CS Workshop in reinforcement learning

Project report –“4 in a row”, Michal Samuel & Yuval Kalev

Source code (zip)

Part 1:

About the project 1

Game representation. 1

Basic board features. 1

Advanced (strategic) board features. 1

Score Function. 1

Learning Algorithm.. 2

The training process. 2

Background & Bibliography. 2

Part 2:

Major functions. 3

Relations between functions. 3

Major data structures. 4

About the project

“4 in a row” is a 2 player board game. The board is made up of 7 columns by 6 rows of empty slots. In each turn the player can drop a coin in one of those columns. The coin will be placed at the lowest empty slot on that column. The winner of the game is the first to connect 4 of his coins in either a horizontal, vertical or diagonal line. We will refer to this as “4-in-a-row”. If the board is full and no player has completed a “4-in-a-row”, the game is a tie.

Out project implements a learning algorithm which is used to “train” a computer opponent in the game 4 in a Row, so that it can be later played against.

The concept of the training algorithm is to make the computer play against itself using a certain strategic policy, learn from each game played and improve the policy until it reaches a point where it is as close to ideal as possible.

Back to contents

Game representation

We represent the game board using a 6x7 slot matrix. Each slot has a color which represents which player filled that slot. An empty slot is represented by color 0. The matrix rows are numbered from 0 to 5 (top to bottom) and 0 to 6 (left to right).

In order to estimate the score of each state we chose to use a reduction of the board status to a feature vector. Each vector element is a counter for its respective game feature. The feature vector is a property of the board, which is updated each time a coin is added.

We chose the following game features:

Basic board features

1. Player has a single coin in a potential “4-in-a-row”. We count this feature as follows : we scan the entire board in windows of 4 to each direction and increment the counter for every window that contained only 1 coin of the player (3 empty slots in addition). This is a basic feature, which roughly estimates the number of “4-in-a-row” potentially available to the player.

2. Player has 2 coins in a potential horizontal “4-in-a-row”. We count this feature the same way as feature #1.

3. Player has 2 coins in a potential vertical “4-in-a-row”. We count this feature the same way as feature #1.

4. Player has 2 coins in a potential diagonal “4-in-a-row”. We count this feature the same way as feature #1.

5. Player has 3 coins in a potential horizontal “4-in-a-row”. We count this feature the same way as feature #1.

6. Player has 3 coins in a potential vertical “4-in-a-row”. We count this feature the same way as feature #1.

7. Player has 3 coins in a potential diagonal “4-in-a-row”. We count this feature the same way as feature #1.

Advanced (strategic) board features

These features are based on the article “A Knowledge-based Approach of Connect-Four” by Victor Allis (chapter 3) [1] .

In this section we introduce a concept called “Threats”. A threat is defined for a specific player. A player threat is an empty slot in the game, which filling it with the player’s coin results in winning. In a case of a threat, we’ll notice that if the opponent places a coin in the slot below the threat, the player will win. The opponent will try to avoid placing a coin in that slot for as long as possible.

An Even threat is a threat made in an even row (rows 0, 2 or 4). Respectively, an Odd threat can occur on rows 1, 3 or 5. The idea is that even and odd threats have different significance for different players. Based on [1], it is noted that for player 1 odd threats are better than even threats (easier to realize the threat). For player 2, even threats are better than odd.

8. Player 1 has an even threat.

9. Player 1 has an odd threat.

10. Player 2 has an even threat.

11. Player 2 has an odd threat.

12. Player has a threat over a threat (both his own threats). The rational behind this feature is that if a player has a threat over another threat, he wins (either the opponent blocks the bottom allowing the player to realize the top threat or the opponent disregards the bottom threat allowing the player to realize the bottom threat.

13. Player has a threat over an opponent threat. The rational behind this feature is that in such a case the player’s threat is useless since either the opponent places a coin and realizes his bottom threat and wins or the player blocks the opponent bottom threat allowing the opponent to block the player’s top threat.

14. “4-in-a-row”. This is a winning state.

Note: except for 8-11, all the features are evaluated by a difference between the player and the opponent count of that feature. Features 8-9 count the occurrences of the feature for player 1 and features 10-11 for player 2.

Back to contents

Score Function

To evaluate the board status we use a linear scoring function, which is a multiplication of the feature vector (a) by the weights vector (w):

Score = Σ a_i*w_i

There are 2 exceptions to that: in a winning state the score will be 1, and in a losing state it will be -1. This is because these 2 cases have a known score, and there is no need to evaluate them.

Note: 4 in a Row is not a symmetric game, which means that the starting player has an advantage over the 2^nd player. Therefore, we use a different weight vector for each player.

Back to contents

Learning Algorithm

We chose to use an on-policy TD(0) control algorithm - Sarsa, in order to improve our scoring function.

The basic concept of the Sarsa algorithm is that we would like to improve our score function and to do that, we choose a game move from time (t), check the score afterwards (and after the opponent’s action) (t+1) and then update the score of (t) so that it will be closer to the score at (t+1) :

w_t+1 = w_t + a*[r + score(s_t+1) - score(s_t)] * att(s_t)

Choosing the next move during the learning process:

One of the challenges in choosing the moves is balancing out exploration vs. exploitation. The way we implemented the learning process is ε-greedy (in a probability of ε we will make a random move, otherwise we’ll find the best possible move using a min-max tree). We start off as knowing very little about the game, so we want to explore more (use more random moves). As the learning process develops we use less random moves and more “educated” ones.

Therefore, we set the value of ε to be 1 / (number of times the weights vector was updated).

The opponent always chooses his action in an “educated” manner, according to his scoring function.

Back to contents

The training process

First of all, since we have 2 weight vectors, in each training iteration we learn both vectors (simulate an entire game where player 1 is the “learning” player, and another where player 2 learns).

The training process was as follows:

1. Ran 100,000 game simulations (for each weight vector) with a min-max depth of 3 for the learning player and 1 for the opponent. We used a=0.001.

2. Ran 100,000 iterations with a min-max depth of 3 for the learning player and 2 for the opponent (a = 0.0001).

3. Ran 200,000 iterations with a min-max depth of 3 for both players (a = 0.0001).

4. Ran 200,000 iterations with a min-max depth of 5 for both players (a = 0.0001).

The idea behind this procedure was to let the learning player train in moderation. First against weaker opponents and as he gets “smarter”, train against better opponents.

Back to contents

Background & Bibliography

1. Victor Allis, “A Knowledge-based Approach of Connect-Four”, Vrije University, Holland, 1998.

2. The almighty Wikipedia.

3. A.G. Barto and R.S., Reinforcement Learning, MIT Press, 1998.

Back to contents

Major functions

package Data

Class BoardData:

public void updateFeatureVec()

Evaluates the feature vector according to the board status. scans all 4-slot “windows” in the board, and for each calls updateFeature(). Called after a coin is added.

public void updateFeature(int direction, Point emptySlot, int [] count, int player)

Updates features in the feature vector according to the details of the window. Variables: direction- the direction of the window (horizontal, vertical or diagonal). emptySlot: the position of the last empty slot in the window, if exists (this is relevant only for threat updates). count : an array that has the count for the empty, red and black coins in the window. player : the player who asks for the best action.

Class MinMax:

public static double stateScore(Vector weight, Vector state)

Calculates the state’s based on the feature vector and weight vector.

public static int calcMinMax(Vector weight, int depth, int player, BoardData board)

Decides what is the best action using a min-max recursive function (minMaxRecursion()). weight : the weight vector. depth : the depth of the min-max. player : the player who asks for the best action. board : the game board. The function returns the number of the column of the best action.

private static MinMaxRet minMaxRecursion(Vector weight, int depth, Boolean max, int player, double alpha, double beta, int actionDepth, BoardData b)

This is the main min-max recursion. It uses alpha and beta pruning in order to eliminate “impossible” action choices. The function returns the action which is best to perform and its score. the variables weight, depth, player and b are the same as the last function. max : Marks the current node type (max or min). initially true. alpha : the score of the best known action so far (initially –infinity). beta : the score of the worst known action so far (initially +infinity). actionDepth : the depth of the leaf at the end of the action chosen. Used for the special case where one of the paths leads to a win and we want to choose the fastest action (shallowest node) for the win.

Class Sarsa:

public static void simulateGame(WeightVector myWeights, WeightVector opWeights, int player)

This function simulates 1 game and updates the weight vector according to the Sarsa algorithm. At the end of the game, the new weights vector is written to the vector file (for the learning player). myWeights : the weight vector of the learning player. opWeights : the weight vector of the opponent. player : the learning player number.

private static void updateWeightVector(WeightVector w, double r, double oldScore, double newScore, Vector oldState)

Updates the weight vector according to the TD(0) algorithm :

new w = w + ALPHA * (r + newScore – oldScore) * oldState

Note: ALPHA is a constant which is assigned on the call to the training function.

Back to contents

Relations between functions

Training

Sarsa.train()

calls Sarsa.simulateGame()

calls MinMax.calcMinMax()

calls MinMax.minMaxRecursion()

calls BoardData.getPlayerFeatures()

MinMax.stateScore()

BoardData.getWinner()

BoardData.addCoinData()

calls BoardData.updateThreatOverThreat()

BoardData.updateFeatureVec()

calls BoardData.updateFeature()

calls BoardData.updateThreatOverThreat()

BoardData.setNextPlayer()

BoardData.popFromCol()

BoardData.initializeBoard()

calls Slot.InitializeSlot()

BoardData.play()

BoardData.getPlayerFeatures()

MinMax.stateScore()

BoardData.isColumnAvailable()

BoardData.isGameOver()

BoardData.getWinner()

Sarsa.updateWeightVector()

WeightVector.writeVectorToFile()

Back to contents

Major data structures

Data.BoardData

Contains all the Data properties of the board, amongst them:

· A Boolean 3D array of threats, of size 6x7x2, where the value at [row][col][player] tells whether the player has a threat at this location.

· A 2x2 matrix of threats counters (even/odd x player1/player2)

· A feature vector:

Data.Vector

Contains an array of a specific size, and the size of that array. performs various vector operations.

· 2 Weight vectors, 1 for each player

Data.WeightVector

An extension of Vector, handles writing and reading from file as well.

· The board itself is represented by a 6x7 Slot matrix

Data.Slot

A board slot, contains all the slot data, and the coin placed in the slot, if it exists.

Back to contents

Contents

Part 1:

Part 2: