CS Workshop in reinforcement learning
Project report –“
implements a learning algorithm which is used to “train” a computer opponent in
The concept of the training algorithm is to make the computer play against itself using a certain strategic policy, learn from each game played and improve the policy until it reaches a point where it is as close to ideal as possible.
We represent the game board using a 6x7 slot matrix. Each slot has a color which represents which player filled that slot. An empty slot is represented by color 0. The matrix rows are numbered from 0 to 5 (top to bottom) and 0 to 6 (left to right).
In order to estimate the score of each state we chose to use a reduction of the board status to a feature vector. Each vector element is a counter for its respective game feature. The feature vector is a property of the board, which is updated each time a coin is added.
We chose the following game features:
1. Player has a single coin in a potential “4-in-a-row”. We count this feature as follows : we scan the entire board in windows of 4 to each direction and increment the counter for every window that contained only 1 coin of the player (3 empty slots in addition). This is a basic feature, which roughly estimates the number of “4-in-a-row” potentially available to the player.
2. Player has 2 coins in a potential horizontal “4-in-a-row”. We count this feature the same way as feature #1.
3. Player has 2 coins in a potential vertical “4-in-a-row”. We count this feature the same way as feature #1.
4. Player has 2 coins in a potential diagonal “4-in-a-row”. We count this feature the same way as feature #1.
5. Player has 3 coins in a potential horizontal “4-in-a-row”. We count this feature the same way as feature #1.
6. Player has 3 coins in a potential vertical “4-in-a-row”. We count this feature the same way as feature #1.
7. Player has 3 coins in a potential diagonal “4-in-a-row”. We count this feature the same way as feature #1.
These features are based on the article “A Knowledge-based Approach of Connect-Four” by Victor Allis (chapter 3)  .
In this section we introduce a concept called “Threats”. A threat is defined for a specific player. A player threat is an empty slot in the game, which filling it with the player’s coin results in winning. In a case of a threat, we’ll notice that if the opponent places a coin in the slot below the threat, the player will win. The opponent will try to avoid placing a coin in that slot for as long as possible.
An Even threat is a threat made in an even row (rows 0, 2 or 4). Respectively, an Odd threat can occur on rows 1, 3 or 5. The idea is that even and odd threats have different significance for different players. Based on , it is noted that for player 1 odd threats are better than even threats (easier to realize the threat). For player 2, even threats are better than odd.
8. Player 1 has an even threat.
9. Player 1 has an odd threat.
10. Player 2 has an even threat.
11. Player 2 has an odd threat.
12. Player has a threat over a threat (both his own threats). The rational behind this feature is that if a player has a threat over another threat, he wins (either the opponent blocks the bottom allowing the player to realize the top threat or the opponent disregards the bottom threat allowing the player to realize the bottom threat.
13. Player has a threat over an opponent threat. The rational behind this feature is that in such a case the player’s threat is useless since either the opponent places a coin and realizes his bottom threat and wins or the player blocks the opponent bottom threat allowing the opponent to block the player’s top threat.
14. “4-in-a-row”. This is a winning state.
Note: except for 8-11, all the features are evaluated by a difference between the player and the opponent count of that feature. Features 8-9 count the occurrences of the feature for player 1 and features 10-11 for player 2.
To evaluate the board status we use a linear scoring function, which is a multiplication of the feature vector (a) by the weights vector (w):
Score = Σ ai*wi
There are 2 exceptions to that: in a winning state the score will be 1, and in a losing state it will be -1. This is because these 2 cases have a known score, and there is no need to evaluate them.
We chose to use an on-policy TD(0) control algorithm - Sarsa, in order to improve our scoring function.
The basic concept of the Sarsa algorithm is that we would like to improve our score function and to do that, we choose a game move from time (t), check the score afterwards (and after the opponent’s action) (t+1) and then update the score of (t) so that it will be closer to the score at (t+1) :
wt+1 = wt + a*[r + score(st+1) - score(st)] * att(st)
Choosing the next move during the learning process:
One of the challenges in choosing the moves is balancing out exploration vs. exploitation. The way we implemented the learning process is ε-greedy (in a probability of ε we will make a random move, otherwise we’ll find the best possible move using a min-max tree). We start off as knowing very little about the game, so we want to explore more (use more random moves). As the learning process develops we use less random moves and more “educated” ones.
Therefore, we set the value of ε to be 1 / (number of times the weights vector was updated).
The opponent always chooses his action in an “educated” manner, according to his scoring function.
First of all, since we have 2 weight vectors, in each training iteration we learn both vectors (simulate an entire game where player 1 is the “learning” player, and another where player 2 learns).
The training process was as follows:
1. Ran 100,000 game simulations (for each weight vector) with a min-max depth of 3 for the learning player and 1 for the opponent. We used a=0.001.
2. Ran 100,000 iterations with a min-max depth of 3 for the learning player and 2 for the opponent (a = 0.0001).
3. Ran 200,000 iterations with a min-max depth of 3 for both players (a = 0.0001).
4. Ran 200,000 iterations with a min-max depth of 5 for both players (a = 0.0001).
The idea behind this procedure was to let the learning player train in moderation. First against weaker opponents and as he gets “smarter”, train against better opponents.
1. Victor Allis, “A Knowledge-based Approach of Connect-Four”,
2. The almighty Wikipedia.
3. A.G. Barto and R.S., Reinforcement Learning, MIT Press, 1998.
public void updateFeatureVec()
Evaluates the feature vector according to the board status. scans all 4-slot “windows” in the board, and for each calls updateFeature(). Called after a coin is added.
public void updateFeature(int direction, Point emptySlot, int  count, int player)
Updates features in the feature vector according to the details of the window. Variables: direction- the direction of the window (horizontal, vertical or diagonal). emptySlot: the position of the last empty slot in the window, if exists (this is relevant only for threat updates). count : an array that has the count for the empty, red and black coins in the window. player : the player who asks for the best action.
public static double stateScore(Vector weight, Vector state)
Calculates the state’s based on the feature vector and weight vector.
public static int calcMinMax(Vector weight, int depth, int player, BoardData board)
Decides what is the best action using a min-max recursive function (minMaxRecursion()). weight : the weight vector. depth : the depth of the min-max. player : the player who asks for the best action. board : the game board. The function returns the number of the column of the best action.
private static MinMaxRet minMaxRecursion(Vector weight, int depth, Boolean max, int player, double alpha, double beta, int actionDepth, BoardData b)
This is the main min-max recursion. It uses alpha and beta pruning in order to eliminate “impossible” action choices. The function returns the action which is best to perform and its score. the variables weight, depth, player and b are the same as the last function. max : Marks the current node type (max or min). initially true. alpha : the score of the best known action so far (initially –infinity). beta : the score of the worst known action so far (initially +infinity). actionDepth : the depth of the leaf at the end of the action chosen. Used for the special case where one of the paths leads to a win and we want to choose the fastest action (shallowest node) for the win.
public static void simulateGame(WeightVector myWeights, WeightVector opWeights, int player)
This function simulates 1 game and updates the weight vector according to the Sarsa algorithm. At the end of the game, the new weights vector is written to the vector file (for the learning player). myWeights : the weight vector of the learning player. opWeights : the weight vector of the opponent. player : the learning player number.
private static void updateWeightVector(WeightVector w, double r, double oldScore, double newScore, Vector oldState)
Updates the weight vector according to the TD(0) algorithm :
new w = w + ALPHA * (r + newScore – oldScore) * oldState
Note: ALPHA is a constant which is assigned on the call to the training function.
Contains all the Data properties of the board, amongst them:
· A Boolean 3D array of threats, of size 6x7x2, where the value at [row][col][player] tells whether the player has a threat at this location.
· A 2x2 matrix of threats counters (even/odd x player1/player2)
· A feature vector:
Contains an array of a specific size, and the size of that array. performs various vector operations.
· 2 Weight vectors, 1 for each player
An extension of Vector, handles writing and reading from file as well.
· The board itself is represented by a 6x7 Slot matrix
A board slot, contains all the slot data, and the coin placed in the slot, if it exists.