Optimal Asset Allocation


In this work, the asset allocation problem is considered as a Markovian Decision Process which can be optimized using Reinforcement Learning based algorithms. Using An artificial model of a stock market, it is shown that SARSA optimization can be used to find a successful policy for portfolio management.

This project is based on the paper Optimal Asset Allocation using Adaptive Dynamic Programming by Ralph Neuneier, 1997. It presents and develops the ideas introduced in this paper, implements a simulation demonstrating these ideas and describes experiments with different stock models in order to show that reinforcement learning can successfully be utilized to this problem.

This work was submitted as a final project in a Reinforcement Learning course given in Tel-Aviv University, 1999/2000, by Prof. Yishai Mansour.

Moti Gindi, 031826035

The Problem

Asset allocation (or portfolio management) is the investment of capital to various trading opportunities (in our case, several different stocks). A portfolio is constructed with the aim of achieving a maximal expected return for a given risk level and time horizon. To compose a theoretically optimal portfolio, the investor has to solve a difficult optimization problem consisting of two phases (Breally, 1991). First, the expected yields are estimated simultaneously with a certainty measure. Second, based on these estimates, a portfolio is constructed obeying the risk level the investor is willing to accept (the so-called mean-variance method). The problem is further complicated if transaction costs (commission) must be considered and if the investor wants to revise the decision at every time step.

In the following work, the modeling phase and the search for optimal portfolio management decisions are considered in the framework of Markovian Decision Process, MDP.


The Model

The model defines the environment in which the agent (our investor) is operating. In our case the model consists of two elements:

n A stock market containing several different stocks each with different behavior pattern.

n A portfolio consisting of the current assets value and the current investment details.

The model behavior is defined based on the following assumptions. Some of these assumptions are actually simplifications of the original problem. Yet, these simplifications do not restrict the generalization of the proposed methods with respect to real applications. They were done merely because of computational reasons. The assumptions:

n The stock market contains several stocks each with different behavior.

n The investor can decide in each step either to stay out of the market or to invest in a single stock. It is not possible to invest in more then one stock simultaneously.

n The investor is small and does not influence the market by his trading.

n The investor has no risk aversion and always invests the total amount of money he has[1].

n The investor may trade at each time step for an infinite time horizon.

n Each transaction made by the investor costs a commission. The commission is a combination of a fixed base value and a variable cost, depending on the investment value.

As mentioned earlier, each stock in the market has its own behavior pattern. The basic behavior of a stock is stochastic, i.e. based on a random process. The parameters of this random process defines the behavior pattern of the stock.. These parameters are:

n Minimum, Maximum and Initial value - defines the range of values that the stock can have.

n Trend - defines the trend of the stock. The decision at each time step whether the stock will increase or decrease is based on the result of a coin toss (Bernoulli trial). The trend of the stock is actually defined by the probability of success (i.e. the probability of value increase) in this trail. The success probability can be defined for each possible value of the stock. Therefore the trend is actually represented as a vector [Min...Max] of success probabilities.

n Stability - defines the stability of the stock. After deciding at each time step if the stock should increase or decrease, it should be decided what will be the amount of change. This amount defines the stability of the stock - as it is larger, the changes of the stock are much more drastic, and thus it is less stable.
The choice of the amount of change at each time step is based on a Poisson process. The number of arrivals at a single time step is translated to the amount of change in the stock value. This number is computed as a random deviate drawn from a Poisson distribution As the number of arrivals is bigger, the change will be grater.
The mean (l) of the Poisson distribution defines its center and therefore defines it in a precise manner. The mean value can be defined for each possible value of the stock. Therefore the stability is actually represented as a vector [Min...Max] of Poisson distribution mean values.

As an example for the above description of stock configuration, lets look at a specific stock definition and its behavior. The stock parameters and transition probabilities are chosen to simulate a situation where the value follows an increasing trend, but with higher values, a drop to a very low values becomes more and more probable. Using the above terminology it can be said that the stability of the stock is decreasing as its value is increasing. The exact configuration of the stock is as follows:

Minimum: 26, Maximum: 40, Initial: 31

Trend:[1.000000 0.966667 0.933333 0.900000 0.866667 0.833333 0.800000 0.766667 0.733333

0.700000 0.666667 0.633333 0.600000 0.566667 0.000000]

Stability: [0.333333 0.666667 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000

3.333333 3.666667 4.000000 4.333333 4.666667 5.000000]

A realization of this stock behavior will look like this:

Figure 1: A realization of a stock behavior

At each step the investor can decide to change his investment. The decision made at each step is whether to invest in the stock market or not, and in which stock to invest if the decision was positive. This action is translated to a series of buy and sell actions, depending on the request and the current holdings of the investor.

For example, if the investor currently holds the stock STCK1 and he wants to invest at the next time step in the stock STCK2, the request will be translated to the following actions:

n Sell current holding of STCK1, pay commission for selling.

n Buy STCK2 using the total portfolio value, pay commission for buying.

Each time step in the model symbolizes another day in the market. At each new day the following actions take place:

1. The new value of each stock is determined.

2. The new value of the investor's portfolio is computed, based on his investments in the previous day. The new value is computed using the formula:

where is the portfolio value at time t, and is the value of the invested stock.

3. The investor decides on his action for the next day and performs it, paying commission if necessary:

e being the commission paid: .


Portfolio Management as a Markovian Decision Problem

MDP provides a model for decision making problems in a stochastic environment. MDP can be described by a finite state set, a finite set of admissible control actions for every state, a set of transition probabilities which describes the dynamics of the system and a return function.

In the context discussed in this work, the state of the MDP is defined by elements which describe the financial time series, i.e. the stocks values, and of elements which quantify the current characteristics of the portfolio.

Each state contain the following information:

n The current value of each stock in the market - a vector of values.

n The current invested stock - an index indicating the stock id, 0 indicating staying out of the market.

n The current percentage of the portfolio invested - as mentioned earlier, in the current implementation only the total amount (100%) can be invested.

n The current value of the portfolio - since the amount of money in the portfolio is actually represented as a real value, with infinite optional values, it is necessary to use discretization in order to reduce the possible number of value. Therefore this value is discretized into integer bins. The size of every bin is a parameter that can be determined by the user (a size of 1, for example, causes a simple rounding of the real value).

Note, that out of the variables which form the state vector, the stock value is actually independent of the portfolio decisions, but the wealth and returns are not. Therefore, asset allocation is a control problem, and not merely a prediction problem.

At each time step the investor can choose to perform an action at the market. This action is equal to an action in the MDP. The action contains the following elements:

n In which stock to invest - an index indicating the requested stock id, 0 indicating staying out of the market. As explained above, this decision is translated by the model into a series of buy and sell actions, depending on the current holding of the investor.

n The percentage of the portfolio to invest - as mentioned earlier, in the current implementation only the total amount (100%) can be invested.

The immediate return function in the MDP is computed as a function of the action chosen, the current portfolio value and the change in the invested stock value:

The return is equal to the relative change of the stock value weighted with the portfolio value. That return is reduced by the transaction commission, if necessary. If it was decided to stay out of the market, the change of the stock value is considered to be 0 (and then the only amount paid is the commission for selling, if necessary).

Since the state of the MDP is determined by the combination of the stocks values, the chosen investment and the portfolio value - the transition function of the MDP is stochastic on the stocks value and portfolio value and deterministic on the chosen investment (which is chosen by the investor in each new day).

In each time step the new values of the stocks and the portfolio are computed, and a new action is chosen. These three elements determine the new state of the system.


The Learning Process

As explained in the previous section the transition function of the MDP is implicit in the model. In each time step the new values of the stocks and the portfolio are computed, and a new action is chosen. These three elements determine the new state of the system. Yet, the transition probabilities are unknown analytically. Therefore the process of finding the optimal policy involves a Reinforcement-Learning method which does not require a model of the system but optimizes the policy by sampling state-action pairs and returns while interacting with the system.

In this work, the optimal policy is computed using the SARSA algorithm. SARSA is an on-line algorithm - it can update the policy used for the learning process while interacting with the environment.

In this algorithm we always investigate two states - the current and the next one. Its name is derived from the fact that we use the current state (S), current action (A), current reward (R), next state (S) and next action (A). We hold at every step the expected reward of every pair of state and action (s,a) which are reached during the optimization process - Q(s,a). The updating of the Q function is done using the formula:

The choice of the next action at every time step in crucial. It presents the exploration vs. exploitation dilemma. The most popular way of solving this dilemma is using an e-greedy action-selection scheme. Under this scheme, at every state, we chose at probability (1-e) the optimal action of this state and at probability e we chose a random action. e can be changed during the optimization process to offer a more exploitative behavior in the initial phases and more exploitative in later phases.

Although e-greedy action-selection is an effective and popular means of balancing exploration and exploitation in reinforcement learning, one drawback is that when it explores it chooses equally among all actions. This means that it is just as likely to choose the worst appearing action as it is to choose the next-to-best. In tasks where the worst actions are very bad, this may be unsatisfactory. The obvious solution is to vary the action probabilities as a graded function of estimated value. The greedy action is still given the highest selection probability, but all the others are ranked and weighted according to their value estimates. These are called Softmax action selection rules. The most common softmax method uses a Gibbs, or Boltzmann, distribution. It chooses action a on the tth step with probability:

where T is a positive parameter called the temperature. High temperatures cause the actions to be all (nearly) equi-probable. Low temperatures cause a greater difference in selection probability for actions that differ in their value estimates.

From the literature, it is unclear whether softmax action selection or e-greedy action selection is better. This may depend on the task and on human factors. The implemented system supports both these options.

In order to get an intuition about the degree of the convergence of the optimization process, at every update of the Q-function by the SARSA algorithm it is checked whether this update have changed the choice of the optimal action in this state. If so, this optimization cycle is assumed to be a significant cycle. Naturally, while the optimization process progresses and convergence is achieved, the number of new significant cycles is decreasing.


The Application

The application consists of three main modules:

n Model - implementation of the environment of the agent operation, as described above.

n Optimizer - implementation of a MDP and of a SARSA-Optimizer, as described above.

n User Interface - a graphic representation of the system state, giving the ability to determine parameters relevant to the model and to the optimization process.

The main user interface window holds information about:

n The daily value of each stock in market.

n The daily decision made by the agent.

n The daily portfolio value.

As demonstrated in the following screen shot, this information is presented graphically throughout the optimization process. In addition, the main window also presents some textual information about the progress of the optimization process. This data is presented in the bottom of the window.

Figure 2: The main control window of the application

The first step while working with the application is loading a model. A model is described using a configuration file. The structure of the model configuration file is straight forward. It is demonstrated as follows:

2.0 // Initialization value of the portfolio

2// Number of stocks in the market

STCK1 // Name of 1st stock

26 40 31 // Minimum, Maximum and Initialization value of the 1st stock

1.000000 0.966667 0.933333 0.900000 0.866667 0.833333 0.800000 0.766667 0.733333 0.700000 0.666667 0.633333 0.600000 0.566667 0.000000 // Trend of the 1st stock

0.333333 0.666667 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333 3.666667 4.000000 4.333333 4.666667 5.000000 // Stability of the 1st stock

STCK2// Name of the 2nd stock

26 40 31 // etc.

0.800000 0.800000 0.300000 0.300000 0.800000 0.800000 0.300000 0.200000 0.800000 0.700000 0.200000 0.200000 0.700000 0.700000 0.200000

0.222222 0.222222 0.444444 0.444444 0.444444 0.666666 0.666666 1.000000 0.666666 0.666666 0.444444 0.444444 0.444444 0.222222 0.222222

During the optimization process it is possible, at any point, to load or save the current policy. Serializing the policy, and later on loading it from the disk - gives to ability to save a partial result of the optimization process for later use.

There are some general parameters of the simulation and optimization process that can be determined by the user. This can be done using the options window of the application:

  3 (  ): Whether to use Epsilon-Greedy or Softmax selection method  3 (  ): Time out in milliseconds between every time step  3 (  ): In automatic mode if the portfolio value is 0, then the simulation is restarted automatically  3 (  ): The bin size when translating the real portfolio value to an integer value  3 (  ): Whether to use fixed alpha value or decreasing value

Figure 3: The configuration options of the application

The application can generate a textual log of the optimization and simulation process. This enable the user to carefully examine the behavior of the system. Yet, since that in most cases the application is active for a long time, the log generated tends to be very large. Therefore, this option is usually turned off.

In the log, every module and sub-module of the application reports of any major action it performes. The structure of the log file is straight forward. It is demonstrated as follows:

The logged running time is Fri May 05 22:04:34 2000

RandomGenerator::Generator is up

StockMarket::Market is up

Portfolio::Portfolio is up, init value is 2.000000

StockMarket::Adding a new stock (#1) STCK1 to market

Stock::Created stock STCK1 (16,30,21)

Stock::Initialized trend of stock STCK1: (1.000000 0.966667 0.933333 0.900000 0.866667 0.833333...)

Stock::Initialized stability of stock STCK1: (0.333333 0.666667 1.000000 1.333333 1.666667 2.000000 ...)

Model::Model is up with 1 stocks

Policy::Created a new policy with init value of 0.000000

Optimizer::Optimizer is up with Epsilon value of 0.100000 and Lambda value of 1.000000

Optimizer::Epsilon value was changed to 0.100000

Optimizer::Lambda value was changed to 1.000000

Optimizer::Alpha value was changed to -1.000000

Optimizer::Action select method was changed to 0

Optimizer::A single cycle of optimization is starting

Optimizer::>Current state (S) is <[21],0,1.000000,2.000000>

Optimizer::>Chose to perform optimal action

Optimizer::>Chose to perform first action (A) <1,1.000000>

Model::Performing action <1,1.000000>

Portfolio::Payed commission of 0.120000, current value is 1.880000

Portfolio::Invested 100% of total value 1.880000 in stock #1

StockMarket::A new day in the market

Stock::Increased STCK1 by 1, new value is 22

Portfolio::Investment changed in 4%, current value is 1.969524

Optimizer::>Reward for the action was -0.030476

Optimizer::>New state (SS) is <[22],1,1.000000,2.000000>

Optimizer::>Chose to perform optimal action

Optimizer::>Chose to perform second action (AA) <1,1.000000>

Optimizer::>Updated (S,A) with -0.030476

Optimizer::A single cycle of optimization is starting

Optimizer::>Current state (S) is <[22],1,1.000000,2.000000>




In this section we use the model and optimization algorithms presented to demonstrate how Reinforcement Learning can be used to optimize asset allocation.

Since it is very easy to configure the model tested using the described configuration files, it is also very easy to construct different testing experiments. During the testing phase of the application some experiments were conducted. From here on these experiments will be described[2].

In the first experiment the investment opportunities included only a single stock. The stock parameters and transition probabilities were chosen to simulate a situation where the value follows an increasing trend, but with higher values, a drop to a very low values becomes more and more probable. The exact configuration of the stock was presented in the first section.

The initial investment was 2.0, at each step the agent had to decide either to change his investment or to remain with the present holdings. Each transaction cost a commission as described in the previous sections.

The agent had the possibility of interacting with the environment continuously. Every time the portfolio value has reached to zero, the model was initialized to its initial state (naturally, without initializing the policy). The Q-values were initialized with zero, the e value was 0.1 and the l parameter was fixed at 0.5 (causing the algorithm to be quite greedy for getting a profit in the present, giving a relatively small importance to future states values).

After ~10000 time steps (days in the market) the simulation was stopped and initialized. In order to achieve a more exploitative behavior the e was decreased to 0.01. The following graphs present the outcome of the agent actions for the next 300 days:

Figure 4: The stock value (top), the portfolio value and the decisions (bottom)

The success of the policy can be easily demonstrated by the continuous increase of the portfolio value (from 2 to 100) in this time period. Lets look more thoroughly at some interesting decisions made by the agent:

n At the beginning, the stock value oscillated at high values. The policy chose to stay out of the market during this period, because the risk was too high. This kind of behavior returned throughout all the time period (for example, days 97-105, 107-120 etc.). In these cases the agent chose not to risk his current holdings, rather to wait until the stock value will decrease.

n In the beginning of the time period, when the portfolio value is relatively small, the agent tended to be much more conservative then later own. It chose to stay out of the market, when the risk was too high, for much longer periods then later. This course of action seems reasonable when taking into account the fact that in small values, the fixed commission paid for selling of buying is relatively significant, and thus the agent should invest only when he has a very good chance for winning. Loosing when the total portfolio value is small decreases dramatically the possibility for future investments. Therefore the agent tried to avoid such situations.

n When a drop to low stock values became very probable, the agent anticipated the decrease and sold his holdings. This kind of behavior can be found at days 73, 140 etc.

n Looking at day 67, 73 or 137 a similar behavior pattern emerges - when there is a small decrease in the stock value which is sufficient for the agent to get back into the market, he does so. Knowing that the stock has an increasing trend, in most cases this decision is justified.

Note that the amount of change of the portfolio value get higher in magnitude as the value is increasing. This is because the agent has no risk aversion and has no choice but always to invest to total capital.

The state space in this experiment contained roughly 25x2x100=5000 states. Yet the obtained policy contained only 2363 states that were reached at least once. Out of these, around half (1148) were reached only once. This suggest that the theoretical state space in much larger then the one encountered in reality.

It should be mentioned that the same experiment was conducted using softmax action-selection scheme. However, in this case the convergence time of the algorithm was much slower. It seems that in most cases the algorithm decided to stay out of the market and thus did not have any ability to increase the portfolio value. Since, in many cases when investing the first time in the stock the reward was negative (because of the commission which had to be paid), the option of staying out of the market became the best action. Therefore the options of doing nothing was chosen many times, creating a very slow improvement of the agent decisions. It seems (as was proven in the next experiments) that the softmax scheme operates better when there is more optional actions (in this model there were only two).

Another experiment was conducted using a stock market of two stocks. In this experiment a second stock was added. It also had an increasing trend, slightly smaller then the one of the first stock. Yet, this stock was much more solid - the amount of change at each time step was very small (mean value of 1). The exact configuration of the model tested can be found at the download section.

The running parameters in this experiment were identical to the previous one. However, the obtained policy was checked only after ~100000 time steps. Since this model only added to the investment opportunities of the first model, it could be expected that (after training of the agent) the increase rate of the portfolio value of this model will be faster. After all, the agent could improve the previous policy using the second stock investment possibility. Indeed, the resulting increase rate of the capital was slightly faster in this experiment: After initializing the model, it took to the agent around 250 days to reach a portfolio value of 100 (in comparison to 300 days in the previous experiment).

Figure 5: The stocks values (top), the portfolio value and the decisions (bottom)

In this experiment the possible state space contained 25x10x3x100=75000 states. Yet the obtained policy contained only 33667 states that were reached at least once. Out of these, around half (14096) were reached only once. It is interesting to notice that the ratio between the total number of possible states, the states actually visited and the states visited only once - was roughly the same in the two experiments. This fact might suggest that the pattern of states visits is somehow inherent to the model, and does not depend dramatically in the specific model.

The same model was also tested using softmax action-selection scheme. In contrast to the first experiment, there was no noticeable change in convergence time of the two methods. As mentioned earlier, this behavior can possibly be accounted for by the larger number of possible action at each state.



In this work, the task of asset allocation (portfolio management) was approached by Reinforcement Learning algorithms. SARSA optimization was successfully utilized to this problem using an artificial stock market model.

Future work has the address the possibility of risk aversion (by allowing investment of only part of the portfolio value) and the possibility of several alternative investments simultaneously. These options present a dramatic increase in the dimension of the state space and therefore may force the use of Neural Networks as value function approximators.



n Optimal Asset Allocation using Adaptive Dynamic Programming, paper by Ralph Neuneier, Siemens AG, 1997.

n Reinforcement Learning: An Introduction, book by Richard S. Sutton and Andrew G. Barto, MIT Press, Cambridge, 1998.

n Numerical Recipes in C: The Art of Scientific Computing", a book from Cambridge University Press.



The Optimal Asset Allocation Program

The Optimal Asset Allocation Source Code

Experimented Models and Policies

A DOS Utility that can be used to display a policy (*.plc files) in a readable textual format

[1] Actually, in the implemented model and MDP there is a possibility to decide to invest only part of the portfolio value. Yet, while experimenting with the application is turned out that adding this level of freedom to the agent increased very drastically the convergence time of the learning process. Therefore it was decided that, in this stage, the investing possibility will be restricted to investing the total amount of portfolio value.

[2] Notice that the model configuration file and the optimal policy file of each experiment can be found at the download section.