In this work, the
asset allocation problem is considered as a Markovian Decision
Process which can be optimized using Reinforcement
Learning based algorithms. Using An artificial model of a
stock market, it is shown that SARSA optimization can be
used to find a successful policy for portfolio management.
This project is
based on the paper “Optimal Asset Allocation using Adaptive
Dynamic Programming” by Ralph Neuneier, 1997. It presents and
develops the ideas introduced in this paper, implements a
simulation demonstrating these ideas and describes experiments
with different stock models in order to show that reinforcement
learning can successfully be utilized to this problem.
This work was
submitted as a final project in a Reinforcement Learning course
given in Tel-Aviv University, 1999/2000, by Prof. Yishai Mansour.
Asset allocation (or
portfolio management) is the investment of capital to various
trading opportunities (in our case, several different stocks). A
portfolio is constructed with the aim of achieving a maximal
expected return for a given risk level and time horizon. To
compose a theoretically optimal portfolio, the investor has to
solve a difficult optimization problem consisting of two phases (Breally,
1991). First, the expected yields are estimated simultaneously
with a certainty measure. Second, based on these estimates, a
portfolio is constructed obeying the risk level the investor is
willing to accept (the so-called mean-variance method).
The problem is further complicated if transaction costs (commission)
must be considered and if the investor wants to revise the
decision at every time step.
In the following
work, the modeling phase and the search for optimal portfolio
management decisions are considered in the framework of Markovian
Decision Process, MDP.
The model defines
the environment in which the agent (our investor) is operating.
In our case the model consists of two elements:
n A stock market containing several different stocks each
with different behavior pattern.
n A portfolio consisting of the current assets value and
the current investment details.
The model behavior
is defined based on the following assumptions. Some of these
assumptions are actually simplifications of the original problem.
Yet, these simplifications do not restrict the generalization of
the proposed methods with respect to real applications. They were
done merely because of computational reasons. The assumptions:
n The stock market contains several stocks each with
n The investor can decide in each step either to stay out
of the market or to invest in a single stock. It is not
possible to invest in more then one stock simultaneously.
n The investor is small and does not influence the market
by his trading.
n The investor has no risk aversion and always invests
the total amount of money he has.
n The investor may trade at each time step for an
infinite time horizon.
n Each transaction made by the investor costs a
commission. The commission is a combination of a fixed base
value and a variable cost, depending on the investment value.
earlier, each stock in the market has its own behavior pattern.
The basic behavior of a stock is stochastic, i.e. based on a
random process. The parameters of this random process defines the
behavior pattern of the stock.. These parameters are:
n Minimum, Maximum and Initial value - defines the
range of values that the stock can have.
n Trend - defines the trend of the stock. The
decision at each time step whether the stock will increase or
decrease is based on the result of a coin toss (Bernoulli
trial). The trend of the stock is actually defined by the
probability of “success” (i.e. the probability of value
increase) in this trail. The success probability can be
defined for each possible value of the stock. Therefore the
trend is actually represented as a vector [Min...Max] of
n Stability - defines the stability of the stock.
After deciding at each time step if the stock should increase
or decrease, it should be decided what will be the amount of
change. This amount defines the stability of the stock - as
it is larger, the changes of the stock are much more drastic,
and thus it is less stable.
The choice of the amount of change at each time step is based
on a Poisson process. The number of “arrivals” at
a single time step is translated to the amount of change in
the stock value. This number is computed as a random deviate
drawn from a Poisson distribution As the number of arrivals
is bigger, the change will be grater.
The mean (l) of the Poisson distribution defines its “center”
and therefore defines it in a precise manner. The mean value
can be defined for each possible value of the stock.
Therefore the stability is actually represented as a vector [Min...Max]
of Poisson distribution mean values.
As an example for
the above description of stock configuration, lets look at a
specific stock definition and its behavior. The stock parameters
and transition probabilities are chosen to simulate a situation
where the value follows an increasing trend, but with higher
values, a drop to a very low values becomes more and more
probable. Using the above terminology it can be said that the
stability of the stock is decreasing as its value is increasing.
The exact configuration of the stock is as follows:
26, Maximum: 40,
0.966667 0.933333 0.900000 0.866667 0.833333 0.800000 0.766667 0.733333
0.700000 0.666667 0.633333 0.600000 0.566667 0.000000]
[0.333333 0.666667 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667
3.333333 3.666667 4.000000 4.333333 4.666667 5.000000]
A realization of
this stock behavior will look like this:
Figure 1: A
realization of a stock behavior
At each step the
investor can decide to change his investment. The decision made
at each step is whether to invest in the stock market or not, and
in which stock to invest if the decision was positive. This
action is translated to a series of buy and sell actions,
depending on the request and the current holdings of the investor.
For example, if
the investor currently holds the stock STCK1 and he wants to
invest at the next time step in the stock STCK2, the request will
be translated to the following actions:
n Sell current holding of STCK1, pay commission for
n Buy STCK2 using the total portfolio value, pay
commission for buying.
Each time step in
the model symbolizes another day in the market. At each new day
the following actions take place:
1. The new value of each stock is determined.
2. The new value of the investor's portfolio is computed,
based on his investments in the previous day. The new value
is computed using the formula:
is the portfolio
value at time t, and is the value of the invested
3. The investor decides on his action for the next day and
performs it, paying commission if necessary:
being the commission paid: .
Management as a Markovian Decision Problem
MDP provides a
model for decision making problems in a stochastic environment.
MDP can be described by a finite state set, a finite set of
admissible control actions for every state, a set of transition
probabilities which describes the dynamics of the system and a
In the context
discussed in this work, the state of the MDP is
defined by elements which describe the financial time series, i.e.
the stocks values, and of elements which quantify the current
characteristics of the portfolio.
Each state contain
the following information:
n The current value of each stock in the market -
a vector of values.
n The current invested stock - an index indicating
the stock id, 0 indicating staying out of the market.
n The current percentage of the portfolio invested
- as mentioned earlier, in the current implementation only
the total amount (100%) can be invested.
n The current value of the portfolio - since the
amount of money in the portfolio is actually represented as a
real value, with infinite optional values, it is
necessary to use discretization in order to reduce the
possible number of value. Therefore this value is discretized
into integer “bins”. The size of every bin is a
parameter that can be determined by the user (a size of 1,
for example, causes a simple rounding of the real value).
Note, that out of
the variables which form the state vector, the stock value is
actually independent of the portfolio decisions, but the wealth
and returns are not. Therefore, asset allocation is a control
problem, and not merely a prediction problem.
At each time step
the investor can choose to perform an action at the market. This
action is equal to an action in the MDP. The action
contains the following elements:
n In which stock to invest - an index indicating
the requested stock id, 0 indicating staying out of the
market. As explained above, this decision is translated by
the model into a series of buy and sell actions, depending on
the current holding of the investor.
n The percentage of the portfolio to invest - as
mentioned earlier, in the current implementation only the
total amount (100%) can be invested.
The immediate return
function in the MDP is computed as a function of the
action chosen, the current portfolio value and the change in the
invested stock value:
The return is
equal to the relative change of the stock value weighted with the
portfolio value. That return is reduced by the transaction
commission, if necessary. If it was decided to stay out of the
market, the change of the stock value is considered to be 0 (and
then the only amount paid is the commission for selling, if
Since the state of
the MDP is determined by the combination of the stocks values,
the chosen investment and the portfolio value - the transition
function of the MDP is stochastic on the stocks value and
portfolio value and deterministic on the chosen investment (which
is chosen by the investor in each new day).
In each time step
the new values of the stocks and the portfolio are computed, and
a new action is chosen. These three elements determine the new
state of the system.
As explained in
the previous section the transition function of the MDP is
implicit in the model. In each time step the new values of the
stocks and the portfolio are computed, and a new action is chosen.
These three elements determine the new state of the system. Yet,
the transition probabilities are unknown analytically. Therefore
the process of finding the optimal policy involves a Reinforcement-Learning
method which does not require a model of the system but optimizes
the policy by sampling state-action pairs and returns while
interacting with the system.
In this work, the
optimal policy is computed using the SARSA algorithm.
SARSA is an on-line algorithm - it can update the policy used for
the learning process while interacting with the environment.
In this algorithm
we always investigate two states - the current and the next one.
Its name is derived from the fact that we use the current state (S),
current action (A), current reward (R), next state (S’) and
next action (A’). We hold at every step the expected reward of
every pair of state and action (s,a) which are reached during the
optimization process - Q(s,a). The updating of the Q
function is done using the formula:
The choice of the
next action at every time step in crucial. It presents the
exploration vs. exploitation dilemma. The most popular way of
solving this dilemma is using an e-greedy action-selection
scheme. Under this scheme, at every state, we chose at
probability (1-e) the optimal action of this state and at
probability e we chose a random action. e can be changed during
the optimization process to offer a more exploitative behavior in
the initial phases and more exploitative in later phases.
action-selection is an effective and popular means of balancing
exploration and exploitation in reinforcement learning, one
drawback is that when it explores it chooses equally among all
actions. This means that it is just as likely to choose the worst
appearing action as it is to choose the next-to-best. In tasks
where the worst actions are very bad, this may be unsatisfactory.
The obvious solution is to vary the action probabilities as a
graded function of estimated value. The greedy action is still
given the highest selection probability, but all the others are
ranked and weighted according to their value estimates. These are
called Softmax action selection rules. The
most common softmax method uses a Gibbs, or Boltzmann,
distribution. It chooses action a on the tth
step with probability:
where T is
a positive parameter called the temperature. High temperatures
cause the actions to be all (nearly) equi-probable. Low
temperatures cause a greater difference in selection probability
for actions that differ in their value estimates.
literature, it is unclear whether softmax action selection or e-greedy
action selection is better. This may depend on the task and on
human factors. The implemented system supports both these options.
In order to get an
intuition about the degree of the convergence of the optimization
process, at every update of the Q-function by the SARSA algorithm
it is checked whether this update have changed the choice of the
optimal action in this state. If so, this optimization cycle is
assumed to be a “significant cycle”. Naturally, while the
optimization process progresses and convergence is achieved, the
number of new significant cycles is decreasing.
consists of three main modules:
n Model - implementation of the environment of the
agent operation, as described above.
n Optimizer - implementation of a MDP and of a
SARSA-Optimizer, as described above.
n User Interface - a graphic representation of the
system state, giving the ability to determine parameters
relevant to the model and to the optimization process.
The main user
interface window holds information about:
n The daily value of each stock in market.
n The daily decision made by the agent.
n The daily portfolio value.
As demonstrated in
the following screen shot, this information is presented
graphically throughout the optimization process. In addition, the
main window also presents some textual information about the
progress of the optimization process. This data is presented in
the bottom of the window.
Figure 2: The main
control window of the application
The first step
while working with the application is loading a model. A
model is described using a configuration file. The structure of
the model configuration file is straight forward. It is
demonstrated as follows:
Initialization value of the portfolio
Number of stocks in the market
// Name of 1st stock
40 31 // Minimum,
Maximum and Initialization value of the 1st
0.966667 0.933333 0.900000 0.866667 0.833333 0.800000 0.766667 0.733333
0.700000 0.666667 0.633333 0.600000 0.566667 0.000000
// Trend of the 1st stock
0.666667 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
3.333333 3.666667 4.000000 4.333333 4.666667 5.000000
// Stability of the 1st stock
Name of the 2nd stock
40 31 // etc.
0.800000 0.300000 0.300000 0.800000 0.800000 0.300000 0.200000 0.800000
0.700000 0.200000 0.200000 0.700000 0.700000 0.200000
0.222222 0.444444 0.444444 0.444444 0.666666 0.666666 1.000000 0.666666
0.666666 0.444444 0.444444 0.444444 0.222222 0.222222
optimization process it is possible, at any point, to load or
save the current policy. Serializing the policy, and later on
loading it from the disk - gives to ability to save a partial
result of the optimization process for later use.
There are some general
parameters of the simulation and optimization process that
can be determined by the user. This can be done using the options
window of the application:
Figure 3: The
configuration options of the application
can generate a textual log of the optimization and simulation
process. This enable the user to carefully examine the
behavior of the system. Yet, since that in most cases the
application is active for a long time, the log generated tends to
be very large. Therefore, this option is usually turned off.
In the log, every
module and sub-module of the application reports of any major
action it performes. The structure of the log file is straight
forward. It is demonstrated as follows:
logged running time is Fri May 05 22:04:34 2000
is up, init value is 2.000000
a new stock (#1) STCK1 to market
stock STCK1 (16,30,21)
trend of stock STCK1: (1.000000 0.966667 0.933333 0.900000 0.866667
stability of stock STCK1: (0.333333 0.666667 1.000000 1.333333 1.666667
is up with 1 stocks
a new policy with init value of 0.000000
is up with Epsilon value of 0.100000 and Lambda value of 1.000000
value was changed to 0.100000
value was changed to 1.000000
value was changed to -1.000000
select method was changed to 0
single cycle of optimization is starting
state (S) is <,0,1.000000,2.000000>
to perform optimal action
to perform first action (A) <1,1.000000>
commission of 0.120000, current value is 1.880000
100% of total value 1.880000 in stock #1
new day in the market
STCK1 by 1, new value is 22
changed in 4%, current value is 1.969524
for the action was -0.030476
state (SS) is <,1,1.000000,2.000000>
to perform optimal action
to perform second action (AA) <1,1.000000>
(S,A) with -0.030476
single cycle of optimization is starting
state (S) is <,1,1.000000,2.000000>
In this section we
use the model and optimization algorithms presented to
demonstrate how Reinforcement Learning can be used to optimize
Since it is very
easy to configure the model tested using the described
configuration files, it is also very easy to construct different
testing experiments. During the testing phase of the application
some experiments were conducted. From here on these experiments
will be described.
In the first
experiment the investment opportunities included only a single
stock. The stock parameters and transition probabilities were
chosen to simulate a situation where the value follows an
increasing trend, but with higher values, a drop to a very low
values becomes more and more probable. The exact configuration of
the stock was presented in the first section.
investment was 2.0, at each step the agent had to decide either
to change his investment or to remain with the present holdings.
Each transaction cost a commission as described in the previous
The agent had the
possibility of interacting with the environment continuously.
Every time the portfolio value has reached to zero, the model was
initialized to its initial state (naturally, without initializing
the policy). The Q-values were initialized with zero, the e value
was 0.1 and the l parameter was fixed at 0.5 (causing the
algorithm to be quite greedy for getting a profit in the present,
giving a relatively small importance to future states values).
After ~10000 time
steps (“days” in the market) the simulation was stopped and
initialized. In order to achieve a more exploitative behavior the
e was decreased to 0.01. The following graphs present the outcome
of the agent actions for the next 300 days:
Figure 4: The stock
value (top), the portfolio value and the decisions (bottom)
The success of the
policy can be easily demonstrated by the continuous increase of
the portfolio value (from 2 to 100) in this time period. Lets
look more thoroughly at some interesting decisions made by the
n At the beginning, the stock value oscillated at high
values. The policy chose to stay out of the market during
this period, because the risk was too high. This kind of
behavior returned throughout all the time period (for example,
days 97-105, 107-120 etc.). In these cases the agent chose
not to risk his current holdings, rather to wait until the
stock value will decrease.
n In the beginning of the time period, when the portfolio
value is relatively small, the agent tended to be much more
conservative then later own. It chose to stay out of the
market, when the risk was too high, for much longer periods
then later. This course of action seems reasonable when
taking into account the fact that in small values, the fixed
commission paid for selling of buying is relatively
significant, and thus the agent should invest only when he
has a very good chance for winning. Loosing when the total
portfolio value is small decreases dramatically the
possibility for future investments. Therefore the agent tried
to avoid such situations.
n When a drop to low stock values became very probable,
the agent anticipated the decrease and sold his holdings.
This kind of behavior can be found at days 73, 140 etc.
n Looking at day 67, 73 or 137 a similar behavior pattern
emerges - when there is a small decrease in the stock value
which is sufficient for the agent to get back into the market,
he does so. Knowing that the stock has an increasing trend,
in most cases this decision is justified.
Note that the
amount of change of the portfolio value get higher in magnitude
as the value is increasing. This is because the agent has no risk
aversion and has no choice but always to invest to total capital.
The state space in
this experiment contained roughly 25x2x100=5000 states. Yet the
obtained policy contained only 2363 states that were reached at
least once. Out of these, around half (1148) were reached only
once. This suggest that the theoretical state space in much
larger then the one encountered in reality.
It should be
mentioned that the same experiment was conducted using softmax
action-selection scheme. However, in this case the convergence
time of the algorithm was much slower. It seems that in most
cases the algorithm decided to stay out of the market and thus
did not have any ability to increase the portfolio value. Since,
in many cases when investing the first time in the stock the
reward was negative (because of the commission which had to be
paid), the option of staying out of the market became the best
action. Therefore the options of doing nothing was chosen many
times, creating a very slow improvement of the agent decisions.
It seems (as was proven in the next experiments) that the softmax
scheme operates better when there is more optional actions (in
this model there were only two).
was conducted using a stock market of two stocks. In this
experiment a second stock was added. It also had an increasing
trend, slightly smaller then the one of the first stock. Yet,
this stock was much more solid - the amount of change at each
time step was very small (mean value of 1). The exact
configuration of the model tested can be found at the download
parameters in this experiment were identical to the previous one.
However, the obtained policy was checked only after ~100000 time
steps. Since this model only added to the investment
opportunities of the first model, it could be expected that (after
training of the agent) the increase rate of the portfolio value
of this model will be faster. After all, the agent could improve
the previous policy using the second stock investment possibility.
Indeed, the resulting increase rate of the capital was slightly
faster in this experiment: After initializing the model, it took
to the agent around 250 days to reach a portfolio value of 100 (in
comparison to 300 days in the previous experiment).
Figure 5: The
stocks values (top), the portfolio value and the decisions (bottom)
In this experiment
the possible state space contained 25x10x3x100=75000 states. Yet
the obtained policy contained only 33667 states that were reached
at least once. Out of these, around half (14096) were reached
only once. It is interesting to notice that the ratio between the
total number of possible states, the states actually visited and
the states visited only once - was roughly the same in the two
experiments. This fact might suggest that the pattern of states
visits is somehow inherent to the model, and does not depend
dramatically in the specific model.
The same model was
also tested using softmax action-selection scheme. In contrast to
the first experiment, there was no noticeable change in
convergence time of the two methods. As mentioned earlier, this
behavior can possibly be accounted for by the larger number of
possible action at each state.
In this work, the
task of asset allocation (portfolio management) was approached by
Reinforcement Learning algorithms. SARSA optimization was
successfully utilized to this problem using an artificial stock
Future work has
the address the possibility of risk aversion (by allowing
investment of only part of the portfolio value) and the
possibility of several alternative investments simultaneously.
These options present a dramatic increase in the dimension of the
state space and therefore may force the use of Neural Networks as
value function approximators.
n “Optimal Asset Allocation using Adaptive Dynamic
Programming”, paper by Ralph Neuneier, Siemens AG, 1997.
n “Reinforcement Learning: An Introduction”, book by
Richard S. Sutton and Andrew G. Barto, MIT Press, Cambridge, 1998.
n “Numerical Recipes in C: The Art of Scientific
Computing", a book from Cambridge University Press.
The Optimal Asset
The Optimal Asset
Allocation Source Code
A DOS Utility that can be used to
display a policy (*.plc files) in a readable textual format