\documentstyle[12pt,psfig]{book}
\setlength{\evensidemargin}{.0 in}
\addtolength{\topmargin}{-1in}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{8in}

\newcommand{\remove}[1]{}

\newcommand{\Do}{{\small\bf do}\ }
\newcommand{\Return}{{\small\bf return\ }}
\newcommand{\Proc}[1]{#1\+}
\newcommand{\Returns}{{\small\bf returns}}
\newcommand{\Procbegin}{{\small\bf begin}}
\newcommand{\If}{{\small\bf if}\ \=\+}
\newcommand{\Then}{{\small\bf then}\ \=\+}
\newcommand{\Else}{\<{\small\bf else}\ \>}
\newcommand{\Elseif}{\<{\small\bf elseif}\ }
\newcommand{\Endif}{\<{\small\bf end\ if\ }\-\-}
\newcommand{\Endproc}[1]{{\small\bf end} #1\-}
\newcommand{\For}{{\small\bf for}\ \=\+}
\newcommand{\Endfor}{{\small\bf end\ for}\ \-}
\newcommand{\Loop}{{\bf loop}\ \= \+}
\newcommand{\Endloop}{{\bf end\ loop}\ \-}
\newenvironment{program}{
    \begin{minipage}{\textwidth}
    \begin{tabbing}
    \ \ \ \ \=\kill
  }{
    \end{tabbing}
    \end{minipage}
  }

\def\blankline{\hbox{}}


\newcommand{\lecture}[7]{
   \pagestyle{headings}
   \thispagestyle{plain}
   \newpage
   \setcounter{chapter}{#1}
   \setcounter{page}{#2}
%  \set\thechapter{#3}
   \noindent
   \begin{center}
   \framebox{
      \vbox{
    \hbox to 6.28in { {\bf Computational  Learning Theory
                       \hfill Fall Semester, 1999/00} }
       \vspace{4mm}
       \hbox to 6.28in { {\Large \hfill Lecture #1: #3  \hfill} }
       \vspace{2mm}
       \hbox to 6.28in { {\it Lecturer: #4 \hfill Scribe: #5, #6, #7} }
      }
   }
   \end{center}
   \markboth{Lecture #1: #3}{Lecture #1: #3}
   \vspace*{4mm}
}

\newcommand{\topic}[2]{\section{#1} \index{#2} \markright{#1}}
\newcommand{\subtopic}[2]{\subsection{#1} \index{#2}}
\newcommand{\subsubtopic}[2]{\subsubsection{#1} \index{#2}}

\renewcommand{\cite}[1]{[#1]}

\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
\newcommand{\be}{\begin{enumerate}}
\newcommand{\ee}{\end{enumerate}}
\newcommand{\blank}{\vspace{1ex}}   % generates a blank line in the output

\newtheorem{theorem}{Theorem}[chapter]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{corollary}[theorem]{Corollary}
\newcommand{\qed}{\hfill $\Box$}
\newenvironment{proof}{\par{\bf Proof:}}{\qed \par}

\newenvironment{dfn}{{\vspace*{1ex} \noindent \bf Definition }}{\vspace*{1ex}}
\newcommand{\bigdef}[2]{\index{#1}\begin{dfn} {\rm #2} \end{dfn}}
\newcommand{\smalldef}[1]{\index{#1} {\em #1}}

% **** YOUR NOTES GO HERE ****************************************************

\begin{document}
\lecture{3}{1}{November 4}{Yishay Mansour}{Orit Mazor} {Sigal Korczyn} {Nimrod Hoofien}
\begin{flushleft}
\topic{Introduction}{}
\subtopic {The Problem}{}
In the following course, we will refrain mainly to discrete time problems, that is, at any given point in time $T=1 \ldots N$ and only in those points the agent may perform one single action. If $N$ is finite, we shall refer to these problems as having a {\em finite horizon}. When $N$ is infinite, these problems will be referred to as having an {\em infinite horizon}. When discussing finite horizon problems, at time $T=N$ the agent is not allowed to perform an action and instead will be rewarded with an immediate end point profit according to the position it is in.
\subtopic {states and actions}{}
Lets define $S$ to be the set of states of the system. For each state $s\in{S}$ we define the set of actions $A_{s}$ as the actions that the agent may perform in a state $s$. For simplicity, we will assume $S$ and $A$ are discrete and not time dependent.At any given state, an action can be performed deterministically (using a function mapping states to actions) or stochastically (randomly). When choosing an action stochastically we will define a distribution q(a) for every ${a}\in{A_{s}}$.
\subtopic {Immediate reward and the probability of transition}{}
As a result of performing an action $a\in{A_{s}}$ in state $s\in{S}$ at time t:

1. The agent is rewarded an immediate reward $R_{t}(s,a)$. We define the expectation of $R_{t}$ as $r_{t}(s,a)=E[R_{t}(s,a)]$.

2.The system transfers to a new state $s'$, determined according to transition probability $P_{t}(s'|s,a)$.
We assume that $P_{t}$ is well defined, that is, that for every $s \in S$ and $a \in A_s$, $\sum_j P(j|s,a) = 1$.

We will not discuss how or when the immediate reward reach the agent. It may be accumulated in the time frame $[t, t+1]$, or alternatively, it can be given in
a single point in time between $t$ and $t+1$. In any case all that matters to the agent is that the immediate reward reaches it before t+1.

A Markovian process is define as a process in which the only information needed from history is the current state. We define a Markovian process as (T, S, A, $P_{t}(.|s,a)$, $R_{t}(s,a)$). The process defined here is a Markovian one since the following states and immediate rewards (and therefore the whole continuation of the process) depends only on the current state and operation chosen, and not on the history.

\subtopic {Policy and Decision Rules}{}

\begin{figure}
\psfig{file=model.ps ,width=6in ,clip=}
\caption{model diagram}
\end{figure}

A decision rule may need memory of the whole history to determine the most profitable course of action, or it may need only the current state. It may
also be a deterministic rule (resulting in one singe operation) or a stochastic one (resulting in a distribution on a set of operations).
We will define the following set of rules:

MD - a deterministic Markovian rule (having no memory):

$d_{t}:S\rightarrow A_{s}$	, $d_{t}(s)\in{A_{s}}$

HD - a deterministic history dependent rule:

$d_{t}:H_{t}\rightarrow A_{s}$

Where we define the history as:
$H_{1}=s$ , and $H_{t}=H_{t-1} \times A \times S$	, $H=\bigcup H_{t}$

MR - a stochastic Markovian rule:

$d_{t}:S\rightarrow P(A)$ ,where $P(A)$ is the set of distributions on $A$.

HR - a stochastic history dependent rule:

$d_{t}:H_{t}\rightarrow P(A)$

We define a stationary rule to be a rule that is independent on time, i.e. $\forall{t}, d_{t}=d$, for some $d$.

SD - a deterministic, stationary rule.

SR - a stochastic, stationary rule.

\subtopic {Policy} {}
A policy, at any given point in time, decides which action the agent selects.
Let $D^{k}_{t}$ be all the options at time $t$, then we define $P_{i}^{k}=D^{k}_{1} \times ... \times D^{k}_{n-1}$.
We also define $\Pi^{SD}$ and $\Pi^{DT}$ to be a stationary, random or deterministic rule, respectively.

\subtopic {Example 1}{}
The first example is a Markovian decision making problem with one time phase. This implies that $N=2$, $T=\{1,2\}$ and $r(s')$ is the value of state $s'$ in the end.

When the system is started at state $s_0\in{S}$, the goal of the agent is to choose an action $a\in{A}$ that maximizes the value of $v(a)=E[R_1(s_0,a_1) + r(s')]$

We will choose a deterministic policy $\pi$ that selects an action as follows:
$u(a)= E[R_1(s,a)] +E[R(x_2)]$, when $x_2$ is a random variable describing the state reached after action $a$. Writing it explicitly, $u(a) = r_1(s,a) + \sum_{j \in S} P(j|s,a)*r_2(j)$

The agent will choose an action $a^*$, that maximizes $V(a)$, i.e.
$ a^* \in \{a_1|u(a_1) = \max_{a \in A} \{U(a)\}\}$

In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of $q(a)$ and the return of such a policy would be $\sum q(a)U(a)$, but since we chose $a^*$ so that $U(a) \leq U(a^*)$ for any action $a$, then $\sum q(a)U(a) \leq \sum q(A)U(a^*) \leq U(a^*)$ (in simpler words, the best deterministic choice is always at least as good as any average).

\subtopic {Example 2}{}

\begin{figure}
\psfig{file=states.ps ,width=6in,  clip=}
\caption{State Diagram}
\end{figure}

Lets define a state machine with two states $s_1$, $s_2$.

Formally, in the example we have:

time: $T=\{1...N\}$

states: $S = \{s_1,s_2\}$

actions: $A_{s_1}=\{a_{11}, a_{12}\},  A_{s_2}=\{a_{21}\}$

reward: $r_t(s_1,a_{11}) = 5, r_t(s_1,a_{12}) = 10, r_t(s_2,a_{21}) = -1$

final reward: $r_N(s_1) = 0, r_N(s_2) = 0$

transition function: $P_t(s_1|s_1,a_{11}) = 0.5, P_t(s_2|s_1,a_{11}) = 0.5 $

$P_t(s_2|s_1,a_{12}) = 1, P_t(s_2|s_2,s_{21}) = 1$

Now we want to maximize the return. First we compare two deterministic policies:

$\pi_1$- always chooses $a_{11}$ when in state $s_{1}$.

$\pi_2$ - always chooses $a_{12}$ when in state $s_{2}$.

Recall that $V_N(\Pi)$ is the expected return of policy $\Pi$ in $N$ time steps.
\begin{eqnarray*}
V_3(\pi_1) & = & \frac{1}{2}(5-1+0) +\frac{1}{2}(5+5+0)=7 \\
V_3(\pi_2)& = & 10-1 =9 \\
V_5(\pi_1) & = & \frac {1}{2}(5-3)+ \frac {1}{4}(10-2)+\frac{1}{8}(20) = 7.25 \\
V_5(\pi_2) & = & 10 - 3 = 7
\end{eqnarray*}
Notice how in shorter time frames $\pi_2$ is the preferable policy, while in longer ones, $\pi_1$ brings on the higher returns.

Lets try determining the best policy. We do this by looking first at the reward gained in the last step, for N=2:
\begin{eqnarray*}
Q_2(s_1,a_{11}) & = & 5 \\
Q_2(s_1,a_{12}) & = &10
\end{eqnarray*}
In this case:
\begin{eqnarray*}
V_2(s_1) & = & 10 \\
V_2(s_2) & = & -1
\end{eqnarray*}
When N=2, The optimal action from $s_1$ is $a_{12}$.

For N=3, we use the optimal policy for N=2, after we do one action.
\begin{eqnarray*}
Q_3(s_1,a_{11}) & = & 5+\frac{1}{2}V_2(s_1)+\frac{1}{2}V_2(s_2) = 5+\frac{1}{2}*10+\frac{1}{2}*(-1) = 9.5 \\
Q_3(s_1,a_{12}) & = & 10+ V_2(s_2) = 10 - 1 = 9
\end{eqnarray*}
Hence, when N=3 the optimal action from $s_1$ is $a_{11}$.

\topic{Finite Horizon}{}

\subtopic {Expected Total Reward}{1}

\begin{math} V^{\pi}_N(s)=E^{\pi}_s[\sum_{t=1}^{N-1} R_t (X_t,Y_t)+R_n(X_N)]\end{math} where,

$X_t$- State in time $t$

$Y_t$- Action in time $t$

Under the assumption that $M>|r_{t}(s,a)|$ for each $t\leq N$ and that $A$ and $S$ are discrete then $V^{\pi}_{n}(s)$ is well defined for all $\pi\in\Pi^{HR}$. ($\Pi^{HR}$ is the group of stochastic, history-depended decision rules). Namely, for each decision rule there is a well defined value.

\subtopic {Return Estimation of a Given Policy}{1}

Let $\pi\in\Pi^{HR}$ be a given policy. We would like to calculate it's value $V^{\pi}_{N}(s)$. An inefficient way of calculating $V^{\pi}_{N}(s)$ is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step.
Let us define a set of variables $U^{\pi}_{t}(h_{t})$ to represent the expected return starting in time $t$ until time $N$ where $h_{t}$ notes the history (previous steps and actions) until time t. We can write $U^{\pi}_t$ as:

\(U^{\pi}_t = E^{\pi}_{h_t}[\sum_{n=t}^{N-1} r_n(X_n,Y_n)+r_N(X_N)]\) where $h_t = (h_{t-1}, a_{t-1}, s_t)$

We can define $U^{\pi}_t$ as a function of $U^{\pi}_{t+1}$ as follows:

\(U^{\pi}_t(h_t) = E^{\pi}h_t[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1})]\)

Note that $X_t$ is known from the history and $Y_t$ is known if $\pi$ is deterministic. $X_{t+1}$ is unknown for any policy since the environment is stochastic.

{\em The idea}: Calculate the values of $U^{\pi}_t$ from $t=N$ to $t=1$ using a dynamic programming algorithm.

\pagebreak
\subsubtopic {Dynamic Programming Algorithm for $\pi\in\Pi^{HD}$}{1}


1. \begin{math} t \leftarrow N\end{math}

\begin{math}\forall h_N=(h_{N-1},a_{N-1},s_N) : U^{\pi}_N(h_N)\leftarrow r_N(s_N)\end{math}

2. Repeat $t\leftarrow t-1$ until t=1

\begin{math}\forall h_t=(h_{t-1},a_{t-1},s_t) \in H_t \end{math}

\begin{math}U^{\pi}_t(h_t)\leftarrow r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j|s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)\end{math}

The next lemma states the correction of the algorithm.
\begin{lemma}
\label{L3:lemma1}
for $\pi\in\Pi^{HD}, U^{\pi}_1(s)=V^{\pi}_N(s)$
\end{lemma}
\begin{proof}
We will use a reversed induction to show that $U^{\pi}_t(h_t)$ equals the expected return of $\pi$ given history $h_t$.

{\em Basis}: for t=N the lemma is obviously true (by the definition of V).

{\em Induction Step}:  $U^{\pi}_t(h_t)$ can be written as:

(1) \begin{math} U^{\pi}_t(h_t)= E^{\pi}_{h_t}[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1)}]\end{math}

$X_t=s_t$ is known from $h_t$

$Y_t=d_t(h_t)$ is known from $\pi$ since the policy is deterministic.

(2) \begin{math}U^{\pi}_t(h_t) = r_t(s_t,d_t(h_t)) + E^{\pi}_{h_t}[U^{\pi}_{t+1}(h_t,d_t(h_t),x_{t+1)}]\end{math} and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for $t$. This implies that at the end, we have computed the expected return of $\pi$ from the start state, which is by definition $V^{\pi}_N(s)$.
\end{proof}
If the policy was a stochastic one ($\pi\in\Pi^{HR}$), we would change stage 2 of the algorithm and get:

(2)\begin{math} U^{\pi}_t(h_t)\leftarrow \sum_{a\in A} q_{h_t}(a)\{r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j|s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)\}\end{math}

\subsubtopic {Computational Complexity}{2}

Let $K$ be the number of states and $L$ the number of actions. Then $|H_t|=K^{t+1} L^t$ and the computation time would be:
\begin{math} \sum_{t=0}^{N-1} K|H_t| \leq K^2 \sum_{t=0}^{N-1} (KL)^t \end{math}

The above value is for a general history dependent policy. If the policy is Markovian, then $|H_t|=K$ and we get a computation time of $K^2(N-1)$

\subtopic {The Optimality Principal}{3}

Let us define: \(U^*_t(h_t) = \max_{\pi\in{\Pi^{HR}}} \{U^{\pi}_t(h_t)\}\). For the simplicity of the discussion we assume that both $S$ and $A$ are finite.
The optimality equations (also known as Bellman equations) are:

\begin{math}U_t(h_t) = \max_{a\in A} \{r_t(s_t,a) + \sum_{j\in S} P_t(j|s_t,a)U_{t+1}(h_t,a,j)\},\end{math}

Where $U_N(h_N)=r_N(s_N)$ for $h_N=(h_{N-1},a_{N-1},s_N)$.

Note that replacing the max operation in the equation with the action taken by policy $\pi$, we get the equation for computing the return of policy $\pi$.

\begin{theorem}Let $U_t$ be the solution for the optimality equations for $t = 1$ up to $N-1$, then:

(1) \begin{math} \forall t \leq N-1, \forall h_t\in H_t, \; u_t(h_t)=u^*_t(h_t) \end{math}

(2) \begin{math} \forall s_1 \in S, \; u_1(s_1)=V^*_N(s_1) \end{math}
\end{theorem}

\begin{proof}First, we will show by induction that $u_t(h_t) \geq u^*_t(h_t)$ for $1\leq t\leq N$ and for every $h_t\in H_t$. Then we exhibit a specific policy $\pi$ for which $V^{\pi}_t(h_t)=u_t(h_t)$ and that this will conclude the proof.

{\em Basis}: When $t=N$, by definition $u_N(h_N)=r_N(s_N)=u^{\pi}_N(h_N)$ for every policy $\pi$ and therefore $u_N(h_N)=u^*_N(h_N)$.

{\em Induction Step}: Let us assume that $u_t(h_t)\geq u^*_t(h_t)$ for $n+1 \leq t \leq N$ and $h_t\in H_t$. Let $\pi'$ be a policy in $\Pi^{HR}$ (that performs the operation $d_n'$ on step $n$). for $t=n$:

\begin{math} u_n(h_n) = \max_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j|s_n,a)u_{n+1}(h_n,a,j)\}\end{math}

Consider an arbitrary policy $\pi'$, that for history $h_n$ uses stochastic action $q(a)$. By the induction assumption,
\begin{eqnarray*}
u_n(h_n) & \geq  & \max_{ a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j|s_n,a)u^*_{n+1}(h_n,a,j)\} \\
& \geq & \max_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j|s_n,a) u^{\pi'}_{n+1}(h_n,a,j)\} \\
& \geq & \sum_{a \in A} q(a) \{ r(s_n,a) + \sum_{j \in S} P_n(j|s_n,a) u^{\pi'}_{n+1}(h_n,a,j)\} \\
 & = & u^{\pi'}_n(h_n)
\end{eqnarray*}
Therefore, $u_n(h_n) \geq U^{\pi'}_n(h_n)$ for every $\pi'$ and in particular to the optimal policy, $u_n(h_n)\geq u^*_n(h_n)= \max_{\pi'} u^{\pi'}_n(h_n)$.
It is left to show that there exists a policy $\pi'$ for which $V^{\pi'}_N(h_n)=u_n(h_n)$. Let us define $\pi'$ in the following manner:

\begin{math}\pi'(h_n) = argmax_{a \in A} \{ r(s_n,a) + \sum_{j \in S} P_n(j|s_n,a)u_{n+1}(h_n,a,j) \}\end{math}

We will now show that $u_n(h_n)=u^{\pi'}_n(h_n)$ using reversed induction on $n$.

{\em Basis}: for $n=N$ this is obviously true.

{\em Induction Step}:
\begin{eqnarray*}
u^{\pi'}_n(h_n) & = & r(s_n,\pi'(h_n)) + \sum_{j \in S} P_n(j|s_n, \pi'(h_n))u^{\pi'}_{n+1}(h_n, \pi'(h_n),j) \\
& = & r(s_n,\pi'(h_n)) + \sum_{j \in S} P_n(j|s_n, \pi'(h_n))u_{n+1}(h_n) \\
& = & u_n(h_n)
\end{eqnarray*}
The second equality is by the induction hypothesis, that $u^{\pi'}_t=u_t(h_t)$ for $t \geq n+1$. Therefore $u^{\pi'}_n(h_n)=u_n(h_n)$ and since $u^*_n(h_n) \geq u^{\pi'}_n(h_n)$ it is clear that $u_n(h_n)=u^*_n(h_n)$ as was required.
\end{proof}
\end{flushleft}
\end{document}