In this article, we will try to retell the paper in a simpler way by which it is easier to follow. At the moment, we will only focus on the first part of the paper which tries to give an answer to the following question:

Is there a way to guarantee policy improvement?

In order to do show we need 3 ingredients:

1. Policy performance measurement
2. Policy improvement algorithm
3. Improved policy performance estimation

The overall idea is that if we can give the lower bound to the improved policy performance and we can show that it is > 0, we thus guarantee policy improvement.

So the path forward is to show you approaches to estimate the improved policy performance.

### Basics

$V_\pi(s)$ is a state-value function.

$Q_\pi(s, a)$ is a action-value function.

$A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)$ is an advantage function.

## Policy performance

We first define the policy performance as an average performance over state states.

where the start state distribution is $D$. In the paper, this $D$ could be substituted with other distribution at will under the notion of restart distribution but we don’t care about it that much here. Let’s say that it is under some start state distribution.

## Conservative greedy policy improvement

The usual policy improvement is to alter the current policy to be $\mathrm{argmax}_a A(s, a)$ for all $s$. Here we look for a more general case allowing us to transform the policy in a more granular way using $\alpha$ as a parameter.

where $\pi'$ is a greedy improvement of $\pi$.

So our goal is to guarantee the improvement of policy under the conservative greedy improvement that is:

That is at any moment we need to find $\alpha$ that satisfies the above inequation. In other words, how small should $\alpha$ be that it still improves the policy.

## Improved policy performance estimation

As you see from $\eqref{eq:eta}$, we need to get the improved policy performance $\eta(\pi_{new})$, but we want to get it cheaply because we might need to fine tune it for the right $\alpha$. This not viable to just rerun the policy evaluation (on a new set of experience from $\pi_{new}​$), it is just too slow. We need to estimate its lower bound.

In the paper, the author shows two ways for estimation:

1. Using Taylor’s series to the first order. Unfortunately this approach does get us any closer to the lower bound of the estimation. But it is a useful starting point anyway.
2. Using the author’s proposed approach. This gives a lower bound.

### Using Taylor’s series to approximate

If we write $\eta(\pi)$ using Taylor’s expansion to the first degree we will get:

Here we have an approximation error in the order of $\mathrm{O}(x^2)$ albeit not knowing its constant factor.

Since our policy improvement is not exactly in the form of aforementioned $x​$, we rather want it to be in the form of $\alpha​$ (recall the conservative policy improvement).

So we want to get the estimate of something like:

From $\eqref{eq:eta_x}$, the only problematic part is the second term (first derivative), we want $\nabla_\alpha$ not $\nabla_\pi$.

We now begin to derive the $\nabla_\alpha \eta(\pi)$.

The gradient of policy performance was first derived in Sutton’s 1999, policy gradient theorem. We would put it here without further ado:

where $d_\pi(s)$ is a discounted state visitation probability. For completeness:

where $\mathrm{P}(s_t=s, \pi)$ is the probability of visiting state $s$ after taking $t$ steps under a policy $\pi$. Please note that $d_\pi$ is not a probability distribution (it does not sum to $1$), but we can make it so by multiplying $1-\gamma$ to it (since $\sum_{i=0}^\infty \gamma^i = \frac{1}{1-\gamma}$).

From $\eqref{eq:policy_gradient}$, we substitute $\nabla_\pi$ with $\nabla_\alpha$, we also write $\pi$ as a function of $\alpha$:

Consider $\nabla_\alpha \left[ (1-\alpha) \pi + \alpha \pi' \right]$:

We substitute $\eqref{eq:pi_grad}$ into $\eqref{eq:eta_alpha1}$ followed by some algebra:

This gradient can be computed without the need to further interact with the environment. We just need to change $\pi$ to $\pi'$ and then rerun on the previous experience.

The quantity in $\eqref{eq:eta_grad}$ is closely related to policy advantage which defines:

Since it obeys the expectation, it uses a normalized distribution. Hence: $\mathbb{A}_\pi(\tilde{\pi}) = (1-\gamma) \nabla_\alpha \eta(\pi)$

Intuitively, the policy advantage tells us how much $\tilde{\pi}$ tries to take large advantages (be greedy). If $\tilde{\pi} = \pi$, this quantity is $0$. It is maximized when $\pi'$ is a greedy policy wrt. $\pi$.

Don’t be confused! $\mathbb{A}_\pi$ is a policy advantage which looks at all states, but $A_\pi$ is an advantage function looks at a particular state and action.

#### Taylor’s expansion of policy performance

We now get:

Now, we can draw some conclusion from the above equation:

1. With policy improvement the second term (first derivative) is positive (if the policy is not optimal).
2. If $\alpha$ is small enough, the second term will dominate the third term (second derivative) resulting in policy improvement.

The only problem is that we don’t know what $\alpha$ is to guarantee the policy improvement. We now turn to a different approach.

### Using the author’s approach

In order to guarantee policy improvement, we need to show that $\eta_\pi(\pi_{new}) - \eta_\pi(\pi) > 0$.

We first rewrite it in a different form.

#### Lemma 6.1

Proof:

With Lemma 6.1 we now have:

where $P(s_t, \pi_{new})$ is the probability of visiting state $s_t$ at time $t$ under policy $\pi_{new}$.

Evidently, we do have $P(s_t, \pi)$ but we do not have $P(s_t, \pi_{new})$. A way forward is to estimate the equation $\eqref{eq:eta_delta}$ with all we have. Since the deviation from our estimate comes from the mismatch between $P(s_t, \pi_{new})$ and $P(s_t, \pi)$, intuitively, a small $\alpha$ should result in a small mismatch and vice versa. This implies that $P(s_t, \pi_{new})$ must share some roots with $P(s_t, \pi)$ which part we can work with. This allows us to get an informed estimate and put an upper bound to the part we cannot work with.

#### The two parts

Consider the policy $\pi_{new}​$, we know from its definition that it is a compound policy. Another way to look at it is we have two policies: $\pi​$ an $\pi'​$. With probability of $\alpha​$ we will select an action according to $\pi'​$, and probability of $1-\alpha​$ we will select an action from $\pi​$.

At time $t$, we define our two parts as:

1. Part one: we follow $\pi$ from $t=0$ until now.
2. Part two: at some point we selected an action from $\pi'$.

If we has been following $\pi$.

The probability is $P(\text{follow } \pi) = (1-\alpha)^t = 1 - \rho_t​$.

The expected advantage function for this part is:

If we has followed $\pi'$ at any point prior $t$.

The probability is $P(\text{not follow } \pi) = \rho_t = 1-(1-\alpha)^t​$.

The expected advantage function for this part is:

We can define the upper bound of this value to be:

This is obvious we just use the $\max$ here which literally cannot be exceeded. As you shall see later on, the smaller the $\epsilon$ the tighter our estimate would be.

The total expected advantage function at time $t$ is then the sum of both:

Furthermore, we can show that $\mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] = \alpha \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right]$:

Substitute $\eqref{eq:pi'_a}$ into $\eqref{eq:two_paths}$:

This is just for a time frame $t$. After all, we still need to incorporate it into the whole trajectories which extends from $t=0$ to $t=\infty$.

Apply to all time steps

Substitute $\eqref{eq:two_paths2}$ into $\eqref{eq:eta_delta}$:

Looking more carefully at the first term, $\rho_t$ depends on $\alpha$ which is something we want to find (remember we want to find the policy improving $\alpha$). With this form, solving to find $\alpha$ will be very hard because it is not in a closed form. We want the $\sum$ term to be a constant independent of $\alpha$. In this way, solving to find $\alpha$ becomes trivial.

To realize this, we further substitute $\epsilon$ into $\eqref{eq:two_parts3}$:

Consider $\sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t \vert \text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right]​$, this is in fact the unnormalized policy advantage (see $\eqref{eq:policy_adv}​$). It equals to $\frac{1}{1-\gamma} \mathbb{A}_\pi(\pi')​$.

Now consider the $\sum_{t=0}^\infty \gamma^t \rho_t$, we can substitute its real values and get:

Substitute them into $\eqref{eq:two_parts4}$:

We call equation $\eqref{eq:two_parts5}$ theorem 4.1.

#### Finding the right step

Finally, we want to guarantee the policy improvement by selecting a proper $\alpha$. We then need to solve for $\alpha$:

This $\alpha$ would be guaranteed to improve the policy because we calculate it from the pessimistic estimate (its lower bound).