Part 1: The Optimization Problem of Reinforcement Learning

## Explicit Formula for the Sequence of States

Before diving into the value function, quality function, and algorithms for reinforcement learning, I want to give this brief intermezzo. In the first post, I introduced the idea of a sequence of distributions over states given a policy.

The main use for this is to formally write down how an agent traverses an environment and do so in a (comparatively) compact manner. If you look at the above equation closely, you can see that it is defined recursively. In order to compute the next element, we first need the previous element and then do some computation with it.

In this post, I want to rewrite the above so that we can move from the initial state distribution **S_0** directly to an arbitrary element **S_k** of the sequence. To do this, I will first introduce a “skip” operator, which allows us to skip elements in the sequence and, moving forward in time, go directly from **S_i** to an **S_j**. This will simplify the derivations of the value function and quality function.

## The Skip Operator

The skip operator takes the following form

This doesn’t look too bad, right? The **calligraphic T** is a glorified transition function that computes the probability of transitioning from **s_0** to **s_k** in *exactly ***k** steps when choosing **a** as the first action and then following the current policy (Note: We will encounter a similar idea later for the quality function Q). Let’s look at this operator when we would like to skip a single (k=1) state (**S_j**):

We again read this from the inside out: Given the state **s_j** (we want to skip) and an action **a_j** we choose in this state, what is the likelihood of transitioning into the next state **s_j+1**? We then multiply this by the likelihood of choosing **a_j** in the state **s_j** – determined by the policy – and sum over all possibilities for **a_j**. This gives us the odds of transitioning from **s_j** into **s_j+1**. We then multiply these odds with the chance of ending up in **s_j** when we execute action **a_j-1** in state **s_j-1**. In other words, we merge a triplet of the form **T~\pi~T** by computing the probability of transitioning from **s_j-1** to **s_j+1** when choosing **a_j-1**, while accounting for all the pathways through which this might occur when following the current policy.

Skipping multiple steps works recursively, each time merging until we have computed a big transition function that starts in **s_0** and ends in **s_k** after **k** steps. This explicit form as **2(k-1)** many integrals!

## Rewriting the Sequence of State Distributions

This skip operator is pretty neat. With it, we can re-write **S_k** from the first post in this series using this explicit formula:

Note that I moved the k from being in parentheses to the superscript. I did this, because we can chain the skip-1 operator (k=1) n times, and get the same result as applying the skip-n operator (k=n) directly. This can be proven easily via induction. We will use this property to show that the value function is recursive.