# Faster RNG for Reinforcement Learning in Python

When you start doing reinforcement learning you will sooner or later come to the point where you will generate random numbers. Initializing policy networks or Q-tables, choosing between exploration or exploitation, or selecting among equally good actions are a few examples. Numpy is very efficient at generating those random numbers, and most of the time (like 95%) this is all you need. However, there is one particular edge case were numpy is not the best solution, and that is exactly the case we encounter in RL a lot: generating single random numbers (i.e., to select an action epsilon-greedy).

Generating single random numbers in numpy is a bad idea, because every numpy call gets sent to the numpy engine and then back to python, which creates overhead that dominates runtime for single random numbers. In this case it is (much) more efficient to use the python standard library instead. However, if you can generate the random numbers in batches, numpy is significantly faster than the standard library again. Take a look at this simple profile:

So if you do your profiling your code and notice that RNG adds up to a significant portion of your runtime, consider pre-generating the random numbers in numpy and then save them to a list. This solution sacrifices a bit of readability, but allows for much faster code. Here is an example that mimics the syntax of the python standard library:

That’s it. Thanks for reading and Happy Coding!

# Parallelism in Python Generators

Yesterday I stumbled upon a StackOverflow question that asked about implementing a Rosetta Code problem in parallel to speed it up. One easy way to do it is one, which is a modification of the python example on rosettacode.org.

When I posted the question the OP commented that he/she was looking for using non-terminal sub-processes that yield these super-d values. I thought about it, but that version does not seem very practical. If the main process is interested in the results of the computation, then temporary concurrency will be the cleaner solution (like in this example). If the main thread isn’t, e.g., if you are running an old-school threaded web-server, that hands off incoming connections to a worker thread, then solutions with non-terminal sub-processes can make sense. In the latter case you are essentially “starting a program N times in parallel and shutting them down together”. This certainly makes sense, but only if those programs don’t need to communicate. Remember to KISS.

Thank you for reading and Happy Coding!

# Reinforcement Learning: Notation and Proofs (Part 2)

## Explicit Formula for the Sequence of States

Before diving into the value function, quality function, and algorithms for reinforcement learning, I want to give this brief intermezzo. In the first post, I introduced the idea of a sequence of distributions over states given a policy.

$S_{k+1}: S \times \Pi \to [0,1] ,~ (s', \pi) \mapsto \int_s S_k(s) \int_a \pi(s,a) T(s,a,s') ~\mathrm{d}a ~\mathrm{d}s$

The main use for this is to formally write down how an agent traverses an environment and do so in a (comparatively) compact manner. If you look at the above equation closely, you can see that it is defined recursively. In order to compute the next element, we first need the previous element and then do some computation with it.

In this post, I want to rewrite the above so that we can move from the initial state distribution S_0 directly to an arbitrary element S_k of the sequence. To do this, I will first introduce a “skip” operator, which allows us to skip elements in the sequence and, moving forward in time, go directly from S_i to an S_j. This will simplify the derivations of the value function and quality function.

## The Skip Operator

The skip operator takes the following form

$S_+: (\mathbb{N}\times S \times S \times \Pi) \to [0,1],~ (k, s_0, s_k, \pi) \mapsto \int_a \mathscr{T}^k(s_0, a, s_k, \pi)~~\mathrm{d}a$

This doesn’t look too bad, right? The calligraphic T is a glorified transition function that computes the probability of transitioning from s_0 to s_k in exactly k steps when choosing a as the first action and then following the current policy (Note: We will encounter a similar idea later for the quality function Q). Let’s look at this operator when we would like to skip a single (k=1) state (S_j):

$\mathscr{T}(s_{j-1},a_{j-1},s_{j+1}) = \int_{s_j}T(s_{j-1},a_{j-1},s_j)\int_{a_j}\pi(s_j,a_j)T(s_j, a_j, s_{j+1})~\mathrm{d}a_j~\mathrm{d}s_j$

We again read this from the inside out: Given the state s_j (we want to skip) and an action a_j we choose in this state, what is the likelihood of transitioning into the next state s_j+1? We then multiply this by the likelihood of choosing a_j in the state s_j – determined by the policy – and sum over all possibilities for a_j. This gives us the odds of transitioning from s_j into s_j+1. We then multiply these odds with the chance of ending up in s_j when we execute action a_j-1 in state s_j-1. In other words, we merge a triplet of the form T~\pi~T by computing the probability of transitioning from s_j-1 to s_j+1 when choosing a_j-1, while accounting for all the pathways through which this might occur when following the current policy.

Skipping multiple steps works recursively, each time merging $T~\pi~\mathscr{T}$ until we have computed a big transition function that starts in s_0 and ends in s_k after k steps. This explicit form as 2(k-1) many integrals!

## Rewriting the Sequence of State Distributions

This skip operator is pretty neat. With it, we can re-write S_k from the first post in this series using this explicit formula:

$S_k(s', \pi) = \int_s S_0(s) S_+^k(s,s',\pi) ~\mathrm{d}s$

Note that I moved the k from being in parentheses to the superscript. I did this, because we can chain the skip-1 operator (k=1) n times, and get the same result as applying the skip-n operator (k=n) directly. This can be proven easily via induction. We will use this property to show that the value function is recursive.

# Reinforcement Learning: Notations and Proofs (Part 1)

## Part 1 – The Optimization Problem of RL

I started solving reinforcement learning problems quite a while ago now; way back when it was still considered a “nice idea, but only really of theoretical interest”. However, it just recently came to me that I treat it as a purely applied field; I do stuff that seems to be working, but I never deeply looked at the math behind various methods and algorithms. It’s mostly lived experience, and I want to rectify that. Unfortunately, I didn’t mange to find many proofs or rigorous explanations on the topic online. This is where this series of posts comes in. I will go over some basics, write them down properly, and derive some ubiquitous relationships that are useful when doing reinforcement learning.

If you would like to see certain topics, feel free to leave a comment below; its always more fun to write about things others actually want to read 🙂

## Preliminaries and Definitions

Let’s start by defining basic terminology. Let $S\subset\mathbb{R}^n$, be a set of states, $A\subset \mathbb{R}^n$ a set of actions, and $\pi:S\times A\to [0,1]$ a (stochastic) policy from the set of all policies $\Pi = \{\pi : \int_s \pi(s) ~\mathrm{d}s = 1 \}$. Further, let $R: S\times A \times S \to \mathbb{R}$ be a reward function, which we assume to be bounded (it makes convergence easier). Next, let $T: S\times A\times S \to [0,1]$ be a transition function which satisfies $\forall s\in S \forall a\in A : \int_{s'} T(s,a,s') ~\mathrm{d}s = 1$. These definitions cover the vast majority of practical problems I have encountered. Further, let $S_0: s \to [0, 1]$ be a distribution over starting states that again satisfies $\int_s S_0(s) ~\mathrm{d}s = 1$.

One consequence of above definitions is that we can analytically express the progress in the environment as the evolution of the distribution over states across time influenced by some sequence of actions $a_k \in A$. I.e, given such a sequence of a_ks, and starting in S_0, we can recursively compute the distribution over states as

$S_{k+1}: S \to [0,1] ,~ (s') \mapsto \int_S S_k(s)T(s,a_k,s') ~\mathrm{d}s$

Don’t let this weird notation throw you; it means that we assume to be in state s at time step k, and then check how likely it is to end up in a resulting state s’ given that we execute action a_k. This gives a probability distribution over s’ assuming we are in s and do a_k. Next we scale it by the probability of actually being in s at time k (given by S_k) and then “sum” over all these distributions, which – they are not countable – takes the form of an integral.

More than just fixed sequences of actions, we can also track the the evolution of a policy as

$S_{k+1}: S \times \Pi \to [0,1] ,~ (s', \pi) \mapsto \int_s S_k(s) \int_a \pi(s,a) T(s,a,s') ~\mathrm{d}a ~\mathrm{d}s$

The product of the three terms gives the probability of making the transition from s into s’ by choosing a at time k. Summing over all actions yields the chance of transitioning from s into s’ at time k. Finally, summing over s yields the odds of transitioning into s’ at time k, which is the same as being in s’ at time k+1.

Following up on that, we can use this sequence S_k to compute the reward we expect to obtain at time k when following a policy

$R_{k}: S \times \Pi \to [0,1] ,~ (\pi) \mapsto \int_s S_k(s) \int_a \pi(s,a) \int_{s'} T(s,a,s') R(s,a,s') ~\mathrm{d}s' ~\mathrm{d}a ~\mathrm{d}s$

Which has the same structure as computing S_k+1 except that we scale each transition by the reward that this transition would give and also sum over all the possible future states s’. This results in the reward we would expect to receive at time k when following the chosen policy.

## The Optimization Problem of Reinforcement Learning

Our goal is to find policies that collect a lot of reward. To measure “collect a lot of reward” we can use the sequence R_k. In practice, we also introduce a discount factor $\gamma\in (0,1)$, because we would like the policy to prefer immediate rewards to rewards in the future (and because it guarantees that the reward is finite). Gamma is a hyper-parameter that we need to choose, which immediately leads to the question how sensitive the optimal policy is to changes in this gamma. Something we will study in a future post.

In sum, this enables us to define an objective function and corresponding optimization problem as

$\mathcal{O}: \Pi \to \mathbb{R}: (\pi) \mapsto \sum_k \gamma^k R_k \rightsquigarrow \max_\pi$

Finding solutions to this optimization problem is the fundamental challenge in reinforcement learning. However, there is one more crux. If we knew all the functions involved, in particular the transition function T, we could use standard optimization methods to tackle this problem, but we know neither T nor R (the reward function) explicitly; hence, computing the value of our objective O for a given policy explicitly is impossible. Instead, we have to approximate the objective function O by sampling from the environment and, then, use that approximation in our numerical optimization process. Yikes!

It is easy to see from the above that convergence of naive methods is under serious threat as soon as the variance in the environment increases, i.e. state and action spaces become large. This will turn into the curse of dimensionality, once we blow up the optimization problem using the notion of value function and Q function. It motivates why a lot of recent work achieves new breakthroughs simply by increasing the computational power (read: number of samples one is able to generate and process). More examples means better approximation, which means more stable numerical optimization. It also motivates one of the current research areas: Find new algorithms that are more sample efficient, i.e., algorithms that are less sensitive to bad approximations of the environment.

## Conclusion

This covers basic notation in reinforcement learning. In my next post on this I will write down the notion of a value function V and a q-function Q and show how both can be defined recursively. There is also a relation between Q and V which we will derive. Finally, we will prove that optimality in one state of the value function implies optimality in every future state as well.

This will be the stepping stone for looking at surrogate functions in reinforcement learning, i.e., how to cleverly replace the objective function stated above with another objective function that is easier to optimize and that yields policies optimal under the original objective. This is the one of the bedrock ideas in current reinforcement learning, and is essentially what algorithms like PPO are doing. They say “look, if we optimize this other cool function instead, we can get policies that also maximize reward and finding them is easier compared to using the original objective”.

Happy coding!

# Franka Emika Panda in Gazebo with ROS and Docker

In my last post I’ve written about creating Gazebo packages from the existing panda models and mentioned that I’m also working on a docker image that includes ROS. Well here it is 🙂 You can check out the GitHub repository (be sure to leave a star if you like it), and – assuming you have an operating docker host – it should be very straight forward to set up.

Since I spent way too much time fixing dependencies and writing configuration files, I would like to share my experience (and of course the code :P) so that the solutions to the problems I encountered along the way are in one place.

### High Level Overview

Before diving into the setup of the individual parts, I want to quickly talk about how the various components interact. This overview will make it easier to understand what each system is trying to accomplish, and what it expects.

Contrary to my initially guess – that I could simply extend the Gazebo package I created earlier – I found that it is easier to start completely from scratch for this image. The logic here is that the robot will be spawned into Gazebo from ROS, and hence it makes sense to properly integrate things into the ROS architecture. This means using .urdf to describe the robot model, instead of .sdf files like I did in the previous post. It also means that I settled for one model – the robot with the gripper – although support for both versions can be added.

The main pipeline goes from MoveIt via ROS Control to Gazebo. MoveIt receives a goal pose and plans a trajectory to it. It then sends position commands to ROS control which sets targets for a series of PID controllers. These controllers are what steer the simulated joints in Gazebo. Gazebo then sends the joint encoder readings back to the ROS framework, where they are picked up by the PID controllers to stabilize the robot, MoveIt to decide the next step of the trajectory, and RViZ to visualize the robot. MoveIt and Gazebo both use the .urdf I generated in the last post, and I got the controller parameters from Erdal Pekel’s blog post.

### ROS Control and Stabilizing the Robot

In retrospect getting the controllers to work is quite easy. While configuring it, it was quite annoying, because they didn’t seem to be loading properly. The issue was dependencies 🙂 Before I go into the packages needed to connect ROS control to gazebo, however, I want to show you the config file that I used to set the PID controls. I got this file from Erdal Pekel’s Blog who spend quite a bit of time tuning those numbers. Amazing job!

To get these controllers to actually control the simulation, we can start them with a single line in the launch file, and point them to the panda_controller.yaml. To communicate with gazebo, however, they will need the gazebo_ros_control package, which doesn’t seem to ship with gazebo_ros nor with ros_control. Additionally the default ros container doesn’t ship with any controllers, so I had to install the effort controllers in the effort_controllers package, but also the joint-trajectory-controller which will install provide the controllers mentioned in above file. I didn’t test if the other effort controllers are necessary, so there might be a dependency on which this doesn’t actually depend.

If the low-level controllers are working properly, the arm in the simulation should freeze in a position that is different from the one it will settle in based on gravity. Usually, the simulation starts, and the arm falls. Then – a short while later – the controllers finish loading and stabilize the arm in some (slightly awkward) position.

### Creating the MoveIt configuration

The MoveIt package in the repository (called panda_moveit) is the result of me using the moveit setup wizzard with the generated .urdf file. Initially, I tried using the official panda_moveit_config package, but failed to get things to work for two key reasons: (1) the official moveit config is meant for the physical robot, which uses controllers unsupported by Gazebo. This mismatch in controllers is detrimental. (2) I (unknowingly) chose a different naming convention in the .urdf compared to the officially used one. This can be fixed by me renaming things, but at this point I had already found out about reason (1), and thought I can keep it if I have to create my own config anyway.

To create this package, I followed the MoveIt Setup Assistant Tutorial, trying to stick as close to the existing panda_moveit_config as possible. For example, I included the ‘ready’ pose that panda uses to give the robot a default state 🙂

One file that I had to add manually to the generated configuration to get it to work was a config file for the MoveIt controllers. This tells MoveIT which controller groups it should use to actuate which motors, and what messages to send to the ROS control nodes. I then had to replace the panda_moveit_controller_manager.launch.xml with the newly created ones.

To start MoveIt and the planning context together with Gazebo, I added the move_group.launch file to the launch file in the panda_gazebo package. I didn’t touch the other launch files that the moveit setup assistant generated; hence, they are likely in a broken state.

### Controlling the Simulator from RViZ

Finally, and to add some icing on the cake, controlling the Gazebo simulation through RViZ provides a very good test if everything is working as it should. I added the RViZ node to the launch file in the panda_gazebo package. After launching the file and waiting for everything to load, I created a layout in RViZ, configured it to my liking, and stored the layout as an .rviz config file. I then added this file as an argument to the launch options for the RViZ node.

A strange situation that I encountered here was that the static frame RViZ tries to use as reference in the global settings was set to map instead of world. I’m not 100% sure why the default is map, but if it is set to this, the interactive markers won’t show for panda’s endeffector.

With all this in place, I can now use the interact handles to drag the robot into a desired goal position and click ‘plan and execute’.

### Putting It All Together in Docker

After all this work of assembling the pieces into a working pipeline, I thought I can take some additional time to alleviate some of the pain others may have when setting this up; especially the pain that comes from missing dependencies. As a result, I decided to create a docker image that is portable and will set all this up for you.

The Dockerfile itself is very short:

It just installs the necessary ROS packages (it is totally my own stupidity, but I can’t stress enough how much time I wasted figuring out which packages I need), adds the workspace created above, and then modifies the image’s entrypoint to lay the catkin workspace over the default ROS workspace.

One interesting thing happens in line 16, where I fix a bug that is very unique to gazebo in docker. As a container starts ‘fresh’ gazebo doesn’t have any models downloaded, which means it will download the two models it needs to create empty.world. This costs time and, unfortunately, interacts with ROS control in such a way that the controllers crash. To correct this, I pre-populate gazebo’s model cache with the models it needs.

The other file I use for building the image is the build script:

The script is again very short. Line 4 generates the .urdf file from the .xacro file (a process I used in my post on how to get panda into Gazebo). Line 7 is interesting, because we spin up a clean ROS image to build the catkin workspace that we created, and then use those generated files to add them to the final image. Arguably this could be solved more elegantly via multi-stage builds; however, I learned about this feature after implementing things this way. Thus I will leave it as is until a future refactor, and will use multi-stage builds in future images.

That’s it! These posts usually take my entire weekend to write. If you think they are good, be sure to to like the post, and leave a star on my GitHub repository. This way I know that writing these is a worthy use of my time.

Happy coding!

# H5-Index of HRI Venues

I got bored during a late Saturday evening, so I decided to query Scopus for the last 5 years of publications for major venues in HRI. After aggregating all this data, I wanted to look at the impact factor and related metrics of these papers. However, I realized only afterwards that I can’t calculate the impact factor from the data I gathered. Instead I did the next best thing, computed the H5 scores for all the venues, and rank ordered them by this score. Maybe this graph is useful to some 🙂

Thank you for reading, and (although there is no programming here) happy coding!

# Franka Emika Panda in Gazebo (without ROS using Docker)

The background for this post is that I am currently visiting ISIR, and I’ve started a new project working with the panda robot by Franka Emika. Unfortunately, we only have a single physical robot, and we are 2 PhDs using it for different experiments. To minimize downtime, I am, hence, setting up a simulator for it, and, to facilitate reusability, I decided to go with a Docker based setup.

The image is available on GitHub and called panda_sim. You can clone it, build it, and see how it works for you. While I tried to keep ROS out of this, the current model still loads the gazebo ros plugin for control. If I get around to it, I will remove this dependency in a future version, to get it completely ROS independent.

I am currently working on a ROS based image that runs a Gazebo server and a ROS controller for the robot with gripper. This way, you can easily spin up a simulator for your robot behavior. You can even do that in parallel for some epic deep reinforcement learning action.

#### Preparing the Robot Model

To use Panda in the simulator, we need to convert the existing model into .sdf format. While Gazebo can work with .urdf files, this requires a parallel ROS installation, which we try to avoid. Internally, the .urdf is converted to .sdf anyway, so we might as well supply .sdf and save the dependency.

First, we need to get the model from franka_ros which is located in the franka_description folder. It is a .urdf model, so in order to use it in Gazebo, we need to add some additional information such as joint inertia, or that the arm should be attached rigidly to the world frame. Erdal Pekel also has a tutorial how to bring Panda into Gazebo (using ROS). I used his numbers and suggestions to modify the files.

Next, as we are ripping out the model from an existing ROS package, we will also need to update the paths in the .urdf. In particular, I removed \$(find franka_description)/robots/ in both the panda_arm.urdf.xacro and the panda_arm_hand.urdf.xacro, and changed the robot_name to panda. I also changed the description_pkg value to panda_arm_hand or panda_arm in the panda_arm.xacro, depending on the model; this name of the package needs to match the name of the model folder (see below). In hand.xacro the value for description_pkg is hardcoded, so I introduced the description_pkg variable, and set it appropriately.

#### Converting .xacro to .sdf

After doing all the necessary modifications, we need to convert the .xacro to a .urdf. Docker can again be incredibly helpful, as we can spin up a throw away ROS instance, do our conversion, and save the result on the host:

docker run -v <path/to/modified>/franka_description:/xacro osrf/ros:kinetic-desktop-full rosrun xacro xacro --inorder /xacro/robots/panda_arm_hand.urdf.xacro > panda_arm_hand.urdf

Next, we want to convert the model to a .sdf. This is again solved in an easy one liner using docker:

docker run -v <path/to/generated/urdf>:/urdf gazebo:latest gz sdf -p /urdf/panda.urdf > model.sdf

#### Assembling the Model

Finally, all that is left is to assemble the pieces into a full Gazebo model of Panda. For this we create a new folder called panda and copy the meshes folder and the model.sdf in there. We then create a models.config to describe the model to Pandas as follows

Here is the folder structure:

panda_arm_hand
|__meshes
|  |__collision
|  |  |__files from franka_description
|  |__visual
|     |__files from franka_description
|__model.config
|__model.sdf 

Copying this folder into ~/.gazebo/models will make it available to Gazebo. To pack it into a docker image, I wrote a small script in build.sh that will construct the above model and then build a docker image with the models already installed. The script will create a tmp folder where it stores the fully constructed models, so you can also run the script and get just the models, if that’s what you need.

Thank you for reading, and happy coding! If you liked this article, and would like to hear more ROS, Gazebo, or Panda related stuff, consider giving this post a like, or leaving me a comment 🙂

# How to play custom animations during speech on a NAO (or Pepper)

I’ve been asked multiple times now how to sync animations and speech on a NAO – or Pepper for that matter; especially from Python.

The answer to that is, there are two options:

1. The first one is to create the animation in Choreograph and then export it to a python script. You then create your usual handle to the text-to-speech module, but instead of calling the say method directly, e.g., tts.say(“Hello”), you call it through the module’s post method, e.g., tts.post.say(“Hello”). This method exists for every function in the API and essentially just makes a non-blocking call. You can then call your animation.
2. You create a custom animation in Choreograph, upload it to the robot, and call it through AnimatedSay or QiChat. Other than being the, I think, cleaner solution, it allows you more fine grained control over when in the sentence the animation starts and when it should stop. This is what I will describe in more detail below.

### Step 1: Create the Animation

Fairly straight forward, and the same for both solutions. You use Choreograph to create a new Timeline box in which you create the animation that you would like. You then connect the timeline box to the input and output of the behavior and make sure it works as you’d expect when you press the green play button.

### Step 2: Configure the Project and Upload it to the Robot

In this step, you configure the new animation to be deployed as an app on the robot.

Go to the properties of the project.

Then make sure to select a minimum naoqi version (for NAO 2.1, for Pepper 2.5), the supported models (usually any model of either NAO or Pepper respectively) and set the ID of the Application. We will use this when calling the animations, so choose something snappy, yet memorable. Finally, it is always nice to add a small Description.

Next, we need to reorganize the app a bit. Create a new folder and name it after your animation; again, we will use this name to call our behavior, so make sure it’s descriptive. Then move the behavior that contains your animation – by default called behavior1.xar – into the folder you just created, and rename it to behavior.xar .

Finally, connect to your robot and use the first button in the bottom right corner to upload the app you just created to your robot.

### Step 3: Use ALAnimatedSpeech from Python

Note: If you don’t want NAO to use the random gestures it typically uses when speaking in animated speech, consider setting the BodyLanguageMode to disabled. You can still play animations, but it won’t automatically start any.

For existing animations – that come with the robot by default – you call the animation like this

"Hello! ^start(animations/Stand/Gestures/Hey_1) Nice to meet you!"

Now, animations is nothing but an app that is installed on the robot. You can even see listed it in the bottom right corner of Choreograph. Inside the app, there are folders for the different stable poses of NAO like Stand, or Sit, which are again divided into types of animations, e.g., Gestures which you can see above. Inside these folders there is, yet another, folder named after the animation (Hey_1), inside of which is a behavior file called behavior.xar.

We have essentially recreated this structure in our own app and installed it right next to the animations app. So, we can call our own animations using the exact same logic:

"Hello! ^start(pacakge_name/animation_name) Nice to meet you!"


It also works with all the other aspects of the ALAnimatedSpeech module, so ^stop, ^wait, ^run, will work just as fine. You can also assign tags to your animations and then make it choose random animations for that tag group.

Finally, please be aware that the robot will return to it’s last specified pose after finishing an animation. Hence, if you want the robot to wait in a different position after the animation finished, you will have to do that by creating a custom posture. I have some comments on that here: The hidden potential of NAO and Pepper – Custom Robot Postures in naoqi v2.4