Self-Signed SSL Certificates with IP SAN v2.0 — This time in Docker

For the love of all things good; there is two things I don’t like: (1) unnecessary convoluted setups and (2) redoing work I’ve done earlier. SSL Certificates seem to combine both into one beautifully painful mess.

I once again found myself in the need to generate a throwaway SSL certificate for local development and testing. As I’ve posted earlier there is a way to do that with more or less effort. Since I found myself doing this for the third time now I decided to spice things up a bit, cutting down the time it takes. I’ve created a small docker container that’s sole purpose it is to create a certificate from a given config file. The way it works is:

  1. Create a folder (e.g. ./certs) and place the below config file (named config.cfg) into it.
  2. Modify the config file, adding all the SANs required for this certificate
  3. docker run –rm -it -v./certs:/certs firefoxmetzger/create_ssl

That’s it. It will drop a `private.key` and `certificate.crt` into the previously created folder. It will also print the properties of the certificate into console so you can make sure the SANs are actually added.

Here is the config.cfg template

And here a link to the GitHub repo with the code of the docker image:


Google Code Jam 2018: Trouble Sort

Link to the exercise.

The second problem in this years Code Jam was Trouble Sort. This problem was substantially easier then the first one. At least for the small test.

The idea is that one performs bubble sort with three elements at a time and, if the last element in the triplet bigger then the first, one reverses the entire 3-element sub-list. Pseudo-code for this algorithm was given. The goal of the exercise was to, given a list, sort that list using above strategy and then asserting that it was ordered. If it was one should print “OK” otherwise one should print the index of the (0-indexed) first element of non-increasing order (e.g. [3,9,7] should return 1).

The problem with passing the big test was that lists could have up to `10^8` many elements and above trouble sort implementation has `O(n^2)`. This means using the naive implementation will quickly run into the timeout.

The “trick” is to see that trouble sort is equivalent to splitting the unsorted list into elements with even and odd index, sorting them and then concatenating those sorted lists. Since this sort can use a reasonably fast algorithm (e.g. pythons native one) it can happen in O(n*log(n)) which will be significantly faster and work on the second, big set of tests. (I know this because I read it in the analysis).

The way I got that intuition was thinking about the fact that reversing a three element list is the same as swapping elements i-1 and i+1 . So we swap with a “distance” of two which made me wonder if it would be possible to swap the second and third element in the list in some way. The only swaps that involve the third element are when the “cursor” is at position two and four and that would swap with the first or fifth element respectively which are both uneven numbers. This trend continues for all numbers (even swap with even, odd with odd) and suddenly you begin to see a pattern.

Unfortunately I failed this challenge completely. My logic is correct, however if you look carefully at the last line, you will see that it outputs something like “Case 1:” however it should output “Case #1:”. I spent over 2 hours trying to find the mistake…

Sometimes it’s the small things that catch you. This probably won’t happen to me again … ever.

Google Code Jam 2018: Saving The Universe Again

During last week I’ve learned about a Code Jam hub in our University. The idea is to meet up and participate in the Google Code Jam 2018 . I ended up participating and advancing into the first round, so I thought I’d share my solutions here for anybody interested. (And for me for further reference)

Task 1: Saving The Universe Again (Link)

An alien robot is shooting a beam that will destroy all algorithms knowledge (not sure how an algorithm can have knowledge, but lets not get philosophical).

The beam starts with strength of 1 and follows a given sequence of actions (string with two literals “S” and “C”). Whenever it shoots “S” it deals damage equal to it’s current strength and whenever it charges “C” the strength is doubled. The sequence “SCSS” would thus deal 5 damage.

You have a shield that can absorb D damage (humanity needs the D!) and can swap any two adjacent literals in the string. What is the minimal number of swaps to reduce the damage dealt to a number <= D? (If impossible return “IMPOSSIBLE”)

You can understand this problem as a tree search (again! gives you a competitive advantage to know these things). The state is the current string and it’s children are all the new states that result from swapping in different places. For each such state you can calculate the damage. To find the closest node to the root node that has damage <= D a possibility is to use simple breadth first search.

Breaching from my usual style I will post the code first and then talk about it below. This is the pure, unmodified code I submitted during the hub, so it is intentionally left a bit messy. It’s still fairly readable though … It’s python and it’s short.

Note: This will work for “small” sequences and is too inefficient for large ones. The branching factor of the tree is len(sequence)-1 which can be up to 30. I timed out on the second (big) test, because I didn’t optimize this part.

Pure breadth first search on a tree with branching factor >10 is generally not a good idea. As such I’ve implemented the search using a queue rather then choosing a recursive approach. This allows to better scale the code as ideas for run time optimization come in.

By optimization I mean pruning of nodes that are clearly disadvantageous. One example is in line 32 where I keep a dictionary of nodes that have been visited. If I encounter the same state again, I already know a shorter path to it; thus I can safely ignore it. To be honest, I should have done the weeding out directly in get_nodes to reduce constant overhead.

Another optimization, which I didn’t implement, would be to see that this is a convex problem. There is exactly one clear minimum (S[…]SC[…]C) and one clear maximum (C[…]CS[…]S) along the border of the problem and there are no local optima in between. Thus, we can ignore all swaps that increase the damage or keep it constant, i.e. “SS”, “SC”, “CC”, and focus on those that swap “CS” into “SC”. This will severely decrease the branching factor. (Note: We already ignore “SS” and “CC” as part of above optimization)

Another consequence of this is that we can swap the queue type from a FIFO (i.e. breadth first) into a priority queue, sorting by current damage dealt choosing the node with lowest damage as the next node. This is essentially Dijkstra’s algorithm (as we don’t use a heuristic, otherwise it would be A*). It keeps amazing me again and again how easily you can swap between different tree searches by simply changing the queue type.

Applying those three optimizations should cut down search time by enough to cope with the big test. It also gets very close to the other solutions I’ve seen; that is “pass over the string back to front and switch all “CS” into “SC”. Repeat until no change was made for an entire sweep of the string and then output the total number of swaps. My approach is simply more formal and can be applied to other problems more easily.

Layered Layers: Residual Blocks in the Sequential Keras API

I’ve been looking at the AlphaGo:Zero network architecture [1] and was searching for existing implementations. I’ve found quite a few (here , here and here) with varying degrees of completeness. The cleanest is probably this one but it depends on Jupyter.

What surprised me was that I couldn’t find one that used Keras’ sequential API. While residual blocks aren’t exactly sequential, from a high level view the architecture itself is; it simply stacks (a lot of) residual blocks. So it should be possible to create something like this, right?

The answer is, of course: Yes, there isn’t much that you can’t do in Python. We are actually using this strategy already. Sequential itself inherits from Layer and, in fact, Container (a class sitting between Sequential and Layer in the inheritance hierarchy) states so itself: A Container is a directed acyclic graph of layers. It is the topological form of a “model”. A Model is simply a Container with added training routines. (source)

It works by defining the residual block as a new Keras layer. Depending on how tightly integrated you want it this can be quite short:

Inside the block we fall back to the functional way of stacking layers. If you want better integration, e.g. model.summary() showing the number of trainable weights, there is additional plumbing. Above just shows the gist . . . (gosh! That pun was bad).

Once that is written, we can use model.add( Residual(32, (3,3) )) as we would any other layer. Nice!

To close with an example, I modified the Keras CNN example on CIFAR10 and replaced the hidden convolutional layers with residual ones. I haven’t optimized performance, but you can see how it works. If you are familiar with the example, you might appreciate how similar it looks.


[1] Silver, David, et al. “Mastering the game of go without human knowledge.” Nature 550.7676 (2017): 354.

7x speedup with an optimized TensorFlow Input pipeline: TFRecords + Dataset API

A while ago I posted an updated version of tensorflow’s how to read TFRecords. Today I want to share another version of this file that was created to show how to further optimize the data pipeline.

Before delving into it let me quickly reflect on TFRecords and Datasets.

TFRecords have long been tensorflow’s recommended input method (though I find that folders with images are usually preferred by people). They are made of Google Protocol Buffers stored on disk in a single file. This is advantageous, because this will store the file in one big chunk on the hard drive, meaning faster reading time on HDDs and (I believe) faster average reading time compared to classical image formats like .jpg when reading actual image data.

The Dataset API on the other hand is the new preferred format of reading data. It comes from the observation that feeding data into TF is the steepest part of the learning curve for beginners. It also unifies all the various existing methods in one approach (aka feed_dict or queues). Finally, it allows us to worry about input on a high(er) level which is always convenient.

Now, let’s get to the meat. The idea is simple: Before the pipeline was

  1. read a single record / example / image
  2. decode the record / example / image
  3. augment the image (not necessary in a MWE, but really important for images)
  4. normalize the image (again some NN wizardry that people assume you “know”)
  5. shuffle the examples
  6. batch them up for training
  7. use the batches for the interesting stuff

Shuffle creates a queue of single examples. This works, but is slower then it could be. If we can write the augmentation and normalization to process batches instead of images we can do this:

  1. read a single example
  2. shuffle the examples
  3. batch them up for training
  4. decode the batch
  5. augment the batch
  6. normalize the batch
  7. use batches for the interesting stuff

As you can see, the trick is to batch them up as soon as possible and then decode / augment in batches. I didn’t dig deeply into this, but for some reason it makes the training A LOT faster.

This is a run after the change:

u\] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 650 Ti BOOST, pci bus id: 0000:01:00.0, compute capability: 3.0)
Step 0: loss = 2.32 (0.253 sec)
Step 100: loss = 2.13 (0.003 sec)
Step 200: loss = 1.90 (0.004 sec)
Step 300: loss = 1.59 (0.006 sec)
Step 400: loss = 1.16 (0.003 sec)
Step 500: loss = 0.95 (0.003 sec)
Step 600: loss = 0.84 (0.006 sec)
Step 700: loss = 0.66 (0.006 sec)
Step 800: loss = 0.79 (0.005 sec)
Step 900: loss = 0.62 (0.004 sec)
Step 1000: loss = 0.66 (0.003 sec)
Done training for 2 epochs, 1100 steps.

and for comparison a run where the batch is created at the end of the input pipeline and decoding is done first. This is what is currently implemented in the example:

2018-02-19 22:52:21.999019: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 650 Ti BOOST, pci bus id: 0000:01:00.0, compute capability: 3.0)
Step 0: loss = 2.32 (0.572 sec)
Step 100: loss = 2.13 (0.029 sec)
Step 200: loss = 1.93 (0.029 sec)
Step 300: loss = 1.65 (0.030 sec)
Step 400: loss = 1.34 (0.030 sec)
Step 500: loss = 0.93 (0.030 sec)
Step 600: loss = 0.73 (0.030 sec)
Step 700: loss = 0.68 (0.030 sec)
Step 800: loss = 0.67 (0.030 sec)
Step 900: loss = 0.56 (0.030 sec)
Step 1000: loss = 0.44 (0.029 sec)
Done training for 2 epochs, 1100 steps.

If we estimate the batched version to 0.0043 s/100ep and the single example speed to 0.03 s/100ep then we get a ~7x speedup. Pretty nice for just swapping around 2 lines of code.

Here is the code for the batched version

If you test it with your machine, let me know your batch times and what machine you are using =) I’d love to hear from you.

Happy coding!

Installing realsense SDK 2.0 in anaconda 3 (Ubuntu 16.04)

As part of my teaching duties at Uppsala University I am preparing a lab on the Intel Realsense D415 depth cameras.

In this post I want to show how I’ve set up the SDK in an anaconda 3 virtual environment on my Ubuntu 16.04. The instructions on how to compile from source provided by Intel are pretty good. On Ubuntu that is the required way to go, because the libraries are not provided by Ubuntu’s package manager.

There is an existing CMake project which I could pretty much use as is, however I had to slightly reconfigure it to work with anaconda 3 (matching the python version).

As a first step I needed an anaconda environment. I used python 3.6 and the name IIS_lab (because that happens to be the name of the lab I will teach)

conda create IIS_lab -python=3.6

This creates the Python executable and library that I had to include in CMake to build against the correct python version. I prefer to use the cmake-gui to configure CMake projects. In the section PYTHON there are two variables to be replaced. Here is the location and new value:


Also I had to check the BUILD > BUILD_PYTHON_BINDINGS box. [If unchecked, the PYTHON category might be missing. In this case simply configure the project again after you’ve checked it.]

Once those two values were set, I could generate and then build the project following the Intel instructions (including the kernel patch). Once done there were two files of interest:


The name of latter may differ depending on python version, c-compiler and 64-bit vs 32-bit OS. I had to copy those into anaconda’s virtual environment and rename the latter. For brevity I will call the location <env-path> and it expands to ~/<username>/anaconda3/env/<env-name>.


As you can see, I reloved the “.cpython-36m-x86_64-linux-gnu” ending. This is because the name of the file defines how the library is imported and the dot character ” . ” has a special meaning in python =) .

That’s it. Now I was be able to use the realsense SDK in my conda environment via

source activate IIS_lab
>>> import pyrealsense2 as rs

Please feel free to comment and share this if you think it was helpful.

Happy coding!