7x speedup with an optimized TensorFlow Input pipeline: TFRecords + Dataset API

A while ago I posted an updated version of tensorflow’s how to read TFRecords. Today I want to share another version of this file that was created to show how to further optimize the data pipeline.

Before delving into it let me quickly reflect on TFRecords and Datasets.

TFRecords have long been tensorflow’s recommended input method (though I find that folders with images are usually preferred by people). They are made of Google Protocol Buffers stored on disk in a single file. This is advantageous, because this will store the file in one big chunk on the hard drive, meaning faster reading time on HDDs and (I believe) faster average reading time compared to classical image formats like .jpg when reading actual image data.

The Dataset API on the other hand is the new preferred format of reading data. It comes from the observation that feeding data into TF is the steepest part of the learning curve for beginners. It also unifies all the various existing methods in one approach (aka feed_dict or queues). Finally, it allows us to worry about input on a high(er) level which is always convenient.

Now, let’s get to the meat. The idea is simple: Before the pipeline was

  1. read a single record / example / image
  2. decode the record / example / image
  3. augment the image (not necessary in a MWE, but really important for images)
  4. normalize the image (again some NN wizardry that people assume you “know”)
  5. shuffle the examples
  6. batch them up for training
  7. use the batches for the interesting stuff

Shuffle creates a queue of single examples. This works, but is slower then it could be. If we can write the augmentation and normalization to process batches instead of images we can do this:

  1. read a single example
  2. shuffle the examples
  3. batch them up for training
  4. decode the batch
  5. augment the batch
  6. normalize the batch
  7. use batches for the interesting stuff

As you can see, the trick is to batch them up as soon as possible and then decode / augment in batches. I didn’t dig deeply into this, but for some reason it makes the training A LOT faster.

This is a run after the change:

u\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 650 Ti BOOST, pci bus id: 0000:01:00.0, compute capability: 3.0)
Step 0: loss = 2.32 (0.253 sec)
Step 100: loss = 2.13 (0.003 sec)
Step 200: loss = 1.90 (0.004 sec)
Step 300: loss = 1.59 (0.006 sec)
Step 400: loss = 1.16 (0.003 sec)
Step 500: loss = 0.95 (0.003 sec)
Step 600: loss = 0.84 (0.006 sec)
Step 700: loss = 0.66 (0.006 sec)
Step 800: loss = 0.79 (0.005 sec)
Step 900: loss = 0.62 (0.004 sec)
Step 1000: loss = 0.66 (0.003 sec)
Done training for 2 epochs, 1100 steps.

and for comparison a run where the batch is created at the end of the input pipeline and decoding is done first. This is what is currently implemented in the example:

2018-02-19 22:52:21.999019: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 650 Ti BOOST, pci bus id: 0000:01:00.0, compute capability: 3.0)
Step 0: loss = 2.32 (0.572 sec)
Step 100: loss = 2.13 (0.029 sec)
Step 200: loss = 1.93 (0.029 sec)
Step 300: loss = 1.65 (0.030 sec)
Step 400: loss = 1.34 (0.030 sec)
Step 500: loss = 0.93 (0.030 sec)
Step 600: loss = 0.73 (0.030 sec)
Step 700: loss = 0.68 (0.030 sec)
Step 800: loss = 0.67 (0.030 sec)
Step 900: loss = 0.56 (0.030 sec)
Step 1000: loss = 0.44 (0.029 sec)
Done training for 2 epochs, 1100 steps.

If we estimate the batched version to 0.0043 s/100ep and the single example speed to 0.03 s/100ep then we get a ~7x speedup. Pretty nice for just swapping around 2 lines of code.

Here is the code for the batched version

If you test it with your machine, let me know your batch times and what machine you are using =) I’d love to hear from you.

Happy coding!


Installing realsense SDK 2.0 in anaconda 3 (Ubuntu 16.04)

As part of my teaching duties at Uppsala University I am preparing a lab on the Intel Realsense D415 depth cameras.

In this post I want to show how I’ve set up the SDK in an anaconda 3 virtual environment on my Ubuntu 16.04. The instructions on how to compile from source provided by Intel are pretty good. On Ubuntu that is the required way to go, because the libraries are not provided by Ubuntu’s package manager.

There is an existing CMake project which I could pretty much use as is, however I had to slightly reconfigure it to work with anaconda 3 (matching the python version).

As a first step I needed an anaconda environment. I used python 3.6 and the name IIS_lab (because that happens to be the name of the lab I will teach)

conda create IIS_lab -python=3.6

This creates the Python executable and library that I had to include in CMake to build against the correct python version. I prefer to use the cmake-gui to configure CMake projects. In the section PYTHON there are two variables to be replaced. Here is the location and new value:


Also I had to check the BUILD > BUILD_PYTHON_BINDINGS box. [If unchecked, the PYTHON category might be missing. In this case simply configure the project again after you’ve checked it.]

Once those two values were set, I could generate and then build the project following the Intel instructions (including the kernel patch). Once done there were two files of interest:


The name of latter may differ depending on python version, c-compiler and 64-bit vs 32-bit OS. I had to copy those into anaconda’s virtual environment and rename the latter. For brevity I will call the location <env-path> and it expands to ~/<username>/anaconda3/env/<env-name>.


As you can see, I removed the “.cpython-36m-x86_64-linux-gnu” ending. This is because the name of the file defines how the library is imported and the dot character ” . ” has a special meaning in python =) .

That’s it. Now I was be able to use the realsense SDK in my conda environment via

source activate IIS_lab
>>> import pyrealsense2 as rs

Please feel free to comment and share this if you think it was helpful.

Happy coding!

Parsing TFRecords with the Tensorflow Dataset API

Update: Datasets are now part of the example in the Tensorflow library.

The Datasets API has become the new standard in feeding things into Tensorflow. Moreover, there seem to be plans to deprecate queues and other inputs, unifying the way data is fed into models. The idea now is to (1) create a Dataset object (in this case a TFRecordDataset) and then (2) create an Iterator that will extract elements and feed them into the model.

I’ve modified tensorflow’s example on “how to read data” to reflect that change. I’ve submitted a PR to the tensorflow repo, until it gets merged take a look at the new code below. It is a lot easier to read, see for yourself:

Further Reading:

Extract the Windows Product Key from a running Windows Machine

I always had the suspicion that Windows saves the used product key in some way. Today I learned that it does so in a very simple manner. Converted into Hex as a registry key called DigitalProductId. The catch is that it doesn’t use UTF-8, ASCII or another standard encoding, rather some “home brew”.

The registry location is:

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\DigitalProductId

Searching the web, I came across this handy script (found here), which I copied into a gist (see below). It reads out the registry, converts the value and then displays the resulting product key in human-readable form. (Assuming product keys can be considered human-readable.)

pyØMQ bind / connect vs. pub / sub

In zmq one is told that it doesn’t matter which side if the communication “binds” to a socket and which side “connects”. Rather it should be the “stable” side that “binds”. However, for the publisher / subscriber (pub/sub) pattern it does matter. At least in pyzmq.

More precisely, the order in which the subscriber and publisher are initialized correlates with which side should bind or connect.

Let’s look the following 4 cases (click on case for code):

First: PUB
Second: SUB
First: SUB
Second: PUB
PUB: bind
SUB: connect
works (1) works (2)
PUB: connect
SUB: bind
works (3) doesn’t work (fix) (4)

Case 1

This case works. However, if the publisher starts sending messages while the subscriber is still connecting they are lost. This is known as the slow-joiner-symptom.

Case 2

This case simply works. It also is the “preferred” way of setting up a PUB / SUB with zmq.

Case 3

Now this case is a bit special, at least in pyzmq. One would expect the slow-joiner-symptom, similar to case 1. However, at least in pyzmq messages are queued on the publisher’s side instead of being thrown away, until a subscriber binds to the address.

Once the subscriber binds to an address, the publisher dumps all the messages it has queued up to the subscriber, even those sent before the connection was established.

Case 4

This case is strange in the very sense of the word. When the publisher connects, it happily starts sending messages as the address is bound. However, the subscriber doesn’t receive anything. Yep, it’s like the publisher doesn’t even exist.

However, if the subscriber polls at least once after the publisher has connected all subsequent messages will be delivered correctly. This is true, even if the publisher has not send anything yet. (see gist)


While any of the 4 scenarios work, one has to be aware of their specialties to avoid pitfalls.

If the subscriber binds, one has to keep an eye on the high water mark on the publisher (case 3) and be aware that messages may be ignored until the subscriber tries to receive for the first time (case 4).

If the publisher binds, one has to be aware of the slow-joiner-symptom (case 1).

A Private Docker Registry with SSL on an Offline Docker Swarm

A part of my master’s thesis is to set up a Docker Swarm to parallelize reinforcement learning experiments. For this I needed a registry hosted by the swarm. This is because the swarm is unfortunately offline and I somehow have to distribute images across nodes.

Many tutorials online show how to set up a registry with SSL certificates and authentication using nginx. However, I wanted something a little simpler. Further, I don’t have a domain name that I can set as common name (CN), so I have to use the IP address for the certificate. This has to be added as SAN (subject alternate name), something that the usual tutorials don’t describe.

Note: “Also as of the Effective Date, the CA SHALL NOT issue a certificate with an Expiry Date later than 1 November 2015 with a subjectAlternativeName extension or Subject commonName field containing a Reserved IP Address or Internal Server Name.” – Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates, v.1.0 Thus, it is necessary to use self-signed certificates in this scenario.

The process breaks down into 3 simple steps:

  1. Create a self-signed SSL certificate with IP SAN
  2. Setup the registry service using the certificate
  3. Give the Nodes in the Swarm Access to the Certificate


For this post my “swarm” will be a single manager node running in a VM.

The last command gives the IP that has to be named in the certificate. I first setup everything I need on the host machine, then deploy it to the swarm. This is fancy talk for “make a folder with all the good stuff and scp it’s content to the VM” — poor mans deploy and as we know, all students are poor.

Create a self-signed SSL certificate with IP SAN

The IP for the node running my registry is: Your one may be different. I wrote a custom openssl.cnf based off the example at /etc/ssl/openssl.cnf and placed it into ~/my_docker_registry_deploy_folder/

Remember that the IP in the last line may differ in your case. With that I could generate a private key and the certificate:

When prompted to enter some information I left everything blank.  Usually the CN has to be equal to the domain name, but in this case the SAN takes care of this. Quickly verify that the SAN is specified:

openssl x509 -in certs/certificate.crt -text -noout

The important line is:

X509v3 Subject Alternative Name:
IP Address:

That’s all for the certificates.

Setup the registry service using the certificate

To make the deployment easy, I wrote a small docker-compose file that starts the registry service:

This launches the registry as a service on port 433. A potential flaw is that this container isn’t constrained to a specific machine. This is because volumes are not shared between nodes and thus all stored images would be lost if the container migrates. In the toy example there is only 1 machine; However, with an actual swarm (that’s the point of this exercise, right?) one would introduce such placement constraints or use a storage solution that migrates with the container.

Time for deployment:

The certificate will be wiped on reboot if we leave it in the home directory. Thus, I place it in a persistent location. Which is also the location the container reads it from.

Give the Nodes in the Swarm Access to the Certificate

The registry is ready and set up, however, when pushing or pulling docker will issue a self-signed certificate error, because it can not verify the certificate. To fix this, each client that wants to interact with the server needs a copy of the certificate. I installed the certificate on each client into


Then, I restarted the docker on the client, to reload certificates:

sudo service docker restart

Thats it! Now I can push to this registry just like any other registry.

docker tag registry:2
docker push
docker pull

Thanks for reading and happy coding!