TL;DR: Simple demo of docker checkpoint & restore
Checkpoint the container
It’s not running anymore
Restore the container
It’s running again!!
What is CR and CRIU?
CR (checkpoint and restart) is a technology that saves the memory state of process into files and resume the processes from the saved state. CRIU is a tool originally developed to CR LXC containers.
Since Docker can run LXC containers, we should be able to CR Docker containers by using CRIU. I’ve experimented this before and wrote this post. Unfortunately, the experiment didn’t succeed because CRIU didn’t support Docker very well at that moment.
More than a year has passed since then and CRIU team made a lots of effort to support Docker.
In this post, I will show you how Docker checkpoint/restore works with CRIU and why I’m excited about it with use cases.
Create CRIU vagrant box
CR Docker by using CRIU is still under experiment on libcontainer project, so you need to compile docker with experimental flag enabled. Also, it’s not fully merged into Docker, so you need to use a fork of Docker that one of developers in CRIU team created. In addition to that, you also need to compile Kernel with special kernel module enabled.
It’s not very fun to do all these things, but don’t worry! I’ve created a Vagrant box which has done all these things and uploaded for you! You just need to download the box and run on your local machine. If you are interested in doing these things, I also put an instruction in appendix.
Spin up vagrant box
Assuming that the name of VM is vg-1, run the following commands.
1 2 3 4 5 6
That’s all you have to do to try CRIU!
docker restore/checkpoint commands
Docker running on vg-1 is enabled experimental features and has two commands you’ve never seen before.
1 2 3 4 5 6
1 2 3 4 5 6
restore command use the CRIU command which I compiled and installed on the vagrant box.
1 2 3 4 5 6 7 8 9 10 11 12
Here’s a very simple example of how to CR a Docker container.
Starting a container
1 2 3 4 5 6
This command runs a number-printer container which keeps printing incremental numbers.
Checkpointing a container
Now, let’s checkpoint this container. You can do this with
docker checkpoint command. Because number-printer container is running in the foreground, do this from other terminal.
Once you checkpoint the container, you will see that it stops printing numbers.
docker ps command doesn’t show the container anymore.
Restoring a container
Let’s restore the number-printing container with
docker restore command.
If restoring succeeded, the number-printing container again started printing numbers. It’s also back in the result of
docker ps. We succeeded in checkpointing and restoring the container!
Pause/Unpause VS Checkpoint/Restore
You may say, “Wait, you can already do this with docker pause/unpause commands”.
That’s right and you don’t see a behavior difference between pause/unpause and checkpoint/restore in the previous example. Both commands stop a job of container in the middle and resume later.
The difference is while checkpoint/restart saves the memory state of containers into a disk while pause/unpause doesn’t.
You can think of pause/unpause as sending SIGSTOP and SIGCONT to UNIX process. You can run a process in foreground and hitting Ctrl+z will stop the process. You can also restore the process with
Pause/unpause commands do the similar thing to running containers.
Checkpoint/restore does more complicated things. It dumps the memory state of running containers into the disk and restore the containers by reading the memory dump.
Because checkpoint/restore saves containers into the disk, we can do more interesting things that we cannot do with pause/unpase. We will see these interesting use cases in the next section.
Here are some use cases of CR Docker containers.
(1) Resuming long-running containers
You sometimes want to run tasks that take very long time. For example, if you run a program that calculates digits of Pi on a Docker container, the container need to be long-running to calculate trillion digits.
But what if you accidentally shutdown your host machine? This will shutdown the Pi container and you will lose trillion digits that the container has calculated.
CRIU is a great tool to solve this problem. You can periodically checkpoint the container and be prepared for the accident. If the accident happens, you just need to restore the container and resume the calculation in the middle.
(2) Speeding up slow-start containers
There are applications that take very long time to start. You can speed up such containers by checkpointing the containers after slow applications started.
Here is an example that demonstrates this use case. Let’s assume that redis is super slow to start (This is completely not true in reality!!).
We just start a redis container as usual.
Because redis is slow to start in the world of this blog, we need to wait for 20 secs before redis is ready to accept connections. After wating 20 secs, we checkpoint the container.
We need to restore the redis container into a new container so that you can start multiple containers from the saved container. To do this, you need to use —force=true and pass a new container id.
The started container is immediately ready to accept the connection without waiting 20 secs.
The cool thing is that you can repeat the same process to start multiple containers very fast.
1 2 3 4
The above example theoretically takes 100 secs (20 sec x 5) to finish without CR. With CR, the five containers start in the blink of an eye.
(3) Container migration
You can do the Docker container migration with CRIU.
To see how this works, you need to run two vagrant VMs. You should be already running vg-1, so you need to run one more VM vg-2.
1 2 3 4 5 6 7 8
Once you run the two vagrant VMs, run number-printer container on vg-1.
1 2 3 4 5
You can see that number-printer container is keep printing number in the background.
1 2 3 4 5 6 7
Now let’s migrate the container to vg-2. I made a helper shell script to do this, so you need to download it to your local machine.
You need to pass three arguments to use the script.
The first argument is the name of the container to migrate. In our case, it’s np.
The second and third arguments are path to vagrant directories that you migrate from and to.
Once the script successfully finished, you can go to vg-2 and check with
docker logs np command. You should see that the number-printing container resumes printing numbers from the place that it’s paused on vg-1.
The migration succeeded!
We’ve seen how to CR Docker container with CRIU and some use cases. I hope people will come up more interesting use cases and develop tools that takes the advantages of CR.
Here are useful resources to learn more about CR Docker with CRIU.