Rethink platform engineering and developer impact in the age of AI. Tune in to our webinar on Thursday, May 22.

How to Not Be the Engineer Running 3.5GB Docker Images

David Mckay
David Mckay

Let’s cut to the chase: you’re adopting a microservice architecture, and you’re planning to use Docker. There’s a reason it is so en vogue – it solves lots and lots of problems and has zero negative effect on our projects, right As with every tool, technology, or paradigm thrust upon us as we scrappily try to maintain our sanity while jumping from shiny to shiny, we need to learn the gotchas.

To do this, I like to start with a simple question: How might this new shiny bite me on the ass, and what can I do to avoid having teeth marks on my rear? I want to tackle a problem I have seen repeatedly during my consultations with teams/organizations adopting Docker.

Behemoth Docker Images

Woah! Look at the size of that image. Awesome microservce is 3.5GB! So much for micro.

What on earth is a Docker image anyways?

To understand why our images are big, we need to understand what images are in the first place.

A Docker image is the output of a docker build. The build process runs each of the instructions within a Dockerfile. Each instruction executed creates a layer. Layers encapsulate the file system changes that the instruction has caused. A Docker image is a collection of layers. Let’s look closer so we can describe a Docker image in more detail.

Example:

Assume we’re going to bring Docker into our PHP workflow. In order to run our PHP application, we need a Debian-based system with PHP installed. We’ll need to describe the environment required to run our application within a Docker container.

Super simple. Super declarative. Though completely useless until we build it. The build process takes a Dockerfile and context and produces a Docker image.

Thecontext is the directory that will be sent to the Dockerfile to satisfy any file requirements, such as ADD or COPY commands, etc.

So what’s actually going on? What’s inside my Docker image?

It’s a file system. When you run an apt-get install vim, all you’re telling the computer to do is put some files on your hard drive. The Docker image encapsulates that and keeps track of all new / modified / deleted files.

These file system changes are tracked in layers. Each layer is the the encapsulation of the file system changes for each instruction in your Dockerfile.

Docker provides a command to visualize our Docker images. As you’ll see in the output below:

  1. We have no control over the size of our base image, other than changing base image.
    This is the “” layer at the bottom of the list.
  2. Some keywords cost us nothing. Examples include CMD, USER, WORKDIR, etc.

Note: If your command makes no changes to the file-system (Like our RUN echo “Building …”), a layer is still created. It just has a zero-byte size.

So in-order to keep our images micro, we need to keep the output of our layers to a minimum

Gotcha’s

1. File Ownership & Permissions

Never, and I mean it, never change the ownership or permissions of a file inside a Dockerfile unless you absolutely NEED to. When you need to, try to modify as few files as possible.

Although comparisons can be made, Docker isn’t like Git. It doesn’t know what changes have happened inside your layer, only which files are affected. This will cause Dcker to create a new layer, replicating/replacing the files. This can cause your image to double in size if you’re modifying particularly large files, or worse, every file!

Example:

Tip: If you’re having problems with permissions inside your container, modify them using your entrypoint script, or modify the user id to reflect what you need. Do not modify the files.

Example

Changing the user-id of www-data to match yours. Tweak as necessary:

RUN usermod -u 1000 www-data

Or run your container with an entrypoint script:

2. Clean up after untidy commands

Sometimes other commands leave a trail of garbage at their sides and couldn’t care about the size of your images. We accept this on our desktops and preach “cache” and “performance”. Inside our images, it’s just pure filth.

Example:

As you can see from the output above, our apt-get update costs us about 10MB and out apt-get install costs us about 30MB. Obviously these are trivial examples, but in larger builds this space will accumulate!

First, let's examine and see what each command is doing to our image. To do this, create an interactive Docker image and bash in:

$ docker run -ti --rm --name live debian:jessie bash

You’ll be live inside the innards of a Debian container and at a bash prompt. Next, let’s get a second terminal window open and inspect the container:

$ docker diff live $

No output. That’s good, because we’ve not done anything yet. docker diff allows us to see what’s changed inside our container. So lets run our first command:

Note: “$ ” is my local prompt and “root@4552beab7001:/#” is inside the container.

root@4552beab7001:/# apt-get update

Oooh, we’ve just discovered where our 10MB is going. Lets fix it by tweaking our Dockerfile to delete our apt cache after installing vim. Your initial thought may be to tweak as:

Unfortunately, this will only add another layer and not affect the previous layers. So although we’re deleting files, the previous layer still knows them. The common trick is to chain our commands at the shell level. This way, the files don’t exist when the RUN is finished, and they never exist in our history.

Much better 🙂 You can repeat that process for every RUN inside your Dockerfile and really cutthe fat out of your image.

Tips

Tip #1.

Create and maintain your own base images, preferably on Alpine! Alpine Linux (http://alpinelinux.org/) is tiny (Under 5MB!) and has a really strong package manager. If you can, use it and keep your base images lean.

Why is creating / maintaining your own base image ideal? Most “official” images are quite bloated and try to be as general as possible. You know what you need. It’s like compiling your own kernel, only not as dangerous 😀

Tip #2.

ONBUILD. Use it. When crafting base images, ONBUILD gives you a great way to reuse this image for both development and production. ONBUILD

tells Docker that when the image is used as a base, we should perform some extra instructions, such as the following, which puts our code into the container for a production build.

ONBUILD ADD . /var/www

As this only runs when being used as a base, our docker-compose.yml, used for development, can instead mount a volume into the container, for getting our code changes into the container without a rebuild 🙂

Tip #3.

Be careful using community images. They disappear. Often. Fork and maintain your own if it’s mission critical. You’re also putting your trust in the maintainer to protect your attach surface, but that’s a security issue and another post for next time.

Telepresence, Now in Blackbird

Ready to Test Telepresence with Docker