Creating small Docker images for Dataproc Serverless Pyspark projects managed with Poetry
Dataproc Serverless on GCP is Google’s solution for running Apache Spark workloads without provisioning and managing a Dataproc cluster. It currently supports batch and interactive workloads (using Jupyter notebooks), has up-to-date versions of Apache Spark (up to 3.4 at the time of writing), and has some very useful features such as autoscaling.
It supports all of the common languages for writing your Spark applications: Pyspark, R, Spark SQL and Java/Scala. This blog post assumes you are writing your applications in Pyspark (Python).
Poetry
If you’ve never heard of Poetry, check out the official website. Poetry is a dependency manager for Python projects, and can be very useful if your Pyspark project relies on external packages. Poetry is not the focus of this blog post however. Just search the internet with a search term like poetry python
and you’ll find lots of good resources on the subject. A good introduction to using Poetry for a Pyspark project can be found here.
Dataproc Serverless custom Docker image
Dataproc Serverless runs your Spark workloads in Docker containers. There’s a default container image, but it’s not very useful if you have external Python dependencies. That’s why custom images are also supported. Google’s documentation has some useful information to get you started. They list the following requirements for the Docker image:
- Any OS image can be used as a base image, but Debian 11 images (such as
debian:11-slim
) are preferred. - The image must contain the packages
procps
andtini
- The image requires a
spark
user with a UID and GID of1099
The documentation also has some other interesting information. One thing to note is that you must not include the Spark binaries and Java Runtime Environment in your container image, because Dataproc will mount these at runtime to your container. This is nice, because it will keep our custom Docker image quite small. Another important point for us, is that by default Dataproc mounts Conda to /opt/dataproc/conda
. But, we are not using Conda for dependency management, we’re using Poetry. Luckily, Dataproc allows us to specify which Python environment to use in the container image, by specifying the PYSPARK_PYTHON
environment variable.
There’s an example custom Dockerfile in the documentation, which shows some interesting details. For example, besides procps
and tini
, it also installs the package libjemalloc2
, so I’m assuming that this is also a required package which Google simply forgot to list. We will be using this example for the first version of our Docker image.
Custom Dockerfile, first version
Let’s break down this Dockerfile to understand what is going on:
- First of all, even though Google suggests to use a Debian 11 base image, we are using
python:3.9-slim
here, which is based on the latest stable version of Debian. But it’s very convenient for us to use this image, because it comes with a Python environment pre-installed. - We copy the
pyproject.toml
andpoetry.lock
files from the root of our repository into the Docker image. These files are needed by Poetry to install the external dependencies exactly as they are specified. - We use
pip
to install Poetry in the Docker image - By setting
virtualenvs.create
tofalse
, Poetry will not create a virtual environment, but instead install all dependencies in the systems Python environment. This is fine as long as you do this inside a Docker container, and for us it makes the usage of the Docker image simpler, but it is normally recommended to always create virtual environments for your projects - Because Poetry installed all the dependencies in the system Python environment, we can simply set
PYSPARK_PYTHON
to/usr/local/bin/python3.9
- This example assumes the Pyspark project repository has all the Pyspark code in a
/jobs
directory, which we will copy as a whole to$PYTHONPATH/jobs
, in this case/src/jobs
Build the image and push it to Google Artifact Registry. You can then test your Spark job from the command line:
This first version of our Docker image should work fine, but we can improve on it. Can we make the image smaller in size? We are installing Poetry into the image, but only to be able to run the Poetry commands to install the dependencies. After that’s done, Poetry doesn’t serve any purpose for our Spark applications. Is there a way to use Poetry as a dependency manager, without having Poetry in the final Docker image, using up unnecessary disk space? Inspired by this StackOverflow post, let’s get to work on our second version.
Custom Dockerfile, second version
The Dockerfile is very similar to the one from the original StackOverflow post. So let’s see what’s going on.
- First of all, this Dockerfile utilizes a multi-stage build, with two separate stages. Stage one creates a Python virtual environment with all of the project’s dependencies, and stage two copies that virtual environment into the final image
- Instead of installing all dependencies into the system Python environment, this time we are creating a virtual environment. This makes it very simple to copy the environment into the final image. All we then have to do, is to point
PYSPARK_PYTHON
to/venv/bin/python
. - We don’t use Poetry to create the virtual environment. Instead, we use Poetry’s
export
command and pipe it intopip
, to install the virtual environment - The final image will not have Poetry, saving us a couple of hundred MB of disk space
Build the image and push it to Artifact Registry again. In the GCP UI you can see the image size and compare it to the previous version. Hopefully, the new image will be significantly smaller!
Run into any issues? See mistakes in the examples? Please let me know by dropping me an e-mail.