Get a Fully Configured Apache Airflow Docker Dev Stack with Bitnami
Aug 02, 2020I've been using it for around 2 years now to build out custom workflow interfaces, like those used for Laboratory Information Management Systems (LIMs), Computer Vision pre and postprocessing pipelines, and to set and forget other genomics pipelines.
My favorite feature of Airflow is how completely agnostic it is to the work you are doing or where that work is taking place. It could take place locally, on a Docker image, on Kubernetes, on any number of AWS services, on an HPC system, etc. Using Airflow allows me to concentrate on the business logic of what I'm trying to accomplish without getting too bogged down in implementation details.
During that time I've adopted a set of systems that I use to quickly build out the main development stack with Docker and Docker Compose, using the Bitnami Apache Airflow stack. Generally, I either deploy the stack to production using either the same Docker compose stack if its a small enough instance that is isolated, or with Kubernetes when I need to interact with other services or file systems.
Bitnami vs Roll Your Own
I used to roll my own Airflow containers using Conda. I still use this approach for most of my other containers, including micro services that interact with my Airflow system, but configuring Airflow is a lot more than just installing packages. Also, even just installing those packages is a pain and I could rarely count on a rebuild actually working without some pain. Then, on top of the packages you need to configure database connections and a message queue.
In comes the Bitnami Apache Airflow docker compose stack for dev and Bitnami Apache Airflow Helm Chart for prod!
Bitnami, in their own words:
Bitnami makes it easy to get your favorite open source software up and running on any platform, including your laptop, Kubernetes and all the major clouds. In addition to popular community offerings, Bitnami, now part of VMware, provides IT organizations with an enterprise offering that is secure, compliant, continuously maintained and customizable to your organizational policies. https://bitnami.com/
Bitnami stacks (usually) work completely the same from their Docker Compose stacks to their Helm charts. This means I can test and develop locally using my compose stack, build out new images, versions, packages, etc, and then deploy to Kubernetes. The configuration, environmental variables, and everything else acts the same. It would be a fairly large undertaking to do all this from scratch, so I use Bitnami.
They have plenty of enterprise offerings, but everything included here is open source and there is no pay wall involved.
And no, I am not affiliated with Bitnami, although I have kids that eat a lot and don't have any particular ethical aversions to selling out. ;-) I've just found their offerings to be excellent.
Grab the Source Code
Everything you need to follow along is included in the post, or you can subscribe to the DevOps for Data Scientists Tutorials newsletter and get the source code delivered to you in a nice zip file.
Project Structure
I like to have my projects organized so that I can run tree
and have a general idea of what's happening.
Apache Airflow has 3 main components, the application, the worker, and the scheduler. Each of these has it's own Docker image to separate out the services. Additionally, there is a database and an message queue, but we won't be doing any customization to these.
.
└── docker
└── bitnami-apache-airflow-1.10.10
├── airflow
│ └── Dockerfile
├── airflow-scheduler
│ └── Dockerfile
├── airflow-worker
│ └── Dockerfile
├── dags
│ └── tutorial.py
├── docker-compose.yml
So what we have here is a directory called bitnami-apache-airflow-1.10.10
. Which brings us to a very important points! Pin your versions! It will save you so, so much pain and frustration!
Then we have one Dockerfile per Airflow piece.
Create this directory structure with:
mkdir -p docker/bitnami-apache-airflow-1.10.10/{airflow,airflow-scheduler,airflow-worker,dags}
The Docker Compose File
This is my preference for the docker-compose.yml
file. I made a few changes for my own preferences, mostly that I pin versions, build my own Docker images, I have volume mounts for the dags
, plugins
, and database backups
along with adding in the docker socket so I can run DockerOperators
from within my stack.
You can always go and grab the original docker-compose
here.
version: '2'
services:
postgresql:
image: 'docker.io/bitnami/postgresql:10-debian-10'
volumes:
- 'postgresql_data:/bitnami/postgresql'
environment:
- POSTGRESQL_DATABASE=bitnami_airflow
- POSTGRESQL_USERNAME=bn_airflow
- POSTGRESQL_PASSWORD=bitnami1
- ALLOW_EMPTY_PASSWORD=yes
redis:
image: docker.io/bitnami/redis:5.0-debian-10
volumes:
- 'redis_data:/bitnami'
environment:
- ALLOW_EMPTY_PASSWORD=yes
airflow-scheduler:
# image: docker.io/bitnami/airflow-scheduler:1-debian-10
build:
context: airflow-scheduler
environment:
- AIRFLOW_DATABASE_NAME=bitnami_airflow
- AIRFLOW_DATABASE_USERNAME=bn_airflow
- AIRFLOW_DATABASE_PASSWORD=bitnami1
- AIRFLOW_EXECUTOR=CeleryExecutor
# If you'd like to load the example DAGs change this to yes!
- AIRFLOW_LOAD_EXAMPLES=no
# only works with 1.10.11
#- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=true
#- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
volumes:
- airflow_scheduler_data:/bitnami
- ./plugins:/opt/bitnami/airflow/plugins
- ./dags:/opt/bitnami/airflow/dags
- ./db_backups:/opt/bitnami/airflow/db_backups
- /var/run/docker.sock:/var/run/docker.sock
airflow-worker:
# image: docker.io/bitnami/airflow-worker:1-debian-10
build:
context: airflow-worker
environment:
- AIRFLOW_DATABASE_NAME=bitnami_airflow
- AIRFLOW_DATABASE_USERNAME=bn_airflow
- AIRFLOW_DATABASE_PASSWORD=bitnami1
- AIRFLOW_EXECUTOR=CeleryExecutor
- AIRFLOW_LOAD_EXAMPLES=no
# only works with 1.10.11
#- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=true
#- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
volumes:
- airflow_worker_data:/bitnami
- ./plugins:/opt/bitnami/airflow/plugins
- ./dags:/opt/bitnami/airflow/dags
- ./db_backups:/opt/bitnami/airflow/db_backups
- /var/run/docker.sock:/var/run/docker.sock
airflow:
# image: docker.io/bitnami/airflow:1-debian-10
build:
# You can also specify the build context
# as cwd and point to a different Dockerfile
context: .
dockerfile: airflow/Dockerfile
environment:
- AIRFLOW_DATABASE_NAME=bitnami_airflow
- AIRFLOW_DATABASE_USERNAME=bn_airflow
- AIRFLOW_DATABASE_PASSWORD=bitnami1
- AIRFLOW_EXECUTOR=CeleryExecutor
- AIRFLOW_LOAD_EXAMPLES=no
# only works with 1.10.11
#- AIRFLOW__WEBSERVER__RELOAD_ON_PLUGIN_CHANGE=True
#- AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
ports:
- '8080:8080'
volumes:
- airflow_data:/bitnami
- ./dags:/opt/bitnami/airflow/dags
- ./plugins:/opt/bitnami/airflow/plugins
- ./db_backups:/opt/bitnami/airflow/db_backups
- /var/run/docker.sock:/var/run/docker.sock
volumes:
airflow_scheduler_data:
driver: local
airflow_worker_data:
driver: local
airflow_data:
driver: local
postgresql_data:
driver: local
redis_data:
driver: local
Pin your versions
The version of Apache Airflow used here is 1.10.10
. The 1.10.11
has some cool updates I would like to incorporate, so I will keep an eye on it!
You can always keep up with the latest Apache Airflow versions by checking out the changelog on the main site.
We are using Bitnami, which has bots that automatically build and update their images as new releases come along.
While this approach is great for bots, I highly do not recommend just hoping that the latest version will be backwards compatible and work with your setup.
Instead, pin a version, and when a new version comes along test it out in your dev stack. At the time of writing the most recent version is 1.10.11
, but it doesn't quite work out of the box, so we are using 1.10.10
.
Bitnami Apache Airflow Docker Tags
Generally speaking, a docker tag corresponds to the application version. Sometimes there are other variants as well, such as base OS. Here we can just go with the application version.
Bitnami Apache Airflow Scheduler Image Tags
Bitnami Apache Airflow Worker Image Tags
Bitnami Apache Airflow Web Image Tags
Build Custom Images
In our docker-compose
we have placeholders in order to build custom images.
We'll just create a minimal Docker file for now. Later I'll show you how to customize your docker container with extra system or python packages.
Airflow Application
echo "FROM docker.io/bitnami/airflow:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow/Dockerfile
Will give you this airflow application docker file.
FROM docker.io/bitnami/airflow:1.10.10
Airflow Scheduler
echo "FROM docker.io/bitnami/airflow-scheduler:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow-scheduler/Dockerfile
Will give you this airflow scheduler docker file.
FROM docker.io/bitnami/airflow-scheduler:1.10.10
Airflow Worker
echo "FROM docker.io/bitnami/airflow-worker:1.10.10" > docker/bitnami-apache-airflow-1.10.10/airflow-worker/Dockerfile
Will give you this airflow worker docker file.
FROM docker.io/bitnami/airflow-worker:1.10.10
Bring Up The Stack
Grab the docker-compose
file above and let's get rolling!
cd code/docker/bitnami-apache-airflow-1.10.10
# Bring it up in foreground
docker-compose up
# Bring it up in the background
# docker-compose up -d
If this is your first time running the command this will take some time. Docker will fetch any images it doesn't already have, and build all the airflow-* images.
Navigate to the UI
Once everything is up and running navigate to the UI at http://localhost:8080
.
Unless you changed the configuration, your default username/password
is user/bitnami
.
Login to check out your Airflow web UI!
Add in a Custom DAG
Here's a DAG that I grabbed from the Apache Airflow Tutorial. I've only included it here for the sake of completeness.
from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
# 'wait_for_downstream': False,
# 'dag': dag,
# 'sla': timedelta(hours=2),
# 'execution_timeout': timedelta(seconds=300),
# 'on_failure_callback': some_function,
# 'on_success_callback': some_other_function,
# 'on_retry_callback': another_function,
# 'sla_miss_callback': yet_another_function,
# 'trigger_rule': 'all_success'
}
dag = DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
)
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
)
t2 = BashOperator(
task_id='sleep',
depends_on_past=False,
bash_command='sleep 5',
retries=3,
dag=dag,
)
dag.doc_md = __doc__
t1.doc_md = """\
#### Task Documentation
You can document your task using the attributes `doc_md` (markdown),
`doc` (plain text), `doc_rst`, `doc_json`, `doc_yaml` which gets
rendered in the UI's Task Instance Details page.
![img](http://montcs.bloomu.edu/~bobmon/Semesters/2012-01/491/import%20soul.png)
"""
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
depends_on_past=False,
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag,
)
t1 >> [t2, t3]
Anyways, grab this file and put it in your code/ bitnami-apache-airflow-1.10.10/dags
folder. The name of the file itself doesn't matter. The DAG name will be whatever you set in the file.
Airflow will restart itself automatically, and if you refresh the UI you should see your new tutorial
DAG listed.
Build Custom Airflow Docker Containers
If you'd like to add additonal system or python packages you can do so.
# code/bitnami-apache-airflow-1.10.10/airflow/Dockerfile
FROM docker.io/bitnami/airflow:1.10.10
# From here - https://github.com/bitnami/bitnami-docker-airflow/blob/master/1/debian-10/Dockerfile
USER root
RUN apt-get update && apt-get upgrade -y && \
apt-get install -y vim && \
rm -r /var/lib/apt/lists /var/cache/apt/archives
RUN bash -c "source /opt/bitnami/airflow/venv/bin/activate && \
pip install flask-restful && \
deactivate"
To be clear, I don't especially endorse this approach anymore, except that I like to add flask-restful
for creating custom REST API plugins.
I like to treat Apache Airflow the way I treat web applications. I've been burned too many times, so now my web apps take care of routing and rendering views, and absolutely nothing else.
Airflow is about the same, except it handles the business logic of my workflows and absolutely nothing else. If I have some crazy pandas/tensorflow/opencv/whatever stuff I need to do I'll build that into a separate microservice and not touch my main business logic. I like to think of Airflow as the spider that sits in the web.
Still, I'm paranoid enough that I like to build my own images so I can then push them to my own docker repo.
Wrap Up and Where to go from here
Now that you have your foundation its time to build out your data science workflows! Add some custom DAGs, create some custom plugins, and generally build stuff.
If you'd like to get the full picture of all my Apache Airflow tips and tricks, including:
- Best practices for separating out your business logic from your Airflow Application
- Build out custom plugins with REST APIs and Flask Blueprints
- Deploy to production with Helm
- Patch your Airflow instance to use CORs to build out interfaces with React, Angular, or another system.
- CI/CD scripts to build and deploy your custom docker images.
Please check out the Apache Airflow Project Lab.
Cheat Sheet
Here are some hopefully helpful commands and resources.
Log into your Apache Airflow Instance
The default username and password is user
and bitnami
.
Docker Compose Commands
Build
cd code/bitnami-apache-airflow-1.10.10/
docker-compose build
Bring up your stack! Running docker-compose up
makes all your logs come up on STDERR/STDOUT.
cd code/bitnami-apache-airflow-1.10.10/
docker-compose build && docker-compose up
If you'd like to run it in the background instead use -d
.
cd code/bitnami-apache-airflow-1.10.10/
docker-compose build && docker-compose up -d
Bitnami Apache Airflow Configuration
You can further customize your Airflow instance using environmental variables that you pass into the docker-compose file. Check out the README for details.
Load DAG files
Custom DAG files can be mounted to /opt/bitnami/airflow/dags
.
Specifying Environment variables using Docker Compose
version: '2'
services:
airflow:
image: bitnami/airflow:latest
environment:
- AIRFLOW_FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- AIRFLOW_EXECUTOR=CeleryExecutor
- AIRFLOW_DATABASE_NAME=bitnami_airflow
- AIRFLOW_DATABASE_USERNAME=bn_airflow
- AIRFLOW_DATABASE_PASSWORD=bitnami1
- AIRFLOW_PASSWORD=bitnami123
- AIRFLOW_USERNAME=user
- [email protected]
Clean up after Docker
Docker can take up a lot of room on your filesystem.
If you'd like to clean up just the Airflow stack then:
cd code/docker/bitnami-apache-airflow-1.10.10
docker-compose stop
docker-compose rm -f -v
Running docker-compose rm -f
forcibly removes all the containers, and the -v
also removes all data volumes.
Remove all docker images everywhere
This will stop all running containers and remove them.
docker container stop $(docker container ls -aq)
docker system prune -f -a
This will remove all containers AND data volumes
docker system prune -f -a --volumes