Apache Airflow Tutorial – Part 1 Introduction

apache airflow distributed computing docker python Mar 09, 2019

What is Apache Airflow?

Briefly, Apache Airflow is a workflow management system (WMS). It groups tasks into analyses, and defines a logical template for when these analyses should be run. Then it gives you all kinds of amazing logging, reporting, and a nice graphical view of your analyses. I'll let you hear it directly from the folks at Apache Airflow

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Source - https://airflow.apache.org/

Analyses in Airflow are Directed Acyclic Graphs (DAGs), which is a more precise term, but not as widely used as 'analyses' or 'workflows'. You will see these terms used interchangeably.

 

Why use Apache Airflow?

As usual, I can tell you why I use Apache Airflow. As a large part of my job I run analyses. These analyses can have complex steps and dependencies, and the actual outcomes of these need to be tracked as well. I need a robust set of logic for triggering these workflows, either from the command line or a web interface. I have a few that are triggered as cron like jobs, but for now most of my airflow tasks are triggered directly. 

When you use airflow, you get a very nice framework along with a scheduling system, a web UI, and very robust logging all out of the box!

Cool! How do I get started?

I found the conceptual platform of Airflow to be very intuitive, but I have been working in the High Performance Computing (HPC) space for longer than I am willing to admit. Oh, how time flies! It's an amazing and incredibly well designed platform, but it does have a bit of a learning curve. In particular, the initial setup is not for the faint of heart, but hopefully I have taken care of some of that for you. Additionally, Airflow itself has excellent documentation and numerous examples. Before we dive into the hands on aspects, let's cover the different components of Airflow.

Airflow Components

Directed Acyclic Graphs - DAGS

The DAG is the grouping of your tasks, or even a single task, along with its scheduling logic. Here is a super minimal DAG example. We have one task, scheduled to run once per day, starting 2019-01-01. I like to think of it as my analysis blueprint.

Word to the caution here, if you are looking at the Airflow website, many of the tasks start on 2015. The scheduler will automatically go and backfill your tasks, which could result in potentially a LOT of tasks for a toy example.

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta


default_args = {
	'owner': 'airflow',
	'depends_on_past': False,
	# Beginning 2019-01-01 ...
	'start_date': datetime(2019, 1, 1),
	'email': ['[email protected]'],
	'email_on_failure': False,
	'email_on_retry': False,
	'retries': 1,
	'retry_delay': timedelta(minutes=5),
}

# Run this dag every 1 day
dag = DAG('tutorial', default_args=default_args, schedule_interval=timedelta(days=1))


t1 = DummyOperator(
	task_id='dummy_task',
	dag=dag)


Operator

The Operator is the set of instructions for HOW your task is going to executed. Note, this does not execute the task. Here are a few examples of tasks. All operators inherit from the BaseOperator, and include task_id and dag. The other parameters are specific to the Operator itself. I will briefly cover 3 of these operators, but there are a ton, and you can make your own custom operators!

 

Dummy Operator

The Dummy Operator is really not used for anything but illustrative purposes. It doesn't actually execute any code or run any tasks.

from airflow.operators.dummy_operator import DummyOperator
t1 = DummyOperator(
	task_id='dummy_task',
	dag=dag)
Python

Bash Operator

The bash operator gives the instructions for executing, you guessed it, bash commands! Notice that the BashOperator has the bash_command parameter as well as task_id, and dag.

from airflow.operators.bash_operator import BashOperator
templated_command = """
echo 'hello world'
"""
t3 = BashOperator(
	task_id='templated',
	bash_command=templated_command,
	# Tasks must be associated to a dag to run
	dag=dag)
Python

Python Operator

I love explicit naming conventions. You see the pattern here? It's beautiful. Instead of the bash_command we had in the BashOperator, we instead have python_callable.

from airflow.operators.python_operator import PythonOperator


def do_some_stuff_task(ds, **kwargs):
	pass


do_some_stuff_op = PythonOperator(
	task_id='do_some_stuff_task',
	provide_context=True,
	dag=dag,
	python_callable=do_some_stuff_task,
)


Sensors

Sensors are Operators that continually poll some execution until it returns that it has finished. You could be waiting for a file to appear in a filesystem or an http request to complete.

Tasks

Tasks are the STUFF. They define what is actually executing. Let's take the Python Operator example above. The do_some_stuff_task is what is the code executed. This is a silly example because nothing is actually happening, but you get the idea.

from airflow.operators.python_operator import PythonOperator


def do_some_stuff_task(ds, **kwargs):
	"""Actual code executed"""
	pass


do_some_stuff_op = PythonOperator(
	# STUFF
	python_callable=do_some_stuff_task,
)
Python

Wrap Up

Airflow is a complex system, but understanding DAGs, Operators and Tasks should be enough to get you going. Check out Part 2 to get your Airflow development environment up and running with Docker.

Bioinformatics Solutions on AWS Newsletter 

Get the first 3 chapters of my book, Bioinformatics Solutions on AWS, as well as weekly updates on the world of Bioinformatics and Cloud Computing, completely free, by filling out the form next to this text.

Bioinformatics Solutions on AWS

If you'd like to learn more about AWS and how it relates to the future of Bioinformatics, sign up here.

We won't send spam. Unsubscribe at any time.