Apache Airflow Tutorial – Part 1 Introduction
Mar 09, 2019What is Apache Airflow?
Briefly, Apache Airflow is a workflow management system (WMS). It groups tasks into analyses, and defines a logical template for when these analyses should be run. Then it gives you all kinds of amazing logging, reporting, and a nice graphical view of your analyses. I'll let you hear it directly from the folks at Apache Airflow
Apache Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Source - https://airflow.apache.org/
Analyses in Airflow are Directed Acyclic Graphs (DAGs), which is a more precise term, but not as widely used as 'analyses' or 'workflows'. You will see these terms used interchangeably.
Why use Apache Airflow?
As usual, I can tell you why I use Apache Airflow. As a large part of my job I run analyses. These analyses can have complex steps and dependencies, and the actual outcomes of these need to be tracked as well. I need a robust set of logic for triggering these workflows, either from the command line or a web interface. I have a few that are triggered as cron like jobs, but for now most of my airflow tasks are triggered directly.
When you use airflow, you get a very nice framework along with a scheduling system, a web UI, and very robust logging all out of the box!
Cool! How do I get started?
I found the conceptual platform of Airflow to be very intuitive, but I have been working in the High Performance Computing (HPC) space for longer than I am willing to admit. Oh, how time flies! It's an amazing and incredibly well designed platform, but it does have a bit of a learning curve. In particular, the initial setup is not for the faint of heart, but hopefully I have taken care of some of that for you. Additionally, Airflow itself has excellent documentation and numerous examples. Before we dive into the hands on aspects, let's cover the different components of Airflow.
Airflow Components
Directed Acyclic Graphs - DAGS
The DAG is the grouping of your tasks, or even a single task, along with its scheduling logic. Here is a super minimal DAG example. We have one task, scheduled to run once per day, starting 2019-01-01. I like to think of it as my analysis blueprint.
Word to the caution here, if you are looking at the Airflow website, many of the tasks start on 2015. The scheduler will automatically go and backfill your tasks, which could result in potentially a LOT of tasks for a toy example.