Apache Airflow: Dask Executor
Airflow is a python-based workflow management tool suited for scheduled jobs and data pipeline orchestration. Airflow’s development lifecycle is rapid and as a result, the core technologies that Airflow integrates with have experienced change over time. At the heart of Airflow is a concept called the executor; the choice of executor impacts Airflow’s ability to distribute and scale its workloads. Airflow offers a mix of local executors and distributed executors, but today’s focus is on one of Airflow’s relatively new, distributed executors: the Dask Executor.
Dask is a software project that aims to natively scale python. Dask architecture has two primary components: a scheduler and a worker. A welcomed feature of Dask is its ability to run locally or in a distributed cluster of workers. At the enterprise level this ability presents two architecture-based use cases: 1) a cost effective Airflow staging environment with the airflow webserver, airflow scheduler, dask scheduler, and dask worker processes (ideally services) all running on one host and 2) a production, microservices-based view that segments the same services relative to resource usage.
When compared to the recently deprecated Mesos Executor, the Dask Executor has shown immediate performance improvements with respect to 1) job queuing speeds and 2) parallel workload distribution. The work stealing aspect of Dask ensures that underutilized airflow workers can offload over-utilized ones in a configurable, realtime manner.
Standing Things Up
Following base installation steps in the Airflow documentation hands you a local Airflow webserver/scheduler running Airflow’s Sequential Executor. Note that moving to a distributed Airflow executor requires an external datastore (mysql, postgres, etc.). Details on setting up a database for Airflow are out of scope… but once you have an external database available, it is time to integrate with Dask.
In addition to Airflow’s normal dependencies, the Dask Executor requires Dask to be installed on the Airflow scheduler and Airflow worker nodes. Linux package managers like apt and yum can install Dask, but beware older OS variants may be a major version behind the latest available.
pip3 install dask[complete]
For simplicity, let’s assume we are setting up a colocated Airflow+ Dask staging environment as indicated earlier in the post. Navigate to where pip3 installed the Dask binaries to find two command line utilities: dask-scheduler and dask-worker.
Starting the Dask scheduler locally can be performed by executing:
dask-scheduler --host 127.0.0.1 --port 8786 --pid-file /var/run/dask/dask-scheduler.pid
Starting a Dask worker locally can be performed by executing:
dask-worker 127.0.0.1:8786 --pid-file /var/run/dask/dask-worker.pid
There is only one line in the airflow.cfg dask section required to hook the Dask Executor to the Dask cluster we just created: cluster_address.
[dask]# This section only applies if you are using the DaskExecutor in
# [core] section above
# The IP address and port of the Dask cluster's scheduler.
cluster_address = 127.0.0.1:8786
At this point you should be equipped to start/restart the Airflow services now configured with the Dask Executor.
# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, KubernetesExecutor
executor = DaskExecutor
The move to a distributed dask cluster is relatively straightforward, too: establish the dask scheduler service at any ip address ex. 9.9.9.9 and have dask worker(s) configured to report there.
dask-scheduler --host 9.9.9.9 --port 8786 --pid-file /var/run/dask/dask-scheduler.piddask-worker 9.9.9.9:8786 --pid-file /var/run/dask/dask-worker.pid
Parting Thoughts
Ultimately a few notable architectures presented themselves:
The Dask scheduler proved to be a lightweight process, so pairing it with the Airflow scheduler as indicated in #2 worked well.
The benefits of switching to the Airflow Dask Executor were instant and impactful.
Happy Airflowing (with Dask)!