Dask Tips and Tricks – HighLevelGraphs

Oct 24, 2019
Dask is an open source project in Python that allows you to scale your code on your laptop or on a cluster. Not only does it have a very clear syntax, you can also declare your order of operations in a data structure. This is a feature I was very interested in as this tends to be the use case I am tasked with most often. It's cool stuff!
For those of you who have written MPI, this is kind of like that, except you don't have to write MPI!If you would like to know more about basic Dask syntax, check out my blog post on Parallelizing For Loops with Dask.

Dask Syntax

Normally when using dask you wrapped dask.delayed around a function call, then when all those are queued up tell dask to compute your results. This is great, and I really like this syntax, but what about when you are fed a list of tasks and need to somehow feed these to Dask?

That is where a HighLevelGraph comes in!

Dask HighLevelGraphs

Dask HighLevelGraphs allow you to define a data structure that is essential a series of jobs. Each one of those jobs has one or more tasks. You can also think of your jobs as being a bucket for your tasks. Each task in a job can be executed in parallel, meaning tasks within a job must not be dependent upon one another!

dask highlevel graph jobs and tasks

Then you define your job dependencies and COMPUTE!

Here we are task-2 depends upon task-1.

We can also have Dask draw this out for us using graphviz (more on that below).

Dask Dependency

Let's see some Code!

Now that we've laid down the foundations let's execute some code!

from dask.highlevelgraph import HighLevelGraph
from dask.distributed import Client

# Get the scheduler from the docker compose service 'scheduler'
client = Client('scheduler:8786')

###########################################################
# Example 1 - A Simple Graph
# with two jobs and two tasks per job
###########################################################


def task1(arg1):
    return arg1


def task2(arg2):
    return arg2


layers = {
    'task-1': {
        'task-1-1': (task1, 'arg1'),
        'task-1-2': (task1, 'arg2')
    },
    'task-2': {
        'task-2-1': (task1, 'arg1'),
        'task-2-2': (task1, 'arg2')
    }
}

dependencies = {
    'task-1': set(),
    'task-2': {'task-1'}
}

graph = HighLevelGraph(layers, dependencies)

graph.visualize(filename='graph-1.svg')
print(client.get(graph, 'task-2-2'))
Python

We have our layers, or our jobs plus tasks, and the dependencies. Once we have that the world is ours!

Using graph.visualize() gets you the image I showed earlier, that maps out all of your dependencies.

From there, call client.get(graph, 'name-of-task'), and it will get you the result from that task! 

Bioinformatics Solutions on AWS Newsletter 

Get the first 3 chapters of my book, Bioinformatics Solutions on AWS, as well as weekly updates on the world of Bioinformatics and Cloud Computing, completely free, by filling out the form next to this text.

Bioinformatics Solutions on AWS

If you'd like to learn more about AWS and how it relates to the future of Bioinformatics, sign up here.

We won't send spam. Unsubscribe at any time.