Dask Tips and Tricks – HighLevelGraphs

For those of you who have written MPI, this is kind of like that, except you don't have to write MPI!If you would like to know more about basic Dask syntax, check out my blog post on Parallelizing For Loops with Dask.

Dask Syntax

Normally when using dask you wrapped dask.delayed around a function call, then when all those are queued up tell dask to compute your results. This is great, and I really like this syntax, but what about when you are fed a list of tasks and need to somehow feed these to Dask?

That is where a HighLevelGraph comes in!

Dask HighLevelGraphs

Dask HighLevelGraphs allow you to define a data structure that is essential a series of jobs. Each one of those jobs has one or more tasks. You can also think of your jobs as being a bucket for your tasks. Each task in a job can be executed in parallel, meaning tasks within a job must not be dependent upon one another!

Then you define your job dependencies and COMPUTE!

Dask HighLevelGraph Dependency Declaration

Here we are task-2 depends upon task-1.

We can also have Dask draw this out for us using graphviz (more on that below).

Let's see some Code!

Now that we've laid down the foundations let's execute some code!

from dask.highlevelgraph import HighLevelGraph
from dask.distributed import Client

# Get the scheduler from the docker compose service 'scheduler'
client = Client('scheduler:8786')

###########################################################
# Example 1 - A Simple Graph
# with two jobs and two tasks per job
###########################################################


def task1(arg1):
    return arg1


def task2(arg2):
    return arg2


layers = {
    'task-1': {
        'task-1-1': (task1, 'arg1'),
        'task-1-2': (task1, 'arg2')
    },
    'task-2': {
        'task-2-1': (task1, 'arg1'),
        'task-2-2': (task1, 'arg2')
    }
}

dependencies = {
    'task-1': set(),
    'task-2': {'task-1'}
}

graph = HighLevelGraph(layers, dependencies)

graph.visualize(filename='graph-1.svg')
print(client.get(graph, 'task-2-2'))

We have our layers, or our jobs plus tasks, and the dependencies. Once we have that the world is ours!

Using graph.visualize() gets you the image I showed earlier, that maps out all of your dependencies.

From there, call client.get(graph, 'name-of-task'), and it will get you the result from that task!

Dask Tips and Tricks – HighLevelGraphs

Dask Syntax

Dask HighLevelGraphs

Let's see some Code!

Bioinformatics Solutions on AWS Newsletter

Bioinformatics Solutions on AWS

Join Our Free Trial