Dask Development Log

This work is supported by Continuum Analytics the XDATA Program and the Data Driven Discovery Initiative from the Moore Foundation

To increase transparency I’m blogging weekly about the work done on Dask and related projects during the previous week. This log covers work done between 2016-12-11 and 2016-12-18. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.

Themes of last week:

Benchmarking new scheduler and worker on larger systems
Kubernetes and Google Container Engine
Fastparquet on S3

Rewriting Load Balancing

In the last two weeks we rewrote a significant fraction of the worker and scheduler. This enables future growth, but also resulted in a loss of our load balancing and work stealing algorithms (the old one no longer made sense in the context of the new system.) Careful dynamic load balancing is essential to running atypical workloads (which are surprisingly typical among Dask users) so rebuilding this has been all-consuming this week for me personally.

Briefly, Dask initially assigns tasks to workers taking into account the expected runtime of the task, the size and location of the data that the task needs, the duration of other tasks on every worker, and where each piece of data sits on all of the workers. Because the number of tasks can grow into the millions and the number of workers can grow into the thousands, Dask needs to figure out a near-optimal placement in near-constant time, which is hard. Furthermore, after the system runs for a while, uncertainties in our estimates build, and we need to rebalance work from saturated workers to idle workers relatively frequently. Load balancing intelligently and responsively is essential to a satisfying user experience.

We have a decently strong test suite around these behaviors, but it’s hard to be comprehensive on performance-based metrics like this, so there has also been a lot of benchmarking against real systems to identify new failure modes. We’re doing what we can to create isolated tests for every failure mode that we find to make future rewrites retain good behavior.

Generally working on the Dask distributed scheduler has taught me the brittleness of unit tests. As we have repeatedly rewritten internals while maintaining the same external API our testing strategy has evolved considerably away from fine-grained unit tests to a mixture of behavioral integration tests and a very strict runtime validation system.

Rebuilding the load balancing algorithms has been high priority for me personally because these performance issues inhibit current power-users from using the development version on their problems as effectively as with the latest release. I’m looking forward to seeing load-balancing humming nicely again so that users can return to git-master and so that I can return to handling a broader base of issues. (Sorry to everyone I’ve been ignoring the last couple of weeks).

Test deployments on Google Container Engine

I’ve personally started switching over my development cluster from Amazon’s EC2 to Google’s Container Engine. Here are some pro’s and con’s from my particular perspective. Many of these probably have more to do with how I use each particular tool rather than intrinsic limitations of the service itself.

In Google’s Favor

Native and immediate support for Kubernetes and Docker, the combination of which allows me to more quickly and dynamically create and scale clusters for different experiments.
Dynamic scaling from a single node to a hundred nodes and back ten minutes later allows me to more easily run a much larger range of scales.
I like being charged by the minute rather than by the hour, especially given the ability to dynamically scale up
Authentication and billing feel simpler

In Amazon’s Favor

I already have tools to launch Dask on EC2
All of my data is on Amazon’s S3
I have nice data acquisition tools, s3fs, for S3 based on boto3. Google doesn’t seem to have a nice Python 3 library for accessing Google Cloud Storage :(

I’m working from Olivier Grisel’s repository docker-distributed although updating to newer versions and trying to use as few modifications from naive deployment as possible. My current branch is here. I hope to have something more stable for next week.

Fastparquet on S3

We gave fastparquet and Dask.dataframe a spin on some distributed S3 data on Friday. I was surprised that everything seemed to work out of the box. Martin Durant, who built both fastparquet and s3fs has done some nice work to make sure that all of the pieces play nicely together. We ran into some performance issues pulling bytes from S3 itself. I expect that there will be some tweaking over the next few weeks.