├── requirements.txt ├── depth-first.png ├── breadth-first.png ├── dask-datarevenue.png ├── CONTRIBUTING.md ├── Makefile ├── make.bat ├── fullspectrum.md ├── template.md ├── index.rst ├── icecube-cosmic-rays.md ├── hydrologic-modeling.md ├── datarevenue.md ├── network-modeling.md ├── sidewalk-labs.md ├── conf.py ├── pangeo.md ├── prefect-workflows.md ├── mosquito-sequencing.md ├── satellite-imagery.md └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | myst-parser 2 | sphinx 3 | dask-sphinx-theme>=3.0.5 4 | -------------------------------------------------------------------------------- /depth-first.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask/dask-stories/HEAD/depth-first.png -------------------------------------------------------------------------------- /breadth-first.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask/dask-stories/HEAD/breadth-first.png -------------------------------------------------------------------------------- /dask-datarevenue.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dask/dask-stories/HEAD/dask-datarevenue.png -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | Dask is a community maintained project. We welcome contributions in the form of bug reports, documentation, code, design proposals, and more. 2 | 3 | For general information on how to contribute see https://docs.dask.org/en/latest/develop.html. 4 | 5 | ## Project specific notes 6 | 7 | This project contains stories of how people are using Dask, you can find instructions on adding your story here https://github.com/dask/dask-stories#how-to-share-your-story. 8 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | SPHINXPROJ = DaskStories 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) -------------------------------------------------------------------------------- /make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=. 11 | set BUILDDIR=_build 12 | set SPHINXPROJ=DaskStories 13 | 14 | if "%1" == "" goto help 15 | 16 | %SPHINXBUILD% >NUL 2>NUL 17 | if errorlevel 9009 ( 18 | echo. 19 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 20 | echo.installed, then set the SPHINXBUILD environment variable to point 21 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 22 | echo.may add the Sphinx directory to PATH. 23 | echo. 24 | echo.If you don't have Sphinx installed, grab it from 25 | echo.http://sphinx-doc.org/ 26 | exit /b 1 27 | ) 28 | 29 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 30 | goto end 31 | 32 | :help 33 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 34 | 35 | :end 36 | popd 37 | -------------------------------------------------------------------------------- /fullspectrum.md: -------------------------------------------------------------------------------- 1 | Full Spectrum: Credit and Banking 2 | ================================= 3 | 4 | Who am I? 5 | --------- 6 | 7 | My name is [Hussain Sultan](https://www.linkedin.com/in/hussainsultan/). 8 | I am a partner at [Full Spectrum 9 | Analytics](https://www.fullspectrumanalytics.com/). I create personalized 10 | analytics software within banks for the sake of equitable and profitable 11 | decision making. 12 | 13 | 14 | What problem am I trying to solve? 15 | ---------------------------------- 16 | 17 | Lending businesses create and manage valuations and cashflow models that output 18 | the profitability expectations for customer segments. These models are complex 19 | because they form a network of equations that need to be scored efficiently and 20 | keep track of inputs/outputs at scale. 21 | 22 | 23 | How Dask helps 24 | -------------- 25 | 26 | Dask is instrumental in my work for creating efficient cashflow model 27 | management systems and general data science enablement on data lakes. 28 | 29 | Dask provides a way to construct the dependencies of cashflow equations as a 30 | DAG (using the [dask.delayed](https://docs.dask.org/en/latest/delayed.html) 31 | interface) and provides a good developer experience for building 32 | scoring/gamification/model tracking applications. 33 | 34 | 35 | Why I chose Dask originally 36 | --------------------------- 37 | 38 | I chose dask for three reasons: 39 | 40 | 1. It was lightweight 41 | 2. The granular task scheduling approach to scaling both dataframes and 42 | arbitrary computations fit my use case well 43 | 3. It is easy to scale my team with Python programmers 44 | 45 | 46 | Some of the pain points of using Dask in our problem 47 | ---------------------------------------------------- 48 | 49 | It's hard to get organization buy-in to adopt an open-source technology without 50 | vendored support and enterprise SLAs. 51 | 52 | In a recent project, we had to integrate with the Orc data format that turned 53 | out to be more expensive than I originally anticipated (compounded by 54 | enterprise hadoop set-up and encryption requirements). These changes have 55 | since been upstreamed though, and so things are easier now. 56 | 57 | 58 | Some of the technology that we use around Dask 59 | ---------------------------------------------- 60 | 61 | We deployed on generic internal server with Jenkins scheduling a Jupyter 62 | notebook to execute. We built everything out using our internal analytics 63 | platform. We didn't have to worry about security because everything was behind 64 | a corporate firewall. 65 | -------------------------------------------------------------------------------- /template.md: -------------------------------------------------------------------------------- 1 | Template 2 | ======== 3 | 4 | Who am I? 5 | --------- 6 | 7 | A brief description of who you are, and the name of the project for which you 8 | use Dask. 9 | 10 | 11 | The Problem I'm trying to solve 12 | ------------------------------- 13 | 14 | Include context and detail here about the problem that you're trying to solve. 15 | Details are *very* welcome here. You're probably writing to someone within 16 | your own field so feel free to use technical speech. 17 | 18 | You shouldn't mention Dask here yet; focus on your problem instead. Why is it 19 | important? Why is it hard? Who does this problem affect? 20 | 21 | 22 | How Dask helps 23 | -------------- 24 | 25 | Describe how Dask helps you to solve this problem. Again, details are welcome. 26 | New readers probably won't know about specific API like "we use client.scatter" 27 | but probably will be able to follow terms used as headers in documentation like 28 | "we used dask dataframe and the futures interface together". 29 | 30 | We also encourage you to mention how your use of Dask has *changed* over time. 31 | What originally drew you to the project? Is that still why you use it or has 32 | your perception or needs changed? 33 | 34 | 35 | Pain points when using dask 36 | --------------------------- 37 | 38 | Dask has issues and it's not always the right solution for every problem. What 39 | are things that you ran into that you think others in your field should know 40 | ahead of time? 41 | 42 | 43 | Technology I use around Dask 44 | ---------------------------- 45 | 46 | This might be other libraries that you use with Dask for analysis or data 47 | storage, cluster technologies that you use to deploy or capture logs, etc.. 48 | Anything that you think someone like you might want to use alongside Dask. 49 | 50 | 51 | Other information 52 | ----------------- 53 | 54 | Is there something else that didn't fit into the sections above? Feel free to 55 | make your own. 56 | 57 | 58 | Links 59 | ----- 60 | 61 | Links and images throughout the document are great. You may want to list links 62 | again here. This might be links to your company or project, links to blogposts 63 | or notebooks that you've written about the topic, or links to relevant source 64 | code. Anything that someone who was interested in your story could use to 65 | learn more. 66 | 67 | We also strongly encourage you to include images. These might be output 68 | results from your analyses, diagrams showing your architecture, or anything 69 | that helps to convey who your group is, and the kind of work that you're doing. 70 | -------------------------------------------------------------------------------- /index.rst: -------------------------------------------------------------------------------- 1 | Dask Use Cases 2 | ============== 3 | 4 | Dask is a versatile tool that supports a variety of workloads. 5 | This page contains brief and illustrative examples of how people use Dask in practice. 6 | These emphasize breadth and hopefully inspire readers to find new ways 7 | that Dask can serve them beyond their original intent. 8 | 9 | .. toctree:: 10 | :maxdepth: 1 11 | 12 | sidewalk-labs.md 13 | mosquito-sequencing.md 14 | fullspectrum.md 15 | icecube-cosmic-rays.md 16 | pangeo.md 17 | hydrologic-modeling.md 18 | network-modeling.md 19 | satellite-imagery.md 20 | prefect-workflows.md 21 | datarevenue.md 22 | 23 | Overview 24 | -------- 25 | 26 | Dask uses can be roughly divided in the following two categories: 27 | 28 | 1. Large NumPy/Pandas/Lists with 29 | `Dask Array `_, 30 | `Dask DataFrame `_, 31 | `Dask Bag `_, 32 | to analyze large datasets with familiar techniques. 33 | This is similar to Databases, Spark_, or big array libraries 34 | 35 | 2. Custom task scheduling. You submit a graph of functions that depend on 36 | each other for custom workloads. This is similar to Luigi_, Airflow_, 37 | Celery_, or Makefiles_ 38 | 39 | Most people today approach Dask assuming it is a framework like Spark, designed 40 | for the first use case around large collections of uniformly shaped data. 41 | However, many of the more productive and novel use cases fall into the second 42 | category where Dask is used to parallelize custom workflows. 43 | 44 | In the real-world applications above we see that people end up using both 45 | sides of Dask to achieve novel results. 46 | 47 | Contributing 48 | ------------ 49 | 50 | If you solve interesting problems with Dask then we want you to share your 51 | story. Hearing from experienced users like yourself can help newcomers quickly 52 | identify the parts of Dask and the surrounding ecosystem that are likely to be 53 | valuable to them. 54 | 55 | Stories are collected as pull requests to `github.com/dask/dask-stories 56 | `_. You may wish to read a few of the 57 | stories above to get a sense for the typical level of information. There is a 58 | template in the repository with suggestions, but you can also structure your 59 | story a different way. 60 | 61 | .. toctree:: 62 | :maxdepth: 1 63 | 64 | template.md 65 | 66 | .. _Airflow: https://airflow.apache.org/ 67 | .. _Luigi: https://luigi.readthedocs.io/en/latest/ 68 | .. _Celery: http://www.celeryproject.org/ 69 | .. _Spark: https://spark.apache.org/ 70 | .. _Makefiles: https://en.wikipedia.org/wiki/Make_(software) 71 | -------------------------------------------------------------------------------- /icecube-cosmic-rays.md: -------------------------------------------------------------------------------- 1 | IceCube: Detecting Cosmic Rays 2 | ============================== 3 | 4 | Who am I? 5 | --------- 6 | 7 | I'm [James Bourbeau](https://github.com/jrbourbeau), I'm a graduate student in 8 | the Physics department at the University of Wisconsin at Madison. I work at 9 | the [IceCube South Pole Neutrino Observatory](https://icecube.wisc.edu/) 10 | studying the cosmic-ray energy spectrum. 11 | 12 | 13 | What problem am I trying to solve? 14 | ---------------------------------- 15 | 16 | Cosmic rays are energetic particles that originate from outer space. While they 17 | have been studied since the early 1900s, the sources of high-energy cosmic rays 18 | are still not well known. I analyze data collected by IceCube to study how the 19 | cosmic-ray spectrum changes with energy and particle mass; this can help provide 20 | valuable insight into our understanding of the origin of cosmic rays. 21 | 22 | This involves developing algorithms to perform energy reconstruction as well 23 | as particle mass group classification for events detected by IceCube. In 24 | addition, we use detector simulation and an iterative unfolding algorithm to 25 | correct for inherit detector biases and the finite resolution of our 26 | reconstructions. 27 | 28 | 29 | How Dask Helps us 30 | ----------------- 31 | 32 | I originally chose to use Dask because of the 33 | [Dask Array](https://docs.dask.org/en/latest/array.html) and 34 | [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html) data 35 | structures. I use Dask Dataframe to load thousands of 36 | [HDF](https://www.hdfgroup.org/) files and then apply further feature 37 | engineering and filtering data preprocessing steps. The final dataset can be 38 | up to 100GB in size, which is too large to load into our available RAM. So 39 | being able to easily distribute this load while still using the familiar 40 | pandas API has become invaluable in my research. 41 | 42 | Later I discovered the 43 | [Dask delayed](https://docs.dask.org/en/latest/delayed.html) iterface and now 44 | use it to parallelize code that doesn't easily conform to the Dask Array or 45 | Dask Dataframe use cases. For example, I often need to perform thousands of 46 | independent calculations for the pixels in a HEALPix sky map. I've found Dask 47 | delayed to be really useful for parallelizing these types of embarrassingly 48 | parallel calculations with minimal hassle. 49 | 50 | I also use several of the 51 | [diagnostic tools](https://docs.dask.org/en/latest/diagnostics-local.html) 52 | Dask offers such as the progress bar and resource profiler. Working in a large 53 | collaboration with shared computing resources, it's great to be able to 54 | monitor how many resources I'm using and scale back or scale up accordingly. 55 | 56 | 57 | Pain points of using Dask 58 | ------------------------- 59 | 60 | There were two main pain points I encountered when first using Dask: 61 | 62 | - Getting used to the idea of lazy computation. While this isn't an issue that 63 | is specific to Dask, it was something that took time to get used to. 64 | 65 | - Dask is a fairly large project with many components and it took some time to 66 | figure out how all the various pieces fit together. Luckily, the user 67 | documentation for Dask is quite good and I was able to get over this initial 68 | learning curve. 69 | 70 | 71 | Technology that we use around Dask 72 | ---------------------------------- 73 | 74 | We store our data in HDF files, which Dask has nice read and write support 75 | for. We also use several other Python data stack tools like Jupyter, 76 | scikit-learn, matplotlib, seaborn, etc. Recently, we've started experimenting 77 | with using HTCondor and the 78 | [Dask distributed scheduler](https://distributed.dask.org/en/latest/) to 79 | scale up to using hundreds of workers on a cluster. 80 | -------------------------------------------------------------------------------- /hydrologic-modeling.md: -------------------------------------------------------------------------------- 1 | NCAR: Hydrological Modeling 2 | =========================== 3 | 4 | Who am I? 5 | --------- 6 | 7 | I am [Joe Hamman](http://joehamman.com/about/) and I am a Project Scientist in the [Computational Hydrology Group](https://ncar.github.io/hydrology/) at the [National Center for Atmospheric Research](https://ncar.ucar.edu/). I am a core developer of the [Xarray](http://Xarray.pydata.org) project and a contributing member of the [Pangeo](http://pangeo-data.org/) project. I study subjects in the areas of climate change, hydrology, and water resource engineering. 8 | 9 | 10 | What problem am I trying to solve? 11 | ---------------------------------- 12 | 13 | Climate change will bring widespread impacts to the hydrologic cycle. We know this because many research studies, conducted over the past two decades, have shown what the first order effects of climate change will look like in managed and natural hydrologic systems in terms of things like water availability, drought, wildfire, extreme precipitation, and floods. However, we don't have a very good understanding of the characteristic uncertainties that come from our choice of tools that we use to estimate these changes. 14 | 15 | In the field of hydroclimatology, the tools we use are numerical models of the climate and hydrologic systems. These models can be constructed in many ways and it is often difficult understand how specific choices we make when building a model impact the inferences we can draw from them (e.g. the impact of climate change on flood frequency). We are working on methods to expose and constrain methodological uncertainties in the climate impacts modeling paradigm for water resource applications. This includes developing and analyzing large ensembles of climate projections and interrogating these ensembles to understand specific sources of uncertainty. 16 | 17 | 18 | How does Dask help? 19 | ------------------- 20 | 21 | The climate and hydrologic sciences rely heavily on data stored in formats like HDF5 and NetCDF. We often use [Xarray](http://xarray.pydata.org) as an interface and friendly data model for these formats. Under the hood, Xarray uses either NumPy or Dask arrays. This allows us to scale the same Xarray computations we would typically do in-memory using NumPy, to larger tasks using Dask. 22 | 23 | In my own research, I use Dask and Xarray as the core computational libraries for working with large datasets (10s-100s of terabytes). Often the operations we do with these datasets are fairly simple reduction operations where we may compare the average climate from two periods. 24 | 25 | Why we chose Dask 26 | ----------------- 27 | 28 | When working on scientific analysis tasks, I don't want to think about parallelizing my code. We chose to work with Dask because `Dask.array` was nearly drop-in-compatible with NumPy. This meant that adopting Dask, inside or outside of Xarray, was much easier than adopting another parallel framework. Along those same lines, Dask is well integrated with other key parts of the scientific Python stack, including Pandas, Scikit-Learn, etc. 29 | 30 | Pain points 31 | ----------- 32 | 33 | Originally deploying Dask on HPC systems was a bit of a pain. But this has 34 | gotten much easier. 35 | 36 | Additionally while Dask is easy to use, it's also easy to break. The freedom 37 | it provides also means that you have the freedom to shoot yourself in the foot. 38 | 39 | Also diagnosing performance issues can be more complex than when just using 40 | Numpy. It's still a bit of an art rather than a science. 41 | 42 | Technology we use around Dask 43 | ----------------------------- 44 | 45 | - We use [Xarray](https://xarray.pydata.org) to provide a higher level (and familiar) interface around Numpy arrays and Dask arrays 46 | - We use NetCDF and HDF files for data storage 47 | - I mostly work on HPC systems and have been helping develop the [dask-jobqueue](https://dask-jobqueue.readthedocs.io) package for deploying Dask on job queueing systems 48 | - In the [Pangeo](https://pangeo-data.github.io) project, we're exploring Dask applications using Kubernetes and Jupyter notebooks 49 | 50 | Other thoughts 51 | -------------- 52 | 53 | I'm quite interested in enabling more intuitive scientific analysis workflows, particularly when parallelization is required. Dask has been a big part of our efforts to facilitate a "beginning-to-end" workflow pattern on large datasets. 54 | -------------------------------------------------------------------------------- /datarevenue.md: -------------------------------------------------------------------------------- 1 | Discovering biomarkers for rare diseases in blood samples 2 | ========================================================= 3 | 4 | ![](dask-datarevenue.png) 5 | 6 | ## Who am I? 7 | I am Markus Schmitt and I am CEO at [Data Revenue](https://www.datarevenue.com/). We build custom machine learning solutions in a number of fields, including everything from medicine to car manufacturing. 8 | 9 | ## What problem am I trying to solve? 10 | It's hard for doctors to diagnose rare diseases based on blood tests. Usually patients are subjected to expensive and prolonged semi-manual gene testing. 11 | 12 | We are analysing thousands of blood samples, comparing sick and healthy patients. We look for biomarkers, compounds in the blood, such as iron, and these help doctors to identify people who have rare diseases (and those who do not). 13 | Currently, this is done offline, on historical samples, but with more work this could be used in real-time too: to analyse patients' blood and give them feedback faster and more cheaply than is currently possible. 14 | 15 | ## How does Dask help? 16 | We started the project without Dask, writing our own custom multiprocessing functionality. This was a burden to maintain, and Dask made it simple to switch over to thinking at a directed acyclic graph (DAG) level. It was great to stop thinking about individual cores. 17 | 18 | Dask has allowed us to run all of our analysis in parallel, shortening the overall feedback loop and letting us get results faster. 19 | We've found Dask to be extremely flexible. We have used it extensively to help with our distributed analysis, but Dask adds value for us in simpler cases too. We have systems which revolve around user-submitted jobs, and we can use Dask to help schedule these, whether or not 'big data' is involved. 20 | 21 | ## Why did I choose Dask? 22 | After it became clear that we were wasting significant time maintaining our custom multiprocessing code, we considered several alternatives before choosing Dask. We specifically considered 23 | - [Apache Flink](https://flink.apache.org/): we found this not only lacking some functionality that we needed, but also very complex to set up and maintain. 24 | - [Apache Spark](https://spark.apache.org/): we similarly found this to be a time sink in terms of set up and maintenance. 25 | - [Apache Hadoop](https://hadoop.apache.org/): we found the MapReduce framework too restrictive, and it felt like we were always pushing round pegs into square holes. 26 | 27 | Ultimately, we chose Dask because it offered a great balance of simplicity and power. It is flexible enough to let us do nearly everything we need, but simple enough to not put a maintenance burden on our team. 28 | We use the Dask scheduler along with the higher level APIs. Our data fits well into a table structure, so the DataFrame API provides a lot of value for us. 29 | 30 | ## Pain points 31 | Dask has largely done exactly what we want, but we have had some memory issues. Sometimes, because of the higher layer of abstraction, it's not easy to predict exactly what data will be loaded into memory. 32 | In some cases, Dask loads far more data into memory than we expected, and we have to add custom logic to delay the processing of some tasks. These tasks are then only run when previous tasks have completed and memory is freed up. 33 | Because we work with medical data, we also have a strong focus on security and compliance. We found the Dask integration with Kubernetes to be lacking some features we needed around this. Specifically we had to apply some manual SSL patches to ensure that data was always transferred over SSL. 34 | But overall, wherever Dask has fallen short of our needs, it has been easy for us to patch these. Dask's architecture makes it easy to understand and change as needed. 35 | 36 | ## Technology I use around Dask 37 | We run Dask on Kubernetes on AWS, with batch processing managed by [Luigi](https://luigi.readthedocs.io/en/stable/). We use scikit-learn for machine learning, and [Dash](https://www.datarevenue.com/ml-tools/dash) for interactive GUIs and Dashboards. 38 | 39 | ## Anything else to know? 40 | We use Dask in many of our projects, from smaller ones all the way up to our largest integrations (for example with Daimler) and we have been impressed with how versatile it is. We process several billion records, amounting to many terabytes of data, through Dask every day. 41 | 42 | ## Links 43 | - [Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS](https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray) 44 | - [How to Scale your Machine Learning Pipeline with Dask](https://www.datarevenue.com/en-blog/how-to-scale-your-machine-learning-pipeline) 45 | -------------------------------------------------------------------------------- /network-modeling.md: -------------------------------------------------------------------------------- 1 | Mobile Network Modeling 2 | ======================= 3 | 4 | Who am I? 5 | --------- 6 | 7 | I am [Sameer Lalwani](https://www.linkedin.com/in/lalwanisameer/), and I specialize in modeling wireless networks. 8 | 9 | We use these models to help operators with their technology decisions by building digital twins of their wireless networks. These models are used to quantify the impact of technology on user experience, network KPI & economics. 10 | 11 | 12 | The problem I'm trying to solve 13 | ------------------------------- 14 | 15 | Mobile network operators face uncertainty whenever a new technology emerges. Their existing networks are complex with several bands and pre-existing sites. They need to understand the implications of bringing a new technology into this mix. By modeling their networks we can help reduce the uncertainty that management faces, while giving them actionable insights about their network. 16 | 17 | For example currently we are in the middle of a 4G to 5G transition, presenting operators with decisions like the following: 18 | 19 | - Which spectrum band should they bid on? 20 | - What type of user experience and network performance can they expect? 21 | - How much capital & operating expense will be required? 22 | - What site locations should they target to meet coverage and capacity requirements? 23 | - Should they consider refarming or acquire new bands? 24 | 25 | These models are used to create digital twins at the scale of a city or a country. Based on known subscriber behavior and data from census, traffic and other sources "synthetic subscribers" are created. These subscribers could easily run into few millions with every individual having its own physical location on a map. 26 | We also bring in other GIS layers like building layouts, roads etc. to model indoor and outdoor subscribers. 27 | 28 | The network can also have a range of cell sites going from few hundreds to tens of thousands. 29 | 30 | Our models calculate data rates at an individual subscriber level by computing RF pathloss loss & SINR from all nearby sites. This takes our datasets into tens of millions of rows on which numerical computation needs to be done. 31 | 32 | We also run lots of scenarios on this network to understand its capabilities and limitations. 33 | 34 | 35 | How Dask helps 36 | -------------- 37 | 38 | ### Distributed computing and Dask delayed 39 | 40 | Dask Distributed helps run our scenarios across multiple machines while remaining within the memory constraints of each machine. We create approximately 10-20 machines on our VMware infrastructure with one Linux machine running the Dask scheduler, and all other machines running Dask workers with identical Conda environments. 41 | 42 | Our team runs Jupyter notebooks on their desktop machines which connect with the Dask scheduler. Their notebooks work on their own datasets but for specialized functions and scenarios their tasks get transparently sent to the Dask scheduler. Each site count scenario takes between 10-15 mins to run, so it's nice to have 40 of them running in parallel. 43 | 44 | ### Dask Dataframe 45 | 46 | We use LiDAR data sets to calculate line of sight for mmWave propagation from lamp posts. The LiDAR data sets for the full city are often too large to open on a single machine. Dask Dataframe allows us to pool the resources of multiple machines while keeping our logic similar to Pandas dataframes. 47 | 48 | We use Dask delayed to process multiple LiDAR files in parallel and then combine them into a single Dask Dataframe representing the full city. The line of sight calculation is CPU intensive and so it is nice to distribute it across multiple cores. 49 | 50 | One of the things we really appreciate about Dask is that it gives us the ability to focus on the model logic without worrying about scalability. Once the code is ready, its seamless to call the functions using delayed and have them run across multiple CPU cores. 51 | 52 | 53 | Pain points when using dask 54 | --------------------------- 55 | 56 | I find Dask Dataframe to be too slow for multi column groupby operations. For such tasks I sort the Dataframe by the columns and partition it by the first column. I then use a delayed operation to use pandas on each partition which end up being much faster. 57 | 58 | 59 | Technology I use around Dask 60 | ---------------------------- 61 | 62 | - GIS : Geopandas, Would love to get this natively supported. 63 | - Traffic Flows: networkx 64 | - Analytics: scikit-learn, scipy, pandas 65 | - Visualization: Datashader with Dask distributed for LiDAR data. 66 | - Charts: Holoviews, Seaborn 67 | - Data Storage: HDF5 68 | 69 | All our deployments are manually done. We bring up these machines for our analytics with identical Conda environments and tear them down once we are done. 70 | 71 | Previously we used IPython parallel but found that Dask was easier to set up, and also allowed us to write more complex logic and also share our computing pool between multiple users. 72 | 73 | 74 | Links to this work 75 | ------------------ 76 | 77 | Examples of models which use LiDAR to calculate line of sight: 78 | 79 | 1. [5G mmWave Coverage from Lamp posts](https://www.linkedin.com/pulse/how-much-5g-coverage-you-really-get-from-lampposts-sameer/) 80 | 2. [mmWave Backhaul & Economics](https://www.linkedin.com/pulse/why-terragraph-mmwave-backhaul-essential-5g-lalwani-mba-ms-ee/) 81 | -------------------------------------------------------------------------------- /sidewalk-labs.md: -------------------------------------------------------------------------------- 1 | Sidewalk Labs: Civic Modeling 2 | ============================= 3 | 4 | Who am I? 5 | --------- 6 | 7 | I'm [Brett Naul](https://github.com/bnaul). 8 | I work at [Sidewalk Labs](https://www.sidewalklabs.com/). 9 | 10 | 11 | What problem am I trying to solve? 12 | ---------------------------------- 13 | 14 | My team @ Sidewalk ("Model Lab") uses machine learning models to study human 15 | travel behavior in cities and produce high-fidelity simulations of the travel 16 | patterns/volumes in a metro area. Our process has three main steps: 17 | 18 | - Construct a "synthetic population" from census data and other sources of 19 | demographic information; this population is statistically representative of 20 | the true population but contains no actual identifiable individuals. 21 | 22 | - Train machine learning models on anonymous mobile location data to 23 | understand behavioral patterns in the region (what times do people go to 24 | lunch, what factors affect an individual's likelihood to use public 25 | transit, etc.). 26 | 27 | - For each person in the synthetic population, generate predictions from 28 | these models and combine the resulting into activities into a single 29 | model of all the activity in a region. 30 | 31 | For more information see our blogpost [Introducing Replica: A Next-Generation Urban Planning Tool](https://medium.com/sidewalk-talk/introducing-replica-a-next-generation-urban-planning-tool-1b7425222e9e). 32 | 33 | 34 | How Dask Helps 35 | -------------- 36 | 37 | Generating activities for millions of synthetic individuals is extremely 38 | computationally intensive; even with for example, a 96 core instance, 39 | simulating a single day in a large region initially took days. It was important 40 | to us to be able to run a new simulation from scratch overnight, and scaling to 41 | hundreds/thousands of cores across many workers with Dask let us accomplish our 42 | goal. 43 | 44 | 45 | Why we chose Dask originally, and how these reasons have changed over time 46 | -------------------------------------------------------------------------- 47 | 48 | Our code consists of a mixture of legacy research-quality code and newer 49 | production-quality code (mostly Python). Before I started we were using Google 50 | Cloud Dataflow (Python 2 only, massively scalable but generally an astronomical 51 | pain to work with / debug) and multiprocessing (something like ~96 cores max). 52 | 53 | Dask let us scale beyond a single machine with only minimal changes to our data 54 | pipeline. If we had been starting from scratch I think it's likely we would 55 | have gone in a different direction (something like C++ or Go microservices, 56 | especially since we have strong Google ties), but from my perspective as a 57 | hybrid infrastructure engineer/data scientist, having all of our models in 58 | Python makes it easy to experiment and debug statistical issues. 59 | 60 | 61 | Some of the pain points of using Dask for our problem 62 | ----------------------------------------------------- 63 | 64 | There is lots of special dask knowledge that only I possess, for example: 65 | 66 | - In which formats we can serialize data that will allow for it to be 67 | reloaded efficiently? Sometimes we can use parquet, other times it should 68 | be CSVs so we can easily chunk them dynamically at runtime 69 | 70 | - Sometimes we load data on the client and scatter to the workers, and other 71 | times we load chunks directly on the workers 72 | 73 | - The debugging process is sufficiently more complicated compared to local 74 | code that it's harder for other people to help resolve issues that occur 75 | on workers 76 | 77 | - The scheduler has been the source of most of our scaling issues: when 78 | the number of tasks/chunks of data gets too large, the scheduler tends 79 | to fall over silently in some way. 80 | 81 | Some of these failures might be to Kubernetes (if we run out of RAM, we 82 | don't see an OOM error; the pod just disappears and the job will restart). 83 | We had to do some hand-tuning of things like timeouts to make things more 84 | stable, and there was quite a bit of trial and error to get to a relatively 85 | reliable state 86 | 87 | - This has more to do with our deploy process but we would sometimes 88 | end up in situations where the scheduler and worker were running 89 | different dask/distributed versions and things will crash when tasks 90 | are submitted but not when the connection is made, which makes it 91 | take a while to diagnose (plus the error tends to be something 92 | inscrutable like `KeyError: ...` that others besides me would have no 93 | idea how to interpret) 94 | 95 | 96 | Some of the technology that we use around Dask 97 | ---------------------------------------------- 98 | 99 | - Google Kubernetes Engine: lots of worker instances (usually 16 cores each), 1 100 | scheduler, 1 job runner client (plus some other microservices) 101 | - Make + Helm 102 | - For debugging/monitoring I usually kubectl port-forward to 8786 and 8787 103 | and watch the dashboard/submit tasks manually. The dashboard is not very 104 | reliable over port-forward when there are lots of workers (for some reason 105 | the websocket connection dies repeatedly) but just reconnecting to the pod 106 | and refreshing always does the trick 107 | -------------------------------------------------------------------------------- /conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Configuration file for the Sphinx documentation builder. 4 | # 5 | # This file does only contain a selection of the most common options. For a 6 | # full list see the documentation: 7 | # http://www.sphinx-doc.org/en/master/config 8 | 9 | # -- Path setup -------------------------------------------------------------- 10 | 11 | # If extensions (or modules to document with autodoc) are in another directory, 12 | # add these directories to sys.path here. If the directory is relative to the 13 | # documentation root, use os.path.abspath to make it absolute, like shown here. 14 | # 15 | # import os 16 | # import sys 17 | # sys.path.insert(0, os.path.abspath('.')) 18 | 19 | 20 | # -- Project information ----------------------------------------------------- 21 | 22 | project = 'Dask Stories' 23 | copyright = '2018, Dask Community' 24 | author = 'Dask Community' 25 | 26 | # The short X.Y version 27 | version = '' 28 | # The full version, including alpha/beta/rc tags 29 | release = '' 30 | 31 | 32 | # -- General configuration --------------------------------------------------- 33 | 34 | # If your documentation needs a minimal Sphinx version, state it here. 35 | # 36 | # needs_sphinx = '1.0' 37 | 38 | # Add any Sphinx extension module names here, as strings. They can be 39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 40 | # ones. 41 | extensions = [ 42 | "myst_parser" 43 | ] 44 | 45 | 46 | # Add any paths that contain templates here, relative to this directory. 47 | templates_path = ['_templates'] 48 | 49 | # The suffix(es) of source filenames. 50 | # You can specify multiple suffix as a list of string: 51 | # 52 | source_suffix = { 53 | '.rst': 'restructuredtext', 54 | '.txt': 'markdown', 55 | '.md': 'markdown', 56 | } 57 | 58 | # source_suffix = '.rst' 59 | 60 | # The master toctree document. 61 | master_doc = 'index' 62 | 63 | # The language for content autogenerated by Sphinx. Refer to documentation 64 | # for a list of supported languages. 65 | # 66 | # This is also used if you do content translation via gettext catalogs. 67 | # Usually you set "language" from the command line for these cases. 68 | language = None 69 | 70 | # List of patterns, relative to source directory, that match files and 71 | # directories to ignore when looking for source files. 72 | # This pattern also affects html_static_path and html_extra_path . 73 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] 74 | 75 | # The name of the Pygments (syntax highlighting) style to use. 76 | # Commenting this out for now, if we register dask pygments, 77 | # then eventually this line can be: 78 | # pygments_style = "dask" 79 | 80 | 81 | # -- Options for HTML output ------------------------------------------------- 82 | 83 | # The theme to use for HTML and HTML Help pages. See the documentation for 84 | # a list of builtin themes. 85 | # 86 | html_theme = 'dask_sphinx_theme' 87 | 88 | # Theme options are theme-specific and customize the look and feel of a theme 89 | # further. For a list of options available for each theme, see the 90 | # documentation. 91 | # 92 | # html_theme_options = {} 93 | 94 | # Add any paths that contain custom static files (such as style sheets) here, 95 | # relative to this directory. They are copied after the builtin static files, 96 | # so a file named "default.css" will overwrite the builtin "default.css". 97 | html_static_path = ['_static'] 98 | 99 | # Custom sidebar templates, must be a dictionary that maps document names 100 | # to template names. 101 | # 102 | # The default sidebars (for documents that don't match any pattern) are 103 | # defined by theme itself. Builtin themes are using these templates by 104 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', 105 | # 'searchbox.html']``. 106 | # 107 | # html_sidebars = {} 108 | 109 | 110 | # -- Options for HTMLHelp output --------------------------------------------- 111 | 112 | # Output file base name for HTML help builder. 113 | htmlhelp_basename = 'DaskStoriesdoc' 114 | 115 | 116 | # -- Options for LaTeX output ------------------------------------------------ 117 | 118 | latex_elements = { 119 | # The paper size ('letterpaper' or 'a4paper'). 120 | # 121 | # 'papersize': 'letterpaper', 122 | 123 | # The font size ('10pt', '11pt' or '12pt'). 124 | # 125 | # 'pointsize': '10pt', 126 | 127 | # Additional stuff for the LaTeX preamble. 128 | # 129 | # 'preamble': '', 130 | 131 | # Latex figure (float) alignment 132 | # 133 | # 'figure_align': 'htbp', 134 | } 135 | 136 | # Grouping the document tree into LaTeX files. List of tuples 137 | # (source start file, target name, title, 138 | # author, documentclass [howto, manual, or own class]). 139 | latex_documents = [ 140 | (master_doc, 'DaskStories.tex', 'Dask Stories Documentation', 141 | 'Dask Community', 'manual'), 142 | ] 143 | 144 | 145 | # -- Options for manual page output ------------------------------------------ 146 | 147 | # One entry per manual page. List of tuples 148 | # (source start file, name, description, authors, manual section). 149 | man_pages = [ 150 | (master_doc, 'daskstories', 'Dask Stories Documentation', 151 | [author], 1) 152 | ] 153 | 154 | 155 | # -- Options for Texinfo output ---------------------------------------------- 156 | 157 | # Grouping the document tree into Texinfo files. List of tuples 158 | # (source start file, target name, title, author, 159 | # dir menu entry, description, category) 160 | texinfo_documents = [ 161 | (master_doc, 'DaskStories', 'Dask Stories Documentation', 162 | author, 'DaskStories', 'One line description of project.', 163 | 'Miscellaneous'), 164 | ] 165 | -------------------------------------------------------------------------------- /pangeo.md: -------------------------------------------------------------------------------- 1 | Pangeo: Earth Science 2 | ===================== 3 | 4 | Who Am I? 5 | --------- 6 | 7 | I am [Ryan Abernathey](http://rabernat.github.io), a physical oceanographer and 8 | professor at [Columbia University](http://columbia.edu) / 9 | [Lamont Doherty Earth Observatory](http://ldeo.columbia.edu). 10 | 11 | I am a founding member of the [Pangeo Project](http://pangeo.io), an 12 | initiative aimed at coordinating and supporting the development of open source 13 | software for the analysis of very large geoscientific datasets such as 14 | satellite observations or climate simulation outputs. Pangeo is funded by 15 | [National Science Foundation Grant 16 | 1740648](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1740648&HistoricalAwards=false), 17 | of which I am the principal investigator. 18 | 19 | What Problem are We Trying to Solve? 20 | ------------------------------------ 21 | 22 | Many oceanographic and atmospheric science datasets consist of multi-dimensional 23 | arrays of numerical data, such as temperature sampled on a regular latitude, 24 | longitude, depth, time grid. These can be real data, observed by instruments 25 | like weather balloons, satellites, or other sensors; or they can be "virtual" 26 | data, produced by simulations. Scientists in these fields perform an extremely 27 | wide range of different analyses on these datasets. For example: 28 | 29 | - simple statistics like mean and standard deviation 30 | - principal component analysis of spatio-temporal variability 31 | - intercomparison of datasets with different spatio-temporal sampling 32 | - spectral analysis (Fourier transforms) over various space and time dimensions 33 | - budget diagnostics (e.g. calculating terms in the equation for heat conservation) 34 | - machine learning for pattern recognition and prediction 35 | 36 | Scientists like to work interactively and iteratively, trying out calculations, 37 | visualizing the results, and tweaking their code until they eventually settle on 38 | a result that is worthy of publication. 39 | 40 | The traditional workflow is to download datasets to a personal laptop or 41 | workstation and peform all analysis there. As sensor technology and computer 42 | power continue to develop, the volume of our datasets is growing exponentially. 43 | This workflow is not feasible or efficient with multi-terabyte datasets, and it 44 | is impossible with petabyte-scale datasets. The fundamental problem we are 45 | trying to solve in Pangeo is **how do we maintain the ability to perform 46 | rapid, interactive analysis in the face of extremely large datasets?** 47 | Dask is an essential part of our solution. 48 | 49 | How Dask Helps 50 | -------------- 51 | 52 | Our large multi-dimensional arrays map very well to Dask's `array` model. Our 53 | users tend to interact with Dask via [Xarray](http://xarray.pydata.org), which 54 | adds additional label-aware operations and group-by / resample capabilities. 55 | The Xarray data model is explicitly inspired by the Common Data Model format 56 | widely used in geosciences. Xarray has incorporated dask from very early in its 57 | development, leading to close integration between these packages. 58 | 59 | Pangeo provides configurations for deploying Jupyter, Xarray and Dask on 60 | high-performance computing clusters and cloud platforms. On these platforms, 61 | our users load data lazily using xarray from a variety of different storage 62 | formats and perform analysis inside Jupyter notebooks. Working closely with 63 | the Dask development team, we have tried to simplify the process of launching 64 | Dask clusters interactively by using packages such as 65 | [dask-kubernetes](https://github.com/dask/dask-kubernetes) and 66 | [dask-jobqueue](https://github.com/dask/dask-jobqueue). 67 | Users employ those packages to interactively launch their own Dask clusters 68 | across many nodes of the compute system. Dask then automatically parallelizes 69 | the xarray-based computations without users having to write much specialized 70 | parallel code. Users appreciate the Dask dashboard, which provides a visual 71 | indication of the progress and efficiency of their ongoing analysis. When 72 | everything is working well, Dask is largely transparent to the user. 73 | 74 | Why We Chose Dask Originally 75 | ---------------------------- 76 | 77 | Pangeo emerged from the Xarray development group, so Dask was a natural choice. 78 | Beyond this, Dask's flexibility is a good fit for our applications; as 79 | described above, scientists in this domain perform a huge range of different 80 | types of analysis. We need a parallel computing engine which does not strongly 81 | constrain the type of computations that can be performed nor require the user 82 | to engage with the details of parallelization. 83 | 84 | Pain Points 85 | ----------- 86 | 87 | Dask's flexibility comes with some overhead. 88 | I have the impression that the size of the graphs our users generate, which 89 | can easily exceed a million tasks, is pushing the limits of the dask scheduler. 90 | It is not uncommon for the scheduler to crash, or to take an uncomfortably long 91 | time to process, when these tasks are submitted. Our workaround is mostly to 92 | fall back on the sort of loop-based iteration over large datasets that we had 93 | to do pre-Dask. All of this undermines the interactive experience we are trying 94 | to achieve. 95 | 96 | However, the first year of this project has made me optimistic about the future. 97 | I think the interaction between Pangeo users and Dask developers has been 98 | pretty successful. Our use cases have helped identify several performance 99 | bottlenecks that have been fixed at the Dask level. If this trend can continue, 100 | I'm confident we will be able to reach our desired scale (petabytes) and speed. 101 | 102 | A broader issue relates to onboarding of new users. While I said above that 103 | Dask operates transparently to the users, this is not always the case. Users 104 | used to writing loop-based code to process datasets have to be retrained around 105 | the delayed-evaluation paradigm. It can be a challenge to translate legacy code 106 | into a Dask-friendly format. Some sort of "cheat sheet" might be able to help 107 | with this. 108 | 109 | Technology around Dask 110 | ---------------------- 111 | 112 | [Xarray](https://xarray.pydata.org) is the main way we interact with Dask. We use the 113 | [`dask-jobqueque`](https://jobqueue.dask.org) and 114 | [`dask-kubernetes`](https://kubernetes.dask.org) projects heavily. 115 | 116 | We also use [Zarr](http://zarr.readthedocs.io) extensively for storage, 117 | especially on the cloud, where we also employ 118 | [`gcsfs`](https://gcsfs.readthedocs.io) and 119 | [`s3fs`](https://s3fs.readthedocs.io) to interface with cloud storage. 120 | 121 | 122 | Copyright and License 123 | --------------------- 124 | 125 | Copyright 2020 Ryan Abernathey. I license this work under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. 126 | -------------------------------------------------------------------------------- /prefect-workflows.md: -------------------------------------------------------------------------------- 1 | Prefect: Production Workflows 2 | ============================ 3 | 4 | Who am I? 5 | --------- 6 | 7 | I am [Chris White](http://github.com/cicdw); I am the CTO 8 | at [Prefect](https://www.prefect.io), a company building the next generation of workflow automation platforms for data engineers and data 9 | scientists. In this role, I am the core developer of our [open source engine](https://github.com/PrefectHQ/prefect) 10 | which allows users to build, schedule and execute robust workflows. 11 | 12 | The Problem I'm trying to solve 13 | ------------------------------- 14 | 15 | Most teams are responsible for maintaining production workflows that 16 | are critical to the team's mission. Historically these workflows consisted 17 | largely of batch ETL jobs, but more recently include things such as 18 | deploying parametrized machine learning models, ad-hoc reporting, and 19 | handling event-driven processes. 20 | 21 | Typically this means developers need a workflow system which can do things such as: 22 | - retry failed tasks 23 | - schedule jobs to run automatically 24 | - log detailed progress (and history) of the workflow 25 | - provide a dashboard / UI for inspecting system health 26 | - provide notification hooks for when things go wrong 27 | 28 | among many other things. We at Prefect like to think of a workflow system as 29 | a technical insurance policy - you shouldn't really notice it much when 30 | things are going well, but it should be maximally useful when things go wrong. 31 | 32 | Prefect's goal is to build the next generation workflow system. Older systems 33 | such as [Airflow](https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4) and Luigi are limited 34 | by their model of workflows as slow-moving, regularly scheduled, 35 | with limited inter-task communication. Prefect, on the other hand, embraces 36 | this new reality and makes very few assumptions about the nature and requirements of 37 | workflows, thereby supporting more dynamic use cases in both data engineering 38 | and data science. 39 | 40 | 41 | How Dask helps 42 | -------------- 43 | 44 | Prefect was designed and built with Dask in mind. Historically, workflow systems 45 | such as [Airflow](https://airflow.apache.org/) handled _all_ scheduling, of both 46 | workflows _and_ the individual tasks contained within the workflows. This pattern introduces a number of problems: 47 | - this puts an enormous burden on the central scheduler (it is scheduling _every single action_ taken in the system) 48 | - it adds non-trivial latency to task runs 49 | - in practice, this limits the amount of dynamicism workflows can have 50 | - it also tends to limit the amount of data tasks can share, as all information is routed through the central scheduler 51 | - it requires users to have an external scheduler service running to run their workflows at all! 52 | 53 | Instead, Prefect handles the scheduling of _workflows_, and lets Dask 54 | handle the scheduling and resource management of _tasks_ within each workflow. This 55 | provides a number of benefits out of the box: 56 | 57 | - **Task scheduling:** Dask handles all task scheduling within a workflow, allowing Prefect to incentivize smaller tasks which Dask schedules with millisecond latency 58 | - **"Dataflow":** because Dask handles serializing and communicating the appropriate information between Tasks, Prefect can support "dataflow" as a first-class pattern 59 | - **Distributed computation:** Dask handles allocating Tasks to workers in a cluster, allowing users to immediately realize the benefits of distributed computation with minimal overhead 60 | - **Parallelism:** whether running in a cluster or locally, Dask provides parallel Task execution off the shelf 61 | 62 | Additionally, because Dask is written in pure Python and has an active open source community, 63 | we can very easily get feedback on possible bugs, and even contribute to improving the software ourselves. 64 | 65 | To achieve this ability to run workflows with many tasks, we found that Dask's [Futures interface](https://docs.dask.org/en/latest/futures.html) 66 | serves us well. In order to support dynamic tasks (i.e., tasks which spawn other tasks), we rely on Dask [worker clients](http://distributed.dask.org/en/latest/task-launch.html?highlight=worker_client). We have also occasionally experimented with [Dask Queues](http://distributed.dask.org/en/latest/api.html?highlight=sharing%20futures#distributed.Queue) to implement more complicated behavior such as future-sharing and resource throttling, but are not currently using them (mainly for design reasons). 67 | 68 | Pain points when using Dask 69 | --------------------------- 70 | 71 | Our biggest pain point in using Dask has largely revolved around the ability (or lack 72 | thereof) to share futures between clients. To provide a concrete example, suppose we start with a 73 | list of numbers and, using [`client.map`](https://distributed.readthedocs.io/en/latest/api.html#distributed.Client.map) 74 | twice, we proceed to compute `x -> x + 1 -> x + 2` for each element of our list. When using only dask primitives and a single client, 75 | these computations proceed asychronously, meaning that the final computation of each branch 76 | can begin without waiting on the other middle computations, as in this schematic: 77 | 78 | ![Depth First Execution](depth-first.png) 79 | 80 | However, in Prefect, we aren't simply passing around Dask futures created from a single `Client` - when a [`map` operation](https://docs.prefect.io/guide/core_concepts/mapping.html#prefect-approach) occurs, the dask futures are actually created by a `worker_client` and attached to a Prefect `State` object. 81 | *Ideally*, we would leave these futures unresolved at this stage so that computation can proceed as above. However, because 82 | it is non-trivial to share futures between clients we must `gather` the futures with this same client, making 83 | our computation proceed in a "breadth-first" manner: 84 | 85 | ![Breadth first execution](breadth-first.png) 86 | 87 | This isn't the worst thing, but for longer pipelines it would be very nice to have the faster branches 88 | of the pipeline proceed with execution so that final results are produced earlier for inspection. 89 | 90 | **Update**: [As of Prefect 0.12.0](https://medium.com/the-prefect-blog/map-faster-mapping-improvements-in-prefect-0-12-0-7cacc3f14e16), Prefect now supports Depth First Execution when running on Dask. 91 | 92 | Technology we use around Dask 93 | ---------------------------- 94 | 95 | Our preferred deployment of Prefect Flows uses [dask-kubernetes](https://github.com/dask/dask-kubernetes) 96 | to spin up a short-lived Dask Cluster in Kubernetes. 97 | 98 | Otherwise, the logic contained within Prefect Tasks can be essentially arbitrary; 99 | many tasks in the system interact with databases, GCP resources, AWS, etc. 100 | 101 | 102 | Links 103 | ----- 104 | 105 | - [Prefect Repo](https://github.com/PrefectHQ/prefect) 106 | - [Prefect on Dask Example](https://docs.prefect.io/guide/tutorials/dask-cluster.html) 107 | - [Dask-Kubernetes](https://kubernetes.dask.org) 108 | - [Blog post on some Prefect / Dask improvements](https://medium.com/the-prefect-blog/map-faster-mapping-improvements-in-prefect-0-12-0-7cacc3f14e16) 109 | -------------------------------------------------------------------------------- /mosquito-sequencing.md: -------------------------------------------------------------------------------- 1 | Genome Sequencing for Mosquitos 2 | =============================== 3 | 4 | Who am I? 5 | --------- 6 | 7 | I'm [Alistair Miles](http://alimanfoo.github.io/about/) and I work for Oxford 8 | University [Big Data Institute](https://www.bdi.ox.ac.uk/) but am also 9 | affiliated with the [Wellcome Sanger Institute](https://www.sanger.ac.uk/). I 10 | lead the malaria vector (mosquito) genomics programme within the [malaria 11 | genomic epidemiology network](www.malariagen.net), an international network of 12 | researchers and malaria control professionals developing new technologies based 13 | on genome sequencing to aid in the effort towards malaria elimination. I also 14 | have a technical role as Head of Epidemiological Informatics for the [Centre 15 | for Genomics and Global Health](http://www.cggh.org/), which means I have some 16 | oversight and responsibility for computing and software architecture and 17 | direction within our teams at Oxford and Sanger. 18 | 19 | 20 | What problem am I trying to solve? 21 | ---------------------------------- 22 | 23 | Malaria is still a major cause of mortality, particularly in sub-Saharan 24 | Africa. Research has shown that the best way to reduce malaria is to control 25 | the mosquitoes that transmit malaria between people. Unfortunately mosquito 26 | populations are becoming resistant to the insecticides used to control them. 27 | New mosquito control tools are needed. New systems for mosquito population 28 | surveillance/monitoring are also needed to help inform and adapt control 29 | strategies to respond to mosquito evolution. We have established a project to 30 | perform an initial survey of mosquito genetic diversity, by sequencing whole 31 | genomes of approximately 3,000 mosquitoes collected from field sites across 18 32 | African countries, [The Anopheles gambiae 1000 Genomes Project]( 33 | www.malariagen.net/ag1000g). We are currently working to scale up our 34 | sequencing operations to be able to sequence ~10,000 mosquitoes per year, and 35 | to integrate genome sequencing into regular mosquito monitoring programmes 36 | across Africa and Southeast Asia. 37 | 38 | 39 | How does Dask help? 40 | ------------------- 41 | 42 | Whole genome sequence data is a relatively large scale data resource, requiring 43 | specialised processing and analysis to extract key information, e.g., 44 | identifying genes involved in the evolution of insecticide resistance. We use 45 | conventional bioinformatic approaches for the initial phases of data processing 46 | (alignment, variant calling, phasing), however beyond that point we switch to 47 | interactive and exploratory analysis using Jupyter notebooks. 48 | 49 | Making interactive analysis of large-scale data is obviously a challenge, 50 | because inefficient code and/or use of computational resources vastly increases 51 | the time taken for any computation, destroying the ability of an analyst to 52 | explore many different possibilities within a dataset. Dask helps by providing 53 | an easy-to-use framework for parallelising computations, either across multiple 54 | cores on a single workstation, or across multiple nodes in a cluster. We have 55 | built a software package called 56 | [scikit-allel](http://scikit-allel.readthedocs.io/en/latest/) to help with our 57 | genetic analyses, and use Dask within that package to parallelise a number of 58 | commonly used computations. 59 | 60 | 61 | Why did I choose Dask? 62 | ---------------------- 63 | 64 | Normally the transition from a serial (i.e., single-core) implementation of any 65 | given computation to a parallel (multi-core) implementation requires the code 66 | to be completely rewritten, because parallel frameworks usually offer a 67 | completely different API, and managing complex parallel workflows is a 68 | significant challenge. 69 | 70 | Originally Dask was appealing because it provided a familiar API, 71 | with the dask.array package following the numpy API (which we were already 72 | using) relatively closely. Dask also handled all the complexity of constructing 73 | and running complex, multi-step computational workflows. 74 | 75 | Today, we're also interested in Dask's offered flexibility to initially 76 | parallelise over multiple cores in a single computer via multi-threading, and 77 | then switch to running on a multi-node cluster with relatively little change in 78 | our code. Thus computations can be scaled up or down with great convenience. 79 | When we first started using Dask we were focused on making effective use of 80 | multiple threads for working on a single computer, now as data is growing we 81 | are moving data and computation into a cloud setting and looking to make use of 82 | Dask via Kubernetes. 83 | 84 | 85 | Pain points? 86 | ------------ 87 | 88 | Initially when we started using Dask in 2015 we hit a few bugs and some of the 89 | error messages generated by Dask were very cryptic, so debugging some problems 90 | was hard. However the stability of the code base, the user documentation, and 91 | the error messages have improved a lot recently, and the sustained investment 92 | in Dask is clearly adding a lot of value for users. 93 | 94 | It is still difficult to think about how to code up parallel operations over 95 | multidimensional arrays where one or more dimensions are dropped by the 96 | function being mapped over the data, but there is some inherent complexity 97 | there so probably not much Dask can do to help. 98 | 99 | The Dask code base itself is tidy and consistent but quite hard to get into to 100 | understand and debug issues. Again Dask is handling a lot of inherent 101 | complexity so maybe not much can be done. 102 | 103 | 104 | Technology I use around around Dask 105 | ----------------------------------- 106 | 107 | We are currently working on deploying both JupyterHub and Dask on top of 108 | Kubernetes in the cloud, following the approach taken in the [Pangeo 109 | project](http://pangeo-data.org/). We use Dask primarily through the 110 | scikit-allel package. We also use Dask primarily with the 111 | [Zarr](http://zarr.readthedocs.io/en/stable/) array storage library (in fact 112 | the original motivation for writing Zarr was to provide a storage library that 113 | enabled Dask to efficiently parallelise I/O bound computations). 114 | 115 | 116 | 117 | Anything else to know? 118 | ---------------------- 119 | 120 | Our analysis code is still quite heterogeneous, with some code making use of a 121 | bespoke approach to out-of-core computing which we developed prior to being 122 | aware of Dask, and the remainder using Dask. This is just a legacy of timing, 123 | with some work having started prior to knowing about Dask. With the stability 124 | and maturity of Dask now I am very happy to push towards full adoption. 125 | 126 | One cognitive shift that this requires is for users to get used to lazy 127 | (deferred) computation. This can be a stumbling block to start with, but is 128 | worth the effort of learning because it gives the user the ability to run 129 | larger computations. So I have been thinking about writing a blog post to 130 | communicate the message that we are moving towards adopting Dask wherever 131 | possible, and to give an introduction to the lazy coding style, with examples 132 | from our domain (population genomics). There are also still quite a few 133 | functions in scikit-allel that could be parallelised via Dask but haven’t yet 134 | been, so I still have an aspiration to work on that. Not sure when I’ll get to 135 | these, but hopefully conveys the intention to adopt Dask more widely and also 136 | help train people in our immediate community to use it. 137 | -------------------------------------------------------------------------------- /satellite-imagery.md: -------------------------------------------------------------------------------- 1 | Satellite Imagery Processing 2 | ============================ 3 | 4 | Who am I? 5 | --------- 6 | 7 | I am [David Hoese](http://github.com/djhoese) and I work as a software 8 | developer at the [Space Science and Engineering Center 9 | (SSEC)](https://www.ssec.wisc.edu/) at the University of Wisconsin - Madison. 10 | My job is to create software that makes meteorological data more accessible to 11 | scientists and researchers. 12 | 13 | I am also a member of the open source 14 | [PyTroll](http://pytroll.github.io/) community where I act as a core developer 15 | on the [SatPy](http://satpy.readthedocs.io/en/latest/) library. I use SatPy in 16 | my SSEC projects called Polar2Grid and Geo2Grid which provide a simple command 17 | line interface on top of the features provided by SatPy. 18 | 19 | 20 | The Problem I'm trying to solve 21 | ------------------------------- 22 | 23 | Satellite imagery data is often hard to read and use because of the many 24 | different formats and structures that it can come in. To make satellite imagery 25 | more useful the SatPy library wraps common operations performed on satellite 26 | data in simple interfaces. Typically meteorological satellite data needs to go 27 | through some or all of the following steps: 28 | 29 | - **Read:** Read the observed scientific data from one or more data files while 30 | keeping track of geolocation and other metadata to make the data the 31 | most useful and descriptive. 32 | - **Composite:** Combine one or more different "channels" of data to bring out 33 | certain features in the data. This is typically shown as RGB 34 | images. 35 | - **Correct:** Sometimes data has artifacts, from the instrument hardware or the 36 | atmosphere for example, that can be removed or adjusted. 37 | - **Resample:** Visualization tools often support a small subset of Earth 38 | projections and satellite data is usually not in those 39 | projections. Resampling can also be useful when wanting to do 40 | intercomparisons between different instruments (on the same 41 | satellite or not). 42 | - **Enhancement:** Data can be normalized or scaled in certain ways that makes 43 | certain atmospheric conditions more apparent. This can also 44 | be used to better fit data in to other data types (8-bit 45 | integers, etc). 46 | - **Write:** Visualization tools typically only support a few specific file 47 | formats. Some of these formats are difficult to write or have small 48 | differences depending on what application they are destined for. 49 | 50 | As satellite instrument technology advances, scientists have to learn how to 51 | handle more channels for each instrument and at spatial and temporal 52 | resolutions that were unheard of when they were learning how to use satellite 53 | data. If they are lucky, scientists may have access to a high performance 54 | computing system, while the rest may have to settle for long execution times on 55 | their desktop or laptop machines. By optimizing the parts of the processing 56 | that take a lot of time and memory it is our hope that scientists can worry 57 | about the science and leave the annoying parts to SatPy. 58 | 59 | 60 | How Dask helps 61 | -------------- 62 | 63 | SatPy's use of Dask makes it possible to do calculations on laptops that 64 | used to require high performance server machines. SatPy was originally 65 | drawn to [Xarray](http://xarray.pydata.org/en/stable/)'s `DataArray` objects 66 | for the metadata storing functionality and the support for Dask arrays. We knew 67 | that our original usage of Numpy masked arrays was not scalable to the new 68 | satellite data being produced. SatPy has now switched to `DataArray` 69 | objects backed by Dask and leverages Dask's ability to do the following: 70 | 71 | - **Lazy evaluation:** Software development is so much easier when you don't have 72 | to remove intermediate results from memory to process the next step. 73 | - **Task caching:** Our processing involves a lot of intermediate results that can 74 | be shared between different processes. When things are optimized in the 75 | Dask graph it saves us as developers from having to code the "reuse" logic 76 | ourselves. It also means that intermediate results that are no longer 77 | needed can be disposed of and their memory freed. 78 | - **Parallel workers and array chunking:** Satellite data is usually compared by 79 | geographic location. So a pixel at one index is often compared with the 80 | pixel of another array at that same index. Splitting arrays in to chunks 81 | and processing them separately provides us with a great performance 82 | improvement and not having to manage which worker gets what chunk of the 83 | array makes development effortless. 84 | 85 | Benefiting from all of the above lets us create amazing high resolution RGB 86 | images in 6-8 minutes on 3 year old laptops that would have taken 35+ minutes 87 | to crash from memory limitations with SatPy's old Numpy implementation. 88 | 89 | 90 | Pain points when using Dask 91 | --------------------------- 92 | 93 | 1. Dask arrays are not Numpy arrays. Almost everything is supported or is 94 | close enough that you get used to it, but not everything. Most things you 95 | can get away with and get perfectly good performance; others you may end up 96 | computing your arrays multiple times in just a couple lines of code when 97 | you didn't know it. Sometimes I wish that there was a Dask feature to 98 | raise an exception of your array is computed without you specifically 99 | saying it was ok. 100 | 101 | 2. Writing to common satellite data formats, like GeoTIFF, can't always be 102 | written to by multiple writers (multiple nodes on a cluster) and some 103 | aren't even thread-safe. Opening a file object and using it with 104 | `dask.array.store` may work with some schedulers and not others. 105 | 106 | 3. Dimension changes are a pain. Satellite data processing some times involves 107 | lookup tables to save on bandwidth limitations when sending data from the 108 | satellite to the ground or other similar situations. Having to use lookup 109 | tables, including something like a KDTree, can be really difficult and 110 | confusing to code with Dask and get it right. It typically involves using 111 | `atop`, `map_blocks`, or sometimes suffering the penalty of passing things 112 | to a `Delayed` function where the entire data array is passed as one 113 | complete memory-hungry array. 114 | 115 | 4. A lot of satellite processing seems to perform better with the default 116 | threaded Dask scheduler over the distributed scheduler due to the nature of 117 | the problems being solved. A lot of processing, especially the creation of 118 | RGB images, requires comparing multiple arrays in different ways and can 119 | suffer from the amount of communication between distributed workers. There 120 | isn't an easy way that I know of to control where things are processed and 121 | which scheduler to use without requiring users to know detailed information 122 | on the internal of Dask. 123 | 124 | 125 | Technology I use around Dask 126 | ---------------------------- 127 | 128 | As mentioned earlier SatPy uses [Xarray](http://xarray.pydata.org/en/stable/)to 129 | wrap most of our Dask operations when possible. We have other useful tools that 130 | we've created in the PyTroll community to help support deploying satellite 131 | processing tools on servers, but they are not specific to Dask. 132 | 133 | 134 | Links 135 | ----- 136 | 137 | - [PyTroll Community](http://pytroll.github.io/) 138 | - [SatPy](http://satpy.readthedocs.io/en/latest/) 139 | - [SatPy Examples](http://satpy.readthedocs.io/en/latest/examples.html) 140 | - [PyResample](http://pyresample.readthedocs.io/en/latest/) 141 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Dask Stories 2 | ============ 3 | 4 | This repository holds stories from experienced Dask users that are intended to 5 | help new users get a sense for how Dask might apply to their field. 6 | 7 | If you are curious about Dask then please read the various submitted stories. 8 | 9 | If you use Dask today to solve interesting problems we would love to have you 10 | share your story. Hearing from experienced users like yourself can help 11 | newcomers quickly identify the parts of Dask and the surrounding ecosystem that 12 | are likely to be valuable to them. 13 | 14 | 15 | ## How to share your story 16 | 17 | We welcome stories of any form. However, if you'd like some guidance then we 18 | recommend the following rubric. We've included suggestions of how to break up 19 | a story below, including two entirely fabricated stories alongside each set of 20 | instructions. 21 | 22 | 23 | ### Who are you? 24 | 25 | Include your name, the project you work with, and your role within that 26 | project. Some examples: 27 | 28 | - I'm [Joseph Chen](). I manage a quantitative quality control group at 29 | [XYZ automotive](), a company that makes automotive parts. 30 | - I'm [Alice Singh](), I'm a post-doctoral scholar at the University of 31 | Arizona. I work at the [National Solar Observatory]() studying sunspots and 32 | solar flare. 33 | 34 | Links are welcome. 35 | 36 | 37 | ### What problem are you trying to solve? 38 | 39 | Include context and detail here about the problem that you're trying to solve. 40 | Details are *very* welcome here. You're probably writing to someone within 41 | your own field so feel free to use technical speech. You shouldn't necessarily 42 | mention Dask here; focus on your problem instead. 43 | 44 | #### Example XYZ Automotive 45 | 46 | XYZ automomotive produces thousands of kinds of parts for cars and millions of 47 | each part. Many of these parts generate telemetry about how they're doing on a 48 | second-by-second basis. We process this information over time to learn when 49 | parts might fail, and what kinds of activities might lead to failure. This 50 | helps us react in a variety of ways: 51 | 52 | 1. We signal drivers that they should service their vehicle soon, reducing the 53 | chance of a serious problem while on the road 54 | 2. We inform automotive mechanics where the problem is, reducing the cost of 55 | diagnostics 56 | 3. We roll these discoveries back into the design process, helping our 57 | engineers develop solid products. 58 | 59 | However, analyzing billions of time series is hard, especially when those time 60 | series come from thousands of different kinds of devices. This is both a "big 61 | data" problem and a "heterogeneous data" problem. We employ a team of data 62 | scientists to analyze this data. We use Pandas, Scikit-Learn, statsmodels, and 63 | [lifelines](http://lifelines.readthedocs.io/en/latest/) for survival analysis, 64 | along with some internal libraries. In particular we found that scaling 65 | survival analysis to be as our business grew. 66 | 67 | #### Example: Solar Astronomers 68 | 69 | My research analyzes correlation information from high-resolution solar 70 | astronomy data that might preceed solar flare activity or correlate with 71 | sunspots. This both helps our understanding of basic science in 72 | Magento-Hydro-Dynamics (MHD) as well as improves the durability of Earth's 73 | satellite fleet during dangerous solar storm events. 74 | 75 | In practice this means that we build algorithms to analyze a real-time stream 76 | of high-resolution images of the sun to predict future activity. We do image 77 | segmentation to find spots for testing data, and discrete wavelet transforms 78 | *both spatially and across time* to create featrures for downstream machine 79 | learning algorithms. 80 | 81 | Our data is big. Single images can be hundreds of megabytes and we have years 82 | of them. This means hundreds of terabytes. 83 | 84 | 85 | ### How Dask Helps 86 | 87 | Describe how Dask helps to solve this problem. Again, details are welcome. 88 | New readers probably won't know about specific API like "we use client.scatter" 89 | but probably will be able to follow terms used as headers in documentation like 90 | "we used dask dataframe and the futures interface together". 91 | 92 | We also encourage you to mention how your use of Dask has *changed* over time. 93 | What originally drew you to the project? Is that still why you use it or has 94 | your perception or needs changed? 95 | 96 | #### Example: XYZ Automotive 97 | 98 | Dask helps us solve our problem by parallelizing our internal timeseries 99 | libraries. We're in a weird position where we have "big data", but it's all 100 | very different, so standard projects like time-series databases, Apache Spark, 101 | or Dask Dataframe weren't really a good fit. However, we found that we could 102 | just use Pandas/Scikit-Learn/Lifelines code along with [Dask delayed]() to 103 | parallelize our existing solutions pretty easily. The code is still pretty 104 | much what we had before (which is good, we had a lot invested there) but now it 105 | operates well on larger datasets and using all of our cores. 106 | 107 | As a result of this our large analysis workstations quickly became very 108 | popular and we've had to acquire more, but people seem pretty happy. We've 109 | looked into the distributed scheduler for cluster computing and it's 110 | interesting, but we haven't found a sufficient business need yet. 111 | 112 | #### Example: Solar Astronomer 113 | 114 | We use [Dask Arrays]() to process large stacks of images with the Numpy API. 115 | I and most of my colleagues are comfortable with Numpy, and so the learning 116 | curve to parallelize with Dask Array was very easy. We have two common 117 | workloads: 118 | 119 | - We combine Dask arrays with the Scikit-Image library to apply standard 120 | filters over image stacks in an embarrassingly parallel way. Nothing fancy 121 | here, but it's nice not to have to worry about saturating cores and keeping 122 | memory use low. Dask handles that for us. 123 | - We rechunk our images to be in blocks that include time, then we overlap 124 | them with halos so that neighboring blocks have a bit of nearby information, 125 | then we apply more complex algorithms, mostly DWT, but we're also starting 126 | to experiment with convolutional neural nets 127 | 128 | Finally, we also just use Dask array as we play around with new datasets. It's 129 | our go-to solution now for just hacking around on our laptop. 130 | 131 | We're also starting to use Dask's futures interface while we prototype out a 132 | real-time processing system for live analysis and alerts, but that's a separate 133 | project. 134 | 135 | Originally most of this work started on my laptop with the local scheduler. 136 | However as we've started doing some more complex workloads we've found that 137 | Dask is no longer able to do everything in low memory. As a result we now run 138 | dask.distributed on our cluster on 10+TB image stacks. We've thought about 139 | going bigger, but don't yet have the allocation. 140 | 141 | 142 | ### Some of the pain points of using Dask for your problem 143 | 144 | Dask has issues and it's not always the right solution for every problem. What 145 | are things that you ran into that you think others in your field should know 146 | ahead of time? 147 | 148 | 149 | #### Example: XYZ Automotive 150 | 151 | When we started using parallelism we found that large sections of our codebase 152 | didn't parallelize well. We used a lot of string comparisons that didn't play 153 | well with the GIL. We switched to uses Dask's multiprocessing scheduler but 154 | then communication costs were too high. Eventually we worked around this by 155 | using Pandas categoricals, which solves the problem pretty well, but only up to 156 | about 16 cores per process. 157 | 158 | We also have some problems with diagnostics. The diagnostics on the local 159 | scheduler aren't as nice as those on the distributed scheduler. There are 160 | tradeoffs both ways 161 | 162 | 163 | #### Example: Solar Astronomer 164 | 165 | Getting new users comfortable with lazy evaluation took a bit of time. 166 | 167 | When we first started the overhead to overlapping computations with dask array 168 | was really high. It looks like this has been mostly fixed in the recent 169 | version though. Still though, as we continue to scale out to larger datasets 170 | we feel like we always run into new kinds of overhead. It can be worked around 171 | and things are getting a lot better, but if you want to go to 100TB expect to 172 | do some tuning. Fortunately the codebase is all Python, so we've been able to 173 | improve things and send patches upstream. 174 | 175 | When using the distributed scheduler we had problems with our HPC cluster, 176 | which is pretty finicky about workers running out of RAM. It took a bit to get 177 | configuration right so that workers killed themselves before the cluster found 178 | out about them. 179 | 180 | 181 | ### Some of the technology you use around Dask 182 | 183 | This might be other libraries that you use with Dask for analysis or data 184 | storage. Cluster technologies that you use to deploy or capture logs, etc.. 185 | Anything that you think someone like you would like to know about. 186 | 187 | 188 | #### Example: XYZ Automotive 189 | 190 | We mostly use Pandas, Scikit-Learn, and Lifelines for computation. We get data 191 | in CSV and convert to Parquet using Arrow. We use the general PyData stack for 192 | plotting and such. 193 | 194 | 195 | #### Example: Solar Astronomer 196 | 197 | We store our data in FITS files and use AstroPy to read it, but now we're 198 | looking at moving over to TIFF or Zarr. We've also just started looking at 199 | XArray, which has good Dask support and seems to have a strong community. 200 | 201 | For cluster deployment we use PBS and the 202 | [dask-jobqueue](https://jobqueue.dask.org) project locally, though 203 | we're starting to look at storing data on AWS and using the [Dask-Helm 204 | chart](https://docs.dask.org/en/latest/setup/kubernetes-helm.html) or 205 | [dask-kubernetes](https://kubernetes.dask.org) 206 | --------------------------------------------------------------------------------