├── requirements.txt
├── depth-first.png
├── breadth-first.png
├── dask-datarevenue.png
├── CONTRIBUTING.md
├── Makefile
├── make.bat
├── fullspectrum.md
├── template.md
├── index.rst
├── icecube-cosmic-rays.md
├── hydrologic-modeling.md
├── datarevenue.md
├── network-modeling.md
├── sidewalk-labs.md
├── conf.py
├── pangeo.md
├── prefect-workflows.md
├── mosquito-sequencing.md
├── satellite-imagery.md
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | myst-parser
2 | sphinx
3 | dask-sphinx-theme>=3.0.5
4 | 


--------------------------------------------------------------------------------
/depth-first.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dask/dask-stories/HEAD/depth-first.png


--------------------------------------------------------------------------------
/breadth-first.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dask/dask-stories/HEAD/breadth-first.png


--------------------------------------------------------------------------------
/dask-datarevenue.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dask/dask-stories/HEAD/dask-datarevenue.png


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | Dask is a community maintained project. We welcome contributions in the form of bug reports, documentation, code, design proposals, and more. 
2 | 
3 | For general information on how to contribute see https://docs.dask.org/en/latest/develop.html.
4 | 
5 | ## Project specific notes
6 | 
7 | This project contains stories of how people are using Dask, you can find instructions on adding your story here https://github.com/dask/dask-stories#how-to-share-your-story.
8 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line.
 5 | SPHINXOPTS    =
 6 | SPHINXBUILD   = sphinx-build
 7 | SPHINXPROJ    = DaskStories
 8 | SOURCEDIR     = .
 9 | BUILDDIR      = _build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)


--------------------------------------------------------------------------------
/make.bat:
--------------------------------------------------------------------------------
 1 | @ECHO OFF
 2 | 
 3 | pushd %~dp0
 4 | 
 5 | REM Command file for Sphinx documentation
 6 | 
 7 | if "%SPHINXBUILD%" == "" (
 8 | 	set SPHINXBUILD=sphinx-build
 9 | )
10 | set SOURCEDIR=.
11 | set BUILDDIR=_build
12 | set SPHINXPROJ=DaskStories
13 | 
14 | if "%1" == "" goto help
15 | 
16 | %SPHINXBUILD% >NUL 2>NUL
17 | if errorlevel 9009 (
18 | 	echo.
19 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
20 | 	echo.installed, then set the SPHINXBUILD environment variable to point
21 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
22 | 	echo.may add the Sphinx directory to PATH.
23 | 	echo.
24 | 	echo.If you don't have Sphinx installed, grab it from
25 | 	echo.http://sphinx-doc.org/
26 | 	exit /b 1
27 | )
28 | 
29 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
30 | goto end
31 | 
32 | :help
33 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
34 | 
35 | :end
36 | popd
37 | 


--------------------------------------------------------------------------------
/fullspectrum.md:
--------------------------------------------------------------------------------
 1 | Full Spectrum: Credit and Banking
 2 | =================================
 3 | 
 4 | Who am I?
 5 | ---------
 6 | 
 7 | My name is [Hussain Sultan](https://www.linkedin.com/in/hussainsultan/).
 8 | I am a partner at [Full Spectrum
 9 | Analytics](https://www.fullspectrumanalytics.com/).  I create personalized
10 | analytics software within banks for the sake of equitable and profitable
11 | decision making.
12 | 
13 | 
14 | What problem am I trying to solve?
15 | ----------------------------------
16 | 
17 | Lending businesses create and manage valuations and cashflow models that output
18 | the profitability expectations for customer segments. These models are complex
19 | because they form a network of equations that need to be scored efficiently and
20 | keep track of inputs/outputs at scale.
21 | 
22 | 
23 | How Dask helps
24 | --------------
25 | 
26 | Dask is instrumental in my work for creating efficient cashflow model
27 | management systems and general data science enablement on data lakes.
28 | 
29 | Dask provides a way to construct the dependencies of cashflow equations as a
30 | DAG (using the [dask.delayed](https://docs.dask.org/en/latest/delayed.html)
31 | interface) and provides a good developer experience for building
32 | scoring/gamification/model tracking applications.
33 | 
34 | 
35 | Why I chose Dask originally
36 | ---------------------------
37 | 
38 | I chose dask for three reasons:
39 | 
40 | 1.  It was lightweight
41 | 2.  The granular task scheduling approach to scaling both dataframes and
42 |     arbitrary computations fit my use case well
43 | 3.  It is easy to scale my team with Python programmers
44 | 
45 | 
46 | Some of the pain points of using Dask in our problem
47 | ----------------------------------------------------
48 | 
49 | It's hard to get organization buy-in to adopt an open-source technology without
50 | vendored support and enterprise SLAs.
51 | 
52 | In a recent project, we had to integrate with the Orc data format that turned
53 | out to be more expensive than I originally anticipated (compounded by
54 | enterprise hadoop set-up and encryption requirements).  These changes have
55 | since been upstreamed though, and so things are easier now.
56 | 
57 | 
58 | Some of the technology that we use around Dask
59 | ----------------------------------------------
60 | 
61 | We deployed on generic internal server with Jenkins scheduling a Jupyter
62 | notebook to execute.  We built everything out using our internal analytics
63 | platform.  We didn't have to worry about security because everything was behind
64 | a corporate firewall.
65 | 


--------------------------------------------------------------------------------
/template.md:
--------------------------------------------------------------------------------
 1 | Template
 2 | ========
 3 | 
 4 | Who am I?
 5 | ---------
 6 | 
 7 | A brief description of who you are, and the name of the project for which you
 8 | use Dask.
 9 | 
10 | 
11 | The Problem I'm trying to solve
12 | -------------------------------
13 | 
14 | Include context and detail here about the problem that you're trying to solve.
15 | Details are *very* welcome here.  You're probably writing to someone within
16 | your own field so feel free to use technical speech.
17 | 
18 | You shouldn't mention Dask here yet; focus on your problem instead.  Why is it
19 | important?  Why is it hard?  Who does this problem affect?
20 | 
21 | 
22 | How Dask helps
23 | --------------
24 | 
25 | Describe how Dask helps you to solve this problem.  Again, details are welcome.
26 | New readers probably won't know about specific API like "we use client.scatter"
27 | but probably will be able to follow terms used as headers in documentation like
28 | "we used dask dataframe and the futures interface together".
29 | 
30 | We also encourage you to mention how your use of Dask has *changed* over time.
31 | What originally drew you to the project?  Is that still why you use it or has
32 | your perception or needs changed?
33 | 
34 | 
35 | Pain points when using dask
36 | ---------------------------
37 | 
38 | Dask has issues and it's not always the right solution for every problem.  What
39 | are things that you ran into that you think others in your field should know
40 | ahead of time?
41 | 
42 | 
43 | Technology I use around Dask
44 | ----------------------------
45 | 
46 | This might be other libraries that you use with Dask for analysis or data
47 | storage, cluster technologies that you use to deploy or capture logs, etc..
48 | Anything that you think someone like you might want to use alongside Dask.
49 | 
50 | 
51 | Other information
52 | -----------------
53 | 
54 | Is there something else that didn't fit into the sections above?  Feel free to
55 | make your own.
56 | 
57 | 
58 | Links
59 | -----
60 | 
61 | Links and images throughout the document are great.  You may want to list links
62 | again here.  This might be links to your company or project, links to blogposts
63 | or notebooks that you've written about the topic, or links to relevant source
64 | code.  Anything that someone who was interested in your story could use to
65 | learn more.
66 | 
67 | We also strongly encourage you to include images.  These might be output
68 | results from your analyses, diagrams showing your architecture, or anything
69 | that helps to convey who your group is, and the kind of work that you're doing.
70 | 


--------------------------------------------------------------------------------
/index.rst:
--------------------------------------------------------------------------------
 1 | Dask Use Cases
 2 | ==============
 3 | 
 4 | Dask is a versatile tool that supports a variety of workloads.
 5 | This page contains brief and illustrative examples of how people use Dask in practice.
 6 | These emphasize breadth and hopefully inspire readers to find new ways
 7 | that Dask can serve them beyond their original intent.
 8 | 
 9 | .. toctree::
10 |    :maxdepth: 1
11 | 
12 |    sidewalk-labs.md
13 |    mosquito-sequencing.md
14 |    fullspectrum.md
15 |    icecube-cosmic-rays.md
16 |    pangeo.md
17 |    hydrologic-modeling.md
18 |    network-modeling.md
19 |    satellite-imagery.md
20 |    prefect-workflows.md
21 |    datarevenue.md
22 | 
23 | Overview
24 | --------
25 | 
26 | Dask uses can be roughly divided in the following two categories:
27 | 
28 | 1.  Large NumPy/Pandas/Lists with
29 |     `Dask Array <https://docs.dask.org/en/latest/array.html>`_,
30 |     `Dask DataFrame <https://docs.dask.org/en/latest/dataframe.html>`_,
31 |     `Dask Bag <https://docs.dask.org/en/latest/bag.html>`_,
32 |     to analyze large datasets with familiar techniques.
33 |     This is similar to Databases, Spark_,  or big array libraries
34 | 
35 | 2.  Custom task scheduling.  You submit a graph of functions that depend on
36 |     each other for custom workloads.  This is similar to Luigi_, Airflow_,
37 |     Celery_, or Makefiles_
38 | 
39 | Most people today approach Dask assuming it is a framework like Spark, designed
40 | for the first use case around large collections of uniformly shaped data.
41 | However, many of the more productive and novel use cases fall into the second
42 | category where Dask is used to parallelize custom workflows.
43 | 
44 | In the real-world applications above we see that people end up using both
45 | sides of Dask to achieve novel results.
46 | 
47 | Contributing
48 | ------------
49 | 
50 | If you solve interesting problems with Dask then we want you to share your
51 | story.  Hearing from experienced users like yourself can help newcomers quickly
52 | identify the parts of Dask and the surrounding ecosystem that are likely to be
53 | valuable to them.
54 | 
55 | Stories are collected as pull requests to `github.com/dask/dask-stories
56 | <https://github.com/dask/dask-stories>`_.  You may wish to read a few of the
57 | stories above to get a sense for the typical level of information.  There is a
58 | template in the repository with suggestions, but you can also structure your
59 | story a different way.
60 | 
61 | .. toctree::
62 |    :maxdepth: 1
63 | 
64 |    template.md
65 | 
66 | .. _Airflow: https://airflow.apache.org/
67 | .. _Luigi: https://luigi.readthedocs.io/en/latest/
68 | .. _Celery: http://www.celeryproject.org/
69 | .. _Spark: https://spark.apache.org/
70 | .. _Makefiles: https://en.wikipedia.org/wiki/Make_(software)
71 | 


--------------------------------------------------------------------------------
/icecube-cosmic-rays.md:
--------------------------------------------------------------------------------
 1 | IceCube: Detecting Cosmic Rays
 2 | ==============================
 3 | 
 4 | Who am I?
 5 | ---------
 6 | 
 7 | I'm [James Bourbeau](https://github.com/jrbourbeau), I'm a graduate student in
 8 | the Physics department at the University of Wisconsin at Madison. I work at
 9 | the [IceCube South Pole Neutrino Observatory](https://icecube.wisc.edu/)
10 | studying the cosmic-ray energy spectrum.
11 | 
12 | 
13 | What problem am I trying to solve?
14 | ----------------------------------
15 | 
16 | Cosmic rays are energetic particles that originate from outer space. While they
17 | have been studied since the early 1900s, the sources of high-energy cosmic rays
18 | are still not well known. I analyze data collected by IceCube to study how the
19 | cosmic-ray spectrum changes with energy and particle mass; this can help provide
20 | valuable insight into our understanding of the origin of cosmic rays.
21 | 
22 | This involves developing algorithms to perform energy reconstruction as well
23 | as particle mass group classification for events detected by IceCube. In
24 | addition, we use detector simulation and an iterative unfolding algorithm to
25 | correct for inherit detector biases and the finite resolution of our
26 | reconstructions.
27 | 
28 | 
29 | How Dask Helps us
30 | -----------------
31 | 
32 | I originally chose to use Dask because of the
33 | [Dask Array](https://docs.dask.org/en/latest/array.html) and
34 | [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html) data
35 | structures. I use Dask Dataframe to load thousands of
36 | [HDF](https://www.hdfgroup.org/) files and then apply further feature
37 | engineering and filtering data preprocessing steps. The final dataset can be
38 | up to 100GB in size, which is too large to load into our available RAM. So
39 | being able to easily distribute this load while still using the familiar
40 | pandas API has become invaluable in my research.
41 | 
42 | Later I discovered the
43 | [Dask delayed](https://docs.dask.org/en/latest/delayed.html) iterface and now
44 | use it to parallelize code that doesn't easily conform to the Dask Array or
45 | Dask Dataframe use cases. For example, I often need to perform thousands of
46 | independent calculations for the pixels in a HEALPix sky map. I've found Dask
47 | delayed to be really useful for parallelizing these types of embarrassingly
48 | parallel calculations with minimal hassle.
49 | 
50 | I also use several of the
51 | [diagnostic tools](https://docs.dask.org/en/latest/diagnostics-local.html)
52 | Dask offers such as the progress bar and resource profiler. Working in a large
53 | collaboration with shared computing resources, it's great to be able to
54 | monitor how many resources I'm using and scale back or scale up accordingly.
55 | 
56 | 
57 | Pain points of using Dask
58 | -------------------------
59 | 
60 | There were two main pain points I encountered when first using Dask:
61 | 
62 | - Getting used to the idea of lazy computation. While this isn't an issue that
63 | is specific to Dask, it was something that took time to get used to.
64 | 
65 | - Dask is a fairly large project with many components and it took some time to
66 | figure out how all the various pieces fit together. Luckily, the user
67 | documentation for Dask is quite good and I was able to get over this initial
68 | learning curve.
69 | 
70 | 
71 | Technology that we use around Dask
72 | ----------------------------------
73 | 
74 | We store our data in HDF files, which Dask has nice read and write support
75 | for. We also use several other Python data stack tools like Jupyter,
76 | scikit-learn, matplotlib, seaborn, etc. Recently, we've started experimenting
77 | with using HTCondor and the
78 | [Dask distributed scheduler](https://distributed.dask.org/en/latest/) to
79 | scale up to using hundreds of workers on a cluster.
80 | 


--------------------------------------------------------------------------------
/hydrologic-modeling.md:
--------------------------------------------------------------------------------
 1 | NCAR: Hydrological Modeling
 2 | ===========================
 3 | 
 4 | Who am I?
 5 | ---------
 6 | 
 7 | I am [Joe Hamman](http://joehamman.com/about/) and I am a Project Scientist in the [Computational Hydrology Group](https://ncar.github.io/hydrology/) at the [National Center for Atmospheric Research](https://ncar.ucar.edu/). I am a core developer of the [Xarray](http://Xarray.pydata.org) project and a contributing member of the [Pangeo](http://pangeo-data.org/) project. I study subjects in the areas of climate change, hydrology, and water resource engineering.
 8 | 
 9 | 
10 | What problem am I trying to solve?
11 | ----------------------------------
12 | 
13 | Climate change will bring widespread impacts to the hydrologic cycle. We know this because many research studies, conducted over the past two decades, have shown what the first order effects of climate change will look like in managed and natural hydrologic systems in terms of things like water availability, drought, wildfire, extreme precipitation, and floods. However, we don't have a very good understanding of the characteristic uncertainties that come from our choice of tools that we use to estimate these changes.
14 | 
15 | In the field of hydroclimatology, the tools we use are numerical models of the climate and hydrologic systems. These models can be constructed in many ways and it is often difficult understand how specific choices we make when building a model impact the inferences we can draw from them (e.g. the impact of climate change on flood frequency).  We are working on methods to expose and constrain methodological uncertainties in the climate impacts modeling paradigm for water resource applications. This includes developing and analyzing large ensembles of climate projections and interrogating these ensembles to understand specific sources of uncertainty.
16 | 
17 | 
18 | How does Dask help?
19 | -------------------
20 | 
21 | The climate and hydrologic sciences rely heavily on data stored in formats like HDF5 and NetCDF. We often use [Xarray](http://xarray.pydata.org) as an interface and friendly data model for these formats. Under the hood, Xarray uses either NumPy or Dask arrays. This allows us to scale the same Xarray computations we would typically do in-memory using NumPy, to larger tasks using Dask.
22 | 
23 | In my own research, I use Dask and Xarray as the core computational libraries for working with large datasets (10s-100s of terabytes). Often the operations we do with these datasets are fairly simple reduction operations where we may compare the average climate from two periods.
24 | 
25 | Why we chose Dask
26 | -----------------
27 | 
28 | When working on scientific analysis tasks, I don't want to think about parallelizing my code. We chose to work with Dask because `Dask.array` was nearly drop-in-compatible with NumPy.   This meant that adopting Dask, inside or outside of Xarray, was much easier than adopting another parallel framework. Along those same lines, Dask is well integrated with other key parts of the scientific Python stack, including Pandas, Scikit-Learn, etc.
29 | 
30 | Pain points
31 | -----------
32 | 
33 | Originally deploying Dask on HPC systems was a bit of a pain.  But this has
34 | gotten much easier.
35 | 
36 | Additionally while Dask is easy to use, it's also easy to break.  The freedom
37 | it provides also means that you have the freedom to shoot yourself in the foot.
38 | 
39 | Also diagnosing performance issues can be more complex than when just using
40 | Numpy.  It's still a bit of an art rather than a science.
41 | 
42 | Technology we use around Dask
43 | -----------------------------
44 | 
45 | - We use [Xarray](https://xarray.pydata.org) to provide a higher level (and familiar) interface around Numpy arrays and Dask arrays
46 | - We use NetCDF and HDF files for data storage
47 | - I mostly work on HPC systems and have been helping develop the [dask-jobqueue](https://dask-jobqueue.readthedocs.io) package for deploying Dask on job queueing systems
48 | - In the [Pangeo](https://pangeo-data.github.io) project, we're exploring Dask applications using Kubernetes and Jupyter notebooks
49 | 
50 | Other thoughts
51 | --------------
52 | 
53 | I'm quite interested in enabling more intuitive scientific analysis workflows, particularly when parallelization is required. Dask has been a big part of our efforts to facilitate a "beginning-to-end" workflow pattern on large datasets.
54 | 


--------------------------------------------------------------------------------
/datarevenue.md:
--------------------------------------------------------------------------------
 1 | Discovering biomarkers for rare diseases in blood samples
 2 | =========================================================
 3 | 
 4 | ![](dask-datarevenue.png)
 5 | 
 6 | ## Who am I?
 7 | I am Markus Schmitt and I am CEO at [Data Revenue](https://www.datarevenue.com/). We build custom machine learning solutions in a number of fields, including everything from medicine to car manufacturing.
 8 | 
 9 | ## What problem am I trying to solve?
10 | It's hard for doctors to diagnose rare diseases based on blood tests. Usually patients are subjected to expensive and prolonged semi-manual gene testing.
11 | 
12 | We are analysing thousands of blood samples, comparing sick and healthy patients. We look for biomarkers, compounds in the blood, such as iron, and these help doctors to identify people who have rare diseases (and those who do not).
13 | Currently, this is done offline, on historical samples, but with more work this could be used in real-time too: to analyse patients' blood and give them feedback faster and more cheaply than is currently possible.
14 | 
15 | ## How does Dask help?
16 | We started the project without Dask, writing our own custom multiprocessing functionality. This was a burden to maintain, and Dask made it simple to switch over to thinking at a directed acyclic graph (DAG) level. It was great to stop thinking about individual cores.
17 | 
18 | Dask has allowed us to run all of our analysis in parallel, shortening the overall feedback loop and letting us get results faster.
19 | We've found Dask to be extremely flexible. We have used it extensively to help with our distributed analysis, but Dask adds value for us in simpler cases too. We have systems which revolve around user-submitted jobs, and we can use Dask to help schedule these, whether or not 'big data' is involved.
20 | 
21 | ## Why did I choose Dask?
22 | After it became clear that we were wasting significant time maintaining our custom multiprocessing code, we considered several alternatives before choosing Dask. We specifically considered
23 | -   [Apache Flink](https://flink.apache.org/): we found this not only lacking some functionality that we needed, but also very complex to set up and maintain.
24 | -   [Apache Spark](https://spark.apache.org/): we similarly found this to be a time sink in terms of set up and maintenance.
25 | -   [Apache Hadoop](https://hadoop.apache.org/): we found the MapReduce framework too restrictive, and it felt like we were always pushing round pegs into square holes.
26 | 
27 | Ultimately, we chose Dask because it offered a great balance of simplicity and power. It is flexible enough to let us do nearly everything we need, but simple enough to not put a maintenance burden on our team.
28 | We use the Dask scheduler along with the higher level APIs. Our data fits well into a table structure, so the DataFrame API provides a lot of value for us.
29 | 
30 | ## Pain points
31 | Dask has largely done exactly what we want, but we have had some memory issues. Sometimes, because of the higher layer of abstraction, it's not easy to predict exactly what data will be loaded into memory.
32 | In some cases, Dask loads far more data into memory than we expected, and we have to add custom logic to delay the processing of some tasks. These tasks are then only run when previous tasks have completed and memory is freed up.
33 | Because we work with medical data, we also have a strong focus on security and compliance. We found the Dask integration with Kubernetes to be lacking some features we needed around this. Specifically we had to apply some manual SSL patches to ensure that data was always transferred over SSL.
34 | But overall, wherever Dask has fallen short of our needs, it has been easy for us to patch these. Dask's architecture makes it easy to understand and change as needed.
35 | 
36 | ## Technology I use around Dask
37 | We run Dask on Kubernetes on AWS, with batch processing managed by [Luigi](https://luigi.readthedocs.io/en/stable/). We use scikit-learn for machine learning, and [Dash](https://www.datarevenue.com/ml-tools/dash) for interactive GUIs and Dashboards.
38 | 
39 | ## Anything else to know?
40 | We use Dask in many of our projects, from smaller ones all the way up to our largest integrations (for example with Daimler) and we have been impressed with how versatile it is. We process several billion records, amounting to many terabytes of data, through Dask every day.
41 | 
42 | ## Links
43 | - [Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS](https://www.datarevenue.com/en-blog/pandas-vs-dask-vs-vaex-vs-modin-vs-rapids-vs-ray)
44 | - [How to Scale your Machine Learning Pipeline with Dask](https://www.datarevenue.com/en-blog/how-to-scale-your-machine-learning-pipeline)
45 | 


--------------------------------------------------------------------------------
/network-modeling.md:
--------------------------------------------------------------------------------
 1 | Mobile Network Modeling
 2 | =======================
 3 | 
 4 | Who am I?
 5 | ---------
 6 | 
 7 | I am [Sameer Lalwani](https://www.linkedin.com/in/lalwanisameer/), and I specialize in modeling wireless networks.
 8 | 
 9 | We use these models to help operators with their technology decisions by building digital twins of their wireless networks. These models are used to quantify the impact of technology on user experience, network KPI & economics.
10 | 
11 | 
12 | The problem I'm trying to solve
13 | -------------------------------
14 | 
15 | Mobile network operators face uncertainty whenever a new technology emerges.  Their existing networks are complex with several bands and pre-existing sites.  They need to understand the implications of bringing a new technology into this mix.  By modeling their networks we can help reduce the uncertainty that management faces, while giving them actionable insights about their network.
16 | 
17 | For example currently we are in the middle of a 4G to 5G transition, presenting operators with decisions like the following:
18 | 
19 | - Which spectrum band should they bid on?
20 | - What type of user experience and network performance can they expect?
21 | - How much capital & operating expense will be required?
22 | - What site locations should they target to meet coverage and capacity requirements?
23 | - Should they consider refarming or acquire new bands?
24 | 
25 | These models are used to create digital twins at the scale of a city or a country. Based on known subscriber behavior and data from census, traffic and other sources "synthetic subscribers" are created. These subscribers could easily run into few millions with every individual having its own physical location on a map.
26 | We also bring in other GIS layers like building layouts, roads etc. to model indoor and outdoor subscribers.
27 | 
28 | The network can also have a range of cell sites going from few hundreds to tens of thousands.
29 | 
30 | Our models calculate data rates at an individual subscriber level by computing RF pathloss loss & SINR from all nearby sites. This takes our datasets into tens of millions of rows on which numerical computation needs to be done.
31 | 
32 | We also run lots of scenarios on this network to understand its capabilities and limitations.
33 | 
34 | 
35 | How Dask helps
36 | --------------
37 | 
38 | ### Distributed computing and Dask delayed
39 | 
40 | Dask Distributed helps run our scenarios across multiple machines while remaining within the memory constraints of each machine. We create approximately 10-20 machines on our VMware infrastructure with one Linux machine running the Dask scheduler, and all other machines running Dask workers with identical Conda environments.
41 | 
42 | Our team runs Jupyter notebooks on their desktop machines which connect with the Dask scheduler. Their notebooks work on their own datasets but for specialized functions and scenarios their tasks get transparently sent to the Dask scheduler.  Each site count scenario takes between 10-15 mins to run, so it's nice to have 40 of them running in parallel.
43 | 
44 | ### Dask Dataframe
45 | 
46 | We use LiDAR data sets to calculate line of sight for mmWave propagation from lamp posts. The LiDAR data sets for the full city are often too large to open on a single machine. Dask Dataframe allows us to pool the resources of multiple machines while keeping our logic similar to Pandas dataframes.
47 | 
48 | We use Dask delayed to process multiple LiDAR files in parallel and then combine them into a single Dask Dataframe representing the full city.  The line of sight calculation is CPU intensive and so it is nice to distribute it across multiple cores.
49 | 
50 | One of the things we really appreciate about Dask is that it gives us the ability to focus on the model logic without worrying about scalability. Once the code is ready, its seamless to call the functions using delayed and have them run across multiple CPU cores.
51 | 
52 | 
53 | Pain points when using dask
54 | ---------------------------
55 | 
56 | I find Dask Dataframe to be too slow for multi column groupby operations. For such tasks I sort the Dataframe by the columns and partition it by the first column. I then use a delayed operation to use pandas on each partition which end up being much faster.
57 | 
58 | 
59 | Technology I use around Dask
60 | ----------------------------
61 | 
62 | - GIS : Geopandas, Would love to get this natively supported.
63 | - Traffic Flows: networkx
64 | - Analytics: scikit-learn, scipy, pandas
65 | - Visualization: Datashader with Dask distributed for LiDAR data.
66 | - Charts: Holoviews, Seaborn
67 | - Data Storage: HDF5
68 | 
69 | All our deployments are manually done. We bring up these machines for our analytics with identical Conda environments and tear them down once we are done.
70 | 
71 | Previously we used IPython parallel but found that Dask was easier to set up, and also allowed us to write more complex logic and also share our computing pool between multiple users.
72 | 
73 | 
74 | Links to this work
75 | ------------------
76 | 
77 | Examples of models which use LiDAR to calculate line of sight:
78 | 
79 | 1. [5G mmWave Coverage from Lamp posts](https://www.linkedin.com/pulse/how-much-5g-coverage-you-really-get-from-lampposts-sameer/)
80 | 2. [mmWave Backhaul & Economics](https://www.linkedin.com/pulse/why-terragraph-mmwave-backhaul-essential-5g-lalwani-mba-ms-ee/)
81 | 


--------------------------------------------------------------------------------
/sidewalk-labs.md:
--------------------------------------------------------------------------------
  1 | Sidewalk Labs: Civic Modeling
  2 | =============================
  3 | 
  4 | Who am I?
  5 | ---------
  6 | 
  7 | I'm [Brett Naul](https://github.com/bnaul).
  8 | I work at [Sidewalk Labs](https://www.sidewalklabs.com/).
  9 | 
 10 | 
 11 | What problem am I trying to solve?
 12 | ----------------------------------
 13 | 
 14 | My team @ Sidewalk ("Model Lab") uses machine learning models to study human
 15 | travel behavior in cities and produce high-fidelity simulations of the travel
 16 | patterns/volumes in a metro area. Our process has three main steps:
 17 | 
 18 | -   Construct a "synthetic population" from census data and other sources of
 19 |     demographic information; this population is statistically representative of
 20 |     the true population but contains no actual identifiable individuals.
 21 | 
 22 | -   Train machine learning models on anonymous mobile location data to
 23 |     understand behavioral patterns in the region (what times do people go to
 24 |     lunch, what factors affect an individual's likelihood to use public
 25 |     transit, etc.).
 26 | 
 27 | -   For each person in the synthetic population, generate predictions from
 28 |     these models and combine the resulting into activities into a single
 29 |     model of all the activity in a region.
 30 | 
 31 | For more information see our blogpost [Introducing Replica: A Next-Generation Urban Planning Tool](https://medium.com/sidewalk-talk/introducing-replica-a-next-generation-urban-planning-tool-1b7425222e9e).
 32 | 
 33 | 
 34 | How Dask Helps
 35 | --------------
 36 | 
 37 | Generating activities for millions of synthetic individuals is extremely
 38 | computationally intensive; even with for example, a 96 core instance,
 39 | simulating a single day in a large region initially took days. It was important
 40 | to us to be able to run a new simulation from scratch overnight, and scaling to
 41 | hundreds/thousands of cores across many workers with Dask let us accomplish our
 42 | goal.
 43 | 
 44 | 
 45 | Why we chose Dask originally, and how these reasons have changed over time
 46 | --------------------------------------------------------------------------
 47 | 
 48 | Our code consists of a mixture of legacy research-quality code and newer
 49 | production-quality code (mostly Python). Before I started we were using Google
 50 | Cloud Dataflow (Python 2 only, massively scalable but generally an astronomical
 51 | pain to work with / debug) and multiprocessing (something like ~96 cores max).
 52 | 
 53 | Dask let us scale beyond a single machine with only minimal changes to our data
 54 | pipeline. If we had been starting from scratch I think it's likely we would
 55 | have gone in a different direction (something like C++ or Go microservices,
 56 | especially since we have strong Google ties), but from my perspective as a
 57 | hybrid infrastructure engineer/data scientist, having all of our models in
 58 | Python makes it easy to experiment and debug statistical issues.
 59 | 
 60 | 
 61 | Some of the pain points of using Dask for our problem
 62 | -----------------------------------------------------
 63 | 
 64 | There is lots of special dask knowledge that only I possess, for example:
 65 | 
 66 | -   In which formats we can serialize data that will allow for it to be
 67 |     reloaded efficiently? Sometimes we can use parquet, other times it should
 68 |     be CSVs so we can easily chunk them dynamically at runtime
 69 | 
 70 | -   Sometimes we load data on the client and scatter to the workers, and other
 71 |     times we load chunks directly on the workers
 72 | 
 73 | -   The debugging process is sufficiently more complicated compared to local
 74 |     code that it's harder for other people to help resolve issues that occur
 75 |     on workers
 76 | 
 77 | -   The scheduler has been the source of most of our scaling issues: when
 78 |     the number of tasks/chunks of data gets too large, the scheduler tends
 79 |     to fall over silently in some way.
 80 | 
 81 |     Some of these failures might be to Kubernetes (if we run out of RAM, we
 82 |     don't see an OOM error; the pod just disappears and the job will restart).
 83 |     We had to do some hand-tuning of things like timeouts to make things more
 84 |     stable, and there was quite a bit of trial and error to get to a relatively
 85 |     reliable state
 86 | 
 87 | -   This has more to do with our deploy process but we would sometimes
 88 |     end up in situations where the scheduler and worker were running
 89 |     different dask/distributed versions and things will crash when tasks
 90 |     are submitted but not when the connection is made, which makes it
 91 |     take a while to diagnose (plus the error tends to be something
 92 |     inscrutable like `KeyError: ...` that others besides me would have no
 93 |     idea how to interpret)
 94 | 
 95 | 
 96 | Some of the technology that we use around Dask
 97 | ----------------------------------------------
 98 | 
 99 | -   Google Kubernetes Engine: lots of worker instances (usually 16 cores each), 1
100 |     scheduler, 1 job runner client (plus some other microservices)
101 | -   Make + Helm
102 | -   For debugging/monitoring I usually kubectl port-forward to 8786 and 8787
103 |     and watch the dashboard/submit tasks manually. The dashboard is not very
104 |     reliable over port-forward when there are lots of workers (for some reason
105 |     the websocket connection dies repeatedly) but just reconnecting to the pod
106 |     and refreshing always does the trick
107 | 


--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # Configuration file for the Sphinx documentation builder.
  4 | #
  5 | # This file does only contain a selection of the most common options. For a
  6 | # full list see the documentation:
  7 | # http://www.sphinx-doc.org/en/master/config
  8 | 
  9 | # -- Path setup --------------------------------------------------------------
 10 | 
 11 | # If extensions (or modules to document with autodoc) are in another directory,
 12 | # add these directories to sys.path here. If the directory is relative to the
 13 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 14 | #
 15 | # import os
 16 | # import sys
 17 | # sys.path.insert(0, os.path.abspath('.'))
 18 | 
 19 | 
 20 | # -- Project information -----------------------------------------------------
 21 | 
 22 | project = 'Dask Stories'
 23 | copyright = '2018, Dask Community'
 24 | author = 'Dask Community'
 25 | 
 26 | # The short X.Y version
 27 | version = ''
 28 | # The full version, including alpha/beta/rc tags
 29 | release = ''
 30 | 
 31 | 
 32 | # -- General configuration ---------------------------------------------------
 33 | 
 34 | # If your documentation needs a minimal Sphinx version, state it here.
 35 | #
 36 | # needs_sphinx = '1.0'
 37 | 
 38 | # Add any Sphinx extension module names here, as strings. They can be
 39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 40 | # ones.
 41 | extensions = [
 42 |     "myst_parser"
 43 | ]
 44 | 
 45 | 
 46 | # Add any paths that contain templates here, relative to this directory.
 47 | templates_path = ['_templates']
 48 | 
 49 | # The suffix(es) of source filenames.
 50 | # You can specify multiple suffix as a list of string:
 51 | #
 52 | source_suffix = {
 53 |     '.rst': 'restructuredtext',
 54 |     '.txt': 'markdown',
 55 |     '.md': 'markdown',
 56 | }
 57 | 
 58 | #  source_suffix = '.rst'
 59 | 
 60 | # The master toctree document.
 61 | master_doc = 'index'
 62 | 
 63 | # The language for content autogenerated by Sphinx. Refer to documentation
 64 | # for a list of supported languages.
 65 | #
 66 | # This is also used if you do content translation via gettext catalogs.
 67 | # Usually you set "language" from the command line for these cases.
 68 | language = None
 69 | 
 70 | # List of patterns, relative to source directory, that match files and
 71 | # directories to ignore when looking for source files.
 72 | # This pattern also affects html_static_path and html_extra_path .
 73 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
 74 | 
 75 | # The name of the Pygments (syntax highlighting) style to use.
 76 | # Commenting this out for now, if we register dask pygments,
 77 | # then eventually this line can be:
 78 | # pygments_style = "dask"
 79 | 
 80 | 
 81 | # -- Options for HTML output -------------------------------------------------
 82 | 
 83 | # The theme to use for HTML and HTML Help pages.  See the documentation for
 84 | # a list of builtin themes.
 85 | #
 86 | html_theme = 'dask_sphinx_theme'
 87 | 
 88 | # Theme options are theme-specific and customize the look and feel of a theme
 89 | # further.  For a list of options available for each theme, see the
 90 | # documentation.
 91 | #
 92 | # html_theme_options = {}
 93 | 
 94 | # Add any paths that contain custom static files (such as style sheets) here,
 95 | # relative to this directory. They are copied after the builtin static files,
 96 | # so a file named "default.css" will overwrite the builtin "default.css".
 97 | html_static_path = ['_static']
 98 | 
 99 | # Custom sidebar templates, must be a dictionary that maps document names
100 | # to template names.
101 | #
102 | # The default sidebars (for documents that don't match any pattern) are
103 | # defined by theme itself.  Builtin themes are using these templates by
104 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
105 | # 'searchbox.html']``.
106 | #
107 | # html_sidebars = {}
108 | 
109 | 
110 | # -- Options for HTMLHelp output ---------------------------------------------
111 | 
112 | # Output file base name for HTML help builder.
113 | htmlhelp_basename = 'DaskStoriesdoc'
114 | 
115 | 
116 | # -- Options for LaTeX output ------------------------------------------------
117 | 
118 | latex_elements = {
119 |     # The paper size ('letterpaper' or 'a4paper').
120 |     #
121 |     # 'papersize': 'letterpaper',
122 | 
123 |     # The font size ('10pt', '11pt' or '12pt').
124 |     #
125 |     # 'pointsize': '10pt',
126 | 
127 |     # Additional stuff for the LaTeX preamble.
128 |     #
129 |     # 'preamble': '',
130 | 
131 |     # Latex figure (float) alignment
132 |     #
133 |     # 'figure_align': 'htbp',
134 | }
135 | 
136 | # Grouping the document tree into LaTeX files. List of tuples
137 | # (source start file, target name, title,
138 | #  author, documentclass [howto, manual, or own class]).
139 | latex_documents = [
140 |     (master_doc, 'DaskStories.tex', 'Dask Stories Documentation',
141 |      'Dask Community', 'manual'),
142 | ]
143 | 
144 | 
145 | # -- Options for manual page output ------------------------------------------
146 | 
147 | # One entry per manual page. List of tuples
148 | # (source start file, name, description, authors, manual section).
149 | man_pages = [
150 |     (master_doc, 'daskstories', 'Dask Stories Documentation',
151 |      [author], 1)
152 | ]
153 | 
154 | 
155 | # -- Options for Texinfo output ----------------------------------------------
156 | 
157 | # Grouping the document tree into Texinfo files. List of tuples
158 | # (source start file, target name, title, author,
159 | #  dir menu entry, description, category)
160 | texinfo_documents = [
161 |     (master_doc, 'DaskStories', 'Dask Stories Documentation',
162 |      author, 'DaskStories', 'One line description of project.',
163 |      'Miscellaneous'),
164 | ]
165 | 


--------------------------------------------------------------------------------
/pangeo.md:
--------------------------------------------------------------------------------
  1 | Pangeo: Earth Science
  2 | =====================
  3 | 
  4 | Who Am I?
  5 | ---------
  6 | 
  7 | I am [Ryan Abernathey](http://rabernat.github.io), a physical oceanographer and
  8 | professor at [Columbia University](http://columbia.edu) /
  9 | [Lamont Doherty Earth Observatory](http://ldeo.columbia.edu).
 10 | 
 11 | I am a founding member of the [Pangeo Project](http://pangeo.io), an
 12 | initiative aimed at coordinating and supporting the development of open source
 13 | software for the analysis of very large geoscientific datasets such as
 14 | satellite observations or climate simulation outputs.  Pangeo is funded by
 15 | [National Science Foundation Grant
 16 | 1740648](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1740648&HistoricalAwards=false),
 17 | of which I am the principal investigator.
 18 | 
 19 | What Problem are We Trying to Solve?
 20 | ------------------------------------
 21 | 
 22 | Many oceanographic and atmospheric science datasets consist of multi-dimensional
 23 | arrays of numerical data, such as temperature sampled on a regular latitude,
 24 | longitude, depth, time grid. These can be real data, observed by instruments
 25 | like weather balloons, satellites, or other sensors; or they can be "virtual"
 26 | data, produced by simulations. Scientists in these fields perform an extremely
 27 | wide range of different analyses on these datasets. For example:
 28 | 
 29 | - simple statistics like mean and standard deviation
 30 | - principal component analysis of spatio-temporal variability
 31 | - intercomparison of datasets with different spatio-temporal sampling
 32 | - spectral analysis (Fourier transforms) over various space and time dimensions
 33 | - budget diagnostics (e.g. calculating terms in the equation for heat conservation)
 34 | - machine learning for pattern recognition and prediction
 35 | 
 36 | Scientists like to work interactively and iteratively, trying out calculations,
 37 | visualizing the results, and tweaking their code until they eventually settle on
 38 | a result that is worthy of publication.
 39 | 
 40 | The traditional workflow is to download datasets to a personal laptop or
 41 | workstation and peform all analysis there. As sensor technology and computer
 42 | power continue to develop, the volume of our datasets is growing exponentially.
 43 | This workflow is not feasible or efficient with multi-terabyte datasets, and it
 44 | is impossible with petabyte-scale datasets. The fundamental problem we are
 45 | trying to solve in Pangeo is **how do we maintain the ability to perform
 46 | rapid, interactive analysis in the face of extremely large datasets?**
 47 | Dask is an essential part of our solution.
 48 | 
 49 | How Dask Helps
 50 | --------------
 51 | 
 52 | Our large multi-dimensional arrays map very well to Dask's `array` model. Our
 53 | users tend to interact with Dask via [Xarray](http://xarray.pydata.org), which
 54 | adds additional label-aware operations and group-by / resample capabilities.
 55 | The Xarray data model is explicitly inspired by the Common Data Model format
 56 | widely used in geosciences. Xarray has incorporated dask from very early in its
 57 | development, leading to close integration between these packages.
 58 | 
 59 | Pangeo provides configurations for deploying Jupyter, Xarray and Dask on
 60 | high-performance computing clusters and cloud platforms. On these platforms,
 61 | our users load data lazily using xarray from a variety of different storage
 62 | formats and perform analysis inside Jupyter notebooks. Working closely with
 63 | the Dask development team, we have tried to simplify the process of launching
 64 | Dask clusters interactively by using packages such as
 65 | [dask-kubernetes](https://github.com/dask/dask-kubernetes) and
 66 | [dask-jobqueue](https://github.com/dask/dask-jobqueue).
 67 | Users employ those packages to interactively launch their own Dask clusters
 68 | across many nodes of the compute system. Dask then automatically parallelizes
 69 | the xarray-based computations without users having to write much specialized
 70 | parallel code. Users appreciate the Dask dashboard, which provides a visual
 71 | indication of the progress and efficiency of their ongoing analysis. When
 72 | everything is working well, Dask is largely transparent to the user.
 73 | 
 74 | Why We Chose Dask Originally
 75 | ----------------------------
 76 | 
 77 | Pangeo emerged from the Xarray development group, so Dask was a natural choice.
 78 | Beyond this, Dask's flexibility is a good fit for our applications; as
 79 | described above, scientists in this domain perform a huge range of different
 80 | types of analysis. We need a parallel computing engine which does not strongly
 81 | constrain the type of computations that can be performed nor require the user
 82 | to engage with the details of parallelization.
 83 | 
 84 | Pain Points
 85 | -----------
 86 | 
 87 | Dask's flexibility comes with some overhead.
 88 | I have the impression that the size of the graphs our users generate, which
 89 | can easily exceed a million tasks, is pushing the limits of the dask scheduler.
 90 | It is not uncommon for the scheduler to crash, or to take an uncomfortably long
 91 | time to process, when these tasks are submitted. Our workaround is mostly to
 92 | fall back on the sort of loop-based iteration over large datasets that we had
 93 | to do pre-Dask. All of this undermines the interactive experience we are trying
 94 | to achieve.
 95 | 
 96 | However, the first year of this project has made me optimistic about the future.
 97 | I think the interaction between Pangeo users and Dask developers has been
 98 | pretty successful. Our use cases have helped identify several performance
 99 | bottlenecks that have been fixed at the Dask level. If this trend can continue,
100 | I'm confident we will be able to reach our desired scale (petabytes) and speed.
101 | 
102 | A broader issue relates to onboarding of new users. While I said above that
103 | Dask operates transparently to the users, this is not always the case. Users
104 | used to writing loop-based code to process datasets have to be retrained around
105 | the delayed-evaluation paradigm. It can be a challenge to translate legacy code
106 | into a Dask-friendly format. Some sort of "cheat sheet" might be able to help
107 | with this.
108 | 
109 | Technology around Dask
110 | ----------------------
111 | 
112 | [Xarray](https://xarray.pydata.org) is the main way we interact with Dask. We use the
113 | [`dask-jobqueque`](https://jobqueue.dask.org) and
114 | [`dask-kubernetes`](https://kubernetes.dask.org) projects heavily.
115 | 
116 | We also use [Zarr](http://zarr.readthedocs.io) extensively for storage,
117 | especially on the cloud, where we also employ
118 | [`gcsfs`](https://gcsfs.readthedocs.io) and
119 | [`s3fs`](https://s3fs.readthedocs.io) to interface with cloud storage.
120 | 
121 | 
122 | Copyright and License
123 | ---------------------
124 | 
125 | Copyright 2020 Ryan Abernathey. I license this work under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
126 | 


--------------------------------------------------------------------------------
/prefect-workflows.md:
--------------------------------------------------------------------------------
  1 | Prefect: Production Workflows
  2 | ============================
  3 | 
  4 | Who am I?
  5 | ---------
  6 | 
  7 | I am [Chris White](http://github.com/cicdw); I am the CTO
  8 | at [Prefect](https://www.prefect.io), a company building the next generation of workflow automation platforms for data engineers and data
  9 | scientists.  In this role, I am the core developer of our [open source engine](https://github.com/PrefectHQ/prefect)
 10 | which allows users to build, schedule and execute robust workflows.
 11 | 
 12 | The Problem I'm trying to solve
 13 | -------------------------------
 14 | 
 15 | Most teams are responsible for maintaining production workflows that
 16 | are critical to the team's mission. Historically these workflows consisted
 17 | largely of batch ETL jobs, but more recently include things such as
 18 | deploying parametrized machine learning models, ad-hoc reporting, and
 19 | handling event-driven processes.
 20 | 
 21 | Typically this means developers need a workflow system which can do things such as:
 22 | - retry failed tasks
 23 | - schedule jobs to run automatically
 24 | - log detailed progress (and history) of the workflow
 25 | - provide a dashboard / UI for inspecting system health
 26 | - provide notification hooks for when things go wrong
 27 | 
 28 | among many other things.  We at Prefect like to think of a workflow system as
 29 | a technical insurance policy - you shouldn't really notice it much when 
 30 | things are going well, but it should be maximally useful when things go wrong.
 31 | 
 32 | Prefect's goal is to build the next generation workflow system.  Older systems
 33 | such as [Airflow](https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4) and Luigi are limited 
 34 | by their model of workflows as slow-moving, regularly scheduled,
 35 | with limited inter-task communication.  Prefect, on the other hand, embraces
 36 | this new reality and makes very few assumptions about the nature and requirements of
 37 | workflows, thereby supporting more dynamic use cases in both data engineering
 38 | and data science.
 39 | 
 40 | 
 41 | How Dask helps
 42 | --------------
 43 | 
 44 | Prefect was designed and built with Dask in mind.  Historically, workflow systems
 45 | such as [Airflow](https://airflow.apache.org/) handled _all_ scheduling, of both
 46 | workflows _and_ the individual tasks contained within the workflows. This pattern introduces a number of problems:
 47 | - this puts an enormous burden on the central scheduler (it is scheduling _every single action_ taken in the system)
 48 | - it adds non-trivial latency to task runs
 49 | - in practice, this limits the amount of dynamicism workflows can have
 50 | - it also tends to limit the amount of data tasks can share, as all information is routed through the central scheduler
 51 | - it requires users to have an external scheduler service running to run their workflows at all!
 52 | 
 53 | Instead, Prefect handles the scheduling of _workflows_, and lets Dask
 54 | handle the scheduling and resource management of _tasks_ within each workflow.  This
 55 | provides a number of benefits out of the box:
 56 | 
 57 | - **Task scheduling:** Dask handles all task scheduling within a workflow, allowing Prefect to incentivize smaller tasks which Dask schedules with millisecond latency
 58 | - **"Dataflow":** because Dask handles serializing and communicating the appropriate information between Tasks, Prefect can support "dataflow" as a first-class pattern
 59 | - **Distributed computation:** Dask handles allocating Tasks to workers in a cluster, allowing users to immediately realize the benefits of distributed computation with minimal overhead
 60 | - **Parallelism:** whether running in a cluster or locally, Dask provides parallel Task execution off the shelf
 61 | 
 62 | Additionally, because Dask is written in pure Python and has an active open source community,
 63 | we can very easily get feedback on possible bugs, and even contribute to improving the software ourselves.
 64 | 
 65 | To achieve this ability to run workflows with many tasks, we found that Dask's [Futures interface](https://docs.dask.org/en/latest/futures.html)
 66 | serves us well.  In order to support dynamic tasks (i.e., tasks which spawn other tasks), we rely on Dask [worker clients](http://distributed.dask.org/en/latest/task-launch.html?highlight=worker_client).  We have also occasionally experimented with [Dask Queues](http://distributed.dask.org/en/latest/api.html?highlight=sharing%20futures#distributed.Queue) to implement more complicated behavior such as future-sharing and resource throttling, but are not currently using them (mainly for design reasons).
 67 | 
 68 | Pain points when using Dask
 69 | ---------------------------
 70 | 
 71 | Our biggest pain point in using Dask has largely revolved around the ability (or lack
 72 | thereof) to share futures between clients.  To provide a concrete example, suppose we start with a 
 73 | list of numbers and, using [`client.map`](https://distributed.readthedocs.io/en/latest/api.html#distributed.Client.map)
 74 | twice, we proceed to compute `x -> x + 1 -> x + 2` for each element of our list.  When using only dask primitives and a single client,
 75 | these computations proceed asychronously, meaning that the final computation of each branch
 76 | can begin without waiting on the other middle computations, as in this schematic:
 77 | 
 78 | ![Depth First Execution](depth-first.png)
 79 | 
 80 | However, in Prefect, we aren't simply passing around Dask futures created from a single `Client` - when a [`map` operation](https://docs.prefect.io/guide/core_concepts/mapping.html#prefect-approach) occurs, the dask futures are actually created by a `worker_client` and attached to a Prefect `State` object.
 81 | *Ideally*, we would leave these futures unresolved at this stage so that computation can proceed as above.  However, because 
 82 | it is non-trivial to share futures between clients we must `gather` the futures with this same client, making
 83 | our computation proceed in a "breadth-first" manner: 
 84 | 
 85 | ![Breadth first execution](breadth-first.png)
 86 | 
 87 | This isn't the worst thing, but for longer pipelines it would be very nice to have the faster branches
 88 | of the pipeline proceed with execution so that final results are produced earlier for inspection.
 89 | 
 90 | **Update**: [As of Prefect 0.12.0](https://medium.com/the-prefect-blog/map-faster-mapping-improvements-in-prefect-0-12-0-7cacc3f14e16), Prefect now supports Depth First Execution when running on Dask.
 91 | 
 92 | Technology we use around Dask
 93 | ----------------------------
 94 | 
 95 | Our preferred deployment of Prefect Flows uses [dask-kubernetes](https://github.com/dask/dask-kubernetes)
 96 | to spin up a short-lived Dask Cluster in Kubernetes.  
 97 | 
 98 | Otherwise, the logic contained within Prefect Tasks can be essentially arbitrary;
 99 | many tasks in the system interact with databases, GCP resources, AWS, etc.
100 | 
101 | 
102 | Links
103 | -----
104 | 
105 | - [Prefect Repo](https://github.com/PrefectHQ/prefect)
106 | - [Prefect on Dask Example](https://docs.prefect.io/guide/tutorials/dask-cluster.html)
107 | - [Dask-Kubernetes](https://kubernetes.dask.org)
108 | - [Blog post on some Prefect / Dask improvements](https://medium.com/the-prefect-blog/map-faster-mapping-improvements-in-prefect-0-12-0-7cacc3f14e16)
109 | 


--------------------------------------------------------------------------------
/mosquito-sequencing.md:
--------------------------------------------------------------------------------
  1 | Genome Sequencing for Mosquitos
  2 | ===============================
  3 | 
  4 | Who am I?
  5 | ---------
  6 | 
  7 | I'm [Alistair Miles](http://alimanfoo.github.io/about/) and I work for Oxford
  8 | University [Big Data Institute](https://www.bdi.ox.ac.uk/) but am also
  9 | affiliated with the [Wellcome Sanger Institute](https://www.sanger.ac.uk/). I
 10 | lead the malaria vector (mosquito) genomics programme within the [malaria
 11 | genomic epidemiology network](www.malariagen.net), an international network of
 12 | researchers and malaria control professionals developing new technologies based
 13 | on genome sequencing to aid in the effort towards malaria elimination. I also
 14 | have a technical role as Head of Epidemiological Informatics for the [Centre
 15 | for Genomics and Global Health](http://www.cggh.org/), which means I have some
 16 | oversight and responsibility for computing and software architecture and
 17 | direction within our teams at Oxford and Sanger.
 18 | 
 19 | 
 20 | What problem am I trying to solve?
 21 | ----------------------------------
 22 | 
 23 | Malaria is still a major cause of mortality, particularly in sub-Saharan
 24 | Africa. Research has shown that the best way to reduce malaria is to control
 25 | the mosquitoes that transmit malaria between people. Unfortunately mosquito
 26 | populations are becoming resistant to the insecticides used to control them.
 27 | New mosquito control tools are needed. New systems for mosquito population
 28 | surveillance/monitoring are also needed to help inform and adapt control
 29 | strategies to respond to mosquito evolution. We have established a project to
 30 | perform an initial survey of mosquito genetic diversity, by sequencing whole
 31 | genomes of approximately 3,000 mosquitoes collected from field sites across 18
 32 | African countries, [The Anopheles gambiae 1000 Genomes Project](
 33 | www.malariagen.net/ag1000g). We are currently working to scale up our
 34 | sequencing operations to be able to sequence ~10,000 mosquitoes per year, and
 35 | to integrate genome sequencing into regular mosquito monitoring programmes
 36 | across Africa and Southeast Asia.
 37 | 
 38 | 
 39 | How does Dask help?
 40 | -------------------
 41 | 
 42 | Whole genome sequence data is a relatively large scale data resource, requiring
 43 | specialised processing and analysis to extract key information, e.g.,
 44 | identifying genes involved in the evolution of insecticide resistance. We use
 45 | conventional bioinformatic approaches for the initial phases of data processing
 46 | (alignment, variant calling, phasing), however beyond that point we switch to
 47 | interactive and exploratory analysis using Jupyter notebooks.
 48 | 
 49 | Making interactive analysis of large-scale data is obviously a challenge,
 50 | because inefficient code and/or use of computational resources vastly increases
 51 | the time taken for any computation, destroying the ability of an analyst to
 52 | explore many different possibilities within a dataset. Dask helps by providing
 53 | an easy-to-use framework for parallelising computations, either across multiple
 54 | cores on a single workstation, or across multiple nodes in a cluster. We have
 55 | built a software package called
 56 | [scikit-allel](http://scikit-allel.readthedocs.io/en/latest/)  to help with our
 57 | genetic analyses, and use Dask within that package to parallelise a number of
 58 | commonly used computations.
 59 | 
 60 | 
 61 | Why did I choose Dask?
 62 | ----------------------
 63 | 
 64 | Normally the transition from a serial (i.e., single-core) implementation of any
 65 | given computation to a parallel (multi-core) implementation requires the code
 66 | to be completely rewritten, because parallel frameworks usually offer a
 67 | completely different API, and managing complex parallel workflows is a
 68 | significant challenge.
 69 | 
 70 | Originally Dask was appealing because it provided a familiar API,
 71 | with the dask.array package following the numpy API (which we were already
 72 | using) relatively closely. Dask also handled all the complexity of constructing
 73 | and running complex, multi-step computational workflows.
 74 | 
 75 | Today, we're also interested in Dask's offered flexibility to initially
 76 | parallelise over multiple cores in a single computer via multi-threading, and
 77 | then switch to running on a multi-node cluster with relatively little change in
 78 | our code. Thus computations can be scaled up or down with great convenience.
 79 | When we first started using Dask we were focused on making effective use of
 80 | multiple threads for working on a single computer, now as data is growing we
 81 | are moving data and computation into a cloud setting and looking to make use of
 82 | Dask via Kubernetes.
 83 | 
 84 | 
 85 | Pain points?
 86 | ------------
 87 | 
 88 | Initially when we started using Dask in 2015 we hit a few bugs and some of the
 89 | error messages generated by Dask were very cryptic, so debugging some problems
 90 | was hard.  However the stability of the code base, the user documentation, and
 91 | the error messages have improved a lot recently, and the sustained investment
 92 | in Dask is clearly adding a lot of value for users.
 93 | 
 94 | It is still difficult to think about how to code up parallel operations over
 95 | multidimensional arrays where one or more dimensions are dropped by the
 96 | function being mapped over the data, but there is some inherent complexity
 97 | there so probably not much Dask can do to help.
 98 | 
 99 | The Dask code base itself is tidy and consistent but quite hard to get into to
100 | understand and debug issues. Again Dask is handling a lot of inherent
101 | complexity so maybe not much can be done.
102 | 
103 | 
104 | Technology I use around around Dask
105 | -----------------------------------
106 | 
107 | We are currently working on deploying both JupyterHub and Dask on top of
108 | Kubernetes in the cloud, following the approach taken in the [Pangeo
109 | project](http://pangeo-data.org/). We use Dask primarily through the
110 | scikit-allel package. We also use Dask primarily with the
111 | [Zarr](http://zarr.readthedocs.io/en/stable/) array storage library (in fact
112 | the original motivation for writing Zarr was to provide a storage library that
113 | enabled Dask to efficiently parallelise I/O bound computations).
114 | 
115 | 
116 | 
117 | Anything else to know?
118 | ----------------------
119 | 
120 | Our analysis code is still quite heterogeneous, with some code making use of a
121 | bespoke approach to out-of-core computing which we developed prior to being
122 | aware of Dask, and the remainder using Dask. This is just a legacy of timing,
123 | with some work having started prior to knowing about Dask. With the stability
124 | and maturity of Dask now I am very happy to push towards full adoption.
125 | 
126 | One cognitive shift that this requires is for users to get used to lazy
127 | (deferred) computation. This can be a stumbling block to start with, but is
128 | worth the effort of learning because it gives the user the ability to run
129 | larger computations. So I have been thinking about writing a blog post to
130 | communicate the message that we are moving towards adopting Dask wherever
131 | possible, and to give an introduction to the lazy coding style, with examples
132 | from our domain (population genomics). There are also still quite a few
133 | functions in scikit-allel that could be parallelised via Dask but haven’t yet
134 | been, so I still have an aspiration to work on that. Not sure when I’ll get to
135 | these, but hopefully conveys the intention to adopt Dask more widely and also
136 | help train people in our immediate community to use it.
137 | 


--------------------------------------------------------------------------------
/satellite-imagery.md:
--------------------------------------------------------------------------------
  1 | Satellite Imagery Processing
  2 | ============================
  3 | 
  4 | Who am I?
  5 | ---------
  6 | 
  7 | I am [David Hoese](http://github.com/djhoese) and I work as a software
  8 | developer at the [Space Science and Engineering Center
  9 | (SSEC)](https://www.ssec.wisc.edu/) at the University of Wisconsin - Madison.
 10 | My job is to create software that makes meteorological data more accessible to
 11 | scientists and researchers.
 12 | 
 13 | I am also a member of the open source
 14 | [PyTroll](http://pytroll.github.io/) community where I act as a core developer
 15 | on the [SatPy](http://satpy.readthedocs.io/en/latest/) library. I use SatPy in
 16 | my SSEC projects called Polar2Grid and Geo2Grid which provide a simple command
 17 | line interface on top of the features provided by SatPy.
 18 | 
 19 | 
 20 | The Problem I'm trying to solve
 21 | -------------------------------
 22 | 
 23 | Satellite imagery data is often hard to read and use because of the many
 24 | different formats and structures that it can come in. To make satellite imagery
 25 | more useful the SatPy library wraps common operations performed on satellite
 26 | data in simple interfaces. Typically meteorological satellite data needs to go
 27 | through some or all of the following steps:
 28 | 
 29 | - **Read:** Read the observed scientific data from one or more data files while
 30 |     keeping track of geolocation and other metadata to make the data the
 31 |     most useful and descriptive.
 32 | - **Composite:** Combine one or more different "channels" of data to bring out
 33 |     certain features in the data. This is typically shown as RGB
 34 |     images.
 35 | - **Correct:** Sometimes data has artifacts, from the instrument hardware or the
 36 |     atmosphere for example, that can be removed or adjusted.
 37 | - **Resample:** Visualization tools often support a small subset of Earth
 38 |     projections and satellite data is usually not in those
 39 |     projections. Resampling can also be useful when wanting to do
 40 |     intercomparisons between different instruments (on the same
 41 |     satellite or not).
 42 | - **Enhancement:** Data can be normalized or scaled in certain ways that makes
 43 |     certain atmospheric conditions more apparent. This can also
 44 |     be used to better fit data in to other data types (8-bit
 45 |     integers, etc).
 46 | - **Write:** Visualization tools typically only support a few specific file
 47 |     formats. Some of these formats are difficult to write or have small
 48 |     differences depending on what application they are destined for.
 49 | 
 50 | As satellite instrument technology advances, scientists have to learn how to
 51 | handle more channels for each instrument and at spatial and temporal
 52 | resolutions that were unheard of when they were learning how to use satellite
 53 | data. If they are lucky, scientists may have access to a high performance
 54 | computing system, while the rest may have to settle for long execution times on
 55 | their desktop or laptop machines. By optimizing the parts of the processing
 56 | that take a lot of time and memory it is our hope that scientists can worry
 57 | about the science and leave the annoying parts to SatPy.
 58 | 
 59 | 
 60 | How Dask helps
 61 | --------------
 62 | 
 63 | SatPy's use of Dask makes it possible to do calculations on laptops that
 64 | used to require high performance server machines. SatPy was originally
 65 | drawn to [Xarray](http://xarray.pydata.org/en/stable/)'s `DataArray` objects
 66 | for the metadata storing functionality and the support for Dask arrays. We knew
 67 | that our original usage of Numpy masked arrays was not scalable to the new
 68 | satellite data being produced.  SatPy has now switched to `DataArray`
 69 | objects backed by Dask and leverages Dask's ability to do the following:
 70 | 
 71 | - **Lazy evaluation:** Software development is so much easier when you don't have
 72 |     to remove intermediate results from memory to process the next step.
 73 | - **Task caching:** Our processing involves a lot of intermediate results that can
 74 |     be shared between different processes. When things are optimized in the
 75 |     Dask graph it saves us as developers from having to code the "reuse" logic
 76 |     ourselves. It also means that intermediate results that are no longer
 77 |     needed can be disposed of and their memory freed.
 78 | - **Parallel workers and array chunking:** Satellite data is usually compared by
 79 |     geographic location. So a pixel at one index is often compared with the
 80 |     pixel of another array at that same index. Splitting arrays in to chunks
 81 |     and processing them separately provides us with a great performance
 82 |     improvement and not having to manage which worker gets what chunk of the
 83 |     array makes development effortless.
 84 | 
 85 | Benefiting from all of the above lets us create amazing high resolution RGB
 86 | images in 6-8 minutes on 3 year old laptops that would have taken 35+ minutes
 87 | to crash from memory limitations with SatPy's old Numpy implementation.
 88 | 
 89 | 
 90 | Pain points when using Dask
 91 | ---------------------------
 92 | 
 93 | 1.  Dask arrays are not Numpy arrays. Almost everything is supported or is
 94 |     close enough that you get used to it, but not everything. Most things you
 95 |     can get away with and get perfectly good performance; others you may end up
 96 |     computing your arrays multiple times in just a couple lines of code when
 97 |     you didn't know it.  Sometimes I wish that there was a Dask feature to
 98 |     raise an exception of your array is computed without you specifically
 99 |     saying it was ok.
100 | 
101 | 2.  Writing to common satellite data formats, like GeoTIFF, can't always be
102 |     written to by multiple writers (multiple nodes on a cluster) and some
103 |     aren't even thread-safe. Opening a file object and using it with
104 |     `dask.array.store` may work with some schedulers and not others.
105 | 
106 | 3.  Dimension changes are a pain. Satellite data processing some times involves
107 |     lookup tables to save on bandwidth limitations when sending data from the
108 |     satellite to the ground or other similar situations. Having to use lookup
109 |     tables, including something like a KDTree, can be really difficult and
110 |     confusing to code with Dask and get it right. It typically involves using
111 |     `atop`, `map_blocks`, or sometimes suffering the penalty of passing things
112 |     to a `Delayed` function where the entire data array is passed as one
113 |     complete memory-hungry array.
114 | 
115 | 4.  A lot of satellite processing seems to perform better with the default
116 |     threaded Dask scheduler over the distributed scheduler due to the nature of
117 |     the problems being solved. A lot of processing, especially the creation of
118 |     RGB images, requires comparing multiple arrays in different ways and can
119 |     suffer from the amount of communication between distributed workers. There
120 |     isn't an easy way that I know of to control where things are processed and
121 |     which scheduler to use without requiring users to know detailed information
122 |     on the internal of Dask.
123 | 
124 | 
125 | Technology I use around Dask
126 | ----------------------------
127 | 
128 | As mentioned earlier SatPy uses [Xarray](http://xarray.pydata.org/en/stable/)to
129 | wrap most of our Dask operations when possible. We have other useful tools that
130 | we've created in the PyTroll community to help support deploying satellite
131 | processing tools on servers, but they are not specific to Dask.
132 | 
133 | 
134 | Links
135 | -----
136 | 
137 | - [PyTroll Community](http://pytroll.github.io/)
138 | - [SatPy](http://satpy.readthedocs.io/en/latest/)
139 | - [SatPy Examples](http://satpy.readthedocs.io/en/latest/examples.html)
140 | - [PyResample](http://pyresample.readthedocs.io/en/latest/)
141 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Dask Stories
  2 | ============
  3 | 
  4 | This repository holds stories from experienced Dask users that are intended to
  5 | help new users get a sense for how Dask might apply to their field.
  6 | 
  7 | If you are curious about Dask then please read the various submitted stories.
  8 | 
  9 | If you use Dask today to solve interesting problems we would love to have you
 10 | share your story.  Hearing from experienced users like yourself can help
 11 | newcomers quickly identify the parts of Dask and the surrounding ecosystem that
 12 | are likely to be valuable to them.
 13 | 
 14 | 
 15 | ## How to share your story
 16 | 
 17 | We welcome stories of any form.  However, if you'd like some guidance then we
 18 | recommend the following rubric.  We've included suggestions of how to break up
 19 | a story below, including two entirely fabricated stories alongside each set of
 20 | instructions.
 21 | 
 22 | 
 23 | ### Who are you?
 24 | 
 25 | Include your name, the project you work with, and your role within that
 26 | project.  Some examples:
 27 | 
 28 | -  I'm [Joseph Chen]().  I manage a quantitative quality control group at
 29 |    [XYZ automotive](), a company that makes automotive parts.
 30 | -  I'm [Alice Singh](), I'm a post-doctoral scholar at the University of
 31 |    Arizona.  I work at the [National Solar Observatory]() studying sunspots and
 32 |    solar flare.
 33 | 
 34 | Links are welcome.
 35 | 
 36 | 
 37 | ### What problem are you trying to solve?
 38 | 
 39 | Include context and detail here about the problem that you're trying to solve.
 40 | Details are *very* welcome here.  You're probably writing to someone within
 41 | your own field so feel free to use technical speech.  You shouldn't necessarily
 42 | mention Dask here; focus on your problem instead.
 43 | 
 44 | #### Example XYZ Automotive
 45 | 
 46 | XYZ automomotive produces thousands of kinds of parts for cars and millions of
 47 | each part.  Many of these parts generate telemetry about how they're doing on a
 48 | second-by-second basis.  We process this information over time to learn when
 49 | parts might fail, and what kinds of activities might lead to failure.  This
 50 | helps us react in a variety of ways:
 51 | 
 52 | 1.  We signal drivers that they should service their vehicle soon, reducing the
 53 |     chance of a serious problem while on the road
 54 | 2.  We inform automotive mechanics where the problem is, reducing the cost of
 55 |     diagnostics
 56 | 3.  We roll these discoveries back into the design process, helping our
 57 |     engineers develop solid products.
 58 | 
 59 | However, analyzing billions of time series is hard, especially when those time
 60 | series come from thousands of different kinds of devices.  This is both a "big
 61 | data" problem and a "heterogeneous data" problem.  We employ a team of data
 62 | scientists to analyze this data.  We use Pandas, Scikit-Learn, statsmodels, and
 63 | [lifelines](http://lifelines.readthedocs.io/en/latest/) for survival analysis,
 64 | along with some internal libraries.  In particular we found that scaling
 65 | survival analysis to be as our business grew.
 66 | 
 67 | #### Example: Solar Astronomers
 68 | 
 69 | My research analyzes correlation information from high-resolution solar
 70 | astronomy data that might preceed solar flare activity or correlate with
 71 | sunspots.  This both helps our understanding of basic science in
 72 | Magento-Hydro-Dynamics (MHD) as well as improves the durability of Earth's
 73 | satellite fleet during dangerous solar storm events.
 74 | 
 75 | In practice this means that we build algorithms to analyze a real-time stream
 76 | of high-resolution images of the sun to predict future activity.  We do image
 77 | segmentation to find spots for testing data, and discrete wavelet transforms
 78 | *both spatially and across time* to create featrures for downstream machine
 79 | learning algorithms.
 80 | 
 81 | Our data is big.  Single images can be hundreds of megabytes and we have years
 82 | of them.  This means hundreds of terabytes.
 83 | 
 84 | 
 85 | ### How Dask Helps
 86 | 
 87 | Describe how Dask helps to solve this problem.  Again, details are welcome.
 88 | New readers probably won't know about specific API like "we use client.scatter"
 89 | but probably will be able to follow terms used as headers in documentation like
 90 | "we used dask dataframe and the futures interface together".
 91 | 
 92 | We also encourage you to mention how your use of Dask has *changed* over time.
 93 | What originally drew you to the project?  Is that still why you use it or has
 94 | your perception or needs changed?
 95 | 
 96 | #### Example: XYZ Automotive
 97 | 
 98 | Dask helps us solve our problem by parallelizing our internal timeseries
 99 | libraries.  We're in a weird position where we have "big data", but it's all
100 | very different, so standard projects like time-series databases, Apache Spark,
101 | or Dask Dataframe weren't really a good fit.  However, we found that we could
102 | just use Pandas/Scikit-Learn/Lifelines code along with [Dask delayed]() to
103 | parallelize our existing solutions pretty easily.  The code is still pretty
104 | much what we had before (which is good, we had a lot invested there) but now it
105 | operates well on larger datasets and using all of our cores.
106 | 
107 | As a result of this our large analysis workstations quickly became very
108 | popular and we've had to acquire more, but people seem pretty happy.  We've
109 | looked into the distributed scheduler for cluster computing and it's
110 | interesting, but we haven't found a sufficient business need yet.
111 | 
112 | #### Example: Solar Astronomer
113 | 
114 | We use [Dask Arrays]() to process large stacks of images with the Numpy API.
115 | I and most of my colleagues are comfortable with Numpy, and so the learning
116 | curve to parallelize with Dask Array was very easy.  We have two common
117 | workloads:
118 | 
119 | -  We combine Dask arrays with the Scikit-Image library to apply standard
120 |    filters over image stacks in an embarrassingly parallel way.  Nothing fancy
121 |    here, but it's nice not to have to worry about saturating cores and keeping
122 |    memory use low.  Dask handles that for us.
123 | -  We rechunk our images to be in blocks that include time, then we overlap
124 |    them with halos so that neighboring blocks have a bit of nearby information,
125 |    then we apply more complex algorithms, mostly DWT, but we're also starting
126 |    to experiment with convolutional neural nets
127 | 
128 | Finally, we also just use Dask array as we play around with new datasets.  It's
129 | our go-to solution now for just hacking around on our laptop.
130 | 
131 | We're also starting to use Dask's futures interface while we prototype out a
132 | real-time processing system for live analysis and alerts, but that's a separate
133 | project.
134 | 
135 | Originally most of this work started on my laptop with the local scheduler.
136 | However as we've started doing some more complex workloads we've found that
137 | Dask is no longer able to do everything in low memory.  As a result we now run
138 | dask.distributed on our cluster on 10+TB image stacks.  We've thought about
139 | going bigger, but don't yet have the allocation.
140 | 
141 | 
142 | ### Some of the pain points of using Dask for your problem
143 | 
144 | Dask has issues and it's not always the right solution for every problem.  What
145 | are things that you ran into that you think others in your field should know
146 | ahead of time?
147 | 
148 | 
149 | #### Example: XYZ Automotive
150 | 
151 | When we started using parallelism we found that large sections of our codebase
152 | didn't parallelize well.  We used a lot of string comparisons that didn't play
153 | well with the GIL.  We switched to uses Dask's multiprocessing scheduler but
154 | then communication costs were too high.  Eventually we worked around this by
155 | using Pandas categoricals, which solves the problem pretty well, but only up to
156 | about 16 cores per process.
157 | 
158 | We also have some problems with diagnostics.  The diagnostics on the local
159 | scheduler aren't as nice as those on the distributed scheduler.  There are
160 | tradeoffs both ways
161 | 
162 | 
163 | #### Example: Solar Astronomer
164 | 
165 | Getting new users comfortable with lazy evaluation took a bit of time.
166 | 
167 | When we first started the overhead to overlapping computations with dask array
168 | was really high.  It looks like this has been mostly fixed in the recent
169 | version though.  Still though, as we continue to scale out to larger datasets
170 | we feel like we always run into new kinds of overhead.  It can be worked around
171 | and things are getting a lot better, but if you want to go to 100TB expect to
172 | do some tuning.  Fortunately the codebase is all Python, so we've been able to
173 | improve things and send patches upstream.
174 | 
175 | When using the distributed scheduler we had problems with our HPC cluster,
176 | which is pretty finicky about workers running out of RAM.  It took a bit to get
177 | configuration right so that workers killed themselves before the cluster found
178 | out about them.
179 | 
180 | 
181 | ### Some of the technology you use around Dask
182 | 
183 | This might be other libraries that you use with Dask for analysis or data
184 | storage.  Cluster technologies that you use to deploy or capture logs, etc..
185 | Anything that you think someone like you would like to know about.
186 | 
187 | 
188 | #### Example: XYZ Automotive
189 | 
190 | We mostly use Pandas, Scikit-Learn, and Lifelines for computation.  We get data
191 | in CSV and convert to Parquet using Arrow.  We use the general PyData stack for
192 | plotting and such.
193 | 
194 | 
195 | #### Example: Solar Astronomer
196 | 
197 | We store our data in FITS files and use AstroPy to read it, but now we're
198 | looking at moving over to TIFF or Zarr.  We've also just started looking at
199 | XArray, which has good Dask support and seems to have a strong community.
200 | 
201 | For cluster deployment we use PBS and the
202 | [dask-jobqueue](https://jobqueue.dask.org) project locally, though
203 | we're starting to look at storing data on AWS and using the [Dask-Helm
204 | chart](https://docs.dask.org/en/latest/setup/kubernetes-helm.html) or
205 | [dask-kubernetes](https://kubernetes.dask.org)
206 | 


--------------------------------------------------------------------------------