├── README.md
└── archive
    ├── 2021-07-29-hayes-wasi-data.md
    └── wasi-data-runtimes.png


/README.md:
--------------------------------------------------------------------------------
 1 | # Summary
 2 | 
 3 | The goal of this proposal is to define a mechanism for programs compiled to Wasm to represent and drive distributed algorithms, taking advantage of distributed storage and computation systems.
 4 | 
 5 | Apache Spark is one such example. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
 6 | 
 7 | The attributes of Apache Spark that have led to its success include:
 8 | 
 9 | - Simple API
10 | - Pluggable
11 | - Support for many languages
12 | - Distributed compute
13 | - Fault Tolerant
14 | - Lazy Evaluation of compute and transformations.
15 | 
16 | There are a number of distributed systems that may benefit from an interoperable data API, decoupling from exisiting systems like Apache Spark and Hadoop, like databases. By moving compute next to the data, we can not only simplify the system stack, but improve performance, security, safety, and resiliency.
17 | 
18 | ## Goals
19 | 
20 | Create a resilient and distributable data API that will lead to a portable, host- and language-independent ecosystem of composable WebAssembly modules that can run distributed algorithms.
21 | 
22 | - Resiliency
23 | - Enable AOT compilation
24 | - Portable
25 | - Host and language-independent
26 | - Composable Wasm modules
27 | - Highly performant distributed computation
28 | - Write data pipelines in a familiar language
29 | - Large-scale batch and streaming data processing
30 | - Allow functions to operate on relations polymorphically, i.e. the columns and types may not be fixed when writing the function.
31 | - Allow for new types, functions, operators, etc.
32 | 
33 | ## Proposal
34 | 
35 | This design is focused on allowing a Wasm program to work with distributed data. This stands on the shoulders of other active proposals for Wasm and WASI including: interface types, modules, and shared-nothing linking. To represent distributed algorithms, wasi-data needs to provide a mechanism for a Wasm client to push a distributed query plan to it's host, and monitor that operation through to completion. There are many ways to achieve this goal, which is why we are still early in the design process for wasi-data. But here is what we are thinking so far:
36 | 
37 | First, we need a way to communicate a DAG of computation between client and host. We are evaluating multiple solutions to this problem:
38 | 
39 | 1. Define the DAG as a [Substrait](https://substrait.io/) plan with extensions for defining custom operators and expressions.
40 | 2. Create our own tree datastructure that can be used to communicate a desired distributed operation from the client to host.
41 | 3. Rather than supporting an entire logical plan, provide an API for registering Wasm powered user-defined functions (UDF), table-valued functions (TVF), and user-defined aggregates (UDA). Then provide an API for executing SQL queries and either collecting them or streaming the result into a temporary table for additional computation.
42 | 
43 | Second, we need a way to make lambdas a first class concept in Wasm. This means passing a function reference to the host in a way that allows the host to serialize and send that function (with all it's required state and dependencies) to another machine (or thread, or process) for execution. This allows a single Wasm module to define everything it needs for a complete distributed operation. We are evaluating the idea of embedding each lambda function into the Wasm file as a separate module to achieve this goal.
44 | 
45 | ## How does this relate to wasi-nn and wasi-parallel?
46 | 
47 | ### wasi-nn
48 | 
49 | Some distributed operations may need to use wasi-nn to execute ML/AI algorithms in a performant way. Wasi-data may make additional optimizations to work well with wasi-nn, however it will mostly be a higher order concept.
50 | 
51 | ### wasi-parallel
52 | 
53 | Similar to wasi-nn, some distributed operations may use wasi-parallel to process vectors of data in parallel. This could be especially powerful if wasi-parallel could gain partition local access to GPUs which is one of their goals. In this way, wasi-data and wasi-parallel will work together to define extremely high performance distributed algorithms which take advantage of whatever hardware is available in the cluster.
54 | 
55 | ## Similar projects
56 | 
57 | The following libraries and frameworks expose API's that are similar in nature to wasi-data, and may also benefit from a language-agnostic API for distributed data computation:
58 | 
59 | - Pandas
60 | - Apache Flink
61 | - Timely DataFlow
62 | - Substrait
63 | - Apache Beam
64 | 


--------------------------------------------------------------------------------
/archive/2021-07-29-hayes-wasi-data.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | theme : "simple"
  3 | transition: "none"
  4 | ---
  5 | 
  6 | # wasi-data
  7 | 
  8 | Support for [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) algorithms and distributed computation for data streams
  9 | 
 10 | ---
 11 | 
 12 | ## Problem
 13 | 
 14 | * Input data is far beyond gigabyte-scale
 15 | * I/O-bound
 16 | * Distributed
 17 | * Must be resilient
 18 | 
 19 | note: The future is distributed
 20 | 
 21 | ---
 22 | 
 23 | ## API
 24 | 
 25 | ```java
 26 | DataSet<Row<A,B,C…>>
 27 | 
 28 | map(func (Row<A,B,C>) Row<…>)
 29 | 
 30 | reduce(Row<out>, func(Row<out>, Row<orig>) Row<out>)
 31 | ```
 32 | 
 33 | ---
 34 | 
 35 | ## map-reduce
 36 | 
 37 | Specialization of split-apply-combine
 38 | 
 39 | note: map-reduce brings compute to the data. Traditional parallel algorithms bring data to the compute.
 40 | note: Only time data is moved, is when all of the parallel workers are communicating their results (step 3)
 41 | 
 42 | ---
 43 | 
 44 | ![wasi-data runtime diagram. Any language compiles to a wasm module that exposes the wasi-data API, loads to a distributed runtime, then those modules are sent to worker nodes. It's the worker nodes that have the wasm runtime embedded and are managing the wasm processes](https://github.com/singlestore-labs/wasi-data/blob/main/archive/wasi-data-runtimes.png?raw=true)
 45 | 
 46 | ---
 47 | 
 48 | ## Real world frameworks based on map-reduce
 49 | 
 50 | There are so many
 51 | 
 52 | note: map-reduce is such a fundamental piece of distributed computation
 53 | 
 54 | ---
 55 | 
 56 | ## To name a few
 57 | 
 58 | * [Apache Hadoop](https://hadoop.apache.org/)
 59 | * [Apache Spark](https://spark.apache.org/)
 60 | * [Apache Flink](https://github.com/apache/flink)
 61 | * [Timely Dataflow](https://github.com/TimelyDataflow/timely-dataflow)
 62 | * [Apache Beam](https://beam.apache.org/)
 63 | * ...
 64 | * Google Cloud Dataflow
 65 | * IBM Streams
 66 | * [Twister2](https://twister2.org/)
 67 | * ...
 68 | 
 69 | ---
 70 | 
 71 | ## Distributed map-reduce
 72 | 
 73 | note: Distributed impl's of MapReduce require a means of connecting the processes performing the Map and Reduce phases
 74 | 
 75 | Requires an implementation to connect processes performing map and reduce phases.
 76 | 
 77 | * Distributed file system
 78 | * Distributed database
 79 | * Streaming from mappers to reducers
 80 | * Sharding
 81 | 
 82 | ---
 83 | 
 84 | ## Why WASM and WASI
 85 | 
 86 | note: same reasons as everyone else, heterogenous hardware and software, edge, server farms
 87 | 
 88 | * Portable
 89 | * Host and language-independent
 90 | * Reliability and Isolation
 91 | * Composable WASM modules
 92 | * Highly performant distributed computation (SIMD, hardware acceleration)
 93 | 
 94 | ---
 95 | 
 96 | ## Example
 97 | 
 98 | ```java
 99 | createDataSet([
100 |     Row(a=1, b=2., c='string1', d=date(2000, 1, 1)),
101 |     Row(a=2, b=3., c='string2', d=date(2000, 2, 1)),
102 |     Row(a=4, b=5., c='string3', d=date(2000, 3, 1))
103 | ])
104 | 
105 | DataSet<...> input = // [...]
106 | DataSet<...> reduced = input
107 |   .groupBy(/*define key here*/)
108 |   .reduce(/*do something*/);
109 | ```
110 | 


--------------------------------------------------------------------------------
/archive/wasi-data-runtimes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/singlestore-labs/wasi-data/331e7bf5b7a460ccc17d571703fc39b53f51ac49/archive/wasi-data-runtimes.png


--------------------------------------------------------------------------------