├── README.md └── archive ├── 2021-07-29-hayes-wasi-data.md └── wasi-data-runtimes.png /README.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | The goal of this proposal is to define a mechanism for programs compiled to Wasm to represent and drive distributed algorithms, taking advantage of distributed storage and computation systems. 4 | 5 | Apache Spark is one such example. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. 6 | 7 | The attributes of Apache Spark that have led to its success include: 8 | 9 | - Simple API 10 | - Pluggable 11 | - Support for many languages 12 | - Distributed compute 13 | - Fault Tolerant 14 | - Lazy Evaluation of compute and transformations. 15 | 16 | There are a number of distributed systems that may benefit from an interoperable data API, decoupling from exisiting systems like Apache Spark and Hadoop, like databases. By moving compute next to the data, we can not only simplify the system stack, but improve performance, security, safety, and resiliency. 17 | 18 | ## Goals 19 | 20 | Create a resilient and distributable data API that will lead to a portable, host- and language-independent ecosystem of composable WebAssembly modules that can run distributed algorithms. 21 | 22 | - Resiliency 23 | - Enable AOT compilation 24 | - Portable 25 | - Host and language-independent 26 | - Composable Wasm modules 27 | - Highly performant distributed computation 28 | - Write data pipelines in a familiar language 29 | - Large-scale batch and streaming data processing 30 | - Allow functions to operate on relations polymorphically, i.e. the columns and types may not be fixed when writing the function. 31 | - Allow for new types, functions, operators, etc. 32 | 33 | ## Proposal 34 | 35 | This design is focused on allowing a Wasm program to work with distributed data. This stands on the shoulders of other active proposals for Wasm and WASI including: interface types, modules, and shared-nothing linking. To represent distributed algorithms, wasi-data needs to provide a mechanism for a Wasm client to push a distributed query plan to it's host, and monitor that operation through to completion. There are many ways to achieve this goal, which is why we are still early in the design process for wasi-data. But here is what we are thinking so far: 36 | 37 | First, we need a way to communicate a DAG of computation between client and host. We are evaluating multiple solutions to this problem: 38 | 39 | 1. Define the DAG as a [Substrait](https://substrait.io/) plan with extensions for defining custom operators and expressions. 40 | 2. Create our own tree datastructure that can be used to communicate a desired distributed operation from the client to host. 41 | 3. Rather than supporting an entire logical plan, provide an API for registering Wasm powered user-defined functions (UDF), table-valued functions (TVF), and user-defined aggregates (UDA). Then provide an API for executing SQL queries and either collecting them or streaming the result into a temporary table for additional computation. 42 | 43 | Second, we need a way to make lambdas a first class concept in Wasm. This means passing a function reference to the host in a way that allows the host to serialize and send that function (with all it's required state and dependencies) to another machine (or thread, or process) for execution. This allows a single Wasm module to define everything it needs for a complete distributed operation. We are evaluating the idea of embedding each lambda function into the Wasm file as a separate module to achieve this goal. 44 | 45 | ## How does this relate to wasi-nn and wasi-parallel? 46 | 47 | ### wasi-nn 48 | 49 | Some distributed operations may need to use wasi-nn to execute ML/AI algorithms in a performant way. Wasi-data may make additional optimizations to work well with wasi-nn, however it will mostly be a higher order concept. 50 | 51 | ### wasi-parallel 52 | 53 | Similar to wasi-nn, some distributed operations may use wasi-parallel to process vectors of data in parallel. This could be especially powerful if wasi-parallel could gain partition local access to GPUs which is one of their goals. In this way, wasi-data and wasi-parallel will work together to define extremely high performance distributed algorithms which take advantage of whatever hardware is available in the cluster. 54 | 55 | ## Similar projects 56 | 57 | The following libraries and frameworks expose API's that are similar in nature to wasi-data, and may also benefit from a language-agnostic API for distributed data computation: 58 | 59 | - Pandas 60 | - Apache Flink 61 | - Timely DataFlow 62 | - Substrait 63 | - Apache Beam 64 | -------------------------------------------------------------------------------- /archive/2021-07-29-hayes-wasi-data.md: -------------------------------------------------------------------------------- 1 | --- 2 | theme : "simple" 3 | transition: "none" 4 | --- 5 | 6 | # wasi-data 7 | 8 | Support for [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) algorithms and distributed computation for data streams 9 | 10 | --- 11 | 12 | ## Problem 13 | 14 | * Input data is far beyond gigabyte-scale 15 | * I/O-bound 16 | * Distributed 17 | * Must be resilient 18 | 19 | note: The future is distributed 20 | 21 | --- 22 | 23 | ## API 24 | 25 | ```java 26 | DataSet> 27 | 28 | map(func (Row) Row<…>) 29 | 30 | reduce(Row, func(Row, Row) Row) 31 | ``` 32 | 33 | --- 34 | 35 | ## map-reduce 36 | 37 | Specialization of split-apply-combine 38 | 39 | note: map-reduce brings compute to the data. Traditional parallel algorithms bring data to the compute. 40 | note: Only time data is moved, is when all of the parallel workers are communicating their results (step 3) 41 | 42 | --- 43 | 44 | ![wasi-data runtime diagram. Any language compiles to a wasm module that exposes the wasi-data API, loads to a distributed runtime, then those modules are sent to worker nodes. It's the worker nodes that have the wasm runtime embedded and are managing the wasm processes](https://github.com/singlestore-labs/wasi-data/blob/main/archive/wasi-data-runtimes.png?raw=true) 45 | 46 | --- 47 | 48 | ## Real world frameworks based on map-reduce 49 | 50 | There are so many 51 | 52 | note: map-reduce is such a fundamental piece of distributed computation 53 | 54 | --- 55 | 56 | ## To name a few 57 | 58 | * [Apache Hadoop](https://hadoop.apache.org/) 59 | * [Apache Spark](https://spark.apache.org/) 60 | * [Apache Flink](https://github.com/apache/flink) 61 | * [Timely Dataflow](https://github.com/TimelyDataflow/timely-dataflow) 62 | * [Apache Beam](https://beam.apache.org/) 63 | * ... 64 | * Google Cloud Dataflow 65 | * IBM Streams 66 | * [Twister2](https://twister2.org/) 67 | * ... 68 | 69 | --- 70 | 71 | ## Distributed map-reduce 72 | 73 | note: Distributed impl's of MapReduce require a means of connecting the processes performing the Map and Reduce phases 74 | 75 | Requires an implementation to connect processes performing map and reduce phases. 76 | 77 | * Distributed file system 78 | * Distributed database 79 | * Streaming from mappers to reducers 80 | * Sharding 81 | 82 | --- 83 | 84 | ## Why WASM and WASI 85 | 86 | note: same reasons as everyone else, heterogenous hardware and software, edge, server farms 87 | 88 | * Portable 89 | * Host and language-independent 90 | * Reliability and Isolation 91 | * Composable WASM modules 92 | * Highly performant distributed computation (SIMD, hardware acceleration) 93 | 94 | --- 95 | 96 | ## Example 97 | 98 | ```java 99 | createDataSet([ 100 | Row(a=1, b=2., c='string1', d=date(2000, 1, 1)), 101 | Row(a=2, b=3., c='string2', d=date(2000, 2, 1)), 102 | Row(a=4, b=5., c='string3', d=date(2000, 3, 1)) 103 | ]) 104 | 105 | DataSet<...> input = // [...] 106 | DataSet<...> reduced = input 107 | .groupBy(/*define key here*/) 108 | .reduce(/*do something*/); 109 | ``` 110 | -------------------------------------------------------------------------------- /archive/wasi-data-runtimes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/singlestore-labs/wasi-data/331e7bf5b7a460ccc17d571703fc39b53f51ac49/archive/wasi-data-runtimes.png --------------------------------------------------------------------------------