├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Shagun Sodhani 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Lambda-Architecture 2 | 3 | Notes on Lambda Architecture 4 | 5 | ## Why Lambda Architecture 6 | 7 | * We are generating data at an unprecedented rate. But data is not the same as knowledge. 8 | 9 | * To extract useful insights from the data and to tame the three Vs of data (Volume, Velocity and Variety), we need to rethink our tools and design principles. 10 | 11 | * The general transition is as follows: 12 | * RDBMS 13 | * Message Queues 14 | * Sharding 15 | 16 | * New set of tools we have: 17 | * NoSQL Databases - Mongo, Cassandra, HBase 18 | * Highly Scalable Message Queues - Kafka 19 | * Distributed filesystems - HDFS 20 | * MapReduce Paradigm - Hadoop, Spark 21 | 22 | * In this series of innovations and improvement, we have an alternate paradigm for Big Data computation - the Lambda Architecture 23 | 24 | ## Principle 25 | 26 | * We want to develop a system that can answer a query about the data. ie **query = function(all data)** 27 | 28 | * Effectively, we want to be able to implement arbitrary functions over arbitrary data. 29 | 30 | ## Building Blocks 31 | 32 | * Remember **query = function(data)**. So, in theory, to compute the result, we could run the `function` over the entire data every time. 33 | 34 | * But this would be inefficient and expensive. 35 | 36 | * So we precompute the query function over the data and store it as a **batch view**. We can further index these **batch views** to provide query results very fast. 37 | 38 | * **batch view = function(all data)** 39 | 40 | * **query = function(batch view)** 41 | 42 | * Lambda Architecture should have a component to compute the batch views. This component is called as **Batch Layer**. 43 | 44 | ### Batch Layer 45 | 46 | * Batch Layer should be able to do two things: 47 | 48 | * Compute arbitrary functions over arbitrary data ie implement **batch view = function(all data)** 49 | 50 | * Store a copy of immutable, constantly growing, append-only master dataset 51 | 52 | * In pseudo code: 53 | 54 | ``` 55 | function batch_layer(): 56 | while (True): 57 | ingest_data() 58 | compute_batch_view() 59 | append_to_master_dataset() 60 | 61 | ``` 62 | 63 | * Regular-write workload. 64 | 65 | * Can be easily scaled horizontally. 66 | 67 | * Hadoop is an example of batch processing systems. 68 | 69 | * Now we need a component to query over these batch views 70 | 71 | ### Serving Layer 72 | 73 | * Indexes the batch view so that it can be queried efficiently. 74 | 75 | * Effectively, the actual query has been broken down into precomputed sub-queries. These sub-queries were executed in the Batch Layer and the results of those sub-queries are not being queried to generate the final result. 76 | 77 | * In pseudo code: 78 | 79 | ``` 80 | function serve_layer(): 81 | query_batch_view() 82 | ``` 83 | 84 | * When new batches are available, Serving Layer indexes them as well. 85 | 86 | * Workload: 87 | 88 | * Write at time of batch updates. 89 | * Random read queries. 90 | 91 | * Since no random writes are required, the indexing (and database) layer are very simple. 92 | 93 | ### Things so far 94 | 95 | * We have a two-layer architecture - with Batch and Serving layer. 96 | 97 | * We can bake in fault tolerance into both the layers. 98 | 99 | * In case there is any human error, the batched views can be simply recomputed from scratch. 100 | 101 | * Since Batch Layer can compute arbitrary functions over arbitrary datasets, the architecture is quite general and extensible. 102 | 103 | * Further, we can use any combination of technology for the two layers. 104 | 105 | * This simple architecture is able to solve a lot of big data use cases. 106 | 107 | * The architecture so far is not complete. 108 | 109 | * Batch Layer can take a significant amount of time to process a batch of data which makes it a high latency layer. 110 | 111 | * While some data is being processed, the new incoming data is queued and is picked as part of the next batch. This means there is always some backlog of data which is not yet consumed. 112 | 113 | * One solution: use smaller batches. We could potentially waste a lot of time in context switch and still have a backlog of records (streaming cases). But we will talk about this option a little later. 114 | 115 | * Now when we query the Serving Layer, the results do not account for the data which is yet to be processed and this could be a problem depending on the use cases. 116 | 117 | * Basically it is a problem for all the streaming/real-time use cases. 118 | 119 | * So the next problem we have to solve is how to make the backlog data available for query. 120 | 121 | * We introduce one more layer, called as Speed Layer, to solve this problem. 122 | 123 | ### Speed Layer 124 | 125 | * Speed Layer is similar to Batch Layer in terms of how it processes the incoming data but there are some key differences: 126 | 127 | * Speed Layer operates only on the most recent, un-batched data while the Batch Layer operates on the entire data. Further, Speed Layers works incrementally over the data. So if a new record is added to the un-batched data, Speed Layer just processes the newly added record. 128 | 129 | * Batch Layer can support more complex operations than Speed Layer. This is partially because Speed Layer should be low latency and partially because Speed Layer does not have access to entire data. So operations that require past data (eg get me all those current users who visited my site at least once in last one month) cannot be trivially supported by Speed Layer. 130 | 131 | * Batch Layer has a high latency while Speed Layer has a low latency. 132 | 133 | * Speed Layer is a stream processing system while Batch Layer is a, surprise surpise, batch processing system. 134 | 135 | * Spark Streaming and Storm are examples of real-time processing systems. 136 | 137 | * ``` 138 | function spped_layer(): 139 | while (True): 140 | ingest_data() 141 | compute_speed_view() 142 | 143 | ``` 144 | 145 | * Speed Layer needs to support both random read and write queries. Hence, it requires a much more complicated database system. 146 | 147 | * Once the un-batched data is fed to the Batch Layer and the batch view passed to the Serving Layer, the corresponding results from the Speed Layer are flushed. The complexity of processing the most recent data is pushed to a layer whose results are temporary. This is known as **Complexity Isolation**. 148 | 149 | This ensures that every query reads only one copy of each record - either from batch views or from Speed Layer. 150 | 151 | * In the case of any failure, the Speed Layer can be reset and all the un-batched data can be recomputed. 152 | 153 | * The last piece of the architecture is how to merge the results from Speed Layer and Batch Layer. 154 | 155 | ## Variants 156 | 157 | ### Unified Lambda Architecture 158 | 159 | * Use the same code for both Batch and Speed Layer and combine their results almost transparently. 160 | * Examples are systems like Apache Spark and Apache Flink. 161 | 162 | ### Free Lambda Architecture 163 | 164 | * Implement each layer independently, maintain them independently and write a system to merge the results depending on the use case/business logic. 165 | 166 | ## Limitations 167 | 168 | * One should know if they need Lambda Architecture. A large number of big data use cases can be solved using just the Batch Layer. 169 | --------------------------------------------------------------------------------