├── cassandra.md
├── choosing_a_database.md
├── database-foundation.md
├── database-research.pdf
├── drawio
├── foundation_of_data_systems
└── storage_and_retrieval
├── images
├── ![]().png
├── 2020-10-09-08-07-27.png
├── 2020-10-09-08-22-44.png
├── 2020-10-09-08-23-40.png
├── 2020-10-09-08-25-53.png
├── 2020-10-09-08-26-23.png
├── 2020-10-09-09-45-09.png
├── 2020-10-09-09-51-37.png
├── 2020-10-15-10-14-29.png
├── 2020-10-15-10-33-23.png
├── 2020-10-15-10-41-34.png
├── 2020-10-15-11-02-23.png
├── cassandra
│ ├── 149f7c7b.png
│ ├── 2a482058.png
│ ├── 2e4598fe.png
│ ├── 3580057f.png
│ ├── 5abf55f1.png
│ └── e8f5bac5.png
├── choosing_database
│ ├── ab021a97.png
│ └── cf5618a7.png
├── foundation_of_data_systems.png
├── indexing.png
├── intensive
│ ├── 2020-09-30-11-24-08.png
│ ├── 2020-09-30-12-43-17.png
│ ├── 2020-09-30-12-45-56.png
│ ├── 2020-09-30-12-50-56.png
│ ├── 2020-09-30-13-39-23.png
│ ├── 2020-09-30-13-39-31.png
│ ├── 2020-09-30-13-58-58.png
│ ├── 2020-09-30-14-09-32.png
│ ├── 2020-09-30-14-12-27.png
│ ├── 2020-09-30-15-29-21.png
│ ├── 2020-09-30-15-33-44.png
│ ├── 2020-09-30-15-35-46.png
│ ├── 2020-09-30-15-53-48.png
│ ├── 2020-09-30-15-59-51.png
│ ├── 2020-09-30-21-37-42.png
│ ├── 2020-09-30-21-41-30.png
│ ├── 2020-10-01-09-54-22.png
│ ├── 2020-10-01-09-54-24.png
│ └── 2020-10-01-10-32-50.png
├── mongodb-replicaset
│ ├── 0dde5a21.png
│ ├── 10d422c0.png
│ ├── 2b6a83ef.png
│ ├── 3265315e.png
│ ├── 32cf40dc.png
│ ├── 378a5b1f.png
│ ├── 42075f95.png
│ ├── 53f7d13b.png
│ ├── 5de8cac9.png
│ ├── 648e69b1.png
│ ├── 676c1811.png
│ ├── 93653734.png
│ ├── 937a3729.png
│ ├── 99c0ddfc.png
│ ├── 9a3d1e67.png
│ ├── b2aa36d1.png
│ ├── cda7fe4a.png
│ ├── d1b63713.png
│ ├── de1f1c67.png
│ └── e7b2e770.png
├── pool.png
└── readme.md
│ ├── 2020-10-18-17-14-25.png
│ ├── 2020-10-18-17-14-32.png
│ ├── 2020-10-18-17-14-44.png
│ ├── 2020-10-18-17-22-30.png
│ ├── 2020-10-18-18-07-48.png
│ ├── 2020-10-18-18-11-26.png
│ ├── 2020-10-18-20-52-24.png
│ ├── 2020-10-18-20-53-08.png
│ ├── 2020-10-18-21-13-33.png
│ ├── 2020-10-18-21-13-59.png
│ ├── 2020-10-18-21-14-27.png
│ ├── 2020-10-18-21-14-37.png
│ ├── 2020-10-18-21-15-56.png
│ ├── 2020-10-18-21-17-15.png
│ ├── 2020-10-18-21-19-04.png
│ ├── 2020-10-19-07-27-09.png
│ ├── 2020-10-19-07-27-12.png
│ ├── 2020-10-19-08-11-48.png
│ ├── 2020-10-19-09-31-29.png
│ ├── 2020-10-19-09-46-37.png
│ ├── 2020-10-19-09-47-50.png
│ ├── 2020-10-19-09-51-23.png
│ ├── 2020-10-19-11-12-43.png
│ ├── 2020-10-19-15-25-19.png
│ ├── 2020-10-19-15-25-26.png
│ ├── 2020-10-19-16-56-48.png
│ ├── 2020-10-19-17-08-49.png
│ └── 2020-10-19-17-21-23.png
├── mongodb.md
├── nosql.md
├── postgresql.md
├── readme.md
├── relational_vs_nosql.md
├── scaling.md
└── storage_engine.md
/cassandra.md:
--------------------------------------------------------------------------------
1 | # Cassandra
2 | ## Why use Cassandra ?
3 | - Is a good fit for several nodes, a great fit if your application is expected to require `dozen of nodes`.
4 | - Lots of writes, statistics, analysis. Consider your application from the perspective of the ratio of reads to write. `Excellent throughput on writes`.
5 | - Has out of the box support for `geographical distribution` of data. Easily to replicate data across multiple data centers.
6 |
7 | ## Cassandra key features
8 | - **Linear scalability**: You can add more and more nodes to the cluster and it will still have the top performance
9 | - **Fault tolerance**: You don't have to worry about a master node going down and the whole things stop to work
10 | - **Commodity hardware**: You do not have to buy specialized servers in order to run cassandra you can use commodity hardware
11 | - **Highly-performant**: For realtime application, different with hadoop when they run at nights, at batches. Those traditional databases are not designed for realtime
12 |
13 | ## Database distribution
14 | To distribute the trows across the nodes, a partitioner is used. The partitioner uses an algorithm to determin which node a given row of data will go to
15 |
16 | 
17 |
18 | 
19 |
20 | 
21 |
22 | 
23 |
24 | ### Virtual nodes
25 | Instead of a node being responsible for just one token range, it is instead responsible for many small token ranges(by default, 256 of them)
26 |
27 | - Virtual nodes were created to make it easier to add new nodes to a cluster while keeping the cluster balanced
28 | - When a new ndoe is added, it receives many small token range slices from the existing nodes, to maintain a balanced cluster
29 |
30 | 
31 |
32 | ## Usecases
33 | - Sensor data
34 | - Recommendation engine, gaming, fraud detection, location based services
35 |
36 | ## Data modeling [1]
37 | ### Query-first design
38 | In cassandra you don't start with the data model, you start with the query model
39 |
40 | ## Cassandra configuration
41 | ### Snitch
42 | Snitch is how the nodes in a cluster know aout the topology of the cluster
43 | 
44 |
45 | ## Read/ Write data [2]
46 |
47 |
48 | ### Gossip
49 | Every one second, each node communicates with up to theree other nodes, exchanging information about ifself and all the other nodes that it has information about
50 |
51 | - Gossip is the **internal** communication method for nodes in a cluster to talk to each other
52 | - For **external communication** such as an application to a database, CQL or Thrift are used
53 |
54 |
55 |
56 | ## Reference
57 | [1] Cassandra: The Definitive Guide (https://learning.oreilly.com/library/view/cassandra-the-definitive/9781098115159/)
58 |
59 | [2] https://www.udemy.com/course/apache-cassandra/
60 | https://www.javatpoint.com/use-cases-of-cassandra
--------------------------------------------------------------------------------
/choosing_a_database.md:
--------------------------------------------------------------------------------
1 | # Choosing database
2 |
3 | [7 Database Paradigms - YouTube](https://www.youtube.com/watch?v=W2Z7fbCLSTw)
4 |
5 | ## Columned based
6 | - `Decentralized and can scale horizontally`, a popular usecases for scaling a large amount of time-series data like records from an IOT device, weather sensor, on in the case of netflix, a history of different shows watch you watched
7 | - It's often used in situation where you have `frequent write` but `infrequent update and reads`
8 | - When you need to retrieve columns of data in one disk block to reduce disk I/O.
9 | - It's not going to be your primary app database. For that you need something more general purpose
10 |
11 | ## Document oriented database
12 | Tradeoff: Schemaless relation-ish queries without joins
13 |
14 | 
15 |
16 | Reads from a frontend application are much faster however writing or updating data tend to be more complex, document databases are far more general purpose than the other options
17 |
18 | For developers, they are easier to use, best for mosts apps, games, iot and many other usecases. If you are not sure how your data is structured at this point, a document database is properly
19 |
20 |
21 | ## Graph database
22 | 
23 |
24 | ## Chosing your database
25 | Two reasons to consider a NoSQL database: programmer productivity and data access performance.
26 | - To improve programmer productivity by using a database that better matches an application’s
27 | needs.
28 | - To improve data access performance via some combination of handling larger data volumes,
29 | reducing latency, and improving throughput.
30 |
31 | It's essential to test your expectation about programmer productivity and performance before commiting to using a NoSQL technology
32 |
33 | ### Sticking with the default
34 | There are many cases you’re better off sticking with the default option of a relational database:
35 | - You can easily find people with the experience of using them.
36 | - They are mature, so you are less likely to run into the rough edges of new technology
37 | - Picking a new technology will always introduce a risk of problems should things run into difficulties
38 |
39 |
40 | ---
41 | ### SQL
42 | traditional SQL databases are missing two important capabilities — linear write scalability (i.e. automatic sharding across multiple nodes) and automatic/zero-data loss failover.
43 |
44 | This means data volumes ingested cannot exceed the max write throughput of a single node.
45 |
46 | Additionally, some temporary data loss should be expected on failover
47 |
48 | . Zero downtime upgrades are also very difficult to achieve in the SQL database world.
49 |
50 | ### NoSQL
51 | NoSQL DBs are usually distributed in nature where data gets partitioned or sharded across multiple nodes.
52 | They mandate denormalization which means inserted data also needs to be copied multiple times to serve the specific queries you have in mind.
--------------------------------------------------------------------------------
/database-foundation.md:
--------------------------------------------------------------------------------
1 | ## 1. Foundation of data systems [1]
2 | 
3 |
4 | ### 1.1. Reliability
5 | - `Tolerate` hardware and software & software faults
6 | - The system should continue to work correctly even in the face of adversity.
7 |
8 | #### 1.1.1. Hardware faults
9 | Anyone who has worked with large datacenters can tell you that these things happen all the time when you have a lot of machines:
10 | - Hard disk crash
11 | - RAM becomes faulty
12 | - The power grid has a blackout
13 | - Someone unplugs the wrong network cable.
14 |
15 | We can achieve `reliability` by using software fault-tolerance techniques in preference or in addition to hardware
16 | redundancy.
17 |
18 | For example: A single-server system
19 | requires planned `downtime` if you need to reboot the machine. Whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system.
20 |
21 | #### 1.1.2. Software errors
22 | - Hardware faults as being random and independent from each other. It is unlikely that a large number of hardware components will fail at the same time.
23 | - A software bug that causes every instance of an application server to crash when
24 | given a particular bad input
25 | - A service that the system depends on that slows down, becomes unresponsive, or
26 | starts returning corrupted response
27 | - Cascading failures, where a small fault in one component triggers a fault in
28 | another component, which in turn triggers further faults
29 |
30 | Solutions:
31 | - Carefully thinking about assumptions and interactions in the
32 | system
33 | - Thorough testing
34 | - Process isolation
35 | - Allowing processes to crash and restart
36 | - Measuring, monitoring, and analyzing system behavior in production
37 |
38 | #### 1.1.3. Human errors
39 | Configuration errors by operators were the leading cause of outages, whereas hard‐
40 | ware faults (servers or network) played a role in only 10–25% of outages
41 |
42 | How we make our system reliables:
43 | - Test thoroughly at all levels, from unit tests to whole-system integration tests and
44 | manual tests
45 | - Make it fast to roll back configuration changes, roll
46 | out new code gradually
47 | - Set up detailed and clear monitoring, such as performance metrics and error
48 | rates
49 |
50 | #### 1.1.4. How important is reliability?
51 | - Bugs in business applications cause `lost productivity`
52 | - Outages of ecommerce sites can have huge costs in terms of `lost revenue
53 | and damage to reputation`.
54 |
55 |
56 | ### 1.2. Scalability
57 | - Measuring load & performance
58 | - Latency percentiles, throughput
59 | - As the system grows, there should be reasonable ways of dealing with that growth. Scalability is the term we use to describe a system’s ability to cope with increased
60 | load.
61 |
62 | #### 1.2.1. Describing Load
63 | - Post tweet: A user can publish a new message to their followers (4.6k requests/sec on aver‐
64 | age, over 12k requests/sec at peak)
65 |
66 | - Home timeline: A user can view tweets posted by the people they follow (300k requests/sec)
67 |
68 | 
69 |
70 | 
71 |
72 | #### 1.2.2. Describing performance
73 | If the 95th percentile response time
74 | is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds
75 |
76 | ### 1.3. Maintainability
77 | - Operability
78 | - Simplicty & evolvability
79 |
80 | #### 1.3.1. Operability
81 | Make it easy for operations teams to keep the system running smoothly.
82 |
83 | #### 1.3.2. Simplicity
84 | Make it easy for new engineers to understand the system, by removing as much
85 | complexity as possible from the system. (Note this is not the same as simplicity
86 | of the user interface.)
87 |
88 | #### 1.3.3. Evolvability
89 | Make it easy for engineers to make changes to the system in the future, adapting
90 | it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.
91 |
92 | 
93 |
94 | ## 2. Data model and query languages
95 | ### 2.1. The Birth of NoSQL
96 | - A need for greater scalability than relational databases can easily achive, including very large datasets or very high write throughput
97 | - Specialized query operations that are not well supported by the relational model
98 | - Frustration with the restrictiveness of relational schemas, and a desire for a more
99 | dynamic and expressive data model
100 | 
101 | 
102 |
103 |
104 | ### 2.2. Document data model and JSON
105 | The JSON representation has better `locality` than the multi-table schema. If you want to fetch a profile in the relational example, you need to either
106 | perform multiple queries or perform a messy multi-way join between the users table and its subordinate tables. In the JSON representation, all the relevant information is in one place, and one query is sufficient.
107 |
108 | - If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model.
109 |
110 | #### 2.2.1. Advantage
111 | - Flexibility in the document model: Do not enforce any schema on the data in documents
112 | - Data locality for queries: If your application often
113 | needs to access the entire document (for example, to render it on a web page), there is
114 | a performance advantage to this storage locality
115 | But it is generally recommended that you keep
116 | documents fairly small and avoid writes that increase the size of a document
117 |
118 |
119 | #### 2.2.2. Limitation
120 | - you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251”
121 | - Poor support for join.
122 | - Many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time
123 |
124 | ### 2.3. Schema changes
125 | 
126 |
127 | Schema changes have a bad reputation of being slow and requiring downtime.
128 |
129 | MySQL is a notable exception—it
130 | copies the entire table on `ALTER TABLE` , which can mean minutes or even hours of
131 | downtime when altering a large table although various tools exist to work around
132 | this limitation
133 |
134 | #### MapReduce Querying
135 | MapReduce is a programming model for processing large amounts of data in bulk
136 | across many machines, popularized by Google
137 | 
138 |
139 | 
140 |
141 | ### 2.4. Graph-Like Data Models
142 | - If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document
143 | model is appropriate.
144 |
145 | - If many-to-many relationships are very common in your data? The relational model can handle simple cases of many-to-many relationships, but as the con‐
146 | nections within your data become more complex, it becomes more natural to start
147 | modeling your data as a graph
148 |
149 | A graph consists of two kinds of object: Vertices and edges. Many kinds of data can be modeled as a graph. Typical examples include:
150 |
151 | - Social graphs: Vertices are people, and edges indicate which people know each other.
152 | - The web graph: Vertices are web pages, and edges indicate HTML links to other pages.
153 | Well-known algorithms can operate on these graphs:
154 | - Car navigation system: search for the shortest path between two points in a road network
155 | - PageRank can be used on the web graph to determine the popularity of a web page
156 | and thus its ranking in search results.
157 |
158 | ## 3. Storage and Retrieval
159 | - Why should you, as an application developer, care how the database handles storage and retrieval internally. You do need to select a storage engine that is appropriate for your application, from the many that are available.
160 | - In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.
161 | - There is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.
162 |
163 | Two families of storage engines:
164 | - Log-structured storage engines
165 | - Page oriented storage engines such as B-Trees.
166 |
167 | ### 3.1. Hash index
168 | - Is usually implemented as a hash map.
169 | - Let’s say our data storage consists only of appending to a file, Whenever you append a
170 | new key-value pair to the file, you also update the hash map to reflect the offset of the
171 | data you just wrote
172 | - When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.
173 | 
174 | How we avoid eventually running out of disk space?
175 | - break the log into segments of a certain size by closing a segment file when it reaches a certain size
176 | - making subsequent writes to a new segment file
177 | - We can then perform compaction on these segments. Compaction means throwing away duplicate
178 | keys in the log, and keeping only the most recent update for each key.
179 | 
180 | - merge several segments together at the same time as performing the compaction
181 | 
182 | Hash index implementation, append-only log: The order of key-value pairs in the file does not matter
183 | - File format: It’s faster and simpler to use a binary format
184 | - Deleting records: If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone)
185 | - Crash recovery: If the database is restarted, the in-memory hash maps are lost.
186 | storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
187 | - Partially written records: The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts
188 | of the log to be detected and ignored.
189 | - Concurrency control: As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread
190 |
191 | #### append-only design turns out to be good for several reasons
192 | - Appending and segment merging are sequential write operations, which are gererally much faster than random writes
193 | - Concurrency and crash recovery are much simpler if segment files are append only or immutable.
194 | #### Limitation
195 | - The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck.
196 |
197 | ### SSTables(Sorted String Table) and LSMTrees
198 | - The sequence of key-value pairs is sorted by key.
199 | Advantages over log segments with hash indexes:
200 | - Merging segments is simple and efficient even if the files are bigger than the
201 | available memory
202 | 
203 | - In order to find a particular key in the file, you no longer need to keep an index
204 | of all the keys in memory. You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient
205 |
206 | 
207 |
208 | #### Constructing and maintaining SSTables
209 | - How do you get your data to be sorted by key in the first place? Our
210 | incoming writes can occur in any order.? We can insert keys in order and read them back in sorted order with Red-black trees, AVL trees
211 | Working flow:
212 | - When a write comes in -> add it to an in-memory balanced tree data structure(red-black) tree. This is sometimes called a memtable.
213 | - When the memtable gets bigger than some threshold -> write it out to disk as an SSTable file
214 | - The new SSTable file becomes the most recent segment of the database.
215 | - In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment,
216 | - From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
217 |
218 | #### Making an LSM-tree out of SSTables
219 |
220 | ### B-Trees
221 | B-Trees keep key-value pairs sorted by key, which allows efficient value lookups and range queries.
222 |
223 | Log structured indexes break the database down into variable-size segments typically several megabytes or more in size, and always write a segment sequentially
224 |
225 | B-trees break the databaes down into fixed-size blocks or pages, traditionally 4KB in size(sometimes bigger) and read or write one page at a
226 | time. This design corresponds more closely to the underlying hardware, as disks are
227 | also arranged in fixed-size blocks.
228 |
229 | ### Decding factor between B-tree and LSM trees
230 | https://rkenmi.com/posts/b-trees-vs-lsm-trees#:~:text=Modern%20databases%20are%20typically%20represented,in%20fixed%20size%20page%20segments.
231 | - B-Trees typically modify entries in-place
232 | - LSM trees on the other hand append entries and discard stale entries from time to time
233 | - B-Trees only have to worry about 1 unique key
234 | - LSM trees will potentially have duplicate keys
235 |
236 | If `reads` are a concern, then it may be worthwhile to look into B-Trees instead of LSM Trees. In a nutshell, if we do a simple database query with LSM Trees, we'll first look at the memtable for the key. If it doesn't exist there, then we look at the most recent SSTable, and if not there, then we look at the 2nd most recent SSTable, and onwards. This means that if the key to be queried doesn't exist at all, then LSM Trees can be quite slow.
237 |
238 | Otherwise if writes are a concern, LSM Trees are more attractive due to sequential writes. In general, sequential write is a lot faster than random writes, especially on magnetic hard disks. Do note though, that the maximum write throughput can be more unpredictable than B-Trees due to the periodic compaction and merging going on in the background.
239 |
240 | ### Full-text search and fuzzy indexes
241 | - full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word
242 | - ignore grammatical variations of words
243 | - search for occurences of words near each other in the same document
244 | - Cope with typos in documents or queries
245 |
246 | ### Transaction Processing or Analytics? OLTP, OLAP
247 | 
248 | There was a trend for companies to stop using their OLTP systems for analytics
249 | purposes, and to run the analytics on a separate database instead. This separate data‐
250 | base was called a data warehouse.
251 |
252 | ### Data warehousing
253 | A data warehouse, by contrast, is a separate database that analysts can query to their
254 | hearts’ content, without affecting OLTP operations
255 | Data warehous contains a read-only copy of the data in all the various OLTP systems in the company
256 | 
257 |
258 | Advatange of using data warehouse: rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns.
259 |
260 | ### Column-Oriented Storage
261 | - Instead of loading all of those rows from disk to memory, parse them, filter out those that don't meet the required conditions. That can take a long time
262 | - The idea behind column-oriented storage is don't store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns
263 | that are used in that query, which can save a lot of work
264 |
265 | ### Encoding
266 | - XML, JSON, binary format
267 | - Apache thrift, Message pack, gRPC
268 | - Avaro
269 |
270 | ## Distributed data
271 | There are various reasons why you might want to distribute a database across multi‐
272 | ple machines
273 | - Scalability
274 | If your data volume, read load, or write load grows bigger than a single machine
275 | can handle, you can potentially spread the load across multiple machines.
276 |
277 | - Fault tolerance/high availability
278 | If your application needs to continue working even if one machine (or several
279 | machines, or the network, or an entire datacenter) goes down, you can use multi‐
280 | ple machines to give you redundancy. When one fails, another one can take over.
281 |
282 | - Latency
283 | If you have users around the world, you might want to have servers at various
284 | locations worldwide so that each user can be served from a datacenter that is geo‐
285 | graphically close to them
286 |
287 | ### Replication
288 | Keeping a copy of the same data on multiple machines that are connected via a network. Reasons:
289 | - To keep data geographically close to your users(reduce latency)
290 | - To allow the system to continue working even if some of its parts have failed(increase availability)
291 | - Scale out the number of machines that can serve read queries(increase read throughput)
292 | #### Leader and followers
293 | - One of the replicas is designated the leader
294 | - The other replicas are known as followers
295 | - When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader
296 | 
297 |
298 | ### Setting Up New Followers
299 | - Take a consistent snapshot of the leader’s database at some point in time—if pos‐
300 | sible, without taking a lock on the entire database.
301 | - Copy the snapshot to the new follower node.
302 | - The follower connects to the leader and requests all the data changes that have
303 | happened since the snapshot was taken.This requires that the snapshot is associ‐
304 | ated with an exact position in the leader’s replication log.MySQL calls it the binlog coordinates.
305 | - When the follower has processed the backlog of data changes since the snapshot,
306 | we say it has caught up. It can now continue to process data changes from the
307 | leader as they happen.
308 | ### Problems with Replication Lag:
309 | The delay between a write happening on the leader and being reflected on a follower—the replication lag
310 | - Leader-based replication requires all writes to go through a single node,
311 | - Read-only queries can go to any replica
312 |
313 | **read-after-write consistency**:
314 | - When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
315 | - For example, user profile information on a social network is nor‐
316 | mally only editable by the owner of the profile,
317 | **Monotonic Reads**: Monotonic reads is that user shouldn't see things moving backward in time
318 | - Make sure that each user always makes their reads from the same replica. For example: the replica can be choosen based on a hash of the UserID, rather than randomly. If that replica failed, the user's queries will need to be rerouted to
319 | another replica.
320 |
321 | **Consistent Prefix Reads**:
322 | Anomaly: If some partitions are replicated slower than others, an observer may see the
323 | answer before they see the question.
324 | This guarantee says that if a sequence of writes happens in a certain order,
325 | then anyone reading those writes will see them appear in the same order.
326 |
327 | One solution is to make sure that any writes that are causally related to each other are
328 | written to the same partition
329 |
330 | ### Solutions for Replication Lag
331 | When working with an eventually consistent system, it is worth thinking about how
332 | the application behaves if the replication lag increases to several minutes or even
333 | hours.
334 |
335 | However, if the result is a bad expe‐
336 | rience for users, it’s important to design the system to provide a stronger guarantee,
337 | such as read-after-write
338 |
339 | ### Multi-leader replication
340 | - Allow more than one node to accept writes
341 | Usecases:
342 | - Multi-datacenter operation: You can have a leader in each datacenter
343 |
344 | 
345 |
346 | Compare how the single-leader and multi-leader configurations fare in a multi-
347 | datacenter deployment:
348 | - Performance: every write can be processed in the local datacenter
349 | and is replicated asynchronously to the other datacenters. Thus, the inter-
350 | datacenter network delay is hidden from users, which means the perceived per‐
351 | formance may be better.
352 | - each datacenter can continue operating independently of the others,
353 | and replication catches up when the failed datacenter comes back online.
354 |
355 | Downside:
356 | - the same data may be concurrently modified in two different datacenters
357 | - those write conflicts must be resolved
358 |
359 | #### Synchronous Versus Asynchronous Replication
360 | - Synchronous: The leader waits until follower has confirmed that it receive the write before reporting success to user
361 | - Asynchronous: the leader send the message, doesn't wait for a response from the follower
362 | It's impractical for all followers to be synchronous: Any one node outage would cause the whole system to grind to a halt.
363 | **semi-synchronous**: one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous.
364 |
365 | ## The trouble with distributed system
366 | ### Unreliable Networks
367 | network problems can be surprisingly common, Public cloud services such as EC2 are notorious for having frequent transient net‐work glitches. nobody is immune from network problems
368 | - Your request may have been lost(perhaps someone unplugged a network cable) (perhaps someone unplugged a network cable).
369 | - Your request may be waiting in a queue and will be delivered later
370 | - The remote node may have failed
371 | - The remote node may have temporarily stopped responding (perhaps it is expe‐
372 | riencing a long garbage collection pause), but it
373 | will start responding again later.
374 | - The remote node may have processed your request, but the response has been
375 | lost on the network
376 | - The remote node may have processed your request, but the response has been
377 | delayed and will be delivered later
378 | Solution: Timeout- after some time you give up waiting
379 | and assume that the response is not going to arrive
380 |
381 |
382 | ## References
383 | [1] Martin Kleppmann: [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)
384 |
--------------------------------------------------------------------------------
/database-research.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/database-research.pdf
--------------------------------------------------------------------------------
/drawio/foundation_of_data_systems:
--------------------------------------------------------------------------------
1 | 7Vxtb5w4EP41K919SMQ75GOaps1JiVo1lXr96AUDvgCmxuzL/fqzwbCA2Zd0Ydk9baoGGNtgZuaZGXuGzPSHePWZgDR8wR6MZprirWb6x5mmqY6jsQOnrEuKad6VhIAgT3TaEF7Rv1AQFUHNkQezVkeKcURR2ia6OEmgS1s0QAhetrv5OGo/NQUBlAivLohk6g/k0bCkOqayoT9BFITVk1VFtMSg6iwIWQg8vGyQ9MeZ/kAwpuVZvHqAEWdexZdy3KctrfXECEzoIQOitR+Cv95AsviSpsv75886+nVjirnRdfXC0GPvLy4xoSEOcAKixw31A8F54kF+V4Vdbfo8Y5wyosqI/0BK10KYIKeYkUIaR6K1fCZ/0NZXEaQM58SFO+ZfqQQgAaQ7+mk1w5mmQhxDStZsHIERoGjRngcQKhPU/TZcZSeCse9gsjUFkxkryfpvPv7WrC5/itsVFx9Xrau1uJpAOPqUwnGuwtkpHGNK4Yj7LkCUiyfNNCti0/3gY/bCTbFZv3JcNdxkBePvWQfVTFebRnYW8OMnLkA2f5ywLtjnU+c+ClB+yNYZhXH1IDbv8lnlSElb2rqwDBGFrykoWL9kLrEtdx9F0QOOMCnG6h6Aju8yekYJfoONFst14Nyvn7eAhMLVbpnLMhIDdEN4IOGC1ep6uXFomiVoYcOZVbThxXrpbudYdyKGfsWo0GIhqLu2nHSzw/8Sq2JQRwT1LH5fKpoEtm8wQmCOIkTXw+q9CR3P6NN7R5vrljWQ3rfZqfWovaqdUu11VeLiCdS+w3vf5P96bU7xMyxQjnXt/UAx7GmRoktI4SuGMZDi+77m9noIz5pb5kBIMdRzg4oxSVg2oOIbh4ZXkwa/xt0UbD6D6Pdg8Rwb/hZD7wkB60aHlBumbFcg0LFv3aX9+/qzk3IGgxrByuZedWdbP3NSaEs+6gUwpWH/R/FTjgv7/dTcMQ1TGcZP1dtuZ+OnVDlofgLEW7JXYlQf5BG9wHjgRu0sRQzrQEYbozG6J+bCPhWMhoRgkl0gp7Wz47NsN57yGCSXzeRz47I2te80dPs93rO4+goJYq8PyfAuVbUO9KnqsTsvx4ltkn2rMwh5DpfPJPFy7d+r+NfeHS93+1cGYdR4WbMnVh5L01vqc6uoFwP7SVfJ1TQbPvE7jiABlMceNIQZP76hxGM6WqYTishvYF+5f+OOJyoa9PJnqJBwy5ZbM4mgntSHOpJUqpxNqG7SNSUlS4sQZn+6SOPpolLNlaJtKd6FtyaYxCCS00k3rO0JZCE7eCh7YweXAH6sU0jl89tzYmR5olNM/dv9C/s9hy6TTFYp7/pSZv/KZs1hl/DfeRKkUR5kJS65ehKcBPwOkC4xKUQD5mwqu19uTva8bgfXDFe0Dd4dIN0Ga2YIEJur/rFx9b0w8zfaNnOBGaj9qCgrCZHnwUTyGWPaBK1jFEytJ7C+6zEKqj2aVbg7K6vQWKPO86DQvjwrQAYZ6zkvUZJRkLiFJ8HCcoSXgL0HkLnAQwW6fICinIjbgVm988HeroYmsy4pOyviMEpQEEC+mOzpzYBL2v0tEHNlT+ZZ2ubCFbu/jV2zjdwqJb8Xuc5o2VHlbJD7QzgNFyc+CnIiKkWuajf4Vkyv3pmnDCP1S69F0Y/dEtmSY3c6a2S9I4Gxc+zyzvoLBFlOCo/zv1lXGdr+dVXv3mRXHMMBwpgCEHCFaGOPi139bLRs9if4xQgbXJW27a9VnXQDUrd7fOSAWDhNQaLR3Y9Xe3Te6dH52nsMz1h5M+EZA29YQzNJLY/RlyPtq/Ycz6DIS7KvLAThsVex7hlSf09SWCix2JmaxYYcOz8DChOXzzqFzPYlFEVw4A3Jk+T8u8w2tZ6I8bTMlqvSv4eMkUGY5sNm+6HKFNru4/CdZetgJHU27anV2ZykUHbAcMLQDwwnzHGCd7tT+2ycuEDWkKs1vjAzdLkVsta2iuPJwnRTv3SMHFpHZw6SU5Yw4thbAqU9GDm2mLMyt4cWc3b6j5OcNuSyn1cUpxFyB//24yRLDMc4O7xOnf7X7lrp/xvlVtkUBExUAGCYh9oAawgb8F7oqopttbGo7MbuvgEjgdeUwPu4wIvL/XJLVQ6oNjgxfOXtyKmyEy/gjefTENc7CLJ1MZaw35iHWOKjVgpBXGTBeU7xDcJ0VmXEq89bFZInSZm+y2KMaRjtS/lfUxwHhYqGfdtOrvVvwZiy9tYqPbz6yv59bPUtTF9fTvyxqbMJXHI1TgKUwDIhXGgslxXhiXFP0tv3Z8Y9tGiVu8CYKUjxEBdmGSiEV2ebs8atGgOvyl3GVd2FZZ9q2z2GWR3tEwxTrs47lWUW7W7Ja95CgvkffPeCvSx7G6V1+udOOBRqH5em3Q1BEsAaDO/Q/70mWkLEhXAGeF5tLXwIaE5gtt8WXAEsh1bdT076ENxb+fEbzoldbv7AURnubv5MlP74Hw==
--------------------------------------------------------------------------------
/drawio/storage_and_retrieval:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/images/![]().png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/![]().png
--------------------------------------------------------------------------------
/images/2020-10-09-08-07-27.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-08-07-27.png
--------------------------------------------------------------------------------
/images/2020-10-09-08-22-44.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-08-22-44.png
--------------------------------------------------------------------------------
/images/2020-10-09-08-23-40.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-08-23-40.png
--------------------------------------------------------------------------------
/images/2020-10-09-08-25-53.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-08-25-53.png
--------------------------------------------------------------------------------
/images/2020-10-09-08-26-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-08-26-23.png
--------------------------------------------------------------------------------
/images/2020-10-09-09-45-09.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-09-45-09.png
--------------------------------------------------------------------------------
/images/2020-10-09-09-51-37.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-09-09-51-37.png
--------------------------------------------------------------------------------
/images/2020-10-15-10-14-29.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-15-10-14-29.png
--------------------------------------------------------------------------------
/images/2020-10-15-10-33-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-15-10-33-23.png
--------------------------------------------------------------------------------
/images/2020-10-15-10-41-34.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-15-10-41-34.png
--------------------------------------------------------------------------------
/images/2020-10-15-11-02-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/2020-10-15-11-02-23.png
--------------------------------------------------------------------------------
/images/cassandra/149f7c7b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/149f7c7b.png
--------------------------------------------------------------------------------
/images/cassandra/2a482058.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/2a482058.png
--------------------------------------------------------------------------------
/images/cassandra/2e4598fe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/2e4598fe.png
--------------------------------------------------------------------------------
/images/cassandra/3580057f.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/3580057f.png
--------------------------------------------------------------------------------
/images/cassandra/5abf55f1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/5abf55f1.png
--------------------------------------------------------------------------------
/images/cassandra/e8f5bac5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/cassandra/e8f5bac5.png
--------------------------------------------------------------------------------
/images/choosing_database/ab021a97.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/choosing_database/ab021a97.png
--------------------------------------------------------------------------------
/images/choosing_database/cf5618a7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/choosing_database/cf5618a7.png
--------------------------------------------------------------------------------
/images/foundation_of_data_systems.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/foundation_of_data_systems.png
--------------------------------------------------------------------------------
/images/indexing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/indexing.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-11-24-08.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-11-24-08.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-12-43-17.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-12-43-17.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-12-45-56.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-12-45-56.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-12-50-56.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-12-50-56.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-13-39-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-13-39-23.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-13-39-31.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-13-39-31.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-13-58-58.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-13-58-58.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-14-09-32.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-14-09-32.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-14-12-27.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-14-12-27.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-15-29-21.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-15-29-21.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-15-33-44.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-15-33-44.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-15-35-46.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-15-35-46.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-15-53-48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-15-53-48.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-15-59-51.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-15-59-51.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-21-37-42.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-21-37-42.png
--------------------------------------------------------------------------------
/images/intensive/2020-09-30-21-41-30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-09-30-21-41-30.png
--------------------------------------------------------------------------------
/images/intensive/2020-10-01-09-54-22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-10-01-09-54-22.png
--------------------------------------------------------------------------------
/images/intensive/2020-10-01-09-54-24.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-10-01-09-54-24.png
--------------------------------------------------------------------------------
/images/intensive/2020-10-01-10-32-50.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/intensive/2020-10-01-10-32-50.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/0dde5a21.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/0dde5a21.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/10d422c0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/10d422c0.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/2b6a83ef.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/2b6a83ef.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/3265315e.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/3265315e.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/32cf40dc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/32cf40dc.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/378a5b1f.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/378a5b1f.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/42075f95.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/42075f95.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/53f7d13b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/53f7d13b.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/5de8cac9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/5de8cac9.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/648e69b1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/648e69b1.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/676c1811.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/676c1811.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/93653734.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/93653734.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/937a3729.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/937a3729.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/99c0ddfc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/99c0ddfc.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/9a3d1e67.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/9a3d1e67.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/b2aa36d1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/b2aa36d1.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/cda7fe4a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/cda7fe4a.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/d1b63713.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/d1b63713.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/de1f1c67.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/de1f1c67.png
--------------------------------------------------------------------------------
/images/mongodb-replicaset/e7b2e770.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/mongodb-replicaset/e7b2e770.png
--------------------------------------------------------------------------------
/images/pool.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/pool.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-17-14-25.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-17-14-25.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-17-14-32.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-17-14-32.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-17-14-44.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-17-14-44.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-17-22-30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-17-22-30.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-18-07-48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-18-07-48.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-18-11-26.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-18-11-26.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-20-52-24.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-20-52-24.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-20-53-08.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-20-53-08.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-13-33.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-13-33.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-13-59.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-13-59.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-14-27.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-14-27.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-14-37.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-14-37.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-15-56.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-15-56.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-17-15.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-17-15.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-18-21-19-04.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-18-21-19-04.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-07-27-09.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-07-27-09.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-07-27-12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-07-27-12.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-08-11-48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-08-11-48.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-09-31-29.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-09-31-29.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-09-46-37.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-09-46-37.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-09-47-50.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-09-47-50.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-09-51-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-09-51-23.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-11-12-43.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-11-12-43.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-15-25-19.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-15-25-19.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-15-25-26.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-15-25-26.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-16-56-48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-16-56-48.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-17-08-49.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-17-08-49.png
--------------------------------------------------------------------------------
/images/readme.md/2020-10-19-17-21-23.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/images/readme.md/2020-10-19-17-21-23.png
--------------------------------------------------------------------------------
/mongodb.md:
--------------------------------------------------------------------------------
1 | # MongoDB
2 | ## Why use MongoDB ?
3 | - Combine the best features of key-value stores and relational databases
4 | - Is designed to `rapidly develop`, prototype web-application and internet infrastructures. We can start writing code immediately and move fast.
5 | - Flexible data model: Don't need to do a lot of `planning around schemas`
6 | - The data model and persistence strategies are built for high `read-and-write throughput`
7 | - Easy to set up a high availibity with `automatic failover`
8 | - `Built-in sharding` features help scaling out huge data collection, we don't have to do application-side sharding
9 |
10 | ### weakingness
11 |
12 |
13 |
14 | ## Key features
15 | ### 1. Document data model
16 | 
17 |
18 | 
19 |
20 | ### 2. Schemaless model advantages
21 | Your application code and not the database, enforces the data’s structure.
22 |
23 | This can speed up initial application development when the schema is changing frequently.
24 |
25 | ### 3. Indexes
26 | 
27 | - Indexes in Mongo DB are implemented as a B-tree data structure
28 | - By permitting `multiple secondary indexes` Mongo DB allows users to optimize for a wide variety of queries
29 | - With Mongo DB , you can create up to 64 indexes per collection
30 |
31 | ### 4. Replication
32 | 
33 | - Mongo DB provides database replication via a topology known as a replica set.
34 | - Replica sets distribute data across two or more machines for redundancy and automate
35 | failover in the event of server and network outages
36 | - Replica sets consist of many Mongo DB servers
37 | - A replica set’s primary node can accept both reads and writes, but the secondary nodes are read-only
38 |
39 | ### 5. Speed and durability
40 | - `Write speed` is understood as the volume of inserts, updates, and deletes that a database can process in a given time frame
41 | - `Durability` refers to level of assurance that these write operaitons have been made permanent
42 |
43 | In Mongodb, users control the speed and durablity trade-off by choosing` write semantics` and deciding whether to enable journaling. MongoDB safely
44 |
45 | - You can configure MongoDB to `fire-and-forget`, sending off a write without for an acknowledgement. Ideal of low-value data (like clickstreams and logs)
46 |
47 | ### 6. Scaling
48 | 
49 | - Easy to scale out by plug in more servers
50 | - MongoDB provide range-based partitioning mechanism known as `sharding` which atutomatically manages the distribution of data across node
51 | - No application code has to handle the logic of sharding
52 |
53 | ### 7. Aggregation [2]
54 | MongoDB have an aggregation framework modeled on the concept of data processing pipelines that allows you to do expressive analytical query on the database easily, which is a feature that traditional `SQL` don't support.
55 |
56 | 
57 |
58 | ###
59 |
60 |
61 | ## Suitable usecases
62 | ### Event logging
63 | - Document databases can store all these different types of events and can act as a central data store for event storage.
64 | - Events can be sharded by the name of the application
65 | where the event originated or by the type of event such as `order_processed` or `customer_logged`
66 |
67 | ### Web Analytics or Real-Time Analytics
68 | New metrics can be easily added without schema changes.
69 |
70 | ### E-Commerce Applications
71 | E-commerce applications often need to have flexible schema for products and orders, as well as the
72 | ability to evolve their data models without expensive database refactoring or data migration
73 |
74 | ## When not to use
75 | - **Complex Transactions Spanning Different Operations**: If you need to have atomic cross-document operations, then document databases may not be for you
76 | - **Queries against Varying Aggregate Structure**: Since the data is saved as an aggregate, if the design of the aggregate is constantly changing, you need to normalize the data. In this scenario, document databases may not work.
77 |
78 |
79 | ## Handle schema changes in MongoDB
80 | - Write an upgrade script.
81 | - Incrementally update your documents as they are used.
82 |
83 | https://mongodb.github.io/mongo-csharp-driver/2.10/reference/bson/mapping/schema_changes/
84 | https://derickrethans.nl/managing-schema-changes.html
85 |
86 | ## Strength and weakness
87 | ### Strength
88 | - Be able to handle huge amounts of data(and huge amounts of request) by replication and horizontal scaling.
89 | - Flexible data model, no need to conform a schema
90 | - Easy to use
91 |
92 | ### Weakness
93 | - Encourage denormalization of schemas, thus lead to `duplicate` data
94 | - Mongo is focused on large datasets, works best in large cluster, which can require some effort to design and manage, setting up a Mongo cluster requires a little more forethought
95 | - Database management is complex
96 | - If indexing is implemented poorly or composite index in an incorrect order, MongoDB can be one of the `slowest database`
97 | - not good in applications where analytics are performed or where joins are required because there is no joins in MongoDB.
98 |
99 |
100 | ## Mongodb index
101 | http://learnmongodbthehardway.com/schema/indexes/
102 | - Indexes are key to achieving high performance in MongoDB
103 | - They allow the database to search through less documents to satisfy a query.
104 | - Without an index MongoDB has to scan through all of the documents in a collection to fulfill the query.
105 | - An index increases the time it takes to insert a document
106 | - Indexes trade off faster queries against storage space.
107 |
108 | - MongoDB automatically uses all free memory on the machine as its cache. System resource monitors show that MongoDB uses a lot of memory, but its usage is dynamic. If another process suddenly needs half the server’s RAM, MongoDB will yield cached memory to the other process.[1]
109 |
110 | ## Database modeling
111 | ```
112 | Inp progess ....
113 | ```
114 |
115 | ## MongoBB replicaset
116 | - Group of mongod processes that maintain the same dataset
117 | - Redundancy and high availability
118 | - Increased read capacity
119 |
120 | 
121 |
122 | 
123 |
124 | 
125 |
126 | 
127 |
128 | 
129 |
130 | 
131 |
132 | 
133 |
134 | 
135 |
136 | 
137 |
138 | 
139 | - Manager in application driver
140 |
141 | 
142 |
143 | Replica sets allow us high availability, but some point of time, we want
144 |
145 | 
146 |
147 | 
148 |
149 | 
150 |
151 | 
152 |
153 |
154 | 
155 |
156 | 
157 |
158 | ## References
159 | [1]https://stackoverflow.com/a/22010542
160 |
161 | [2] https://docs.mongodb.com/manual/core/map-reduce/
162 |
163 |
--------------------------------------------------------------------------------
/nosql.md:
--------------------------------------------------------------------------------
1 | # Why NoSQL?
2 | - Flexible, schemaless data modeling.
3 | - Help dealing with data in **Aggregates**
4 | - Support large volumes of data by running on clusters. Relational databases are not designed to run effciently on clusters
5 |
6 | ## Aggregate data model [1]
7 | - Dealing in aggregates make it much easier for these databases to handle operating on a cluster, sinces the aggregate makes a natural unit for replication and sharding.
8 |
9 | - Aggregates are often easier for application programmers to work with, since they often manipulate data through aggregate structures.
10 |
11 | ### Relations and Aggregates
12 | #### Relation
13 | 
14 |
15 | #### Aggregate
16 | ```js
17 | // in customers
18 | {
19 | "id":1,
20 | "name":"Martin",
21 | "billingAddress":[{"city":"Chicago"}]
22 | }
23 | // in orders
24 | {
25 | "id":99,
26 | "customerId":1,
27 | "orderItems":[
28 | {
29 | "productId":27,
30 | "price": 32.45,
31 | "productName": "NoSQL Distilled"
32 | }
33 | ],
34 | "shippingAddress":[{"city":"Chicago"}]
35 | "orderPayment":[
36 | {
37 | "ccinfo":"1000-1000-1000-1000",
38 | "txnId":"abelif879rft",
39 | "billingAddress": {"city": "Chicago"}
40 | }
41 | ],
42 | }
43 | ```
44 |
45 | ## Types of data models
46 | ### Key-value and document data model
47 | Key-value and document databases were strongly aggregate-oriented.
48 | - We can access an aggregate in a key value store based on it keys. Redis allows you to break down the aggregate into lists or sets
49 | - We can access an aggregate in a document database by submitting queries
50 | - With key-value databases, we expect to
51 | mostly look up aggregates using a `key`. With document databases, we mostly expect to submit some
52 | form of `query` based on the internal structure of the document.
53 |
54 | ### Column-family stores
55 | There are many scenarios when you often read a few columns of many rows at once.
56 | - It's better to to store store groups of columns for all rows as the basic storage unit in the disk.
57 | 
58 |
59 | In an RBDMS, the tuples would be stored row-wise, so the data on the disk would be stored as
60 | follows:
61 |
62 | |John,Smith,42|Bill,Cox,23|Jeff,Dean,35|
63 |
64 | In online-transaction-processing (OLTP) applications, the I/O pattern is mostly reading and writing
65 | all of the values for entire records. As a result, row-wise storage is optimal for OLTP databases.
66 |
67 | In a columnar database, however, all of the columns are stored together as follows:
68 | |John,Bill,Jeff|Smith,Cox,Dean|42,23,35|
69 |
70 | The advantage here is that if we want to read values such as Firstname, reading one disk block reads a lot more information in the row-oriented case
71 | since each block holds the similar
72 | type of data, is that we can use efficient compression for the block, further reducing disk space and
73 | I/O.
74 |
75 | Examples include `Cassandra`
76 |
77 | ## Consistency
78 |
79 | ## Reference
80 | [1] https://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620
81 |
82 |
--------------------------------------------------------------------------------
/postgresql.md:
--------------------------------------------------------------------------------
1 | # Postgresql
2 |
3 | ## Problem with MySQL [1]
4 | - When defining a field as int(11) you can just happily insert textual data and MySQL will try to convert it
5 |
6 | ```
7 | mysql> create table example ( `number` int(11) not null );
8 | Query OK, 0 rows affected (0.08 sec)
9 |
10 | mysql> insert into example (number) values (10);
11 | Query OK, 1 row affected (0.08 sec)
12 |
13 | mysql> insert into example (number) values ('wat');
14 | Query OK, 1 row affected, 1 warning (0.10 sec)
15 |
16 | mysql> insert into example (number) values ('what is this 10 nonsense');
17 | Query OK, 1 row affected, 1 warning (0.14 sec)
18 |
19 | mysql> insert into example (number) values ('10 a');
20 | Query OK, 1 row affected, 1 warning (0.09 sec)
21 |
22 | mysql> select * from example;
23 | +--------+
24 | | number |
25 | +--------+
26 | | 10 |
27 | | 0 |
28 | | 0 |
29 | | 10 |
30 | +--------+
31 | 4 rows in set (0.00 sec)
32 | ```
33 | - Any table modification (e.g. adding a column) will result in the table being locked for both reading and writing.
34 | - Any operation using such a table will have to wait until the modification has completed. For tables with lots of data this could take hours to complete, possibly leading to application downtime
35 | - SoundCloud to develop tools such as `lhm` to deal with this.
36 | -
37 |
38 | ## Postgresql [1]
39 | - Has the capability of altering tables in various ways without requiring to lock it for every operation.
40 | - Support for querying JSON
41 | - Querying/storing key-value pairs
42 | - Pub/sub support and more.
43 | - PostgreSQL strikes a balance between performance, reliability, correctness and consistency.
44 |
45 | ### Transactional DDL
46 |
47 |
48 | ## Replication jargon
49 | ### Write-ahead log(WAL)
50 | - Is the log that keeps track of all transactions. PostgreSQL makes the logs available to the slaves.
51 | - Once slaves have puled the logs, they just need to execute the transactions therein.
52 |
53 | ### Synchronous replication
54 | - A transaction on the master will not be considered complete until at least one synchronous slave listed in `synchronous_standby_names` and reports back.
55 |
56 | ### Asynchronous replication
57 | - A transaction on the master will commit even if no slave updates
58 | - This is expedient for distent servers where you don't want transactions to wait because of network latency, but the downside is that your dataset on the slave might lag behind
59 |
60 | ### Streaming replication
61 | - The slave does not require direct file access between master and slaves. Instead, it relies on the PostgreSQL connection protocol to transmit the WALs.
62 |
63 | ### Cascading replication
64 | - Slaves can receive logs from nearby slaves instead of directly from the master
65 |
66 | ### Checkpoint
67 | During its operation, PostgreSQL records changes to transaction log files, but it doesn't immediately flush them to the actual database tables. It usually just keeps the changes in memory, and returns them from memory when they are requested, unless RAM starts getting full and it has to write them out.
68 |
69 | This means that if it crashes, the on disk tables won't be up to date. It has to replay the transaction logs, applying the changes to the on-disk tables, before it can start back up. That can take a while for a big, busy database.
70 |
71 | For that reason, and so that the transaction logs do not keep growing forever, PostgreSQL periodically does a checkpoint where it makes sure the DB is in a clean state. It flushes all pending changes to disk and recycles the transaction logs that were being used to keep a crash recovery record of the changes.
72 |
73 | ## What is publication
74 | A publication can be defined on any physical master. The node where a publication is dfined is refered to as publisher. A publication is a set of changes generated from a table or a group of tables.
75 |
76 | ## Configuration
77 | 1. pg_hba: host-based authentication
78 |
79 | ## Summary
80 |
81 | you should never start your project with a non-relational data store. It's much easier to add one later on (which will probably never happen) than it is to move from non-relational to relational.
82 | ## References
83 |
84 | https://developer.olery.com/blog/goodbye-mongodb-hello-postgresql/
--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
1 | - [Collected resources about database](#collected-resources-about-database)
2 | - [1. Factors to consider when choosing a database system](#1-factors-to-consider-when-choosing-a-database-system)
3 | - [1.1. Storage engine](#11-storage-engine)
4 | - [Btree vs LSM Benchmark [3]](#btree-vs-lsm-benchmark-3)
5 | - [Representatives](#representatives)
6 | - [Take aways](#take-aways)
7 | - [1.2. Types of databases](#12-types-of-databases)
8 | - [1.2.1 Relational database - RDBMS [4]](#121-relational-database---rdbms--4)
9 | - [1.2.1.1 ACID](#1211-acid)
10 | - [1.2.2. NoSQL](#122-nosql)
11 | - [1.2.2.1. CAP theorem](#1221-cap-theorem)
12 | - [1.2.2.2. BASE properties of NoSQL](#1222-base-properties-of-nosql)
13 | - [1.2.2.3. Conceptual techniques of NoSQL [5]](#1223-conceptual-techniques-of-nosql-5)
14 | - [1.2.2.3. Types of data models](#1223-types-of-data-models)
15 | - [1. Key-value](#1-key-value)
16 | - [2. Document-oriented](#2-document-oriented)
17 | - [3. Column-family](#3-column-family)
18 | - [4. Graph-Like Data Models [10]](#4-graph-like-data-models-10)
19 | - [1.2.2.4. Considerations for relational vs. NoSQL systems [4]](#1224-considerations-for-relational-vs-nosql-systems-4)
20 | - [1.2.3. NewSQL [4]](#123-newsql-4)
21 | - [1.3. Reliable, Scalable, and Maintainable [4]](#13-reliable-scalable-and-maintainable-4)
22 | - [Reliability](#reliability)
23 | - [Scalability](#scalability)
24 | - [Maintainability](#maintainability)
25 | - [2. Databases at scale](#2-databases-at-scale)
26 | - [2.1. Caching [11]](#21-caching--11)
27 | - [2.1.1. Types of database caching](#211-types-of-database-caching)
28 | - [2.1.2. Caching strategies](#212-caching-strategies)
29 | - [1. Cache-Aside (Lazy loading)](#1-cache-aside-lazy-loading)
30 | - [2. Write - through](#2-write---through)
31 | - [2.1.3. TTL](#213-ttl)
32 | - [2.1.4. Evictions](#214-evictions)
33 | - [2.2. Indexing](#22-indexing)
34 | - [2.2.1. Single index [17]](#221-single-index-17)
35 | - [2.2.2. Composite index](#222-composite-index)
36 | - [2.3. Partitioning](#23-partitioning)
37 | - [3.4. Horizontal scaling: sharding and replication](#34-horizontal-scaling-sharding-and-replication)
38 | - [3.4.1. Replication](#341-replication)
39 | - [3.4.2. Sharding](#342-sharding)
40 | - [When should we shard?](#when-should-we-shard)
41 | - [Strategies of sharding](#strategies-of-sharding)
42 | - [3. Caveats for working with a database system](#3-caveats-for-working-with-a-database-system)
43 | - [3.1. Connection pool](#31-connection-pool)
44 | - [3.1. RDBMS](#31-rdbms)
45 | - [3.1.1. Concurrency problems](#311-concurrency-problems)
46 | - [3.1.1.1. Dirty Read](#3111-dirty-read)
47 | - [3.1.1.2. Phantom Read](#3112-phantom-read)
48 | - [3.1.1.3. Lost updates](#3113-lost-updates)
49 | - [3.1.1.4. Inconsistent Retrievals](#3114-inconsistent-retrievals)
50 | - [3.1.2. Solving concurrency problems with Locking](#312-solving-concurrency-problems-with-locking)
51 | - [3.1.2.1. Optimistic lock](#3121-optimistic-lock)
52 | - [3.1.2.2. Pessmistic lock](#3122-pessmistic-lock)
53 | - [Exclusive and shared locks](#exclusive-and-shared-locks)
54 | - [3.1.2. Deadlock](#312-deadlock)
55 | - [3.1.3. N + 1 query problem](#313-n--1-query-problem)
56 | - [4. Data warehouse](#4-data-warehouse)
57 | - [5. Cloud-native database](#5-cloud-native-database)
58 | - [6. References](#6-references)
59 |
60 | # Collected resources about database
61 | https://dzone.com/articles/database-architectures-amp-use-cases-explained
62 | # 1. Factors to consider when choosing a database system
63 | ## 1.1. Storage engine
64 | A database storage engine is an internal software component that a database server uses to store, read, update, and delete data in the underlying memory and storage systems. [1]
65 |
66 | 
67 |
68 | | | B-Tree based engine | Log Structred Merge(LSM) based engine |
69 | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
70 | | |  |  |
71 | | **Definition** | **B-tree** is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. | **LSM tree** is a data structure with performance characteristics best fit for indexed access to files with high write volume over an extended period. |
72 | | **How it works?** | https://www.youtube.com/playlist?list=PLzzVuDSjP25QT0H605qxlcmMy_GBTHi8X |
- https://yetanotherdevblog.com/lsm/
- http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
|
73 | | **Pros** | - Data and indexes are organized with B-Tree concept and read/writes always has logarithmic time. For 1 million records, it takes 20 comparisions in the B-Tree to locate the required data/pointer in the index [2]
- Each disk access take around 5ms. For 1 million records, it requires 15ms. For 1000 million records, it takes 1,5s. For 1 billion records, the bill is 15 seconds just to access one row [2]
| Fast sequential writes (as opposed to slow random writes in B-tree engines) |
74 | | **Cons** | - The need to maintain a well-ordered data structure with random writes usually leads to poor write performance because random writes to the storage are more expensive than sequential writes
| - Exhibit poor read throughput in comparision to B-tree based engine
- Consume more CPU resources during read operations and take more memory/disk storage(Can be reduced by using **Bloom filter**)
|
75 |
76 | ### Btree vs LSM Benchmark [3]
77 | 
78 |
79 | ### Representatives
80 | 
81 |
82 | ### Take aways
83 | - If you don't require **extreme write throughput**, `Btree` is likely to be a better choice. `Read throughput is better` and high volumes of writes can be maintained.
84 | - If you have a workload that requires a **high write throughput**, `LSM` is the best choice.
85 |
86 | ## 1.2. Types of databases
87 | 
88 | ### 1.2.1 Relational database - RDBMS [4]
89 | The main unit of data organization in a relational database is referred to as a **table**. Each **table** is associated with a schema that defines the names and data types for each table **column**.
90 |
91 | Within a table, each data record is represented by a row that is, in turn, identified by a primary key, a tuple of column values that must be unique among all the table rows. Table columns may also reference records that exist in other tables. This type of column is typically referred to as a foreign key.
92 |
93 | - Relational databases have been a prevalent technology for decades.
94 | - They're **mature**, **proven**, and **widely implemented**
95 | - They use SQL (Structured Query Language) to manage data, and support **ACID** guarantees.
96 |
97 | #### 1.2.1.1 ACID
98 | - **A**tomicity: All operations in a transaction must either succeed or all are rolled back
99 |
100 | - **For Example**: a transaction to transfer funds from one account to another involves making a withdrawal operation from the first account and a deposit operation on the second. If the deposit operation failed, you don’t want the withdrawal operation to happen either.[6]
101 |
102 | - **C**onsistency: The database integrity constraints are valid on completion of
103 | the transaction. Which demands that the data must meet all the validation rules. All validation rules must be checked to ensure consistency
104 |
105 | - **For example**: If a field-type in database is Integer, it should accept only Integer value's and not some kind of other.If you want to store other types in this field, consistency are violated. At this condition transaction will rollback.
106 |
107 | - **I**solated:
108 | Simultaneously occurring transactions do not interfere with each other. Contentious concurrent access is moderated by the database
109 | so that transactions appear to run sequentially
110 | - **For example**: a teller looking up a balance must be isolated from a concurrent transaction involving a withdrawal from the same account. Only when the withdrawal transaction commits successfully and the teller looks at the balance again will the new balance be reported.[6]
111 |
112 | - **D**urable: Irrespective of hardware or software failures, the updates made by the transaction are permanent.
113 | - **For example**: InnoDB is a transactional storage engine for **MySQL**, which by definition must guarantee durability. It accomplishes this by writing its updates in two places: once to a transaction log and again to an in-memory buffer pool. The transaction log is synced to disk immediately, whereas the buffer pool is only eventually synced by a background thread.[7]
114 |
115 | ### 1.2.2. NoSQL
116 | - No-SQL databases refer to high-performance, non-relational data stores.
117 | - Flexible, schemaless data modeling.
118 | - Support large volumes of data by running on clusters. Relational databases are not designed to run effciently on clusters. They excel in their ease-of-use, scalability, resilience, and availability characteristics
119 | - Help dealing with data in **Aggregates**(See 1.2.2.3)
120 |
121 | However, as we all know, there is no such thing as a free lunch.
122 | To achieve this performance boost, NoSQL databases have to sacrifice something! Being distributed systems, NoSQL databases must adhere to the rules of the **CAP** theorem.
123 |
124 | Additionally, NoSQL databases have to give up **ACID** guarantees to support for scalability. Instead, they employ new properties called **BASE**(1.2.2.2).
125 |
126 |
127 | #### 1.2.2.1. CAP theorem
128 | 
129 |
130 | [Databases categorized in CAP Theorem](https://medium.com/system-design-blog/cap-theorem-1455ce5fc0a0)
131 |
132 | In a distributed computer system, you can only support two of the following guarantees:
133 |
134 | - **C**onsistency: Every read receives the most recent write or an error
135 | - **A**vailability: Every request receives a response, without guarantee that it contains the most recent version of the information
136 | - **P**artition Tolerance: The system continues to operate despite arbitrary partitioning due to network failures [8].
137 |
138 | RDBMS prefer CP ⟶ ACID
139 | NoSQL prefer AP ⟶ BASE
140 |
141 | 
142 |
143 | Image taken from the course [Blockchain at BerkeleyX CS198.2x](https://www.youtube.com/watch?v=K12oQCzjPxE&t=5s)
144 |
145 | Networks aren't reliable, so you'll need to support partition tolerance. You'll need to make a software tradeoff between consistency and availability.
146 |
147 | **CP**: Consistency and partition tolerance.
148 |
149 | Distributed systems in this category typically use a voting protocol to ensure that the majority of nodes agree that they have the most recent version of the stored data; in other words, they reach a [quorum](https://en.wikipedia.org/wiki/Quorum_(distributed_computing)#:~:text=A%20quorum%20is%20the%20minimum,operation%20in%20a%20distributed%20system.). This allows the system to recover from network partitioning events. However, if not enough nodes are available to reach quorum, the system will return an error to clients as data consistency is preferred over availability.[9]
150 |
151 | **AP**: Availability and partition tolerance
152 |
153 | This class of distributed systems favors availability over consistency. Even in the case of a network partition, an AP system will try to process read requests, although stale data may be returned to the clients.[9]
154 |
155 | #### 1.2.2.2. BASE properties of NoSQL
156 | While RDBMS follow the ACID properties, NoSQL databases follow BASE properties.
157 |
158 | - **BA**sically available: Guarantees the availability of the data . There will be a response to any request (can be failure too).
159 |
160 | - **S**oft state: The state of the system could change over time.
161 |
162 | - **E**ventual consistency: The system will eventually become consistent once it stops receiving input.
163 |
164 | #### 1.2.2.3. Conceptual techniques of NoSQL [5]
165 | - **1. Denormalization**: Copy the same data into multiple documents in order to optimize or simplify query processing.
166 | - IO per query vs total data volume: Data can be gouped at one place to have better IO result.
167 | - Processing complexity vs total data volume: reduce joins.
168 |
169 | - **2. Aggregates**:
170 | - Relation
171 | 
172 |
173 | - Aggregate
174 | ```js
175 | // in customers
176 | {
177 | "id":1,
178 | "name":"Martin",
179 | "billingAddress":[{"city":"Chicago"}]
180 | }
181 | // in orders
182 | {
183 | "id":99,
184 | "customerId":1,
185 | "orderItems":[
186 | {
187 | "productId":27,
188 | "price": 32.45,
189 | "productName": "NoSQL Distilled"
190 | }
191 | ],
192 | "shippingAddress":[{"city":"Chicago"}]
193 | "orderPayment":[
194 | {
195 | "ccinfo":"1000-1000-1000-1000",
196 | "txnId":"abelif879rft",
197 | "billingAddress": {"city": "Chicago"}
198 | }
199 | ],
200 | }
201 | ```
202 | - Dealing in aggregates make it much easier for these databases to handle operating on a cluster, sinces the aggregate makes a natural unit for replication and sharding.
203 |
204 | - Aggregates are often easier for application programmers to work with, since they often manipulate data through aggregate structures.
205 | - **3. Application side joins**: Joins are rarely supported in NoSQL solutions. As a consequence of the “question-oriented” NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time.
206 |
207 | #### 1.2.2.3. Types of data models
208 | 
209 | **Key-value** and **Document** databases were strongly aggregate-oriented.
210 |
211 | ##### 1. Key-value
212 | - We can access an aggregate in a key value store based on it keys. Redis allows you to break down the aggregate into lists or sets
213 | - With key-value databases, we expect to mostly look up aggregates using a `key`.
214 | - The basic set of operations supported by key-value stores are **insertions**, **deletions**, and **lookups**
215 |
216 | The data access patterns that are used by key-value stores make data partitioning across multiple nodes much easier compared to other database technologies. This property allows key-value stores to scale horizontally so as to accommodate increased traffic demand.
217 |
218 | **Representatives**: Redis, LevelDB, RockDB, Etcd
219 |
220 | ##### 2. Document-oriented
221 | 
222 |
223 | - Flexible data model- Don't need to do a lot of `planning around schemas`.
224 | - Document databases typically implement their own domain-specific language (DSL) for querying data.
225 | - We can access an **aggregate** in a document database by submitting queries
226 | - They also provide advanced primitives (for example, support for map-reduce) for calculating complex aggregations across multiple documents in a collection.
227 | - This makes document databases a great fit for generating **business intelligence** (BI) and other types of analytics reports.
228 |
229 | **Representatives**: [mongodb.md](mongodb.md), CouchDB, Elasticsearch
230 |
231 | ##### 3. Column-family
232 | There are many scenarios when you often read a few columns of many rows at once.
233 | - It's better to to store store groups of columns for all rows as the basic storage unit in the disk.
234 | 
235 |
236 | In an RBDMS, the tuples would be stored row-wise, so the data on the disk would be stored as
237 | follows:
238 |
239 | `|John,Smith,42|Bill,Cox,23|Jeff,Dean,35|`
240 |
241 | In online-transaction-processing (OLTP) applications, the I/O pattern is mostly reading and writing
242 | all of the values for entire records. As a result, row-wise storage is optimal for OLTP databases.
243 |
244 | In a columnar database, however, all of the columns are stored together as follows
245 |
246 | `|John,Bill,Jeff|Smith,Cox,Dean|42,23,35|`
247 |
248 | The advantage here is that if we want to read values such as Firstname, reading one disk block reads a lot more information in the row-oriented case
249 | since each block holds the similar
250 | type of data, is that we can use efficient compression for the block, further reducing disk space and
251 | I/O.
252 |
253 | ---
254 | **Representatives**: [cassandra.md](cassandra.md), HBase, BigTable
255 |
256 | ##### 4. Graph-Like Data Models [10]
257 | 
258 | - If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.
259 |
260 | - If many-to-many relationships are very common in your data? The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph
261 |
262 | A graph consists of two kinds of object: Vertices and edges. Many kinds of data can be modeled as a graph. Typical examples include:
263 |
264 | - Social graphs: Vertices are people, and edges indicate which people know each other.
265 | - The web graph: Vertices are web pages, and edges indicate HTML links to other pages.
266 | Well-known algorithms can operate on these graphs:
267 | - Car navigation system: search for the shortest path between two points in a road network
268 | - PageRank can be used on the web graph to determine the popularity of a web page
269 | and thus its ranking in search results.
270 |
271 | **Representatives**: Neo4j
272 |
273 | #### 1.2.2.4. Considerations for relational vs. NoSQL systems [4]
274 | High availability and massive scalability are often more critical to the business than strong consistency. Developers can implement techniques and patterns such as Sagas, CQRS, and asynchronous messaging to embrace eventual consistency.
275 |
276 | | Consider a NoSQL datastore when: | Consider a relational database when: |
277 | | ----------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
278 | | You have high volume workloads that require large scale | Your workload volume is consistent and requires medium to large scale |
279 | | Your workloads don't require ACID guarantees | ACID guarantees are required |
280 | | Your data is dynamic and frequently changes | Your data is predictable and highly structured |
281 | | Data can be expressed without relationships | Data is best expressed relationally |
282 | | You need fast writes and write safety isn't critical | Write safety is a requirement |
283 | | Data retrieval is simple and tends to be flat | You work with complex queries and reports |
284 | | Your data requires a wide geographic distribution | Your users are more centralized |
285 | | Your application will be deployed to commodity hardware, such as with public clouds | Your application will be deployed to large, high-end hardware |
286 |
287 | 
288 |
289 | **Sticking with the default**:
290 |
291 | There are many cases you’re better off sticking with the default option of a relational database:
292 | - You can easily find people with the experience of using them.
293 | - They are mature, so you are less likely to run into the rough edges of new technology
294 | - Picking a new technology will always introduce a risk of problems should things run into difficulties
295 | ### 1.2.3. NewSQL [4]
296 | - NewSQL is an emerging database technology that combines the distributed scalability of NoSQL with the ACID guarantees of a relational database.
297 | - NewSQL databases are important for business systems that must process high-volumes of data, across distributed environments, with full transactional support and ACID compliance.
298 | - A key design goal for NewSQL databases is to work natively in **Kubernetes**, taking advantage of the platform's resiliency and scalability.
299 |
300 | 
301 |
302 | ## 1.3. Reliable, Scalable, and Maintainable [4]
303 | 
304 |
305 | Three concerns that are important in most database systems:
306 |
307 | ### Reliability
308 | - `Tolerate` hardware and software & software faults
309 | - The system should continue to work correctly even in the face of adversity.
310 |
311 | ### Scalability
312 | - Measuring load & performance
313 | - Latency percentiles, throughput
314 | - As the system grows, there should be reasonable ways of dealing with that growth. Scalability is the term we use to describe a system’s ability to cope with increased
315 | load.
316 |
317 | ### Maintainability
318 | - **Operability**: Make it easy for operations teams to keep the system running smoothly.
319 | - **Simplicty**: Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
320 | - **Evolvability**: Make it easy for engineers to make changes to the system in the future, adapting
321 | it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.
322 |
323 | > See more details: [database-foundation.md](database-foundation.md)
324 |
325 | # 2. Databases at scale
326 | ## 2.1. Caching [11]
327 | When you’re building distributed applications that require low latency and
328 | scalability, disk-based databases can pose a number of challenges.
329 | - Slow processing queries
330 | - Cost to scale
331 |
332 | A database cache supplements your primary database by removing unnecessary pressure on it
333 |
334 | ### 2.1.1. Types of database caching
335 | Types of database caching:
336 | - **Local caches**:
337 | - A local cache stores your frequently used data within your application.
338 | - This makes data retrieval faster than other caching architectures because it removes network traffic that is associated with retrieving data.
339 | - **A major disadvantage** is that among your applications, each node has
340 | its own resident cache working in a disconnected manner.
341 | - **Remote caches**:
342 | - A remote cache (or “side cache”) is a separate instance (or instances) dedicated for storing the cached data in-memory.
343 | - Remote caches are stored on dedicated servers and are typically built
344 | on key/value NoSQL stores, such as Redis and Memcached.
345 |
346 | ### 2.1.2. Caching strategies
347 | #### 1. Cache-Aside (Lazy loading)
348 | - When you application needs to read data from database, it checks the cache first to determine whether the data is available
349 | - If the data is available, the cached is returned, and the response is issued to the caller
350 | - If the data isn't available. Query from database. The cache is then populated with the data that is retrived from the database, and the data is returned to the caller
351 |
352 | 
353 |
354 | Advantages:
355 | - The cache contains only data that the application requests, which helps keep the cache size cost effective
356 | - Implementation is easy, straightforward
357 |
358 | Disadvanges:
359 | - The data is loaded into the cache only after a cache miss, some overhead is added to the initial response time because additional roundtrips to the cache and database are needed
360 |
361 | #### 2. Write - through
362 | - application batch or backend process updates the primary database
363 | - Immediately afterward, the data is also updated in the cache
364 |
365 | 
366 |
367 |
368 | Advantages:
369 | - Because cache is up-to-date with primary database. There is a much greater likelihood that the data will be found in the cache. This in turn results in better overall application performance and user experience
370 | - The performance of your database is optimal because fewer database reads are performed
371 |
372 | Disadvantages:
373 | - Infrequent requested data is also writen to the cache, resulting in a larger and more expensive cache
374 |
375 | A proper caching strategy includes effective use of both write-through and lazy-loading of your data and setting an approriate expiration for the data to keep it relevant and lean
376 |
377 | ### 2.1.3. TTL
378 | - When apply TTLs to your cache keys. you should add some time jitter to your TTL. This reduces the possiblity of heavy load occurring when your cached data expires Take, for example, the scenario of
379 | caching product information. If all your product data expires at the same time
380 | and your application is under heavy load, then your backend database has to
381 | fulfill all the product request that could generate too
382 | much pressure on your database, resulting in poor performance. By adding
383 | slight jitter to your TTLs, a randomly generated time value (e.g., TTL = your
384 | initial TTL value in seconds + jitter)
385 |
386 | ### 2.1.4. Evictions
387 | - Evictions occur when cache memory is overfilled or is greater than the
388 | maxmemory setting for the cache, causing the engine selecting keys to evict in
389 | order to manage its memory.
390 | - The keys that are chosen are based on the
391 | eviction policy you select
392 |
393 | 
394 |
395 |
396 | ## 2.2. Indexing
397 | Querying a database table of n records by a field other than a **key**, requires O(n) record reads.
398 |
399 | - A database index is a data structure(typically **B-tree**) that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure.
400 | - Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed.
401 | - Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
402 | ### 2.2.1. Single index [17]
403 | - A single-column index is an index based on the values in one column of a table.
404 | - An index on a column is an additional data structure of the table’s records sorted (typically via b-tree) only on that column.
405 | - Each record in the index also includes a pointer to the original record in the table, such that finding records in the index is equivalent to finding records in the original table.
406 |
407 | For example: We have a data sample
408 |
409 | | ID | first_name | last_name | Class | Position | ssn |
410 | | --- | ---------- | ------------ | ---------- | -------- | ---- |
411 | | 1 | Teemo | Shroomer | Specialist | Top | 2345 |
412 | | 2 | Cecil | Heimerdinger | Specialist | Mid | 5461 |
413 | | 3 | Annie | Hastur | Mage | Mid | 8784 |
414 | | 4 | Fiora | Laurent | Slayer | Top | 7867 |
415 | | 5 | Garen | Crownguard | Fighter | Top | 4579 |
416 | | 6 | Malcolm | Graves | Specialist | ADC | 4578 |
417 | | 7 | Irelia | Lito | Figher | Top | 5689 |
418 | | 8 | Janna | Windforce | Controller | Support | 4580 |
419 | | 9 | Jarvan | Lightshield | Figher | Top | 4579 |
420 | | 10 | Katarina | DuCouteau | Assassin | Mid | 5608 |
421 |
422 | If we create an index on users.first_name
423 | `CREATE INDEX first_name_index ON users (first_name) USING BTREE;`
424 | It would create a sorting of the users by their first_name with a pointer to their primary key, something like this:
425 | ```
426 | Annie -> 3
427 | Cecil -> 2
428 | Emilia -> 12
429 | Fiora -> 4
430 | Garen -> 5
431 | Irelia -> 7
432 | Janna -> 8
433 | Jarvan -> 9
434 | Jericho -> 19
435 | Katarina -> 10
436 | Kayle -> 11
437 | ```
438 | Then a query like
439 | ```sql
440 | SELECT * FROM users WHERE first_name = 'Teemo';
441 | ```
442 | would take only O(log_2(n)) reads since the database can perform a binary search on this index, since it is sorted by first_name .
443 |
444 | ### 2.2.2. Composite index
445 | A composite index is an index based on the values in multiple columns of a table.
446 | Back to our example:
447 |
448 | | ID | first_name | last_name | Class | Position | ssn |
449 | | --- | ---------- | ------------ | ---------- | -------- | ---- |
450 | | 1 | Teemo | Shroomer | Specialist | Top | 2345 |
451 | | 2 | Cecil | Heimerdinger | Specialist | Mid | 5461 |
452 | | 3 | Annie | Hastur | Mage | Mid | 8784 |
453 | | 4 | Fiora | Laurent | Slayer | Top | 7867 |
454 | | 5 | Garen | Crownguard | Fighter | Top | 4579 |
455 | | 6 | Malcolm | Graves | Specialist | ADC | 4578 |
456 | | 7 | Irelia | Lito | Figher | Top | 5689 |
457 | | 8 | Janna | Windforce | Controller | Support | 4580 |
458 | | 9 | Jarvan | Lightshield | Figher | Top | 4579 |
459 | | 10 | Katarina | DuCouteau | Assassin | Mid | 5608 |
460 |
461 | ```sql
462 | CREATE INDEX class_pos_index ON users (class, position);
463 | ```
464 | Then we will have a B-tree like this
465 |
466 | 
467 |
468 | To execute a query with composite index.
469 | We **must** provide index keys **in order**.
470 | ```
471 | SELECT * FROM users WHERE class = 'Specialist' AND position = 'Top';
472 | ```
473 |
474 | Howevery a query like
475 | ```
476 | SELECT * FROM users WHERE position = 'Top';
477 | ```
478 | will NOT benefit from this composite index because position is the second field.
479 |
480 | ---
481 | **Avoid Unnecessary Indexes**: adding unnecessary indexes can actually degrade performance overall.
482 | * Additional storage space to store indexes
483 | * Indexes also need to be updated when state-changing queries like CREATE UPDATE and DELETE are made
484 |
485 | **Guidelines for when to use index**:
486 | * Do not use an index for low-read but high-write tables. As mentioned previously
487 | * Do not use an index if the field has `low cardinality`, the number of distinct values in that field.
488 | * Do not use an index for small fixed-size tables.
489 |
490 | ## 2.3. Partitioning
491 | Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.
492 | 
493 |
494 | Types of partitioning:
495 | - **RANGE partitioning**: This type of partitioning assigns rows to partitions based on column values falling within a given range. [18]
496 | ```sql
497 | CREATE TABLE members (
498 | firstname VARCHAR(25) NOT NULL,
499 | lastname VARCHAR(25) NOT NULL,
500 | username VARCHAR(16) NOT NULL,
501 | email VARCHAR(35),
502 | joined DATE NOT NULL
503 | )
504 | PARTITION BY RANGE( YEAR(joined) ) (
505 | PARTITION p0 VALUES LESS THAN (1960),
506 | PARTITION p1 VALUES LESS THAN (1970),
507 | PARTITION p2 VALUES LESS THAN (1980),
508 | PARTITION p3 VALUES LESS THAN (1990),
509 | PARTITION p4 VALUES LESS THAN MAXVALUE
510 | );
511 | ```
512 | - **HASH partitioning**: With this type of partitioning, a partition is selected based on the value returned by a user-defined expression that operates on column values in rows to be inserted into the table.
513 | ```sql
514 | CREATE TABLE employees (
515 | id INT NOT NULL,
516 | fname VARCHAR(30),
517 | lname VARCHAR(30),
518 | hired DATE NOT NULL DEFAULT '1970-01-01',
519 | separated DATE NOT NULL DEFAULT '9999-12-31',
520 | job_code INT,
521 | store_id INT
522 | )
523 | PARTITION BY HASH(store_id)
524 | PARTITIONS 4;
525 | ```
526 | - **LIST partitioning**: Similar to partitioning by RANGE, except that the partition is selected based on columns matching one of a set of discrete values.
527 | ```sql
528 | CREATE TABLE employees (
529 | id INT NOT NULL,
530 | fname VARCHAR(30),
531 | lname VARCHAR(30),
532 | hired DATE NOT NULL DEFAULT '1970-01-01',
533 | separated DATE NOT NULL DEFAULT '9999-12-31',
534 | job_code INT,
535 | store_id INT
536 | )
537 | PARTITION BY LIST(store_id) (
538 | PARTITION pNorth VALUES IN (3,5,6,9,17),
539 | PARTITION pEast VALUES IN (1,2,10,11,19,20),
540 | PARTITION pWest VALUES IN (4,12,13,14,18),
541 | PARTITION pCentral VALUES IN (7,8,15,16)
542 | );
543 | ```
544 | - **KEY partitioning**: This type of partitioning is similar to partitioning by HASH, except that only one or more columns to be evaluated are supplied, and the databaes server provides its own hashing function.
545 | ```sql
546 | CREATE TABLE members (
547 | firstname VARCHAR(25) NOT NULL,
548 | lastname VARCHAR(25) NOT NULL,
549 | username VARCHAR(16) NOT NULL,
550 | email VARCHAR(35),
551 | joined DATE NOT NULL
552 | )
553 | PARTITION BY KEY(joined)
554 | PARTITIONS 6;
555 | ```
556 |
557 | ## 3.4. Horizontal scaling: sharding and replication
558 | ### 3.4.1. Replication
559 | - Replication is the process of copying data from a central database to one or more databases.
560 |
561 | - Replication can be used to improve the availability of data by preventing the loss of a single server from causing your database service to become unavailable.
562 |
563 | 
564 |
565 | - If our workloads are **read-heavy**, horizontal scaling is usually achieved by spinning up read-replicas, which mirror updates to one or more primary nodes. Writes are always routed to the primary nodes while reads are handled by the read-replicas (ideally) or even by the primaries if the read-replicas cannot be reached.
566 |
567 | ### 3.4.2. Sharding
568 |
569 | 
570 |
571 | - For **write-heavy** workloads, we usually resort to techniques such as data **sharding**.
572 | - Data sharding allows us to split (partition) the contents of one or more tables into multiple database nodes. This partitioning is achieved by means of a per-row shard key, which dictates which node is responsible for storing each row of the table. One caveat of this approach is that it introduces additional complexity at query time. While writes are quite efficient, reads are not trivial as the database might need to query each individual node and then aggregate the results together in order to answer even a simple query such as SELECT COUNT(*) FROM X.
573 |
574 | #### When should we shard?
575 | Sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database
576 | - The amount of application data grows to exceed the storage capacity of a single database node.
577 | - The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
578 | - The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.
579 |
580 | **Benefits of sharding**:
581 | - Help facilitate horizontal scaling, `scaling out`. Add more machines to existing cluster in order to spread out the load and allow for more traffic and faster processing
582 | - Speed up query responsetime
583 | - Make an application more reliable by mitigating the impact of outages. With a sharded database, though, an outage is likely to affect only a single shard
584 |
585 | **Drawback of sharding**:
586 | - Deal with complexity of properly implementing a sharded database architecture. If done incorrectly, there's a significant risk that the sharding process can lead to lost data or corrupted tables.
587 | - Sometimes the shards will eventually become unbalanced
588 | - It can be very difficult to return it to its unsharded architecture.
589 | - Join across shard is costly, often performed at application level.
590 | - Sharding isn't natively supported by every database engine. For instance, PostgreSQL does not include automatic sharding as a feature
591 |
592 | #### Strategies of sharding
593 | https://www.digitalocean.com/community/tutorials/understanding-database-sharding
594 |
595 | **Sharding** is usually used with **Replication** in a database system to improve availability.
596 |
597 | 
598 |
599 | # 3. Caveats for working with a database system
600 | ### 3.1. Connection pool
601 | - Is a cache of database connection maintained so that the connections can be reused when future requests to the database are required.
602 | - Connection pool are used to enhance the performance of executing commands on a database, thus reduce the overhead of establishing, closing the connection everytime.
603 |
604 | 
605 | Parameters when configure a connection pool
606 | - `maxConnection`: how many connections do you want by default, default is 10. The maximum number of connections you want to create
607 | - User ask the pool, query this for me, pick one of those 20 connections
608 | - If there are no connection left, all of them are busy or used, you have to wait as a client, and there is a time out for that.
609 |
610 | - `connectionTimeoutMillisec`: how long should the pool give me a new connection if every one is busy
611 | - `idletimeoutMillisec`: If those connection are not used, when do you want me to get rid of it?
612 |
613 | ## 3.1. RDBMS
614 | ### 3.1.1. Concurrency problems
615 | When multiple transactions execute concurrently in an uncontrolled or unrestricted manner, then it might lead to several problems. These problems are commonly referred to as concurrency problems in database environment.
616 |
617 | #### 3.1.1.1. Dirty Read
618 | The dirty read occurs in the case when one transaction updates an item of the database, and then the transaction fails for some reason. The updated database item is accessed by another transaction before it is changed back to the original value. [14]
619 |
620 | For example:
621 | A transaction T1 updates a record which is read by T2. If T1 aborts then T2 now has values which have never formed part of the stable database.
622 |
623 | 
624 | - At time t2, transaction-Y writes A's value.
625 | - At time t3, Transaction-X reads A's value.
626 | - At time t4, Transactions-Y rollbacks. So, it changes A's value back to that of prior to t1.
627 | - So, Transaction-X now contains a value which has never become part of the stable database
628 |
629 | #### 3.1.1.2. Phantom Read
630 | The so-called phantom problem occurs within a transaction when the same query produces different sets of rows at different times. For example, if a SELECT is executed twice, but returns a row the second time that was not returned the first time, the row is a “phantom” row.
631 |
632 | 
633 | Phantom read process [source](https://vladmihalcea.com/non-repeatable-read/)
634 |
635 | For example: Alice and Bob start two database transactions.
636 | 1. Bob’s reads all the post_comment records associated with the post row with the identifier value of 1.
637 | 2. Alice adds a new post_comment record which is associated with the post row having the identifier value of 1.
638 | Alice commits her database transaction.
639 | 3. If Bob’s re-reads the post_comment records having the post_id column value equal to 1, he will observe a different version of this result set.
640 |
641 | #### 3.1.1.3. Lost updates
642 | A lost update occurs when two different transactions are trying to update the same column on the same row within a database at the same time.
643 |
644 | 
645 | The process involved in a lost update. [Source: Vlad Mihalcea's blog. ](https://medium.com/system-design-blog/cap-theorem-1455ce5fc0a0)
646 | #### 3.1.1.4. Inconsistent Retrievals
647 | Inconsistent Retrievals Problem is also known as unrepeatable read. When a transaction calculates some summary function over a set of data while the other transactions are updating the data, then the Inconsistent Retrievals Problem occurs. [14]
648 |
649 | For example: Suppose two transactions operate on three accounts.
650 | 
651 |
652 | 
653 |
654 | In this example,
655 | - Transaction-X is doing the sum of all balance while transaction-Y is transferring an amount 50 from Account-1 to Account-3.
656 | - Here, transaction-X produces the result of 550 which is incorrect. If we write this produced result in the database, the database will become an inconsistent state because the actual sum is 600.
657 | Here, transaction-X has seen an inconsistent state of the database.
658 |
659 | ### 3.1.2. Solving concurrency problems with Locking
660 | We can solve concurrency problems above in a RDBMS with locking mechanisms.
661 |
662 | #### 3.1.2.1. Optimistic lock
663 | Always think that there is no concurrency problem, every time you go to fetch data, always think that there will be no other threads to modify the data, so it will not be locked, but in the update will be judged that other threads before the data is modified, generally using the version number mechanism.
664 |
665 | 
666 |
667 | we have an additional version column. The version column is incremented every time an UPDATE or DELETE is executed, and it is also used in the WHERE clause of the UPDATE and DELETE statements. For this to work, we need to issue the SELECT and read the current version prior to executing the UPDATE or DELETE, as otherwise, we would not know what version value to pass to the WHERE clause or to increment.[12]
668 |
669 | #### 3.1.2.2. Pessmistic lock
670 | Pessimistic locks are divided into two types: exclusive and shared locks
671 |
672 | ##### Exclusive and shared locks
673 | - **Exclusive lock**: When a transaction acquire an exclusive lock, no others transaction can read or
674 | write
675 |
676 | For example: We are implementing a ticket booking system for a cinema
677 | ```
678 | CREATE TABLE seats(
679 | id INT PRIMARY KEY,
680 | is_booked BOOLEAN,
681 | customer_id INT,
682 | )
683 | ```
684 | 
685 |
686 | ```js
687 | query = "SELECT * FROM seats WHERE id = 15 and is_booked = false"
688 | result = sqlConn.query(query)
689 | if (result) {
690 | return "seats is already booked"
691 | }
692 |
693 | updateQuery = "UPDATE seats set isBooked = true and customer_id = $customer_id"
694 | sqlConn.Execute(updateQuery)
695 | ```
696 |
697 | These lines of code seem to work fine. However this piece of code has a vulnerability and will break when there are two customers concurrently trying to book the same seat and will cause inconsistent behaviours.
698 |
699 | | Customer id = 1 | Customer id = 2 |
700 | | ------------------------------------------------------------------ | ------------------------------------------------------------------ |
701 | | SELECT * FROM seats WHERE id = 15 and is_booked = false | SELECT * FROM seats WHERE id = 15 and is_booked = false |
702 | | result = seat(id 15, is_booked = false, customer_id = null ) | result = seat(id 15, is_booked = false, customer_id = null ) |
703 | | update seat set is_booked = true and customer _id = 1 where id =15 | |
704 | | ... a moment later | .... |
705 | | | update seat set is_booked = true and customer _id = 2 where id =15 |
706 |
707 |
708 | Final results: `seat{id: 15, is_booked: true, customer_id: 2}`.
709 | Although the customer `1` book the seat first, the seat finally belongs to the customer `2`. We can notice that this is a `Lost update` concurrency problem.
710 |
711 | How to solve this? We need to acquire an exclusive lock on a row when we're reading it.
712 | ```js
713 | query = "SELECT * FROM seats WHERE id = 15 and is_booked = false FOR UPDATE"
714 | result = sqlConn.query(query)
715 | ```
716 |
717 | With `FOR UPDATE` statement, we acquired a lock on the row, the lock will be released when the update statement is executed.
718 |
719 |
720 |
721 | - Shared lock: Everycan can read but can not write.
722 | ```sql
723 | SELECT * FROM table_name WHERE ... LOCK in SHARE MODE;
724 | ```
725 | ---
726 | Which one to pick `optimistic` vs `pessmimistic`?
727 |
728 | It depends. You should pick your locking scheme based on your application requirements.
729 | - **Optimistic locking** is useful if the possibility for conflicts is very low – there are many records but relatively few users, or very few updates and mostly read-type operations.
730 |
731 | - **Pessimistic locking** is useful if there are a lot of updates and relatively high chances of users trying to update data at the same time.
732 |
733 | For example, in the case of social games
734 | 1. Character and player inventory information is heavily accessed during gameplay, requiring fast read-write access. We should use **Pessimistic locking** for this scenario.
735 |
736 | 2. Player accounts and preferences are read during player login or at the start of a game
737 | but not frequently updated. **Optimistic locking** would work well here too.[13]
738 | ### 3.1.2. Deadlock
739 |
740 | In a database, a deadlock is a situation that occurs when two or more different database sessions have some data locked, and each database session requests a lock on the data that another, different, session has already locked. Because the sessions are waiting for each other, nothing can get done, and the sessions just waste time instead. This scenario where nothing happens because of sessions waiting indefinitely for each other is known as deadlock.
741 |
742 | For example: [16]
743 | ```sql
744 | --Two global temp tables with sample data for demo purposes.
745 | CREATE TABLE ##Employees (
746 | EmpId INT IDENTITY,
747 | EmpName VARCHAR(16),
748 | Phone VARCHAR(16)
749 | )
750 |
751 | INSERT INTO ##Employees (EmpName, Phone)
752 | VALUES ('Martha', '800-555-1212'), ('Jimmy', '619-555-8080')
753 |
754 | CREATE TABLE ##Suppliers(
755 | SupplierId INT IDENTITY,
756 | SupplierName VARCHAR(64),
757 | Fax VARCHAR(16)
758 | )
759 |
760 | INSERT INTO ##Suppliers (SupplierName, Fax)
761 | VALUES ('Acme', '877-555-6060'), ('Rockwell', '800-257-1234')
762 | ```
763 |
764 | ```
765 | Session 1 | Session 2
766 | ===========================================================
767 | BEGIN TRAN; | BEGIN TRAN;
768 | ===========================================================
769 | UPDATE ##Employees
770 | SET EmpName = 'Mary'
771 | WHERE EmpId = 1
772 | ===========================================================
773 | | UPDATE ##Suppliers
774 | | SET Fax = N'555-1212'
775 | | WHERE SupplierId = 1
776 | ===========================================================
777 | UPDATE ##Suppliers
778 | SET Fax = N'555-1212'
779 | WHERE SupplierId = 1
780 | ===========================================================
781 | | UPDATE ##Employees
782 | | SET Phone = N'555-9999'
783 | | WHERE EmpId = 1
784 | ===========================================================
785 | |
786 | |
787 | ===========================================================
788 | ```
789 | ```
790 | Transaction was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
791 | ```
792 | ### 3.1.3. N + 1 query problem
793 | The n+1 query is defined as a query that fetches n records from the database, then runs an addition query for each of those records.
794 |
795 | ```sql
796 | SELECT * FROM users WHERE account_id = 42;
797 | ```
798 | returns users 3,5,27,38,99
799 | - Now select the expenses for each user
800 | ```
801 | SELECT * FROM expenses WHERE user_id = 3;
802 | SELECT * FROM expenses WHERE user_id = 5;
803 | SELECT * FROM expenses WHERE user_id = 27;
804 | SELECT * FROM expenses WHERE user_id = 38;
805 | SELECT * FROM expenses WHERE user_id = 99;
806 | ```
807 | **Prefetching** (aka eager-loading) means if we expect the application to also query for the expenses, then we batch query them by user_id , resulting in just 2 queries instead of 5.
808 | ```
809 | SELECT * FROM users WHERE account_id = 42;
810 | SELECT * FROM expenses WHERE user_id IN (3,5,27,38,99);
811 | ```
812 | Many ORM frameworks already support prefetching such as `Include` method, so just be sure to use them.
813 | # 4. Data warehouse
814 |
815 | # 5. Cloud-native database
816 |
817 | # 6. References
818 | [1] https://blog.yugabyte.com/a-busy-developers-guide-to-database-storage-engines-the-basics/
819 |
820 | [2] https://medium.com/@ckayay/how-to-pick-the-right-database-c2539efe2589
821 |
822 | [3] https://github.com/wiredtiger/wiredtiger/wiki/Btree-vs-LSM
823 |
824 | [4] https://docs.microsoft.com/en-us/dotnet/architecture/cloud-native/relational-vs-nosql-data
825 |
826 | [5] https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
827 |
828 | [6] ACID https://stackoverflow.com/a/999447
829 |
830 | [7] https://www.amazon.com/MongoDB-Action-Kyle-Banker/dp/1935182870
831 |
832 | [8] https://github.com/donnemartin/system-design-primer#eventual-consistency
833 |
834 | [9] https://www.amazon.com/Hands-Software-Engineering-Golang-programming/dp/1838554491
835 |
836 | [10] https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
837 |
838 | [11] https://d0.awsstatic.com/whitepapers/Database/database-caching-strategies-using-redis.pdf
839 |
840 | [12 optimistic-vs-pessimistic-locking] https://stackoverflow.com/a/58952004
841 |
842 | [13] https://blog.couchbase.com/optimistic-or-pessimistic-locking-which-one-should-you-pick/
843 |
844 | [14] https://www.javatpoint.com/dbms-concurrency-control#:~:text=Inconsistent%20Retrievals%20Problem%20is%20also,the%20Inconsistent%20Retrievals%20Problem%20occurs.
845 |
846 | [15] https://secure.phabricator.com/book/phabcontrib/article/n_plus_one/
847 |
848 | [16] Deadlock - https://stackoverflow.com/a/22825825
849 |
850 | [17] https://medium.com/@User3141592/single-vs-composite-indexes-in-relational-databases-58d0eb045cbe#:~:text=Like%20a%20single%20index%2C%20a,a%20concatenation%20of%20multiple%20fields.&text=will%20NOT%20benefit%20from%20this,position%20is%20the%20second%20field.
851 |
852 | [18] https://dev.mysql.com/doc/mysql-partitioning-excerpt/5.7/en/partitioning-overview.html
853 |
--------------------------------------------------------------------------------
/relational_vs_nosql.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/anhthii/database-notes/4e7fad12afab907545e4cb8ee85b9bac3e19b636/relational_vs_nosql.md
--------------------------------------------------------------------------------
/scaling.md:
--------------------------------------------------------------------------------
1 | ## Relational database
2 |
3 | In terms of performance, relational databases such as PostgreSQL [18] and MySQL [17] are generally easy to scale vertically. Switching to a beefier CPU and/or adding more memory to your database server is more or less a standard operating procedure for increasing the queries per second (QPS) or transactions per second (TPS) that the DB can handle. On the other hand, scaling relational databases horizontally is much harder and typically depends on the type of workload you have.
4 |
5 | For write-heavy workloads, we usually resort to techniques such as data sharding. Data sharding allows us to split (partition) the contents of one or more tables into multiple database nodes. This partitioning is achieved by means of a per-row shard key, which dictates which node is responsible for storing each row of the table. One caveat of this approach is that it introduces additional complexity at query time. While writes are quite efficient, reads are not trivial as the database might need to query each individual node and then aggregate the results together in order to answer even a simple query such as SELECT COUNT(*) FROM X.
6 |
7 | On the other hand, if our workloads are read-heavy, horizontal scaling is usually achieved by spinning up read-replicas, which mirror updates to one or more primary nodes. Writes are always routed to the primary nodes while reads are handled by the read-replicas (ideally) or even by the primaries if the read-replicas cannot be reached.
8 |
9 | While relational databases are a great fit for transactional workloads and complex queries, they are not the best tool for querying hierarchical data with arbitrary nesting or for modeling graph-like structures. Moreover, as the volume of stored data exceeds a particular threshold, queries take increasingly longer to run. Eventually, a point is reached where reporting queries that used to execute in real-time can only be processed as offline batch jobs. As a result, companies with high-volume data processing needs have been gradually shifting their focus toward NoSQL databases.
--------------------------------------------------------------------------------
/storage_engine.md:
--------------------------------------------------------------------------------
1 | https://blog.yugabyte.com/a-busy-developers-guide-to-database-storage-engines-the-basics/
2 |
3 | A database storage engine is an internal software component that a database server uses to store, read, update, and delete data in the underlying memory and storage systems.
4 |
5 | ## B-Tree based engine
6 | 
7 |
8 |
9 | ### Pros
10 | - B-trees usually grow wide and shallow, so for most queries very few nodes need to be traversed.
11 | - The net result is high throughput, low latency reads
12 |
13 | ### Cons
14 | - The need to maintain a well-ordered data structure with random writes usually leads to poor write performance
15 | - This is because random writes to the storage are more expensive than sequential writes.
16 |
17 |
18 | ## Log Structred Merge(LSM)
19 | As data volumes grew in the mid 2000s, it became necessary to write larger datasets to databases. B-tree engines fell out of favor given their poor write performance.
20 |
21 |
22 | Database designers turned to a new data structure called log-structured merge-tree (or LSM tree)
23 |
24 | > The LSM tree is a data structure with performance characteristics best fit for indexed access to files with high write volume over an extended period.
25 |
26 | 
27 |
28 |
29 | ### Pros
30 | - LSM engines are the de facto standard today for handling workloads with large fast-growing data.
31 | - Fast sequential writes (as opposed to slow random writes in B-tree engines)
32 |
33 | ### Cons
34 | - LSM tree based exhibit poor read throughput in comparision to B-tree based engine
35 | - LSM engines consume more CPU resources during read operations and take more memory/disk storage.
36 | However these issues get mitigated in practice:
37 | - Reads are made faster with approaches such as bloom filters
38 |
39 | ### Real world usage
40 | - Apache Cassandra, Elasticsearch, RocksDB, InfluxDB
41 |
42 | ## Summary
43 | 
44 |
45 | At the core, database storage engines are usually optimized for either read performance (B-tree) or write performance (LSM).
46 |
47 | In the last 10+ years data volumes have grown significantly and LSM engines have become the standard. LSM engines can be tuned more easily for higher read performance compare to B-tree engines.
--------------------------------------------------------------------------------