├── assets ├── nyaruko.jpg └── courses │ └── database │ ├── doc.png │ ├── schema_network.png │ ├── relational_DBMS.png │ ├── hierarchical_DBMS.png │ └── object_oriented_DBMS.png ├── courses ├── my_courses.md ├── Compilers │ └── notes.md └── advanced-database-systems │ └── history_of_databases.md ├── reading-list └── backlog.md ├── README.md └── books ├── notes ├── grokking_algorithms.md ├── the_ruthless_elimination_of_hurry.md ├── we.md └── designing_data_intensive_applications.md └── books_list.md /assets/nyaruko.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/nyaruko.jpg -------------------------------------------------------------------------------- /assets/courses/database/doc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/courses/database/doc.png -------------------------------------------------------------------------------- /assets/courses/database/schema_network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/courses/database/schema_network.png -------------------------------------------------------------------------------- /assets/courses/database/relational_DBMS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/courses/database/relational_DBMS.png -------------------------------------------------------------------------------- /assets/courses/database/hierarchical_DBMS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/courses/database/hierarchical_DBMS.png -------------------------------------------------------------------------------- /assets/courses/database/object_oriented_DBMS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dellamora/reading-diary/HEAD/assets/courses/database/object_oriented_DBMS.png -------------------------------------------------------------------------------- /courses/my_courses.md: -------------------------------------------------------------------------------- 1 | ## I Hope to Finish Before I Die 2 | 3 | - [Advanced Database Systems (Spring 2020)](/courses/advanced-database-systems/history_of_databases.md) 4 | 5 | ## Completed Courses 6 | 7 | - [The Last Algorithms Course You Will Need](https://github.com/dellamora/data-structures-and-algorithms) 8 | -------------------------------------------------------------------------------- /reading-list/backlog.md: -------------------------------------------------------------------------------- 1 | - [Intuitively Understanding the Shannon Entropy](https://www.youtube.com/watch?v=0GCGaw0QOhA) 2 | - [Alan Watts - The Principle Of Not Forcing](https://www.youtube.com/watch?v=ZzaUGhhnlQ8) 3 | - https://shikaan.github.io/assembly/x86/guide/2024/09/16/x86-64-conditionals.html 4 | -[Enter The Arena: Simplifying Memory Management](https://www.youtube.com/watch?v=TZ5a3gCCZYo) 5 | - https://blog.brownplt.org/2021/07/31/behavioral-hof.html… 6 | - https://blog.brownplt.org/2022/07/09/plan-comp-hof.html… 7 | - https://blog.brownplt.org/2022/08/16/struct-pipe-comp.html… 8 | - [A Flock of Functions: Combinators, Lambda Calculus, & Church Encodings in JS - Part II](https://www.youtube.com/watch?v=pAnLQ9jwN-E) 9 | - https://en.m.wikipedia.org/wiki/What_Is_Life%3F 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
13 |
14 |
24 |
25 | **Late 1960s and early 1970s Relational Model**:[Ted Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) **Late 1960s and early 1970s Relational Model**:[Ted Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) was a mathematician working ar IBM Research. He saw developers spending their time rewriting IMDS and Codasyl programs every time the databsses schema or laryout changed.
26 |
27 | To avoid this maintenance:
28 |
29 | - Database abstraction
30 | - Store databases in simple data structures
31 | - Access data through a high-level language
32 | - Physical storage left up to implementation
33 |
34 |
35 |
36 | **1970s - Relational Mode**:
37 |
38 | - System R - IBM Research
39 | - INGRES -U.C Berkeley (POstgrees stands for post ingress)
40 | - Oracle - Larry Ellison
41 |
42 | **1980s Relation Model**: IBM comes out with DB2 in 1983, SEQUEL becomes the standard (SQL) and many new enterprise DBMSs but Oracle wins marketplace.
43 |
44 | note: [stonebraker](https://en.wikipedia.org/wiki/Michael_Stonebraker) creates Postgres
45 |
46 | **1980s Object-Oriented database**: few of these original DBMSs from the 1980s still exist today but many of the technologies exist in other forms (JSON, XML)
47 |
48 |
49 |
50 |
51 | **1990s - Boring Days**: No major advancements in database systems or application workloads. Microssot forks sybase and creates SQL Server. MySQL is written as a replacement for mSQL. Postgres gets SQL support. SQLite started in early 2000.
52 |
53 | **2000s Internet Boom**: The internet boom created a need for new database systems. The old systems were not designed to handle the scale of the internet.
54 |
55 | **2000s Data Warehouses**: Rise of the special purpose OLAP DBMSs. Distributed/shared-nothing, Relational/SQL, usually closed-source. Significant performance benefits from using columnar data storage model.
56 |
57 | **2000s NoSQL Systems**: Focus on high-availability and high-scalability. Non-relational data models (document, key/value, etc), no ACID transactions, custom APIs instead of SQL, usually open source.
58 |
59 | **for response of NoSQL in 2010s NewSQL**: Provide same perfomance for OLTP Workloads as NoSQL DBMSs without giving up ACID. Distributed/shared-nothing, Relational/SQL, usually closed-source.
60 |
61 | **2010a Hybrid systems**: Hybrid Transactional-Analytical Processing. Execute fast OLTP like NewSQL sytem while also executing comples OLAP queries like data warahouse system. distribute/ shared-nothing, relational/sql, mixed open/ closed-source.
62 |
63 | **2010s Cloud systems**: First database-as-a-service (DBaaS) offereing were "contrainerized" versions of existing DBMSs. There are new DBMSs that are designed from scratch explicitly for running in a cloud environment.
64 |
65 | **Shared-disk engines**:Instead of writing a custom storage manager, the DBMS levarages distributed storage. Scale execution layer independently of storage. Favors log-structured approaches.
66 |
67 | note: this is what most people think when they talk about a data lake
68 |
69 | **2010s GRaph Systems**: Systems for storing and querying graph data. Their main advantage over other data models is to provide a graph-centric query API. Recent research demonstrated that is unclear whether there is any benefit to using a graph-centric execution engine and storage manager.
70 |
71 | **2010s Timeseries systems**: Specialized systems that are designes to store timeserie/event data. the design of these systems make deep asasumptions about the distribution of data and workload query patterns.
72 |
73 | **2010s Specialized systems**:
74 |
75 | - embedded DBMSs
76 | - Milti-models DBSMSs
77 | - Blockchain DBSMSs
78 | - FoHardware Acceleration
79 |
80 | ### References
81 |
82 | - video: [History of Databases (CMU Databases / Spring 2020)](https://www.youtube.com/watch?v=SdW5RKUboKc&list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O&index=1)
83 |
84 | - Paper: [What goes around comes around ](https://people.cs.umass.edu/~yanlei/courses/CS691LL-f06/papers/SH05.pdf)
85 |
--------------------------------------------------------------------------------
/books/notes/designing_data_intensive_applications.md:
--------------------------------------------------------------------------------
1 | # Part 1: Foundations of Data Systems
2 |
3 | ### Chapter 1: Reliable, Scalable, and Maintainable Applications
4 |
5 | The first chapter was focused on explaining some fundamental ways of thinking about data-intensive applications, what the book will cover, and an overview of what reliability, scalability, and maintainability are.
6 |
7 | - **Realiability**: systems should prevent hardware faults, software faults and human error.
8 |
9 | - **Scalability**: As the system expands, whether in terms of data volume, traffic volume, or complexity, it's essential to have viable strategies in place to manage and accommodate this growth effectively.
10 |
11 | - **Maintainability**: systems should be easey to operate, simple and adaptable.
12 |
13 | ### Chapter 2: Data Models and Query Languages
14 |
15 | Data models are important because they can determine how we think about the problem that we are solving. Applications are built ny layering one data model on top of another.
16 |
17 | **SQL**: Based on the relational model proposed by [Edgar F. Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) the SQL is the best-known data modal today. Data is organized into `tables`, where each relation is an unordered collections of `rows`
18 |
19 | **NoSQL**: Have gained popularity because they handle massive amounts of data and high-speed writing better than regular databases. They are also free and open source, making them ideal for unique searches regular databases can't handle, and they offer more freedom in data management.
20 |
21 | **ORM: The Object-Relational Mismatch**: Is request just one query because the relavant data are more easy to get it than creating a lot of queries because there are a lot of tables relations and the book cover some examples
22 |
23 | (I just got lazy and didn't write more notes about this chapter)
24 |
25 | ### Chaper 3: Storage and Retrieval
26 |
27 | A database needs to do two things: **store data and retrieve data**.
28 |
29 | - know how databse handles storage and retrieval internally is important to select a storage engine that is appropriate for the kind of data you have and the kind of access patterns your application requires.
30 |
31 | **Data Structures**: The most common data structures used in databases are `hash indexes`, `search trees`, and `log-structured storage`.
32 |
33 | **Hash Indexes**: A hash index consists of an array of `buckets`, each of which contains a few `keys` and `pointers` to the records that have that key. The hash function determines which bucket a key is assigned to.
34 |
35 | - **Bitcask** is a storage engine that uses hash indexes and well suited to situations where the value for a key is being updated frequently.
36 |
37 | **B-Trees and LMS-Trees**: B-Trees are a generalization of binary search trees in which each node can contain more than two children. B-Trees are well suited to storing data on disk, because they are optimized for reading and writing large blocks of data. LMS-trees are a variation of B-Trees that are optimized for SSDs.
38 |
39 | - **Downsides of LSM-Trees**
40 |
41 | - The process of merging segments is called compaction, and it can interfere with the performance of ongoing reads and writes. - Overhead: LSM-trees have more overhead than B-Trees because they have to maintain more data structures.
42 |
43 | - In high-write-throughput scenarios, LSM-Tree databases struggle to balance disk bandwidth between initial writes and ongoing compaction. Initially, when the database is empty, all disk bandwidth is used for writes. However, as the database grows, more bandwidth is needed for compaction. Without careful configuration, this can lead to a backlog of unmerged segments, decreased read performance, and potential disk space exhaustion. Effective management and monitoring of compaction are crucial to avoid these issues.
44 |
45 | **Log-Structured Storage**: Log-structured storage engines write new data to the end of a sequentially written log, and periodically merge segments of the log into a new segment.
46 |
47 | **Other Indexing Structures**: The book also covers secondary indexes, multi-column indexes ,full-text search / fuzzy indexes.
48 |
49 | - secondary indexes: A secondary index is an index that is based on a field that is not the primary key. Secondary indexes are used to speed up queries that search for records based on a field that is not the primary key.
50 |
51 | - Multi-column indexes: A multi-column index is an index that is based on more than one field. Multi-dimiensional are a general way of querying several columns at once.
52 |
53 | ```SQL
54 | SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
55 | AND longitude > -0.1162 AND longitude < -0.1004;
56 | ```
57 |
58 | - Full-text search / fuzzy indexes: Full-text search is a technique for searching a document or a collection of documents for a word or a phrase. Fuzzy indexes are used to search for records that are similar to a given record. (Elasticsearch is a popular search engine that uses fuzzy indexes)
59 |
60 | **Keeping everything in memory**: The book also covers the use of in-memory databases.
61 |
62 | **OLAP VS OLTP**: The difference between OLTP
63 | and OLAP is not always clear-cut, but some typical characteristics are listed below.
64 | | Property | Transaction Processing Systems (OLTP) | Analytic Systems (OLAP) |
65 | |-------------------------------|-----------------------------------------------|-------------------------------------------------|
66 | | Main Read Pattern | Small number of records per query, fetched by key | Aggregate over large number of records |
67 | | Main Write Pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream |
68 | | Primarily Used By | End user/customer, via web application | Internal analyst, for decision support |
69 | | Data Representation | Latest state of data (current point in time) | History of events that happened over time |
70 | | Dataset Size | Gigabytes to terabytes | Terabytes to petabytes |
71 |
72 | ```text
73 | "On a high level, we saw that storage engines fall into two broad categories: those opti‐
74 | mized for transaction processing (OLTP), and those optimized for analytics (OLAP)."
75 | ```
76 |
77 | **Data Warehousing**: Data warehousing is the process of collecting, storing, and managing data from various sources to provide meaningful business insights. It is a blend of technologies and components which allows the strategic use of data.
78 |
79 | ### Chapter 4: Encoding and Evolution
80 |
81 | **Evolvability**: The ability to make changes to a system in the future, such as adding new features or fixing bugs.
82 |
83 | - For server-side applications, you can perform rolling upgrades, deploying the new version gradually to a few nodes at a time, ensuring smooth operation before moving to others. This minimizes downtime, enabling more frequent releases.
84 |
85 | - Client-side applications rely on users to install updates, which may cause delays in adoption.
86 |
87 | - Backward compatibility is the ability of a system to accept input intended for a later version of itself.
88 |
89 | - Forward compatibility is the ability of a system to accept input intended for a later version of itself.
90 |
91 | **Formats for Encoding Data**: Programs usually work with data in diferents representations, so we need some kind of translations between them. The book covers some of the most common formats for encoding data.
92 |
93 | **Language-Specific Formats**: Some languages have built-in support for encoding objects into byte sequences. While convenient for in-memory operations, these libraries lack cross-language compatibility. Decoding may require instantiating arbitrary classes, posing security risks. Versioning and efficiency are also challenges, as handling different object versions can be complex.
94 |
95 | (There are a lot of examples in the book, but I didn't write them down)
96 |
97 | **JSON, XML, and Binary Variants**: JSON and XML are popular for encoding data for interchange between systems. They are human-readable, but verbose and slow to parse. Binary variants are more efficient, but lack human readability
98 |
99 | **Avro**: Avro is a binary encoding format that is smaller and faster to encode and decode than JSON. It is also a schema-based format, meaning that the schema is included in the encoded data, allowing for forward and backward compatibility.
100 |
101 | (Note there a lot of nice things to say about Avro, but I haven't written them down.
102 | )
103 |
104 | - Define Schema: Avro schemas are defined using JSON. The schema defines the fields and types of the data being encoded.
105 | - Schema Evolution: Avro supports schema evolution, meaning that the schema can change over time. New fields can be added, fields can be removed, and fields can be renamed. Avro also supports default values for fields, allowing for backward compatibility.
106 | - Writer Schema and Reader Schema: Avro supports the concept of a writer schema and a reader schema. The writer schema is the schema used to encode the data, and the reader schema is the schema used to decode the data. The reader schema can be different from the writer schema, allowing for forward and backward compatibility.
107 |
108 | **Dataflow Throught Databases**: Basically, in a database, the process that writes to the database encodes the data and the process that reads from the databse decodes it.
109 |
110 | - Different values written at different times may be encoded differently, so the database must be able to handle multiple encodings of the same field. This is called **schema evolution**.
111 |
112 | **Dataflow Throught Services**: When you have proceses that need to communicate over a **network**, there are a few differents ways of arranging that communication.
113 |
114 | **REST and HTTP**: REST is an architectural style for building distributed systems. It is based on the principles of the web, such as URLs, HTTP methods, and hypermedia. RESTful systems are stateless, meaning that each request from the client to the server must contain all the information necessary to understand the request. RESTful systems are also cacheable, meaning that responses can be cached to improve performance.
115 |
116 | **Remote Procedure Calls (RPC)**: RPC is a synchronous request-response protocol. The client sends a request to the server and waits for a response. The server processes the request and sends a response back to the client. The client is blocked until the server responds.
117 |
118 | **Message brokers (also called a message queue or message-oriented middleware)**: facilitates communication between different applications by receiving messages from one application and delivering them to another. It acts as an intermediary, enabling decoupled communication and allowing applications to interact without direct awareness of each other.
119 |
120 | ```text
121 | "in general, message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic."
122 |
123 | ```
124 |
125 | **Distributed actor frameworks**: A distributed actor framework is a framework for building distributed systems using the actor model. The actor model is a model of concurrent computation that treats actors as the universal primitives of concurrent computation. In response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received.
126 |
127 | **SUMMARY**
128 |
129 | - In databases, data is encoded by the writing process and decoded by the reading process.
130 | - RPC and REST APIs involve encoding a request by the client, decoding it by the server, then encoding a response by the server and decoding it by the client.
131 | - In asynchronous message passing, such as with message brokers or actors, nodes exchange encoded messages, with senders encoding and recipients decoding them for communication.
132 |
--------------------------------------------------------------------------------
/books/books_list.md:
--------------------------------------------------------------------------------
1 | ## Reading
2 | - [Designing Data-Intensive Applications](/books/notes/designing_data_intensive_applications.md)
3 | - [Chip War: The Fight for the World's Most Critical Technology](https://en.wikipedia.org/wiki/Chip_War:_The_Fight_for_the_World%27s_Most_Critical_Technology)
4 |
5 |
6 |
7 | ## Want to Read
8 | - Why we Sleep(Matthew Walkwer)
9 | - The Circadian Code: Lose Weight, Supercharge Your Energy and Sleep Well Every Night
10 | - The Craft od Research wayne c. booth, gregory g. colomb, joseph m. williams
11 | - The marshmallow test walter mischel
12 | - The complete TurtleTrader: How 23 Novice Investors Became Overnight Millionaires
13 | - [Functional-Light JavaScript](https://github.com/getify/Functional-Light-JS)
14 | - The Design of Everyday Thing
15 |
16 | ## Read
17 |