├── .gitignore
├── Behavioral.md
├── OOP_related.md
├── OS_review
    ├── OS_review.md
    └── images
    │   ├── raid.png
    │   ├── sys_call.png
    │   └── timer_interrupt.png
├── Phone Interview.md
├── README.md
├── The Technical Interview Cheat Sheet.md
├── complete_system_design
    ├── .DS_Store
    ├── glossary_of_system_design
    │   ├── .DS_Store
    │   ├── basics.md
    │   ├── caching.md
    │   ├── cap_theorem.md
    │   ├── consistent_hashing.md
    │   ├── data_partitioning.md
    │   ├── indexes.md
    │   ├── key_characteristics_of_distributed_systems.md
    │   ├── load_balancing.md
    │   ├── long_polling_websockets_serversent_events.md
    │   ├── proxies.md
    │   ├── redundancy_replication.md
    │   └── sql_nosql.md
    ├── images
    │   ├── HTTP_protocol.png
    │   ├── Vertical_scaling_vs._Horizontal_scaling.png
    │   ├── accessing.png
    │   ├── ajax.png
    │   ├── cap.png
    │   ├── cap_theorem.png
    │   ├── client_loadbalancer_server.png
    │   ├── database_schema.png
    │   ├── detailed_component.png
    │   ├── hash1.png
    │   ├── hash2.png
    │   ├── hash3.png
    │   ├── hash4.png
    │   ├── hash5.png
    │   ├── high_level_design.png
    │   ├── high_level_url_shortening.png
    │   ├── library_catalog_indexes.png
    │   ├── loadbalancer2.png
    │   ├── long_polling.png
    │   ├── proxy.png
    │   ├── redundancy.png
    │   ├── redundant_load_balancer.png
    │   ├── request_flow1.png
    │   ├── request_flow10.png
    │   ├── request_flow11.png
    │   ├── request_flow2.png
    │   ├── request_flow3.png
    │   ├── request_flow4.png
    │   ├── request_flow5.png
    │   ├── request_flow6.png
    │   ├── request_flow7.png
    │   ├── request_flow8.png
    │   ├── request_flow9.png
    │   ├── shortening.png
    │   ├── sse.png
    │   ├── url1.png
    │   ├── url2.png
    │   ├── url3.png
    │   ├── url4.png
    │   ├── url5.png
    │   ├── url6.png
    │   ├── url7.png
    │   ├── url8.png
    │   ├── url9.png
    │   └── websockets.png
    └── system_design_problems
    │   ├── step_by_step_guide.md
    │   └── url_shortening.md
├── distributed_system
    └── review.md
├── probability
    ├── 002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx
    ├── 4710_review.md
    ├── Practical_Guide_To_Quant.md
    ├── Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf
    ├── complete_practical_guide_to_quant
    │   ├── Chap1.md
    │   ├── Chap2.md
    │   ├── Chap4.md
    │   └── images
    │   │   ├── 2.1.png
    │   │   ├── 2.2.1.png
    │   │   ├── 2.2.2.png
    │   │   ├── 2.2.png
    │   │   ├── 2.3.png
    │   │   ├── 2.4.png
    │   │   ├── 4.1.png
    │   │   ├── 4.2.1.png
    │   │   ├── 4.2.png
    │   │   ├── 4.3.1.png
    │   │   ├── 4.3.2.png
    │   │   ├── 4.3.3.png
    │   │   ├── 4.3.png
    │   │   ├── 4.4.1.png
    │   │   ├── 4.4.2.png
    │   │   ├── 4.4.png
    │   │   ├── 4.5.png
    │   │   ├── Table4.1.png
    │   │   ├── Table4.2.png
    │   │   └── Table4.3.png
    └── images
    │   └── properties_of_random_variables.png
├── quant_trader
    └── info.md
└── system_design
    ├── Grokking the system design interview.md
    ├── System Design.md
    ├── design instagram
    ├── design url shortening
    └── glossary_of_system_design
        ├── basics.md
        ├── caching.md
        ├── cap_theorem.md
        ├── consistent_hashing.md
        ├── data_partitioning.md
        ├── indexes.md
        ├── key_characteristics_of_distributed_systems.md
        ├── load_balancing.md
        ├── long_polling_websockets_serversent_events.md
        ├── proxies.md
        ├── redundancy_replication.md
        └── sql_nosql.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.py
 2 | *private*
 3 | optiver
 4 | hrt
 5 | drw
 6 | de_shaw
 7 | jane_street
 8 | bridgewater
 9 | imc
10 | facebook
11 | PDT
12 | quant_trader
13 | two_sigma
14 | citadel
15 | google
16 | CS5414_slides


--------------------------------------------------------------------------------
/Behavioral.md:
--------------------------------------------------------------------------------
 1 | ## Questions to ask
 2 | ### Two sigma
 3 | - Could you briefly introduce Two Sigma? What do you guys do? And what does your team do?
 4 | - What’s your typical day? I mean, as an engineer, how usually do you spend your day?
 5 | - Halite the AI challenge, last time the 2016 TS cup, why do you guys so focus on AI and Robot?
 6 | - Things that engineers build, are them only being used internally? And Also are you using any - - products of software from other companies?
 7 | - About the trading and all related operations, are they human involved?  In other words, how much do you guys trust decisions made by machines?
 8 | 
 9 | ## Behavioral Questions
10 | - introduce your favorite project/resume/yourself?
11 |     - 有没有看职位要求？说说职位要求要找什么人？你是这样的人么？介绍一个你最符合这个职位要求的项目，最后强调你是good fit
12 | 
13 | - your greatest weakness/failure?
14 |     - 你一个无伤大雅的小缺点/失败是什么？你从以前的哪个项目知道自己有这个缺点/失败？知道以后学到了什么教训？在后面哪个项目中吸取了这个教训，做了什么，取得了什么结果？
15 |     - pushy
16 |     - others don't get the opportunity to learn
17 | 
18 | - your greatest advantage? 
19 |     - 我知道你很牛，你哪个特质最符合这个职位的要求，并且在最后强调你的某某优点让你是一个good fit for this position
20 |     - fast learner
21 |     - like to challenge myself with unfamiliar concepts
22 | 
23 | - why our company?
24 |     - 公司的mission是什么？我的career goal和你们公司的mission完美契合；职位的要求是什么？我的背景和能力和这个职位的要求完美契合。最后强调你是good fit
25 |    - employees' quality: The bar and the median quality of its hires is significantly higher than other companies, as it makes quite a bit of money per employee and therefore can afford to hire the most competent set of programmers. At other companies, I’m used to work with people “as good as” as I am, whereas at jane street I’ll be humbled and always learn from the smarter people around me.
26 |     - values personal growth: from what I heard of, managers at jane street place a strong emphasis on employees personal growth, including programming skills and leadership. With a low turnover rate, js could allocate abundant resource, both in financial resource and human effort, in training and upgrading new employees, so they could take resposibilities in the upcoming years. This is especially tempting for someone who is about to graduate from college and is seeking a tremendous improvement in the skill set.
27 |     - collaborative environment: no levels within the company, all programmers are called 'software developers'. This encourages a collaborative working environment as we won't be stressed by your teammates title
28 |     - Also, what I personally like about the software developer role at jane street is that it is very directed towards the goal of making profits, instead of working on a product for customers as in traditional tech companies. This motivates me as a programmer to contribute more by writing rigorous code and heavily employing unit tests and integration tests.
29 |     - At xxx we leverage technology to solve a variety of problems with high degrees of difficulty: managing scarce bandwidth resources, responding to market events in microseconds or less, automatically pricing diverse sets of financial instruments with extremely low error tolerance, and storing and analyzing terabytes of data. Our systems are built to add to the stability of the market, not detract from it; they must operate at peak efficiency in the most extreme market conditions. These systems must also be simple, flexible, and well-architected so they can quickly change to meet the dynamic needs of our industry. Technologists at Optiver work hard, think creatively, and engineer rigorous solutions that make an immediate impact.
30 | - how did you hear about this position?
31 |     - 如实回答就行，我一般都说career fair和公司的工程师聊了聊，关键是最后要再重复一遍，据我了解，这个职位是干啥或者需要啥，我以前也在做这个或者有相关的技能，所以good fit
32 | 
33 | - what if your teammate/colleague is hard to work with / not contributing?
34 | 队友/同事不干活/很难相处咋办？
35 |     - 你有没有经常和队友/同事主动沟通？你愿不愿意为了团队，帮队友/同事分担一些工作？能不能以非常职业的方式解决这个问题？
36 | 
37 | - what if your teammate/colleague disagree with you?
38 | 队友/同事不同意你的观点咋办？
39 |     - 你有没有自己花一些时间做一个数字化（quantitative）的比较？有没有向队友/同事提交一个详细的报告或者比较（report/strong case）来说服ta？会不会有效的沟通？
40 |     - do a quantitative comparison first
41 |     - talk to him
42 |     - talk to my manager
43 | 
44 | - how do you define success?
45 |     - 一般我都说达到自己制定的目标就算成功，这样容易说；那就可以理解为你有没有为自己制定目标？你的目标是啥？你现在完成的怎么样？未来在这个公司想怎么发展自己？（develop tech stack，gain more domain knowledge，see myself in postion of senior engineer in xx years）
46 | 
47 | - what if you get assigned to a challenging task?
48 |     - 你会不会和你的老板沟通？你会不会和你的同事沟通？你会不会提出合理的要求？能不能以非常职业的方式解决这个问题？
49 |     - team contract
50 |         - schedule
51 |         - distribute work reasonably, harvest everyone's capability
52 |         - agreement on emergency
53 |     - keep doing reflections on daily work
54 |     - double check we've satisfied all requirements when project finished
55 |     - (myself) don't make assumptions, open to any idea in general
56 | 
57 | - what if a task is due earlier? what would you do if you have multiple deadlines upcoming?
58 |     - 你是怎么管理你的时间的？比如日历上设置好项目，还有提醒；你会不会根据工作的优先级安排你的时间？你会不会为了项目组的整体利益考虑（best interest of my team），舍弃一些个人利益？比如为了毕设，自己的考试就不投入太多时间；会不会和别人沟通寻找解决方案？如果你是组长，你知道due提前了会不会采取措施？比如立刻开会，重新安排这个项目后面的任务和时间节点。
59 |     - figure out priority
60 |     - set deadlines on my calendar so I won't miss out
61 |     - figure out a balance between personal interest and group interest
62 |   
63 | - your favorite and least favorite project and the teamworking in them
64 | 
65 | ## Things to notice
66 | - BQ checks if 这个人好不好在团队里相处（上下级都有），对公司是不是真的很有兴趣（这个很重要），and clients relationship（如果这个职位有面向客户的话）。最少找到一个点你觉得这个公司跟其他公司不一样的特别吸引你的
67 | - 不能只回答他问你的表层的意思，比如他问你缺点，然后你回答说：我缺点是有时候太追求完美，然后就结束了。那肯定是不行的。
68 | 比较好的回答是：我以前有个xxx项目，太追求完美，导致错过了截止日期。我吸取了教训，有时候完成目标比追求完美更重要，在另外一个xxx项目中，我合理分配资源，即使有些东西没做到完美，但是在截止日前完成了任务，我向我的老板提了后续完善这个项目的方案，我的老板很满意。
69 | 


--------------------------------------------------------------------------------
/OS_review/images/raid.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/raid.png


--------------------------------------------------------------------------------
/OS_review/images/sys_call.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/sys_call.png


--------------------------------------------------------------------------------
/OS_review/images/timer_interrupt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/timer_interrupt.png


--------------------------------------------------------------------------------
/Phone Interview.md:
--------------------------------------------------------------------------------
 1 | ## Phone Interview Steps & Tricks
 2 | ### Process
 3 | - Greeting
 4 | 	- "Hey! How is it going?! God, i don't even remembr how long I have been working from home，feels like longer than it iactually is." 
 5 | 	- "I am actually really excited about the interviews, this is the only chance to talk to someone outside of my team and my roommate. (laugh)"
 6 | 	- "How are things in (interviewer's city)? People start to wear masks and take things serisouly now, hope it will get better soon!"
 7 | 
 8 | - the flow
 9 | 	- Clarifying questions
10 | 	- Brainstorm data structures before coding
11 | 	- Think of the brute force solution first
12 | 	- consider edge cases!!
13 | 	- State that you know this is too long / inefficient / blah, state why
14 | 	- Now try to improve upon it
15 | 	- You can ask for hints if you’re absolutely stuck!
16 | - think out loud, explain your thought process
17 | 	- brute force 
18 | 	- then optimize it
19 | - ask questions!!
20 | 	- e.g. constraints? the data type of tree node values? does this api care more about speed or space?
21 | - name functions properly, don't name it as `solution()` 
22 | - "should I start implement it in code, or you want me to continue to optimize it?"
23 | - explain a bit when coding
24 | - explain time complexity
25 | - go through test cases when finished
26 | - improvement on the same coding question
27 | 	- java doc，unit test, regression test, performance tuning, benchmarking, A/B testing
28 | 
29 | 	
30 | 
31 | ### Miscellaneous
32 | 	- over communication
33 | 		- "Hey, if I look like I am looking to my right/left, that's because my camera is here and I have my codepad opened on another screen"
34 | 		- "Hi, if I am silent for a couple secs/mins, I am just thinking about the question"
35 | 	- 彩虹屁
36 | 		- 吹/dig further when he mentions the challenge/headache in work
37 | 		- talk about your potential solutions	
38 | 	- don't say anything negative about companies you've worked at !!!!!!
39 | 	- 背题的人写代码和讨论test case和真明白背后原理或者数学证明的是完全不同的
40 | 
41 | 
42 | ### Red Flags
43 |     - when the interviewer says we're running out of time, "time is up，thank you for your time with us"
44 |     - late
45 |     - "interesting" 可能为敷衍，听烦了
46 |     - "Thank you! (HR) will reach out to you in the next a few day. "  但没有第一人称突显他的行动
47 |     - "I am not sure XXX."
48 |     - "you code seems good..."
49 | 
50 | 
51 | ### Questions to ask the interviewer
52 | 	- Name, job title, maybe email (for reference later on)
53 | 	- What made you choose company X?
54 | 	- What’s the most satisfying project you’ve worked on?
55 | 	- What’s a typical day for an intern like?
56 | 	- Any example projects interns have worked on?
57 | 	- What’s your favorite thing about working for your company?
58 | 	- How does this company compare to other places you’ve worked before?
59 | 	- what's a typical day like (if care about work-life balance)
60 | 	- What is your most challenging part in your daily work?.
61 | 	- what do you expect from a new hire/intern of my level in the fist half year?
62 | 	- what’s the most unique part about working at xxx that you’ve never experienced before.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CS_interview_cheatsheet
2 | help me find a job plssss
3 | 


--------------------------------------------------------------------------------
/complete_system_design/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/.DS_Store


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/glossary_of_system_design/.DS_Store


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/basics.md:
--------------------------------------------------------------------------------
 1 | Basics
 2 | ====
 3 | 
 4 | # text
 5 | Whenever we are designing a large system, we need to consider a few things:
 6 | 
 7 | What are the different architectural pieces that can be used?
 8 | How do these pieces work with each other?
 9 | How can we best utilize these pieces: what are the right tradeoffs?
10 | Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. In the following chapters, we will try to define some of the core building blocks of scalable systems. Familiarizing these concepts would greatly benefit in understanding distributed system concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing between SQL vs. NoSQL.
11 | 
12 | Let’s start with the Key Characteristics of Distributed Systems.
13 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/caching.md:
--------------------------------------------------------------------------------
  1 | Caching
  2 | ====
  3 | # keypoints
  4 | - Take advantage of the locality of reference principle: recently requested data is likely to be requested again.
  5 | - Exist at all levels in architecture, but often found at the level nearest to the front end.
  6 | 
  7 | ## Application server cache
  8 | - Cache placed on a request layer node.
  9 | - When a request layer node is expanded to many nodes
 10 |   - Load balancer randomly distributes requests across the nodes.
 11 |   - The same request can go to different nodes.
 12 |   - Increase cache misses.
 13 |   - Solutions:
 14 |     - Global caches
 15 |     - Distributed caches
 16 | 
 17 | ## Distributed cache
 18 | - Each request layer node owns part of the cached data.
 19 | - Entire cache is divided up using a consistent hashing function.
 20 | - Pro
 21 |   - Cache space can be increased easily by adding more nodes to the request pool.
 22 | - Con
 23 |   - A missing node leads to cache lost.
 24 | 
 25 | ## Global cache
 26 | - A server or file store that is faster than original store, and accessible by all request layer nodes.
 27 | - Two common forms
 28 |   - Cache server handles cache miss.
 29 |     - Used by most applications.
 30 |   - Request nodes handle cache miss.
 31 |     - Have a large percentage of the hot data set in the cache.
 32 |     - An architecture where the files stored in the cache are static and shouldn’t be evicted.
 33 |     - The application logic understands the eviction strategy or hot spots better than the cache
 34 | 
 35 | ## Content distributed network (CDN)
 36 | - For sites serving large amounts of static media.
 37 | - Process
 38 |   - A request first asks the CDN for a piece of static media.
 39 |   - CDN serves that content if it has it locally available.
 40 |   - If content isn’t available, CDN will query back-end servers for the file, cache it locally and serve it to the requesting user.
 41 | - If the system is not large enough for CDN, it can be built like this:
 42 |   - Serving static media off a separate subdomain using lightweight HTTP server (e.g. Nginx).
 43 |   - Cutover the DNS from this subdomain to a CDN later.
 44 | 
 45 | ## Cache invalidation
 46 | - Keep cache coherent with the source of truth. Invalidate cache when source of truth has changed.
 47 | - Write-through cache
 48 |   - Data is written into the cache and permanent storage at the same time.
 49 |   - Pro
 50 |     - Fast retrieval, complete data consistency, robust to system disruptions.
 51 |   - Con
 52 |     - Higher latency for write operations.
 53 | - Write-around cache
 54 |   - Data is written to permanent storage, not cache.
 55 |   - Pro
 56 |     - Reduce the cache that is no used.
 57 |   - Con
 58 |     - Query for recently written data creates a cache miss and higher latency.
 59 | - Write-back cache
 60 |   - Data is only written to cache.
 61 |   - Write to the permanent storage is done later on.
 62 |   - Pro
 63 |     - Low latency, high throughput for write-intensive applications.
 64 |   - Con
 65 |     - Risk of data loss in case of system disruptions.
 66 | 
 67 | ## Cache eviction policies
 68 | - FIFO: first in first out
 69 | - LIFO: last in first out
 70 | - LRU: least recently used
 71 | - MRU: most recently used
 72 | - LFU: least frequently used
 73 | - RR: random replacement
 74 | 
 75 | 
 76 | 
 77 | # text
 78 | - Load balancing helps you scale horizontally across an ever-increasing number of servers, but caching will enable you to make vastly better use of the resources you already have as well as making otherwise unattainable product requirements feasible. Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again. They are used in almost every layer of computing: hardware, operating systems, web browsers, web applications, and more. A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end where they are implemented to return data quickly without taxing downstream levels.
 79 | 
 80 | ## Application server cache
 81 | - Placing a cache directly on a request layer node enables the local storage of response data. Each time a request is made to the service, the node will quickly return local cached data if it exists. If it is not in the cache, the requesting node will query the data from disk. The cache on one request layer node could also be located both in memory (which is very fast) and on the node’s local disk (faster than going to network storage).
 82 | - What happens when you expand this to many nodes? If the request layer is expanded to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are global caches and distributed caches.
 83 | 
 84 | ## Content Distribution Network (CDN)
 85 | - CDNs are a kind of cache that comes into play for sites serving large amounts of static media. In a typical CDN setup, a request will first ask the CDN for a piece of static media; the CDN will serve that content if it has it locally available. If it isn’t available, the CDN will query the back-end servers for the file, cache it locally, and serve it to the requesting user.
 86 | - If the system we are building isn’t yet large enough to have its own CDN, we can ease a future transition by serving the static media off a separate subdomain (e.g. static.yourservice.com) using a lightweight HTTP server like Nginx, and cut-over the DNS from your servers to a CDN later.
 87 | 
 88 | ## Cache Invalidation
 89 | - While caching is fantastic, it does require some maintenance for keeping cache coherent with the source of truth (e.g., database). If the data is modified in the database, it should be invalidated in the cache; if not, this can cause inconsistent application behavior.
 90 | 
 91 | - Solving this problem is known as cache invalidation; there are three main schemes that are used:
 92 | 
 93 | - Write-through cache
 94 |     - Under this scheme, data is written into the cache and the corresponding database at the same time. The cached data allows for fast retrieval and, since the same data gets written in the permanent storage, we will have complete data consistency between the cache and the storage. Also, this scheme ensures that nothing will get lost in case of a crash, power failure, or other system disruptions.
 95 |     - Although, write through minimizes the risk of data loss, since every write operation must be done twice before returning success to the client, this scheme has the disadvantage of higher latency for write operations.
 96 | 
 97 | - Write-around cache
 98 |     - This technique is similar to write through cache, but data is written directly to permanent storage, bypassing the cache. This can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a “cache miss” and must be read from slower back-end storage and experience higher latency.
 99 | 
100 | - Write-back cache
101 |     - Under this scheme, data is written to cache alone and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions. This results in low latency and high throughput for write-intensive applications, however, this speed comes with the risk of data loss in case of a crash or other adverse event because the only copy of the written data is in the cache.
102 | 
103 | ## Cache eviction policies
104 | - First In First Out (FIFO)
105 |     - The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before.
106 | - Last In First Out (LIFO) 
107 |     - The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before.
108 | - Least Recently Used (LRU)
109 |     - Discards the least recently used items first.
110 | - Most Recently Used (MRU)
111 |     - Discards, in contrast to LRU, the most recently used items first.
112 | - Least Frequently Used (LFU)
113 |     - Counts how often an item is needed. Those that are used least often are discarded first.
114 | - Random Replacement (RR)
115 |     - Randomly selects a candidate item and discards it to make space when necessary.


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/cap_theorem.md:
--------------------------------------------------------------------------------
 1 | CAP Theorem
 2 | 
 3 | # keypoints
 4 | [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem)
 5 | ====
 6 | - it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP)
 7 | - Consistency
 8 |     - All nodes see the same data at the same time
 9 |     - achieved by updating several nodes before further reads
10 |     - every read receives the most recent write or an error
11 | - Availability
12 |     - every request receives a response on success/failure
13 |     - achieved by replicating the data across different servers
14 | - Partition tolerance
15 |     - system continues to work despite message loss or partial failure
16 |     - can sustain any amount of network failure
17 |     - the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
18 | - CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability
19 | - CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.
20 | - [ACID](https://en.wikipedia.org/wiki/ACID) databases choose consistency over availability.
21 | - [BASE](https://en.wikipedia.org/wiki/Eventual_consistency) systems choose availability over consistency.
22 | 
23 | # text
24 | - CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance. When we design a distributed system, trading off among CAP is almost the first thing we want to consider. CAP theorem says while designing a distributed system we can pick only two of the following three options:
25 | - Consistency
26 |     - All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads.
27 | - Availability
28 |     - Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers.
29 | - Partition tolerance
30 |     - The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
31 | ![cap](../images/cap.png)
32 | - We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available.
33 | 
34 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/consistent_hashing.md:
--------------------------------------------------------------------------------
 1 | Consistent Hashing
 2 | ====
 3 | # keypoints
 4 | 
 5 | - Distributed Hash Table (DHT)
 6 |     - index = hash_function(key)
 7 | - distributed caching system
 8 |     - n cache servers, if index = key % n
 9 |     - problem
10 |         - not horizontally scalable
11 |             - when adding a new cache host, all existing mappings broken
12 |         - may not be load balanced
13 | 
14 | ## consistent hashing
15 | - minimize reorganization when nodes are added or removed
16 | - only k/n keys need to be remapped 
17 | - objects mapped to the same host if possible
18 | - if host removed from the system
19 |     - object on that host are shared by other hosts
20 | - if new host added
21 |     - takes its share from a few hosts 
22 | 
23 | ## how it works
24 | - given a list of cache servers
25 |     - hash them to integers in the range
26 | - hash a key to a single integer
27 | - move clockwise on the ring
28 |     - til finding the first cache
29 | - that cache is the one containing the key
30 | - to add a new server D
31 |     - keys originally at C split
32 |     - some shifted to D
33 | - to remove a server A
34 |     - all keys originally mapped to A remapped to B
35 | - problem: real data randomly distributed, might not be uniform
36 |     - add virtual replicas for cache server
37 |     - map each cache to multiple points on the ring, i.e. replicas
38 |     - each cache associated with multiple portion of the ring
39 | 
40 | # text
41 | - Distributed Hash Table (DHT) is one of the fundamental components used in distributed scalable systems. Hash Tables need a key, a value, and a hash function where hash function maps the key to a location where the value is stored.
42 | - index = hash_function(key)
43 | - Suppose we are designing a distributed caching system. Given ‘n’ cache servers, an intuitive hash function would be ‘key % n’. It is simple and commonly used. But it has two major drawbacks:
44 |     - It is NOT horizontally scalable. Whenever a new cache host is added to the system, all existing mappings are broken. It will be a pain point in maintenance if the caching system contains lots of data. Practically, it becomes difficult to schedule a downtime to update all caching mappings.
45 |     - It may NOT be load balanced, especially for non-uniformly distributed data. In practice, it can be easily assumed that the data will not be distributed uniformly. For the caching system, it translates into some caches becoming hot and saturated while the others idle and are almost empty.
46 | - In such situations, consistent hashing is a good way to improve the caching system.
47 | 
48 | ## What is Consistent Hashing?
49 | - Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down.
50 | - In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is the total number of servers. Recall that in a caching system using the ‘mod’ as the hash function, all keys need to be remapped.
51 | - In Consistent Hashing, objects are mapped to the same host if possible. When a host is removed from the system, the objects on that host are shared by other hosts; when a new host is added, it takes its share from a few hosts without touching other’s shares.
52 | 
53 | ## How does it work?
54 | - As a typical hash function, consistent hashing maps a key to an integer. Suppose the output of the hash function is in the range of [0, 256]. Imagine that the integers in the range are placed on a ring such that the values are wrapped around.
55 | - Here’s how consistent hashing works:
56 | 
57 |     - Given a list of cache servers, hash them to integers in the range.
58 |     - To map a key to a server,
59 |         - Hash it to a single integer.
60 |         - Move clockwise on the ring until finding the first cache it encounters.
61 |         - That cache is the one that contains the key. See animation below as an example: key1 maps to cache A; key2 maps to cache C.
62 | ![hash1](../images/hash1.png)
63 | ![hash2](../images/hash2.png)
64 | ![hash3](../images/hash3.png)
65 | ![hash4](../images/hash4.png)
66 | ![hash5](../images/hash5.png)
67 | 
68 | - To add a new server, say D, keys that were originally residing at C will be split. Some of them will be shifted to D, while other keys will not be touched.
69 | - To remove a cache or, if a cache fails, say A, all keys that were originally mapped to A will fall into B, and only those keys need to be moved to B; other keys will not be affected.
70 | - For load balancing, as we discussed in the beginning, the real data is essentially randomly distributed and thus may not be uniform. It may make the keys on caches unbalanced.
71 | - To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to a single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way, each cache is associated with multiple portions of the ring.
72 | - If the hash function “mixes well,” as the number of replicas increases, the keys will be more balanced.
73 | 
74 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/data_partitioning.md:
--------------------------------------------------------------------------------
  1 | Data Partitioning
  2 | ====
  3 | # keypoints
  4 | - break up a big database (DB) into many smaller parts
  5 | - after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines
  6 | 
  7 | ## Partitioning Methods
  8 | - Horizontal partitioning (range based partitioning, data sharding)
  9 |     - put different rows into different tables
 10 |         - e.g. 0-1k, 1k-2k, ...
 11 |     - problem
 12 |         - if range for partition not chosen carefully, could have unbalanced serves
 13 | - Vertical Partitioning
 14 |     - store tables related to a specific feature in one server
 15 |         - e.g. server1: insta pics, server2: user info..;
 16 |     - problem
 17 |         - if keeps growing, may be necessary to further partition a feature specific DB across various servers
 18 | - Directory Based Partitioning
 19 |     - create a lookup service which knows your current partitioning scheme 
 20 |     - to find out where a particular data entity resides, query the directory server that holds the mapping between each tuple key to its DB server
 21 | 
 22 | ## Partitioning Criteria
 23 | - Key or Hash-based partitioning
 24 |     - apply a hash function to some key attributes of the entity we are storing -> partition number
 25 |     - e.g. ID % 100 if we have 100 partitions
 26 |     - should ensure uniform allocation
 27 |     - problem
 28 |         - adding new serves might require rehashing -> downtime for the service
 29 | - List partitioning
 30 |     - each partition assigned a list of values
 31 |     - to insert a new record, find the partition with the corresponding key
 32 | - Round-robin partitioning
 33 |     - i^th tuple assigned to partition i % n
 34 | - Composite partitioning
 35 |     - combine the above schemes
 36 |     - e.g. list partitioning -> hash based partitioning
 37 |     - e.g. consistent hashing =  hash + list partitioning
 38 |         - when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots
 39 | 
 40 | ## Common Problems of Data Partitioning
 41 | - Joins and Denormalization
 42 |     - if database is partitioned and spread across multiple machines then often not feasible to perform joins
 43 |     - workaround
 44 |         - denormalize the database so that queries that previously required joins can be performed from a single table
 45 |         - but denormalization leads to data inconsistency
 46 | - Referential integrity
 47 |     - enforce data integrity constraints in a partitioned database difficult, e.g. foreign keys
 48 | - Rebalancing
 49 |     - reason to change partition scheme
 50 |         - data distribution not uniform
 51 |         - a lot of load on a partition
 52 |     - solution
 53 |         - create more DB partitions or rebalance existing partitions
 54 |         - will incur downtime
 55 |         - could use directory based partitioning
 56 | 
 57 | 
 58 | # text
 59 | Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.
 60 | 
 61 | ## Partitioning Methods
 62 | - There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications.
 63 | 
 64 | - Horizontal partitioning
 65 |     - In this scheme, we put different rows into different tables. For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table and places with ZIP codes greater than 10000 are stored in a separate table. This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding.
 66 |     - The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the previous example, splitting location based on their zip codes assumes that places will be evenly distributed across the different zip codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan as compared to its suburb cities.
 67 | 
 68 | - Vertical Partitioning
 69 |     - In this scheme, we divide our data to store tables related to a specific feature in their own server. For example, if we are building Instagram like application - where we need to store data related to users, photos they upload, and people they follow - we can decide to place user profile information on one DB server, friend lists on another, and photos on a third server.
 70 | 
 71 |     - Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g. it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users).
 72 | 
 73 | - Directory Based Partitioning
 74 |     - A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application.
 75 | 
 76 | ## Partitioning Criteria
 77 | - Key or Hash-based partitioning
 78 |     - Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number. For example, if we have 100 DB servers and our ID is a numeric value that gets incremented by one each time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record. This approach should ensure a uniform allocation of data among servers. The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.
 79 | 
 80 | - List partitioning
 81 |     - In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.
 82 | 
 83 | - Round-robin partitioning
 84 |     - This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
 85 | 
 86 | - Composite partitioning
 87 |     - Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
 88 | 
 89 | ## Common Problems of Data Partitioning
 90 | - On a partitioned database, there are certain extra constraints on the different operations that can be performed. Most of these constraints are due to the fact that operations across multiple tables or multiple rows in the same table will no longer run on the same server. Below are some of the constraints and additional complexities introduced by partitioning:
 91 | 
 92 | - Joins and Denormalization
 93 |     - Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency.
 94 | 
 95 | - Referential integrity
 96 |     - As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult.
 97 |     - Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.
 98 | 
 99 | - Rebalancing
100 |     - There could be many reasons we have to change our partitioning scheme:
101 |         - The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
102 |         - There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.
103 |     - In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).
104 | 
105 | 
106 | Back
107 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/indexes.md:
--------------------------------------------------------------------------------
 1 | Indexes
 2 | ====
 3 | # keypoints
 4 | - a data structure that can be perceived as a table of contents that points us to the location where actual data lives
 5 | - Improve the performance of search queries.
 6 | - Decrease the write performance bc need to update indices. This performance degradation applies to all insert, update, and delete operations.
 7 | 
 8 | # texts
 9 | - Indexes are well known when it comes to databases. Sooner or later there comes a time when database performance is no longer satisfactory. One of the very first things you should turn to when that happens is database indexing.
10 | - The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
11 | 
12 | ## Example: A library catalog
13 | - A library catalog is a register that contains the list of books found in a library. The catalog is organized like a database table generally with four columns: book title, writer, subject, and date of publication. There are usually two such catalogs: one sorted by the book title and one sorted by the writer name. That way, you can either think of a writer you want to read and then look through their books or look up a specific book title you know you want to read in case you don’t know the writer’s name. These catalogs are like indexes for the database of books. They provide a sorted list of data that is easily searchable by relevant information.
14 | - Simply saying, an index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index. Let’s assume a table containing a list of books, the following diagram shows how an index on the ‘Title’ column looks like:
15 | ![library_catalog_indexes](../images/library_catalog_indexes.png)
16 | - Just like a traditional relational data store, we can also apply this concept to larger datasets. The trick with indexes is that we must carefully consider how users will access the data. In the case of data sets that are many terabytes in size, but have very small payloads (e.g., 1 KB), indexes are a necessity for optimizing data access. Finding a small payload in such a large dataset can be a real challenge, since we can’t possibly iterate over that much data in any reasonable time. Furthermore, it is very likely that such a large data set is spread over several physical devices—this means we need some way to find the correct physical location of the desired data. Indexes are the best way to do this.
17 | 
18 | ## How do Indexes decrease write performance?
19 | - An index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update.
20 | - When adding rows or making updates to existing rows for a table with an active index, we not only have to write the data but also have to update the index. This will decrease the write performance. This performance degradation applies to all insert, update, and delete operations for the table. For this reason, adding unnecessary indexes on tables should be avoided and indexes that are no longer used should be removed. To reiterate, adding indexes is about improving the performance of search queries. If the goal of the database is to provide a data store that is often written to and rarely read from, in that case, decreasing the performance of the more common operation, which is writing, is probably not worth the increase in performance we get from reading.
21 | For more details, see Database Indexes.
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/key_characteristics_of_distributed_systems.md:
--------------------------------------------------------------------------------
 1 | Key Characteristics of Distributed Systems
 2 | ====
 3 | 
 4 | # keypoints
 5 | ## Scalability
 6 | - The capability of a system to grow and manage increased demand.
 7 | - A system that can continuously evolve to support growing amount of work is scalable.
 8 | - Horizontal scaling: by adding more servers into the pool of resources.
 9 | - Vertical scaling: by adding more resource (CPU, RAM, storage, etc) to an existing server. This approach comes with downtime and an upper limit.
10 | 
11 | ## Reliability
12 | - Reliability is the probability that a system will fail in a given period.
13 | - A distributed system is reliable if it keeps delivering its service even when one or multiple components fail.
14 | - Reliability is achieved through redundancy of components and data (remove every single point of failure).
15 | 
16 | ## Availability
17 | - Availability is the time a system remains operational to perform its required function in a specific period.
18 | - Measured by the percentage of time that a system remains operational under normal conditions.
19 | - A reliable system is available.
20 | - An available system is not necessarily reliable.
21 |   - A system with a security hole is available when there is no security attack.
22 | 
23 | ## Efficiency
24 | - Latency: response time, the delay to obtain the first piece of data.
25 | - Bandwidth: throughput, amount of data delivered in a given time.
26 | 
27 | ## Serviceability / Manageability
28 | - Easiness to operate and maintain the system.
29 | - Simplicity and spend with which a system can be repaired or maintained.
30 | 
31 | # text
32 | 
33 | Key characteristics of a distributed system include Scalability, Reliability, Availability, Efficiency, and Manageability. Let’s briefly review them:
34 | 
35 | ## Scalability
36 | - Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable.
37 | - A system may have to scale because of many reasons like increased data volume or increased amount of work, e.g., number of transactions. A scalable system would like to achieve this scaling without performance loss.
38 | - Generally, the performance of a system, although designed (or claimed) to be scalable, declines with the system size due to the management or environment cost. For instance, network speed may become slower because machines tend to be far apart from one another. More generally, some tasks may not be distributed, either because of their inherent atomic nature or because of some flaw in the system design. At some point, such tasks would limit the speed-up obtained by distribution. A scalable architecture avoids this situation and attempts to balance the load on all the participating nodes evenly.
39 | - Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more servers into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server.
40 | - With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit.
41 | - Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime.
42 | - ![Vertical scaling vs. Horizontal scaling](../images/Vertical_scaling_vs._Horizontal_scaling.png)
43 | 
44 | ## Reliability
45 | - By definition, reliability is the probability a system will fail in a given period. In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail. Reliability represents one of the main characteristics of any distributed system, since in such systems any failing machine can always be replaced by another healthy one, ensuring the completion of the requested task.
46 | - Take the example of a large electronic commerce store (like Amazon), where one of the primary requirement is that any user transaction should never be canceled due to a failure of the machine that is running that transaction. For instance, if a user has added an item to their shopping cart, the system is expected not to lose it. A reliable distributed system achieves this through redundancy of both the software components and data. If the server carrying the user’s shopping cart fails, another server that has the exact replica of the shopping cart should replace it.
47 | - Obviously, redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure.
48 | 
49 | ## Availability
50 | - By definition, availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions. An aircraft that can be flown for many hours a month without much downtime can be said to have a high availability. Availability takes into account maintainability, repair time, spares availability, and other logistics considerations. If an aircraft is down for maintenance, it is considered not available during that time.
51 | - Reliability is availability over time considering the full range of possible real-world conditions that can occur. An aircraft that can make it through any possible weather safely is more reliable than one that has vulnerabilities to possible conditions.
52 | - Reliability Vs. Availability
53 |     - If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed. Let’s take the example of an online retail store that has 99.99% availability for the first two years after its launch. However, the system was launched without any information security testing. The customers are happy with the system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the third year, the system experiences a series of information security incidents that suddenly result in extremely low availability for extended periods of time. This results in reputational and financial damage to the customers.
54 | 
55 | ## Efficiency
56 | - To understand how to measure the efficiency of a distributed system, let’s assume we have an operation that runs in a distributed manner and delivers a set of items as result. Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second). The two measures correspond to the following unit costs:
57 | - Number of messages globally sent by the nodes of the system regardless of the message size.
58 | Size of messages representing the volume of data exchanges.
59 | The complexity of operations supported by distributed data structures (e.g., searching for a specific key in a distributed index) can be characterized as a function of one of these cost units. Generally speaking, the analysis of a distributed structure in terms of ‘number of messages’ is over-simplistic. It ignores the impact of many aspects, including the network topology, the network load, and its variation, the possible heterogeneity of the software and hardware components involved in data processing and routing, etc. However, it is quite difficult to develop a precise cost model that would accurately take into account all these performance factors; therefore, we have to live with rough but robust estimates of the system behavior.
60 | 
61 | ## Serviceability or Manageability
62 | - Another important consideration while designing a distributed system is how easy it is to operate and maintain. Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it routinely operate without failure or exceptions?).
63 | - Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/load_balancing.md:
--------------------------------------------------------------------------------
 1 | Load Balancing (LB)
 2 | ====
 3 | # keypoints
 4 | Help scale horizontally across an ever-increasing number of servers.
 5 | 
 6 | ## LB locations
 7 | - Between user and web server
 8 | - Between web servers and an internal platform layer (application servers, cache servers)
 9 | - Between internal platform layer and database
10 | 
11 | ## Algorithms
12 | - Least connection
13 | - Least response time
14 | - Least bandwidth
15 | - Round robin
16 | - Weighted round robin
17 | - IP hash
18 | 
19 | ## Implementation
20 | - Smart clients
21 | - Hardware load balancers
22 | - Software load balancers
23 | 
24 | # text
25 | - Load Balancer (LB) is another critical component of any distributed system. It helps to spread the traffic across a cluster of servers to improve responsiveness and availability of applications, websites or databases. LB also keeps track of the status of all the resources while distributing requests. If a server is not available to take new requests or is not responding or has elevated error rate, LB will stop sending traffic to such a server.
26 | - Typically a load balancer sits between the client and the server accepting incoming network and application traffic and distributing the traffic across multiple backend servers using various algorithms. By balancing application requests across multiple servers, a load balancer reduces individual server load and prevents any one application server from becoming a single point of failure, thus improving overall application availability and responsiveness.
27 | ![client_loadbalancer_server](../images/client_loadbalancer_server.png)
28 | - To utilize full scalability and redundancy, we can try to balance the load at each layer of the system. We can add LBs at three places:
29 |     - Between the user and the web server
30 |     - Between web servers and an internal platform layer, like application servers or cache servers
31 |     - Between internal platform layer and database.
32 | ![loadbalancer2](../images/loadbalancer2.png)
33 | 
34 | ## Benefits of Load Balancing
35 | - Users experience faster, uninterrupted service. Users won’t have to wait for a single struggling server to finish its previous tasks. Instead, their requests are immediately passed on to a more readily available resource.
36 | - Service providers experience less downtime and higher throughput. Even a full server failure won’t affect the end user experience as the load balancer will simply route around it to a healthy server.
37 | - Load balancing makes it easier for system administrators to handle incoming requests while decreasing wait time for users.
38 | - Smart load balancers provide benefits like predictive analytics that determine traffic bottlenecks before they happen. As a result, the smart load balancer gives an organization actionable insights. These are key to automation and can help drive business decisions.
39 | - System administrators experience fewer failed or stressed components. Instead of a single device performing a lot of work, load balancing has several devices perform a little bit of work.
40 | 
41 | ## Load Balancing Algorithms
42 | - How does the load balancer choose the backend server?
43 |     Load balancers consider two factors before forwarding a request to a backend server. They will first ensure that the server they choose is actually responding appropriately to requests and then use a pre-configured algorithm to select one from the set of healthy servers. We will discuss these algorithms shortly.
44 | 
45 | - Health Checks 
46 |     - Load balancers should only forward traffic to “healthy” backend servers. To monitor the health of a backend server, “health checks” regularly attempt to connect to backend servers to ensure that servers are listening. If a server fails a health check, it is automatically removed from the pool, and traffic will not be forwarded to it until it responds to the health checks again.
47 | 
48 | - There is a variety of load balancing methods, which use different algorithms for different needs.
49 | 
50 | - Least Connection Method
51 |     — This method directs traffic to the server with the fewest active connections. This approach is quite useful when there are a large number of persistent client connections which are unevenly distributed between the servers.
52 | - Least Response Time Method 
53 |     — This algorithm directs traffic to the server with the fewest active connections and the lowest average response time.
54 | - Least Bandwidth Method 
55 |     - This method selects the server that is currently serving the least amount of traffic measured in megabits per second (Mbps).
56 | - Round Robin Method 
57 |     — This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning. It is most useful when the servers are of equal specification and there are not many persistent connections.
58 | - Weighted Round Robin Method 
59 |     — The weighted round-robin scheduling is designed to better handle servers with different processing capacities. Each server is assigned a weight (an integer value that indicates the processing capacity). Servers with higher weights receive new connections before those with less weights and servers with higher weights get more connections than those with less weights.
60 | - IP Hash 
61 |     — Under this method, a hash of the IP address of the client is calculated to redirect the request to a server.
62 | 
63 | ## Redundant Load Balancers
64 | - The load balancer can be a single point of failure; to overcome this, a second load balancer can be connected to the first to form a cluster. Each LB monitors the health of the other and, since both of them are equally capable of serving traffic and failure detection, in the event the main load balancer fails, the second load balancer takes over.
65 | - ![redundant load balancer](../images/redundant_load_balancer.png)
66 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/long_polling_websockets_serversent_events.md:
--------------------------------------------------------------------------------
 1 | Long-Polling vs WebSockets vs Server-Sent Events
 2 | ====
 3 | 
 4 | # keypoints
 5 | - communication protocols 
 6 |     - long-polling
 7 |     - WebSockets
 8 |     - Server-Sent Events
 9 |     - between a client like a web browser and a web server
10 |     - sequence of event for regular HTTP request
11 |         - client opens a connections, request data from server
12 |         - server calculates reponse
13 |         - server sends response back to the client
14 | 
15 | ## Ajax Polling
16 | - client repeatedly polls/requests a server for data
17 | - If no data is available, an empty response is returned
18 | - steps
19 |     - client opens a connection, requests data from the server using regular HTTP.
20 |     - requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
21 |     - server calculates the response and sends it back
22 |     - client repeats the above three steps periodically 
23 | - problem
24 |     - client keeps asking the server for new data, a lot of responses are empty -> HTTP overhead
25 | 
26 | ## HTTP Long-Polling
27 | - server push information to client whenever the data is available. 
28 | - client requests as in normal polling, but expect server may not respond immediatey
29 | - if server has no data available, then hold request instead of sending empty response until a timeout
30 | - once data available, full response sent
31 | - client immediately re-request, so server always have a waiting request
32 | - client has to reconnect periodically after connection closed due to timeouts
33 | 
34 | ## WebSockets
35 | - persistent connection between client adn server
36 | - both parties can send data at any time
37 | - establishes WebSocket connection througj WebSocket handshake
38 |     - if succeeds, client server can exchange data
39 |     - enables communication with low overheads
40 |     - real-time data transfer
41 | 
42 | ## Server-Sent Events (SSEs)
43 | - client establishes a persistent & long-term connection with the server
44 | - client require another tech/protocol to send data to server
45 | - steps
46 |     - client request data using regular HTTP
47 |     - request webpage opens a connections to server
48 |     - server sends data to client if new info available
49 | - best when real-time traffic needed
50 | - or server generate data in loop
51 | 
52 | 
53 | # text
54 | - Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request
55 |     - The client opens a connection and requests data from the server.
56 |     - The server calculates the response.
57 |     - The server sends the response back to the client on the opened request.
58 | - ![HTTP protocol](../images/HTTP_protocol.png)
59 | 
60 | ## Ajax Polling
61 | - Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned.
62 |     - The client opens a connection and requests data from the server using regular HTTP.
63 |     - The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
64 |     - The server calculates the response and sends it back, just like regular HTTP traffic.
65 |     - The client repeats the above three steps periodically to get updates from the server.
66 | - The problem with Polling is that the client has to keep asking the server for any new data. As a result, a lot of responses are empty, creating HTTP overhead.
67 | - ![Ajax Polling Protocol](../images/ajax.png)
68 | 
69 | ## HTTP Long-Polling
70 | - This is a variation of the traditional polling technique that allows the server to push information to a client whenever the data is available. With Long-Polling, the client requests information from the server exactly as in normal polling, but with the expectation that the server may not respond immediately. That’s why this technique is sometimes referred to as a “Hanging GET”.
71 |     - If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available.
72 |     - Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event.
73 | - The basic life cycle of an application using HTTP Long-Polling is as follows:
74 |     - The client makes an initial request using regular HTTP and then waits for a response.
75 |     - The server delays its response until an update is available or a timeout has occurred.
76 |     - When an update is available, the server sends a full response to the client.
77 |     - The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period.
78 |     - Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed due to timeouts.
79 | 
80 | ## WebSockets
81 | - WebSocket provides Full duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server that both parties can use to start sending data at any time. The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server. This is made possible by providing a standardized way for the server to send content to the browser without being asked by the client and allowing for messages to be passed back and forth while keeping the connection open. In this way, a two-way (bi-directional) ongoing conversation can take place between a client and a server.
82 | 
83 | ## Server-Sent Events (SSEs)
84 | - Under SSEs the client establishes a persistent and long-term connection with the server. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so.
85 |     - Client requests data from a server using regular HTTP.
86 |     - The requested webpage opens a connection to the server.
87 |     - The server sends the data to the client whenever there’s new information available.
88 | - SSEs are best when we need real-time traffic from the server to the client or if the server is generating data in a loop and will be sending multiple events to the client.
89 | - ![Server Sent Events Protocol](../images/sse.png)


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/proxies.md:
--------------------------------------------------------------------------------
 1 | Proxies
 2 | ====
 3 | 
 4 | # keypoints
 5 | - A proxy server is an intermediary piece of hardware / software sitting between client and backend server.
 6 |   - Filter requests
 7 |   - Log requests
 8 |   - Transform requests 
 9 |     - adding/removing headers
10 |     - encrypting/decrypting
11 |     - compressing a resource
12 |     - cache
13 |         - if multiple clients access a particular request, proxy server can cache it
14 | 
15 | ## Proxy Server Types
16 | - Open Proxy
17 |     - accessible by any Internet user
18 |     - Anonymous Proxy
19 |         - reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress
20 |     - Trаnspаrent Proxy 
21 |         –  іdentіfіes іtself
22 |         - with the support of HTTP heаders, the fіrst IP аddress cаn be vіewed
23 |         - can cаche the websіtes
24 | - Reverse Proxy
25 |     - retrieves resources on behalf of a client from servers
26 |     - then returned to the client
27 | 
28 | # text
29 | - A proxy server is an intermediate server between the client and the back-end server. Clients connect to proxy servers to make a request for a service like a web page, file, connection, etc. In short, a proxy server is a piece of software or hardware that acts as an intermediary for requests from clients seeking resources from other servers.
30 | - Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compressing a resource). Another advantage of a proxy server is that its cache can serve a lot of requests. If multiple clients access a particular resource, the proxy server can cache it and serve it to all the clients without going to the remote server.
31 | ![proxy](../images/proxy.png)
32 | 
33 | ## Proxy Server Types
34 | - Proxies can reside on the client’s local server or anywhere between the client and the remote servers. Here are a few famous types of proxy servers:
35 | 
36 | - Open Proxy
37 |     - An open proxy is a proxy server that is accessible by any Internet user. Generally, a proxy server only allows users within a network group (i.e. a closed proxy) to store and forward Internet services such as DNS or web pages to reduce and control the bandwidth used by the group. With an open proxy, however, any user on the Internet is able to use this forwarding service. There two famous open proxy types:
38 | 
39 |     - Anonymous Proxy 
40 |         - Thіs proxy reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress. Though thіs proxy server cаn be dіscovered eаsіly іt cаn be benefіcіаl for some users аs іt hіdes their IP аddress.
41 |     - Trаnspаrent Proxy 
42 |         – Thіs proxy server аgаіn іdentіfіes іtself, аnd wіth the support of HTTP heаders, the fіrst IP аddress cаn be vіewed. The mаіn benefіt of usіng thіs sort of server іs іts аbіlіty to cаche the websіtes.
43 | - Reverse Proxy 
44 |     - A reverse proxy retrieves resources on behalf of a client from one or more servers. These resources are then returned to the client, appearing as if they originated from the proxy server itself
45 | 
46 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/redundancy_replication.md:
--------------------------------------------------------------------------------
 1 | Redundancy & Replication
 2 | ====
 3 | # keypoints
 4 | - Redundancy
 5 |     - **duplication of critical data or services** with the intention of increased reliability of the system.
 6 |     - remove single point of failure
 7 |     - if we have two servers and one fails, system can failover to the other one.
 8 | - primary-replica relationship
 9 |     - between the original and the copies. 
10 |     - primary gets all updates
11 |     - then ripple through to the replica servers
12 |     - replca outputs message if received update successfully
13 | - Shared-nothing architecture
14 |   - Each node can operate independently of one another.
15 |   - No central service managing state or orchestrating activities.
16 |   - New servers can be added without special conditions or knowledge.
17 |   - No single point of failure.
18 | 
19 | # text
20 | - Redundancy is the duplication of critical components or functions of a system with the intention of increasing the reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance. For example, if there is only one copy of a file stored on a single server, then losing that server means losing the file. Since losing data is seldom a good thing, we can create duplicate or redundant copies of the file to solve this problem.
21 | - Redundancy plays a key role in removing the single points of failure in the system and provides backups if needed in a crisis. For example, if we have two instances of a service running in production and one fails, the system can failover to the other one.
22 | ![redundancy](../images/redundancy.png)
23 | - Replication means sharing information to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.
24 | - Replication is widely used in many database management systems (DBMS), usually with a primary-replica relationship between the original and the copies. The primary server gets all the updates, which then ripple through to the replica servers. Each replica outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates.
25 | 
26 | 


--------------------------------------------------------------------------------
/complete_system_design/glossary_of_system_design/sql_nosql.md:
--------------------------------------------------------------------------------
  1 | SQL vs. NoSQL
  2 | ====
  3 | # keypoints
  4 | ## sql (relational databases)
  5 |     - structured
  6 |     - have predefined schemas
  7 |         - e.g. phone books that store phone numbers and addresses
  8 |     - store data in rows and columns
  9 |         - row contains information about one entity
 10 |         - column contains separate data points
 11 |     
 12 | ## NoSQL (non-relational databases)
 13 |     - unstructured, distributed
 14 |     - have a dynamic schema 
 15 |         - e.g file folders that hold everything from a person’s address to their Facebook ‘likes’ 
 16 | 
 17 | ## Common types of NoSQL
 18 | ### Key-value stores
 19 | - Array of key-value pairs. The "key" is an attribute name.
 20 | - Redis, Vodemort, Dynamo.
 21 | 
 22 | ### Document databases
 23 | - Data is stored in documents.
 24 | - Documents are grouped in collections.
 25 | - Each document can have an entirely different structure.
 26 | - CouchDB, MongoDB.
 27 | 
 28 | ### Wide-column / columnar databases
 29 | - Column families - containers for rows.
 30 | - No need to know all the columns up front.
 31 | - Each row can have different number of columns.
 32 | - Cassandra, HBase.
 33 | 
 34 | ### Graph database
 35 | - Data is stored in graph structures
 36 |   - Nodes: entities
 37 |   - Properties: information about the entities
 38 |   - Lines: connections between the entities
 39 | - Neo4J, InfiniteGraph
 40 | 
 41 | ## Differences between SQL and NoSQL
 42 | ### Storage
 43 | - SQL: store data in tables.
 44 | - NoSQL: have different data storage models.
 45 |     - key-value
 46 |     - document
 47 |     - graph
 48 |     - columnar
 49 | 
 50 | ### Schema
 51 | - SQL
 52 |   - Each record conforms to a fixed schema.
 53 |   - each row must have data for each column
 54 |   - Schema can be altered, but it requires modifying the whole database and going offline.
 55 | - NoSQL:
 56 |   - Schemas are dynamic.
 57 |   - each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’
 58 | 
 59 | ### Querying
 60 | - SQL
 61 |   - Use SQL (structured query language) for defining and manipulating the data.
 62 | - NoSQL
 63 |   - Queries are focused on a collection of documents.
 64 |   - UnQL (unstructured query language).
 65 |   - Different databases have different syntax.
 66 | 
 67 | ### Scalability
 68 | - SQL
 69 |   - Vertically scalable (by increasing the horsepower: memory, CPU, etc) and expensive.
 70 |   - Horizontally scalable (across multiple servers); but it can be challenging and time-consuming.
 71 | - NoSQL
 72 |   - Horizontablly scalable (by adding more servers) and cheap.
 73 | 
 74 | ### ACID
 75 | - Atomicity, consistency, isolation, durability
 76 | - SQL
 77 |   - ACID compliant
 78 |   - Data reliability
 79 |   - Gurantee of transactions
 80 | - NoSQL
 81 |   - Most sacrifice ACID compliance for performance and scalability.
 82 | 
 83 | ## Which one to use?
 84 | ### SQL
 85 | - Ensure ACID compliance.
 86 |   - Reduce anomalies.
 87 |   - Protect database integrity.
 88 | - Data is structured and unchanging.
 89 | 
 90 | ### NoSQL
 91 | - Data has little or no structure.
 92 | - Make the most of cloud computing and storage.
 93 |   - Cloud-based storage requires data to be easily spread across multiple servers to scale up.
 94 | - Rapid development.
 95 |   - Frequent updates to the data structure.
 96 | 
 97 | # text
 98 | - In the world of databases, there are two main types of solutions: SQL and NoSQL (or relational databases and non-relational databases). Both of them differ in the way they were built, the kind of information they store, and the storage method they use.
 99 | - Relational databases are structured and have predefined schemas like phone books that store phone numbers and addresses. Non-relational databases are unstructured, distributed, and have a dynamic schema like file folders that hold everything from a person’s address and phone number to their Facebook ‘likes’ and online shopping preferences.
100 | 
101 | ## SQL 
102 | Relational databases store data in rows and columns. Each row contains all the information about one entity and each column contains all the separate data points. Some of the most popular relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB.
103 | 
104 | ## NoSQL 
105 | Following are the most common types of NoSQL:
106 | - Key-Value Stores:
107 |     - Data is stored in an array of key-value pairs. The ‘key’ is an attribute name which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and Dynamo.
108 | - Document Databases
109 |     - In these databases, data is stored in documents (instead of rows and columns in a table) and these documents are grouped together in collections. Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB.
110 | - Wide-Column Databases
111 |     - Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.
112 | - Graph Databases
113 |     - These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph.
114 | 
115 | ## High level differences between SQL and NoSQL 
116 | - Storage
117 |     - SQL stores data in tables where each row represents an entity and each column represents a data point about that entity; for example, if we are storing a car entity in a table, different columns could be ‘Color’, ‘Make’, ‘Model’, and so on.
118 |     - NoSQL databases have different data storage models. The main ones are key-value, document, graph, and columnar. We will discuss differences between these databases below.
119 | 
120 | - Schema
121 |     - In SQL, each record conforms to a fixed schema, meaning the columns must be decided and chosen before data entry and each row must have data for each column. The schema can be altered later, but it involves modifying the whole database and going offline.
122 |     - In NoSQL, schemas are dynamic. Columns can be added on the fly and each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’
123 | 
124 | - Querying
125 |     - SQL databases use SQL (structured query language) for defining and manipulating the data, which is very powerful. In a NoSQL database, queries are focused on a collection of documents. Sometimes it is also called UnQL (Unstructured Query Language). Different databases have different syntax for using UnQL.
126 | 
127 | - Scalability
128 |     - In most common situations, SQL databases are vertically scalable, i.e., by increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very expensive. It is possible to scale a relational database across multiple servers, but this is a challenging and time-consuming process.
129 |     - On the other hand, NoSQL databases are horizontally scalable, meaning we can add more servers easily in our NoSQL database infrastructure to handle a lot of traffic. Any cheap commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across servers automatically.
130 | 
131 | - Reliability or ACID Compliancy (Atomicity, Consistency, Isolation, Durability): The vast majority of relational databases are ACID compliant. So, when it comes to data reliability and safe guarantee of performing transactions, SQL databases are still the better bet.
132 | 
133 | Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.
134 | 
135 | SQL VS. NoSQL - Which one to use? #
136 | When it comes to database technology, there’s no one-size-fits-all solution. That’s why many businesses rely on both relational and non-relational databases for different needs. Even as NoSQL databases are gaining popularity for their speed and scalability, there are still situations where a highly structured SQL database may perform better; choosing the right technology hinges on the use case.
137 | 
138 | Reasons to use SQL database #
139 | Here are a few reasons to choose a SQL database:
140 | 
141 | We need to ensure ACID compliance. ACID compliance reduces anomalies and protects the integrity of your database by prescribing exactly how transactions interact with the database. Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed, but for many e-commerce and financial applications, an ACID-compliant database remains the preferred option.
142 | Your data is structured and unchanging. If your business is not experiencing massive growth that would require more servers and if you’re only working with data that is consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume.
143 | Reasons to use NoSQL database #
144 | When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck. Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases. A few popular examples of NoSQL databases are MongoDB, CouchDB, Cassandra, and HBase.
145 | 
146 | Storing large volumes of data that often have little to no structure. A NoSQL database sets no limits on the types of data we can store together and allows us to add new types as the need changes. With document-based databases, you can store data in one place without having to define what “types” of data those are in advance.
147 | Making the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches.
148 | Rapid development. NoSQL is extremely useful for rapid development as it doesn’t need to be prepped ahead of time. If you’re working on quick iterations of your system which require making frequent updates to the data structure without a lot of downtime between versions, a relational database will slow you down.
149 | Interviewing soon? We've partnered with Hired so that companies apply to you instead of you applying to them.See how
150 | 


--------------------------------------------------------------------------------
/complete_system_design/images/HTTP_protocol.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/HTTP_protocol.png


--------------------------------------------------------------------------------
/complete_system_design/images/Vertical_scaling_vs._Horizontal_scaling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/Vertical_scaling_vs._Horizontal_scaling.png


--------------------------------------------------------------------------------
/complete_system_design/images/accessing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/accessing.png


--------------------------------------------------------------------------------
/complete_system_design/images/ajax.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/ajax.png


--------------------------------------------------------------------------------
/complete_system_design/images/cap.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/cap.png


--------------------------------------------------------------------------------
/complete_system_design/images/cap_theorem.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/cap_theorem.png


--------------------------------------------------------------------------------
/complete_system_design/images/client_loadbalancer_server.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/client_loadbalancer_server.png


--------------------------------------------------------------------------------
/complete_system_design/images/database_schema.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/database_schema.png


--------------------------------------------------------------------------------
/complete_system_design/images/detailed_component.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/detailed_component.png


--------------------------------------------------------------------------------
/complete_system_design/images/hash1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash1.png


--------------------------------------------------------------------------------
/complete_system_design/images/hash2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash2.png


--------------------------------------------------------------------------------
/complete_system_design/images/hash3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash3.png


--------------------------------------------------------------------------------
/complete_system_design/images/hash4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash4.png


--------------------------------------------------------------------------------
/complete_system_design/images/hash5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash5.png


--------------------------------------------------------------------------------
/complete_system_design/images/high_level_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/high_level_design.png


--------------------------------------------------------------------------------
/complete_system_design/images/high_level_url_shortening.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/high_level_url_shortening.png


--------------------------------------------------------------------------------
/complete_system_design/images/library_catalog_indexes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/library_catalog_indexes.png


--------------------------------------------------------------------------------
/complete_system_design/images/loadbalancer2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/loadbalancer2.png


--------------------------------------------------------------------------------
/complete_system_design/images/long_polling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/long_polling.png


--------------------------------------------------------------------------------
/complete_system_design/images/proxy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/proxy.png


--------------------------------------------------------------------------------
/complete_system_design/images/redundancy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/redundancy.png


--------------------------------------------------------------------------------
/complete_system_design/images/redundant_load_balancer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/redundant_load_balancer.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow1.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow10.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow11.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow2.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow3.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow4.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow5.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow6.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow7.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow8.png


--------------------------------------------------------------------------------
/complete_system_design/images/request_flow9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow9.png


--------------------------------------------------------------------------------
/complete_system_design/images/shortening.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/shortening.png


--------------------------------------------------------------------------------
/complete_system_design/images/sse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/sse.png


--------------------------------------------------------------------------------
/complete_system_design/images/url1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url1.png


--------------------------------------------------------------------------------
/complete_system_design/images/url2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url2.png


--------------------------------------------------------------------------------
/complete_system_design/images/url3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url3.png


--------------------------------------------------------------------------------
/complete_system_design/images/url4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url4.png


--------------------------------------------------------------------------------
/complete_system_design/images/url5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url5.png


--------------------------------------------------------------------------------
/complete_system_design/images/url6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url6.png


--------------------------------------------------------------------------------
/complete_system_design/images/url7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url7.png


--------------------------------------------------------------------------------
/complete_system_design/images/url8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url8.png


--------------------------------------------------------------------------------
/complete_system_design/images/url9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url9.png


--------------------------------------------------------------------------------
/complete_system_design/images/websockets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/websockets.png


--------------------------------------------------------------------------------
/complete_system_design/system_design_problems/step_by_step_guide.md:
--------------------------------------------------------------------------------
 1 | System Design Interviews: A step by step guide
 2 | =====
 3 | # keypoints
 4 | 
 5 | 
 6 | # text
 7 | - A lot of software engineers struggle with system design interviews (SDIs) primarily because of three reasons:
 8 |     - The unstructured nature of SDIs, where the candidates are asked to work on an open-ended design problem that doesn’t have a standard answer.
 9 |     - Candidates lack experience in developing complex and large scale systems.
10 |     - Candidates did not spend enough time to prepare for SDIs.
11 | - Like coding interviews, candidates who haven’t put a deliberate effort to prepare for SDIs, mostly perform poorly, especially at top companies like Google, Facebook, Amazon, Microsoft, etc. In these companies, candidates who do not perform above average have a limited chance to get an offer. On the other hand, a good performance always results in a better offer (higher position and salary) since it shows the candidate’s ability to handle a complex system.
12 | - In this course, we’ll follow a step by step approach to solve multiple design problems. First, let’s go through these steps:
13 | 
14 | ## Step 1: Requirements clarifications
15 | - It is always a good idea to ask questions about the exact scope of the problem we are trying to solve. Design questions are mostly open-ended, and they don’t have ONE correct answer. That’s why clarifying ambiguities early in the interview becomes critical. Candidates who spend enough time to define the end goals of the system always have a better chance to be successful in the interview. Also, since we only have 35-40 minutes to design a (supposedly) large system, we should clarify what parts of the system we will be focusing on.
16 | - Let’s expand this with an actual example of designing a Twitter-like service. Here are some questions for designing Twitter that should be answered before moving on to the next steps:
17 |     - Will users of our service be able to post tweets and follow other people?
18 |     - Should we also design to create and display the user’s timeline?
19 |     - Will tweets contain photos and videos?
20 |     - Are we focusing on the backend only, or are we developing the front-end too?
21 |     - Will users be able to search tweets?
22 |     - Do we need to display hot trending topics?
23 |     - Will there be any push notification for new (or important) tweets?
24 | - All such questions will determine how our end design will look like.  
25 | 
26 | ## Step 2: Back-of-the-envelope estimation
27 | - It is always a good idea to estimate the scale of the system we’re going to design. This will also help later when we focus on scaling, partitioning, load balancing, and caching.
28 |     - What scale is expected from the system (e.g., number of new tweets, number of tweet views, number of timeline generations per sec., etc.)?
29 |     - How much storage will we need? We will have different storage requirements if users can have photos and videos in their tweets.
30 |     - What network bandwidth usage are we expecting? This will be crucial in deciding how we will manage traffic and balance load between servers.
31 | 
32 | ## Step 3: System interface definition
33 | - Define what APIs are expected from the system. This will establish the exact contract expected from the system and ensure if we haven’t gotten any requirements wrong. Some examples of APIs for our Twitter-like service will be:
34 |     - ``postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …)``
35 |     - ``generateTimeline(user_id, current_time, user_location, …)``
36 |     - ``markTweetFavorite(user_id, tweet_id, timestamp, …)``
37 | 
38 | ## Step 4: Defining data model
39 | - Defining the data model in the early part of the interview will clarify how data will flow between different system components. Later, it will guide for data partitioning and management. The candidate should identify various entities of the system, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Here are some entities for our Twitter-like service:
40 |     - User: UserID, Name, Email, DoB, CreationData, LastLogin, etc.
41 |     - Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
42 |     - UserFollow: UserID1, UserID2
43 |     - FavoriteTweets: UserID, TweetID, TimeStamp
44 | - Which database system should we use? Will NoSQL like Cassandra best fit our needs, or should we use a MySQL-like solution? What kind of block storage should we use to store photos and videos?
45 | 
46 | ## Step 5: High-level design
47 | - Draw a block diagram with 5-6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end-to-end.
48 | - For Twitter, at a high-level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we will have a lot more read traffic (compared to write), we can decide to have separate servers to handle these scenarios. On the back-end, we need an efficient database that can store all the tweets and support a huge number of reads. We will also need a distributed file storage system for storing photos and videos.
49 | - ![](../images/high_level_design.png)
50 | 
51 | ## Step 6: Detailed design
52 | - Dig deeper into two or three major components; the interviewer’s feedback should always guide us to what parts of the system need further discussion. We should present different approaches, their pros and cons, and explain why we will prefer one approach over the other. Remember, there is no single answer; the only important thing is to consider tradeoffs between different options while keeping system constraints in mind.
53 |     - Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue could it cause?
54 |     - How will we handle hot users who tweet a lot or follow lots of people?
55 |     - Since users’ timeline will contain the most recent (and relevant) tweets, should we try to store our data so that it is optimized for scanning the latest tweets?
56 |     - How much and at which layer should we introduce cache to speed things up?
57 |     - What components need better load balancing?
58 | 
59 | ## Step 7: identifying and resolving bottlenecks
60 | - Try to discuss as many bottlenecks as possible and different approaches to mitigate them.
61 |     - Is there any single point of failure in our system? What are we doing to mitigate it?
62 |     - Do we have enough replicas of the data so that we can still serve our users if we lose a few servers?
63 |     - Similarly, do we have enough copies of different services running such that a few failures will not cause a total system shutdown?
64 |     - How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?


--------------------------------------------------------------------------------
/complete_system_design/system_design_problems/url_shortening.md:
--------------------------------------------------------------------------------
  1 | Designing a URL Shortening service like TinyURL
  2 | =====
  3 | 
  4 | # text
  5 | ## Why do we need URL shortening?
  6 | - URL shortening is used to create shorter aliases for long URLs. We call these shortened aliases “short links.” Users are redirected to the original URL when they hit these short links. Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally, users are less likely to mistype shorter URLs.
  7 | - For example, if we shorten this page through TinyURL:
  8 |     - ``https://www.educative.io/collection/page/5668639101419520/5649050225344512/5668600916475904/``
  9 | - we would get
 10 |     - ``http://tinyurl.com/jlg8zpc``
 11 | - The shortened URL is nearly one-third the size of the actual URL.
 12 | - URL shortening is used to optimize links across devices, track individual links to analyze audience, measure ad campaigns’ performance, or hide affiliated original URLs.
 13 | - If you haven’t used tinyurl.com before, please try creating a new shortened URL and spend some time going through the various options their service offers. This will help you a lot in understanding this chapter.
 14 | 
 15 | ## Requirements and Goals of the System
 16 | - Our URL shortening system should meet the following requirements:
 17 | - **Functional Requirements**
 18 |     - Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
 19 |     - When users access a short link, our service should redirect them to the original link.
 20 |     - Users should optionally be able to pick a custom short link for their URL.
 21 |     - Links will expire after a standard default timespan. Users should be able to specify the expiration time.
 22 | - **Non-Functional Requirements**
 23 |     - The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
 24 |     - URL redirection should happen in real-time with minimal latency.
 25 |     - Shortened links should not be guessable (not predictable).
 26 | - **Extended Requirements**
 27 |     - Analytics; e.g., how many times a redirection happened?
 28 |     - Our service should also be accessible through REST APIs by other services.
 29 | 
 30 | ## Capacity Estimation and Constraints
 31 | - Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write.
 32 | - **Traffic estimates**
 33 |     - Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period
 34 |         - 100 * 500M => 50B
 35 |     - What would be Queries Per Second (QPS) for our system? New URLs shortenings per second:
 36 |         - 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
 37 |     - Considering 100:1 read/write ratio, URLs redirections per second will be
 38 |         - 100 * 200 URLs/s = 20K/s
 39 | - **Storage estimates**
 40 |     - Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion
 41 |         - 500 million * 5 years * 12 months = 30 billion
 42 |     - Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage
 43 |         - 30 billion * 500 bytes = 15 TB
 44 | - **Bandwidth estimates**
 45 |     - For write requests, since we expect 200 new URLs every second, total incoming data for our service will be 100KB per second:
 46 |         - ``200 * 500 bytes = 100 KB/s``
 47 |     - For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second
 48 |         - ``20K * 500 bytes = ~10 MB/s``
 49 | - **Memory estimates**
 50 |     - If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
 51 |     - Since we have 20K requests per second, we will be getting 1.7 billion requests per day
 52 |         - ``20K * 3600 seconds * 24 hours = ~1.7 billion``
 53 |     - To cache 20% of these requests, we will need 170GB of memory.
 54 |         - ``0.2 * 1.7 billion * 500 bytes = ~170GB``
 55 |     - One thing to note here is that since there will be many duplicate requests (of the same URL), our actual memory usage will be less than 170GB.
 56 | - **High-level estimates**  
 57 |     - Assuming 500 million new URLs per month and 100:1 read:write ratio, following is the summary of the high level estimates for our service
 58 | 
 59 | ## System APIs
 60 | - We can have SOAP or REST APIs to expose the functionality of our service. Following could be the definitions of the APIs for creating and deleting URLs:
 61 | 
 62 | - ``createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)``
 63 | - **Parameters**
 64 |     - api_dev_key (string)
 65 |         - The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
 66 |     - original_url (string)
 67 |         - Original URL to be shortened.
 68 |     - custom_alias (string)
 69 |         - Optional custom key for the URL.
 70 |     - user_name (string)
 71 |         - Optional user name to be used in the encoding.
 72 |     - expire_date (string)  
 73 |         - Optional expiration date for the shortened URL.
 74 | - **Returns**: (string)
 75 |     - A successful insertion returns the shortened URL; otherwise, it returns an error code.
 76 |     - ``deleteURL(api_dev_key, url_key)``
 77 |     - Where “url_key” is a string representing the shortened URL to be retrieved; a successful deletion returns ‘URL Removed’.
 78 |     - How do we detect and prevent abuse? 
 79 |         - A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key).
 80 | 
 81 | ## Database Design
 82 | - A few observations about the nature of the data we will store:
 83 |     - We need to store billions of records.
 84 |     - Each object we store is small (less than 1K).
 85 |     - There are no relationships between records—other than storing which user created a URL.
 86 |     - Our service is read-heavy.
 87 | - Database Schema
 88 |     - We would need two tables: one for storing information about the URL mappings and one for the user’s data who created the short link
 89 |     - ![database schema](../images/high_level_design.png)
 90 |     - What kind of database should we use? 
 91 |         - Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra or Riak is a better choice. A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details.
 92 | 
 93 | ## Basic System Design and Algorithm
 94 | - The problem we are solving here is how to generate a short and unique key for a given URL.
 95 | - In the TinyURL example in Section 1, the shortened URL is “http://tinyurl.com/jlg8zpc”. The last seven characters of this URL is the short key we want to generate. We’ll explore two solutions here:
 96 | 
 97 | ### Encoding actual URL 
 98 | - We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for display. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8, or 10 characters?
 99 | - Using base64 encoding, a 6 letters long key would result in 64^6 = ~68.7 billion possible strings.
100 | - Using base64 encoding, an 8 letters long key would result in 64^8 = ~281 trillion possible strings.
101 | - With 68.7B unique strings, let’s assume six letter keys would suffice for our system.
102 | - If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). Now we only have space for 8 characters per short key; how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication; to resolve that, we can choose some other characters out of the encoding string or swap some characters.
103 | 
104 | - What are the different issues with our solution? 
105 |     - We have the following couple of problems with our encoding scheme:
106 |     - If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable.
107 |     - What if parts of the URL are URL-encoded? e.g., http://www.educative.io/distributed.php?id=design, and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for the URL encoding.
108 | 
109 | - Workaround for the issues
110 |     - We can append an increasing sequence number to each input URL to make it unique and then generate its hash. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service.
111 |     - Another solution could be to append the user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one.
112 |     - ![1/9](../images/url1.png)
113 |     - ![2/9](../images/url2.png)
114 |     - ![3/9](../images/url3.png)
115 |     - ![4/9](../images/url4.png)
116 |     - ![5/9](../images/url5.png)
117 |     - ![6/9](../images/url6.png)
118 |     - ![7/9](../images/url7.png)
119 |     - ![8/9](../images/url8.png)
120 |     - ![9/9](../images/url9.png)
121 | 
122 | ### Generating keys offline 
123 | - We can have a standalone **Key Generation Service (KGS)** that generates random six-letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique
124 | 
125 | - Can concurrency cause problems? 
126 |     - As soon as a key is used, it should be marked in the database to ensure that it is not used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem?
127 | 
128 | - Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory to quickly provide them whenever a server needs them.
129 | 
130 | - For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which could be acceptable, given the huge number of keys we have.
131 | 
132 | - KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server.
133 | 
134 | - What would be the key-DB size? 
135 |     - With base64 encoding, we can generate 68.7B unique six letters keys. If we need one byte to store one alpha-numeric character, we can store all these keys in:
136 |     - 6 (characters per key) * 68.7B (unique keys) = 412 GB.
137 | 
138 | - Isn’t KGS a single point of failure? 
139 |     - Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys.
140 | 
141 | - Can each app server cache some keys from key-DB? 
142 |     - Yes, this can surely speed things up. Although, in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This can be acceptable since we have 68B unique six-letter keys.
143 | 
144 | - How would we perform a key lookup? 
145 |     - We can look up the key in our database to get the full URL. If it’s present in the DB, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the homepage.
146 | 
147 | - Should we impose size limits on custom aliases? 
148 |     - Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema).
149 | - ![High level system design for URL shortening](../images/high_level_url_shortening.png)
150 | 
151 | ## Data Partitioning and Replication
152 | - To scale out our DB, we need to partition it so that it can store information about billions of URLs. We need to develop a partitioning scheme that would divide and store our data into different DB servers.
153 | 
154 | - Range Based Partitioning
155 |     - We can store URLs in separate partitions based on the hash key’s first letter. Hence we save all the URLs starting with the letter ‘A’ (and ‘a’) in one partition, save those that start with the letter ‘B’ in another partition, and so on. This approach is called range-based partitioning. We can even combine certain less frequently occurring letters into one database partition. We should come up with a static partitioning scheme to always store/find a URL in a predictable manner.
156 |     - The main problem with this approach is that it can lead to unbalanced DB servers. For example, we decide to put all URLs starting with the letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E.’
157 | 
158 | - Hash-Based Partitioning
159 |     - In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object.
160 |     - Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any ‘key’ to a number between [1…256]). This number would represent the partition in which we store our object.
161 |     - This approach can still lead to overloaded partitions, which can be solved using Consistent Hashing.
162 | 
163 | ## Cache
164 | - We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcached, which can store full URLs with their respective hashes. Before hitting backend storage, the application servers can quickly check if the cache has the desired URL.
165 | 
166 | - How much cache memory should we have? 
167 |     - We can start with 20% of daily traffic and, based on clients’ usage patterns, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs.
168 | 
169 | - Which cache eviction policy would best fit our needs? 
170 |     - When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of the URLs that have been accessed recently.
171 | 
172 |     - To further increase the efficiency, we can replicate our caching servers to distribute the load between them.
173 | 
174 | - How can each cache replica be updated? 
175 |     - Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update its cache by adding the new entry. If a replica already has that entry, it can simply ignore it.
176 | 
177 | - ![1/11](../images/request_flow1.png)
178 | - ![2/11](../images/request_flow2.png)
179 | - ![3/11](../images/request_flow3.png)
180 | - ![4/11](../images/request_flow4.png)
181 | - ![5/11](../images/request_flow5.png)
182 | - ![6/11](../images/request_flow6.png)
183 | - ![7/11](../images/request_flow7.png)
184 | - ![8/11](../images/request_flow8.png)
185 | - ![9/11](../images/request_flow9.png)
186 | - ![10/11](../images/request_flow10.png)
187 | - ![11/11](../images/request_flow11.png)
188 | 
189 | ## Load Balancer (LB)
190 | - We can add a Load balancing layer at three places in our system:
191 |     - Between Clients and Application servers
192 |     - Between Application Servers and database servers
193 |     - Between Application Servers and Cache servers
194 | - Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it.
195 | - A problem with Round Robin LB is that we don’t take the server load into consideration. If a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that.
196 | 
197 | ## Purging or DB cleanup
198 | - Should entries stick around forever, or should they be purged? If a user-specified expiration time is reached, what should happen to the link?
199 | - If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will ensure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.
200 |     - Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
201 |     - A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low.
202 |     - We can have a default expiration time for each link (e.g., two years).
203 |     - After removing an expired link, we can put the key back in the key-DB to be reused.
204 |     - Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever.
205 | - ![Detailed component design for URL shortening](../images/detailed_component.png)
206 | 
207 | ## Telemetry
208 | - How many times a short URL has been used, what were user locations, etc.? How would we store these statistics? If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests?
209 | - Some statistics worth tracking: country of the visitor, date and time of access, web page that referred the click, browser, or platform from where the page was accessed.
210 | 
211 | ## Security and Permissions
212 | - Can users create private URLs or allow a particular set of users to access a URL?
213 | - We can store the permission level (public/private) with each URL in the database. We can also create a separate table to store UserIDs that have permission to see a specific URL. If a user does not have permission and tries to access a URL, we can send an error (HTTP 401) back. Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The columns will store the UserIDs of those users that have permission to see the URL.
214 | 
215 | 


--------------------------------------------------------------------------------
/distributed_system/review.md:
--------------------------------------------------------------------------------
 1 | https://www.wisdomjobs.com/e-university/distributed-computing-interview-questions.html
 2 | 
 3 | Question 1. Define Distributed System?
 4 | Answer :
 5 | A distributed system is a collection of independent computers that appears to its users as a single coherent system.  A distributed system is one in which components located at networked communicate and coordinate their actions only by passing message.
 6 | Question 2. List The Characteristics Of Distributed System?
 7 | Answer :
 8 | Programs are executed concurrently 
 9 | There is no global time 
10 | Components can fail independently (isolation, crash)
11 | Question 3. Mention The Examples Of Distributed System?
12 | Answer :
13 | The Internet
14 | Intranets
15 | Mobile and ubiquitous computing
16 | 


--------------------------------------------------------------------------------
/probability/002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx


--------------------------------------------------------------------------------
/probability/4710_review.md:
--------------------------------------------------------------------------------
  1 | Introduction To Probability
  2 | ====
  3 | <!-- <img src="https://render.githubusercontent.com/render/math?math=\Omega"> -->
  4 | 
  5 | # Experiments with random outcomes
  6 | ## Sample space & probabilities
  7 | - sample space \Omega
  8 |   - set of all the possible outcomes of the experiments
  9 | - sample points w
 10 |   - Elements of \Omega
 11 | - events
 12 |   - subset of \Omega
 13 | - \F
 14 |   - collection of events in \Omega
 15 | - probability measure / probability distribution P
 16 |   - func from \F to \R
 17 |   - P(A)
 18 |     - prob of event A
 19 | - Kolmogorov's axiom
 20 |   - 0 <= P(A) <= 1, \any A
 21 |   - P(\Omega) = 1, P(\empty) = 0
 22 |   - if A_1, A_2, A_3, ... pairwise disjoint events
 23 |     - P(\bigcup_{i=1}^{\inf} A_i) = \sum_{i=1}^{\infty} P(A_i)
 24 |     - P(A_1 \cup A_2 \cup ... \cup A_n) = P(A_1) + P(A_2) + ... + P(A_n)
 25 | - probability space (\Omega, \F, P)
 26 | - mutually exclusive
 27 |   - A_i \cap A_j = \empty
 28 | - cartesian product spaces
 29 |   - A_1 x A_2 x ... x A_n = {(x_1, x_2, ..., x_n) | x_i \in A_i, i \in [1,n]}
 30 |   - set of ordered n tuples with the i-th element from A_i
 31 | 
 32 | ## random sampling
 33 | - sampling with & without replacement
 34 | - ordered & unordered sample
 35 | 
 36 | ## consequence of the rules of probability
 37 | - P(A) + P(A^c) = 1
 38 | - monotonicity of probability
 39 |   - if A \subset B then P(A) <= P(B)
 40 | - inclusion - exclusion
 41 |   - P(A \cup B) = P(A) + P(B) - P(A \cap B)
 42 |   - P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)
 43 | - n person gets hat problem: n person have their hats mixed up, what is the probability that no one gets hi/her own hat? How does this probability bejave as n -> \inf
 44 |   - define event A_i = {personi gets his/her hat}
 45 |   - P(\bigcap_{i=1}^{n}A_i^c) = 1 - P(\bigcup_{i=1}^n A_i)
 46 |   - P(A_i1 \cap A_i2 \cap ... \cap A_ik) = P(i_1, ..., i_k gets their own hat) = \frac{(n-k)!}{n!}  (given k hats assigned correctly, number of ways (n-k) assigned to the rest of guests)
 47 |   - \sum_k P(A_i1 \cap A_i2 \cap ... \cap A_ik) = \binom{n}{k} \frac{(n-k)!}{n!} = \frac{1}{k!}
 48 |   - P(\bigcup_{i=1}^n A_i) = 1 - 1/2! + 1/3! + ... + (-1)^{n+1} 1/n!
 49 |   - P(\bigcap_{i=1}^{n}A_i^c) = 1/2! - 1/3! + ... + (-1)^n 1/n! = \sum_{k=0}^n (-1)^k/k!
 50 |   - if n -> \inf, then P(\bigcap_{i=1}^{n}A_i^c) = e^{-1}
 51 | 
 52 | ## random variables
 53 | - random variable X is a function from \Omega into the real numbers
 54 | - X is degenerate if P(X = b) = 1
 55 | - probability distribution of X is P{X \in B} for set B of real numbers
 56 | - X is a discrete random variable if there exists a finite or countably infinite set {k_1, k_2, ...} of real numbers such that sum_i P(X = k_i) = 1
 57 | - probability mass function p.m.f. of a discrete random variable is p_X = p(k) = Pr(X = k) for all possible k of X
 58 | 
 59 | # Conditional probability and independence
 60 | ## conditional probability
 61 | - The conditional probability of A given B is P(A | B) = P(AB) / P(B)
 62 | - Multiplication rule for n events
 63 |   - P(A_1A_2...A_n) = P(A_1)P(A_2|A_1)P(A_3|A_1A_2)...P(A_n|A_1A_2...A_{n-1})
 64 | - a finite collection of events {B_1, ..., B_n} is a partition of \Omega if B_iB_j = \empty whenever i != j and \bigcup_{i=1}^n B_i = \Omega
 65 | 
 66 | ## bayes' formula
 67 | - P(B | A) = P(AB) / P(A) = \frac{P(A|B)P(B)}{P(A|B)P(B) + P(A|B^c)P(B^c)}
 68 | - general version of bayes' formula
 69 |   - P(B_k|A) = P(AB_k)/P(A) = \frac{P(A|B_k)P(B_k)}{\sum_{i=1}^n P(A|B_i)P(B_i)}
 70 | 
 71 | ## independence
 72 | - A independent of B if P(A|B) = P(A) or P(AB) = P(A)P(B)
 73 | - if A B independent, same is true for A^c and B^c, A^c and B, A and B^c
 74 | - X_1, ..., X_n are random variables on the same probability space, then they are independent if P(X_1 \in B_1, X_2 \in B_2, ..., X_n \in B_n) = \prod_{k=1}^n P(X_k \in B_k)
 75 | 
 76 | ## independent trials
 77 | - **Bernoulli distribution**
 78 |   - records the result of a single trial with 2 possile outcomes
 79 |   - 0 <= p < 1, X ~ Ber(p) with success probability p if X \in {0, 1} and P(X = 1) = p and P(X = 0) = 1-p
 80 |   - e.g. a sequence of n independent trials
 81 |     - Pr(X_1 = 0, X_2 = 1, X_3 = X_4 = 0) = p(1-p)^3
 82 |   - E[X] = p
 83 |   - Var(X) = p(1-p)
 84 | - **Binomial distribution**
 85 |   - X \sim Bin(n, p)
 86 |   - Let X be the number of successes in n indep trials, with success probaility p, X_i denotes the outcome of trial i
 87 |   - X = X_1 + X_2 + ... + X_n
 88 |   - Pr(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
 89 |   - E[X] = np
 90 |   - Var(X) = np(1-p)
 91 | - **geometric distribution**
 92 |   - X \sim Geom(p)
 93 |   - infinite sequence of indep trials
 94 |   - X is the number of trials needed to see the first success
 95 |   - P(X = k) = P(X_1 = 0, X_2 = 0, ..., X_{k-1} = 0, X_k = 1) = (1-p)^{k-1}p
 96 |   - E[X] = 1/p
 97 |   - Var(X) = (1-p)/p^2
 98 | ## Further topics
 99 | - conditional independence 
100 |   - P(A_i1 A_i2 ... A_ik | B) = P(A_i1 | B) P(A_i2 | B) ... P(A_ik | B)
101 |   - e.g. Suppose 9/10 coins are fair, 1/10 coins are biased with tail probability 3/5
102 |     - A_1 = first flip yields tail, A_2 = second flip yields tail
103 |     - success flipis of **a given coin** are independent
104 |     - P(A_1|F) = P(A_2|F) = 1/2, P(A_1|B) = P(A_2|B) = 3/5
105 |     - P(A_1A_2|F) = P(A_1|F)P(A_2|F), P(A_1A_2|B) = P(A_1|B)P(A_2|B) 
106 |     - P(A_1A_2) = P(A_1A_2|F)P(F) + P(A_1A_2|B)P(B) 
107 | - hypergeometric distribution
108 |   - X \sim Hypergeom(N, N_A, n)
109 |   - The result of each draw (the elements of the population being sampled) can be classified into one of two mutually exclusive categories
110 |   - The probability of a success changes on each draw, as each draw decreases the population
111 |   - X takes values in the set [0, n] 
112 |   - k is num of successes/type A
113 |   - P(X = k) = \frac{\binom{N_A}{k} \binom{N - N_A}{n-k}}{\binom{N}{k}}
114 |   - sample n items without replacement, choose k items from N_A type A items, and n-k from N-N_A type B items
115 | - the birthday problem
116 |   - How large should a randomly selected group of people be to guarantee that with probability at least 1/2 there are two people with the same birthday?
117 |   - Take a random sample size of k
118 |   - p_k = Pr(there is repetition in the sample, how large should k be to have p_k > 1/2
119 |   - A_k = the first k picks are all distinct
120 |   - p(A_k) = \frac{365 * 364 * ... * (365 - (k-1))}{365^k}
121 |   - p_k = 1 - p(A_k)
122 | 
123 | # random variables
124 | ## probability distribution of random variables
125 | - discrete: Bernoulli, binomial, geometric
126 | - probability density function p.d.f
127 |   - P(X <= b) = \int_{-\inf}^{b} f(x) dx
128 | - if a random variable X has density function f then point values have probability zero
129 |   - P(X = c) = \int_c^c f(x) dx = 0 \any c
130 | - f(x) \geq 0 for \any x \in \R
131 | - \int_{-\inf}^{\inf} f(x) dx = 1
132 | - **uniform distribution**
133 |   - X \sim Unif[a, b]
134 |   - f(x) = 1/(b-a) if x \in [a,b]
135 |             0      otherwise
136 |   - P(c <= X <= d) = \int_c^d 1/(b-a) dx
137 | - the value f(x) of a density function is not a probability, but it gives probability of sets by integration
138 | - P(a < X < a + \epsilon) \simeq f(a) * \epsilon
139 | - E[X] = (a+b)/2
140 | - Var(X) = (b-a)^2/12
141 | ## cumulative distribution function c.d.f
142 | - F(s) = P(X <= s) \any s \in \R
143 | - P(a < X <= b) = P(X <= b) - P(X <= a) = F(b) - F(a)
144 | - For discrete random variable
145 |   - F(s) = P(X <= s) = \sum_{k:k <= s} P(X = k)
146 | - For continuous random variabnle
147 |   - F(s) = P(X <= s) = \int_{-\inf}^s f(x) dx
148 | - find pmf/pdf from cdf
149 |   - if F is piecewise constant, then X is discrete. Possible values of X are where F has jumps. P(X = x) = magnitude of the jump of F at a
150 |   - if F continuous, F'(x) exists everywhere, except possibly at finitely many points, then X is continuous, f(x) = F'(x). If F not differentiable at x, then f(x) can be set arbitrary
151 | - property of cdf
152 |   - monotonicity: if s < t then F(s) <= F(t)
153 |   - right continuity: for each t \in \R, F(t) = lim_{s -> t^+} F(s)
154 |   - lim_{t -> -\inf} F(t) = 0, lim_{t -> \inf} F(t) = 1
155 | - P(X < a) = lim_{s -> a^-} F(s)
156 | 
157 | ## Expectation
158 | - expectation / first moment of discrete variablne: u = E[X] = \sum_{k} kP(X = k)
159 | - expectation of continuous random variable : E[X] = \int_{-\inf}^{\inf} xf(x) dx
160 | - St. Petersburg paradox: flip a coin, if head, win 2 dollars and game is over; if tail, prize is doubled and flip again. 
161 |   - Let Y denote the prize
162 |   - P(Y = 2^n) = 2^{-n}
163 |   - E[Y] = \sum_{n=1}^{\inf} 2^n 2^{-n} = \sum 1 = \inf
164 | - undefined expectation
165 |   - you and I flip a fair coin until we see the first head
166 |   - let n denote the number of flips needed, if n odd, you pay me 2^nl otherwise I pay you 2^n
167 |   - P(X = 2^n) = 2^{-n}, for odd n>= 1
168 |   - P(X = -2^n) = 2^{-n} for even n >= 1
169 |   - E[X] = 2^1 * 2*{-1} + (-2^2) * 2*{-2} + ... = 1 - 1 + 1 - 1..
170 |   - the expectation does not exist
171 | - expectation of a function of random variable
172 |   - discrete: E[g(X)] = \sum_k g(k) P(X = k)
173 |   - continuous: E[g(X)] = \int_{-\inf}^{\inf} g(k) P(X = k)
174 | - a stick of length l is broekn at a uniformly chosen random location. What is the expected length of the longer piece?
175 |   - g(x) = l-x if 0 <= l-x <=l/2
176 |             x if l/2 < x <= l
177 |   - E[g(x)] = \int_0^l g(x) f(x) dx
178 |             = \int_0^{l/2} (l-x)/l dx + \int_{l/2
179 |              ^{l} x/l dx
180 |             = 3l/4 
181 | - the n-th moment of X is E[X^n] = \sum_k k^n P(X = k)
182 | - median / 0.5-th quantile of X is any m that satisfies P(X >= m) >= 1/2, P(X <= m) >= 1/2
183 | - first quartile: p = 0.25, third quartile: p = 0.75
184 | - p-th quantile is any x satisfying P(X <= X) >= P, P(X >= x) >= 1-p
185 | - median of X
186 |   - find m with P(X <= m) = 1/2
187 | ## variance
188 | - Var(X) = E[(X - u)^2] = \sigma^2
189 |         = E[X^2] - (E[X])^2
190 | - standard deviation SD(X) = \sigma 
191 | - discrete: Var(X) = \sum_k (k-u)^2 P(X = k)
192 | - continuous Var(X) = \int_{\int}^{\int} (x-u)^2 f(x)dx
193 | - for an indicator random variable, Var[I_A] = P(A) P(A^c)
194 | - E(aX+b) = aE[X] + b
195 | - Var(aX+b) = a^2Var(X)
196 | ![Properties of Random Variables](images/properties_of_random_variables.png)
197 | 
198 | ## Gaussian distribution
199 | - Z \sim N(0, 1), a random variable Z has standard normal distribution / standard Gaussian distribution if Z has density function \phi(x) = 1/\sqrt(2\pi) e^{-x^2/2}
200 | - bell shaped curve
201 | - c.d.f \Phi(x) = 1/\sqrt(2 \pi) \int_{-\inf}^{x} e^{-s^2/2} ds
202 | - X \sim N(u, \sigma^2) iff. f(x) = 1/\sqrt(2\pi \sigma^2)e^{-(x-u)^2/2\sigma^2}
203 | - if X \sim N(u, \sigma^2), Z = (X - u)/\sigma
204 | - if 1 <= k < l are integers and E[X^l] finite
205 | 
206 | # Approximations of the binomial distribution
207 | ## normal approximation
208 | - central limit theorem
209 |   - e.g. pmf of Bin(n, p) distribution can be close to the bell curve of the normal distribution
210 | - law of rare events
211 |   - When p small, Bin(n, p) close to Poisson()
212 | - CLT for binomial
213 |   - S_n \sim Bin(n, p) should approximate the density function of X \sim N(np, np(1-p)) as n becomes large
214 |   - let p be fixed, then lim_{n -> \inf} P(a <= (S_n - np)/\sqrt(np(1-p)) <= b) = \int_a^b 1/\sqrt(2\pi) e^{-x^2/2} dx
215 | - Suppose S_n \sim Bin(n, p) with n large and p not too close to 0 or 1, or np(1-p) > 10, then P(a <= (S_n - np)/\sqrt(np(1-p)) <= b) is close to \Phi(b) - \Phi(a)
216 | - first approximated the binomial with the normal distribution, then approximated the c.d.f of normal distribution using the table in the appendix
217 | - three sigma rule: \Phi(3) - \Phi(-3) \simeq 0.9974
218 | - continuity correction
219 |   - compared to P(k_1 <= S_n <= k_2),  P(k_1 -1/2 <= S_n <= k_2 + 1/2) is a better approximation
220 | 
221 | ## law of large numbers
222 | - law of large numbers for binomial random variables
223 |   - \any fixed \epsilon > 0, lim_{n -> \inf} P(|S_n/n - p| < \epsilon) = 1
224 | - CLT describes the error in the law of large numbers
225 |   - S_n / n = p + \sigma/\sqrt(n) * (S_n - np)/(\sigma\sqrt(n)) \simeq p + \sigma/\sqrt(n) Z
226 |   - decomposes S_n/n into a sum of p and a random error
227 |   - for large n this random error is approximately normal w/ sdv \sigma/\sqrt(n)
228 | 
229 | ## applications of the normal approximation
230 | - want to estimate p for a biased coin
231 |   - law of large number: flip n times, count S_n, take \hat(p) = S_n/n as the estimate for p
232 |   - P(|\hat(p) - p| < \epsilon) = P(|S_n/n-p|<\epsilon)
233 |                                 = P(-ne < S_n - np < ne)
234 |                                 = then divide both sides by \sqrt(np(1-p))
235 |                                 = 2\Phi(e\sqrt(n)/\sqrt(p(1-p))) - 1
236 |                                 >= 2\Phi(2e\sqrt(n)) - 1
237 |   - use it to solve problem that says 'how many times shold we flip a coin... so \hat(p) is within 0.05 of the true p, with probability at least 0.99?'
238 | - confidence intervals
239 |   - (\hat(p) - e, \hat(p) + e) contains the true p with probability at least r, 100r% is the confidence level
240 |   - e.g. 'find the 95% confidence level'
241 |     - use 2\Phi(2e\sqrt(n)) - 1 > 0.95 to solve for e
242 | - maximum likelihood estimator
243 |   - \hat(p) = S_n / n
244 |   - once S_n = k has been observed, can use pmf of S_n to compare how likely outcome k is under different value of p
245 | - polling
246 |   - actually sampling without replacement - hypergeometric
247 |   - but sampling with replacement leads to indep trials and binomal distribution for number of success
248 |   - if sample size n small compared to population, then even if sampling w/ replcament, meeting the same person twice has low chance
249 |   - could use Bin(n,p) for polling
250 |   - Hypergeom(N, N_A, n) converges to Bin(n, p) as N -> \inf and N_A/N -> p
251 | - random walk
252 |   - let X_1, X_2, X_3 be indep random variable s.t. P(X_j = 1) = p, P(X_j = -1) = 1-p
253 |   - S_0 = 0
254 |   - S_n = X_1 + X_2 + .. + X_n
255 |   - X_j is the j-th step, S_n is her position after n steps
256 |   - random sequence S_0, S_1, S_2, ... is a simple random walk
257 |   - if p = 1/2, then S_n is a symmetric simple random walk, otherwise asymmetric
258 |   - T_n = number of times the coin came up heads
259 |   - S_n = T_n - (n - T_n) = 2T_n - n
260 |   - T_n \sim Bin(n, p)
261 |   - E[S_n] = n(2p-1)
262 |   - Var[S_n] = 4np(1-p)
263 | 
264 | ## Poisson approximation (discrete)
265 | - X \sim Poisson(\lambda)
266 |   - \lambda > 0, X has Poisson distribution if X is nonnegative integer with
267 |   - P(X = k) = e^{-\lambda} \lambda^k/k!, k \in \N
268 |   - E[X] = \lambda
269 |   - Var[X] = \lambda
270 | - law of rare events
271 |   - if successes rare in a sequence of indep trials, then number of successes is approximated by Poisson
272 |   - Let S_n \sim Bin(n, \lambda/n), \lambda/n < 1, then 
273 |   - lim_{n -> \inf} P(S_n = k) = e^{-\lambda} \lambda^k/k!
274 | - Let X \sim Bin(n, p) and Y \sim Poisson(np), then \any subset A \subset {0,1,2...}, we have |P(X \in A) - P(Y \in A)| <= np^2
275 | - Poisson approximation of counting rare events
276 |   - X = num of rare events that are not strongly dependent of each other
277 |   - then X \sim Poisson(\lambda)
278 |   - P(X = k) \simeq e^{-\lambda} \lambda^k/k!
279 | - normal and poisson approximation of the binomial
280 |   - when np(1-p) > 10 -> use normal
281 |   - when np^2 small -> use poisson
282 | 
283 | ## Exponential distribution
284 | - X \sim Exp(\lambda)
285 |   - \lambda > 0, X has exponential distribution with rate \lambda if X has density function f(x) = \lambda e^{-\lambda x} for x >= 0, and 0 for x < 0
286 | - c.d.f 
287 |   - F(t) = \int_0^t \lambda e^{-\lambda x} dx = 1 - e^{-\lambda t}, t >= 0
288 |   - P(X > t) = 1 - P(X <= t) = e^{-\lambda t}
289 | - E[X] = 1/\lambda
290 | - Var[X] = 1/\lambda^2
291 | - memoryless property
292 |   - for any s, t > 0
293 |   - P(X > t+s | X > t) = P(X > s)
294 |   - e.g. lifetime of some machine can be modeled by Exp(\lambda)
295 |     - regardless of how long the machine has been operation, the distribution of remaining time is the same as that of the original lifetime
296 |     - behaves as if it were brand new
297 |   - no other distribution with continuous p.d.f on [0, \inf] that satisfies the memoryless properyu
298 | - approximation
299 |   - model the time when first custiomer arrives in a discrete time scale
300 |   - probability that at least one custoner arrives time time interval of length 1/n is \lambda/n, for large n
301 |   - for k = 1,2,3... if first customer arrives during [(k-1)/n, k/n], set T_n = k/n
302 |   - P(T_n = k/n) = (1-\lambda/n)^{k-1} \lambda/n --> nT_n \sim Geom(\lambda/n)
303 |   - \lim_{n -> \inf}P(T_n > t) = e^{-\lambda t}, t >= 0
304 | 
305 | # joint distribution of random variables
306 | ## joint distribution of discrete random variables
307 | - X_1, X_2, ,.., X_n are discrete random variables
308 | - joint probability mass function 
309 |   - p(k_1, k_2, ..., k_n) = P(X_1 = k_1, X_2 = k_2, ..., X_n = k_n)
310 | - let p(k_1, ..., k_n) be the joint probability mass function of (X_1, ..., X_n)
311 |   - the probability mass function of X_j / marginal probability mass function of X_j
312 |     - p_{X_j}(k) = \sum_{l_1, ..., l_{j-1}, l_{j+1}, ..., l_n} p(l_1, ...,. l_{j-1}, k, l_{j+1}, l_n)
313 | - multinomial distribution
314 |   - n, r positive integers
315 |   - p_1, p_2, ..., p_r positive reals
316 |   - p_1 + p_2 + ... + p_r = 1
317 |   - if possible values are integer vectors (k_1, ..., k_r) such that  
318 |     - k_j >= 0
319 |     - k_1 + ... + k_r = n
320 |   - (X_1, ..., X_r) has multinomial distribution
321 |   - joint probability mass function 
322 |     - P(X_1 = k_1, X_2 = k_2, ..., X_r = k_r) = \binom{n}{k_1, k_2, ..., k_r}p_1^{k_1} ... p_r^{k_r}
323 |   - (X_1, ..., X_r) ~ Mult(n, r, p_1, ..., p_r)
324 | 
325 | ## joint continuous random variables
326 | - X_1, ..., X_n are jointly continuous if there exists a 
327 | - join density function f on R^n s.t. for subsets B \subset R^n
328 |   - P((X_1, ..., X_n) \in B) = \int ... \int_B f(x_1, ..., x_n) dx_1 ... dx_n
329 | - let f be the joint density function of X_1, ..., X_n. 
330 | - then each random variable X_j has a density function f_{X_j} that can be obtained by integrating away the other variables from f
331 | - f_{X_j}(x) = \int ... \int(n-1 integrals) f(x_1, ..., x_{j-1}, x, x_{j+1}, ..., x_n) dx_1 ... dx_{j-1} dx_{j+1} dx_n
332 | 
333 | ## joint distributions and independence
334 | ### discrete
335 | - let p(k_1, ..., k_n) be the joint probability mass function of the discrete random variables X_1, ..., X_n
336 | - let p_{X_j}(k) = P(X_j = k) be the marginal probability mass function of X_j
337 | - X_1, ..., X_n are independent iff. 
338 |   - p(k_1, ..., k_n) = p_{X_1}(k) ... p_{X_n}(k_n)
339 |   - for all possible values k_1, ..., k_n
340 |   
341 | ### continuous
342 | - if X_1, ..., X_n have joint density function
343 |   - f(x_1, x_2, ..., x_n) = f_{X_1}(x_1)f_{X_2}(x_2)...f_{X_n}(x_n)
344 | - then X_1, ..., X_n are independent
345 | - vice versa
346 | 
347 | - suppose X_1, ..., X_{m+n} are independent random variables
348 | - define random variables Y = f(X_1, ..., X_m), Z = g(X_{m+1}, ..., X_{m+n})
349 | - then Y and Z are independent random variables
350 | 
351 | ## joint cumulative distribution function
352 | - discrete random variables   
353 |   - joint probability mass function
354 | - continuous random variables
355 |   - joint probability density function
356 | - joint cumulative distribution function
357 |   - F(s_1, ..., s_n) = P(X_1 <= s_1, ..., X_n <= s_n)
358 |   - F(x, y) = P(X <= x, Y <= y) = \int_{-\infty}^x \int_{-infty}^x f(s, t) dt ds
359 | - X_1, ..., X_n are independent iff. 
360 |   - F(x_1, x_2, ..., x_n) = \product_{k=1}^n F_{X_k}(x_k)
361 | 
362 | # Tail bounds and limit theorems
363 | ## estimating tail probabilities
364 | - Markov's inequality
365 |   - let X be a nonnegative random variable
366 |   - for any c > 0
367 |   - P(X >= c) <= E[X]/c
368 | - Chebyshev's inequality
369 |   - X has finite mean \mu and a finite variance \sigma^2
370 |   - for any c > 0
371 |   - P(|X - \mu| >= c) <= \sigma^2/c^2
372 | 
373 | ## law of large numbers
374 | - suppose we have iid random variables X_1, X_2, ... 
375 | - with finite mean E[X_1] = \mu
376 | - finite variance Var(X_1) = \sigma^2
377 | - Let S_n = X_1 + ... + X_n
378 | - for any fixed \epsilon > 0 we have 
379 | - lim_{n -> \infty} P(|S_n/n - \mu| < \epsilon) = 1
380 | 
381 | ## central limit theorem
382 | - suppose we have iid random variables X_1, X_2, ... 
383 | - with finite mean E[X_1] = \mu
384 | - finite variance Var(X_1) = \sigma^2
385 | - Let S_n = X_1 + ... + X_n
386 | - for any fixed finite a and b
387 | - lim_{n -> \infty} P(a <= \frac{S_n - n\mu}{\sigma \sqrt{n}} <= b) = \Phi(b) - \Phi(a) = the integration of normal distribution from a to b


--------------------------------------------------------------------------------
/probability/Practical_Guide_To_Quant.md:
--------------------------------------------------------------------------------
  1 | ## Chapter 2 Brain Teasers
  2 | - starts with a simplified version
  3 | ### Screwy pirates
  4 | - 100 coins divided to 5 pirates
  5 | - 2 pirates, a and b
  6 |     - b proposes: b gets 100, a gets 0
  7 | - 3 pirates, a, b, c
  8 |     - a will supports c no matter what
  9 |     - a: 1, b: 0, c: 99
 10 | - 4 pirates, abcd
 11 |     - b supports d
 12 |     - a: 0, b: 1, c: 0, d: 99
 13 | - 5 pirates, abcde
 14 |     - a: 1, b: 0, c: 1, d: 0, e: 98
 15 | 
 16 | ### Tiger and sheep
 17 | - 1 tiger, 1 sheep
 18 |     - eat
 19 | - 2 tigers, ab
 20 |     - not eat
 21 | - 3 tigers, abc
 22 |     - a eats
 23 | - 4 tigers, abcd
 24 |     - if a eats, then 2 sheep, bcd
 25 |         - if b eats, then 3 sheep, cd
 26 |             - not eating then
 27 |     - a won't eat
 28 | - 100 tiger, not eat
 29 | 
 30 | ### River crossing
 31 | - CD cross, D back, 3min
 32 | - AB cross, C back, 2
 33 | - CD cross, 2min
 34 | 
 35 | 
 36 | ### Card game
 37 | - 2 cards
 38 | 
 39 | ### Defective ball
 40 | - 4 + 4 + 4
 41 | 
 42 | ### Horse race
 43 | - 5 groups, 5 races, have orderibg in each group
 44 | - pick tops from each group, 1 race, then 3 groups have potential answers, 3+2+1 candidates
 45 | - race  1,2,3,6,7, then add 11 and race 
 46 | 
 47 | ## Chapter 4 Probablity Theory
 48 | ### Coin toss game
 49 | - remove one coin from A
 50 | - E1: A more coins
 51 | - E2: equal coins
 52 | - E3: A fewer coins
 53 | - P(E1) = P(E2) = x, then 2x + y = 1
 54 | - result = x + y/2 = 0.5
 55 | 
 56 | ### Card game
 57 | - 1/13 + 1/13 * 48/52 + 1/13 * 44/52...
 58 | = 1/(13*52) * (0 + 4 + 8 + ... + 48) 
 59 | =  1/(13*52) * (4 + 48) * 13 /2
 60 | = 24/52
 61 | = 8/17
 62 | 
 63 | 
 64 | ### Drunk passenger
 65 | - E1: seat #1 taken before #100
 66 | - E2: seat #100 taken before #1
 67 | 
 68 | ### N point on a circle
 69 | - 1/2^{N-1} chance that all 2, ..., N points in the same semi-circle
 70 | - same for all i
 71 | - N * 1/2^{N-1} 
 72 | 
 73 | ### poker hands
 74 | - four-of-a-hand: 13 * 48
 75 | - full house: 13 * 12 * 4 * 6
 76 | - hand with two pairs: 13 * 6 * 6 * 6 * 44
 77 | 
 78 | ### hopping rabbit
 79 | - stair(1) = 1
 80 | - stair(2) = 2
 81 | - stair(n) = stair(n-1) + stair(n-2)
 82 | 
 83 | ### screwy pirates
 84 | - for each random group of 5, there must be a lock that none of them has the key to, yet all other 6 pirates have the key
 85 | - number of locks = \binom(11,5)
 86 | - each lock has 6 keys, each pirate has \binom(11,5) * 6 / 11
 87 | 
 88 | ### chess tournament
 89 | - conditional probablity approach
 90 |     - each player has 1/(2^n-1) of meeting player 1
 91 |     - 1 and 2 do not meet in round 1 has probablity (2^n-2)/(2^n-1)
 92 |     - 1 and 2 do not meet in round 2 has probablity (2^{n-1}-2)/(2^{n-1}-1)
 93 |     - multiply together, get 2^{n-1}/(2^n-1)
 94 | - counting approach
 95 |     - 1 and 2 must be in different subgroup
 96 |     - 2^{n-1}/(2^n-1)
 97 | 
 98 | ### application letters
 99 | - let E_i be the event that envelop i is correct
100 | - P(E_i) = 1/5
101 | - P(E_iE_j) = 1/5 * 1/4
102 | - \sum P(E_iE_j) = 10 * 1/5 * 1/4 = 1/2
103 | - P(E_iE_jE_k) = 1/5 * 1/4 * 1/3
104 | - \sum P(E_iE_jE_k) = \binom(5, 3) * 1/5 * 1/4 * 1/3 = 1/3!
105 | - 1 - 1/2 + 1/3! - 1/4! + 1/5!
106 | 
107 | ### birthday problem
108 | - Pr(nobody has the same birtyday) < 1/2
109 | - 365 * 364 * ... * (365-n+1)/365^n < 1/2
110 | 
111 | ### 100th digit
112 | - binom theorem: (x + y)^n = \sum_{k=0}{n} \binom(n, k) x^k y^{n-k}
113 | - calculate (1-\sqrt(2))^n and (1+\sqrt(2))^n and add together, must be an integer
114 | - 0 < (1-\sqrt(2))^3000 <<10^{-100}
115 | 
116 | ### cubic of integer
117 | - x = a + 10b
118 | - x^3 = (a + 10b)^3 = a^3 + 30a^2b + 300ab^2 + 1000b^3
119 | - last digit of x^3 depends on a^3, a = 1
120 | - second to last digit of x^3 depends on 30a^2b = 30b, 3b = 1, thus b = 7
121 | - prob = 1/100
122 | 
123 | ### boys and girls
124 | - part A
125 |     - A = everyone has >= son
126 |     - B = both are boys
127 |     - {bb, bg, gb, gg}
128 |     - Pr = 1/3
129 | 
130 | - part B
131 |     - 1/2
132 | 
133 | ### all-girl world?
134 | X = # of boys before having a girl
135 | X = 0, 1, 2, \infty
136 | average proportion of boys = \sum_{k=0} k/(k+1) * (1/2)^{k+1}
137 | 
138 | ###unfair coin
139 | B: biased
140 | HS: 10 heads
141 | Pr(B|HS) = \frac{Pr(B \cap HS)}{Pr(HS)}
142 | Pr(HS) = Pr(F \cap HS) + Pr(B \cap HS)
143 | Pr(B \cap HS) = Pr(HS|B) * Pr(B) = 1 * 1/10^3
144 | 
145 | Pr(B|HS) = \frac{1/10^3}{(1/2)^10 * 999/1000 + 1/10^3}
146 | 
147 | ### fair probability from an unfair coin
148 | Pr(H) = p
149 | Pr(HH) = p^2
150 | Pr(HT) = 2p(1-p)
151 | Pr(TT) = (1-p)^2
152 | 
153 | throw it twice, if HH or TT, discard
154 | if HT, count as positive, if TF, then negative
155 | 
156 | ### dart game
157 | enumerate all possible outcomes of three throws
158 | 
159 | ### birthday line
160 | assume i'm the n^th person
161 | P(n) = Pr(first n-1 person different birthdays) * Pr(my birthday is the same as one of them)
162 | = \frac{365 * 364 * ... * 365-n+2}{365^{n-1}} * \frac{n-1}{365}
163 | find the n such that P(n) > P(n-1) and P(n) > P(n+1)
164 | 
165 | ### dice order
166 | Pr = Pr(increasing order | three different number) * Pr(three different number) 
167 | = 1/6 * 5/6 * 4/6
168 | 
169 | ### Monty hall problem
170 | if not switch, Pr(win) = 1/3
171 | if switch, Pr(win) = Pr(originally picked a goat) = 2/3
172 | 
173 | ### Amoeba population
174 | Let P(E) be the probability that the amoeba dies.
175 | Let F1, F2, F3, F4 be those four individual outcomes
176 | P(E) = P(E|F1)P(F1) + P(E|F2)P(F2) + P(E|F3)P(F3) + P(E|F4)P(F4)
177 | = 1/4 + P(E)/4 + P(E)^2/4 + P(E)^3/4
178 | P(E) = \sqrt(2) - 1
179 | 
180 | ### candies in a jar
181 | 
182 | ### coin toss game
183 | Pr(A win) = Pr(xHT) + Pr(xxxHT) ...
184 | = P(A|H) * 1/2 + P(A|T) * 1/2
185 | P(A|T) = P(B) 
186 |     = 1-P(A)
187 | conditioned on B's toss
188 | P(A|H) = 1/2*0 + 1/2(1-P(A|H))
189 | -> P(A|H) = 1/3
190 | P(A) = 4/9
191 | 
192 | 4.4 Discrete & continuous distributions
193 | 
194 | ### meeting probability
195 | Pr(|X-Y| <= 5) = shaded area in a square
196 | 
197 | ### probablity of a triangle
198 | x y-x 1-y
199 | assume x < y
200 | x + y-x > 1-y -> y > 1/2
201 | y-x + 1-y > x -> x < 1/2
202 | x + 1-y > y-x -> x + 1/2 > y


--------------------------------------------------------------------------------
/probability/Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/Chap1.md:
--------------------------------------------------------------------------------
 1 | # Chapter 1 General Principles
 2 | - Let us begin this book by exploring five general principles that will be extremely helpful in your interview process. From my experience on both sides of the interview table, these general guidelines will better prepare you for job interviews and will likely make you a successful candidate.
 3 | 
 4 | ## Build a broad knowledge base
 5 | - The length and the style of quant interviews differ from firm to firm. Landing a quant job may mean enduring hours of bombardment with brain teaser, calculus, linear algebra, probability theory, statistics, derivative pricing, or programming problems. To be a successful candidate, you need to have broad knowledge in mathematics, finance and programming.
 6 | 
 7 | - Will all these topics be relevant for your future quant job? Probably not. Each specific quant position often requires only limited knowledge in these domains. General problem solving skills may make more difference than specific knowledge. Then why are quantitative interviews so comprehensive? There arc at least two reasons for this:
 8 | 
 9 | - The first reason is that interviewers often have diverse backgrounds. Each interviewer has his or her own favorite topics that are often related to his or her own educational background or work experience. As a result, the topics you will be tested on are likely to be very broad. The second reason is more fundamental. Your problem solving skills—a crucial requirement for any quant job--is often positively correlated to the breadth of your knowledge. A basic understanding of a broad range of topics often helps you better analyze problems, explore alternative approaches, and conic up with efficient solutions. Besides, your responsibility may not be restricted to your own projects. You will be expected to contribute as a member of a bigger team. Having broad knowledge will help you contribute to the team's success as well.
10 | 
11 | - The key here is "basic understanding." Interviewers do not expect you to be an expert on a specific subject—unless it happens to be your PhD thesis. The knowledge used in interviews, although broad, covers mainly essential concepts. This is exactly the reason why most of the books I refer to in the following chapters have the word -introduction" or "first". in the title. If I am allowed to give only one suggestion to a candidate, it will be know the basics very well.
12 | 
13 | ## Practice your interview skills
14 | - The interview process starts long before you step into an interview room. In a sense, the success or thilure of your interview is often determined before the first question is asked.Your solutions to interview problems may fail to reflect your true intelligence and knowledge if you are unprepared. Although a complete review of quant interview problems is impossible and unnecessary, practice does improve your interview skills. Furthermore, many of the behavioral, technical and resume-related questions can be anticipated. So prepare yourself for potential questions long before you enter an interview room.
15 | 
16 | ## Listen carefully
17 | - You should be an active listener in interviews so that you understand the problems well before you attempt to answer them. If any aspect of a problem is not clear to you politely ask for clarification. If the problem is more than a couple of sentences, jot down the key words to help you remember all the information. For complex problems, interviewers often give away some clues when they explain the problem. Even the assumptions they give inay include some information as to how to approach the problem. So listen carefully and make sure you get the necessary information.
18 | 
19 | ## Speak your mind
20 | - When you analyze a problem and explore different ways to solve it, never do it silently. Clearly demonstrate your analysis and write down the important steps involved if necessary. This conveys your intelligence to the interviewer and shows that you are methodical and thorough. In case that you go astray, the interaction will also give your interviewer the opportunity to correct the course and provide you with some hints.
21 | - Speaking your mind does not mean explaining every tiny detail. If some conclusions are obvious to you, simply state the conclusion without the trivial details. More often than not, the interviewer uses a problem to test a specific concept/approach. You should focus on demonstrating your understanding of the key concept/approach instead of dwelling on less relevant details.
22 | 
23 | ## Make reasonable assumptions
24 | - In real job settings, you are unlikely to have all the necessary information or data you'd prefer to have bc.'fore you build a model and make a decision. In interviews, interviewers may not give you all the necessary assumptions either. So it is up to you to make reasonable assumptions. The keyword here is reasonable. Explain your assumptions to the interviewer so that you will get immediate feedback. '1'0 solve quantitative problems, it is crucial that you can quickly make reasonable assumptions and design appropriate frameworks to solve problems based on the assumptions.
25 | 
26 | - We now ready to review basic concepts inquantitative finance subject areas and have :lin solving real-world interview problems!


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.2.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.2.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.3.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/2.4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.4.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.2.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.2.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.3.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.3.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.3.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.3.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.4.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.4.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/4.5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.5.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/Table4.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.1.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/Table4.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.2.png


--------------------------------------------------------------------------------
/probability/complete_practical_guide_to_quant/images/Table4.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.3.png


--------------------------------------------------------------------------------
/probability/images/properties_of_random_variables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/images/properties_of_random_variables.png


--------------------------------------------------------------------------------
/quant_trader/info.md:
--------------------------------------------------------------------------------
1 | ## finance
2 | 
3 | ## links
4 | https://www.quantstart.com/articles/Self-Study-Plan-for-Becoming-a-Quantitative-Trader-Part-I/
5 | https://www.investopedia.com/options-basics-tutorial-4583012
6 | ##


--------------------------------------------------------------------------------
/system_design/Grokking the system design interview.md:
--------------------------------------------------------------------------------
 1 | https://www.educative.io/courses/grokking-the-system-design-interview?affiliate_id=5749180081373184/
 2 | 
 3 | # names of large numbers
 4 | - 1 KB = 1024 bytes = 10^3 bytes | kilobytes
 5 | - 1 MB = 1024 KB = 10^6 bytes | megabytes
 6 | - 1 GB = 10^9 bytes | gigabytes
 7 | - 1 TB = 10^12 bytes | terabytes
 8 | - 1 PB = 10^15 bytes | petabytes
 9 | 
10 | # Interview Process
11 | - Scope the problem
12 |   - Don’t make assumptions.
13 |   - Ask clarifying questions to understand the constraints and use cases.
14 |   - Steps
15 |     - Requirements clarifications
16 |     - System interface definition
17 | - Sketch up an abstract design
18 |   - Building blocks of the system
19 |   - Relationships between them
20 |   - Steps
21 |     - Back-of-the-envelope estimation
22 |     - Defining data model
23 |     - High-level design
24 | - Identify and address the bottlenecks
25 |   - Use the fundamental principles of scalable system design
26 |   - Steps
27 |     - Detailed design
28 |     - Identifying and resolving bottlenecks
29 | 
30 | 
31 | 


--------------------------------------------------------------------------------
/system_design/System Design.md:
--------------------------------------------------------------------------------
  1 | ## Overview
  2 | - start by asking you to design a system that performs a given task
  3 | ### examples
  4 | - Design a URL shortening service like bit.ly.
  5 | - How would you implement the Google search?
  6 | - Design a client-server application which allows people to play chess with one another.
  7 | - How would you store the relations in a social network like Facebook and implement a feature where one user receives notifications when their friends like the same things as they do?
  8 | 
  9 | ## Step1: System Design Process
 10 | - use cases
 11 |     - e.g. Design a URL shortening service like bit.ly.
 12 |         - shortening: take an url -> return shorter url
 13 |         - redirection: take a short url -> redirect to the original url
 14 |     - custom url: let user custom their short url
 15 |     - analytics: allow people to look at usage statistics of the url
 16 |     - automatic link expiration
 17 |     - manual link removal: remove a short url used before
 18 |     - UI vs API
 19 | - constraints
 20 |     - usage per second: e.g. assume not in top3 url service but in top10
 21 |     - start estimating from usage per month
 22 | 
 23 | ## Step2: Abstract design
 24 | - draw a simple diagram of your ideas
 25 | - e.g. url shortening
 26 |     - application service layer
 27 |         - shortening service
 28 |             - generate new hash
 29 |             - check if it's in data storage
 30 |                 - if not, generate new mapping
 31 |                 - if yes, keep generating until an unusued one is found
 32 |         - redirection service
 33 |             - retrieve the value given the hash
 34 |     - data storage layer, keeps track of the hash to url mapping
 35 |         - act like a big hash table
 36 |         - stores new mapping
 37 |         - retrieves value given a key
 38 |     - hashed_url = convert_to_base_62(md5(original_url + random_salt))[:6]
 39 | 
 40 | ## Step3: Understanding bottleneck
 41 | - traffic is probably not going to be hard, data more interesting
 42 | 
 43 | ## Step4: Scalability
 44 | - ideas
 45 |     - Vertical scaling
 46 |     - Horizontal scaling
 47 |     - Caching
 48 |     - Load balancing
 49 |     - Database replication
 50 |     - Database partitioning
 51 | - clones
 52 |     - every server contains exactly the same codebase and does not store any user-related data, like sessions or profile pictures, on local disc or memory. 
 53 |     - Sessions need to be stored in a centralized data store which is accessible to all your application servers. 
 54 |     - a code change is sent to all your servers without one server still serving old code, serving the same codebase from all your servers
 55 |     - servers can now horizontally scale and you can already serve thousands of concurrent requests
 56 | - database
 57 |     - can stay with MySQL, and use it like a NoSQL database
 58 |     - or you can switch to a better and easier to scale NoSQL database like MongoDB or CouchDB, using NoSQL instead of scaling a relational database
 59 | - cache
 60 |     - A cache is a simple key-value store and it should reside as a buffering layer between your application and your data storage.
 61 |     - Whenever your application has to read data it should at first try to retrieve the data from your cache.
 62 |     - if it’s not in the cache should it then try to get the data from the main data source
 63 |     - Cached Database Queries
 64 |         - A hashed version of your query is the cache key
 65 |         - issues
 66 |             - expiration: it is hard to delete a cached result when you cache a complex query 
 67 |             - When one piece of data changes (for example a table cell) you need to delete all cached queries who may include that table cell.
 68 |     - Cached Objects
 69 |         - store the complete instance of the class or the assembed dataset in the cache
 70 |         - easily get rid of the object whenever something did change and makes the overall operation of your code faster and more logical.
 71 | - asynchronism
 72 |     - Async #1
 73 |         - doing the time-consuming work in advance and serving the finished work with a low request time.
 74 |     - Async #2
 75 |         - start the task when the customer is in the bakery and tell him to come back at the next day. Refering to a web service that means to handle tasks asynchronously.
 76 |         - A user comes to your website and starts a very computing intensive task which would take several minutes to finish. So the frontend of your website sends a job onto a job queue and immediately signals back to the user: your job is in work, please continue to the browse the page
 77 | 
 78 | 
 79 | 
 80 | ## Topics
 81 | ### Concurrency
 82 | - Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence?
 83 | 
 84 | ### Networking
 85 | - Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor?
 86 | 
 87 | ### Abstraction
 88 | - You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?
 89 | 
 90 | ### Real-World Performance
 91 | - You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network.
 92 | 
 93 | ### Estimation
 94 | - Estimation, especially in the form of a back-of-the-envelope calculation, is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write.
 95 | 
 96 | ### Availability and Reliability
 97 | - Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability?
 98 | 
 99 | 
100 | ## random links
101 | - https://www.palantir.com/2011/10/how-to-rock-a-systems-design-interview/
102 | 
103 | 


--------------------------------------------------------------------------------
/system_design/design instagram:
--------------------------------------------------------------------------------
  1 | 
  2 | # Design instagram
  3 | ## what is instagram
  4 | - For the sake of this exercise, we plan to design a simpler version of Instagram, where a user can share photos and can also follow other users. The ‘News Feed’ for each user will consist of top photos of all the people the user follows.
  5 | 
  6 | ## Requirements and Goals of the System
  7 | - Functional Requirements
  8 |     - Users should be able to upload/download/view photos.
  9 |     - Users can perform searches based on photo/video titles.
 10 |     - Users can follow other users.
 11 |     - The system should be able to generate and display a user’s News Feed consisting of top photos from all the people the user follows.
 12 | - Non-functional Requirements
 13 | - Our service needs to be highly available.
 14 | - The acceptable latency of the system is 200ms for News Feed generation.
 15 | - Consistency can take a hit (in the interest of availability), if a user doesn’t see a photo for a while; it should be fine.
 16 | - The system should be highly reliable; any uploaded photo or video should never be lost.
 17 | 
 18 | ## Some Design Considerations
 19 | - The system would be read-heavy, so we will focus on building a system that can retrieve photos quickly.
 20 |     - Practically, users can upload as many photos as they like. Efficient management of storage should be a crucial factor while designing this system.
 21 |     - Low latency is expected while viewing photos.
 22 |     - Data should be 100% reliable. If a user uploads a photo, the system will guarantee that it will never be lost.
 23 | 
 24 | ## Capacity Estimation and Constraints
 25 | - Let’s assume we have 500M total users, with 1M daily active users.
 26 | - 2M new photos every day, 23 new photos every second.
 27 | - Average photo file size => 200KB
 28 | - Total space required for 1 day of photos
 29 |     - 2M * 200KB => 400 GB
 30 | - Total space required for 10 years:
 31 |     - 400GB * 365 (days a year) * 10 (years) ~= 1425TB
 32 | 
 33 | ## High Level System Design
 34 | - At a high-level, we need to support two scenarios, one to upload photos and the other to view/search photos. 
 35 | - Our service would need some object storage servers to store photos and also some database servers to store metadata information about the photos.
 36 | 
 37 | ##  Database Schema 
 38 | - We need to store data about users, their uploaded photos, and people they follow. - Photo table will store all data related to a photo; we need to have an index on (PhotoID, CreationDate) since we need to fetch recent photos first.
 39 | - photo table
 40 |     ```
 41 |     photoID: int - key
 42 |     userID : int
 43 |     photo_path: char[256]
 44 |     photo_latitude: int
 45 |     photo_longitude: int
 46 |     creation_date: date
 47 | 
 48 |     ```
 49 | - user table
 50 |     ```
 51 |     userID: int - key
 52 |     name: char[20]
 53 |     email: char[30]
 54 |     creation_date: date
 55 |     last_login: date
 56 |     user_follow: users
 57 |     ```
 58 | - We need to store relationships between users and photos, to know who owns which photo. We also need to store the list of people a user follows. For both of these tables, we can use a wide-column datastore like Cassandra. For the ‘UserPhoto’ table, the ‘key’ would be ‘UserID’ and the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different columns. We will have a similar scheme for the ‘UserFollow’ table.
 59 | - Cassandra or key-value stores in general, always maintain a certain number of replicas to offer reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for certain days (to support undeleting) before getting removed from the system permanently.
 60 | 
 61 | ## Data Size Estimation
 62 | - Let’s estimate how much data will be going into each table and how much total storage we will need for 10 years.
 63 | - **User**: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes:
 64 |     - UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes
 65 | - If we have 500 million users, we will need 32GB of total storage.
 66 |     - 500 million * 68 ~= 32GB
 67 | 
 68 | - **Photo**: Each row in Photo’s table will be of 284 bytes:
 69 |     - PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes
 70 | - If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day:
 71 |     - 2M * 284 bytes ~= 0.5GB per day
 72 | - For 10 years we will need 1.88TB of storage.
 73 | 
 74 | - **UserFollow**: Each row in the UserFollow table will consist of 8 bytes. If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table:
 75 |     - 500 million users * 500 followers * 8 bytes ~= 1.82TB
 76 | - Total space required for all tables for 10 years will be 3.7TB:
 77 | 
 78 | 32GB + 1.88TB + 1.82TB ~= 3.7TB
 79 | 
 80 | ## component design
 81 | - Photo uploads (or writes) can be slow as they have to go to the disk, whereas reads will be faster, especially if they are being served from cache.
 82 | - Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. We should keep in mind that web servers have a connection limit before designing our system. 
 83 | - If we assume that a web server can have a maximum of 500 connections at any time, then it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the system.
 84 | - Separating photos’ read and write requests will also allow us to scale and optimize each of these operations independently.
 85 | 
 86 | ## Reliability and Redundancy
 87 | - Losing files is not an option for our service. Therefore, we will store multiple copies of each file so that if one storage server dies we can retrieve the photo from the other copy present on a different storage server.
 88 | - This same principle also applies to other components of the system. If we want to have high availability of the system, we need to have multiple replicas of services running in the system, so that if a few services die down the system still remains available and running. Redundancy removes the single point of failure in the system.
 89 | - If only one instance of a service is required to run at any point, we can run a redundant secondary copy of the service that is not serving any traffic, but it can take control after the failover when primary has a problem.
 90 | - Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production and one fails or degrades, the system can failover to the healthy copy. Failover can happen automatically or require manual intervention.
 91 | 
 92 | ## data shading
 93 | ### Partitioning based on UserID
 94 | - we’ll find the shard number by UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID.
 95 | - How can we generate PhotoIDs? 
 96 |     - Each DB shard can have its own auto-increment sequence for PhotoIDs and since we will append ShardID with each PhotoID, it will make it unique throughout our system.
 97 | - issues
 98 |     - How would we handle hot users? Several people follow such hot users and a lot of other people see any photo they upload.
 99 |     - Some users will have a lot of photos compared to others, thus making a non-uniform distribution of storage.
100 |     - What if we cannot store all pictures of a user on one shard? If we distribute photos of a user onto multiple shards will it cause higher latencies?
101 |     - Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s data if that shard is down or higher latency if it is serving high load etc.
102 | 
103 | ### Partitioning based on PhotoID
104 | - If we can generate unique PhotoIDs first and then find a shard number through “PhotoID % 10”, the above problems will have been solved. We would not need to append ShardID with PhotoID in this case as PhotoID will itself be unique throughout the system.
105 | - How can we generate PhotoIDs? 
106 |     - Here we cannot have an auto-incrementing sequence in each shard to define PhotoID because we need to know PhotoID first to find the shard where it will be stored. One solution could be that we dedicate a separate database instance to generate auto-incrementing IDs. If our PhotoID can fit into 64 bits, we can define a table containing only a 64 bit ID field. So whenever we would like to add a photo in our system, we can insert a new row in this table and take that ID to be our PhotoID of the new photo.
107 | - Wouldn’t this key generating DB be a single point of failure? 
108 |     - Yes, it would be. A workaround for that could be defining two such databases with one generating even numbered IDs and the other odd numbered. 
109 | - How can we plan for the future growth of our system? 
110 |     - We can have a large number of logical partitions to accommodate future data growth, such that in the beginning, multiple logical partitions reside on a single physical database server. Since each database server can have multiple database instances on it, we can have separate databases for each logical partition on any server. So whenever we feel that a particular database server has a lot of data, we can migrate some logical partitions from it to another server. We can maintain a config file (or a separate database) that can map our logical partitions to database servers; this will enable us to move partitions around easily. Whenever we want to move a partition, we only have to update the config file to announce the change.
111 | 
112 | ## Ranking and News Feed Generation
113 | - What are the different approaches for sending News Feed contents to the users?
114 | - Pull
115 |     - Clients can pull the News Feed contents from the server on a regular basis or manually whenever they need it. Possible problems with this approach are a) New data might not be shown to the users until clients issue a pull request b) Most of the time pull requests will result in an empty response if there is no new data.
116 | - Push
117 |     - Servers can push new data to the users as soon as it is available. To efficiently manage this, users have to maintain a Long Poll request with the server for receiving the updates. A possible problem with this approach is, a user who follows a lot of people or a celebrity user who has millions of followers; in this case, the server has to push updates quite frequently.
118 | - Hybrid
119 |     - We can adopt a hybrid approach. We can move all the users who have a high number of follows to a pull-based model and only push data to those users who have a few hundred (or thousand) follows. Another approach could be that the server pushes updates to all the users not more than a certain frequency, letting users with a lot of follows/updates to regularly pull data.
120 | 
121 | ## News Feed Creation with Sharded Data
122 | - One of the most important requirement to create the News Feed for any given user is to fetch the latest photos from all people the user follows. For this, we need to have a mechanism to sort photos on their time of creation. To efficiently do this, we can make photo creation time part of the PhotoID. As we will have a primary index on PhotoID, it will be quite quick to find the latest PhotoIDs.
123 | - We can use epoch time for this. Let’s say our PhotoID will have two parts; the first part will be representing epoch time and the second part will be an auto-incrementing sequence. So to make a new PhotoID, we can take the current epoch time and append an auto-incrementing ID from our key-generating DB. We can figure out shard number from this PhotoID ( PhotoID % 10) and store the photo there.
124 | - What could be the size of our PhotoID? Let’s say our epoch time starts today, how many bits we would need to store the number of seconds for next 50 years?
125 |     - 86400 sec/day * 365 (days a year) * 50 (years) => 1.6 billion seconds
126 | - We would need 31 bits to store this number. Since on the average, we are expecting 23 new photos per second; we can allocate 9 bits to store auto incremented sequence. So every second we can store (2^9 => 512) new photos. We can reset our auto incrementing sequence every second.


--------------------------------------------------------------------------------
/system_design/design url shortening:
--------------------------------------------------------------------------------
  1 | # Designing a URL Shortening service like TinyURL
  2 | ## Why do we need URL shortening?
  3 | - URL shortening is used for optimizing links across devices, tracking individual links to analyze audience and campaign performance, and hiding affiliated original URLs.
  4 | 
  5 | ## Requirements and Goals of the System
  6 | ### Functional Requirements
  7 | -Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
  8 | - When users access a short link, our service should redirect them to the original link.
  9 | - Users should optionally be able to pick a custom short link for their URL.
 10 | - Links will expire after a standard default timespan. Users should be able to specify the expiration time.
 11 | 
 12 | ## Non-Functional Requirements
 13 | - The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
 14 | - URL redirection should happen in real-time with minimal latency.
 15 | - Shortened links should not be guessable (not predictable).
 16 | 
 17 | ## Extended Requirements
 18 | - Analytics; e.g., how many times a redirection happened?
 19 | Our service should also be accessible through REST APIs by other services.
 20 | 
 21 | ### Capacity Estimation and Constraints
 22 | - Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write.
 23 | 
 24 | ## Traffic estimates: 
 25 | - Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period:
 26 | 100 * 500M => 50B
 27 | - What would be Queries Per Second (QPS) for our system? New URLs shortenings per second:
 28 | 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
 29 | - Considering 100:1 read/write ratio, URLs redirections per second will be:
 30 | 100 * 200 URLs/s = 20K/s
 31 | 
 32 | ## Storage estimates
 33 | - Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion:
 34 | 500 million * 5 years * 12 months = 30 billion
 35 | 
 36 | - Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage:
 37 | 30 billion * 500 bytes = 15 TB
 38 | 
 39 | ## Memory estimates
 40 | - If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
 41 | - Since we have 20K requests per second, we will be getting 1.7 billion requests per day:
 42 | 20K * 3600 seconds * 24 hours = ~1.7 billion
 43 | - To cache 20% of these requests, we will need 170GB of memory.
 44 | 0.2 * 1.7 billion * 500 bytes = ~170GB
 45 | - One thing to note here is that since there will be a lot of duplicate requests (of the same URL), therefore, our actual memory usage will be less than 170GB.
 46 | 
 47 | ## System APIs
 48 | - Following could be the definitions of the APIs for creating and deleting URLs:
 49 |     ```
 50 |     createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)
 51 |     ```
 52 | - Parameters
 53 |     - api_dev_key (string): The API developer key of a registered account. This will be - used to, among other things, throttle users based on their allocated quota.
 54 |     - original_url (string): Original URL to be shortened.
 55 |     - custom_alias (string): Optional custom key for the URL.
 56 |     - user_name (string): Optional user name to be used in the encoding.
 57 |     - expire_date (string): Optional expiration date for the shortened URL.
 58 | - Returns: (string)
 59 |     - A successful insertion returns the shortened URL; otherwise, it returns an error code.
 60 | - ``deleteURL(api_dev_key, url_key)``
 61 |     Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion returns ‘URL Removed’.
 62 | - How do we detect and prevent abuse? 
 63 |     - A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key).
 64 | 
 65 | ## Database Design
 66 | - A few observations about the nature of the data we will store:
 67 |     - We need to store billions of records.
 68 |     - Each object we store is small (less than 1K).
 69 |     - There are no relationships between records—other than storing which user created a URL.
 70 |     - Our service is read-heavy.
 71 | - Database Schema
 72 |     - We would need two tables: one for storing information about the URL mappings, and one for the user’s data who created the short link.
 73 |     - url mapping of char[16]
 74 |         - original_url char[512]
 75 |         - creation_date
 76 |         - expiration_date
 77 |         - user_id
 78 |     - user info
 79 |         - name
 80 |         - email
 81 |         - register_date
 82 |         - last_login_time
 83 | - What kind of database should we use? 
 84 |     - Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra or Riak is a better choice. 
 85 |     - A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details.
 86 | 
 87 | ## Basic System Design and Algorithm
 88 | ### encoding actual url
 89 | - We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL.
 90 |     - MD5
 91 |     - MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value.
 92 |     - One basic requirement of any cryptographic hash function is that it should be computationally infeasible to find two distinct messages that hash to the same value. MD5 fails this requirement catastrophically; such collisions can be found in seconds on an ordinary home computer.
 93 | - This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. 
 94 | - Using base64 encoding, a 6 letters long key would result in 64^6 = ~68.7 billion possible strings
 95 | - Using base64 encoding, an 8 letters long key would result in 64^8 = ~281 trillion possible strings
 96 | - If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). 
 97 | - Now we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters.
 98 | - issues
 99 |     - If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable.
100 |     - What if parts of the URL are URL-encoded? e.g., http://www.educative.io/distributed.php?id=design, and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for the URL encoding.
101 | - workarounds
102 |     - We can append an increasing sequence number to each input URL to make it unique, and then generate a hash of it. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service.
103 |     - Another solution could be to append user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one.
104 | - ![Request flow for shortening of a URL](images/shortening.png)
105 | ### Generating keys offline 
106 | - We can have a standalone Key Generation Service (KGS) that generates random six-letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. 
107 | - can concurrency cause problem?
108 |     - As soon as a key is used, it should be marked in the database to ensure it doesn’t get reuse. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. 
109 |     - For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which could be acceptable, given the huge number of keys we have.
110 |     - KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server.
111 |     - Can each app server cache some keys from key-DB? 
112 |         - Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This can be acceptable since we have 68B unique six-letter keys.
113 |     - How would we perform a key lookup? 
114 |         - We can look up the key in our database to get the full URL. If it’s present in the DB, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the homepage.
115 |     - it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema).
116 | 
117 | ## Data Partitioning and Replication
118 | - Range Based Partitioning
119 |     - We can store URLs in separate partitions based on the first letter of the hash key. Hence we save all the URLs starting with letter ‘A’ (and ‘a’) in one partition, save those that start with letter ‘B’ in another partition and so on. This approach is called range-based partitioning. We can even combine certain less frequently occurring letters into one database partition. We should come up with a static partitioning scheme so that we can always store/find a URL in a predictable manner.
120 |     - The main problem with this approach is that it can lead to unbalanced DB servers. For example, we decide to put all URLs starting with letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E’.
121 | 
122 | - Hash-Based Partitioning
123 |     - In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object.
124 |     - Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any ‘key’ to a number between [1…256]), and this number would represent the partition in which we store our object.
125 | 
126 | ## Cache
127 | - How much cache memory should we have? 
128 |     - We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs.
129 | 
130 | - Which cache eviction policy would best fit our needs? 
131 |     - When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of the URLs that have been accessed recently.
132 | 
133 | - How can each cache replica be updated? 
134 |     - Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update its cache by adding the new entry. If a replica already has that entry, it can simply ignore it.
135 | - ![Request flow for accessing a shortened URL](/images/accessing.png)
136 | 
137 | ## Load Balancer (LB)
138 | - We can add a Load balancing layer at three places in our system:
139 |     - Between Clients and Application servers
140 |     - Between Application Servers and database servers
141 |     - Between Application Servers and Cache servers
142 | - Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it.
143 | - A problem with Round Robin LB is that we don’t take the server load into consideration. If a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that.
144 | 
145 | ## Purging or DB cleanup
146 | - Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link?
147 | - If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.
148 |     - Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
149 |     - A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low.
150 |     - We can have a default expiration time for each link (e.g., two years).
151 |     - After removing an expired link, we can put the key back in the key-DB to be reused.
152 |     - Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever.
153 | 
154 | ## Telemetry
155 | - How many times a short URL has been used, what were user locations, etc.? How would we store these statistics? If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests?
156 | - Some statistics worth tracking: country of the visitor, date and time of access, web page that refers the click, browser, or platform from where the page was accessed.
157 | 
158 | 
159 | ## Security and Permissions
160 | - Can users create private URLs or allow a particular set of users to access a URL?
161 | - We can store the permission level (public/private) with each URL in the database. We can also create a separate table to store UserIDs that have permission to see a specific URL. If a user does not have permission and tries to access a URL, we can send an error (HTTP 401) back. 
162 | - Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The columns will store the UserIDs of those users that have the permission to see the URL.bvwe


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/basics.md:
--------------------------------------------------------------------------------
 1 | Basics
 2 | ====
 3 | 
 4 | # text
 5 | Whenever we are designing a large system, we need to consider a few things:
 6 | 
 7 | What are the different architectural pieces that can be used?
 8 | How do these pieces work with each other?
 9 | How can we best utilize these pieces: what are the right tradeoffs?
10 | Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. In the following chapters, we will try to define some of the core building blocks of scalable systems. Familiarizing these concepts would greatly benefit in understanding distributed system concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing between SQL vs. NoSQL.
11 | 
12 | Let’s start with the Key Characteristics of Distributed Systems.
13 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/caching.md:
--------------------------------------------------------------------------------
 1 | Caching
 2 | ====
 3 | # keypoints
 4 | - Take advantage of the locality of reference principle: recently requested data is likely to be requested again.
 5 | - Exist at all levels in architecture, but often found at the level nearest to the front end.
 6 | 
 7 | ## Application server cache
 8 | - Cache placed on a request layer node.
 9 | - When a request layer node is expanded to many nodes
10 |   - Load balancer randomly distributes requests across the nodes.
11 |   - The same request can go to different nodes.
12 |   - Increase cache misses.
13 |   - Solutions:
14 |     - Global caches
15 |     - Distributed caches
16 | 
17 | ## Distributed cache
18 | - Each request layer node owns part of the cached data.
19 | - Entire cache is divided up using a consistent hashing function.
20 | - Pro
21 |   - Cache space can be increased easily by adding more nodes to the request pool.
22 | - Con
23 |   - A missing node leads to cache lost.
24 | 
25 | ## Global cache
26 | - A server or file store that is faster than original store, and accessible by all request layer nodes.
27 | - Two common forms
28 |   - Cache server handles cache miss.
29 |     - Used by most applications.
30 |   - Request nodes handle cache miss.
31 |     - Have a large percentage of the hot data set in the cache.
32 |     - An architecture where the files stored in the cache are static and shouldn’t be evicted.
33 |     - The application logic understands the eviction strategy or hot spots better than the cache
34 | 
35 | ## Content distributed network (CDN)
36 | - For sites serving large amounts of static media.
37 | - Process
38 |   - A request first asks the CDN for a piece of static media.
39 |   - CDN serves that content if it has it locally available.
40 |   - If content isn’t available, CDN will query back-end servers for the file, cache it locally and serve it to the requesting user.
41 | - If the system is not large enough for CDN, it can be built like this:
42 |   - Serving static media off a separate subdomain using lightweight HTTP server (e.g. Nginx).
43 |   - Cutover the DNS from this subdomain to a CDN later.
44 | 
45 | ## Cache invalidation
46 | - Keep cache coherent with the source of truth. Invalidate cache when source of truth has changed.
47 | - Write-through cache
48 |   - Data is written into the cache and permanent storage at the same time.
49 |   - Pro
50 |     - Fast retrieval, complete data consistency, robust to system disruptions.
51 |   - Con
52 |     - Higher latency for write operations.
53 | - Write-around cache
54 |   - Data is written to permanent storage, not cache.
55 |   - Pro
56 |     - Reduce the cache that is no used.
57 |   - Con
58 |     - Query for recently written data creates a cache miss and higher latency.
59 | - Write-back cache
60 |   - Data is only written to cache.
61 |   - Write to the permanent storage is done later on.
62 |   - Pro
63 |     - Low latency, high throughput for write-intensive applications.
64 |   - Con
65 |     - Risk of data loss in case of system disruptions.
66 | 
67 | ## Cache eviction policies
68 | - FIFO: first in first out
69 | - LIFO: last in first out
70 | - LRU: least recently used
71 | - MRU: most recently used
72 | - LFU: least frequently used
73 | - RR: random replacement
74 | 
75 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/cap_theorem.md:
--------------------------------------------------------------------------------
 1 | [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem)
 2 | ====
 3 | - it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP)
 4 | - Consistency
 5 |     - All nodes see the same data at the same time
 6 |     - achieved by updating several nodes before further reads
 7 |     - every read receives the most recent write or an error
 8 | - Availability
 9 |     - every request receives a response on success/failure
10 |     - achieved by replicating the data across different servers
11 | - Partition tolerance
12 |     - system continues to work despite message loss or partial failure
13 |     - can sustain any amount of network failure
14 |     - the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
15 | - CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability
16 | - CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made.
17 | - [ACID](https://en.wikipedia.org/wiki/ACID) databases choose consistency over availability.
18 | - [BASE](https://en.wikipedia.org/wiki/Eventual_consistency) systems choose availability over consistency.
19 | 
20 | # text
21 | - CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance. When we design a distributed system, trading off among CAP is almost the first thing we want to consider. CAP theorem says while designing a distributed system we can pick only two of the following three options:
22 | - Consistency
23 |     - All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads.
24 | - Availability
25 |     - Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers.
26 | - Partition tolerance
27 |     - The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
28 | ![cap](../images/cap.png)
29 | - We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available.
30 | 
31 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/consistent_hashing.md:
--------------------------------------------------------------------------------
 1 | Consistent Hashing
 2 | ====
 3 | # keypoints
 4 | 
 5 | - Distributed Hash Table (DHT)
 6 |     - index = hash_function(key)
 7 | - distributed caching system
 8 |     - n cache servers, if index = key % n
 9 |     - problem
10 |         - not horizontally scalable
11 |             - when adding a new cache host, all existing mappings broken
12 |         - may not be load balanced
13 | 
14 | ## consistent hashing
15 | - minimize reorganization when nodes are added or removed
16 | - only k/n keys need to be remapped 
17 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/data_partitioning.md:
--------------------------------------------------------------------------------
 1 | Data Partitioning
 2 | ====
 3 | # keypoints
 4 | - break up a big database (DB) into many smaller parts
 5 | - after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines
 6 | 
 7 | ## Partitioning Methods
 8 | - Horizontal partitioning (range based partitioning, data sharding)
 9 |     - put different rows into different tables
10 |         - e.g. 0-1k, 1k-2k, ...
11 |     - problem
12 |         - if range for partition not chosen carefully, could have unbalanced serves
13 | - Vertical Partitioning
14 |     - store tables related to a specific feature in one server
15 |         - e.g. server1: insta pics, server2: user info..;
16 |     - problem
17 |         - if keeps growing, may be necessary to further partition a feature specific DB across various servers
18 | - Directory Based Partitioning
19 |     - create a lookup service which knows your current partitioning scheme 
20 |     - to find out where a particular data entity resides, query the directory server that holds the mapping between each tuple key to its DB server
21 | 
22 | ## Partitioning Criteria
23 | - Key or Hash-based partitioning
24 |     - apply a hash function to some key attributes of the entity we are storing -> partition number
25 |     - e.g. ID % 100 if we have 100 partitions
26 |     - should ensure uniform allocation
27 |     - problem
28 |         - adding new serves might require rehashing -> downtime for the service
29 | - List partitioning
30 |     - each partition assigned a list of values
31 |     - to insert a new record, find the partition with the corresponding key
32 | - Round-robin partitioning
33 |     - i^th tuple assigned to partition i % n
34 | - Composite partitioning
35 |     - combine the above schemes
36 |     - e.g. list partitioning -> hash based partitioning
37 |     - e.g. consistent hashing =  hash + list partitioning
38 |         - when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots
39 | 
40 | ## Common Problems of Data Partitioning
41 | - Joins and Denormalization
42 |     - if database is partitioned and spread across multiple machines then often not feasible to perform joins
43 |     - workaround
44 |         - denormalize the database so that queries that previously required joins can be performed from a single table
45 |         - but denormalization leads to data inconsistency
46 | - Referential integrity
47 |     - enforce data integrity constraints in a partitioned database difficult, e.g. foreign keys
48 | - Rebalancing
49 |     - reason to change partition scheme
50 |         - data distribution not uniform
51 |         - a lot of load on a partition
52 |     - solution
53 |         - create more DB partitions or rebalance existing partitions
54 |         - will incur downtime
55 |         - could use directory based partitioning
56 | 
57 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/indexes.md:
--------------------------------------------------------------------------------
1 | Indexes
2 | ====
3 | # keypoints
4 | - a data structure that can be perceived as a table of contents that points us to the location where actual data lives
5 | - Improve the performance of search queries.
6 | - Decrease the write performance bc need to update indices. This performance degradation applies to all insert, update, and delete operations.
7 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/key_characteristics_of_distributed_systems.md:
--------------------------------------------------------------------------------
 1 | Key Characteristics of Distributed Systems
 2 | ====
 3 | 
 4 | # keypoints
 5 | ## Scalability
 6 | - The capability of a system to grow and manage increased demand.
 7 | - A system that can continuously evolve to support growing amount of work is scalable.
 8 | - Horizontal scaling: by adding more servers into the pool of resources.
 9 | - Vertical scaling: by adding more resource (CPU, RAM, storage, etc) to an existing server. This approach comes with downtime and an upper limit.
10 | 
11 | ## Reliability
12 | - Reliability is the probability that a system will fail in a given period.
13 | - A distributed system is reliable if it keeps delivering its service even when one or multiple components fail.
14 | - Reliability is achieved through redundancy of components and data (remove every single point of failure).
15 | 
16 | ## Availability
17 | - Availability is the time a system remains operational to perform its required function in a specific period.
18 | - Measured by the percentage of time that a system remains operational under normal conditions.
19 | - A reliable system is available.
20 | - An available system is not necessarily reliable.
21 |   - A system with a security hole is available when there is no security attack.
22 | 
23 | ## Efficiency
24 | - Latency: response time, the delay to obtain the first piece of data.
25 | - Bandwidth: throughput, amount of data delivered in a given time.
26 | 
27 | ## Serviceability / Manageability
28 | - Easiness to operate and maintain the system.
29 | - Simplicity and spend with which a system can be repaired or maintained.
30 | 
31 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/load_balancing.md:
--------------------------------------------------------------------------------
 1 | Load Balancing (LB)
 2 | ====
 3 | # keypoints
 4 | Help scale horizontally across an ever-increasing number of servers.
 5 | 
 6 | ## LB locations
 7 | - Between user and web server
 8 | - Between web servers and an internal platform layer (application servers, cache servers)
 9 | - Between internal platform layer and database
10 | 
11 | ## Algorithms
12 | - Least connection
13 | - Least response time
14 | - Least bandwidth
15 | - Round robin
16 | - Weighted round robin
17 | - IP hash
18 | 
19 | ## Implementation
20 | - Smart clients
21 | - Hardware load balancers
22 | - Software load balancers
23 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/long_polling_websockets_serversent_events.md:
--------------------------------------------------------------------------------
 1 | Long-Polling vs WebSockets vs Server-Sent Events
 2 | ====
 3 | 
 4 | # keypoints
 5 | - communication protocols 
 6 |     - long-polling
 7 |     - WebSockets
 8 |     - Server-Sent Events
 9 |     - between a client like a web browser and a web server
10 |     - sequence of event for regular HTTP request
11 |         - client opens a connections, request data from server
12 |         - server calculates reponse
13 |         - server sends response back to the client
14 | 
15 | ## Ajax Polling
16 | - client repeatedly polls/requests a server for data
17 | - If no data is available, an empty response is returned
18 | - steps
19 |     - client opens a connection, requests data from the server using regular HTTP.
20 |     - requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
21 |     - server calculates the response and sends it back
22 |     - client repeats the above three steps periodically 
23 | - problem
24 |     - client keeps asking the server for new data, a lot of responses are empty -> HTTP overhead
25 | 
26 | ## HTTP Long-Polling
27 | - server push information to client whenever the data is available. 
28 | - client requests as in normal polling, but expect server may not respond immediatey
29 | - if server has no data available, then hold request instead of sending empty response until a timeout
30 | - once data available, full response sent
31 | - client immediately re-request, so server always have a waiting request
32 | - client has to reconnect periodically after connection closed due to timeouts
33 | 
34 | ## WebSockets
35 | - persistent connection between client adn server
36 | - both parties can send data at any time
37 | - establishes WebSocket connection througj WebSocket handshake
38 |     - if succeeds, client server can exchange data
39 |     - enables communication with low overheads
40 |     - real-time data transfer
41 | 
42 | ## Server-Sent Events (SSEs)
43 | - client establishes a persistent & long-term connection with the server
44 | - client require another tech/protocol to send data to server
45 | - steps
46 |     - client request data using regular HTTP
47 |     - request webpage opens a connections to server
48 |     - server sends data to client if new info available
49 | - best when real-time traffic needed
50 | - or server generate data in loop
51 | 
52 | # text
53 | - Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request:
54 |     - The client opens a connection and requests data from the server.
55 |     - The server calculates the response.
56 |     - The server sends the response back to the client on the opened request.
57 | ![HTTP_protocol](../images/HTTP_protocol.png)
58 | 
59 | ## Ajax Polling
60 | - Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned.
61 |     - The client opens a connection and requests data from the server using regular HTTP.
62 |     - The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds).
63 |     - The server calculates the response and sends it back, just like regular HTTP traffic.
64 |     - The client repeats the above three steps periodically to get updates from the server.
65 | - The problem with Polling is that the client has to keep asking the server for any new data. As a result, a lot of responses are empty, creating HTTP overhead.
66 | ![Ajax Polling Protocol](../images/ajax.png)
67 | 
68 | ## HTTP Long-Polling
69 | - This is a variation of the traditional polling technique that allows the server to push information to a client whenever the data is available. With Long-Polling, the client requests information from the server exactly as in normal polling, but with the expectation that the server may not respond immediately. That’s why this technique is sometimes referred to as a “Hanging GET”.
70 |     - If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available.
71 |     - Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event.
72 | - The basic life cycle of an application using HTTP Long-Polling is as follows:
73 |     - The client makes an initial request using regular HTTP and then waits for a response.
74 |     - The server delays its response until an update is available or a timeout has occurred.
75 |     - When an update is available, the server sends a full response to the client.
76 |     - The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period.
77 |     - Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed due to timeouts.
78 | ![Long Polling Protocol](../images/long_polling.png)
79 | 
80 | ## WebSockets
81 | - WebSocket provides Full duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server that both parties can use to start sending data at any time. The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server. This is made possible by providing a standardized way for the server to send content to the browser without being asked by the client and allowing for messages to be passed back and forth while keeping the connection open. In this way, a two-way (bi-directional) ongoing conversation can take place between a client and a server.
82 | ![WebSockets Protocol](../images/websockets.png)
83 | 
84 | ## Server-Sent Events (SSEs)
85 | - Under SSEs the client establishes a persistent and long-term connection with the server. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so.
86 |     - Client requests data from a server using regular HTTP.
87 |     - The requested webpage opens a connection to the server.
88 |     - The server sends the data to the client whenever there’s new information available.
89 | - SSEs are best when we need real-time traffic from the server to the client or if the server is generating data in a loop and will be sending multiple events to the client.
90 | ![Server Sent Events Protocol](../images/sse.png)


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/proxies.md:
--------------------------------------------------------------------------------
 1 | Proxies
 2 | ====
 3 | 
 4 | # keypoints
 5 | - A proxy server is an intermediary piece of hardware / software sitting between client and backend server.
 6 |   - Filter requests
 7 |   - Log requests
 8 |   - Transform requests 
 9 |     - adding/removing headers
10 |     - encrypting/decrypting
11 |     - compressing a resource
12 |     - cache
13 |         - if multiple clients access a particular request, proxy server can cache it
14 | 
15 | ## Proxy Server Types
16 | - Open Proxy
17 |     - accessible by any Internet user
18 |     - Anonymous Proxy
19 |         - reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress
20 |     - Trаnspаrent Proxy 
21 |         –  іdentіfіes іtself
22 |         - with the support of HTTP heаders, the fіrst IP аddress cаn be vіewed
23 |         - can cаche the websіtes
24 | - Reverse Proxy
25 |     - retrieves resources on behalf of a client from servers
26 |     - then returned to the client
27 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/redundancy_replication.md:
--------------------------------------------------------------------------------
 1 | Redundancy & Replication
 2 | ====
 3 | # keypoints
 4 | - Redundancy
 5 |     - **duplication of critical data or services** with the intention of increased reliability of the system.
 6 |     - remove single point of failure
 7 |     - if we have two servers and one fails, system can failover to the other one.
 8 | - primary-replica relationship
 9 |     - between the original and the copies. 
10 |     - primary gets all updates
11 |     - then ripple through to the replica servers
12 |     - replca outputs message if received update successfully
13 | - Shared-nothing architecture
14 |   - Each node can operate independently of one another.
15 |   - No central service managing state or orchestrating activities.
16 |   - New servers can be added without special conditions or knowledge.
17 |   - No single point of failure.
18 | 
19 | 
20 | 


--------------------------------------------------------------------------------
/system_design/glossary_of_system_design/sql_nosql.md:
--------------------------------------------------------------------------------
 1 | SQL vs. NoSQL
 2 | ====
 3 | # keypoints
 4 | ## sql (relational databases)
 5 |     - structured
 6 |     - have predefined schemas
 7 |         - e.g. phone books that store phone numbers and addresses
 8 |     - store data in rows and columns
 9 |         - row contains information about one entity
10 |         - column contains separate data points
11 |     
12 | ## NoSQL (non-relational databases)
13 |     - unstructured, distributed
14 |     - have a dynamic schema 
15 |         - e.g file folders that hold everything from a person’s address to their Facebook ‘likes’ 
16 | 
17 | ## Common types of NoSQL
18 | ### Key-value stores
19 | - Array of key-value pairs. The "key" is an attribute name.
20 | - Redis, Vodemort, Dynamo.
21 | 
22 | ### Document databases
23 | - Data is stored in documents.
24 | - Documents are grouped in collections.
25 | - Each document can have an entirely different structure.
26 | - CouchDB, MongoDB.
27 | 
28 | ### Wide-column / columnar databases
29 | - Column families - containers for rows.
30 | - No need to know all the columns up front.
31 | - Each row can have different number of columns.
32 | - Cassandra, HBase.
33 | 
34 | ### Graph database
35 | - Data is stored in graph structures
36 |   - Nodes: entities
37 |   - Properties: information about the entities
38 |   - Lines: connections between the entities
39 | - Neo4J, InfiniteGraph
40 | 
41 | ## Differences between SQL and NoSQL
42 | ### Storage
43 | - SQL: store data in tables.
44 | - NoSQL: have different data storage models.
45 |     - key-value
46 |     - document
47 |     - graph
48 |     - columnar
49 | 
50 | ### Schema
51 | - SQL
52 |   - Each record conforms to a fixed schema.
53 |   - each row must have data for each column
54 |   - Schema can be altered, but it requires modifying the whole database and going offline.
55 | - NoSQL:
56 |   - Schemas are dynamic.
57 |   - each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’
58 | 
59 | ### Querying
60 | - SQL
61 |   - Use SQL (structured query language) for defining and manipulating the data.
62 | - NoSQL
63 |   - Queries are focused on a collection of documents.
64 |   - UnQL (unstructured query language).
65 |   - Different databases have different syntax.
66 | 
67 | ### Scalability
68 | - SQL
69 |   - Vertically scalable (by increasing the horsepower: memory, CPU, etc) and expensive.
70 |   - Horizontally scalable (across multiple servers); but it can be challenging and time-consuming.
71 | - NoSQL
72 |   - Horizontablly scalable (by adding more servers) and cheap.
73 | 
74 | ### ACID
75 | - Atomicity, consistency, isolation, durability
76 | - SQL
77 |   - ACID compliant
78 |   - Data reliability
79 |   - Gurantee of transactions
80 | - NoSQL
81 |   - Most sacrifice ACID compliance for performance and scalability.
82 | 
83 | ## Which one to use?
84 | ### SQL
85 | - Ensure ACID compliance.
86 |   - Reduce anomalies.
87 |   - Protect database integrity.
88 | - Data is structured and unchanging.
89 | 
90 | ### NoSQL
91 | - Data has little or no structure.
92 | - Make the most of cloud computing and storage.
93 |   - Cloud-based storage requires data to be easily spread across multiple servers to scale up.
94 | - Rapid development.
95 |   - Frequent updates to the data structure.
96 | 


--------------------------------------------------------------------------------