├── .gitignore ├── Behavioral.md ├── OOP_related.md ├── OS_review ├── OS_review.md └── images │ ├── raid.png │ ├── sys_call.png │ └── timer_interrupt.png ├── Phone Interview.md ├── README.md ├── The Technical Interview Cheat Sheet.md ├── complete_system_design ├── .DS_Store ├── glossary_of_system_design │ ├── .DS_Store │ ├── basics.md │ ├── caching.md │ ├── cap_theorem.md │ ├── consistent_hashing.md │ ├── data_partitioning.md │ ├── indexes.md │ ├── key_characteristics_of_distributed_systems.md │ ├── load_balancing.md │ ├── long_polling_websockets_serversent_events.md │ ├── proxies.md │ ├── redundancy_replication.md │ └── sql_nosql.md ├── images │ ├── HTTP_protocol.png │ ├── Vertical_scaling_vs._Horizontal_scaling.png │ ├── accessing.png │ ├── ajax.png │ ├── cap.png │ ├── cap_theorem.png │ ├── client_loadbalancer_server.png │ ├── database_schema.png │ ├── detailed_component.png │ ├── hash1.png │ ├── hash2.png │ ├── hash3.png │ ├── hash4.png │ ├── hash5.png │ ├── high_level_design.png │ ├── high_level_url_shortening.png │ ├── library_catalog_indexes.png │ ├── loadbalancer2.png │ ├── long_polling.png │ ├── proxy.png │ ├── redundancy.png │ ├── redundant_load_balancer.png │ ├── request_flow1.png │ ├── request_flow10.png │ ├── request_flow11.png │ ├── request_flow2.png │ ├── request_flow3.png │ ├── request_flow4.png │ ├── request_flow5.png │ ├── request_flow6.png │ ├── request_flow7.png │ ├── request_flow8.png │ ├── request_flow9.png │ ├── shortening.png │ ├── sse.png │ ├── url1.png │ ├── url2.png │ ├── url3.png │ ├── url4.png │ ├── url5.png │ ├── url6.png │ ├── url7.png │ ├── url8.png │ ├── url9.png │ └── websockets.png └── system_design_problems │ ├── step_by_step_guide.md │ └── url_shortening.md ├── distributed_system └── review.md ├── probability ├── 002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx ├── 4710_review.md ├── Practical_Guide_To_Quant.md ├── Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf ├── complete_practical_guide_to_quant │ ├── Chap1.md │ ├── Chap2.md │ ├── Chap4.md │ └── images │ │ ├── 2.1.png │ │ ├── 2.2.1.png │ │ ├── 2.2.2.png │ │ ├── 2.2.png │ │ ├── 2.3.png │ │ ├── 2.4.png │ │ ├── 4.1.png │ │ ├── 4.2.1.png │ │ ├── 4.2.png │ │ ├── 4.3.1.png │ │ ├── 4.3.2.png │ │ ├── 4.3.3.png │ │ ├── 4.3.png │ │ ├── 4.4.1.png │ │ ├── 4.4.2.png │ │ ├── 4.4.png │ │ ├── 4.5.png │ │ ├── Table4.1.png │ │ ├── Table4.2.png │ │ └── Table4.3.png └── images │ └── properties_of_random_variables.png ├── quant_trader └── info.md └── system_design ├── Grokking the system design interview.md ├── System Design.md ├── design instagram ├── design url shortening └── glossary_of_system_design ├── basics.md ├── caching.md ├── cap_theorem.md ├── consistent_hashing.md ├── data_partitioning.md ├── indexes.md ├── key_characteristics_of_distributed_systems.md ├── load_balancing.md ├── long_polling_websockets_serversent_events.md ├── proxies.md ├── redundancy_replication.md └── sql_nosql.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.py 2 | *private* 3 | optiver 4 | hrt 5 | drw 6 | de_shaw 7 | jane_street 8 | bridgewater 9 | imc 10 | facebook 11 | PDT 12 | quant_trader 13 | two_sigma 14 | citadel 15 | google 16 | CS5414_slides -------------------------------------------------------------------------------- /Behavioral.md: -------------------------------------------------------------------------------- 1 | ## Questions to ask 2 | ### Two sigma 3 | - Could you briefly introduce Two Sigma? What do you guys do? And what does your team do? 4 | - What’s your typical day? I mean, as an engineer, how usually do you spend your day? 5 | - Halite the AI challenge, last time the 2016 TS cup, why do you guys so focus on AI and Robot? 6 | - Things that engineers build, are them only being used internally? And Also are you using any - - products of software from other companies? 7 | - About the trading and all related operations, are they human involved? In other words, how much do you guys trust decisions made by machines? 8 | 9 | ## Behavioral Questions 10 | - introduce your favorite project/resume/yourself? 11 | - 有没有看职位要求?说说职位要求要找什么人?你是这样的人么?介绍一个你最符合这个职位要求的项目,最后强调你是good fit 12 | 13 | - your greatest weakness/failure? 14 | - 你一个无伤大雅的小缺点/失败是什么?你从以前的哪个项目知道自己有这个缺点/失败?知道以后学到了什么教训?在后面哪个项目中吸取了这个教训,做了什么,取得了什么结果? 15 | - pushy 16 | - others don't get the opportunity to learn 17 | 18 | - your greatest advantage? 19 | - 我知道你很牛,你哪个特质最符合这个职位的要求,并且在最后强调你的某某优点让你是一个good fit for this position 20 | - fast learner 21 | - like to challenge myself with unfamiliar concepts 22 | 23 | - why our company? 24 | - 公司的mission是什么?我的career goal和你们公司的mission完美契合;职位的要求是什么?我的背景和能力和这个职位的要求完美契合。最后强调你是good fit 25 | - employees' quality: The bar and the median quality of its hires is significantly higher than other companies, as it makes quite a bit of money per employee and therefore can afford to hire the most competent set of programmers. At other companies, I’m used to work with people “as good as” as I am, whereas at jane street I’ll be humbled and always learn from the smarter people around me. 26 | - values personal growth: from what I heard of, managers at jane street place a strong emphasis on employees personal growth, including programming skills and leadership. With a low turnover rate, js could allocate abundant resource, both in financial resource and human effort, in training and upgrading new employees, so they could take resposibilities in the upcoming years. This is especially tempting for someone who is about to graduate from college and is seeking a tremendous improvement in the skill set. 27 | - collaborative environment: no levels within the company, all programmers are called 'software developers'. This encourages a collaborative working environment as we won't be stressed by your teammates title 28 | - Also, what I personally like about the software developer role at jane street is that it is very directed towards the goal of making profits, instead of working on a product for customers as in traditional tech companies. This motivates me as a programmer to contribute more by writing rigorous code and heavily employing unit tests and integration tests. 29 | - At xxx we leverage technology to solve a variety of problems with high degrees of difficulty: managing scarce bandwidth resources, responding to market events in microseconds or less, automatically pricing diverse sets of financial instruments with extremely low error tolerance, and storing and analyzing terabytes of data. Our systems are built to add to the stability of the market, not detract from it; they must operate at peak efficiency in the most extreme market conditions. These systems must also be simple, flexible, and well-architected so they can quickly change to meet the dynamic needs of our industry. Technologists at Optiver work hard, think creatively, and engineer rigorous solutions that make an immediate impact. 30 | - how did you hear about this position? 31 | - 如实回答就行,我一般都说career fair和公司的工程师聊了聊,关键是最后要再重复一遍,据我了解,这个职位是干啥或者需要啥,我以前也在做这个或者有相关的技能,所以good fit 32 | 33 | - what if your teammate/colleague is hard to work with / not contributing? 34 | 队友/同事不干活/很难相处咋办? 35 | - 你有没有经常和队友/同事主动沟通?你愿不愿意为了团队,帮队友/同事分担一些工作?能不能以非常职业的方式解决这个问题? 36 | 37 | - what if your teammate/colleague disagree with you? 38 | 队友/同事不同意你的观点咋办? 39 | - 你有没有自己花一些时间做一个数字化(quantitative)的比较?有没有向队友/同事提交一个详细的报告或者比较(report/strong case)来说服ta?会不会有效的沟通? 40 | - do a quantitative comparison first 41 | - talk to him 42 | - talk to my manager 43 | 44 | - how do you define success? 45 | - 一般我都说达到自己制定的目标就算成功,这样容易说;那就可以理解为你有没有为自己制定目标?你的目标是啥?你现在完成的怎么样?未来在这个公司想怎么发展自己?(develop tech stack,gain more domain knowledge,see myself in postion of senior engineer in xx years) 46 | 47 | - what if you get assigned to a challenging task? 48 | - 你会不会和你的老板沟通?你会不会和你的同事沟通?你会不会提出合理的要求?能不能以非常职业的方式解决这个问题? 49 | - team contract 50 | - schedule 51 | - distribute work reasonably, harvest everyone's capability 52 | - agreement on emergency 53 | - keep doing reflections on daily work 54 | - double check we've satisfied all requirements when project finished 55 | - (myself) don't make assumptions, open to any idea in general 56 | 57 | - what if a task is due earlier? what would you do if you have multiple deadlines upcoming? 58 | - 你是怎么管理你的时间的?比如日历上设置好项目,还有提醒;你会不会根据工作的优先级安排你的时间?你会不会为了项目组的整体利益考虑(best interest of my team),舍弃一些个人利益?比如为了毕设,自己的考试就不投入太多时间;会不会和别人沟通寻找解决方案?如果你是组长,你知道due提前了会不会采取措施?比如立刻开会,重新安排这个项目后面的任务和时间节点。 59 | - figure out priority 60 | - set deadlines on my calendar so I won't miss out 61 | - figure out a balance between personal interest and group interest 62 | 63 | - your favorite and least favorite project and the teamworking in them 64 | 65 | ## Things to notice 66 | - BQ checks if 这个人好不好在团队里相处(上下级都有),对公司是不是真的很有兴趣(这个很重要),and clients relationship(如果这个职位有面向客户的话)。最少找到一个点你觉得这个公司跟其他公司不一样的特别吸引你的 67 | - 不能只回答他问你的表层的意思,比如他问你缺点,然后你回答说:我缺点是有时候太追求完美,然后就结束了。那肯定是不行的。 68 | 比较好的回答是:我以前有个xxx项目,太追求完美,导致错过了截止日期。我吸取了教训,有时候完成目标比追求完美更重要,在另外一个xxx项目中,我合理分配资源,即使有些东西没做到完美,但是在截止日前完成了任务,我向我的老板提了后续完善这个项目的方案,我的老板很满意。 69 | -------------------------------------------------------------------------------- /OS_review/images/raid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/raid.png -------------------------------------------------------------------------------- /OS_review/images/sys_call.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/sys_call.png -------------------------------------------------------------------------------- /OS_review/images/timer_interrupt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/OS_review/images/timer_interrupt.png -------------------------------------------------------------------------------- /Phone Interview.md: -------------------------------------------------------------------------------- 1 | ## Phone Interview Steps & Tricks 2 | ### Process 3 | - Greeting 4 | - "Hey! How is it going?! God, i don't even remembr how long I have been working from home,feels like longer than it iactually is." 5 | - "I am actually really excited about the interviews, this is the only chance to talk to someone outside of my team and my roommate. (laugh)" 6 | - "How are things in (interviewer's city)? People start to wear masks and take things serisouly now, hope it will get better soon!" 7 | 8 | - the flow 9 | - Clarifying questions 10 | - Brainstorm data structures before coding 11 | - Think of the brute force solution first 12 | - consider edge cases!! 13 | - State that you know this is too long / inefficient / blah, state why 14 | - Now try to improve upon it 15 | - You can ask for hints if you’re absolutely stuck! 16 | - think out loud, explain your thought process 17 | - brute force 18 | - then optimize it 19 | - ask questions!! 20 | - e.g. constraints? the data type of tree node values? does this api care more about speed or space? 21 | - name functions properly, don't name it as `solution()` 22 | - "should I start implement it in code, or you want me to continue to optimize it?" 23 | - explain a bit when coding 24 | - explain time complexity 25 | - go through test cases when finished 26 | - improvement on the same coding question 27 | - java doc,unit test, regression test, performance tuning, benchmarking, A/B testing 28 | 29 | 30 | 31 | ### Miscellaneous 32 | - over communication 33 | - "Hey, if I look like I am looking to my right/left, that's because my camera is here and I have my codepad opened on another screen" 34 | - "Hi, if I am silent for a couple secs/mins, I am just thinking about the question" 35 | - 彩虹屁 36 | - 吹/dig further when he mentions the challenge/headache in work 37 | - talk about your potential solutions 38 | - don't say anything negative about companies you've worked at !!!!!! 39 | - 背题的人写代码和讨论test case和真明白背后原理或者数学证明的是完全不同的 40 | 41 | 42 | ### Red Flags 43 | - when the interviewer says we're running out of time, "time is up,thank you for your time with us" 44 | - late 45 | - "interesting" 可能为敷衍,听烦了 46 | - "Thank you! (HR) will reach out to you in the next a few day. " 但没有第一人称突显他的行动 47 | - "I am not sure XXX." 48 | - "you code seems good..." 49 | 50 | 51 | ### Questions to ask the interviewer 52 | - Name, job title, maybe email (for reference later on) 53 | - What made you choose company X? 54 | - What’s the most satisfying project you’ve worked on? 55 | - What’s a typical day for an intern like? 56 | - Any example projects interns have worked on? 57 | - What’s your favorite thing about working for your company? 58 | - How does this company compare to other places you’ve worked before? 59 | - what's a typical day like (if care about work-life balance) 60 | - What is your most challenging part in your daily work?. 61 | - what do you expect from a new hire/intern of my level in the fist half year? 62 | - what’s the most unique part about working at xxx that you’ve never experienced before. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CS_interview_cheatsheet 2 | help me find a job plssss 3 | -------------------------------------------------------------------------------- /complete_system_design/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/.DS_Store -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/glossary_of_system_design/.DS_Store -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/basics.md: -------------------------------------------------------------------------------- 1 | Basics 2 | ==== 3 | 4 | # text 5 | Whenever we are designing a large system, we need to consider a few things: 6 | 7 | What are the different architectural pieces that can be used? 8 | How do these pieces work with each other? 9 | How can we best utilize these pieces: what are the right tradeoffs? 10 | Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. In the following chapters, we will try to define some of the core building blocks of scalable systems. Familiarizing these concepts would greatly benefit in understanding distributed system concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing between SQL vs. NoSQL. 11 | 12 | Let’s start with the Key Characteristics of Distributed Systems. 13 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/caching.md: -------------------------------------------------------------------------------- 1 | Caching 2 | ==== 3 | # keypoints 4 | - Take advantage of the locality of reference principle: recently requested data is likely to be requested again. 5 | - Exist at all levels in architecture, but often found at the level nearest to the front end. 6 | 7 | ## Application server cache 8 | - Cache placed on a request layer node. 9 | - When a request layer node is expanded to many nodes 10 | - Load balancer randomly distributes requests across the nodes. 11 | - The same request can go to different nodes. 12 | - Increase cache misses. 13 | - Solutions: 14 | - Global caches 15 | - Distributed caches 16 | 17 | ## Distributed cache 18 | - Each request layer node owns part of the cached data. 19 | - Entire cache is divided up using a consistent hashing function. 20 | - Pro 21 | - Cache space can be increased easily by adding more nodes to the request pool. 22 | - Con 23 | - A missing node leads to cache lost. 24 | 25 | ## Global cache 26 | - A server or file store that is faster than original store, and accessible by all request layer nodes. 27 | - Two common forms 28 | - Cache server handles cache miss. 29 | - Used by most applications. 30 | - Request nodes handle cache miss. 31 | - Have a large percentage of the hot data set in the cache. 32 | - An architecture where the files stored in the cache are static and shouldn’t be evicted. 33 | - The application logic understands the eviction strategy or hot spots better than the cache 34 | 35 | ## Content distributed network (CDN) 36 | - For sites serving large amounts of static media. 37 | - Process 38 | - A request first asks the CDN for a piece of static media. 39 | - CDN serves that content if it has it locally available. 40 | - If content isn’t available, CDN will query back-end servers for the file, cache it locally and serve it to the requesting user. 41 | - If the system is not large enough for CDN, it can be built like this: 42 | - Serving static media off a separate subdomain using lightweight HTTP server (e.g. Nginx). 43 | - Cutover the DNS from this subdomain to a CDN later. 44 | 45 | ## Cache invalidation 46 | - Keep cache coherent with the source of truth. Invalidate cache when source of truth has changed. 47 | - Write-through cache 48 | - Data is written into the cache and permanent storage at the same time. 49 | - Pro 50 | - Fast retrieval, complete data consistency, robust to system disruptions. 51 | - Con 52 | - Higher latency for write operations. 53 | - Write-around cache 54 | - Data is written to permanent storage, not cache. 55 | - Pro 56 | - Reduce the cache that is no used. 57 | - Con 58 | - Query for recently written data creates a cache miss and higher latency. 59 | - Write-back cache 60 | - Data is only written to cache. 61 | - Write to the permanent storage is done later on. 62 | - Pro 63 | - Low latency, high throughput for write-intensive applications. 64 | - Con 65 | - Risk of data loss in case of system disruptions. 66 | 67 | ## Cache eviction policies 68 | - FIFO: first in first out 69 | - LIFO: last in first out 70 | - LRU: least recently used 71 | - MRU: most recently used 72 | - LFU: least frequently used 73 | - RR: random replacement 74 | 75 | 76 | 77 | # text 78 | - Load balancing helps you scale horizontally across an ever-increasing number of servers, but caching will enable you to make vastly better use of the resources you already have as well as making otherwise unattainable product requirements feasible. Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again. They are used in almost every layer of computing: hardware, operating systems, web browsers, web applications, and more. A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end where they are implemented to return data quickly without taxing downstream levels. 79 | 80 | ## Application server cache 81 | - Placing a cache directly on a request layer node enables the local storage of response data. Each time a request is made to the service, the node will quickly return local cached data if it exists. If it is not in the cache, the requesting node will query the data from disk. The cache on one request layer node could also be located both in memory (which is very fast) and on the node’s local disk (faster than going to network storage). 82 | - What happens when you expand this to many nodes? If the request layer is expanded to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are global caches and distributed caches. 83 | 84 | ## Content Distribution Network (CDN) 85 | - CDNs are a kind of cache that comes into play for sites serving large amounts of static media. In a typical CDN setup, a request will first ask the CDN for a piece of static media; the CDN will serve that content if it has it locally available. If it isn’t available, the CDN will query the back-end servers for the file, cache it locally, and serve it to the requesting user. 86 | - If the system we are building isn’t yet large enough to have its own CDN, we can ease a future transition by serving the static media off a separate subdomain (e.g. static.yourservice.com) using a lightweight HTTP server like Nginx, and cut-over the DNS from your servers to a CDN later. 87 | 88 | ## Cache Invalidation 89 | - While caching is fantastic, it does require some maintenance for keeping cache coherent with the source of truth (e.g., database). If the data is modified in the database, it should be invalidated in the cache; if not, this can cause inconsistent application behavior. 90 | 91 | - Solving this problem is known as cache invalidation; there are three main schemes that are used: 92 | 93 | - Write-through cache 94 | - Under this scheme, data is written into the cache and the corresponding database at the same time. The cached data allows for fast retrieval and, since the same data gets written in the permanent storage, we will have complete data consistency between the cache and the storage. Also, this scheme ensures that nothing will get lost in case of a crash, power failure, or other system disruptions. 95 | - Although, write through minimizes the risk of data loss, since every write operation must be done twice before returning success to the client, this scheme has the disadvantage of higher latency for write operations. 96 | 97 | - Write-around cache 98 | - This technique is similar to write through cache, but data is written directly to permanent storage, bypassing the cache. This can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a “cache miss” and must be read from slower back-end storage and experience higher latency. 99 | 100 | - Write-back cache 101 | - Under this scheme, data is written to cache alone and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions. This results in low latency and high throughput for write-intensive applications, however, this speed comes with the risk of data loss in case of a crash or other adverse event because the only copy of the written data is in the cache. 102 | 103 | ## Cache eviction policies 104 | - First In First Out (FIFO) 105 | - The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before. 106 | - Last In First Out (LIFO) 107 | - The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before. 108 | - Least Recently Used (LRU) 109 | - Discards the least recently used items first. 110 | - Most Recently Used (MRU) 111 | - Discards, in contrast to LRU, the most recently used items first. 112 | - Least Frequently Used (LFU) 113 | - Counts how often an item is needed. Those that are used least often are discarded first. 114 | - Random Replacement (RR) 115 | - Randomly selects a candidate item and discards it to make space when necessary. -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/cap_theorem.md: -------------------------------------------------------------------------------- 1 | CAP Theorem 2 | 3 | # keypoints 4 | [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem) 5 | ==== 6 | - it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP) 7 | - Consistency 8 | - All nodes see the same data at the same time 9 | - achieved by updating several nodes before further reads 10 | - every read receives the most recent write or an error 11 | - Availability 12 | - every request receives a response on success/failure 13 | - achieved by replicating the data across different servers 14 | - Partition tolerance 15 | - system continues to work despite message loss or partial failure 16 | - can sustain any amount of network failure 17 | - the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes 18 | - CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability 19 | - CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made. 20 | - [ACID](https://en.wikipedia.org/wiki/ACID) databases choose consistency over availability. 21 | - [BASE](https://en.wikipedia.org/wiki/Eventual_consistency) systems choose availability over consistency. 22 | 23 | # text 24 | - CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance. When we design a distributed system, trading off among CAP is almost the first thing we want to consider. CAP theorem says while designing a distributed system we can pick only two of the following three options: 25 | - Consistency 26 | - All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads. 27 | - Availability 28 | - Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers. 29 | - Partition tolerance 30 | - The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages. 31 | ![cap](../images/cap.png) 32 | - We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available. 33 | 34 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/consistent_hashing.md: -------------------------------------------------------------------------------- 1 | Consistent Hashing 2 | ==== 3 | # keypoints 4 | 5 | - Distributed Hash Table (DHT) 6 | - index = hash_function(key) 7 | - distributed caching system 8 | - n cache servers, if index = key % n 9 | - problem 10 | - not horizontally scalable 11 | - when adding a new cache host, all existing mappings broken 12 | - may not be load balanced 13 | 14 | ## consistent hashing 15 | - minimize reorganization when nodes are added or removed 16 | - only k/n keys need to be remapped 17 | - objects mapped to the same host if possible 18 | - if host removed from the system 19 | - object on that host are shared by other hosts 20 | - if new host added 21 | - takes its share from a few hosts 22 | 23 | ## how it works 24 | - given a list of cache servers 25 | - hash them to integers in the range 26 | - hash a key to a single integer 27 | - move clockwise on the ring 28 | - til finding the first cache 29 | - that cache is the one containing the key 30 | - to add a new server D 31 | - keys originally at C split 32 | - some shifted to D 33 | - to remove a server A 34 | - all keys originally mapped to A remapped to B 35 | - problem: real data randomly distributed, might not be uniform 36 | - add virtual replicas for cache server 37 | - map each cache to multiple points on the ring, i.e. replicas 38 | - each cache associated with multiple portion of the ring 39 | 40 | # text 41 | - Distributed Hash Table (DHT) is one of the fundamental components used in distributed scalable systems. Hash Tables need a key, a value, and a hash function where hash function maps the key to a location where the value is stored. 42 | - index = hash_function(key) 43 | - Suppose we are designing a distributed caching system. Given ‘n’ cache servers, an intuitive hash function would be ‘key % n’. It is simple and commonly used. But it has two major drawbacks: 44 | - It is NOT horizontally scalable. Whenever a new cache host is added to the system, all existing mappings are broken. It will be a pain point in maintenance if the caching system contains lots of data. Practically, it becomes difficult to schedule a downtime to update all caching mappings. 45 | - It may NOT be load balanced, especially for non-uniformly distributed data. In practice, it can be easily assumed that the data will not be distributed uniformly. For the caching system, it translates into some caches becoming hot and saturated while the others idle and are almost empty. 46 | - In such situations, consistent hashing is a good way to improve the caching system. 47 | 48 | ## What is Consistent Hashing? 49 | - Consistent hashing is a very useful strategy for distributed caching systems and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed. Hence, the caching system will be easier to scale up or scale down. 50 | - In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is the total number of servers. Recall that in a caching system using the ‘mod’ as the hash function, all keys need to be remapped. 51 | - In Consistent Hashing, objects are mapped to the same host if possible. When a host is removed from the system, the objects on that host are shared by other hosts; when a new host is added, it takes its share from a few hosts without touching other’s shares. 52 | 53 | ## How does it work? 54 | - As a typical hash function, consistent hashing maps a key to an integer. Suppose the output of the hash function is in the range of [0, 256]. Imagine that the integers in the range are placed on a ring such that the values are wrapped around. 55 | - Here’s how consistent hashing works: 56 | 57 | - Given a list of cache servers, hash them to integers in the range. 58 | - To map a key to a server, 59 | - Hash it to a single integer. 60 | - Move clockwise on the ring until finding the first cache it encounters. 61 | - That cache is the one that contains the key. See animation below as an example: key1 maps to cache A; key2 maps to cache C. 62 | ![hash1](../images/hash1.png) 63 | ![hash2](../images/hash2.png) 64 | ![hash3](../images/hash3.png) 65 | ![hash4](../images/hash4.png) 66 | ![hash5](../images/hash5.png) 67 | 68 | - To add a new server, say D, keys that were originally residing at C will be split. Some of them will be shifted to D, while other keys will not be touched. 69 | - To remove a cache or, if a cache fails, say A, all keys that were originally mapped to A will fall into B, and only those keys need to be moved to B; other keys will not be affected. 70 | - For load balancing, as we discussed in the beginning, the real data is essentially randomly distributed and thus may not be uniform. It may make the keys on caches unbalanced. 71 | - To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to a single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way, each cache is associated with multiple portions of the ring. 72 | - If the hash function “mixes well,” as the number of replicas increases, the keys will be more balanced. 73 | 74 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/data_partitioning.md: -------------------------------------------------------------------------------- 1 | Data Partitioning 2 | ==== 3 | # keypoints 4 | - break up a big database (DB) into many smaller parts 5 | - after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines 6 | 7 | ## Partitioning Methods 8 | - Horizontal partitioning (range based partitioning, data sharding) 9 | - put different rows into different tables 10 | - e.g. 0-1k, 1k-2k, ... 11 | - problem 12 | - if range for partition not chosen carefully, could have unbalanced serves 13 | - Vertical Partitioning 14 | - store tables related to a specific feature in one server 15 | - e.g. server1: insta pics, server2: user info..; 16 | - problem 17 | - if keeps growing, may be necessary to further partition a feature specific DB across various servers 18 | - Directory Based Partitioning 19 | - create a lookup service which knows your current partitioning scheme 20 | - to find out where a particular data entity resides, query the directory server that holds the mapping between each tuple key to its DB server 21 | 22 | ## Partitioning Criteria 23 | - Key or Hash-based partitioning 24 | - apply a hash function to some key attributes of the entity we are storing -> partition number 25 | - e.g. ID % 100 if we have 100 partitions 26 | - should ensure uniform allocation 27 | - problem 28 | - adding new serves might require rehashing -> downtime for the service 29 | - List partitioning 30 | - each partition assigned a list of values 31 | - to insert a new record, find the partition with the corresponding key 32 | - Round-robin partitioning 33 | - i^th tuple assigned to partition i % n 34 | - Composite partitioning 35 | - combine the above schemes 36 | - e.g. list partitioning -> hash based partitioning 37 | - e.g. consistent hashing = hash + list partitioning 38 | - when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots 39 | 40 | ## Common Problems of Data Partitioning 41 | - Joins and Denormalization 42 | - if database is partitioned and spread across multiple machines then often not feasible to perform joins 43 | - workaround 44 | - denormalize the database so that queries that previously required joins can be performed from a single table 45 | - but denormalization leads to data inconsistency 46 | - Referential integrity 47 | - enforce data integrity constraints in a partitioned database difficult, e.g. foreign keys 48 | - Rebalancing 49 | - reason to change partition scheme 50 | - data distribution not uniform 51 | - a lot of load on a partition 52 | - solution 53 | - create more DB partitions or rebalance existing partitions 54 | - will incur downtime 55 | - could use directory based partitioning 56 | 57 | 58 | # text 59 | Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers. 60 | 61 | ## Partitioning Methods 62 | - There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications. 63 | 64 | - Horizontal partitioning 65 | - In this scheme, we put different rows into different tables. For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table and places with ZIP codes greater than 10000 are stored in a separate table. This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding. 66 | - The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the previous example, splitting location based on their zip codes assumes that places will be evenly distributed across the different zip codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan as compared to its suburb cities. 67 | 68 | - Vertical Partitioning 69 | - In this scheme, we divide our data to store tables related to a specific feature in their own server. For example, if we are building Instagram like application - where we need to store data related to users, photos they upload, and people they follow - we can decide to place user profile information on one DB server, friend lists on another, and photos on a third server. 70 | 71 | - Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g. it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users). 72 | 73 | - Directory Based Partitioning 74 | - A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application. 75 | 76 | ## Partitioning Criteria 77 | - Key or Hash-based partitioning 78 | - Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number. For example, if we have 100 DB servers and our ID is a numeric value that gets incremented by one each time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record. This approach should ensure a uniform allocation of data among servers. The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing. 79 | 80 | - List partitioning 81 | - In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries. 82 | 83 | - Round-robin partitioning 84 | - This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n). 85 | 86 | - Composite partitioning 87 | - Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed. 88 | 89 | ## Common Problems of Data Partitioning 90 | - On a partitioned database, there are certain extra constraints on the different operations that can be performed. Most of these constraints are due to the fact that operations across multiple tables or multiple rows in the same table will no longer run on the same server. Below are some of the constraints and additional complexities introduced by partitioning: 91 | 92 | - Joins and Denormalization 93 | - Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency. 94 | 95 | - Referential integrity 96 | - As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult. 97 | - Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references. 98 | 99 | - Rebalancing 100 | - There could be many reasons we have to change our partitioning scheme: 101 | - The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition. 102 | - There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos. 103 | - In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database). 104 | 105 | 106 | Back 107 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/indexes.md: -------------------------------------------------------------------------------- 1 | Indexes 2 | ==== 3 | # keypoints 4 | - a data structure that can be perceived as a table of contents that points us to the location where actual data lives 5 | - Improve the performance of search queries. 6 | - Decrease the write performance bc need to update indices. This performance degradation applies to all insert, update, and delete operations. 7 | 8 | # texts 9 | - Indexes are well known when it comes to databases. Sooner or later there comes a time when database performance is no longer satisfactory. One of the very first things you should turn to when that happens is database indexing. 10 | - The goal of creating an index on a particular table in a database is to make it faster to search through the table and find the row or rows that we want. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records. 11 | 12 | ## Example: A library catalog 13 | - A library catalog is a register that contains the list of books found in a library. The catalog is organized like a database table generally with four columns: book title, writer, subject, and date of publication. There are usually two such catalogs: one sorted by the book title and one sorted by the writer name. That way, you can either think of a writer you want to read and then look through their books or look up a specific book title you know you want to read in case you don’t know the writer’s name. These catalogs are like indexes for the database of books. They provide a sorted list of data that is easily searchable by relevant information. 14 | - Simply saying, an index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index. Let’s assume a table containing a list of books, the following diagram shows how an index on the ‘Title’ column looks like: 15 | ![library_catalog_indexes](../images/library_catalog_indexes.png) 16 | - Just like a traditional relational data store, we can also apply this concept to larger datasets. The trick with indexes is that we must carefully consider how users will access the data. In the case of data sets that are many terabytes in size, but have very small payloads (e.g., 1 KB), indexes are a necessity for optimizing data access. Finding a small payload in such a large dataset can be a real challenge, since we can’t possibly iterate over that much data in any reasonable time. Furthermore, it is very likely that such a large data set is spread over several physical devices—this means we need some way to find the correct physical location of the desired data. Indexes are the best way to do this. 17 | 18 | ## How do Indexes decrease write performance? 19 | - An index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update. 20 | - When adding rows or making updates to existing rows for a table with an active index, we not only have to write the data but also have to update the index. This will decrease the write performance. This performance degradation applies to all insert, update, and delete operations for the table. For this reason, adding unnecessary indexes on tables should be avoided and indexes that are no longer used should be removed. To reiterate, adding indexes is about improving the performance of search queries. If the goal of the database is to provide a data store that is often written to and rarely read from, in that case, decreasing the performance of the more common operation, which is writing, is probably not worth the increase in performance we get from reading. 21 | For more details, see Database Indexes. 22 | 23 | 24 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/key_characteristics_of_distributed_systems.md: -------------------------------------------------------------------------------- 1 | Key Characteristics of Distributed Systems 2 | ==== 3 | 4 | # keypoints 5 | ## Scalability 6 | - The capability of a system to grow and manage increased demand. 7 | - A system that can continuously evolve to support growing amount of work is scalable. 8 | - Horizontal scaling: by adding more servers into the pool of resources. 9 | - Vertical scaling: by adding more resource (CPU, RAM, storage, etc) to an existing server. This approach comes with downtime and an upper limit. 10 | 11 | ## Reliability 12 | - Reliability is the probability that a system will fail in a given period. 13 | - A distributed system is reliable if it keeps delivering its service even when one or multiple components fail. 14 | - Reliability is achieved through redundancy of components and data (remove every single point of failure). 15 | 16 | ## Availability 17 | - Availability is the time a system remains operational to perform its required function in a specific period. 18 | - Measured by the percentage of time that a system remains operational under normal conditions. 19 | - A reliable system is available. 20 | - An available system is not necessarily reliable. 21 | - A system with a security hole is available when there is no security attack. 22 | 23 | ## Efficiency 24 | - Latency: response time, the delay to obtain the first piece of data. 25 | - Bandwidth: throughput, amount of data delivered in a given time. 26 | 27 | ## Serviceability / Manageability 28 | - Easiness to operate and maintain the system. 29 | - Simplicity and spend with which a system can be repaired or maintained. 30 | 31 | # text 32 | 33 | Key characteristics of a distributed system include Scalability, Reliability, Availability, Efficiency, and Manageability. Let’s briefly review them: 34 | 35 | ## Scalability 36 | - Scalability is the capability of a system, process, or a network to grow and manage increased demand. Any distributed system that can continuously evolve in order to support the growing amount of work is considered to be scalable. 37 | - A system may have to scale because of many reasons like increased data volume or increased amount of work, e.g., number of transactions. A scalable system would like to achieve this scaling without performance loss. 38 | - Generally, the performance of a system, although designed (or claimed) to be scalable, declines with the system size due to the management or environment cost. For instance, network speed may become slower because machines tend to be far apart from one another. More generally, some tasks may not be distributed, either because of their inherent atomic nature or because of some flaw in the system design. At some point, such tasks would limit the speed-up obtained by distribution. A scalable architecture avoids this situation and attempts to balance the load on all the participating nodes evenly. 39 | - Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more servers into your pool of resources whereas Vertical scaling means that you scale by adding more power (CPU, RAM, Storage, etc.) to an existing server. 40 | - With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool; Vertical-scaling is usually limited to the capacity of a single server and scaling beyond that capacity often involves downtime and comes with an upper limit. 41 | - Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by switching from smaller to bigger machines. However, this process often involves downtime. 42 | - ![Vertical scaling vs. Horizontal scaling](../images/Vertical_scaling_vs._Horizontal_scaling.png) 43 | 44 | ## Reliability 45 | - By definition, reliability is the probability a system will fail in a given period. In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail. Reliability represents one of the main characteristics of any distributed system, since in such systems any failing machine can always be replaced by another healthy one, ensuring the completion of the requested task. 46 | - Take the example of a large electronic commerce store (like Amazon), where one of the primary requirement is that any user transaction should never be canceled due to a failure of the machine that is running that transaction. For instance, if a user has added an item to their shopping cart, the system is expected not to lose it. A reliable distributed system achieves this through redundancy of both the software components and data. If the server carrying the user’s shopping cart fails, another server that has the exact replica of the shopping cart should replace it. 47 | - Obviously, redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure. 48 | 49 | ## Availability 50 | - By definition, availability is the time a system remains operational to perform its required function in a specific period. It is a simple measure of the percentage of time that a system, service, or a machine remains operational under normal conditions. An aircraft that can be flown for many hours a month without much downtime can be said to have a high availability. Availability takes into account maintainability, repair time, spares availability, and other logistics considerations. If an aircraft is down for maintenance, it is considered not available during that time. 51 | - Reliability is availability over time considering the full range of possible real-world conditions that can occur. An aircraft that can make it through any possible weather safely is more reliable than one that has vulnerabilities to possible conditions. 52 | - Reliability Vs. Availability 53 | - If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In other words, high reliability contributes to high availability, but it is possible to achieve a high availability even with an unreliable product by minimizing repair time and ensuring that spares are always available when they are needed. Let’s take the example of an online retail store that has 99.99% availability for the first two years after its launch. However, the system was launched without any information security testing. The customers are happy with the system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the third year, the system experiences a series of information security incidents that suddenly result in extremely low availability for extended periods of time. This results in reputational and financial damage to the customers. 54 | 55 | ## Efficiency 56 | - To understand how to measure the efficiency of a distributed system, let’s assume we have an operation that runs in a distributed manner and delivers a set of items as result. Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in a given time unit (e.g., a second). The two measures correspond to the following unit costs: 57 | - Number of messages globally sent by the nodes of the system regardless of the message size. 58 | Size of messages representing the volume of data exchanges. 59 | The complexity of operations supported by distributed data structures (e.g., searching for a specific key in a distributed index) can be characterized as a function of one of these cost units. Generally speaking, the analysis of a distributed structure in terms of ‘number of messages’ is over-simplistic. It ignores the impact of many aspects, including the network topology, the network load, and its variation, the possible heterogeneity of the software and hardware components involved in data processing and routing, etc. However, it is quite difficult to develop a precise cost model that would accurately take into account all these performance factors; therefore, we have to live with rough but robust estimates of the system behavior. 60 | 61 | ## Serviceability or Manageability 62 | - Another important consideration while designing a distributed system is how easy it is to operate and maintain. Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate (i.e., does it routinely operate without failure or exceptions?). 63 | - Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/load_balancing.md: -------------------------------------------------------------------------------- 1 | Load Balancing (LB) 2 | ==== 3 | # keypoints 4 | Help scale horizontally across an ever-increasing number of servers. 5 | 6 | ## LB locations 7 | - Between user and web server 8 | - Between web servers and an internal platform layer (application servers, cache servers) 9 | - Between internal platform layer and database 10 | 11 | ## Algorithms 12 | - Least connection 13 | - Least response time 14 | - Least bandwidth 15 | - Round robin 16 | - Weighted round robin 17 | - IP hash 18 | 19 | ## Implementation 20 | - Smart clients 21 | - Hardware load balancers 22 | - Software load balancers 23 | 24 | # text 25 | - Load Balancer (LB) is another critical component of any distributed system. It helps to spread the traffic across a cluster of servers to improve responsiveness and availability of applications, websites or databases. LB also keeps track of the status of all the resources while distributing requests. If a server is not available to take new requests or is not responding or has elevated error rate, LB will stop sending traffic to such a server. 26 | - Typically a load balancer sits between the client and the server accepting incoming network and application traffic and distributing the traffic across multiple backend servers using various algorithms. By balancing application requests across multiple servers, a load balancer reduces individual server load and prevents any one application server from becoming a single point of failure, thus improving overall application availability and responsiveness. 27 | ![client_loadbalancer_server](../images/client_loadbalancer_server.png) 28 | - To utilize full scalability and redundancy, we can try to balance the load at each layer of the system. We can add LBs at three places: 29 | - Between the user and the web server 30 | - Between web servers and an internal platform layer, like application servers or cache servers 31 | - Between internal platform layer and database. 32 | ![loadbalancer2](../images/loadbalancer2.png) 33 | 34 | ## Benefits of Load Balancing 35 | - Users experience faster, uninterrupted service. Users won’t have to wait for a single struggling server to finish its previous tasks. Instead, their requests are immediately passed on to a more readily available resource. 36 | - Service providers experience less downtime and higher throughput. Even a full server failure won’t affect the end user experience as the load balancer will simply route around it to a healthy server. 37 | - Load balancing makes it easier for system administrators to handle incoming requests while decreasing wait time for users. 38 | - Smart load balancers provide benefits like predictive analytics that determine traffic bottlenecks before they happen. As a result, the smart load balancer gives an organization actionable insights. These are key to automation and can help drive business decisions. 39 | - System administrators experience fewer failed or stressed components. Instead of a single device performing a lot of work, load balancing has several devices perform a little bit of work. 40 | 41 | ## Load Balancing Algorithms 42 | - How does the load balancer choose the backend server? 43 | Load balancers consider two factors before forwarding a request to a backend server. They will first ensure that the server they choose is actually responding appropriately to requests and then use a pre-configured algorithm to select one from the set of healthy servers. We will discuss these algorithms shortly. 44 | 45 | - Health Checks 46 | - Load balancers should only forward traffic to “healthy” backend servers. To monitor the health of a backend server, “health checks” regularly attempt to connect to backend servers to ensure that servers are listening. If a server fails a health check, it is automatically removed from the pool, and traffic will not be forwarded to it until it responds to the health checks again. 47 | 48 | - There is a variety of load balancing methods, which use different algorithms for different needs. 49 | 50 | - Least Connection Method 51 | — This method directs traffic to the server with the fewest active connections. This approach is quite useful when there are a large number of persistent client connections which are unevenly distributed between the servers. 52 | - Least Response Time Method 53 | — This algorithm directs traffic to the server with the fewest active connections and the lowest average response time. 54 | - Least Bandwidth Method 55 | - This method selects the server that is currently serving the least amount of traffic measured in megabits per second (Mbps). 56 | - Round Robin Method 57 | — This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning. It is most useful when the servers are of equal specification and there are not many persistent connections. 58 | - Weighted Round Robin Method 59 | — The weighted round-robin scheduling is designed to better handle servers with different processing capacities. Each server is assigned a weight (an integer value that indicates the processing capacity). Servers with higher weights receive new connections before those with less weights and servers with higher weights get more connections than those with less weights. 60 | - IP Hash 61 | — Under this method, a hash of the IP address of the client is calculated to redirect the request to a server. 62 | 63 | ## Redundant Load Balancers 64 | - The load balancer can be a single point of failure; to overcome this, a second load balancer can be connected to the first to form a cluster. Each LB monitors the health of the other and, since both of them are equally capable of serving traffic and failure detection, in the event the main load balancer fails, the second load balancer takes over. 65 | - ![redundant load balancer](../images/redundant_load_balancer.png) 66 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/long_polling_websockets_serversent_events.md: -------------------------------------------------------------------------------- 1 | Long-Polling vs WebSockets vs Server-Sent Events 2 | ==== 3 | 4 | # keypoints 5 | - communication protocols 6 | - long-polling 7 | - WebSockets 8 | - Server-Sent Events 9 | - between a client like a web browser and a web server 10 | - sequence of event for regular HTTP request 11 | - client opens a connections, request data from server 12 | - server calculates reponse 13 | - server sends response back to the client 14 | 15 | ## Ajax Polling 16 | - client repeatedly polls/requests a server for data 17 | - If no data is available, an empty response is returned 18 | - steps 19 | - client opens a connection, requests data from the server using regular HTTP. 20 | - requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds). 21 | - server calculates the response and sends it back 22 | - client repeats the above three steps periodically 23 | - problem 24 | - client keeps asking the server for new data, a lot of responses are empty -> HTTP overhead 25 | 26 | ## HTTP Long-Polling 27 | - server push information to client whenever the data is available. 28 | - client requests as in normal polling, but expect server may not respond immediatey 29 | - if server has no data available, then hold request instead of sending empty response until a timeout 30 | - once data available, full response sent 31 | - client immediately re-request, so server always have a waiting request 32 | - client has to reconnect periodically after connection closed due to timeouts 33 | 34 | ## WebSockets 35 | - persistent connection between client adn server 36 | - both parties can send data at any time 37 | - establishes WebSocket connection througj WebSocket handshake 38 | - if succeeds, client server can exchange data 39 | - enables communication with low overheads 40 | - real-time data transfer 41 | 42 | ## Server-Sent Events (SSEs) 43 | - client establishes a persistent & long-term connection with the server 44 | - client require another tech/protocol to send data to server 45 | - steps 46 | - client request data using regular HTTP 47 | - request webpage opens a connections to server 48 | - server sends data to client if new info available 49 | - best when real-time traffic needed 50 | - or server generate data in loop 51 | 52 | 53 | # text 54 | - Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request 55 | - The client opens a connection and requests data from the server. 56 | - The server calculates the response. 57 | - The server sends the response back to the client on the opened request. 58 | - ![HTTP protocol](../images/HTTP_protocol.png) 59 | 60 | ## Ajax Polling 61 | - Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned. 62 | - The client opens a connection and requests data from the server using regular HTTP. 63 | - The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds). 64 | - The server calculates the response and sends it back, just like regular HTTP traffic. 65 | - The client repeats the above three steps periodically to get updates from the server. 66 | - The problem with Polling is that the client has to keep asking the server for any new data. As a result, a lot of responses are empty, creating HTTP overhead. 67 | - ![Ajax Polling Protocol](../images/ajax.png) 68 | 69 | ## HTTP Long-Polling 70 | - This is a variation of the traditional polling technique that allows the server to push information to a client whenever the data is available. With Long-Polling, the client requests information from the server exactly as in normal polling, but with the expectation that the server may not respond immediately. That’s why this technique is sometimes referred to as a “Hanging GET”. 71 | - If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available. 72 | - Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event. 73 | - The basic life cycle of an application using HTTP Long-Polling is as follows: 74 | - The client makes an initial request using regular HTTP and then waits for a response. 75 | - The server delays its response until an update is available or a timeout has occurred. 76 | - When an update is available, the server sends a full response to the client. 77 | - The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period. 78 | - Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed due to timeouts. 79 | 80 | ## WebSockets 81 | - WebSocket provides Full duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server that both parties can use to start sending data at any time. The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server. This is made possible by providing a standardized way for the server to send content to the browser without being asked by the client and allowing for messages to be passed back and forth while keeping the connection open. In this way, a two-way (bi-directional) ongoing conversation can take place between a client and a server. 82 | 83 | ## Server-Sent Events (SSEs) 84 | - Under SSEs the client establishes a persistent and long-term connection with the server. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so. 85 | - Client requests data from a server using regular HTTP. 86 | - The requested webpage opens a connection to the server. 87 | - The server sends the data to the client whenever there’s new information available. 88 | - SSEs are best when we need real-time traffic from the server to the client or if the server is generating data in a loop and will be sending multiple events to the client. 89 | - ![Server Sent Events Protocol](../images/sse.png) -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/proxies.md: -------------------------------------------------------------------------------- 1 | Proxies 2 | ==== 3 | 4 | # keypoints 5 | - A proxy server is an intermediary piece of hardware / software sitting between client and backend server. 6 | - Filter requests 7 | - Log requests 8 | - Transform requests 9 | - adding/removing headers 10 | - encrypting/decrypting 11 | - compressing a resource 12 | - cache 13 | - if multiple clients access a particular request, proxy server can cache it 14 | 15 | ## Proxy Server Types 16 | - Open Proxy 17 | - accessible by any Internet user 18 | - Anonymous Proxy 19 | - reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress 20 | - Trаnspаrent Proxy 21 | – іdentіfіes іtself 22 | - with the support of HTTP heаders, the fіrst IP аddress cаn be vіewed 23 | - can cаche the websіtes 24 | - Reverse Proxy 25 | - retrieves resources on behalf of a client from servers 26 | - then returned to the client 27 | 28 | # text 29 | - A proxy server is an intermediate server between the client and the back-end server. Clients connect to proxy servers to make a request for a service like a web page, file, connection, etc. In short, a proxy server is a piece of software or hardware that acts as an intermediary for requests from clients seeking resources from other servers. 30 | - Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compressing a resource). Another advantage of a proxy server is that its cache can serve a lot of requests. If multiple clients access a particular resource, the proxy server can cache it and serve it to all the clients without going to the remote server. 31 | ![proxy](../images/proxy.png) 32 | 33 | ## Proxy Server Types 34 | - Proxies can reside on the client’s local server or anywhere between the client and the remote servers. Here are a few famous types of proxy servers: 35 | 36 | - Open Proxy 37 | - An open proxy is a proxy server that is accessible by any Internet user. Generally, a proxy server only allows users within a network group (i.e. a closed proxy) to store and forward Internet services such as DNS or web pages to reduce and control the bandwidth used by the group. With an open proxy, however, any user on the Internet is able to use this forwarding service. There two famous open proxy types: 38 | 39 | - Anonymous Proxy 40 | - Thіs proxy reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress. Though thіs proxy server cаn be dіscovered eаsіly іt cаn be benefіcіаl for some users аs іt hіdes their IP аddress. 41 | - Trаnspаrent Proxy 42 | – Thіs proxy server аgаіn іdentіfіes іtself, аnd wіth the support of HTTP heаders, the fіrst IP аddress cаn be vіewed. The mаіn benefіt of usіng thіs sort of server іs іts аbіlіty to cаche the websіtes. 43 | - Reverse Proxy 44 | - A reverse proxy retrieves resources on behalf of a client from one or more servers. These resources are then returned to the client, appearing as if they originated from the proxy server itself 45 | 46 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/redundancy_replication.md: -------------------------------------------------------------------------------- 1 | Redundancy & Replication 2 | ==== 3 | # keypoints 4 | - Redundancy 5 | - **duplication of critical data or services** with the intention of increased reliability of the system. 6 | - remove single point of failure 7 | - if we have two servers and one fails, system can failover to the other one. 8 | - primary-replica relationship 9 | - between the original and the copies. 10 | - primary gets all updates 11 | - then ripple through to the replica servers 12 | - replca outputs message if received update successfully 13 | - Shared-nothing architecture 14 | - Each node can operate independently of one another. 15 | - No central service managing state or orchestrating activities. 16 | - New servers can be added without special conditions or knowledge. 17 | - No single point of failure. 18 | 19 | # text 20 | - Redundancy is the duplication of critical components or functions of a system with the intention of increasing the reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance. For example, if there is only one copy of a file stored on a single server, then losing that server means losing the file. Since losing data is seldom a good thing, we can create duplicate or redundant copies of the file to solve this problem. 21 | - Redundancy plays a key role in removing the single points of failure in the system and provides backups if needed in a crisis. For example, if we have two instances of a service running in production and one fails, the system can failover to the other one. 22 | ![redundancy](../images/redundancy.png) 23 | - Replication means sharing information to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility. 24 | - Replication is widely used in many database management systems (DBMS), usually with a primary-replica relationship between the original and the copies. The primary server gets all the updates, which then ripple through to the replica servers. Each replica outputs a message stating that it has received the update successfully, thus allowing the sending of subsequent updates. 25 | 26 | -------------------------------------------------------------------------------- /complete_system_design/glossary_of_system_design/sql_nosql.md: -------------------------------------------------------------------------------- 1 | SQL vs. NoSQL 2 | ==== 3 | # keypoints 4 | ## sql (relational databases) 5 | - structured 6 | - have predefined schemas 7 | - e.g. phone books that store phone numbers and addresses 8 | - store data in rows and columns 9 | - row contains information about one entity 10 | - column contains separate data points 11 | 12 | ## NoSQL (non-relational databases) 13 | - unstructured, distributed 14 | - have a dynamic schema 15 | - e.g file folders that hold everything from a person’s address to their Facebook ‘likes’ 16 | 17 | ## Common types of NoSQL 18 | ### Key-value stores 19 | - Array of key-value pairs. The "key" is an attribute name. 20 | - Redis, Vodemort, Dynamo. 21 | 22 | ### Document databases 23 | - Data is stored in documents. 24 | - Documents are grouped in collections. 25 | - Each document can have an entirely different structure. 26 | - CouchDB, MongoDB. 27 | 28 | ### Wide-column / columnar databases 29 | - Column families - containers for rows. 30 | - No need to know all the columns up front. 31 | - Each row can have different number of columns. 32 | - Cassandra, HBase. 33 | 34 | ### Graph database 35 | - Data is stored in graph structures 36 | - Nodes: entities 37 | - Properties: information about the entities 38 | - Lines: connections between the entities 39 | - Neo4J, InfiniteGraph 40 | 41 | ## Differences between SQL and NoSQL 42 | ### Storage 43 | - SQL: store data in tables. 44 | - NoSQL: have different data storage models. 45 | - key-value 46 | - document 47 | - graph 48 | - columnar 49 | 50 | ### Schema 51 | - SQL 52 | - Each record conforms to a fixed schema. 53 | - each row must have data for each column 54 | - Schema can be altered, but it requires modifying the whole database and going offline. 55 | - NoSQL: 56 | - Schemas are dynamic. 57 | - each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’ 58 | 59 | ### Querying 60 | - SQL 61 | - Use SQL (structured query language) for defining and manipulating the data. 62 | - NoSQL 63 | - Queries are focused on a collection of documents. 64 | - UnQL (unstructured query language). 65 | - Different databases have different syntax. 66 | 67 | ### Scalability 68 | - SQL 69 | - Vertically scalable (by increasing the horsepower: memory, CPU, etc) and expensive. 70 | - Horizontally scalable (across multiple servers); but it can be challenging and time-consuming. 71 | - NoSQL 72 | - Horizontablly scalable (by adding more servers) and cheap. 73 | 74 | ### ACID 75 | - Atomicity, consistency, isolation, durability 76 | - SQL 77 | - ACID compliant 78 | - Data reliability 79 | - Gurantee of transactions 80 | - NoSQL 81 | - Most sacrifice ACID compliance for performance and scalability. 82 | 83 | ## Which one to use? 84 | ### SQL 85 | - Ensure ACID compliance. 86 | - Reduce anomalies. 87 | - Protect database integrity. 88 | - Data is structured and unchanging. 89 | 90 | ### NoSQL 91 | - Data has little or no structure. 92 | - Make the most of cloud computing and storage. 93 | - Cloud-based storage requires data to be easily spread across multiple servers to scale up. 94 | - Rapid development. 95 | - Frequent updates to the data structure. 96 | 97 | # text 98 | - In the world of databases, there are two main types of solutions: SQL and NoSQL (or relational databases and non-relational databases). Both of them differ in the way they were built, the kind of information they store, and the storage method they use. 99 | - Relational databases are structured and have predefined schemas like phone books that store phone numbers and addresses. Non-relational databases are unstructured, distributed, and have a dynamic schema like file folders that hold everything from a person’s address and phone number to their Facebook ‘likes’ and online shopping preferences. 100 | 101 | ## SQL 102 | Relational databases store data in rows and columns. Each row contains all the information about one entity and each column contains all the separate data points. Some of the most popular relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB. 103 | 104 | ## NoSQL 105 | Following are the most common types of NoSQL: 106 | - Key-Value Stores: 107 | - Data is stored in an array of key-value pairs. The ‘key’ is an attribute name which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and Dynamo. 108 | - Document Databases 109 | - In these databases, data is stored in documents (instead of rows and columns in a table) and these documents are grouped together in collections. Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB. 110 | - Wide-Column Databases 111 | - Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase. 112 | - Graph Databases 113 | - These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph. 114 | 115 | ## High level differences between SQL and NoSQL 116 | - Storage 117 | - SQL stores data in tables where each row represents an entity and each column represents a data point about that entity; for example, if we are storing a car entity in a table, different columns could be ‘Color’, ‘Make’, ‘Model’, and so on. 118 | - NoSQL databases have different data storage models. The main ones are key-value, document, graph, and columnar. We will discuss differences between these databases below. 119 | 120 | - Schema 121 | - In SQL, each record conforms to a fixed schema, meaning the columns must be decided and chosen before data entry and each row must have data for each column. The schema can be altered later, but it involves modifying the whole database and going offline. 122 | - In NoSQL, schemas are dynamic. Columns can be added on the fly and each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’ 123 | 124 | - Querying 125 | - SQL databases use SQL (structured query language) for defining and manipulating the data, which is very powerful. In a NoSQL database, queries are focused on a collection of documents. Sometimes it is also called UnQL (Unstructured Query Language). Different databases have different syntax for using UnQL. 126 | 127 | - Scalability 128 | - In most common situations, SQL databases are vertically scalable, i.e., by increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very expensive. It is possible to scale a relational database across multiple servers, but this is a challenging and time-consuming process. 129 | - On the other hand, NoSQL databases are horizontally scalable, meaning we can add more servers easily in our NoSQL database infrastructure to handle a lot of traffic. Any cheap commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across servers automatically. 130 | 131 | - Reliability or ACID Compliancy (Atomicity, Consistency, Isolation, Durability): The vast majority of relational databases are ACID compliant. So, when it comes to data reliability and safe guarantee of performing transactions, SQL databases are still the better bet. 132 | 133 | Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability. 134 | 135 | SQL VS. NoSQL - Which one to use? # 136 | When it comes to database technology, there’s no one-size-fits-all solution. That’s why many businesses rely on both relational and non-relational databases for different needs. Even as NoSQL databases are gaining popularity for their speed and scalability, there are still situations where a highly structured SQL database may perform better; choosing the right technology hinges on the use case. 137 | 138 | Reasons to use SQL database # 139 | Here are a few reasons to choose a SQL database: 140 | 141 | We need to ensure ACID compliance. ACID compliance reduces anomalies and protects the integrity of your database by prescribing exactly how transactions interact with the database. Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed, but for many e-commerce and financial applications, an ACID-compliant database remains the preferred option. 142 | Your data is structured and unchanging. If your business is not experiencing massive growth that would require more servers and if you’re only working with data that is consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume. 143 | Reasons to use NoSQL database # 144 | When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck. Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases. A few popular examples of NoSQL databases are MongoDB, CouchDB, Cassandra, and HBase. 145 | 146 | Storing large volumes of data that often have little to no structure. A NoSQL database sets no limits on the types of data we can store together and allows us to add new types as the need changes. With document-based databases, you can store data in one place without having to define what “types” of data those are in advance. 147 | Making the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches. 148 | Rapid development. NoSQL is extremely useful for rapid development as it doesn’t need to be prepped ahead of time. If you’re working on quick iterations of your system which require making frequent updates to the data structure without a lot of downtime between versions, a relational database will slow you down. 149 | Interviewing soon? We've partnered with Hired so that companies apply to you instead of you applying to them.See how 150 | -------------------------------------------------------------------------------- /complete_system_design/images/HTTP_protocol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/HTTP_protocol.png -------------------------------------------------------------------------------- /complete_system_design/images/Vertical_scaling_vs._Horizontal_scaling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/Vertical_scaling_vs._Horizontal_scaling.png -------------------------------------------------------------------------------- /complete_system_design/images/accessing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/accessing.png -------------------------------------------------------------------------------- /complete_system_design/images/ajax.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/ajax.png -------------------------------------------------------------------------------- /complete_system_design/images/cap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/cap.png -------------------------------------------------------------------------------- /complete_system_design/images/cap_theorem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/cap_theorem.png -------------------------------------------------------------------------------- /complete_system_design/images/client_loadbalancer_server.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/client_loadbalancer_server.png -------------------------------------------------------------------------------- /complete_system_design/images/database_schema.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/database_schema.png -------------------------------------------------------------------------------- /complete_system_design/images/detailed_component.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/detailed_component.png -------------------------------------------------------------------------------- /complete_system_design/images/hash1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash1.png -------------------------------------------------------------------------------- /complete_system_design/images/hash2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash2.png -------------------------------------------------------------------------------- /complete_system_design/images/hash3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash3.png -------------------------------------------------------------------------------- /complete_system_design/images/hash4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash4.png -------------------------------------------------------------------------------- /complete_system_design/images/hash5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/hash5.png -------------------------------------------------------------------------------- /complete_system_design/images/high_level_design.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/high_level_design.png -------------------------------------------------------------------------------- /complete_system_design/images/high_level_url_shortening.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/high_level_url_shortening.png -------------------------------------------------------------------------------- /complete_system_design/images/library_catalog_indexes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/library_catalog_indexes.png -------------------------------------------------------------------------------- /complete_system_design/images/loadbalancer2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/loadbalancer2.png -------------------------------------------------------------------------------- /complete_system_design/images/long_polling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/long_polling.png -------------------------------------------------------------------------------- /complete_system_design/images/proxy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/proxy.png -------------------------------------------------------------------------------- /complete_system_design/images/redundancy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/redundancy.png -------------------------------------------------------------------------------- /complete_system_design/images/redundant_load_balancer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/redundant_load_balancer.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow1.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow10.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow11.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow2.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow3.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow4.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow5.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow6.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow7.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow8.png -------------------------------------------------------------------------------- /complete_system_design/images/request_flow9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/request_flow9.png -------------------------------------------------------------------------------- /complete_system_design/images/shortening.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/shortening.png -------------------------------------------------------------------------------- /complete_system_design/images/sse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/sse.png -------------------------------------------------------------------------------- /complete_system_design/images/url1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url1.png -------------------------------------------------------------------------------- /complete_system_design/images/url2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url2.png -------------------------------------------------------------------------------- /complete_system_design/images/url3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url3.png -------------------------------------------------------------------------------- /complete_system_design/images/url4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url4.png -------------------------------------------------------------------------------- /complete_system_design/images/url5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url5.png -------------------------------------------------------------------------------- /complete_system_design/images/url6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url6.png -------------------------------------------------------------------------------- /complete_system_design/images/url7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url7.png -------------------------------------------------------------------------------- /complete_system_design/images/url8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url8.png -------------------------------------------------------------------------------- /complete_system_design/images/url9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/url9.png -------------------------------------------------------------------------------- /complete_system_design/images/websockets.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/complete_system_design/images/websockets.png -------------------------------------------------------------------------------- /complete_system_design/system_design_problems/step_by_step_guide.md: -------------------------------------------------------------------------------- 1 | System Design Interviews: A step by step guide 2 | ===== 3 | # keypoints 4 | 5 | 6 | # text 7 | - A lot of software engineers struggle with system design interviews (SDIs) primarily because of three reasons: 8 | - The unstructured nature of SDIs, where the candidates are asked to work on an open-ended design problem that doesn’t have a standard answer. 9 | - Candidates lack experience in developing complex and large scale systems. 10 | - Candidates did not spend enough time to prepare for SDIs. 11 | - Like coding interviews, candidates who haven’t put a deliberate effort to prepare for SDIs, mostly perform poorly, especially at top companies like Google, Facebook, Amazon, Microsoft, etc. In these companies, candidates who do not perform above average have a limited chance to get an offer. On the other hand, a good performance always results in a better offer (higher position and salary) since it shows the candidate’s ability to handle a complex system. 12 | - In this course, we’ll follow a step by step approach to solve multiple design problems. First, let’s go through these steps: 13 | 14 | ## Step 1: Requirements clarifications 15 | - It is always a good idea to ask questions about the exact scope of the problem we are trying to solve. Design questions are mostly open-ended, and they don’t have ONE correct answer. That’s why clarifying ambiguities early in the interview becomes critical. Candidates who spend enough time to define the end goals of the system always have a better chance to be successful in the interview. Also, since we only have 35-40 minutes to design a (supposedly) large system, we should clarify what parts of the system we will be focusing on. 16 | - Let’s expand this with an actual example of designing a Twitter-like service. Here are some questions for designing Twitter that should be answered before moving on to the next steps: 17 | - Will users of our service be able to post tweets and follow other people? 18 | - Should we also design to create and display the user’s timeline? 19 | - Will tweets contain photos and videos? 20 | - Are we focusing on the backend only, or are we developing the front-end too? 21 | - Will users be able to search tweets? 22 | - Do we need to display hot trending topics? 23 | - Will there be any push notification for new (or important) tweets? 24 | - All such questions will determine how our end design will look like. 25 | 26 | ## Step 2: Back-of-the-envelope estimation 27 | - It is always a good idea to estimate the scale of the system we’re going to design. This will also help later when we focus on scaling, partitioning, load balancing, and caching. 28 | - What scale is expected from the system (e.g., number of new tweets, number of tweet views, number of timeline generations per sec., etc.)? 29 | - How much storage will we need? We will have different storage requirements if users can have photos and videos in their tweets. 30 | - What network bandwidth usage are we expecting? This will be crucial in deciding how we will manage traffic and balance load between servers. 31 | 32 | ## Step 3: System interface definition 33 | - Define what APIs are expected from the system. This will establish the exact contract expected from the system and ensure if we haven’t gotten any requirements wrong. Some examples of APIs for our Twitter-like service will be: 34 | - ``postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …)`` 35 | - ``generateTimeline(user_id, current_time, user_location, …)`` 36 | - ``markTweetFavorite(user_id, tweet_id, timestamp, …)`` 37 | 38 | ## Step 4: Defining data model 39 | - Defining the data model in the early part of the interview will clarify how data will flow between different system components. Later, it will guide for data partitioning and management. The candidate should identify various entities of the system, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Here are some entities for our Twitter-like service: 40 | - User: UserID, Name, Email, DoB, CreationData, LastLogin, etc. 41 | - Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc. 42 | - UserFollow: UserID1, UserID2 43 | - FavoriteTweets: UserID, TweetID, TimeStamp 44 | - Which database system should we use? Will NoSQL like Cassandra best fit our needs, or should we use a MySQL-like solution? What kind of block storage should we use to store photos and videos? 45 | 46 | ## Step 5: High-level design 47 | - Draw a block diagram with 5-6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end-to-end. 48 | - For Twitter, at a high-level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we will have a lot more read traffic (compared to write), we can decide to have separate servers to handle these scenarios. On the back-end, we need an efficient database that can store all the tweets and support a huge number of reads. We will also need a distributed file storage system for storing photos and videos. 49 | - ![](../images/high_level_design.png) 50 | 51 | ## Step 6: Detailed design 52 | - Dig deeper into two or three major components; the interviewer’s feedback should always guide us to what parts of the system need further discussion. We should present different approaches, their pros and cons, and explain why we will prefer one approach over the other. Remember, there is no single answer; the only important thing is to consider tradeoffs between different options while keeping system constraints in mind. 53 | - Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue could it cause? 54 | - How will we handle hot users who tweet a lot or follow lots of people? 55 | - Since users’ timeline will contain the most recent (and relevant) tweets, should we try to store our data so that it is optimized for scanning the latest tweets? 56 | - How much and at which layer should we introduce cache to speed things up? 57 | - What components need better load balancing? 58 | 59 | ## Step 7: identifying and resolving bottlenecks 60 | - Try to discuss as many bottlenecks as possible and different approaches to mitigate them. 61 | - Is there any single point of failure in our system? What are we doing to mitigate it? 62 | - Do we have enough replicas of the data so that we can still serve our users if we lose a few servers? 63 | - Similarly, do we have enough copies of different services running such that a few failures will not cause a total system shutdown? 64 | - How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades? -------------------------------------------------------------------------------- /complete_system_design/system_design_problems/url_shortening.md: -------------------------------------------------------------------------------- 1 | Designing a URL Shortening service like TinyURL 2 | ===== 3 | 4 | # text 5 | ## Why do we need URL shortening? 6 | - URL shortening is used to create shorter aliases for long URLs. We call these shortened aliases “short links.” Users are redirected to the original URL when they hit these short links. Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally, users are less likely to mistype shorter URLs. 7 | - For example, if we shorten this page through TinyURL: 8 | - ``https://www.educative.io/collection/page/5668639101419520/5649050225344512/5668600916475904/`` 9 | - we would get 10 | - ``http://tinyurl.com/jlg8zpc`` 11 | - The shortened URL is nearly one-third the size of the actual URL. 12 | - URL shortening is used to optimize links across devices, track individual links to analyze audience, measure ad campaigns’ performance, or hide affiliated original URLs. 13 | - If you haven’t used tinyurl.com before, please try creating a new shortened URL and spend some time going through the various options their service offers. This will help you a lot in understanding this chapter. 14 | 15 | ## Requirements and Goals of the System 16 | - Our URL shortening system should meet the following requirements: 17 | - **Functional Requirements** 18 | - Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications. 19 | - When users access a short link, our service should redirect them to the original link. 20 | - Users should optionally be able to pick a custom short link for their URL. 21 | - Links will expire after a standard default timespan. Users should be able to specify the expiration time. 22 | - **Non-Functional Requirements** 23 | - The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing. 24 | - URL redirection should happen in real-time with minimal latency. 25 | - Shortened links should not be guessable (not predictable). 26 | - **Extended Requirements** 27 | - Analytics; e.g., how many times a redirection happened? 28 | - Our service should also be accessible through REST APIs by other services. 29 | 30 | ## Capacity Estimation and Constraints 31 | - Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write. 32 | - **Traffic estimates** 33 | - Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period 34 | - 100 * 500M => 50B 35 | - What would be Queries Per Second (QPS) for our system? New URLs shortenings per second: 36 | - 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s 37 | - Considering 100:1 read/write ratio, URLs redirections per second will be 38 | - 100 * 200 URLs/s = 20K/s 39 | - **Storage estimates** 40 | - Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion 41 | - 500 million * 5 years * 12 months = 30 billion 42 | - Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage 43 | - 30 billion * 500 bytes = 15 TB 44 | - **Bandwidth estimates** 45 | - For write requests, since we expect 200 new URLs every second, total incoming data for our service will be 100KB per second: 46 | - ``200 * 500 bytes = 100 KB/s`` 47 | - For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second 48 | - ``20K * 500 bytes = ~10 MB/s`` 49 | - **Memory estimates** 50 | - If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. 51 | - Since we have 20K requests per second, we will be getting 1.7 billion requests per day 52 | - ``20K * 3600 seconds * 24 hours = ~1.7 billion`` 53 | - To cache 20% of these requests, we will need 170GB of memory. 54 | - ``0.2 * 1.7 billion * 500 bytes = ~170GB`` 55 | - One thing to note here is that since there will be many duplicate requests (of the same URL), our actual memory usage will be less than 170GB. 56 | - **High-level estimates** 57 | - Assuming 500 million new URLs per month and 100:1 read:write ratio, following is the summary of the high level estimates for our service 58 | 59 | ## System APIs 60 | - We can have SOAP or REST APIs to expose the functionality of our service. Following could be the definitions of the APIs for creating and deleting URLs: 61 | 62 | - ``createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)`` 63 | - **Parameters** 64 | - api_dev_key (string) 65 | - The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota. 66 | - original_url (string) 67 | - Original URL to be shortened. 68 | - custom_alias (string) 69 | - Optional custom key for the URL. 70 | - user_name (string) 71 | - Optional user name to be used in the encoding. 72 | - expire_date (string) 73 | - Optional expiration date for the shortened URL. 74 | - **Returns**: (string) 75 | - A successful insertion returns the shortened URL; otherwise, it returns an error code. 76 | - ``deleteURL(api_dev_key, url_key)`` 77 | - Where “url_key” is a string representing the shortened URL to be retrieved; a successful deletion returns ‘URL Removed’. 78 | - How do we detect and prevent abuse? 79 | - A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key). 80 | 81 | ## Database Design 82 | - A few observations about the nature of the data we will store: 83 | - We need to store billions of records. 84 | - Each object we store is small (less than 1K). 85 | - There are no relationships between records—other than storing which user created a URL. 86 | - Our service is read-heavy. 87 | - Database Schema 88 | - We would need two tables: one for storing information about the URL mappings and one for the user’s data who created the short link 89 | - ![database schema](../images/high_level_design.png) 90 | - What kind of database should we use? 91 | - Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra or Riak is a better choice. A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details. 92 | 93 | ## Basic System Design and Algorithm 94 | - The problem we are solving here is how to generate a short and unique key for a given URL. 95 | - In the TinyURL example in Section 1, the shortened URL is “http://tinyurl.com/jlg8zpc”. The last seven characters of this URL is the short key we want to generate. We’ll explore two solutions here: 96 | 97 | ### Encoding actual URL 98 | - We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for display. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8, or 10 characters? 99 | - Using base64 encoding, a 6 letters long key would result in 64^6 = ~68.7 billion possible strings. 100 | - Using base64 encoding, an 8 letters long key would result in 64^8 = ~281 trillion possible strings. 101 | - With 68.7B unique strings, let’s assume six letter keys would suffice for our system. 102 | - If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). Now we only have space for 8 characters per short key; how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication; to resolve that, we can choose some other characters out of the encoding string or swap some characters. 103 | 104 | - What are the different issues with our solution? 105 | - We have the following couple of problems with our encoding scheme: 106 | - If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable. 107 | - What if parts of the URL are URL-encoded? e.g., http://www.educative.io/distributed.php?id=design, and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for the URL encoding. 108 | 109 | - Workaround for the issues 110 | - We can append an increasing sequence number to each input URL to make it unique and then generate its hash. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service. 111 | - Another solution could be to append the user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one. 112 | - ![1/9](../images/url1.png) 113 | - ![2/9](../images/url2.png) 114 | - ![3/9](../images/url3.png) 115 | - ![4/9](../images/url4.png) 116 | - ![5/9](../images/url5.png) 117 | - ![6/9](../images/url6.png) 118 | - ![7/9](../images/url7.png) 119 | - ![8/9](../images/url8.png) 120 | - ![9/9](../images/url9.png) 121 | 122 | ### Generating keys offline 123 | - We can have a standalone **Key Generation Service (KGS)** that generates random six-letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique 124 | 125 | - Can concurrency cause problems? 126 | - As soon as a key is used, it should be marked in the database to ensure that it is not used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem? 127 | 128 | - Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory to quickly provide them whenever a server needs them. 129 | 130 | - For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which could be acceptable, given the huge number of keys we have. 131 | 132 | - KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server. 133 | 134 | - What would be the key-DB size? 135 | - With base64 encoding, we can generate 68.7B unique six letters keys. If we need one byte to store one alpha-numeric character, we can store all these keys in: 136 | - 6 (characters per key) * 68.7B (unique keys) = 412 GB. 137 | 138 | - Isn’t KGS a single point of failure? 139 | - Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys. 140 | 141 | - Can each app server cache some keys from key-DB? 142 | - Yes, this can surely speed things up. Although, in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This can be acceptable since we have 68B unique six-letter keys. 143 | 144 | - How would we perform a key lookup? 145 | - We can look up the key in our database to get the full URL. If it’s present in the DB, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the homepage. 146 | 147 | - Should we impose size limits on custom aliases? 148 | - Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema). 149 | - ![High level system design for URL shortening](../images/high_level_url_shortening.png) 150 | 151 | ## Data Partitioning and Replication 152 | - To scale out our DB, we need to partition it so that it can store information about billions of URLs. We need to develop a partitioning scheme that would divide and store our data into different DB servers. 153 | 154 | - Range Based Partitioning 155 | - We can store URLs in separate partitions based on the hash key’s first letter. Hence we save all the URLs starting with the letter ‘A’ (and ‘a’) in one partition, save those that start with the letter ‘B’ in another partition, and so on. This approach is called range-based partitioning. We can even combine certain less frequently occurring letters into one database partition. We should come up with a static partitioning scheme to always store/find a URL in a predictable manner. 156 | - The main problem with this approach is that it can lead to unbalanced DB servers. For example, we decide to put all URLs starting with the letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E.’ 157 | 158 | - Hash-Based Partitioning 159 | - In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object. 160 | - Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any ‘key’ to a number between [1…256]). This number would represent the partition in which we store our object. 161 | - This approach can still lead to overloaded partitions, which can be solved using Consistent Hashing. 162 | 163 | ## Cache 164 | - We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcached, which can store full URLs with their respective hashes. Before hitting backend storage, the application servers can quickly check if the cache has the desired URL. 165 | 166 | - How much cache memory should we have? 167 | - We can start with 20% of daily traffic and, based on clients’ usage patterns, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs. 168 | 169 | - Which cache eviction policy would best fit our needs? 170 | - When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of the URLs that have been accessed recently. 171 | 172 | - To further increase the efficiency, we can replicate our caching servers to distribute the load between them. 173 | 174 | - How can each cache replica be updated? 175 | - Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update its cache by adding the new entry. If a replica already has that entry, it can simply ignore it. 176 | 177 | - ![1/11](../images/request_flow1.png) 178 | - ![2/11](../images/request_flow2.png) 179 | - ![3/11](../images/request_flow3.png) 180 | - ![4/11](../images/request_flow4.png) 181 | - ![5/11](../images/request_flow5.png) 182 | - ![6/11](../images/request_flow6.png) 183 | - ![7/11](../images/request_flow7.png) 184 | - ![8/11](../images/request_flow8.png) 185 | - ![9/11](../images/request_flow9.png) 186 | - ![10/11](../images/request_flow10.png) 187 | - ![11/11](../images/request_flow11.png) 188 | 189 | ## Load Balancer (LB) 190 | - We can add a Load balancing layer at three places in our system: 191 | - Between Clients and Application servers 192 | - Between Application Servers and database servers 193 | - Between Application Servers and Cache servers 194 | - Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it. 195 | - A problem with Round Robin LB is that we don’t take the server load into consideration. If a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that. 196 | 197 | ## Purging or DB cleanup 198 | - Should entries stick around forever, or should they be purged? If a user-specified expiration time is reached, what should happen to the link? 199 | - If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will ensure that only expired links will be deleted, although some expired links can live longer but will never be returned to users. 200 | - Whenever a user tries to access an expired link, we can delete the link and return an error to the user. 201 | - A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low. 202 | - We can have a default expiration time for each link (e.g., two years). 203 | - After removing an expired link, we can put the key back in the key-DB to be reused. 204 | - Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever. 205 | - ![Detailed component design for URL shortening](../images/detailed_component.png) 206 | 207 | ## Telemetry 208 | - How many times a short URL has been used, what were user locations, etc.? How would we store these statistics? If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests? 209 | - Some statistics worth tracking: country of the visitor, date and time of access, web page that referred the click, browser, or platform from where the page was accessed. 210 | 211 | ## Security and Permissions 212 | - Can users create private URLs or allow a particular set of users to access a URL? 213 | - We can store the permission level (public/private) with each URL in the database. We can also create a separate table to store UserIDs that have permission to see a specific URL. If a user does not have permission and tries to access a URL, we can send an error (HTTP 401) back. Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The columns will store the UserIDs of those users that have permission to see the URL. 214 | 215 | -------------------------------------------------------------------------------- /distributed_system/review.md: -------------------------------------------------------------------------------- 1 | https://www.wisdomjobs.com/e-university/distributed-computing-interview-questions.html 2 | 3 | Question 1. Define Distributed System? 4 | Answer : 5 | A distributed system is a collection of independent computers that appears to its users as a single coherent system. A distributed system is one in which components located at networked communicate and coordinate their actions only by passing message. 6 | Question 2. List The Characteristics Of Distributed System? 7 | Answer : 8 | Programs are executed concurrently 9 | There is no global time 10 | Components can fail independently (isolation, crash) 11 | Question 3. Mention The Examples Of Distributed System? 12 | Answer : 13 | The Internet 14 | Intranets 15 | Mobile and ubiquitous computing 16 | -------------------------------------------------------------------------------- /probability/002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/002_Xinfeng_Zhou_A_Practical_Guide_To_Quant.docx -------------------------------------------------------------------------------- /probability/4710_review.md: -------------------------------------------------------------------------------- 1 | Introduction To Probability 2 | ==== 3 | 4 | 5 | # Experiments with random outcomes 6 | ## Sample space & probabilities 7 | - sample space \Omega 8 | - set of all the possible outcomes of the experiments 9 | - sample points w 10 | - Elements of \Omega 11 | - events 12 | - subset of \Omega 13 | - \F 14 | - collection of events in \Omega 15 | - probability measure / probability distribution P 16 | - func from \F to \R 17 | - P(A) 18 | - prob of event A 19 | - Kolmogorov's axiom 20 | - 0 <= P(A) <= 1, \any A 21 | - P(\Omega) = 1, P(\empty) = 0 22 | - if A_1, A_2, A_3, ... pairwise disjoint events 23 | - P(\bigcup_{i=1}^{\inf} A_i) = \sum_{i=1}^{\infty} P(A_i) 24 | - P(A_1 \cup A_2 \cup ... \cup A_n) = P(A_1) + P(A_2) + ... + P(A_n) 25 | - probability space (\Omega, \F, P) 26 | - mutually exclusive 27 | - A_i \cap A_j = \empty 28 | - cartesian product spaces 29 | - A_1 x A_2 x ... x A_n = {(x_1, x_2, ..., x_n) | x_i \in A_i, i \in [1,n]} 30 | - set of ordered n tuples with the i-th element from A_i 31 | 32 | ## random sampling 33 | - sampling with & without replacement 34 | - ordered & unordered sample 35 | 36 | ## consequence of the rules of probability 37 | - P(A) + P(A^c) = 1 38 | - monotonicity of probability 39 | - if A \subset B then P(A) <= P(B) 40 | - inclusion - exclusion 41 | - P(A \cup B) = P(A) + P(B) - P(A \cap B) 42 | - P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C) 43 | - n person gets hat problem: n person have their hats mixed up, what is the probability that no one gets hi/her own hat? How does this probability bejave as n -> \inf 44 | - define event A_i = {personi gets his/her hat} 45 | - P(\bigcap_{i=1}^{n}A_i^c) = 1 - P(\bigcup_{i=1}^n A_i) 46 | - P(A_i1 \cap A_i2 \cap ... \cap A_ik) = P(i_1, ..., i_k gets their own hat) = \frac{(n-k)!}{n!} (given k hats assigned correctly, number of ways (n-k) assigned to the rest of guests) 47 | - \sum_k P(A_i1 \cap A_i2 \cap ... \cap A_ik) = \binom{n}{k} \frac{(n-k)!}{n!} = \frac{1}{k!} 48 | - P(\bigcup_{i=1}^n A_i) = 1 - 1/2! + 1/3! + ... + (-1)^{n+1} 1/n! 49 | - P(\bigcap_{i=1}^{n}A_i^c) = 1/2! - 1/3! + ... + (-1)^n 1/n! = \sum_{k=0}^n (-1)^k/k! 50 | - if n -> \inf, then P(\bigcap_{i=1}^{n}A_i^c) = e^{-1} 51 | 52 | ## random variables 53 | - random variable X is a function from \Omega into the real numbers 54 | - X is degenerate if P(X = b) = 1 55 | - probability distribution of X is P{X \in B} for set B of real numbers 56 | - X is a discrete random variable if there exists a finite or countably infinite set {k_1, k_2, ...} of real numbers such that sum_i P(X = k_i) = 1 57 | - probability mass function p.m.f. of a discrete random variable is p_X = p(k) = Pr(X = k) for all possible k of X 58 | 59 | # Conditional probability and independence 60 | ## conditional probability 61 | - The conditional probability of A given B is P(A | B) = P(AB) / P(B) 62 | - Multiplication rule for n events 63 | - P(A_1A_2...A_n) = P(A_1)P(A_2|A_1)P(A_3|A_1A_2)...P(A_n|A_1A_2...A_{n-1}) 64 | - a finite collection of events {B_1, ..., B_n} is a partition of \Omega if B_iB_j = \empty whenever i != j and \bigcup_{i=1}^n B_i = \Omega 65 | 66 | ## bayes' formula 67 | - P(B | A) = P(AB) / P(A) = \frac{P(A|B)P(B)}{P(A|B)P(B) + P(A|B^c)P(B^c)} 68 | - general version of bayes' formula 69 | - P(B_k|A) = P(AB_k)/P(A) = \frac{P(A|B_k)P(B_k)}{\sum_{i=1}^n P(A|B_i)P(B_i)} 70 | 71 | ## independence 72 | - A independent of B if P(A|B) = P(A) or P(AB) = P(A)P(B) 73 | - if A B independent, same is true for A^c and B^c, A^c and B, A and B^c 74 | - X_1, ..., X_n are random variables on the same probability space, then they are independent if P(X_1 \in B_1, X_2 \in B_2, ..., X_n \in B_n) = \prod_{k=1}^n P(X_k \in B_k) 75 | 76 | ## independent trials 77 | - **Bernoulli distribution** 78 | - records the result of a single trial with 2 possile outcomes 79 | - 0 <= p < 1, X ~ Ber(p) with success probability p if X \in {0, 1} and P(X = 1) = p and P(X = 0) = 1-p 80 | - e.g. a sequence of n independent trials 81 | - Pr(X_1 = 0, X_2 = 1, X_3 = X_4 = 0) = p(1-p)^3 82 | - E[X] = p 83 | - Var(X) = p(1-p) 84 | - **Binomial distribution** 85 | - X \sim Bin(n, p) 86 | - Let X be the number of successes in n indep trials, with success probaility p, X_i denotes the outcome of trial i 87 | - X = X_1 + X_2 + ... + X_n 88 | - Pr(X = k) = \binom{n}{k} p^k (1-p)^{n-k} 89 | - E[X] = np 90 | - Var(X) = np(1-p) 91 | - **geometric distribution** 92 | - X \sim Geom(p) 93 | - infinite sequence of indep trials 94 | - X is the number of trials needed to see the first success 95 | - P(X = k) = P(X_1 = 0, X_2 = 0, ..., X_{k-1} = 0, X_k = 1) = (1-p)^{k-1}p 96 | - E[X] = 1/p 97 | - Var(X) = (1-p)/p^2 98 | ## Further topics 99 | - conditional independence 100 | - P(A_i1 A_i2 ... A_ik | B) = P(A_i1 | B) P(A_i2 | B) ... P(A_ik | B) 101 | - e.g. Suppose 9/10 coins are fair, 1/10 coins are biased with tail probability 3/5 102 | - A_1 = first flip yields tail, A_2 = second flip yields tail 103 | - success flipis of **a given coin** are independent 104 | - P(A_1|F) = P(A_2|F) = 1/2, P(A_1|B) = P(A_2|B) = 3/5 105 | - P(A_1A_2|F) = P(A_1|F)P(A_2|F), P(A_1A_2|B) = P(A_1|B)P(A_2|B) 106 | - P(A_1A_2) = P(A_1A_2|F)P(F) + P(A_1A_2|B)P(B) 107 | - hypergeometric distribution 108 | - X \sim Hypergeom(N, N_A, n) 109 | - The result of each draw (the elements of the population being sampled) can be classified into one of two mutually exclusive categories 110 | - The probability of a success changes on each draw, as each draw decreases the population 111 | - X takes values in the set [0, n] 112 | - k is num of successes/type A 113 | - P(X = k) = \frac{\binom{N_A}{k} \binom{N - N_A}{n-k}}{\binom{N}{k}} 114 | - sample n items without replacement, choose k items from N_A type A items, and n-k from N-N_A type B items 115 | - the birthday problem 116 | - How large should a randomly selected group of people be to guarantee that with probability at least 1/2 there are two people with the same birthday? 117 | - Take a random sample size of k 118 | - p_k = Pr(there is repetition in the sample, how large should k be to have p_k > 1/2 119 | - A_k = the first k picks are all distinct 120 | - p(A_k) = \frac{365 * 364 * ... * (365 - (k-1))}{365^k} 121 | - p_k = 1 - p(A_k) 122 | 123 | # random variables 124 | ## probability distribution of random variables 125 | - discrete: Bernoulli, binomial, geometric 126 | - probability density function p.d.f 127 | - P(X <= b) = \int_{-\inf}^{b} f(x) dx 128 | - if a random variable X has density function f then point values have probability zero 129 | - P(X = c) = \int_c^c f(x) dx = 0 \any c 130 | - f(x) \geq 0 for \any x \in \R 131 | - \int_{-\inf}^{\inf} f(x) dx = 1 132 | - **uniform distribution** 133 | - X \sim Unif[a, b] 134 | - f(x) = 1/(b-a) if x \in [a,b] 135 | 0 otherwise 136 | - P(c <= X <= d) = \int_c^d 1/(b-a) dx 137 | - the value f(x) of a density function is not a probability, but it gives probability of sets by integration 138 | - P(a < X < a + \epsilon) \simeq f(a) * \epsilon 139 | - E[X] = (a+b)/2 140 | - Var(X) = (b-a)^2/12 141 | ## cumulative distribution function c.d.f 142 | - F(s) = P(X <= s) \any s \in \R 143 | - P(a < X <= b) = P(X <= b) - P(X <= a) = F(b) - F(a) 144 | - For discrete random variable 145 | - F(s) = P(X <= s) = \sum_{k:k <= s} P(X = k) 146 | - For continuous random variabnle 147 | - F(s) = P(X <= s) = \int_{-\inf}^s f(x) dx 148 | - find pmf/pdf from cdf 149 | - if F is piecewise constant, then X is discrete. Possible values of X are where F has jumps. P(X = x) = magnitude of the jump of F at a 150 | - if F continuous, F'(x) exists everywhere, except possibly at finitely many points, then X is continuous, f(x) = F'(x). If F not differentiable at x, then f(x) can be set arbitrary 151 | - property of cdf 152 | - monotonicity: if s < t then F(s) <= F(t) 153 | - right continuity: for each t \in \R, F(t) = lim_{s -> t^+} F(s) 154 | - lim_{t -> -\inf} F(t) = 0, lim_{t -> \inf} F(t) = 1 155 | - P(X < a) = lim_{s -> a^-} F(s) 156 | 157 | ## Expectation 158 | - expectation / first moment of discrete variablne: u = E[X] = \sum_{k} kP(X = k) 159 | - expectation of continuous random variable : E[X] = \int_{-\inf}^{\inf} xf(x) dx 160 | - St. Petersburg paradox: flip a coin, if head, win 2 dollars and game is over; if tail, prize is doubled and flip again. 161 | - Let Y denote the prize 162 | - P(Y = 2^n) = 2^{-n} 163 | - E[Y] = \sum_{n=1}^{\inf} 2^n 2^{-n} = \sum 1 = \inf 164 | - undefined expectation 165 | - you and I flip a fair coin until we see the first head 166 | - let n denote the number of flips needed, if n odd, you pay me 2^nl otherwise I pay you 2^n 167 | - P(X = 2^n) = 2^{-n}, for odd n>= 1 168 | - P(X = -2^n) = 2^{-n} for even n >= 1 169 | - E[X] = 2^1 * 2*{-1} + (-2^2) * 2*{-2} + ... = 1 - 1 + 1 - 1.. 170 | - the expectation does not exist 171 | - expectation of a function of random variable 172 | - discrete: E[g(X)] = \sum_k g(k) P(X = k) 173 | - continuous: E[g(X)] = \int_{-\inf}^{\inf} g(k) P(X = k) 174 | - a stick of length l is broekn at a uniformly chosen random location. What is the expected length of the longer piece? 175 | - g(x) = l-x if 0 <= l-x <=l/2 176 | x if l/2 < x <= l 177 | - E[g(x)] = \int_0^l g(x) f(x) dx 178 | = \int_0^{l/2} (l-x)/l dx + \int_{l/2 179 | ^{l} x/l dx 180 | = 3l/4 181 | - the n-th moment of X is E[X^n] = \sum_k k^n P(X = k) 182 | - median / 0.5-th quantile of X is any m that satisfies P(X >= m) >= 1/2, P(X <= m) >= 1/2 183 | - first quartile: p = 0.25, third quartile: p = 0.75 184 | - p-th quantile is any x satisfying P(X <= X) >= P, P(X >= x) >= 1-p 185 | - median of X 186 | - find m with P(X <= m) = 1/2 187 | ## variance 188 | - Var(X) = E[(X - u)^2] = \sigma^2 189 | = E[X^2] - (E[X])^2 190 | - standard deviation SD(X) = \sigma 191 | - discrete: Var(X) = \sum_k (k-u)^2 P(X = k) 192 | - continuous Var(X) = \int_{\int}^{\int} (x-u)^2 f(x)dx 193 | - for an indicator random variable, Var[I_A] = P(A) P(A^c) 194 | - E(aX+b) = aE[X] + b 195 | - Var(aX+b) = a^2Var(X) 196 | ![Properties of Random Variables](images/properties_of_random_variables.png) 197 | 198 | ## Gaussian distribution 199 | - Z \sim N(0, 1), a random variable Z has standard normal distribution / standard Gaussian distribution if Z has density function \phi(x) = 1/\sqrt(2\pi) e^{-x^2/2} 200 | - bell shaped curve 201 | - c.d.f \Phi(x) = 1/\sqrt(2 \pi) \int_{-\inf}^{x} e^{-s^2/2} ds 202 | - X \sim N(u, \sigma^2) iff. f(x) = 1/\sqrt(2\pi \sigma^2)e^{-(x-u)^2/2\sigma^2} 203 | - if X \sim N(u, \sigma^2), Z = (X - u)/\sigma 204 | - if 1 <= k < l are integers and E[X^l] finite 205 | 206 | # Approximations of the binomial distribution 207 | ## normal approximation 208 | - central limit theorem 209 | - e.g. pmf of Bin(n, p) distribution can be close to the bell curve of the normal distribution 210 | - law of rare events 211 | - When p small, Bin(n, p) close to Poisson() 212 | - CLT for binomial 213 | - S_n \sim Bin(n, p) should approximate the density function of X \sim N(np, np(1-p)) as n becomes large 214 | - let p be fixed, then lim_{n -> \inf} P(a <= (S_n - np)/\sqrt(np(1-p)) <= b) = \int_a^b 1/\sqrt(2\pi) e^{-x^2/2} dx 215 | - Suppose S_n \sim Bin(n, p) with n large and p not too close to 0 or 1, or np(1-p) > 10, then P(a <= (S_n - np)/\sqrt(np(1-p)) <= b) is close to \Phi(b) - \Phi(a) 216 | - first approximated the binomial with the normal distribution, then approximated the c.d.f of normal distribution using the table in the appendix 217 | - three sigma rule: \Phi(3) - \Phi(-3) \simeq 0.9974 218 | - continuity correction 219 | - compared to P(k_1 <= S_n <= k_2), P(k_1 -1/2 <= S_n <= k_2 + 1/2) is a better approximation 220 | 221 | ## law of large numbers 222 | - law of large numbers for binomial random variables 223 | - \any fixed \epsilon > 0, lim_{n -> \inf} P(|S_n/n - p| < \epsilon) = 1 224 | - CLT describes the error in the law of large numbers 225 | - S_n / n = p + \sigma/\sqrt(n) * (S_n - np)/(\sigma\sqrt(n)) \simeq p + \sigma/\sqrt(n) Z 226 | - decomposes S_n/n into a sum of p and a random error 227 | - for large n this random error is approximately normal w/ sdv \sigma/\sqrt(n) 228 | 229 | ## applications of the normal approximation 230 | - want to estimate p for a biased coin 231 | - law of large number: flip n times, count S_n, take \hat(p) = S_n/n as the estimate for p 232 | - P(|\hat(p) - p| < \epsilon) = P(|S_n/n-p|<\epsilon) 233 | = P(-ne < S_n - np < ne) 234 | = then divide both sides by \sqrt(np(1-p)) 235 | = 2\Phi(e\sqrt(n)/\sqrt(p(1-p))) - 1 236 | >= 2\Phi(2e\sqrt(n)) - 1 237 | - use it to solve problem that says 'how many times shold we flip a coin... so \hat(p) is within 0.05 of the true p, with probability at least 0.99?' 238 | - confidence intervals 239 | - (\hat(p) - e, \hat(p) + e) contains the true p with probability at least r, 100r% is the confidence level 240 | - e.g. 'find the 95% confidence level' 241 | - use 2\Phi(2e\sqrt(n)) - 1 > 0.95 to solve for e 242 | - maximum likelihood estimator 243 | - \hat(p) = S_n / n 244 | - once S_n = k has been observed, can use pmf of S_n to compare how likely outcome k is under different value of p 245 | - polling 246 | - actually sampling without replacement - hypergeometric 247 | - but sampling with replacement leads to indep trials and binomal distribution for number of success 248 | - if sample size n small compared to population, then even if sampling w/ replcament, meeting the same person twice has low chance 249 | - could use Bin(n,p) for polling 250 | - Hypergeom(N, N_A, n) converges to Bin(n, p) as N -> \inf and N_A/N -> p 251 | - random walk 252 | - let X_1, X_2, X_3 be indep random variable s.t. P(X_j = 1) = p, P(X_j = -1) = 1-p 253 | - S_0 = 0 254 | - S_n = X_1 + X_2 + .. + X_n 255 | - X_j is the j-th step, S_n is her position after n steps 256 | - random sequence S_0, S_1, S_2, ... is a simple random walk 257 | - if p = 1/2, then S_n is a symmetric simple random walk, otherwise asymmetric 258 | - T_n = number of times the coin came up heads 259 | - S_n = T_n - (n - T_n) = 2T_n - n 260 | - T_n \sim Bin(n, p) 261 | - E[S_n] = n(2p-1) 262 | - Var[S_n] = 4np(1-p) 263 | 264 | ## Poisson approximation (discrete) 265 | - X \sim Poisson(\lambda) 266 | - \lambda > 0, X has Poisson distribution if X is nonnegative integer with 267 | - P(X = k) = e^{-\lambda} \lambda^k/k!, k \in \N 268 | - E[X] = \lambda 269 | - Var[X] = \lambda 270 | - law of rare events 271 | - if successes rare in a sequence of indep trials, then number of successes is approximated by Poisson 272 | - Let S_n \sim Bin(n, \lambda/n), \lambda/n < 1, then 273 | - lim_{n -> \inf} P(S_n = k) = e^{-\lambda} \lambda^k/k! 274 | - Let X \sim Bin(n, p) and Y \sim Poisson(np), then \any subset A \subset {0,1,2...}, we have |P(X \in A) - P(Y \in A)| <= np^2 275 | - Poisson approximation of counting rare events 276 | - X = num of rare events that are not strongly dependent of each other 277 | - then X \sim Poisson(\lambda) 278 | - P(X = k) \simeq e^{-\lambda} \lambda^k/k! 279 | - normal and poisson approximation of the binomial 280 | - when np(1-p) > 10 -> use normal 281 | - when np^2 small -> use poisson 282 | 283 | ## Exponential distribution 284 | - X \sim Exp(\lambda) 285 | - \lambda > 0, X has exponential distribution with rate \lambda if X has density function f(x) = \lambda e^{-\lambda x} for x >= 0, and 0 for x < 0 286 | - c.d.f 287 | - F(t) = \int_0^t \lambda e^{-\lambda x} dx = 1 - e^{-\lambda t}, t >= 0 288 | - P(X > t) = 1 - P(X <= t) = e^{-\lambda t} 289 | - E[X] = 1/\lambda 290 | - Var[X] = 1/\lambda^2 291 | - memoryless property 292 | - for any s, t > 0 293 | - P(X > t+s | X > t) = P(X > s) 294 | - e.g. lifetime of some machine can be modeled by Exp(\lambda) 295 | - regardless of how long the machine has been operation, the distribution of remaining time is the same as that of the original lifetime 296 | - behaves as if it were brand new 297 | - no other distribution with continuous p.d.f on [0, \inf] that satisfies the memoryless properyu 298 | - approximation 299 | - model the time when first custiomer arrives in a discrete time scale 300 | - probability that at least one custoner arrives time time interval of length 1/n is \lambda/n, for large n 301 | - for k = 1,2,3... if first customer arrives during [(k-1)/n, k/n], set T_n = k/n 302 | - P(T_n = k/n) = (1-\lambda/n)^{k-1} \lambda/n --> nT_n \sim Geom(\lambda/n) 303 | - \lim_{n -> \inf}P(T_n > t) = e^{-\lambda t}, t >= 0 304 | 305 | # joint distribution of random variables 306 | ## joint distribution of discrete random variables 307 | - X_1, X_2, ,.., X_n are discrete random variables 308 | - joint probability mass function 309 | - p(k_1, k_2, ..., k_n) = P(X_1 = k_1, X_2 = k_2, ..., X_n = k_n) 310 | - let p(k_1, ..., k_n) be the joint probability mass function of (X_1, ..., X_n) 311 | - the probability mass function of X_j / marginal probability mass function of X_j 312 | - p_{X_j}(k) = \sum_{l_1, ..., l_{j-1}, l_{j+1}, ..., l_n} p(l_1, ...,. l_{j-1}, k, l_{j+1}, l_n) 313 | - multinomial distribution 314 | - n, r positive integers 315 | - p_1, p_2, ..., p_r positive reals 316 | - p_1 + p_2 + ... + p_r = 1 317 | - if possible values are integer vectors (k_1, ..., k_r) such that 318 | - k_j >= 0 319 | - k_1 + ... + k_r = n 320 | - (X_1, ..., X_r) has multinomial distribution 321 | - joint probability mass function 322 | - P(X_1 = k_1, X_2 = k_2, ..., X_r = k_r) = \binom{n}{k_1, k_2, ..., k_r}p_1^{k_1} ... p_r^{k_r} 323 | - (X_1, ..., X_r) ~ Mult(n, r, p_1, ..., p_r) 324 | 325 | ## joint continuous random variables 326 | - X_1, ..., X_n are jointly continuous if there exists a 327 | - join density function f on R^n s.t. for subsets B \subset R^n 328 | - P((X_1, ..., X_n) \in B) = \int ... \int_B f(x_1, ..., x_n) dx_1 ... dx_n 329 | - let f be the joint density function of X_1, ..., X_n. 330 | - then each random variable X_j has a density function f_{X_j} that can be obtained by integrating away the other variables from f 331 | - f_{X_j}(x) = \int ... \int(n-1 integrals) f(x_1, ..., x_{j-1}, x, x_{j+1}, ..., x_n) dx_1 ... dx_{j-1} dx_{j+1} dx_n 332 | 333 | ## joint distributions and independence 334 | ### discrete 335 | - let p(k_1, ..., k_n) be the joint probability mass function of the discrete random variables X_1, ..., X_n 336 | - let p_{X_j}(k) = P(X_j = k) be the marginal probability mass function of X_j 337 | - X_1, ..., X_n are independent iff. 338 | - p(k_1, ..., k_n) = p_{X_1}(k) ... p_{X_n}(k_n) 339 | - for all possible values k_1, ..., k_n 340 | 341 | ### continuous 342 | - if X_1, ..., X_n have joint density function 343 | - f(x_1, x_2, ..., x_n) = f_{X_1}(x_1)f_{X_2}(x_2)...f_{X_n}(x_n) 344 | - then X_1, ..., X_n are independent 345 | - vice versa 346 | 347 | - suppose X_1, ..., X_{m+n} are independent random variables 348 | - define random variables Y = f(X_1, ..., X_m), Z = g(X_{m+1}, ..., X_{m+n}) 349 | - then Y and Z are independent random variables 350 | 351 | ## joint cumulative distribution function 352 | - discrete random variables 353 | - joint probability mass function 354 | - continuous random variables 355 | - joint probability density function 356 | - joint cumulative distribution function 357 | - F(s_1, ..., s_n) = P(X_1 <= s_1, ..., X_n <= s_n) 358 | - F(x, y) = P(X <= x, Y <= y) = \int_{-\infty}^x \int_{-infty}^x f(s, t) dt ds 359 | - X_1, ..., X_n are independent iff. 360 | - F(x_1, x_2, ..., x_n) = \product_{k=1}^n F_{X_k}(x_k) 361 | 362 | # Tail bounds and limit theorems 363 | ## estimating tail probabilities 364 | - Markov's inequality 365 | - let X be a nonnegative random variable 366 | - for any c > 0 367 | - P(X >= c) <= E[X]/c 368 | - Chebyshev's inequality 369 | - X has finite mean \mu and a finite variance \sigma^2 370 | - for any c > 0 371 | - P(|X - \mu| >= c) <= \sigma^2/c^2 372 | 373 | ## law of large numbers 374 | - suppose we have iid random variables X_1, X_2, ... 375 | - with finite mean E[X_1] = \mu 376 | - finite variance Var(X_1) = \sigma^2 377 | - Let S_n = X_1 + ... + X_n 378 | - for any fixed \epsilon > 0 we have 379 | - lim_{n -> \infty} P(|S_n/n - \mu| < \epsilon) = 1 380 | 381 | ## central limit theorem 382 | - suppose we have iid random variables X_1, X_2, ... 383 | - with finite mean E[X_1] = \mu 384 | - finite variance Var(X_1) = \sigma^2 385 | - Let S_n = X_1 + ... + X_n 386 | - for any fixed finite a and b 387 | - lim_{n -> \infty} P(a <= \frac{S_n - n\mu}{\sigma \sqrt{n}} <= b) = \Phi(b) - \Phi(a) = the integration of normal distribution from a to b -------------------------------------------------------------------------------- /probability/Practical_Guide_To_Quant.md: -------------------------------------------------------------------------------- 1 | ## Chapter 2 Brain Teasers 2 | - starts with a simplified version 3 | ### Screwy pirates 4 | - 100 coins divided to 5 pirates 5 | - 2 pirates, a and b 6 | - b proposes: b gets 100, a gets 0 7 | - 3 pirates, a, b, c 8 | - a will supports c no matter what 9 | - a: 1, b: 0, c: 99 10 | - 4 pirates, abcd 11 | - b supports d 12 | - a: 0, b: 1, c: 0, d: 99 13 | - 5 pirates, abcde 14 | - a: 1, b: 0, c: 1, d: 0, e: 98 15 | 16 | ### Tiger and sheep 17 | - 1 tiger, 1 sheep 18 | - eat 19 | - 2 tigers, ab 20 | - not eat 21 | - 3 tigers, abc 22 | - a eats 23 | - 4 tigers, abcd 24 | - if a eats, then 2 sheep, bcd 25 | - if b eats, then 3 sheep, cd 26 | - not eating then 27 | - a won't eat 28 | - 100 tiger, not eat 29 | 30 | ### River crossing 31 | - CD cross, D back, 3min 32 | - AB cross, C back, 2 33 | - CD cross, 2min 34 | 35 | 36 | ### Card game 37 | - 2 cards 38 | 39 | ### Defective ball 40 | - 4 + 4 + 4 41 | 42 | ### Horse race 43 | - 5 groups, 5 races, have orderibg in each group 44 | - pick tops from each group, 1 race, then 3 groups have potential answers, 3+2+1 candidates 45 | - race 1,2,3,6,7, then add 11 and race 46 | 47 | ## Chapter 4 Probablity Theory 48 | ### Coin toss game 49 | - remove one coin from A 50 | - E1: A more coins 51 | - E2: equal coins 52 | - E3: A fewer coins 53 | - P(E1) = P(E2) = x, then 2x + y = 1 54 | - result = x + y/2 = 0.5 55 | 56 | ### Card game 57 | - 1/13 + 1/13 * 48/52 + 1/13 * 44/52... 58 | = 1/(13*52) * (0 + 4 + 8 + ... + 48) 59 | = 1/(13*52) * (4 + 48) * 13 /2 60 | = 24/52 61 | = 8/17 62 | 63 | 64 | ### Drunk passenger 65 | - E1: seat #1 taken before #100 66 | - E2: seat #100 taken before #1 67 | 68 | ### N point on a circle 69 | - 1/2^{N-1} chance that all 2, ..., N points in the same semi-circle 70 | - same for all i 71 | - N * 1/2^{N-1} 72 | 73 | ### poker hands 74 | - four-of-a-hand: 13 * 48 75 | - full house: 13 * 12 * 4 * 6 76 | - hand with two pairs: 13 * 6 * 6 * 6 * 44 77 | 78 | ### hopping rabbit 79 | - stair(1) = 1 80 | - stair(2) = 2 81 | - stair(n) = stair(n-1) + stair(n-2) 82 | 83 | ### screwy pirates 84 | - for each random group of 5, there must be a lock that none of them has the key to, yet all other 6 pirates have the key 85 | - number of locks = \binom(11,5) 86 | - each lock has 6 keys, each pirate has \binom(11,5) * 6 / 11 87 | 88 | ### chess tournament 89 | - conditional probablity approach 90 | - each player has 1/(2^n-1) of meeting player 1 91 | - 1 and 2 do not meet in round 1 has probablity (2^n-2)/(2^n-1) 92 | - 1 and 2 do not meet in round 2 has probablity (2^{n-1}-2)/(2^{n-1}-1) 93 | - multiply together, get 2^{n-1}/(2^n-1) 94 | - counting approach 95 | - 1 and 2 must be in different subgroup 96 | - 2^{n-1}/(2^n-1) 97 | 98 | ### application letters 99 | - let E_i be the event that envelop i is correct 100 | - P(E_i) = 1/5 101 | - P(E_iE_j) = 1/5 * 1/4 102 | - \sum P(E_iE_j) = 10 * 1/5 * 1/4 = 1/2 103 | - P(E_iE_jE_k) = 1/5 * 1/4 * 1/3 104 | - \sum P(E_iE_jE_k) = \binom(5, 3) * 1/5 * 1/4 * 1/3 = 1/3! 105 | - 1 - 1/2 + 1/3! - 1/4! + 1/5! 106 | 107 | ### birthday problem 108 | - Pr(nobody has the same birtyday) < 1/2 109 | - 365 * 364 * ... * (365-n+1)/365^n < 1/2 110 | 111 | ### 100th digit 112 | - binom theorem: (x + y)^n = \sum_{k=0}{n} \binom(n, k) x^k y^{n-k} 113 | - calculate (1-\sqrt(2))^n and (1+\sqrt(2))^n and add together, must be an integer 114 | - 0 < (1-\sqrt(2))^3000 <<10^{-100} 115 | 116 | ### cubic of integer 117 | - x = a + 10b 118 | - x^3 = (a + 10b)^3 = a^3 + 30a^2b + 300ab^2 + 1000b^3 119 | - last digit of x^3 depends on a^3, a = 1 120 | - second to last digit of x^3 depends on 30a^2b = 30b, 3b = 1, thus b = 7 121 | - prob = 1/100 122 | 123 | ### boys and girls 124 | - part A 125 | - A = everyone has >= son 126 | - B = both are boys 127 | - {bb, bg, gb, gg} 128 | - Pr = 1/3 129 | 130 | - part B 131 | - 1/2 132 | 133 | ### all-girl world? 134 | X = # of boys before having a girl 135 | X = 0, 1, 2, \infty 136 | average proportion of boys = \sum_{k=0} k/(k+1) * (1/2)^{k+1} 137 | 138 | ###unfair coin 139 | B: biased 140 | HS: 10 heads 141 | Pr(B|HS) = \frac{Pr(B \cap HS)}{Pr(HS)} 142 | Pr(HS) = Pr(F \cap HS) + Pr(B \cap HS) 143 | Pr(B \cap HS) = Pr(HS|B) * Pr(B) = 1 * 1/10^3 144 | 145 | Pr(B|HS) = \frac{1/10^3}{(1/2)^10 * 999/1000 + 1/10^3} 146 | 147 | ### fair probability from an unfair coin 148 | Pr(H) = p 149 | Pr(HH) = p^2 150 | Pr(HT) = 2p(1-p) 151 | Pr(TT) = (1-p)^2 152 | 153 | throw it twice, if HH or TT, discard 154 | if HT, count as positive, if TF, then negative 155 | 156 | ### dart game 157 | enumerate all possible outcomes of three throws 158 | 159 | ### birthday line 160 | assume i'm the n^th person 161 | P(n) = Pr(first n-1 person different birthdays) * Pr(my birthday is the same as one of them) 162 | = \frac{365 * 364 * ... * 365-n+2}{365^{n-1}} * \frac{n-1}{365} 163 | find the n such that P(n) > P(n-1) and P(n) > P(n+1) 164 | 165 | ### dice order 166 | Pr = Pr(increasing order | three different number) * Pr(three different number) 167 | = 1/6 * 5/6 * 4/6 168 | 169 | ### Monty hall problem 170 | if not switch, Pr(win) = 1/3 171 | if switch, Pr(win) = Pr(originally picked a goat) = 2/3 172 | 173 | ### Amoeba population 174 | Let P(E) be the probability that the amoeba dies. 175 | Let F1, F2, F3, F4 be those four individual outcomes 176 | P(E) = P(E|F1)P(F1) + P(E|F2)P(F2) + P(E|F3)P(F3) + P(E|F4)P(F4) 177 | = 1/4 + P(E)/4 + P(E)^2/4 + P(E)^3/4 178 | P(E) = \sqrt(2) - 1 179 | 180 | ### candies in a jar 181 | 182 | ### coin toss game 183 | Pr(A win) = Pr(xHT) + Pr(xxxHT) ... 184 | = P(A|H) * 1/2 + P(A|T) * 1/2 185 | P(A|T) = P(B) 186 | = 1-P(A) 187 | conditioned on B's toss 188 | P(A|H) = 1/2*0 + 1/2(1-P(A|H)) 189 | -> P(A|H) = 1/3 190 | P(A) = 4/9 191 | 192 | 4.4 Discrete & continuous distributions 193 | 194 | ### meeting probability 195 | Pr(|X-Y| <= 5) = shaded area in a square 196 | 197 | ### probablity of a triangle 198 | x y-x 1-y 199 | assume x < y 200 | x + y-x > 1-y -> y > 1/2 201 | y-x + 1-y > x -> x < 1/2 202 | x + 1-y > y-x -> x + 1/2 > y -------------------------------------------------------------------------------- /probability/Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/Xinfeng_Zhou_A_Practical_Guide_To_Quant.pdf -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/Chap1.md: -------------------------------------------------------------------------------- 1 | # Chapter 1 General Principles 2 | - Let us begin this book by exploring five general principles that will be extremely helpful in your interview process. From my experience on both sides of the interview table, these general guidelines will better prepare you for job interviews and will likely make you a successful candidate. 3 | 4 | ## Build a broad knowledge base 5 | - The length and the style of quant interviews differ from firm to firm. Landing a quant job may mean enduring hours of bombardment with brain teaser, calculus, linear algebra, probability theory, statistics, derivative pricing, or programming problems. To be a successful candidate, you need to have broad knowledge in mathematics, finance and programming. 6 | 7 | - Will all these topics be relevant for your future quant job? Probably not. Each specific quant position often requires only limited knowledge in these domains. General problem solving skills may make more difference than specific knowledge. Then why are quantitative interviews so comprehensive? There arc at least two reasons for this: 8 | 9 | - The first reason is that interviewers often have diverse backgrounds. Each interviewer has his or her own favorite topics that are often related to his or her own educational background or work experience. As a result, the topics you will be tested on are likely to be very broad. The second reason is more fundamental. Your problem solving skills—a crucial requirement for any quant job--is often positively correlated to the breadth of your knowledge. A basic understanding of a broad range of topics often helps you better analyze problems, explore alternative approaches, and conic up with efficient solutions. Besides, your responsibility may not be restricted to your own projects. You will be expected to contribute as a member of a bigger team. Having broad knowledge will help you contribute to the team's success as well. 10 | 11 | - The key here is "basic understanding." Interviewers do not expect you to be an expert on a specific subject—unless it happens to be your PhD thesis. The knowledge used in interviews, although broad, covers mainly essential concepts. This is exactly the reason why most of the books I refer to in the following chapters have the word -introduction" or "first". in the title. If I am allowed to give only one suggestion to a candidate, it will be know the basics very well. 12 | 13 | ## Practice your interview skills 14 | - The interview process starts long before you step into an interview room. In a sense, the success or thilure of your interview is often determined before the first question is asked.Your solutions to interview problems may fail to reflect your true intelligence and knowledge if you are unprepared. Although a complete review of quant interview problems is impossible and unnecessary, practice does improve your interview skills. Furthermore, many of the behavioral, technical and resume-related questions can be anticipated. So prepare yourself for potential questions long before you enter an interview room. 15 | 16 | ## Listen carefully 17 | - You should be an active listener in interviews so that you understand the problems well before you attempt to answer them. If any aspect of a problem is not clear to you politely ask for clarification. If the problem is more than a couple of sentences, jot down the key words to help you remember all the information. For complex problems, interviewers often give away some clues when they explain the problem. Even the assumptions they give inay include some information as to how to approach the problem. So listen carefully and make sure you get the necessary information. 18 | 19 | ## Speak your mind 20 | - When you analyze a problem and explore different ways to solve it, never do it silently. Clearly demonstrate your analysis and write down the important steps involved if necessary. This conveys your intelligence to the interviewer and shows that you are methodical and thorough. In case that you go astray, the interaction will also give your interviewer the opportunity to correct the course and provide you with some hints. 21 | - Speaking your mind does not mean explaining every tiny detail. If some conclusions are obvious to you, simply state the conclusion without the trivial details. More often than not, the interviewer uses a problem to test a specific concept/approach. You should focus on demonstrating your understanding of the key concept/approach instead of dwelling on less relevant details. 22 | 23 | ## Make reasonable assumptions 24 | - In real job settings, you are unlikely to have all the necessary information or data you'd prefer to have bc.'fore you build a model and make a decision. In interviews, interviewers may not give you all the necessary assumptions either. So it is up to you to make reasonable assumptions. The keyword here is reasonable. Explain your assumptions to the interviewer so that you will get immediate feedback. '1'0 solve quantitative problems, it is crucial that you can quickly make reasonable assumptions and design appropriate frameworks to solve problems based on the assumptions. 25 | 26 | - We now ready to review basic concepts inquantitative finance subject areas and have :lin solving real-world interview problems! -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.2.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.3.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/2.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/2.4.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.2.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.3.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.3.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.3.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.3.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.4.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.4.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/4.5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/4.5.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/Table4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.1.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/Table4.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.2.png -------------------------------------------------------------------------------- /probability/complete_practical_guide_to_quant/images/Table4.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/complete_practical_guide_to_quant/images/Table4.3.png -------------------------------------------------------------------------------- /probability/images/properties_of_random_variables.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/QuintessaQ/CS_interview_cheatsheet/5d5c9a6d7cf0224f290af27855e49ce2be4c466e/probability/images/properties_of_random_variables.png -------------------------------------------------------------------------------- /quant_trader/info.md: -------------------------------------------------------------------------------- 1 | ## finance 2 | 3 | ## links 4 | https://www.quantstart.com/articles/Self-Study-Plan-for-Becoming-a-Quantitative-Trader-Part-I/ 5 | https://www.investopedia.com/options-basics-tutorial-4583012 6 | ## -------------------------------------------------------------------------------- /system_design/Grokking the system design interview.md: -------------------------------------------------------------------------------- 1 | https://www.educative.io/courses/grokking-the-system-design-interview?affiliate_id=5749180081373184/ 2 | 3 | # names of large numbers 4 | - 1 KB = 1024 bytes = 10^3 bytes | kilobytes 5 | - 1 MB = 1024 KB = 10^6 bytes | megabytes 6 | - 1 GB = 10^9 bytes | gigabytes 7 | - 1 TB = 10^12 bytes | terabytes 8 | - 1 PB = 10^15 bytes | petabytes 9 | 10 | # Interview Process 11 | - Scope the problem 12 | - Don’t make assumptions. 13 | - Ask clarifying questions to understand the constraints and use cases. 14 | - Steps 15 | - Requirements clarifications 16 | - System interface definition 17 | - Sketch up an abstract design 18 | - Building blocks of the system 19 | - Relationships between them 20 | - Steps 21 | - Back-of-the-envelope estimation 22 | - Defining data model 23 | - High-level design 24 | - Identify and address the bottlenecks 25 | - Use the fundamental principles of scalable system design 26 | - Steps 27 | - Detailed design 28 | - Identifying and resolving bottlenecks 29 | 30 | 31 | -------------------------------------------------------------------------------- /system_design/System Design.md: -------------------------------------------------------------------------------- 1 | ## Overview 2 | - start by asking you to design a system that performs a given task 3 | ### examples 4 | - Design a URL shortening service like bit.ly. 5 | - How would you implement the Google search? 6 | - Design a client-server application which allows people to play chess with one another. 7 | - How would you store the relations in a social network like Facebook and implement a feature where one user receives notifications when their friends like the same things as they do? 8 | 9 | ## Step1: System Design Process 10 | - use cases 11 | - e.g. Design a URL shortening service like bit.ly. 12 | - shortening: take an url -> return shorter url 13 | - redirection: take a short url -> redirect to the original url 14 | - custom url: let user custom their short url 15 | - analytics: allow people to look at usage statistics of the url 16 | - automatic link expiration 17 | - manual link removal: remove a short url used before 18 | - UI vs API 19 | - constraints 20 | - usage per second: e.g. assume not in top3 url service but in top10 21 | - start estimating from usage per month 22 | 23 | ## Step2: Abstract design 24 | - draw a simple diagram of your ideas 25 | - e.g. url shortening 26 | - application service layer 27 | - shortening service 28 | - generate new hash 29 | - check if it's in data storage 30 | - if not, generate new mapping 31 | - if yes, keep generating until an unusued one is found 32 | - redirection service 33 | - retrieve the value given the hash 34 | - data storage layer, keeps track of the hash to url mapping 35 | - act like a big hash table 36 | - stores new mapping 37 | - retrieves value given a key 38 | - hashed_url = convert_to_base_62(md5(original_url + random_salt))[:6] 39 | 40 | ## Step3: Understanding bottleneck 41 | - traffic is probably not going to be hard, data more interesting 42 | 43 | ## Step4: Scalability 44 | - ideas 45 | - Vertical scaling 46 | - Horizontal scaling 47 | - Caching 48 | - Load balancing 49 | - Database replication 50 | - Database partitioning 51 | - clones 52 | - every server contains exactly the same codebase and does not store any user-related data, like sessions or profile pictures, on local disc or memory. 53 | - Sessions need to be stored in a centralized data store which is accessible to all your application servers. 54 | - a code change is sent to all your servers without one server still serving old code, serving the same codebase from all your servers 55 | - servers can now horizontally scale and you can already serve thousands of concurrent requests 56 | - database 57 | - can stay with MySQL, and use it like a NoSQL database 58 | - or you can switch to a better and easier to scale NoSQL database like MongoDB or CouchDB, using NoSQL instead of scaling a relational database 59 | - cache 60 | - A cache is a simple key-value store and it should reside as a buffering layer between your application and your data storage. 61 | - Whenever your application has to read data it should at first try to retrieve the data from your cache. 62 | - if it’s not in the cache should it then try to get the data from the main data source 63 | - Cached Database Queries 64 | - A hashed version of your query is the cache key 65 | - issues 66 | - expiration: it is hard to delete a cached result when you cache a complex query 67 | - When one piece of data changes (for example a table cell) you need to delete all cached queries who may include that table cell. 68 | - Cached Objects 69 | - store the complete instance of the class or the assembed dataset in the cache 70 | - easily get rid of the object whenever something did change and makes the overall operation of your code faster and more logical. 71 | - asynchronism 72 | - Async #1 73 | - doing the time-consuming work in advance and serving the finished work with a low request time. 74 | - Async #2 75 | - start the task when the customer is in the bakery and tell him to come back at the next day. Refering to a web service that means to handle tasks asynchronously. 76 | - A user comes to your website and starts a very computing intensive task which would take several minutes to finish. So the frontend of your website sends a job onto a job queue and immediately signals back to the user: your job is in work, please continue to the browse the page 77 | 78 | 79 | 80 | ## Topics 81 | ### Concurrency 82 | - Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence? 83 | 84 | ### Networking 85 | - Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor? 86 | 87 | ### Abstraction 88 | - You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS? 89 | 90 | ### Real-World Performance 91 | - You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network. 92 | 93 | ### Estimation 94 | - Estimation, especially in the form of a back-of-the-envelope calculation, is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write. 95 | 96 | ### Availability and Reliability 97 | - Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability? 98 | 99 | 100 | ## random links 101 | - https://www.palantir.com/2011/10/how-to-rock-a-systems-design-interview/ 102 | 103 | -------------------------------------------------------------------------------- /system_design/design instagram: -------------------------------------------------------------------------------- 1 | 2 | # Design instagram 3 | ## what is instagram 4 | - For the sake of this exercise, we plan to design a simpler version of Instagram, where a user can share photos and can also follow other users. The ‘News Feed’ for each user will consist of top photos of all the people the user follows. 5 | 6 | ## Requirements and Goals of the System 7 | - Functional Requirements 8 | - Users should be able to upload/download/view photos. 9 | - Users can perform searches based on photo/video titles. 10 | - Users can follow other users. 11 | - The system should be able to generate and display a user’s News Feed consisting of top photos from all the people the user follows. 12 | - Non-functional Requirements 13 | - Our service needs to be highly available. 14 | - The acceptable latency of the system is 200ms for News Feed generation. 15 | - Consistency can take a hit (in the interest of availability), if a user doesn’t see a photo for a while; it should be fine. 16 | - The system should be highly reliable; any uploaded photo or video should never be lost. 17 | 18 | ## Some Design Considerations 19 | - The system would be read-heavy, so we will focus on building a system that can retrieve photos quickly. 20 | - Practically, users can upload as many photos as they like. Efficient management of storage should be a crucial factor while designing this system. 21 | - Low latency is expected while viewing photos. 22 | - Data should be 100% reliable. If a user uploads a photo, the system will guarantee that it will never be lost. 23 | 24 | ## Capacity Estimation and Constraints 25 | - Let’s assume we have 500M total users, with 1M daily active users. 26 | - 2M new photos every day, 23 new photos every second. 27 | - Average photo file size => 200KB 28 | - Total space required for 1 day of photos 29 | - 2M * 200KB => 400 GB 30 | - Total space required for 10 years: 31 | - 400GB * 365 (days a year) * 10 (years) ~= 1425TB 32 | 33 | ## High Level System Design 34 | - At a high-level, we need to support two scenarios, one to upload photos and the other to view/search photos. 35 | - Our service would need some object storage servers to store photos and also some database servers to store metadata information about the photos. 36 | 37 | ## Database Schema 38 | - We need to store data about users, their uploaded photos, and people they follow. - Photo table will store all data related to a photo; we need to have an index on (PhotoID, CreationDate) since we need to fetch recent photos first. 39 | - photo table 40 | ``` 41 | photoID: int - key 42 | userID : int 43 | photo_path: char[256] 44 | photo_latitude: int 45 | photo_longitude: int 46 | creation_date: date 47 | 48 | ``` 49 | - user table 50 | ``` 51 | userID: int - key 52 | name: char[20] 53 | email: char[30] 54 | creation_date: date 55 | last_login: date 56 | user_follow: users 57 | ``` 58 | - We need to store relationships between users and photos, to know who owns which photo. We also need to store the list of people a user follows. For both of these tables, we can use a wide-column datastore like Cassandra. For the ‘UserPhoto’ table, the ‘key’ would be ‘UserID’ and the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different columns. We will have a similar scheme for the ‘UserFollow’ table. 59 | - Cassandra or key-value stores in general, always maintain a certain number of replicas to offer reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for certain days (to support undeleting) before getting removed from the system permanently. 60 | 61 | ## Data Size Estimation 62 | - Let’s estimate how much data will be going into each table and how much total storage we will need for 10 years. 63 | - **User**: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes: 64 | - UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes 65 | - If we have 500 million users, we will need 32GB of total storage. 66 | - 500 million * 68 ~= 32GB 67 | 68 | - **Photo**: Each row in Photo’s table will be of 284 bytes: 69 | - PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes 70 | - If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day: 71 | - 2M * 284 bytes ~= 0.5GB per day 72 | - For 10 years we will need 1.88TB of storage. 73 | 74 | - **UserFollow**: Each row in the UserFollow table will consist of 8 bytes. If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table: 75 | - 500 million users * 500 followers * 8 bytes ~= 1.82TB 76 | - Total space required for all tables for 10 years will be 3.7TB: 77 | 78 | 32GB + 1.88TB + 1.82TB ~= 3.7TB 79 | 80 | ## component design 81 | - Photo uploads (or writes) can be slow as they have to go to the disk, whereas reads will be faster, especially if they are being served from cache. 82 | - Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. We should keep in mind that web servers have a connection limit before designing our system. 83 | - If we assume that a web server can have a maximum of 500 connections at any time, then it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the system. 84 | - Separating photos’ read and write requests will also allow us to scale and optimize each of these operations independently. 85 | 86 | ## Reliability and Redundancy 87 | - Losing files is not an option for our service. Therefore, we will store multiple copies of each file so that if one storage server dies we can retrieve the photo from the other copy present on a different storage server. 88 | - This same principle also applies to other components of the system. If we want to have high availability of the system, we need to have multiple replicas of services running in the system, so that if a few services die down the system still remains available and running. Redundancy removes the single point of failure in the system. 89 | - If only one instance of a service is required to run at any point, we can run a redundant secondary copy of the service that is not serving any traffic, but it can take control after the failover when primary has a problem. 90 | - Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis. For example, if there are two instances of the same service running in production and one fails or degrades, the system can failover to the healthy copy. Failover can happen automatically or require manual intervention. 91 | 92 | ## data shading 93 | ### Partitioning based on UserID 94 | - we’ll find the shard number by UserID % 10 and then store the data there. To uniquely identify any photo in our system, we can append shard number with each PhotoID. 95 | - How can we generate PhotoIDs? 96 | - Each DB shard can have its own auto-increment sequence for PhotoIDs and since we will append ShardID with each PhotoID, it will make it unique throughout our system. 97 | - issues 98 | - How would we handle hot users? Several people follow such hot users and a lot of other people see any photo they upload. 99 | - Some users will have a lot of photos compared to others, thus making a non-uniform distribution of storage. 100 | - What if we cannot store all pictures of a user on one shard? If we distribute photos of a user onto multiple shards will it cause higher latencies? 101 | - Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s data if that shard is down or higher latency if it is serving high load etc. 102 | 103 | ### Partitioning based on PhotoID 104 | - If we can generate unique PhotoIDs first and then find a shard number through “PhotoID % 10”, the above problems will have been solved. We would not need to append ShardID with PhotoID in this case as PhotoID will itself be unique throughout the system. 105 | - How can we generate PhotoIDs? 106 | - Here we cannot have an auto-incrementing sequence in each shard to define PhotoID because we need to know PhotoID first to find the shard where it will be stored. One solution could be that we dedicate a separate database instance to generate auto-incrementing IDs. If our PhotoID can fit into 64 bits, we can define a table containing only a 64 bit ID field. So whenever we would like to add a photo in our system, we can insert a new row in this table and take that ID to be our PhotoID of the new photo. 107 | - Wouldn’t this key generating DB be a single point of failure? 108 | - Yes, it would be. A workaround for that could be defining two such databases with one generating even numbered IDs and the other odd numbered. 109 | - How can we plan for the future growth of our system? 110 | - We can have a large number of logical partitions to accommodate future data growth, such that in the beginning, multiple logical partitions reside on a single physical database server. Since each database server can have multiple database instances on it, we can have separate databases for each logical partition on any server. So whenever we feel that a particular database server has a lot of data, we can migrate some logical partitions from it to another server. We can maintain a config file (or a separate database) that can map our logical partitions to database servers; this will enable us to move partitions around easily. Whenever we want to move a partition, we only have to update the config file to announce the change. 111 | 112 | ## Ranking and News Feed Generation 113 | - What are the different approaches for sending News Feed contents to the users? 114 | - Pull 115 | - Clients can pull the News Feed contents from the server on a regular basis or manually whenever they need it. Possible problems with this approach are a) New data might not be shown to the users until clients issue a pull request b) Most of the time pull requests will result in an empty response if there is no new data. 116 | - Push 117 | - Servers can push new data to the users as soon as it is available. To efficiently manage this, users have to maintain a Long Poll request with the server for receiving the updates. A possible problem with this approach is, a user who follows a lot of people or a celebrity user who has millions of followers; in this case, the server has to push updates quite frequently. 118 | - Hybrid 119 | - We can adopt a hybrid approach. We can move all the users who have a high number of follows to a pull-based model and only push data to those users who have a few hundred (or thousand) follows. Another approach could be that the server pushes updates to all the users not more than a certain frequency, letting users with a lot of follows/updates to regularly pull data. 120 | 121 | ## News Feed Creation with Sharded Data 122 | - One of the most important requirement to create the News Feed for any given user is to fetch the latest photos from all people the user follows. For this, we need to have a mechanism to sort photos on their time of creation. To efficiently do this, we can make photo creation time part of the PhotoID. As we will have a primary index on PhotoID, it will be quite quick to find the latest PhotoIDs. 123 | - We can use epoch time for this. Let’s say our PhotoID will have two parts; the first part will be representing epoch time and the second part will be an auto-incrementing sequence. So to make a new PhotoID, we can take the current epoch time and append an auto-incrementing ID from our key-generating DB. We can figure out shard number from this PhotoID ( PhotoID % 10) and store the photo there. 124 | - What could be the size of our PhotoID? Let’s say our epoch time starts today, how many bits we would need to store the number of seconds for next 50 years? 125 | - 86400 sec/day * 365 (days a year) * 50 (years) => 1.6 billion seconds 126 | - We would need 31 bits to store this number. Since on the average, we are expecting 23 new photos per second; we can allocate 9 bits to store auto incremented sequence. So every second we can store (2^9 => 512) new photos. We can reset our auto incrementing sequence every second. -------------------------------------------------------------------------------- /system_design/design url shortening: -------------------------------------------------------------------------------- 1 | # Designing a URL Shortening service like TinyURL 2 | ## Why do we need URL shortening? 3 | - URL shortening is used for optimizing links across devices, tracking individual links to analyze audience and campaign performance, and hiding affiliated original URLs. 4 | 5 | ## Requirements and Goals of the System 6 | ### Functional Requirements 7 | -Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications. 8 | - When users access a short link, our service should redirect them to the original link. 9 | - Users should optionally be able to pick a custom short link for their URL. 10 | - Links will expire after a standard default timespan. Users should be able to specify the expiration time. 11 | 12 | ## Non-Functional Requirements 13 | - The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing. 14 | - URL redirection should happen in real-time with minimal latency. 15 | - Shortened links should not be guessable (not predictable). 16 | 17 | ## Extended Requirements 18 | - Analytics; e.g., how many times a redirection happened? 19 | Our service should also be accessible through REST APIs by other services. 20 | 21 | ### Capacity Estimation and Constraints 22 | - Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write. 23 | 24 | ## Traffic estimates: 25 | - Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period: 26 | 100 * 500M => 50B 27 | - What would be Queries Per Second (QPS) for our system? New URLs shortenings per second: 28 | 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s 29 | - Considering 100:1 read/write ratio, URLs redirections per second will be: 30 | 100 * 200 URLs/s = 20K/s 31 | 32 | ## Storage estimates 33 | - Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion: 34 | 500 million * 5 years * 12 months = 30 billion 35 | 36 | - Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage: 37 | 30 billion * 500 bytes = 15 TB 38 | 39 | ## Memory estimates 40 | - If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. 41 | - Since we have 20K requests per second, we will be getting 1.7 billion requests per day: 42 | 20K * 3600 seconds * 24 hours = ~1.7 billion 43 | - To cache 20% of these requests, we will need 170GB of memory. 44 | 0.2 * 1.7 billion * 500 bytes = ~170GB 45 | - One thing to note here is that since there will be a lot of duplicate requests (of the same URL), therefore, our actual memory usage will be less than 170GB. 46 | 47 | ## System APIs 48 | - Following could be the definitions of the APIs for creating and deleting URLs: 49 | ``` 50 | createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None) 51 | ``` 52 | - Parameters 53 | - api_dev_key (string): The API developer key of a registered account. This will be - used to, among other things, throttle users based on their allocated quota. 54 | - original_url (string): Original URL to be shortened. 55 | - custom_alias (string): Optional custom key for the URL. 56 | - user_name (string): Optional user name to be used in the encoding. 57 | - expire_date (string): Optional expiration date for the shortened URL. 58 | - Returns: (string) 59 | - A successful insertion returns the shortened URL; otherwise, it returns an error code. 60 | - ``deleteURL(api_dev_key, url_key)`` 61 | Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion returns ‘URL Removed’. 62 | - How do we detect and prevent abuse? 63 | - A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key). 64 | 65 | ## Database Design 66 | - A few observations about the nature of the data we will store: 67 | - We need to store billions of records. 68 | - Each object we store is small (less than 1K). 69 | - There are no relationships between records—other than storing which user created a URL. 70 | - Our service is read-heavy. 71 | - Database Schema 72 | - We would need two tables: one for storing information about the URL mappings, and one for the user’s data who created the short link. 73 | - url mapping of char[16] 74 | - original_url char[512] 75 | - creation_date 76 | - expiration_date 77 | - user_id 78 | - user info 79 | - name 80 | - email 81 | - register_date 82 | - last_login_time 83 | - What kind of database should we use? 84 | - Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL store like DynamoDB, Cassandra or Riak is a better choice. 85 | - A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details. 86 | 87 | ## Basic System Design and Algorithm 88 | ### encoding actual url 89 | - We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. 90 | - MD5 91 | - MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. 92 | - One basic requirement of any cryptographic hash function is that it should be computationally infeasible to find two distinct messages that hash to the same value. MD5 fails this requirement catastrophically; such collisions can be found in seconds on an ordinary home computer. 93 | - This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘+’ and ‘/’ we can use Base64 encoding. 94 | - Using base64 encoding, a 6 letters long key would result in 64^6 = ~68.7 billion possible strings 95 | - Using base64 encoding, an 8 letters long key would result in 64^8 = ~281 trillion possible strings 96 | - If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). 97 | - Now we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication, to resolve that, we can choose some other characters out of the encoding string or swap some characters. 98 | - issues 99 | - If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable. 100 | - What if parts of the URL are URL-encoded? e.g., http://www.educative.io/distributed.php?id=design, and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for the URL encoding. 101 | - workarounds 102 | - We can append an increasing sequence number to each input URL to make it unique, and then generate a hash of it. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service. 103 | - Another solution could be to append user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one. 104 | - ![Request flow for shortening of a URL](images/shortening.png) 105 | ### Generating keys offline 106 | - We can have a standalone Key Generation Service (KGS) that generates random six-letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. 107 | - can concurrency cause problem? 108 | - As soon as a key is used, it should be marked in the database to ensure it doesn’t get reuse. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. 109 | - For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which could be acceptable, given the huge number of keys we have. 110 | - KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server. 111 | - Can each app server cache some keys from key-DB? 112 | - Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This can be acceptable since we have 68B unique six-letter keys. 113 | - How would we perform a key lookup? 114 | - We can look up the key in our database to get the full URL. If it’s present in the DB, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the homepage. 115 | - it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema). 116 | 117 | ## Data Partitioning and Replication 118 | - Range Based Partitioning 119 | - We can store URLs in separate partitions based on the first letter of the hash key. Hence we save all the URLs starting with letter ‘A’ (and ‘a’) in one partition, save those that start with letter ‘B’ in another partition and so on. This approach is called range-based partitioning. We can even combine certain less frequently occurring letters into one database partition. We should come up with a static partitioning scheme so that we can always store/find a URL in a predictable manner. 120 | - The main problem with this approach is that it can lead to unbalanced DB servers. For example, we decide to put all URLs starting with letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E’. 121 | 122 | - Hash-Based Partitioning 123 | - In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object. 124 | - Our hashing function will randomly distribute URLs into different partitions (e.g., our hashing function can always map any ‘key’ to a number between [1…256]), and this number would represent the partition in which we store our object. 125 | 126 | ## Cache 127 | - How much cache memory should we have? 128 | - We can start with 20% of daily traffic and, based on clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we need 170GB memory to cache 20% of daily traffic. Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of smaller servers to store all these hot URLs. 129 | 130 | - Which cache eviction policy would best fit our needs? 131 | - When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, which will also keep track of the URLs that have been accessed recently. 132 | 133 | - How can each cache replica be updated? 134 | - Whenever there is a cache miss, our servers would be hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all the cache replicas. Each replica can update its cache by adding the new entry. If a replica already has that entry, it can simply ignore it. 135 | - ![Request flow for accessing a shortened URL](/images/accessing.png) 136 | 137 | ## Load Balancer (LB) 138 | - We can add a Load balancing layer at three places in our system: 139 | - Between Clients and Application servers 140 | - Between Application Servers and database servers 141 | - Between Application Servers and Cache servers 142 | - Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it. 143 | - A problem with Round Robin LB is that we don’t take the server load into consideration. If a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that. 144 | 145 | ## Purging or DB cleanup 146 | - Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link? 147 | - If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users. 148 | - Whenever a user tries to access an expired link, we can delete the link and return an error to the user. 149 | - A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low. 150 | - We can have a default expiration time for each link (e.g., two years). 151 | - After removing an expired link, we can put the key back in the key-DB to be reused. 152 | - Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever. 153 | 154 | ## Telemetry 155 | - How many times a short URL has been used, what were user locations, etc.? How would we store these statistics? If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests? 156 | - Some statistics worth tracking: country of the visitor, date and time of access, web page that refers the click, browser, or platform from where the page was accessed. 157 | 158 | 159 | ## Security and Permissions 160 | - Can users create private URLs or allow a particular set of users to access a URL? 161 | - We can store the permission level (public/private) with each URL in the database. We can also create a separate table to store UserIDs that have permission to see a specific URL. If a user does not have permission and tries to access a URL, we can send an error (HTTP 401) back. 162 | - Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The columns will store the UserIDs of those users that have the permission to see the URL.bvwe -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/basics.md: -------------------------------------------------------------------------------- 1 | Basics 2 | ==== 3 | 4 | # text 5 | Whenever we are designing a large system, we need to consider a few things: 6 | 7 | What are the different architectural pieces that can be used? 8 | How do these pieces work with each other? 9 | How can we best utilize these pieces: what are the right tradeoffs? 10 | Investing in scaling before it is needed is generally not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. In the following chapters, we will try to define some of the core building blocks of scalable systems. Familiarizing these concepts would greatly benefit in understanding distributed system concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing between SQL vs. NoSQL. 11 | 12 | Let’s start with the Key Characteristics of Distributed Systems. 13 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/caching.md: -------------------------------------------------------------------------------- 1 | Caching 2 | ==== 3 | # keypoints 4 | - Take advantage of the locality of reference principle: recently requested data is likely to be requested again. 5 | - Exist at all levels in architecture, but often found at the level nearest to the front end. 6 | 7 | ## Application server cache 8 | - Cache placed on a request layer node. 9 | - When a request layer node is expanded to many nodes 10 | - Load balancer randomly distributes requests across the nodes. 11 | - The same request can go to different nodes. 12 | - Increase cache misses. 13 | - Solutions: 14 | - Global caches 15 | - Distributed caches 16 | 17 | ## Distributed cache 18 | - Each request layer node owns part of the cached data. 19 | - Entire cache is divided up using a consistent hashing function. 20 | - Pro 21 | - Cache space can be increased easily by adding more nodes to the request pool. 22 | - Con 23 | - A missing node leads to cache lost. 24 | 25 | ## Global cache 26 | - A server or file store that is faster than original store, and accessible by all request layer nodes. 27 | - Two common forms 28 | - Cache server handles cache miss. 29 | - Used by most applications. 30 | - Request nodes handle cache miss. 31 | - Have a large percentage of the hot data set in the cache. 32 | - An architecture where the files stored in the cache are static and shouldn’t be evicted. 33 | - The application logic understands the eviction strategy or hot spots better than the cache 34 | 35 | ## Content distributed network (CDN) 36 | - For sites serving large amounts of static media. 37 | - Process 38 | - A request first asks the CDN for a piece of static media. 39 | - CDN serves that content if it has it locally available. 40 | - If content isn’t available, CDN will query back-end servers for the file, cache it locally and serve it to the requesting user. 41 | - If the system is not large enough for CDN, it can be built like this: 42 | - Serving static media off a separate subdomain using lightweight HTTP server (e.g. Nginx). 43 | - Cutover the DNS from this subdomain to a CDN later. 44 | 45 | ## Cache invalidation 46 | - Keep cache coherent with the source of truth. Invalidate cache when source of truth has changed. 47 | - Write-through cache 48 | - Data is written into the cache and permanent storage at the same time. 49 | - Pro 50 | - Fast retrieval, complete data consistency, robust to system disruptions. 51 | - Con 52 | - Higher latency for write operations. 53 | - Write-around cache 54 | - Data is written to permanent storage, not cache. 55 | - Pro 56 | - Reduce the cache that is no used. 57 | - Con 58 | - Query for recently written data creates a cache miss and higher latency. 59 | - Write-back cache 60 | - Data is only written to cache. 61 | - Write to the permanent storage is done later on. 62 | - Pro 63 | - Low latency, high throughput for write-intensive applications. 64 | - Con 65 | - Risk of data loss in case of system disruptions. 66 | 67 | ## Cache eviction policies 68 | - FIFO: first in first out 69 | - LIFO: last in first out 70 | - LRU: least recently used 71 | - MRU: most recently used 72 | - LFU: least frequently used 73 | - RR: random replacement 74 | 75 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/cap_theorem.md: -------------------------------------------------------------------------------- 1 | [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem) 2 | ==== 3 | - it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP) 4 | - Consistency 5 | - All nodes see the same data at the same time 6 | - achieved by updating several nodes before further reads 7 | - every read receives the most recent write or an error 8 | - Availability 9 | - every request receives a response on success/failure 10 | - achieved by replicating the data across different servers 11 | - Partition tolerance 12 | - system continues to work despite message loss or partial failure 13 | - can sustain any amount of network failure 14 | - the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes 15 | - CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability 16 | - CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made. 17 | - [ACID](https://en.wikipedia.org/wiki/ACID) databases choose consistency over availability. 18 | - [BASE](https://en.wikipedia.org/wiki/Eventual_consistency) systems choose availability over consistency. 19 | 20 | # text 21 | - CAP theorem states that it is impossible for a distributed software system to simultaneously provide more than two out of three of the following guarantees (CAP): Consistency, Availability, and Partition tolerance. When we design a distributed system, trading off among CAP is almost the first thing we want to consider. CAP theorem says while designing a distributed system we can pick only two of the following three options: 22 | - Consistency 23 | - All nodes see the same data at the same time. Consistency is achieved by updating several nodes before allowing further reads. 24 | - Availability 25 | - Every request gets a response on success/failure. Availability is achieved by replicating the data across different servers. 26 | - Partition tolerance 27 | - The system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages. 28 | ![cap](../images/cap.png) 29 | - We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available. 30 | 31 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/consistent_hashing.md: -------------------------------------------------------------------------------- 1 | Consistent Hashing 2 | ==== 3 | # keypoints 4 | 5 | - Distributed Hash Table (DHT) 6 | - index = hash_function(key) 7 | - distributed caching system 8 | - n cache servers, if index = key % n 9 | - problem 10 | - not horizontally scalable 11 | - when adding a new cache host, all existing mappings broken 12 | - may not be load balanced 13 | 14 | ## consistent hashing 15 | - minimize reorganization when nodes are added or removed 16 | - only k/n keys need to be remapped 17 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/data_partitioning.md: -------------------------------------------------------------------------------- 1 | Data Partitioning 2 | ==== 3 | # keypoints 4 | - break up a big database (DB) into many smaller parts 5 | - after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines 6 | 7 | ## Partitioning Methods 8 | - Horizontal partitioning (range based partitioning, data sharding) 9 | - put different rows into different tables 10 | - e.g. 0-1k, 1k-2k, ... 11 | - problem 12 | - if range for partition not chosen carefully, could have unbalanced serves 13 | - Vertical Partitioning 14 | - store tables related to a specific feature in one server 15 | - e.g. server1: insta pics, server2: user info..; 16 | - problem 17 | - if keeps growing, may be necessary to further partition a feature specific DB across various servers 18 | - Directory Based Partitioning 19 | - create a lookup service which knows your current partitioning scheme 20 | - to find out where a particular data entity resides, query the directory server that holds the mapping between each tuple key to its DB server 21 | 22 | ## Partitioning Criteria 23 | - Key or Hash-based partitioning 24 | - apply a hash function to some key attributes of the entity we are storing -> partition number 25 | - e.g. ID % 100 if we have 100 partitions 26 | - should ensure uniform allocation 27 | - problem 28 | - adding new serves might require rehashing -> downtime for the service 29 | - List partitioning 30 | - each partition assigned a list of values 31 | - to insert a new record, find the partition with the corresponding key 32 | - Round-robin partitioning 33 | - i^th tuple assigned to partition i % n 34 | - Composite partitioning 35 | - combine the above schemes 36 | - e.g. list partitioning -> hash based partitioning 37 | - e.g. consistent hashing = hash + list partitioning 38 | - when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots 39 | 40 | ## Common Problems of Data Partitioning 41 | - Joins and Denormalization 42 | - if database is partitioned and spread across multiple machines then often not feasible to perform joins 43 | - workaround 44 | - denormalize the database so that queries that previously required joins can be performed from a single table 45 | - but denormalization leads to data inconsistency 46 | - Referential integrity 47 | - enforce data integrity constraints in a partitioned database difficult, e.g. foreign keys 48 | - Rebalancing 49 | - reason to change partition scheme 50 | - data distribution not uniform 51 | - a lot of load on a partition 52 | - solution 53 | - create more DB partitions or rebalance existing partitions 54 | - will incur downtime 55 | - could use directory based partitioning 56 | 57 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/indexes.md: -------------------------------------------------------------------------------- 1 | Indexes 2 | ==== 3 | # keypoints 4 | - a data structure that can be perceived as a table of contents that points us to the location where actual data lives 5 | - Improve the performance of search queries. 6 | - Decrease the write performance bc need to update indices. This performance degradation applies to all insert, update, and delete operations. 7 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/key_characteristics_of_distributed_systems.md: -------------------------------------------------------------------------------- 1 | Key Characteristics of Distributed Systems 2 | ==== 3 | 4 | # keypoints 5 | ## Scalability 6 | - The capability of a system to grow and manage increased demand. 7 | - A system that can continuously evolve to support growing amount of work is scalable. 8 | - Horizontal scaling: by adding more servers into the pool of resources. 9 | - Vertical scaling: by adding more resource (CPU, RAM, storage, etc) to an existing server. This approach comes with downtime and an upper limit. 10 | 11 | ## Reliability 12 | - Reliability is the probability that a system will fail in a given period. 13 | - A distributed system is reliable if it keeps delivering its service even when one or multiple components fail. 14 | - Reliability is achieved through redundancy of components and data (remove every single point of failure). 15 | 16 | ## Availability 17 | - Availability is the time a system remains operational to perform its required function in a specific period. 18 | - Measured by the percentage of time that a system remains operational under normal conditions. 19 | - A reliable system is available. 20 | - An available system is not necessarily reliable. 21 | - A system with a security hole is available when there is no security attack. 22 | 23 | ## Efficiency 24 | - Latency: response time, the delay to obtain the first piece of data. 25 | - Bandwidth: throughput, amount of data delivered in a given time. 26 | 27 | ## Serviceability / Manageability 28 | - Easiness to operate and maintain the system. 29 | - Simplicity and spend with which a system can be repaired or maintained. 30 | 31 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/load_balancing.md: -------------------------------------------------------------------------------- 1 | Load Balancing (LB) 2 | ==== 3 | # keypoints 4 | Help scale horizontally across an ever-increasing number of servers. 5 | 6 | ## LB locations 7 | - Between user and web server 8 | - Between web servers and an internal platform layer (application servers, cache servers) 9 | - Between internal platform layer and database 10 | 11 | ## Algorithms 12 | - Least connection 13 | - Least response time 14 | - Least bandwidth 15 | - Round robin 16 | - Weighted round robin 17 | - IP hash 18 | 19 | ## Implementation 20 | - Smart clients 21 | - Hardware load balancers 22 | - Software load balancers 23 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/long_polling_websockets_serversent_events.md: -------------------------------------------------------------------------------- 1 | Long-Polling vs WebSockets vs Server-Sent Events 2 | ==== 3 | 4 | # keypoints 5 | - communication protocols 6 | - long-polling 7 | - WebSockets 8 | - Server-Sent Events 9 | - between a client like a web browser and a web server 10 | - sequence of event for regular HTTP request 11 | - client opens a connections, request data from server 12 | - server calculates reponse 13 | - server sends response back to the client 14 | 15 | ## Ajax Polling 16 | - client repeatedly polls/requests a server for data 17 | - If no data is available, an empty response is returned 18 | - steps 19 | - client opens a connection, requests data from the server using regular HTTP. 20 | - requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds). 21 | - server calculates the response and sends it back 22 | - client repeats the above three steps periodically 23 | - problem 24 | - client keeps asking the server for new data, a lot of responses are empty -> HTTP overhead 25 | 26 | ## HTTP Long-Polling 27 | - server push information to client whenever the data is available. 28 | - client requests as in normal polling, but expect server may not respond immediatey 29 | - if server has no data available, then hold request instead of sending empty response until a timeout 30 | - once data available, full response sent 31 | - client immediately re-request, so server always have a waiting request 32 | - client has to reconnect periodically after connection closed due to timeouts 33 | 34 | ## WebSockets 35 | - persistent connection between client adn server 36 | - both parties can send data at any time 37 | - establishes WebSocket connection througj WebSocket handshake 38 | - if succeeds, client server can exchange data 39 | - enables communication with low overheads 40 | - real-time data transfer 41 | 42 | ## Server-Sent Events (SSEs) 43 | - client establishes a persistent & long-term connection with the server 44 | - client require another tech/protocol to send data to server 45 | - steps 46 | - client request data using regular HTTP 47 | - request webpage opens a connections to server 48 | - server sends data to client if new info available 49 | - best when real-time traffic needed 50 | - or server generate data in loop 51 | 52 | # text 53 | - Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request: 54 | - The client opens a connection and requests data from the server. 55 | - The server calculates the response. 56 | - The server sends the response back to the client on the opened request. 57 | ![HTTP_protocol](../images/HTTP_protocol.png) 58 | 59 | ## Ajax Polling 60 | - Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned. 61 | - The client opens a connection and requests data from the server using regular HTTP. 62 | - The requested webpage sends requests to the server at regular intervals (e.g., 0.5 seconds). 63 | - The server calculates the response and sends it back, just like regular HTTP traffic. 64 | - The client repeats the above three steps periodically to get updates from the server. 65 | - The problem with Polling is that the client has to keep asking the server for any new data. As a result, a lot of responses are empty, creating HTTP overhead. 66 | ![Ajax Polling Protocol](../images/ajax.png) 67 | 68 | ## HTTP Long-Polling 69 | - This is a variation of the traditional polling technique that allows the server to push information to a client whenever the data is available. With Long-Polling, the client requests information from the server exactly as in normal polling, but with the expectation that the server may not respond immediately. That’s why this technique is sometimes referred to as a “Hanging GET”. 70 | - If the server does not have any data available for the client, instead of sending an empty response, the server holds the request and waits until some data becomes available. 71 | - Once the data becomes available, a full response is sent to the client. The client then immediately re-request information from the server so that the server will almost always have an available waiting request that it can use to deliver data in response to an event. 72 | - The basic life cycle of an application using HTTP Long-Polling is as follows: 73 | - The client makes an initial request using regular HTTP and then waits for a response. 74 | - The server delays its response until an update is available or a timeout has occurred. 75 | - When an update is available, the server sends a full response to the client. 76 | - The client typically sends a new long-poll request, either immediately upon receiving a response or after a pause to allow an acceptable latency period. 77 | - Each Long-Poll request has a timeout. The client has to reconnect periodically after the connection is closed due to timeouts. 78 | ![Long Polling Protocol](../images/long_polling.png) 79 | 80 | ## WebSockets 81 | - WebSocket provides Full duplex communication channels over a single TCP connection. It provides a persistent connection between a client and a server that both parties can use to start sending data at any time. The client establishes a WebSocket connection through a process known as the WebSocket handshake. If the process succeeds, then the server and client can exchange data in both directions at any time. The WebSocket protocol enables communication between a client and a server with lower overheads, facilitating real-time data transfer from and to the server. This is made possible by providing a standardized way for the server to send content to the browser without being asked by the client and allowing for messages to be passed back and forth while keeping the connection open. In this way, a two-way (bi-directional) ongoing conversation can take place between a client and a server. 82 | ![WebSockets Protocol](../images/websockets.png) 83 | 84 | ## Server-Sent Events (SSEs) 85 | - Under SSEs the client establishes a persistent and long-term connection with the server. The server uses this connection to send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so. 86 | - Client requests data from a server using regular HTTP. 87 | - The requested webpage opens a connection to the server. 88 | - The server sends the data to the client whenever there’s new information available. 89 | - SSEs are best when we need real-time traffic from the server to the client or if the server is generating data in a loop and will be sending multiple events to the client. 90 | ![Server Sent Events Protocol](../images/sse.png) -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/proxies.md: -------------------------------------------------------------------------------- 1 | Proxies 2 | ==== 3 | 4 | # keypoints 5 | - A proxy server is an intermediary piece of hardware / software sitting between client and backend server. 6 | - Filter requests 7 | - Log requests 8 | - Transform requests 9 | - adding/removing headers 10 | - encrypting/decrypting 11 | - compressing a resource 12 | - cache 13 | - if multiple clients access a particular request, proxy server can cache it 14 | 15 | ## Proxy Server Types 16 | - Open Proxy 17 | - accessible by any Internet user 18 | - Anonymous Proxy 19 | - reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress 20 | - Trаnspаrent Proxy 21 | – іdentіfіes іtself 22 | - with the support of HTTP heаders, the fіrst IP аddress cаn be vіewed 23 | - can cаche the websіtes 24 | - Reverse Proxy 25 | - retrieves resources on behalf of a client from servers 26 | - then returned to the client 27 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/redundancy_replication.md: -------------------------------------------------------------------------------- 1 | Redundancy & Replication 2 | ==== 3 | # keypoints 4 | - Redundancy 5 | - **duplication of critical data or services** with the intention of increased reliability of the system. 6 | - remove single point of failure 7 | - if we have two servers and one fails, system can failover to the other one. 8 | - primary-replica relationship 9 | - between the original and the copies. 10 | - primary gets all updates 11 | - then ripple through to the replica servers 12 | - replca outputs message if received update successfully 13 | - Shared-nothing architecture 14 | - Each node can operate independently of one another. 15 | - No central service managing state or orchestrating activities. 16 | - New servers can be added without special conditions or knowledge. 17 | - No single point of failure. 18 | 19 | 20 | -------------------------------------------------------------------------------- /system_design/glossary_of_system_design/sql_nosql.md: -------------------------------------------------------------------------------- 1 | SQL vs. NoSQL 2 | ==== 3 | # keypoints 4 | ## sql (relational databases) 5 | - structured 6 | - have predefined schemas 7 | - e.g. phone books that store phone numbers and addresses 8 | - store data in rows and columns 9 | - row contains information about one entity 10 | - column contains separate data points 11 | 12 | ## NoSQL (non-relational databases) 13 | - unstructured, distributed 14 | - have a dynamic schema 15 | - e.g file folders that hold everything from a person’s address to their Facebook ‘likes’ 16 | 17 | ## Common types of NoSQL 18 | ### Key-value stores 19 | - Array of key-value pairs. The "key" is an attribute name. 20 | - Redis, Vodemort, Dynamo. 21 | 22 | ### Document databases 23 | - Data is stored in documents. 24 | - Documents are grouped in collections. 25 | - Each document can have an entirely different structure. 26 | - CouchDB, MongoDB. 27 | 28 | ### Wide-column / columnar databases 29 | - Column families - containers for rows. 30 | - No need to know all the columns up front. 31 | - Each row can have different number of columns. 32 | - Cassandra, HBase. 33 | 34 | ### Graph database 35 | - Data is stored in graph structures 36 | - Nodes: entities 37 | - Properties: information about the entities 38 | - Lines: connections between the entities 39 | - Neo4J, InfiniteGraph 40 | 41 | ## Differences between SQL and NoSQL 42 | ### Storage 43 | - SQL: store data in tables. 44 | - NoSQL: have different data storage models. 45 | - key-value 46 | - document 47 | - graph 48 | - columnar 49 | 50 | ### Schema 51 | - SQL 52 | - Each record conforms to a fixed schema. 53 | - each row must have data for each column 54 | - Schema can be altered, but it requires modifying the whole database and going offline. 55 | - NoSQL: 56 | - Schemas are dynamic. 57 | - each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’ 58 | 59 | ### Querying 60 | - SQL 61 | - Use SQL (structured query language) for defining and manipulating the data. 62 | - NoSQL 63 | - Queries are focused on a collection of documents. 64 | - UnQL (unstructured query language). 65 | - Different databases have different syntax. 66 | 67 | ### Scalability 68 | - SQL 69 | - Vertically scalable (by increasing the horsepower: memory, CPU, etc) and expensive. 70 | - Horizontally scalable (across multiple servers); but it can be challenging and time-consuming. 71 | - NoSQL 72 | - Horizontablly scalable (by adding more servers) and cheap. 73 | 74 | ### ACID 75 | - Atomicity, consistency, isolation, durability 76 | - SQL 77 | - ACID compliant 78 | - Data reliability 79 | - Gurantee of transactions 80 | - NoSQL 81 | - Most sacrifice ACID compliance for performance and scalability. 82 | 83 | ## Which one to use? 84 | ### SQL 85 | - Ensure ACID compliance. 86 | - Reduce anomalies. 87 | - Protect database integrity. 88 | - Data is structured and unchanging. 89 | 90 | ### NoSQL 91 | - Data has little or no structure. 92 | - Make the most of cloud computing and storage. 93 | - Cloud-based storage requires data to be easily spread across multiple servers to scale up. 94 | - Rapid development. 95 | - Frequent updates to the data structure. 96 | --------------------------------------------------------------------------------