├── images
├── hld_youtube.png
├── trie_merged.png
├── ig_read_writes.png
├── ig_redundancy.png
├── sequence-diag.png
├── trie_storage.png
├── trie_typeahead.png
├── twitter_epoch.png
├── designing_cloud_mqs.png
├── quad_tree_diagram.png
├── twitter_high_level.png
├── uber_system_design.png
├── designing_cloud_client.png
├── designing_cloud_detailed.png
├── detailed_design_youtube.png
├── twitter_component_design.png
├── twitter_like_high_level.png
├── designing_cloud_high_level.png
├── distributed_logging_design.png
├── instagram_high_level_design.png
├── designing_webcrawler_high_level.png
├── rate_limiter_high_level_design.png
├── sharding_based_on_tweet_object.png
├── designing_crawler_detailed_component.png
├── distributed_logging_datacenter_level.png
├── fixed_window_atomicity.svg
├── fixed_window_problem.svg
└── fixed_rolling_window.svg
├── codecov.yml
├── .github
├── workflows
│ ├── action.yml
│ └── ipynb_tests.yml
└── stale.yml
├── .travis.yml
├── requirements.txt
├── .gitignore
├── OOP-design
├── images
│ ├── amazon-s-shoppingcart.svg
│ └── amazon-s-checkout.svg
├── designing_amazon_online_shopping_system.md
└── designing_amazon_online_shopping_system.ipynb
├── README.md
├── designing_twitter_search.md
├── distributed_logging.md
├── designing_twitter_search.ipynb
├── designing_typeahead_suggestion.md
├── distributed_logging.ipynb
├── designing_instagram.md
├── designing_uber_backend.md
├── designing-cloud-storage.md
├── designing_webcrawler.md
├── designing_api_rate_limiter.md
├── designing_twitter.md
└── designing_ticketmaster.md
/images/hld_youtube.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/hld_youtube.png
--------------------------------------------------------------------------------
/images/trie_merged.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/trie_merged.png
--------------------------------------------------------------------------------
/images/ig_read_writes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/ig_read_writes.png
--------------------------------------------------------------------------------
/images/ig_redundancy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/ig_redundancy.png
--------------------------------------------------------------------------------
/images/sequence-diag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/sequence-diag.png
--------------------------------------------------------------------------------
/images/trie_storage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/trie_storage.png
--------------------------------------------------------------------------------
/images/trie_typeahead.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/trie_typeahead.png
--------------------------------------------------------------------------------
/images/twitter_epoch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/twitter_epoch.png
--------------------------------------------------------------------------------
/images/designing_cloud_mqs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_cloud_mqs.png
--------------------------------------------------------------------------------
/images/quad_tree_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/quad_tree_diagram.png
--------------------------------------------------------------------------------
/images/twitter_high_level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/twitter_high_level.png
--------------------------------------------------------------------------------
/images/uber_system_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/uber_system_design.png
--------------------------------------------------------------------------------
/images/designing_cloud_client.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_cloud_client.png
--------------------------------------------------------------------------------
/images/designing_cloud_detailed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_cloud_detailed.png
--------------------------------------------------------------------------------
/images/detailed_design_youtube.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/detailed_design_youtube.png
--------------------------------------------------------------------------------
/images/twitter_component_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/twitter_component_design.png
--------------------------------------------------------------------------------
/images/twitter_like_high_level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/twitter_like_high_level.png
--------------------------------------------------------------------------------
/images/designing_cloud_high_level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_cloud_high_level.png
--------------------------------------------------------------------------------
/images/distributed_logging_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/distributed_logging_design.png
--------------------------------------------------------------------------------
/images/instagram_high_level_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/instagram_high_level_design.png
--------------------------------------------------------------------------------
/images/designing_webcrawler_high_level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_webcrawler_high_level.png
--------------------------------------------------------------------------------
/images/rate_limiter_high_level_design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/rate_limiter_high_level_design.png
--------------------------------------------------------------------------------
/images/sharding_based_on_tweet_object.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/sharding_based_on_tweet_object.png
--------------------------------------------------------------------------------
/images/designing_crawler_detailed_component.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/designing_crawler_detailed_component.png
--------------------------------------------------------------------------------
/images/distributed_logging_datacenter_level.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gitgik/distributed-system-design/HEAD/images/distributed_logging_datacenter_level.png
--------------------------------------------------------------------------------
/codecov.yml:
--------------------------------------------------------------------------------
1 | coverage:
2 | status:
3 | project:
4 | default:
5 | target: auto
6 | threshold: 10
7 | patch:
8 | default:
9 | target: 0%
10 |
--------------------------------------------------------------------------------
/.github/workflows/action.yml:
--------------------------------------------------------------------------------
1 | name: Check Markdown links
2 |
3 | on: push
4 |
5 | jobs:
6 | markdown-link-check:
7 | runs-on: ubuntu-latest
8 | steps:
9 | - uses: actions/checkout@master
10 | with:
11 | fetch-depth: 1
12 | - uses: gaurav-nelson/github-action-markdown-link-check@0.6.0
13 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 |
3 | cache:
4 | directories:
5 | - $HOME/.cache/pip
6 |
7 | python:
8 | - 3.6
9 |
10 | before_install:
11 | - pip install --upgrade pip
12 | - pip install --upgrade setuptools wheel nose coverage codecov
13 |
14 | install:
15 | - pip install nbval
16 | - pip install -r requirements.txt
17 |
18 | script:
19 | - pytest --nbval
20 |
21 | after_success:
22 | - codecov
23 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # Core testing
2 | pytest
3 | nbval
4 |
5 | # Common utilities & jupyter/nbval underlying dependencies
6 | six
7 | python-dateutil
8 | ipykernel # For executing notebooks
9 | nbformat # For reading/writing notebook structure
10 | jupyter_core # Common core utilities for Jupyter
11 | # Network/Security - often pulled in by requests/urllib3 or similar
12 | certifi
13 | idna
14 | urllib3
15 | # cryptography might be needed by pyOpenSSL or other security aspects of web requests
16 | cryptography
17 | pyOpenSSL
18 |
--------------------------------------------------------------------------------
/.github/stale.yml:
--------------------------------------------------------------------------------
1 | # Number of days of inactivity before an issue becomes stale
2 | daysUntilStale: 30
3 | # Number of days of inactivity before a stale issue is closed
4 | daysUntilClose: 7
5 | # Issues with these labels will never be considered stale
6 | exemptLabels:
7 | - pinned
8 | - security
9 | # Label to use when marking an issue as stale
10 | staleLabel: wontfix
11 | # Comment to post when marking an issue as stale. Set to `false` to disable
12 | markComment: >
13 | This issue has been automatically marked as stale because it has not had
14 | recent activity. It will be closed if no further activity occurs. Thank you
15 | for your contributions.
16 | # Comment to post when closing a stale issue. Set to `false` to disable
17 | closeComment: false
18 |
--------------------------------------------------------------------------------
/.github/workflows/ipynb_tests.yml:
--------------------------------------------------------------------------------
1 | name: Notebook tests
2 |
3 | on: [push, pull_request]
4 |
5 | jobs:
6 | test-notebooks:
7 | runs-on: ubuntu-latest
8 |
9 | # Test on multiple Python versions
10 | strategy:
11 | matrix:
12 | python-version: ["3.9", "3.10", "3.11"]
13 |
14 | steps:
15 | - name: Checkout repository
16 | uses: actions/checkout@v4
17 | - name: Set up Python ${{ matrix.python-version }}
18 | uses: actions/setup-python@v5
19 | with:
20 | python-version: ${{ matrix.python-version }}
21 | # Caching for faster dependency installation on subsequent runs
22 | cache: 'pip'
23 | - name: Install dependencies
24 | run: |
25 | python -m pip install --upgrade pip
26 | pip install -r requirements.txt
27 | - name: Test notebooks with pytest
28 | run: |
29 | pytest --nbval
30 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/images/fixed_window_atomicity.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/OOP-design/images/amazon-s-shoppingcart.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Distributed System Design
2 |
3 | [](https://app.codacy.com/gh/gitgik/distributed-system-design/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)
4 |
5 | A curated collection of approaches to creating large scale distributed systems during interviews.
6 |
7 | System Design Problems:
8 |
9 | 1. [Design Instagram](designing_instagram.md)
10 | 2. [Design Twitter](designing_twitter.md)
11 | 3. [Design Twitter Search](designing_twitter_search.md)
12 | 4. [Design Instagram Feed](designing_instagram_feed.md)
13 | 5. [Design Youtube or Netflix](designing_youtube_or_netflix.md)
14 | 6. [Design Uber or Lyft](designing_uber_backend.md)
15 | 7. [Design a Typeahead Suggestion like Google Search](designing_typeahead_suggestion.md)
16 | 8. [Design an API Rate Limiter](designing_api_rate_limiter.md)
17 | 9. [Design an E-ticketing service like Ticketmaster](designing_ticketmaster.md)
18 | 10. [Design a Web Crawler](designing_webcrawler.md)
19 | 11. [Design Cloud File Storage like Dropbox/GDrive](designing-cloud-storage.md)
20 | 12. [Distributed Logging](distributed_logging.md)
21 |
22 |
23 |
24 | Object Oriented Design Problems:
25 |
26 | 1. [Design Amazon: Online Shopping System](OOP-design/designing_amazon_online_shopping_system.md)
27 |
28 | ## Step 0: Intro
29 |
30 | A lot of software engineers struggle with system design interviews (SDIs) primarily because of three reasons:
31 |
32 | - The unstructured nature of SDIs is mostly an open-ended design problem that doesn’t have a standard answer.
33 | - Their lack of experience in developing large scale systems.
34 | - They didn't prepare for SDIs.
35 |
36 | At top companies such as Google, FB, Amazon, Microsoft, etc., candidates who don't perform above average have a limited chance to get an offer.
37 |
38 | Good performance always results in a better offer (higher position and salary), since it shows the candidate's ability to handle a complex system.
39 |
40 | The steps below will help guilde you to solve mutiple complex design problems.
41 |
42 | ## Step 1: Requirements Clarifications
43 |
44 | It's always a good idea to know the exact scope of the problem we are solving.
45 | Design questions are mostly open-ended, that's why clarifying ambiguities early in the interview becomes critical. Since we have 30-45 minutes to design a (supposedly large system, we should clarify what parts of the system we will be focusing on.
46 |
47 | Let's look at an example: Twitter-like service.
48 |
49 | Here are some questions that should be answered before moving on to the next steps:
50 |
51 | - Will users be able to post tweets and follow other people?
52 | - Should we also design to create and display user's timeline?
53 | - Will the tweets contain photos and videos?
54 | - Are we focusing on the backend only or are we developing the front-end too?
55 | - Will users be able to search tweets?
56 | - Will there be any push notification for new or important tweets?
57 | - Do we need to display hot trending topics?
58 |
59 |
60 |
61 | ## Step 2: System Interface definition
62 |
63 | Define what APIs are expected from the system. This ensures we haven't gotten any requirements wrong
64 | and establish the exact contract expected from the system.
65 |
66 | Example:
67 |
68 | ```python
69 | post_tweet(
70 | user_id,
71 | tweet_data,
72 | tweet_location,
73 | user_location,
74 | timestamp, ...
75 | )
76 | ```
77 |
78 | ```python
79 | generate_timeline(user_id, current_time, user_location, ...)
80 | ```
81 |
82 | ```python
83 | mark_tweet_favorite(user_id, tweet_id, timestamp, ...)
84 | ```
85 |
86 | ## Step 3: Back-of-the-envelope estimation
87 |
88 | It's always a good idea to estimate the scale of the system we're going to design. This will help later when we focus on scaling, partitioning, load balancing and caching.
89 |
90 | - What scale is expected from the system? (no. of tweets/sec, no. of views/sec ...)
91 | - How much storage will we need? We will have different numbers if user can have photos and videos in their tweets.
92 | - What network bandwidth usage are we expecting? Crucial in deciding how we will manage traffic and balance load between servers.
93 |
94 | ## Step 4: Defining data model
95 |
96 | This will clarify how the data flows among different components in the system.
97 | Later, it will guide towards data partitioning and management. The candidate should identity various system entities, how they interact with each other, and different aspect of data management like storage, transportation, encryption, etc.
98 |
99 | Examples for a Twitter-like service:
100 |
101 | - User: user_id, name, email, date_of_birth, created_at, last_login, etc.
102 | - Tweet: tweet_id, contentm tweet_location, number_of_likes, timestamp, etc.
103 | - UserFollowers: user_id1, user_id2
104 | - FavoriteTweets: user_id, tweet_id, timestamp
105 |
106 | > Which database system should be use? Will NoSQL best fit our needs, or should we use a MySQL-like relational DB? What kind of block storage should we use to store photos and videos?
107 |
108 | ## Step 5: High-level Design
109 |
110 | Draw a block diagram with 5-6 boxes representing the core system components. We should identify enough components that are needed to solve the problem from end-to-end.
111 |
112 | For Twitter-like service, at a high-level, we need multiple application servers to serve all
113 | read/write requests. Load balancers(LB) should be placed in front of them for traffic distributions.
114 |
115 | Assuming we'll have more read than write traffic, we can decide to have separate servers for
116 | handling these scenarios.
117 |
118 | On the backend, we need an efficient Database that can store all tweets and can support a
119 | huge number of reads.
120 | We also need a distributed file storage system for storing static media like photos and videos.
121 |
122 | 
123 |
124 | ## Step 6: Detailed design
125 |
126 | Dig deeper into 2-3 components; the interviewer's feedback should always guide you on what parts of the system needs further discussion.
127 |
128 | - we should present different approaches
129 | - present their pros and cons,
130 | - and explain why we will prefere one approach to the other
131 |
132 | There's no single right answer.
133 |
134 | > The only important thing is to consider tradeoffs between different options while keeping system constraints in mind.
135 |
136 | Questions to consider include:
137 |
138 | - Since we will store massive amounts of data, how should we partition our data to distribute it to multiple dBs? Should we try to store all data of a user on the same DB? What issue could this cause?
139 | - How will we handle hot users who tweet a lot or follow a lot of people?
140 | - Since users' tieline will contain the most recent tweets, should we try to store our data in such a way that is optimized for scanning latest tweets?
141 | - How much and at what parts should we introduce cache to speed things up?
142 | - What components need better load balancing?
143 |
144 | ## Step 7: Identifyng and resolving bottlenecks
145 |
146 | Discuss as many bottlenecks and the different approaches to mitigate them.
147 |
148 | - Is there a single point of failure? What are we doing to mitigate this?
149 | - Do we have enough replicas of data that if we lose a few servers, we can still serve our users?
150 | - Similary, do we have enough copies of different services running such that a few failures will not cause a total system shutdown?
151 | - How are we monitoring performance? Do we get alerts whenever critical system components fail or performance degrades?
152 |
--------------------------------------------------------------------------------
/designing_twitter_search.md:
--------------------------------------------------------------------------------
1 | # Designing Twitter Search
2 |
3 | We'll design a service that can effectively store and query user tweets.
4 |
5 | ## 1. Requirements and System Goals
6 | - Assume Twitter has 1.5 billion total users with 800 million daily active users.
7 | - On average Twitter gets 400 million tweets every day.
8 | - Average size of a tweet is 300 bytes.
9 | - Assume 500M searches a day.
10 | - The search query will consist of multiple words combined with AND/OR.
11 |
12 |
13 | ## 2. Capacity Estimation and Constraints
14 |
15 | ```
16 | 400 million new tweets each day,
17 | Each tweet is on average 300 bytes
18 | 400M * 300 => 120GB/day
19 |
20 | Total storage per second:
21 | 120 GB / (24 hours / 3600 seconds) ~= 1.38MB/second
22 | ```
23 |
24 |
25 | ## 3. System APIs
26 | We can have REST APIs expose the functionality of our service.
27 |
28 | ```python
29 |
30 | search(
31 | api_dev_key: string, # The API developer key of a registered account, this will be used for things like throttling users based on their allocated quota.
32 | search_terms: string, # A string containing the search terms.
33 | max_results_to_return: number, # Number of tweets to return.
34 | sort: number, # optional sort mode: Last first(0 - default), Best mached (1), Most liked (2)
35 | page_token: string, # This token specifies a page in the result set that should be returned.
36 | )
37 | ```
38 | Returns: (JSON)
39 | ```
40 | A JSON containing info about a list of tweets matching the search query.
41 | Each result entry can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.
42 | ```
43 |
44 |
45 |
46 | ## 4. Detailed Component Design
47 | 1. Since we have a huge amount of data, we need to have a data partitioning scheme that'll efficiently distribute the data across multiple servers.
48 |
49 |
50 | 5 year plan
51 | ```
52 | 120 GB/day * 365 days * 5 years ~= 200TB
53 |
54 | ```
55 |
56 | We never want to be more than 80% full at any time, so we'll need approximately 250TB storage. Assuming we also need to keep an extra copy for fault tolerance, then, our total storage will be 500 TB.
57 |
58 | Assuming modern servers store up to 5TB of data, we'd need 100 such servers to hold all the data for the next 5 years.
59 |
60 | Let's start with simplistic design where we store tweets in a PostgreSQL DB. Assume a table with two columns: TweetID, and TweetText.
61 | Partitioning can be based on TweetID. If our TweetIDs are unique system wide, we can define a hash function that can map a TweetID to a storage server where we can store that tweet object.
62 |
63 | #### How can we create system wide unique TweetIDs?
64 | If we're getting 400M tweets per day, then in the next five years?
65 | ```
66 | 400 M * 365 * 5 years => 730 billion tweets
67 | ```
68 | We'll need 5 bytes number to identify TweetIDs uniquely. Assume we have a service that will generate unique TweetIDs whenever we need to store an object. We can feed the TweetID to our hash function to find the storage server and store the tweet object there.
69 |
70 | 2. **Index:** Since our tweet queries will consist of words, let's build the index that can tell us which words comes in which tweet object.
71 |
72 |
73 | Assume:
74 | - Index all English words,
75 | - Add some famous nouns like People names, city names, etc
76 | - We have 300K English words, 200K nouns, Total 500K.
77 | - Average length of a word = 5 characters.
78 |
79 | ```
80 | If we keep our index in memory, we need:
81 |
82 | 500K * 5 => 2.5 MB
83 | ```
84 |
85 | Assume:
86 | - We keep the index in memory for all tweets from our last two years.
87 | ```
88 | Since we'll get 730 Billion tweets in the next 5 years,
89 |
90 | 292Billion (2 year tweets) * 5 => 1460 GB
91 | ```
92 |
93 | So our index would be like a big distributed hash table, where 'key' would be the word and 'value' will be a list of TweetIDs of all those tweets which contain that word.
94 |
95 | Assume:
96 | - Average of 40 words in each tweet,
97 | - 15 words will need indexing in each tweet, since we won't be indexing prepositions and other small words (the, in, an, and)
98 |
99 | > This means that each TweetID will be stored 15 times in our index.
100 |
101 | so total memory we will need to store our index:
102 | ```
103 | (1460 * 15) + 2.5MB ~= 21 TB
104 | ```
105 | > Assuming a high-end server holds 144GB of memory, we would need 152 such servers to hold our index.
106 |
107 | ## Sharding our Data
108 |
109 |
110 | #### Sharding based on Words:
111 | While building the index, we'll iterate through all words of a tweet and calculate the hash of each word to find the server where it would be indexed. To find all tweets containing a specific word we have to query only server which contains this word.
112 |
113 | Issues with this approach:
114 | - If a word becomes hot? There will be lots of queries (high load) on the server holding the word, affecting the service performance.
115 | - Over time, some words can end up storing a lot of TweetIDs compared to others, therefore maintaining a uniform distribution of words while tweets are growing is tricky.
116 |
117 | To recover from this, we can repartition our data or use [Consistent Hashing](https://en.wikipedia.org/wiki/Consistent_hashing)
118 |
119 |
120 | #### Sharding based on tweet object
121 | While storing, we will pass the TweetID to our hash function to find the server and index all words of the tweet on that server.
122 | While querying for a particular word, we'll query all servers, and each server will return a set of TweetIDs. A centralized server will aggregate these results to return them to the user.
123 |
124 | 
125 |
126 | ## 6. Fault Tolerance
127 | We can have a secondary replica of each server and if the primary one dies, it can take control after the failover.
128 | Both primary and secondary servers will have the same copy of the index.
129 |
130 | How can we efficiently retrieve a mapping between tweets and the index server? We have to build a reverse index that will map all the tweetID to their index server. We'll keep this in the Index-Builder server.
131 |
132 | - build a Hashtable, where key = index server number and value = HashSet containing all TweetIDs being kept at that index server.
133 | - A HashSet will help us to add/remove tweets from our index quickly.
134 |
135 | So whenever an index server has to rebuild itself, it can simply ask the Index-Builder server for all tweets it needs to store and then fetch those tweets to build the index. We should also have a replica of the Index-builder server for fault tolerance.
136 |
137 | ## 7. Caching
138 | We can introduce a cache server in front of our DB. We can also use Memcached, which can store all hot tweets in memory. App servers before hitting the backend DB, can quickly check if the cache has that tweet. Based on clients' usage patterns, we can adjust how many cache servers we need. For cache eviction policy, Least Recently Used (LRU) seems suitable.
139 |
140 | ## 8. Load Balancing
141 | Add LB layers at two places:
142 | 1. Between Clients and Application servers,
143 | 2. Between Application servers and Backend server.
144 |
145 | LB approach:
146 | - Use round robin approach: distrubute incoming requests equally among servers.
147 | - Simple to implement and no overhead
148 | - If as server is dead, LB will take it out of rotation and stop sending traffic to it
149 | - Problem is if a server is overloaded, or slow, the LB will not stop sending new requests to it. To fix this, a more intelligent LB solution can be placed that periodically queries the server about the load and adjust traffic based on that.
150 |
--------------------------------------------------------------------------------
/images/fixed_window_problem.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/OOP-design/designing_amazon_online_shopping_system.md:
--------------------------------------------------------------------------------
1 | # Designing Amazon - Online Shopping System
2 | Let's design an online retail store.
3 | For the sake of this problem, we'll focus on Amazon's retail business where users can buy/sell products.
4 |
5 |
6 | ## Requirements and Goals of the System
7 | 1. Users should be able to:
8 | - Add new products to sell.
9 | - Search products by name or category.
10 | - Buy products only if they are registered members.
11 | - Remove/modify product items in their shopping cart.
12 | - Checkout and buy items in the shopping cart.
13 | - Rate and review a product.
14 | - Specify a shipping address where their order will be delivered.
15 | - Cancel and order if it hasn't been shipped.
16 | - Pay via debit or credit cards
17 | - Track their shipment to see the current state of their order.
18 | 2. The system should be able to:
19 | - Send a notification whenever the shipping status of the order changes.
20 |
21 | ## Use Case Diagram
22 | We have four main actors in the system:
23 |
24 | - **Admin:** Mainly responsible for account management, adding and modifying new product categories.
25 |
26 | - **Guest:** All guests can search the catalog, add/remove items on the shopping cart, and also become registered members.
27 | - **Member:** In addition to what guests can do, members can place orders and add new products to sell
28 | - **System:** Mainly responsible for sending notifications for orders and shipping updates.
29 |
30 |
31 | Top use cases therefore include:
32 | 1. Add/Update products: whenever a product is added/modified, update the catalog.
33 | 2. Search for products by their name or category.
34 | 3. Add/remove product items from shopping cart.
35 | 4. Checkout to buy a product item in the shopping cart.
36 | 5. Make a payment to place an order.
37 | 6. Add a new product category.
38 | 7. Send notifications about order shipment updates to members.
39 |
40 |
41 | ### Code
42 |
43 | First we define the enums, datatypes and constants that'll be used by the rest of the classes:
44 |
45 |
46 | ```python
47 | from enum import Enum
48 |
49 |
50 | class AccountStatus(Enum):
51 | ACTIVE, BLOCKED, BANNED, COMPROMISED, ARCHIVED, UNKNOWN = 1, 2, 3, 4, 5, 6
52 |
53 | class OrderStatus(Enum):
54 | UNSHIPPED, SHIPPED, PENDING, COMPLETED, CANCELED, REFUND_APPLIED = 1, 2, 3, 4, 5, 6
55 |
56 | class ShipmentStatus(Enum):
57 | PENDING, SHIPPED, DELIVERED, ON_HOLD = 1, 2, 3, 4
58 |
59 | class PaymentStatus(Enum):
60 | UNPAID, PENDING, COMPLETED, FILLED, DECLINED = 1, 2, 3, 4, 5
61 | CANCELLED, ABANDONED, SETTLING, SETTLED, REFUNDED = 6, 7, 8, 9, 10
62 |
63 | ```
64 |
65 | #### Account, Customer, Admin and Guest classes
66 | These classes represent different people that interact with the system.
67 |
68 |
69 | ```python
70 | from abc import ABC, abstractmethod
71 |
72 |
73 | class Account:
74 | """Python strives to adhere to Uniform Access Principle.
75 |
76 | So there's no need for getter and setter methods.
77 | """
78 |
79 | def __init__(self, username, password, name, email, phone, shipping_address, status:AccountStatus):
80 | # "private" attributes
81 | self._username = username
82 | self._password = password
83 | self._email = email
84 | self._phone = phone
85 | self._shipping_address = shipping_address
86 | self._status = status.ACTIVE
87 | self._credit_cards = []
88 | self._bank_accounts = []
89 |
90 | def add_product(self, product):
91 | pass
92 |
93 | def add_product_review(self, review):
94 | pass
95 |
96 | def reset_password(self):
97 | pass
98 |
99 |
100 | class Customer(ABC):
101 | def __init__(self, cart, order):
102 | self._cart = cart
103 | self._order = order
104 |
105 | def get_shopping_cart(self):
106 | return self.cart
107 |
108 | def add_item_to_cart(self, item):
109 | raise NotImplementedError
110 |
111 | def remove_item_from_cart(self, item):
112 | raise NotImplementedError
113 |
114 |
115 | class Guest(Customer):
116 | def register_account(self):
117 | pass
118 |
119 |
120 | class Member(Customer):
121 | def __init__(self, account:Account):
122 | self._account = account
123 |
124 | def place_order(self, order):
125 | pass
126 |
127 | ```
128 |
129 |
130 | ```python
131 | # Test class definition
132 | g = Guest(cart="Cart1", order="Order1")
133 | print(hasattr(g, "remove_item_from_cart"))
134 | print(isinstance(g, Customer))
135 | ```
136 |
137 | True
138 | True
139 |
140 |
141 | #### Product Category, Product and Product Review
142 | The classes below are related to a product:
143 |
144 |
145 | ```python
146 | class Product:
147 | def __init__(self, product_id, name, description, price, category, available_item_count):
148 | self._product_id = product_id
149 | self._name = name
150 | self._price = price
151 | self._category = category
152 | self._available_item_count = 0
153 |
154 | def update_price(self, new_price):
155 | self._price = new_price
156 |
157 |
158 | class ProductCategory:
159 | def __init__(self, name, description):
160 | self._name = name
161 | self._description = description
162 |
163 |
164 | class ProductReview:
165 | def __init__(self, rating, review, reviewer):
166 | self._rating = rating
167 | self._review = review
168 | self._reviewer = reviewer
169 |
170 | ```
171 |
172 | #### ShoppingCart, Item, Order and OrderLog
173 | Users will add items to the shopping cart and place an order to buy all the items in the cart.
174 |
175 |
176 | ```python
177 | class Item:
178 | def __init__(self, item_id, quantity, price):
179 | self._item_id = item_id
180 | self._quantity = quantity
181 | self._price = price
182 |
183 | def update_quantity(self, quantity):
184 | self._quantity = quantity
185 |
186 | def __repr__(self):
187 | return f"ItemID:<{self._item_id}>"
188 |
189 |
190 | class ShoppingCart:
191 | """We can still access items by calling items instead of having getter method
192 | """
193 | def __init__(self):
194 | self._items = []
195 |
196 | def add_item(self, item):
197 | self._items.append(item)
198 |
199 | def remove_item(self, item):
200 | self._items.remove(item)
201 |
202 | def update_item_quantity(self, item, quantity):
203 | pass
204 |
205 | ```
206 |
207 |
208 | ```python
209 | item = Item(item_id=1, quantity=2, price=300)
210 | cart = ShoppingCart()
211 | cart.add_item(item)
212 | ```
213 |
214 |
215 | ```python
216 | # shopping cart now has items
217 | cart._items
218 | ```
219 |
220 |
221 |
222 |
223 | [ItemID:<1>]
224 |
225 |
226 |
227 |
228 | ```python
229 | import datetime
230 |
231 |
232 | class OrderLog:
233 | def __init__(self, order_number, status=OrderStatus.PENDING):
234 | self._order_number = order_number
235 | self._creation_date = datetime.date.today()
236 | self._status = status
237 |
238 |
239 | class Order:
240 | def __init__(self, order_number, status=OrderStatus.PENDING):
241 | self._order_number = order_number
242 | self._status = status
243 | self._order_date = datetime.date.today()
244 | self._order_log = []
245 |
246 | def send_for_shipment(self):
247 | pass
248 |
249 | def make_payment(self, payment):
250 | pass
251 |
252 | def add_order_log(self, order_log):
253 | pass
254 | ```
255 |
256 | #### Shipment and Notification
257 | After successfully placing an order and processing the payment, a shipment record will be created.
258 | Let's define the Shipment and Notification classes:
259 |
260 |
261 | ```python
262 | import datetime
263 |
264 |
265 | class ShipmentLog:
266 | def __init__(self, shipment_id, status=ShipmentStatus.PENDING):
267 | self._shipment_id = shipment_id
268 | self.shipment_status = status
269 |
270 |
271 | class Shipment:
272 | def __init__(self, shipment_id, shipment_method, eta=None, shipment_logs=[]):
273 | self._shipment_id = shipment_id
274 | self._shipment_date = datetime.date.today()
275 | self._eta = eta
276 | self._shipment_logs = shipment_logs
277 |
278 |
279 | class Notification(ABC):
280 | def __init__(self, notification_id, content):
281 | self._notification_id = notification_id
282 | self._created_on = datetime.datetime.now()
283 | self._content = content
284 |
285 | ```
286 |
--------------------------------------------------------------------------------
/distributed_logging.md:
--------------------------------------------------------------------------------
1 | # Designing Distributed Logging System
2 |
3 | One of the most challenging aspects of debugging distributed systems is understanding system behavior in the period leading up to a bug.
4 | As we all know by now, a distributed system is made up of microservices calling each other to complete an operation.
5 | Multiple services can talk to each other to complete a single business requirement.
6 |
7 | In this architecture, logs are accumulated in each machine running the microservice. A single microservice can also be deployed to hundreds of nodes. In an archirectural setup where multiple microservices are interdependent, and failure of one service can result in failures of other services. If we do not have well organized logging, we might not be able to determine the root cause of failure.
8 |
9 | ## Understanding the system
10 | ### Restrain Log Size
11 | At any given time, the distributed system logs hundreds of concurrent messages.
12 | The number of logs increases over time. But, not all logs are important enough to be logged.
13 | To solve this, logs have to be structured. We need to decide what to log into the system on the application or logging level.
14 |
15 | ### Log sampling
16 | Storage and processing resources is a constraint. We must determine which messages we should log into the system so as to control volume of logs generated.
17 |
18 | High-throughput systems will emit lots of messages from the same set of events. Instead of logging all the messages, we can use a sampler service that only logs a smaller set of messages from a larger chunk. The sampler service can use various sampling algorithms such as adaptive and priority sampling to log events. For large systems with thousands of microservices and billions of events per seconds, an appropriate
19 |
20 | ### Structured logging
21 | The first benefit of structured logs is better interoperability between log readers and writers.
22 | Use structured logging to make the job of log processing system easier.
23 |
24 | ### Categorization
25 | The following severity levels are commonly used in logging:
26 | - `DEBUG`
27 | - `INFO`
28 | - `WARNING`
29 | - `ERROR`
30 | - `CRITICAL`
31 |
32 | ## Requirements
33 | ### Functional requirements
34 | - Writing logs: the microservices should be able to write into the logging system.
35 | - Query-based logs: It should be effortless to search for logs.
36 | - The logs should reside in distributed storage for easy access.
37 | - The logging mechanism should be secure and not vulnerable. Access to logs should be for authenticated users and necessary read-only permissions granted to everyone.
38 | - The system should avoid logging sensitive information like credit cards numbers, passwords, and so on.
39 | - Since logging is a I/O-heavy operation, the system should avoid logging excessive information. Logging all information is unnecessary. It only takes up more space and impacts performance.
40 | - Avoid logging personally identifiable information (PII) such as names, addresses, emails, etc.
41 |
42 |
43 | ### Non-functional requirements
44 | - **Low latency:** logging is a resource-intensive operation that's significantly slower than CPU operations. To ensure low latency, the logging system should be designed so that logging does not block or delay a service's main application process.
45 | - **Scalability:** Our logging system must be scalable. It should efficiently handle increasing log volumes over time and support a growing number of concurrent users.
46 | - **Availability:** The logging system should be highly available to ensure data is consistently logged without interruption.
47 |
48 | ## Components to use
49 | We will use the following components:
50 | - **Pub-Sub system:** we will use a publish-subscribe system to efficiently handle the large volume of logs.
51 | - **Distributed Search:** we will employ distributed search to query logs efficiently.
52 |
53 | >A distributed search system is a search architecture that efficiently handles large dataset and high query loads by spreading search operations across multiple servers or nodes. It has the following components:
54 | >1. **Crawler:** This component fetches the content and creates documents.
55 | >2. **Indexer:** Builds a searchable index from the fetched documents.
56 | >3. **Searcher:** Responds to user queries by running searches on the index created by the indexer.
57 |
58 | - **Logging Accumulator:** This component will collect logs from each node and store them in a central location, allowing for easy retrieval of logs related to specific events without needing to query each individual node.
59 | - **Blob Storage:** The blob storage provides a scalable and reliable storage for large volumes of data.
60 | - **Log Indexer:** Due to the increasing number of log files, efficient searching is crucial. The log indexer utilizes distributed search techniques to index and make logs searchable, ensuring fast retrieval of relevant information.
61 | - **Visualizer:** The visualizer component provides a unified view of all logs. It enables users to analyze and monitor system behavior and performance through visual representation and analytics.
62 |
63 |
64 | ## API Design
65 | We design for reads and writes
66 |
67 |
68 | Read
69 | ```python
70 | searching(keyword)
71 | ```
72 | This API call returns a list of logs that contain the keyword.
73 |
74 | Write
75 | ```python
76 | write_logs(unique_ID, message)
77 | ```
78 | This API call writes the log message against against a unique key.
79 |
80 |
81 | ## High Level System Design
82 |
83 | 
84 |
85 | ## Component Design
86 |
87 | ### Logging at Various Levels in a Server
88 | In a server environment, logging occurs across various services and application, each producing logs crucial for monitoring and troubleshooting.
89 |
90 | #### Server Level
91 | - **Multiple Applications:** A server hosts multiple apps, such as App1, App2, etc. Each running various microservices with user authentication, fetching the cart, storage etc in an e-commerce context.
92 | - **Logging Structure:** each service within the application generates logs identified by an ID conprising application ID, service ID, and timestamp, ensuring unique identification and event causality determination.
93 |
94 | #### Logging Process
95 | Each service will push logs into the log accumulator service.
96 | The service will store the logs logically and push the logs to a pub-sub system.
97 |
98 | We use the pub-sub system to handle scalability challenge by efficiently managing and distributing a large volume of logs across the system.
99 |
100 | #### Ensuring Low Latency and Performance
101 | - **Asynchronous Logging:** Logs are sent asynchronously via low-priority threads to avoid impacting the performance of critical processes. This also ensure continuous availability of services without any disruptions caused by logging activities.
102 | - **Data Loss Awareness:** Logging large volumes of messsages can lead to potential data loss. To balance user-perceived latency with data peristence, log services often use RAM and save data asynchronously. To minimize data loss, we will add more log accumulators to handle increasing concurrent users.
103 |
104 | #### Log Retention
105 | Logs also have an expiration date. We can delete regular logs after a few days or months. Comliance logs are usually stored for up to five years. If depends on the requirements of the application.
106 |
107 | Another crucial component therefore is to have an expiration checker. It will verity the logs that have to be deleted
108 |
109 | ### Data Center Level
110 | All servers in the data center transmit logs to a publish-subscribe architecture.
111 | By utilizing a horizontally-scalable pub-sub framework, we can effectively manager large log volumes.
112 |
113 | Implementing multiple pub-sub instance within each data center enhances scalability and prevents throughput limitations and bottlenecks. Subsequently, the pub-sub system routes the log data to blob storage.
114 |
115 | 
116 |
117 | Now, data in the pub-sub system is temporary and get deleted after a few days before being moved to archival storage.
118 | However, while the data is still present in the pub-sub system, we can utilize it using the following services:
119 | - **Alerts Service:** This service identifies alerts and errors and notifies the appropriate stakeholders if a critical error is detected, or sends a message to a monitoring tool, ensuring timely awareness of important alerts. The service will also monitor logs for suspicious activities or security incidents, triggering alerts or automated responses to mitigate threats.
120 | - **Analytics service:** This service analyzes trends and patterns in the logged data to provide insights into system perf, user behavior, or operational metrics.
121 |
--------------------------------------------------------------------------------
/images/fixed_rolling_window.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/designing_twitter_search.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Designing Twitter Search\n",
8 | "\n",
9 | "We'll design a service that can effectively store and query user tweets."
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "## 1. Requirements and System Goals\n",
17 | "- Assume Twitter has 1.5 billion total users with 800 million daily active users.\n",
18 | "- On average Twitter gets 400 million tweets every day.\n",
19 | "- Average size of a tweet is 300 bytes.\n",
20 | "- Assume 500M searches a day.\n",
21 | "- The search query will consist of multiple words combined with AND/OR.\n"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "## 2. Capacity Estimation and Constraints\n",
29 | "\n",
30 | "```\n",
31 | " 400 million new tweets each day,\n",
32 | " Each tweet is on average 300 bytes \n",
33 | " 400M * 300 => 120GB/day\n",
34 | " \n",
35 | " Total storage per second:\n",
36 | " 120 GB / (24 hours / 3600 seconds) ~= 1.38MB/second\n",
37 | "```\n"
38 | ]
39 | },
40 | {
41 | "cell_type": "markdown",
42 | "metadata": {},
43 | "source": [
44 | "## 3. System APIs\n",
45 | "We can have REST APIs expose the functionality of our service.\n",
46 | "\n",
47 | "```python\n",
48 | "\n",
49 | "search(\n",
50 | " api_dev_key: string, # The API developer key of a registered account, this will be used for things like throttling users based on their allocated quota.\n",
51 | " search_terms: string, # A string containing the search terms.\n",
52 | " max_results_to_return: number, # Number of tweets to return.\n",
53 | " sort: number, # optional sort mode: Last first(0 - default), Best mached (1), Most liked (2)\n",
54 | " page_token: string, # This token specifies a page in the result set that should be returned.\n",
55 | ")\n",
56 | "```\n",
57 | "Returns: (JSON)\n",
58 | "```\n",
59 | "A JSON containing info about a list of tweets matching the search query.\n",
60 | "Each result entry can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.\n",
61 | "```\n",
62 | "\n"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "metadata": {},
68 | "source": [
69 | "## 4. Detailed Component Design\n",
70 | "1. Since we have a huge amount of data, we need to have a data partitioning scheme that'll efficiently distribute the data across multiple servers.\n",
71 | "\n",
72 | "\n",
73 | "5 year plan\n",
74 | "```\n",
75 | " 120 GB/day * 365 days * 5 years ~= 200TB\n",
76 | " \n",
77 | "```\n",
78 | "\n",
79 | "We never want to be more than 80% full at any time, so we'll need approximately 250TB storage. Assuming we also need to keep an extra copy for fault tolerance, then, our total storage will be 500 TB.\n",
80 | "\n",
81 | "Assuming modern servers store up to 5TB of data, we'd need 100 such servers to hold all the data for the next 5 years.\n",
82 | "\n",
83 | "Let's start with simplistic design where we store tweets in a PostgreSQL DB. Assume a table with two columns: TweetID, and TweetText. \n",
84 | "Partitioning can be based on TweetID. If our TweetIDs are unique system wide, we can define a hash function that can map a TweetID to a storage server where we can store that tweet object.\n",
85 | "\n",
86 | "#### How can we create system wide unique TweetIDs?\n",
87 | "If we're getting 400M tweets per day, then in the next five years?\n",
88 | "```\n",
89 | " 400 M * 365 * 5 years => 730 billion tweets\n",
90 | "```\n",
91 | "We'll need 5 bytes number to identify TweetIDs uniquely. Assume we have a service that will generate unique TweetIDs whenever we need to store an object. We can feed the TweetID to our hash function to find the storage server and store the tweet object there.\n",
92 | "\n",
93 | "2. **Index:** Since our tweet queries will consist of words, let's build the index that can tell us which words comes in which tweet object.\n",
94 | "\n",
95 | "\n",
96 | "Assume:\n",
97 | "- Index all English words,\n",
98 | "- Add some famous nouns like People names, city names, etc\n",
99 | "- We have 300K English words, 200K nouns, Total 500K.\n",
100 | "- Average length of a word = 5 characters.\n",
101 | "\n",
102 | "```\n",
103 | " If we keep our index in memory, we need:\n",
104 | " \n",
105 | " 500K * 5 => 2.5 MB\n",
106 | "```\n",
107 | "\n",
108 | "Assume:\n",
109 | " - We keep the index in memory for all tweets from our last two years. \n",
110 | "```\n",
111 | " Since we'll get 730 Billion tweets in the next 5 years,\n",
112 | " \n",
113 | " 292Billion (2 year tweets) * 5 => 1460 GB\n",
114 | "```\n",
115 | "\n",
116 | "So our index would be like a big distributed hash table, where 'key' would be the word and 'value' will be a list of TweetIDs of all those tweets which contain that word.\n",
117 | "\n",
118 | "Assume:\n",
119 | " - Average of 40 words in each tweet,\n",
120 | " - 15 words will need indexing in each tweet, since we won't be indexing prepositions and other small words (the, in, an, and)\n",
121 | "\n",
122 | "> This means that each TweetID will be stored 15 times in our index. \n",
123 | "\n",
124 | "so total memory we will need to store our index:\n",
125 | "```\n",
126 | " (1460 * 15) + 2.5MB ~= 21 TB\n",
127 | "```\n",
128 | "> Assuming a high-end server holds 144GB of memory, we would need 152 such servers to hold our index.\n",
129 | "\n",
130 | "## Sharding our Data\n",
131 | "\n",
132 | "\n",
133 | "#### Sharding based on Words:\n",
134 | "While building the index, we'll iterate through all words of a tweet and calculate the hash of each word to find the server where it would be indexed. To find all tweets containing a specific word we have to query only server which contains this word.\n",
135 | "\n",
136 | "Issues with this approach:\n",
137 | "- If a word becomes hot? There will be lots of queries (high load) on the server holding the word, affecting the service performance.\n",
138 | "- Over time, some words can end up storing a lot of TweetIDs compared to others, therefore maintaining a uniform distribution of words while tweets are growing is tricky.\n",
139 | "\n",
140 | "To recover from this, we can repartition our data or use [Consistent Hashing](https://en.wikipedia.org/wiki/Consistent_hashing)\n",
141 | "\n",
142 | "\n",
143 | "#### Sharding based on tweet object\n",
144 | "While storing, we will pass the TweetID to our hash function to find the server and index all words of the tweet on that server.\n",
145 | "While querying for a particular word, we'll query all servers, and each server will return a set of TweetIDs. A centralized server will aggregate these results to return them to the user. \n",
146 | "\n",
147 | ""
148 | ]
149 | },
150 | {
151 | "cell_type": "markdown",
152 | "metadata": {},
153 | "source": [
154 | "## 6. Fault Tolerance\n",
155 | "We can have a secondary replica of each server and if the primary one dies, it can take control after the failover.\n",
156 | "Both primary and secondary servers will have the same copy of the index. \n",
157 | "\n",
158 | "How can we efficiently retrieve a mapping between tweets and the index server? We have to build a reverse index that will map all the tweetID to their index server. We'll keep this in the Index-Builder server.\n",
159 | "\n",
160 | "- build a Hashtable, where key = index server number and value = HashSet containing all TweetIDs being kept at that index server.\n",
161 | "- A HashSet will help us to add/remove tweets from our index quickly.\n",
162 | "\n",
163 | "So whenever an index server has to rebuild itself, it can simply ask the Index-Builder server for all tweets it needs to store and then fetch those tweets to build the index. We should also have a replica of the Index-builder server for fault tolerance. "
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "## 7. Caching\n",
171 | "We can introduce a cache server in front of our DB. We can also use Memcached, which can store all hot tweets in memory. App servers before hitting the backend DB, can quickly check if the cache has that tweet. Based on clients' usage patterns, we can adjust how many cache servers we need. For cache eviction policy, Least Recently Used (LRU) seems suitable."
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "## 8. Load Balancing\n",
179 | "Add LB layers at two places:\n",
180 | "1. Between Clients and Application servers,\n",
181 | "2. Between Application servers and Backend server.\n",
182 | "\n",
183 | "LB approach:\n",
184 | "- Use round robin approach: distrubute incoming requests equally among servers.\n",
185 | "- Simple to implement and no overhead\n",
186 | "- If as server is dead, LB will take it out of rotation and stop sending traffic to it\n",
187 | "- Problem is if a server is overloaded, or slow, the LB will not stop sending new requests to it. To fix this, a more intelligent LB solution can be placed that periodically queries the server about the load and adjust traffic based on that."
188 | ]
189 | }
190 | ],
191 | "metadata": {
192 | "kernelspec": {
193 | "display_name": "Python 3",
194 | "language": "python",
195 | "name": "python3"
196 | },
197 | "language_info": {
198 | "codemirror_mode": {
199 | "name": "ipython",
200 | "version": 3
201 | },
202 | "file_extension": ".py",
203 | "mimetype": "text/x-python",
204 | "name": "python",
205 | "nbconvert_exporter": "python",
206 | "pygments_lexer": "ipython3",
207 | "version": "3.7.4"
208 | }
209 | },
210 | "nbformat": 4,
211 | "nbformat_minor": 2
212 | }
213 |
--------------------------------------------------------------------------------
/OOP-design/images/amazon-s-checkout.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/designing_typeahead_suggestion.md:
--------------------------------------------------------------------------------
1 | # Designing Typeahead Suggestion
2 |
3 | Typeahead is a real-time suggestion service which recommends terms to users as they enter text for searching.
4 |
5 | As the user types into the search box, it tries to predict the query based on the characters the users has entered, and gives a list of suggestions to complete the query.
6 |
7 | It's not about speeding up the users' search but to help the user articulate their search queries better.
8 |
9 |
10 | ## 1. Requirements and Goals of the System
11 | **Functional requirements:** As the user types in their search query, our service should suggest top 10 terms starting with whatever the user typed.
12 |
13 | **Non-functional requirements:** The suggestions should appear in real-time, allowing the user to see it in about 200ms.
14 |
15 |
16 | ## 2. Basic System Design and Algorithm
17 |
18 | The problem to solve is that we have a lot of strings we need to store in such a way that the user can search with any prefix. The service will suggest the terms that match with the prefix. For example, if our DB contains the terms (cat, cap, captain, capital), and the user has typed in `cap`, then the system should suggest `cap`, `captain` and `capital`.
19 |
20 | To serve a lot of queries with minimal latency, we can't depend on the DB for this; we need to store our index in memory in a highly efficient data structure – a Trie(pronounced "try").
21 |
22 | 
23 |
24 | If the user types `cap`, then the service can traverse the trie and go to the node **P**, to find all the terms that start with this prefix. (i.e cap-ital, cap-tain, cap-tion).
25 |
26 | We can also merge nodes that have only one branch to save memory.
27 |
28 | 
29 |
30 |
31 | **Should we have a case insensitive trie?** For simplicity, lets assume our data is case insensitive.
32 |
33 | #### How do we find top 10 suggestions?
34 | We can store the count of searches that terminated at each node. For example, if the user about `captain` 20 times, and `caption` 50 times, we can store the count with the last character of the phrase. Now if the user types `cap` we know that the top most searched word under 'cap' is `caption`.
35 |
36 | > So to find the top suggestions for a given prefix, we can traverse the sub-tree under it.
37 |
38 |
39 |
40 | #### Given a prefix, how long will it take to traverse its sub-tree?
41 | Given the amounf of text we need to index, we should expect a huge tree. Traversing a tree will take really long. Since we have strict latency requrements, we need to improve the efficiency of our solution.
42 | - Store top 10 suggestions for each node. This will require extra storage.
43 | - We can optimize our storage by storing only references to the terminal node rather than storing the entire phrase.
44 | - Store frequency with each reference to keep track of top suggestions.
45 |
46 |
47 |
48 | #### How do we build this trie?
49 | We can efficiently build it bottom-up.
50 | - Each parent node will recursively call child nodes to calculate top suggestions and their counts.
51 | - The parent nodes will combine their suggestions from all their children nodes to determine their top suggestions.
52 |
53 |
54 |
55 | #### how do we update the trie?
56 | Assume 5 billion daily searches
57 | ```
58 | 5B searches / 36400 sec a day ~= 60K searches/sec
59 | ```
60 | Updating the trie for every search will be extremely resource intensive and this can hamper our read requests too.
61 | Solution: Update the trie offline at certain intervals.
62 |
63 | As new queries come in, log them and track their frequency of occurences. We can log every 1000th query. For example, if we don't want to show a term that hasn't been searched for < 1000 times, it's safe to only log queries that occur the 1000th time.
64 |
65 | We can user [Map Reduce (MR)](https://en.wikipedia.org/wiki/MapReduce) to process all the logging data periodically, say every hour.
66 | - The MR jobs will calculate the frequencies of all search terms in the past hour.
67 | - Then update the trie with the new data. We take the current snapshot of the trie and update it offline with the new terms and their frequencies.
68 |
69 | We have two options to updating offline:
70 | 1. Make a copy of the trie from each server and update the copy offline. Once done, switch to the copy and discard the old one.
71 | 2. Master-slave configure each trie server. Update slave while master is service incoming traffic. Once update is complete, make our slave the new master. Then later, update our old master, which can then start serving traffic too.
72 |
73 |
74 |
75 | #### **How do we update the frequencies of suggestions?**
76 | We are storing frequencies of suggestions for each node, so we need to update them too.
77 |
78 | We can update only differences in frequencies, instead of recounting all search terms from scratch.
79 |
80 | Approach: Use **[Exponential Moving Average(EMA)](https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average)**
81 | - This will allow us to give more weight to the latest data.
82 | - This means that: if we keep count for searches done the last 10 days, we need to subtract the count from the time period *no longer included* and add counts for the new time period included.
83 |
84 |
85 | **Inserting a new term and updating corresponding frequencies**
86 |
87 | After inserting a new term in the trie, we will go to the term's terminal node and increase its frequency.
88 | Since we are storing top 10 suggestions for a node, it's possible that the new term is also in the top 10 suggestions of a few other nodes.
89 | We'll therefore update the top 10 queries of those nodes too.
90 |
91 | Traverse back from the node all the way to the root. For each parent, we check if the new query is part of the top 10. If so, we update the corresponding frequency. If not, we check if the current query's frequency is high enough to be top 10. If so, we insert the new term and remove the term with the lowest frequency.
92 |
93 |
94 |
95 | #### **How do we remove a term from the trie?**
96 |
97 | Let's say we have to remove some term because it's highly offensive or for some legal issue.
98 | We can do that when the periodic updates happen. Meanwhile, we can also add a filtering layer on each server which will remove any such term before sending them to users.
99 |
100 |
101 |
102 |
103 | #### **What could be different ranking criteria for suggestions?**
104 | In addition to a simple count for ranking terms, we can use factors such as user's location, freshness, language, demographics, personal history, etc.
105 |
106 | ## 3. Permanent Storage of the Trie
107 |
108 | #### **How to store a trie in a file to rebuild it easily.**
109 | We enables us to rebuild a trie if the server goes down. To store, start with the root node, and save the trie level-by-level. With each node, store the character it contains and how many children it has. Right after each node, we should put all its children.
110 |
111 | Let's assume we have the following trie:
112 |
113 | 
114 |
115 | With the mentioned storage scheme, we can store the above trie as:
116 | ```
117 | C2,A2,R1,T,P,O1,D
118 | ```
119 | From this, we can easily rebuild the trie.
120 |
121 | ## 4. Capacity Estimation
122 |
123 | Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of these will be unique.
124 | Let's assume we have 100 million unique queries on which we want to build an index.
125 |
126 | #### Storage Estimates
127 | Assume: average query = 3 words, each averaging 5 characters, then average query size = 15 characters.
128 |
129 | ```
130 | Each character = 2 bytes
131 | We need 30 bytes to store each query.
132 |
133 | 100 million * 30 bytes => 3 GB
134 |
135 | ```
136 | We can expect this data to grow everyday, but we are also removing terms that are not being searched anymore.
137 | If we assume we have 2% unique new queries every day, total storage for a year is:
138 | ```
139 | 3GB + ( 3GB * 0.02% * 365 days) ==> 25GB
140 |
141 | ```
142 |
143 | ## 5. Data Partition
144 | Although the index can fit in a single server, we still need to partition to meet our requirement of low latency and higher efficiency.
145 |
146 | #### Partition based on maximum memory capacity of the server
147 | We can store data on a server as long as it has memory available.
148 | Whenever a sub-tree cannot fit in a server, we break our partition there and assign that range to this server.
149 | We then move on to the next server and repeat the process.
150 | If Server 1 stored `A to AABC`, then Server 2 will store `AABD` onwards. If our second server could store up to `BXA`, the next server will start from `BXB`, and so on.
151 |
152 | ```
153 | Server 1: A-AABC
154 | Server 2, AABD-BXA
155 | Server 3, BXB-CDA
156 | ...
157 | Serve N, ...
158 | ```
159 | For queries, if a user types `A`, we query Server 1 and 2 to find top suggestions. If they type `AAA` we only query Server 1.
160 |
161 | #### Load Balancing
162 | We can have a LB to store the above mappings and redirect traffic accordingly.
163 |
164 |
165 |
166 | ## 6. Cache
167 |
168 | Caching the top searched terms will be extremely helpful in our service.
169 |
170 | > There will be a small percentage of queries (20/80 rule) that will be responsible for most of the traffic.
171 |
172 | We can have separate cache servers in front of the trie servers holding most frequently searched terms and their typeahead suggestions. Application servers should check these cache servers before hitting the trie servers to see if they have the desired searched terms. This will save us time to traverse the trie.
173 |
174 | We can also build a Machine Learning (ML) model that can try to predict the engagement on each suggestion based on simple counting, personalization, or trending data, and cache these terms beforehand.
175 |
176 |
177 | ## 7. The Client
178 | 1. The client should only try hitting the server if the user has not pressed any key for 50ms.
179 | 2. If the user is constantly typing, the client can cancel the in-progress request.
180 | 3. The client can wait first until the user enters a couple of characters, say 3.
181 | 4. Client can pre-fetch data from server to save future requests.
182 | 5. Client can store recent history locally, because of the high probability of being reused.
183 | 6. The server can push some part of their cache to a CDN for efficiency.
184 | 7. The client should establishe the connection to the server fast, as soon as the user opens up the search bar. So that when the user types the first letter, the client doesn't waste time trying to connect at this point.
185 |
186 | ## 8. Replication and Fault Tolerance.
187 | We should have replicas for our trie servers both for load balancing and also for fault tolerance.
188 | When a trie server goes down, we already have a Master-Slave configuration. So if the master dies, the slave can take over after failover. Any server that comes back up can rebuild the trie from the last snapshot.
189 |
190 |
191 | ## 9. Personalization
192 | Users will receive some typeahead suggestions based on their historical searches, location, language, etc. We can store the personal history of each user separately on the server and also cache them on the client. The server can add these personalized terms in the final set before sending it to the user. Personalized searches should always come before others.
193 |
--------------------------------------------------------------------------------
/distributed_logging.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "f490e2af-7bb3-4c66-94b8-81e59f8bd065",
6 | "metadata": {},
7 | "source": [
8 | "# Designing Distributed Logging System\n",
9 | "\n",
10 | "One of the most challenging aspects of debugging distributed systems is understanding system behavior in the period leading up to a bug.\n",
11 | "As we all know by now, a distributed system is made up of microservices calling each other to complete an operation.\n",
12 | "Multiple services can talk to each other to complete a single business requirement.\n",
13 | "\n",
14 | "In this architecture, logs are accumulated in each machine running the microservice. A single microservice can also be deployed to hundreds of nodes. In an archirectural setup where multiple microservices are interdependent, and failure of one service can result in failures of other services. If we do not have well organized logging, we might not be able to determine the root cause of failure."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "id": "24dfcd83-0590-4e59-96ce-80f049fe9771",
20 | "metadata": {},
21 | "source": [
22 | "## Understanding the system\n",
23 | "### Restrain Log Size\n",
24 | "At any given time, the distributed system logs hundreds of concurrent messages. \n",
25 | "The number of logs increases over time. But, not all logs are important enough to be logged.\n",
26 | "To solve this, logs have to be structured. We need to decide what to log into the system on the application or logging level.\n",
27 | "\n",
28 | "### Log sampling\n",
29 | "Storage and processing resources is a constraint. We must determine which messages we should log into the system so as to control volume of logs generated.\n",
30 | "\n",
31 | "High-throughput systems will emit lots of messages from the same set of events. Instead of logging all the messages, we can use a sampler service that only logs a smaller set of messages from a larger chunk. The sampler service can use various sampling algorithms such as adaptive and priority sampling to log events. For large systems with thousands of microservices and billions of events per seconds, an appropriate \n",
32 | "\n",
33 | "### Structured logging\n",
34 | "The first benefit of structured logs is better interoperability between log readers and writers.\n",
35 | "Use structured logging to make the job of log processing system easier. \n",
36 | "\n",
37 | "### Categorization\n",
38 | "The following severity levels are commonly used in logging:\n",
39 | "- `DEBUG`\n",
40 | "- `INFO`\n",
41 | "- `WARNING`\n",
42 | "- `ERROR`\n",
43 | "- `CRITICAL`"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "id": "8cb6105b-a01d-450e-a11c-1544ee40deb2",
49 | "metadata": {},
50 | "source": [
51 | "## Requirements\n",
52 | "### Functional requirements\n",
53 | "- Writing logs: the microservices should be able to write into the logging system.\n",
54 | "- Query-based logs: It should be effortless to search for logs.\n",
55 | "- The logs should reside in distributed storage for easy access.\n",
56 | "- The logging mechanism should be secure and not vulnerable. Access to logs should be for authenticated users and necessary read-only permissions granted to everyone.\n",
57 | "- The system should avoid logging sensitive information like credit cards numbers, passwords, and so on.\n",
58 | "- Since logging is a I/O-heavy operation, the system should avoid logging excessive information. Logging all information is unnecessary. It only takes up more space and impacts performance.\n",
59 | "- Avoid logging personally identifiable information (PII) such as names, addresses, emails, etc.\n",
60 | "\n",
61 | "\n",
62 | "### Non-functional requirements\n",
63 | "- **Low latency:** logging is a resource-intensive operation that's significantly slower than CPU operations. To ensure low latency, the logging system should be designed so that logging does not block or delay a service's main application process.\n",
64 | "- **Scalability:** Our logging system must be scalable. It should efficiently handle increasing log volumes over time and support a growing number of concurrent users.\n",
65 | "- **Availability:** The logging system should be highly available to ensure data is consistently logged without interruption."
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "id": "341f2edc-95d0-4eeb-98f0-821a773a3617",
71 | "metadata": {},
72 | "source": [
73 | "## Components to use\n",
74 | "We will use the following components:\n",
75 | "- **Pub-Sub system:** we will use a publish-subscribe system to efficiently handle the large volume of logs.\n",
76 | "- **Distributed Search:** we will employ distributed search to query logs efficiently.\n",
77 | "\n",
78 | ">A distributed search system is a search architecture that efficiently handles large dataset and high query loads by spreading search operations across multiple servers or nodes. It has the following components:\n",
79 | ">1. **Crawler:** This component fetches the content and creates documents.\n",
80 | ">2. **Indexer:** Builds a searchable index from the fetched documents.\n",
81 | ">3. **Searcher:** Responds to user queries by running searches on the index created by the indexer.\n",
82 | "\n",
83 | "- **Logging Accumulator:** This component will collect logs from each node and store them in a central location, allowing for easy retrieval of logs related to specific events without needing to query each individual node.\n",
84 | "- **Blob Storage:** The blob storage provides a scalable and reliable storage for large volumes of data.\n",
85 | "- **Log Indexer:** Due to the increasing number of log files, efficient searching is crucial. The log indexer utilizes distributed search techniques to index and make logs searchable, ensuring fast retrieval of relevant information.\n",
86 | "- **Visualizer:** The visualizer component provides a unified view of all logs. It enables users to analyze and monitor system behavior and performance through visual representation and analytics.\n"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "id": "e34cb42e-89fe-483d-91d8-b7899e552932",
92 | "metadata": {},
93 | "source": [
94 | "## API Design\n",
95 | "We design for reads and writes\n",
96 | "\n",
97 | "\n",
98 | "Read\n",
99 | "```python\n",
100 | "searching(keyword)\n",
101 | "```\n",
102 | "This API call returns a list of logs that contain the keyword.\n",
103 | "\n",
104 | "Write\n",
105 | "```python\n",
106 | "write_logs(unique_ID, message)\n",
107 | "```\n",
108 | "This API call writes the log message against against a unique key.\n"
109 | ]
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "id": "14d6a78d-6560-47e2-b963-bced4fdedb40",
114 | "metadata": {},
115 | "source": [
116 | "## High Level System Design\n",
117 | "\n",
118 | ""
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "id": "298644e9-a2b0-487a-aa5e-30c236d3d0a9",
124 | "metadata": {},
125 | "source": [
126 | "## Component Design \n",
127 | "\n",
128 | "### Logging at Various Levels in a Server\n",
129 | "In a server environment, logging occurs across various services and application, each producing logs crucial for monitoring and troubleshooting.\n",
130 | "\n",
131 | "#### Server Level\n",
132 | "- **Multiple Applications:** A server hosts multiple apps, such as App1, App2, etc. Each running various microservices with user authentication, fetching the cart, storage etc in an e-commerce context.\n",
133 | "- **Logging Structure:** each service within the application generates logs identified by an ID conprising application ID, service ID, and timestamp, ensuring unique identification and event causality determination.\n",
134 | "\n",
135 | "#### Logging Process\n",
136 | "Each service will push logs into the log accumulator service.\n",
137 | "The service will store the logs logically and push the logs to a pub-sub system.\n",
138 | "\n",
139 | "We use the pub-sub system to handle scalability challenge by efficiently managing and distributing a large volume of logs across the system.\n",
140 | "\n",
141 | "#### Ensuring Low Latency and Performance\n",
142 | "- **Asynchronous Logging:** Logs are sent asynchronously via low-priority threads to avoid impacting the performance of critical processes. This also ensure continuous availability of services without any disruptions caused by logging activities.\n",
143 | "- **Data Loss Awareness:** Logging large volumes of messsages can lead to potential data loss. To balance user-perceived latency with data peristence, log services often use RAM and save data asynchronously. To minimize data loss, we will add more log accumulators to handle increasing concurrent users."
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "id": "848e3ef9-100b-4a34-9176-f588c280b973",
149 | "metadata": {},
150 | "source": [
151 | "#### Log Retention\n",
152 | "Logs also have an expiration date. We can delete regular logs after a few days or months. Comliance logs are usually stored for up to five years. If depends on the requirements of the application.\n",
153 | "\n",
154 | "Another crucial component therefore is to have an expiration checker. It will verity the logs that have to be deleted"
155 | ]
156 | },
157 | {
158 | "cell_type": "markdown",
159 | "id": "8faf0361-06b5-4dad-acc8-f2660172aa17",
160 | "metadata": {},
161 | "source": [
162 | "### Data Center Level\n",
163 | "All servers in the data center transmit logs to a publish-subscribe architecture.\n",
164 | "By utilizing a horizontally-scalable pub-sub framework, we can effectively manager large log volumes. \n",
165 | "\n",
166 | "Implementing multiple pub-sub instance within each data center enhances scalability and prevents throughput limitations and bottlenecks. Subsequently, the pub-sub system routes the log data to blob storage.\n",
167 | "\n",
168 | "\n",
169 | "\n",
170 | "Now, data in the pub-sub system is temporary and get deleted after a few days before being moved to archival storage. \n",
171 | "However, while the data is still present in the pub-sub system, we can utilize it using the following services:\n",
172 | "- **Alerts Service:** This service identifies alerts and errors and notifies the appropriate stakeholders if a critical error is detected, or sends a message to a monitoring tool, ensuring timely awareness of important alerts. The service will also monitor logs for suspicious activities or security incidents, triggering alerts or automated responses to mitigate threats.\n",
173 | "- **Analytics service:** This service analyzes trends and patterns in the logged data to provide insights into system perf, user behavior, or operational metrics."
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": null,
179 | "id": "38164d1c-2c45-4b25-864d-463a5955577f",
180 | "metadata": {},
181 | "outputs": [],
182 | "source": []
183 | }
184 | ],
185 | "metadata": {
186 | "kernelspec": {
187 | "display_name": "Python 3 (ipykernel)",
188 | "language": "python",
189 | "name": "python3"
190 | },
191 | "language_info": {
192 | "codemirror_mode": {
193 | "name": "ipython",
194 | "version": 3
195 | },
196 | "file_extension": ".py",
197 | "mimetype": "text/x-python",
198 | "name": "python",
199 | "nbconvert_exporter": "python",
200 | "pygments_lexer": "ipython3",
201 | "version": "3.11.7"
202 | }
203 | },
204 | "nbformat": 4,
205 | "nbformat_minor": 5
206 | }
207 |
--------------------------------------------------------------------------------
/designing_instagram.md:
--------------------------------------------------------------------------------
1 | # Designing Instagram
2 |
3 | Let's design a photo-sharing service like IG, where users upload photos to share them with other users.
4 |
5 | Instagram enables its users to upload and share their photos and videos with other users. Users can choose to share information publicly or privately. Anything shared publicly can be seen by any other user, whereas privately shared content can only be accessed by a specified set of people.
6 |
7 | We plan to design a simpler version of Instagram, where a user can share photos and can also follow other users.
8 |
9 | ## 1. Requirements and Goals of the System
10 |
11 | #### Functional requirements
12 | - Users should be able to upload/download/view photos
13 | - Users can perform searches baed on photo/video titles
14 | - Users can follow other users
15 | - The system should generate Newsfeed consisting top photos from all the people the user follows
16 |
17 | #### Non-functional requirements
18 | - The service needs to be highly available
19 | - The acceptable latency is 200ms for News Feed generation
20 | - The system should be highly reliable; any uploaded photo/video should never be lost.
21 |
22 | ## 2. Capacity Estimation and Constraints
23 |
24 | The system would be read-heavy, so we'll focus on buiding a system that can retrieve photos quickly.
25 |
26 | - Assume we have 500M total users, with 1M daily active users.
27 | - 2M new photos every day, 23 new photos per second.
28 | - Average photo file size ~= 200KB
29 | - Total space required for a 1 day of photos =>
30 | ```
31 | 2M * 200KB => 400GB
32 | ```
33 | - Total space for 10 years:
34 | ```
35 | 400GB * 365 days * 10 years ~= 1425 TB => 1.4 Petabytes
36 | ```
37 |
38 | ## 3. High Level System Design
39 |
40 | At a high-level, we need to support two scenarios: uploading photos and view/searching photos.
41 |
42 | We need object storage servers to store photos and also some DB servers to store metadata information about the photos.
43 |
44 | 
45 |
46 | ## 4. Database Schema
47 |
48 | > The DB schema will help understand data flow among various components and later guide us towards data partitioning.
49 |
50 | We need to store user data, their photos, and people they follow.
51 |
52 | Photo table
53 |
54 | >
55 |
56 | | Photo |
57 | | --- |
58 | | PhotoID: int (PK) |
59 | | UserID: int |
60 | | PhotoLatitude: int |
61 | | PhotoLongitude: int |
62 | | UserLatitude: int |
63 | | UserLongitude: int |
64 | | CreationDate: datetime |
65 |
66 | >
67 |
68 | | User |
69 | | --- |
70 | | UserID: int (PK) |
71 | | Name: varchar(20) |
72 | | DOB: datetime |
73 | | CreatedAt: datetime |
74 | | LastLogin: datetime |
75 |
76 | >
77 |
78 | |UserFollow | |
79 | |---|---|
80 | | PK | UserID1: int |
81 | | PK | UserID2: int|
82 |
83 |
84 | We could use an RDBMS like MySQL since we require joins. But relational DB come with their challenges, especially when scaling. So we can store the schema in a distributed wide-column NoSQL datastore like [Cassandra](https://en.wikipedia.org/wiki/Apache_Cassandra).
85 | All the photo metadata can go to a table where the `'key'` is the `PhotoID` and the `'value'` would be an object containing Photo related details.
86 | Cassandra and most key-value stores maintain a certain number of replicas to offer reliability. Also in these data stores, deletes don't get applied instantly, data is retained for a few days to support undeleting, before getting removed permanently.
87 |
88 | We can store the actual photos in as distributed file storage system like [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) or [S3](https://en.wikipedia.org/wiki/Amazon_S3).
89 |
90 |
91 | ## 5. Data Size Estimation
92 |
93 | Let's estimate how much storage we'll need for the next 10 years.
94 |
95 | ### User
96 |
97 | Assuming each int and datetime is 4 bytes, each row in User table will have:
98 | ```
99 | UserID(4 bytes) + Name(20 bytes) + Email(32 bytes) + DOB(4 bytes) +
100 | CreatedAt(4 bytes) + LastLogin(4 bytes) = 68 bytes
101 | ```
102 | We have 500 million users:
103 | ```
104 | 500 million * 68 bytes ~= 32 GB
105 | ```
106 |
107 | ### Photo
108 |
109 | Each row in Photos table will have:
110 | ```
111 | PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotoLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationAt (4 bytes) = 284 bytes
112 | ```
113 | We get 2M photos every day, so for one day we need:
114 | ```
115 | 2 M * 284 bytes ~= 0.5 GB per day
116 |
117 | For 10 years we'll need:
118 | 0.5GB per day * 365 days * 10 years => 1.88 TB
119 | ```
120 |
121 | ### UserFollow
122 |
123 | Each row will have 8 bytes. Assume on average, each user follows 500 users, We would need 1.82 TB of storage for the UserFollow Table:
124 | ```
125 | 8 bytes * 500 followers * 500M users ~= 1.82 TB
126 | ```
127 | Total space required for the DB tables for 10 years will be:
128 | ```
129 | 32 GB + 1.88 + 1.82 ~= 3.7TB
130 | ```
131 |
132 | ## 6. Component Design
133 |
134 | Photo uploads (or writes) can be slow as they have to go to the disk, while reads will be faster, especially if they are being served from cache.
135 |
136 | Uploading users can consume all available connections, as uploading is a slow process, meaning reads can't be served if the system gets busy with all the write requests.
137 |
138 | We should keep in mind that all web servers have a connection limit. If we assume that a web server can have a maximum of 500 connections at any time, then it cant have more than 500 concurrent reads and uploads. To handle this bottleneck, we can split reads and writes into seperate services –– dedicated servers for reads and different servers for writes/uploads to ensure they don't hog the system.
139 |
140 | > Also, separating reads from writes will allow us to scale and optimize each operation independently.
141 |
142 | 
143 |
144 | ## 7. Reliability and Redundancy
145 |
146 | Losing files is not an option for our service.
147 |
148 | We'll store multiple copies of each file so that if one storage server dies, we can retrieve another copy on a different storage server.
149 |
150 | This principle also applies to the rest of the system. If we want high availability, we need to have multiple replicas of services running, so that if a few services go down, the system remains available and running.
151 |
152 | > Redundancy removes the single point of failure in the system, taking control after a failover.
153 |
154 | If there are two instances of the same service running on production and one fails, the system can failover to the healthy copy. Failover can happen automatically or be done manually.
155 |
156 | 
157 |
158 | ## 8. Data Sharding
159 |
160 | ### a. Partitioning based on UserID
161 |
162 | We can shard based on UserID, so that we keep all photos of a user on the same shard. If one DB shard is 1TB, we need 4 shards to store 3.7TB of data. Assume for better performance and scalability, we keep 10 shards.
163 |
164 | We'll find the shard number by doing (UserID % 10) and storing the data there. To uniquely identify each photo in the system, we can append shard number to each PhotoID.
165 |
166 | **How do we generate PhotoIDs?** Each DB shard can have its own auto-increment sequence for PhotoIDs and since we will append ShardID with each PhotoID, it will make it unique throughout the system.
167 |
168 | Issues with this approach:
169 | - How would we handle hot users? IG celebrities have a lot of followers, meaning many people see any photo they upload.
170 | - Some users will have more photos than others, so data will be unevenly distrubuted in the partitions.
171 | - Storying all photos of a user on one shard can cause issues like unavailability if that shard is down, or higher latency if it's serving high loads.
172 |
173 |
174 | ### b. Partitioning based on PhotoID
175 |
176 | If we generate unique PhotoIDs first, then find a shard number using
177 | (PhotoID % 10), the above problems will be solved.
178 | We wont need to append ShardID with PhotoID since the PhotoID will itself be unique throughout the system.
179 |
180 | **How to generate photoIDs?** We can dedicate a seperate DB instance to generate auto-incrementing IDs. If our PhotoID can fit into 64 bits, we can define a table containing only 64 bit ID field. So whenever we want to add a photo, we can insert a new row in Photo table and take that ID to be the new photo's PhotoID.
181 |
182 | **Wouldnt this key generating DB be a single point of failure?**
183 | Yes.
184 |
185 | A workaround for this is to define two such DBs:
186 | - one generates even numbered IDs
187 | - the other generates odd numbered IDs
188 |
189 | ```
190 | KeyGeneratingServer1:
191 | auto-increment-increment = 2
192 | auto-increment-offset = 1
193 |
194 | KeyGeneratingServer2:
195 | auto-increment-increment = 2
196 | auto-increment-offset = 2
197 | ```
198 | Then, we can put a load balancer in front of both DBs to round robin between them and to deal with downtime.
199 |
200 | **Alternatively**, we can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB)
201 |
202 | ## 9. Ranking and NewsFeed Generation
203 |
204 | We need to fetch the latest, most popular photos of the people the user follows.
205 |
206 | - First, get a list of people the user follows and fetch metadata info of latest 100 photos for each
207 | - Submit all photos to ranking algorithm which will determine the top 100 photos (based on recency, likeness, etc.)
208 | - Return them to the user as news feed.
209 |
210 | To improve the efficiency, we can pre-generate the News Feed and store it in a separate table.
211 |
212 | #### **Pre-generating the News Feed**:
213 |
214 | - Dedicate servers that continously generate users' News feeds and store them in a **`UserNewsFeed`** table. When any user needs the latest photos, we simply query this table.
215 | - When servers need to generate again, they will first query `UserNewsFeed` table to find last time the News feed was generated. Then, new News Feed data will be generated from that time onwards.
216 |
217 | #### **How do we send News Feed content to users?**
218 |
219 | **1. Pull**: Clients pull content from server on a regular/ or manually.
220 | Problems:
221 | - New data not showing until client issues a pull request
222 | - Most of the time pull requests will result in empty response if there's no data. (Frustrating the user)
223 |
224 | **2. Push:** Servers push new data to users as soon as it is available. Users maintain a long poll request with the server. A possible problem is, a user who follows a lot of people or a celebrity user who has millions of followers; the server will have to push lots of updates quite frequently, straining the server.
225 |
226 | **3. Hybrid:**
227 | - We can adopt a hybrid of the two approaches. Users with lots of followers will use a pull-based model. We only push data to those users who have < 1000 followers.
228 | - Server pushes updates to all users not more than a certain frequency, and letting users with a lot of updates to regularly pull data.
229 |
230 |
231 |
232 | ## 10. Cache and Load balancing
233 |
234 | Our service will need a massive-scale photo delivery system to serve the globally distributed users.
235 |
236 | We'll push content closer to the user using a large number of georgraphically distributed CDNs.
237 |
238 | We can also have a cache for metadata servers to cache hot DB rows. Memcache can cache the data and application servers before hitting the actual DB.
239 | For cache eviction, we can use Least Recently User (LRU), where we discard the least recently viewed rows out of the cache memory.
240 |
241 |
242 | #### **How can we build a more intelligent cache?**
243 |
244 | If we go with 80-20 rule, 20% of photo reads generates 80% of traffic. This means that certain photos are so popular that the majority of people view/search them. Therefore, we can try caching 20% of daily read volume of photos and metadata.
245 |
--------------------------------------------------------------------------------
/designing_uber_backend.md:
--------------------------------------------------------------------------------
1 | # Designing Uber Backend
2 | Let's design a ride-sharing service like Uber, connecting a passenger who needs a ride with a driver who has a car.
3 |
4 | Uber enables its customers to book drivers for taxi rides. Uber drivers use their cars to driver customers around. Both customers and drivers communicate with each other through their smartphones using the Uber app.
5 |
6 | Similar Services: Lyft
7 |
8 | ## 1. Requirements and Goals of the System
9 |
10 | There are two types of users in our system: Drivers and Customers.
11 |
12 | - Drivers need to regularly notify the service about their current location and their availability to pick passengers
13 | - Passengers get to see all the nearby available drivers
14 | - Customers can request a ride; this notifies nearby drivers that a customer is ready to be picked up
15 | - Once a driver and a customer accept a ride, they acan constantly see each other's current location until the trip finishes.
16 | - Upon reaching the destination, the driver marks the journey complete to be available for the next ride.
17 |
18 |
19 | ## 2. Capacity Estimation and Constraints
20 | - Assume we have 300 million customers and 1 million daily active customers, and 500K daily active drivers.
21 | - Assume 1 million daily rides
22 | - Let's assume that all active drivers notify their current location every 3 seconds.
23 | - Once a customer puts in a request for a ride, the system should be able to contact drivers in real-time.
24 |
25 | ## 3. Basic System Design and Algorithm
26 |
27 |
28 | ### a. Grids
29 | We can divide the whole city map into smaller grids to group driver locations into smaller sets. Each grid will store all the drivers locations under a specific range of longitude and latitude.
30 | This will enable us to query only a few grids to find nearby drivers. Based on a customer's location, we can find all neighbouring grids and then query these grids to find nearby drivers.
31 |
32 | #### What could be a reasonable grid size?
33 | Let's assume that GridID (say, a 4 byte number) would uniquely identify grids in our system.
34 |
35 | Grid size could be equal to the distance we want to query since we also want to reduce the number of grids. We can search within the customer's grid which contains their location and the neighbouring eight grids. Since our grids will be statically defined, (from the fixed grid size), we can easily find the grid number of any driver (lat, long) and its neighbouring grids.
36 |
37 | In the DB, we can store the GridID with each location and have an index on it, too, for faster searching.
38 |
39 |
40 | ### b. Dynamic size Grids
41 | Let's assume we don't want to have more than 100 drivers locations in a grid so that we can have faster searching. So, whenever a grid reaches this limit, we break it down into four grids of equal size, and distribute drivers among them (of course according to the driver's current location). This means that the city center will have a lot of grids, whereas the outskirts of the city will have large grids with few drivers.
42 |
43 | #### What data structure can hold this information?
44 | A tree in which each node has 4 children.
45 | Each node will represent a grid and will contain info about all the drivers' locations in that grid. If a node reaches our limit of 500 places, we will break it down to create 4 child nodes under it and distribute driver locations among them. In this way, all the leaf nodes will represent grids that can't be broken further down. So leaf nodes will keep a list of places with them.
46 | This tree structure is called a [QuadTree](https://en.wikipedia.org/wiki/Quadtree).
47 |
48 | 
49 |
50 | #### How do we build a quad tree?
51 | We'll start with one node that represents the whole city in one grid. Each Uber-available city will have it's own qaud tree.
52 | Since it will have more than 100 locations, we will break it down into 4 nodes and distribute locations among them. We will keep repeating the process with each child node until there are no nodes left with more than 100 driver locations.
53 |
54 | #### How will we find the grid for a given location?
55 | We start with the root node(city) and search downward to find one required node/grid. At each step we will see if the current node we are visiting has children. If it has, we move to the child node that ocntains our desired location and repeat the process. We stop only if the node does not have any children, meaning that's our desired node(grid).
56 |
57 | #### How will we find neighboring grids of a given grid?
58 | **Approach 1:** Since our leaf nodes contain a list of locations, we can connect all leaf nodes with a doubly linked list. This way we can iterate forward or backwards among neighbouring lead nodes to find out our desired driver locations.
59 |
60 | **Approach 2:** Find it through parent nodes. We can keep a pointer in each node to access its parent, and since each parent node has pointers to all its children, we can easily find siblings of a node. We can keep expanding our seach for neighboring grids by going up through parent pointers.
61 |
62 | Issues with our Dynamic Grid solution:
63 |
64 | * Since all active drivers are reporting their locations every 3 seconds, we need to update the QuadTree to reflect that. It will take a lot of time and resources if we have to update it for every change in the driver's coordinates.
65 | * If the new position does not belong in the current grid, we have to remove the driver from the current grid and remove/reinsert the user to the correct grid. After this move, if the new grid reaches the maximum limit of drivers, we have to repartition it.
66 | * We need to have a quick mechanism to propagate the current location of all nearby drivers to any active customer in that area. Also, when a ride is in progress, our system needs to notify both the driver and passenger about the current location of the car.
67 |
68 | > Although our QuadTree helps us find nearby drivers quickly, a fast update in the tree is not guaranteed.
69 |
70 | #### Do we need to modify our QuadTree every time a driver reports their location?
71 | If we dont, it will have some old data and this won't reflect the current location of drivers correctly. Since all active drivers report their location every 3 seconds, there will be a lot more updates happening to our tree than querying for nearby drivers.
72 |
73 | Enter hash table!
74 |
75 | > We can keep the latest position reported by all drivers in a hash table and update our QuadTree a little less frequently.
76 |
77 | Let's assume we guarantee that a driver's current location will be reflected in the QuadTree within 15 seconds. Meanwhile we will maintain a hash table that will store the current location reported by drivers; let's call this **DriverLocationHT**.
78 |
79 | #### How much memory do we need for DriverLocationHT?
80 | We need to store DriverID, their present and old location, in the hash table. So we need a total of 35 bytes to store one record.
81 |
82 | 1. DriverID (3 bytes - 1 million drivers)
83 | 2. Old latitude (8 bytes)
84 | 3. Old longitude (8 bytes)
85 | 4. New latitude (8 bytes)
86 | 5 New longitube (8 bytes) Total ==> 35 bytes
87 |
88 | If we have one million drivers, we need:
89 |
90 | ```
91 | 1 million * 35 bytes ==> 35MB (ignoring hash table overhead)
92 | ```
93 |
94 | #### How much Bandwidth?
95 | To receive location updates from all active drivers, we get DriverID and their location (3 + 16 bytes => 19 bytes). We do this every 3 seconds from 500 active drivers:
96 | ```
97 | 19 bytes * 500K drivers ==> 9.5MB per 3 sec.
98 | ```
99 |
100 | ### Do we need to distribute DriverLocationHT Hash Table onto multiple servers?
101 | The memory and bandwidth requirements can be easily handled by one server, but for scalability, performance, and fault tolerance, we should distribute DriverLocationHT onto multiple servers. We can distribute beased the DriverID to make the distribution completely random. Let's call the machines holding DriverlocationHt the Driver location servers.
102 |
103 | The servers will:
104 | - As soon as they receive driver location update, broadcast that information to all interested customers.
105 | - Notify the respective QuadTree server to refresh the driver's location. This happens every 15 seconds.
106 |
107 |
108 | ### Broadcasting driver's location to customers
109 | We can have a **Push Model** where the server pushes all the positions to all the relevant customers.
110 | - A dedicated Notification Service that can broadcast current locations of drivers to all the interested customers.
111 | - Build our Notification service on a publisher/subscriber model. When a customer opens Uber app, they query the server to find nearby drivers. On the back-end, before returning the list of drivers to the customer, we subscribe the customer for all the updates from those nearby drivers.
112 | - We can maintain a list of customers interested in knowing the driver location and whernever we have an update in DriverLocationHT for that driver, we can broadcast the current location of the driver to all subscribed customers. This way, our system ensures that we always show the driver's current position to the customer.
113 |
114 | #### Memory needed to store customer subscriptions
115 | Assume 5 customers subscribe to 1 driver. Let's assume we store this information in a hash table to update it efficiently. We need to store driver and customer IDs to maintain subscriptions.
116 |
117 | Assume we need 3 bytes for DriverID, 8 bytes for CustomerID:
118 | ```
119 | (500K drivers * 3 bytes) + (500K * 5 customers * 8 bytes) ~= 21MB
120 | ```
121 |
122 | #### How much bandwidth will we need for the broadcast?
123 | For every active driver, we have 5 subscribing customers, so the total subscribers are:
124 | ```
125 | 5 * 500K => 2.5M
126 | ```
127 | To all these customers, we need to send DriverID(3 bytes) + Location(16 bytes) every second, so we need the following bandwidth:
128 | ```
129 | (3 + 16) bytes * 2.5M ==> 47.5 MB/s
130 | ```
131 |
132 | #### How can we efficiently implement Notification Service?
133 | We can use either HTTP long polling or push notifications.
134 |
135 |
136 | #### How about Clients pull nearby driver information from server?
137 | - Clients can send their current location, and the server will find all the nearby drivers from the QuadTree to return them to the client.
138 | - Upon receiving this information, the client can update their screen to reflect current positions of drivers.
139 | - Clients will query every five seconds to limit the number of round trips to the server.
140 | - This solution looks simpler compared to the push model described above.
141 |
142 |
143 | #### Do we need to repartition a grid as soon as it reaches maximum limit?
144 | We can have a grid shrink/grow an extra 10% before we partition/merge them. This will decrease the load for a grid partition or merge on high traffic grids.
145 |
146 | 
147 |
148 | ### "Requesting a Ride" use case
149 | 1. The customer will put a request for a ride
150 | 2. One of the Aggregator servers will take the request and asks QuadTree servers to return nearby drivers.
151 | 3. The Aggregator server collects all the results and sorts them by ratings.
152 | 4. The Aggregator server will send a notification to the top (say three) drivers simultaneously, whichever driver accepts the request first will be assigned the ride. The other drivers will receive a cancellation request.
153 | 5. If none of the three drivers respond, the Aggregator will request a ride from the next three drivers from the list.
154 | 6. The customer is notified once the driver accepts a request.
155 |
156 | ## 4. Fault Tolerance and Replication
157 |
158 | #### What if a Driver Location server or Notification server dies?
159 | We need replicas of these servers,so that the primary can failover to the secondary server. Also we can store data in some persistent storage like SSDs that provide fast IOs; this ensures that if both primary and secondary servers die, we can recover the data from persistent storage.
160 |
161 | ## 5. Ranking Drivers
162 | We can rank search results not just by proximity but also by popularity or relevance.
163 |
164 |
165 | #### How can we return top rated drivers within a given radius?
166 | Let's assume we keep track of the overall ratings in our database and QuadTree. An aggregated number can represent this popularirt in our system.
167 |
168 | For example, while searching for the top 10 drivers within a given radius, we can ask each partition of QuadTree to return the top 10 drivers with a maximum rating. The aggregator server can then determine the top 10 drivers among all drivers returned.
169 |
--------------------------------------------------------------------------------
/designing-cloud-storage.md:
--------------------------------------------------------------------------------
1 | # Designing a Cloud Storage Service like Dropbox or Google Drive
2 | Cloud file storage enables users to store their data on remote servers.
3 |
4 | Why cloud storage?
5 | 1. **Availability:** Users can access their files from any devices, anywhere, anytime.
6 | 2. **Durability and Reliability:** Cloud storage ensures that users never lose their data by storing their data on different geographically located servers.
7 | 3. **Scalability:** Users will never have to worry about running out of storage space, as long as they are ready to pay for it.
8 |
9 |
10 |
11 | ## 1. Requirements and System Goals
12 | What do we want to achieve from a cloud storage system?
13 | 1. Users should be able to upload/download files from any devices
14 | 2. Users should be able to share files and folders with other users
15 | 3. The service should support large files
16 | 4. The service should allow syncing of files between devices. Updating a file on a device should get synchronized on all other devices.
17 | 5. ACID-ity on all file operations should be enforced.
18 | 6. The service should support offline editing, and as soon as users come online, all changes synced.
19 | 7. The service should support snapshotting of data, so that users can go back to previous versions of it
20 |
21 | #### Design Considerations
22 | - We should expect large read and write volumes, with the read/write ratio being about the same.
23 | - Files must be stored in small chunks. This has a lot of benefits, such as if a user fails to upload a file, then only the failing chunk will be retried instead of the entire file.
24 | - We can reduce on data exchange by transferring updated chunks only. And, for small changes, clients can intelligently upload the diffs instead of the whole chunk
25 | - The system can remove duplicate chunks, to save storage space and bandwidth.
26 | - We can prevent lots of round trips to the server if clients keep a local copy of metadata (file name, size etc)
27 |
28 | ## 2. Capacity Estimation
29 | ```python
30 | * Assume: 100 million users, 20 million DAU (daily active users)
31 | * Assume: Each user has on average two devices
32 | * Assume: Each user has on average about 100 files/photos/videos, we have
33 |
34 | Total files => 100,000,000 users * 100 files = 10 billion files
35 |
36 | * Assume: The average file size => 100KB, Total storage would be:
37 | 0.1MB * 10B files => 1 PB(Petabyte)
38 | ```
39 |
40 |
41 |
42 | ## 3. High Level Design
43 |
44 | The user will have a folder as their workplace on their device. Any file/photo/folder placed inside this folder will be uploaded to the cloud. If changes are made to it, it will be reflected in the same way in the cloud storage.
45 | - We need to store files and metadata (file name, size, dir, etc) and who the file is shared with.
46 | - We need block servers to help the clients upload/download files to our service
47 | - We need metadata servers to facilitate updating file and user metadata.
48 | - We need a synchronization server to notify clients whenever an update happens so they can start synchronizing the updated files.
49 | - We need to keep metadata of files updated in a NoSQL database.
50 |
51 | 
52 |
53 | ## 4. Component Design
54 | The major components of the system are as follows:
55 |
56 | ### a. Client
57 | The client monitors the workspace folder on the user's device and syncs changes to the remote cloud storage.
58 | The main operations for the client are:
59 | 1. Upload and download files
60 | 2. Detect file changes in workspace folder
61 | 3. Resolve conflict due to offline or concurrent updates.
62 |
63 | #### Handling efficient file transfer
64 | We can break files into small chunks, say 4MB. We can transfer only modified chunks instead of entire files. To get an optimal chunk size we can the following:
65 | - Input/output operations per sec (IOPS) for our storage devices in our backend.
66 | - Network bandwidth.
67 | - Average file size storage.
68 |
69 | We should also keep a record of each file and the chunks that make up that file in our metadata servers.
70 |
71 | A copy of metadata can also be kept with the client to enable offline updates and save round trips to update the remote metadata.
72 |
73 | #### Syncing with other clients
74 | We can use HTTP long polling to request info from the server. If the server has no new data for this client, instead of sending an empty response, it holds the request open and waits for response information to become available. Once new info is available, the server immediately sends a HTTP response to the client, completing the open request.
75 |
76 | #### Major parts of the Client
77 | 
78 |
79 | 1. **Internal Metadata DB:** to keep track of all files, chunks, versions, and locations in the file system.
80 | 2. **Chunker:** will split files into chunks, and reconstruct a file from its chunks.
81 | 3. **Watcher:** will monitor workspace folder and notify the indexer of user action (e.g CRUD operations), as well as listen for incoming sync changes broadcasted by `Sync Service`.
82 | 4. **Indexer:** will process events from the watcher and update the client DB with necessary chunk/update information on files. Once chunks are synced to the cloud, the indexer can communicated with `remote Sync Service` to broadcast changes to other clients and update the `Remote Metadata DB`.
83 |
84 | On client communication frequency:
85 | > The client should exponentially back-off connection retries to a busy/slow server, and mobile clients should sync on demand to save on user's bandwidths/bundles and space.
86 |
87 |
88 | ### b. Metadata DB
89 | The Metadata database can be a relational database like MySQL or a NoSQL DB like DynamoDB.
90 | The Sync Service should be able to provide a consistent view of the files through a DB, especially if the file is being edited by more than one user.
91 |
92 | If we go with NoSQL for its scalability and performance, we can support ACID properties programmatically in the logic of our Sync Service.
93 |
94 | The objects to be saved in the Metadata NoSQL DB are as follows:
95 | - Chunks
96 | - Files
97 | - User
98 | - Devices
99 | - Workspace Sync Folders
100 |
101 |
102 | ### c. Sync Service
103 | This component will process file updates made by a client and apply changes to other subscribed clients.
104 | It will sync local DB for the client with the info store in the remote Metadata DB.
105 |
106 | **Consistency and reliability:** When the Sync Service receives an update request, has a verification process. This process first checks with the Metadata DB for consistency before and proceeding with the update, ensuring data integrity. This step helps prevent conflicts and inconsistencies that could come about from concurrent updates from multiple clients.
107 |
108 | **Efficient Data Transfer:** By transmitting only the diffs between file versions instead of the entire file, bandwidth consumption and cloud data storage usage are minimized. This approach is benefitial especially for large files and frequent update scenarios.
109 |
110 | **Optimized storage:** The server and clients can calculate a hash using a collision resistant alogorithm (SHAs, Checksums or even Merkle trees) to see whether to update a copy of a chunk or not. On the server, if we already have a chunk with a similar hash, we don't need to create another copy, we can use the same chunk. The sync service will intelligently identify and reuse existing chunks, reducing redundancy and conversing storage space
111 |
112 | **Scalability through messaging middleware:** Adding a messaging middleware between clients and the Sync Service will allow us to provide scalable message queuing and change notifications, supporting a high number of clients using pull or push strategies.
113 | Multiple Sync Service instances can receive requests from a global request queue, and the messaging middleware will be able to balance its load.
114 |
115 | **Future-Proofing for Growth:** By designing the system with scalability and efficiency in mind, it can accommodate increasing demands as the user base grows or usage patterns evolve. This proactive approach minimizes the need for major architectural changes or performance optimizations down the line, leading to a more sustainable and adaptable system architecture.
116 |
117 | ### d. Message Queuing Service
118 | This component supports asynchronous communication between client and Sync Service, and efficiently store any number of messages in a highly available, reliable and scalable queue.
119 |
120 | The service will have two queues:
121 | 1. **A Request Queue:** is a global queue which will receive client's request to update the Metadata DB.
122 | From there, the Sync Service will take the message to update metadata.
123 |
124 | 2. **A Response Queue:** will correspond to an individual subscribed client, and will deliver update messages to that client. Each message will be deleted from the queue once received by a client. Therefore, we need to create separate Response Queues for each subscribed client
125 |
126 | 
127 |
128 | ## 5. File Processing Workflow
129 | When Client A updates a file that's shared with Client B and C, they should receive updates too. If other clients
130 | are not online at the time of update, the Message Queue Service keeps the update notifications in response queues until they come online.
131 |
132 | 1. Client A uploads chunks to cloud storage.
133 | 2. Client A updates metadata and commits changes.
134 | 3. Client A gets confirmation and notifications are sent to Client B and C about the changes.
135 | 4. Client B and C receive metadata changes and download updated chunks.
136 |
137 | ## 6. Data Deduplication
138 | We'll use this technique to remove duplicateed copies of data to cut down storage costs.
139 | For each incoming chunk, we calculate a hash of it and compare it with hashes of the existing chunks to see if we have a similar chunk that's already saved.
140 |
141 | Two ways to do this:
142 |
143 | a. **In-line deduplication:** do hash calculations in real-time as clients enter the data on the device. If an existing chunk has the same hash as a new chunk, we store a reference to the existing chunk as metadata. This prevents us from making a full copy of the chunk, saving on network bandwidth and storage usage.
144 |
145 | b. **Post-process deduplication:** store new chunks and later some process analyzes the data looking for duplicated chunks. The benefit here is that clients don't need to wait for hash lookups to complete storing data. This ensures there's no degradation in storage performance. The drawback is that duplicate data will consume bandwidth, and we will also unnecessarily store it, but only for a short time.
146 |
147 | ## 7. Partitioning Metadata DB
148 | To scale our metadata DB, we can partition it using various partition schemes:
149 |
150 | We can use Range-based partitioning where we store files/chunks in separate partitions based on the first letter of the file path. But, later this might lead to unbalanced servers, where partitions that start with frequently occuring letters will have more files than those that dont.
151 |
152 | For Hash-based partitioning, we can take a hash of the object and use it to determine the DB partition to save the object. A hash on the `FileID` of the File object we are storing can be used to determine the partition to store the object.
153 |
154 | The hashing function will distribute objects into different partitions by mapping them to a number between `[1,...,256]` and this number would be the partition we store the object. And to prevent overloading some partitions, we can use `Constitent Hashing`.
155 |
156 | ## 8. Load Balancer
157 | We can add the load balancer at two places:
158 | 1. Between Client and Block servers
159 | 2. Between Client and Metadata servers
160 |
161 | 
162 |
163 | We can have a round robin load balancer that distributes incoming requests equally among backend servers. But if a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB strategy can be implemented such that it queries for a backend server load before it sends traffic to that server, and adjusts traffic to a server based on its current server load.
164 |
165 | ## 9. Caching
166 | To deal with hot frequently used files/chunks, we can create a cache for block storage. We'll store whole chunks
167 | and the system can cechk if the cache has the desired chunk before hitting Block storage.
168 |
169 | LRU eviction policy can be used to discard the least recently used chunk first.
170 | We can also introduce a cache for metadata DB for hot metadata records.
171 |
--------------------------------------------------------------------------------
/designing_webcrawler.md:
--------------------------------------------------------------------------------
1 | # Designing a Web Crawler
2 |
3 | Let's design a Web Crawler that will browse and download the World Wide Web.
4 |
5 | ## What's a Web Crawler?
6 | It's a software program which browses the WWW in a methodical and automated manner, collecting documents by recursively fetching links from a set of starting pages.
7 |
8 | Search engines use web crawling as a means to provide up-to-date data. Search engines download all pages and create an index on them to perform faster searches.
9 |
10 | Other uses of web crawlers:
11 | - Test web pages and links for valid syntax and structure.
12 | - To search for copyright infringements.
13 | - To maintain mirror sites for popular web sites.
14 | - To monitor sites to see where their content changes.
15 |
16 | ## 1. Requirements and Goals of the System
17 | **Scalability:** Our service needs to be scalable, since we'll be fetching hundreds of millions of web documents.
18 |
19 | **Extensibility:** Our service should be designed in a way that allows newer functionality to be added to it. It should be able to allow for newer document formats that needs to be downloaded and processed in future.
20 |
21 | ## 2. Design Considerations
22 | We should be asking a few questions here:
23 |
24 | #### Is it a crawler for only HTML pages? Or should we fetch and store other media types like images, videos, etc.
25 | It's important to clarify this because it will change the design. If we're writing a general-purpose crawler, we might want to break down the parsing module into different sets of modules: one for HTML, another for videos,..., so basically each module handling a given media type.
26 |
27 | For this design, let's assume our web crawler will deal with HTML only.
28 |
29 | #### What protocols apart from HTTP are we looking at? FTP?
30 | Let's assume HTTP for now. Again, it should not be hard to extend it to other protocols.
31 |
32 | #### What is the expected number of pages we will crawl? How big will be the URL Database?
33 | Let's assume we need to crawl 1Billion websites. Since one site can contain many URLs, assume an upper bound of `15 billion web pages`.
34 |
35 |
36 | #### Robots Exclusion Protocol?
37 | Some web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to crawlers. The Robots Exclusion Protocol requires a Web Crawler to fetch a document called `robot.txt` which contains these declarations for that site before downloading any real content from it.
38 |
39 |
40 | ## 3. Capacity Estimation and Constraints
41 | If we crawl 15B pages in 4 weeks, how many pages will we need to fetch per second?
42 |
43 | ```text
44 | 15B / (4 weeks * 7 days * 86400 sec) ~= 6200 web pages/sec
45 | ```
46 |
47 | **What about storage?** Pages sizes vary. But since we are dealing with HTML only, let's assume an average page size is 100KB. With each page, if we're storing 500 bytes of metadata, total storage we would need is:
48 |
49 | ```text
50 | 15B * (100KB + 500 bytes)
51 | 15 B * 100.5 KB ~= 1.5 Petabytes
52 | ```
53 |
54 | We don't want to go beyond 70% capacity of our storage system, so the total storage we will need is:
55 |
56 | ```text
57 | 1.5 petabytes / 0.7 ==> 2.14 Petabytes
58 | ```
59 |
60 | ## 4. High Level Design
61 | The basic algorithm of a web crawler is this:
62 |
63 | 1. Taking in a list of seed URLs as input, pick a URL from the unvisited URL list.
64 | 2. Find the URL host-name's IP address.
65 | 3. Establish a connection to the host to download its corresponding documents.
66 | 4. Parse the documents contents to look for new URLs.
67 | 5. Add the new URLs to the list of unvisited URLs.
68 | 6. Process the downloaded document, e.g, store it, or index the contents
69 | 7. Go back to step 1.
70 |
71 | ### How to Crawl
72 |
73 | Breath first or depth first?
74 | Breadth-first search (BFS) is usually used. We can also use Depth-first search especially when the crawler has already established a connection with a website. In this situation, the crawler will just DFS all the URLs within the website to save some handshaking overhead.
75 |
76 | **Path-ascending crawling:** Path-ascending crawling helps discover a hidden or isolated resources. In this scheme, a crawler would ascend to every path in each URL like so:
77 | ```text
78 | given a seed URL of http://xyz.com/a/b/one.html
79 |
80 | it will attempt to crawl /a/b/, /a/ and /
81 | ```
82 |
83 | ### Difficulties implementing an efficient web crawler.
84 | #### 1. Large volume of web pages
85 | A large volume implies that the web crawler can only dowload a fraction of the web pages, so it's critical that the web crawler should be intelligent enough to prioritize download.
86 |
87 | #### 2. Rate of change on web pages
88 | Web pages change frequenty. By the time the crawler is downloading the last page from the site, the page may change dynamically, or a new page may be added.
89 |
90 | **Components of a bare minimum crawler:**
91 | 1. **URL frontier:** stores a list of URLs to download and prioritize which URLs should be crawled first.
92 | 2. **HTTP Fetcher:** to retrieve a web page from the hosts server.
93 | 3. **Extractor:** to extract links from HTML documents.
94 | 4. **Duplicate Remover:** to make sure same content is not extracted twice.
95 | 5. **Datastore:** to store retrieved pages, URLs and other metadata.
96 |
97 | 
98 |
99 | ## 5. Detailed Component Design
100 | Assume the crawler is running on a single server, where multiple working threads are performing all the steps needed to download and process a document in a loop.
101 |
102 | **Step 1:** remove an absolute URL from the shared URL frontier for downloading. the URL begins with a scheme (e.g HTTP) which identifies the network protocol that should be used to download it.
103 | We can implement these protocols in a modular way for extensibility, so that later if our crawler needs to support more protocols, it can easily be done.
104 |
105 | **Step 2:** Based on the URL's scheme, the worker calls the appropriate protocol module to download the document.
106 |
107 | **Step 3:** After downloading, the document is written into a Document Input Stream (DIS). This will enable other modules to re-read the document multiple times.
108 |
109 | **Step 4:** The worker invokes the dedupe test to see whether this document (associated with a different URL) has already been seen before. If so, the document is not processed any further and the worker thread removes the next URL from the frontier.
110 |
111 | **Step 5:** Process the downloaded document. Each doc has a different MIME type like HTML page, Image, Video etc. We can implement these MIME schemes in a modular way, to allow for extensibility when our crawler need to support more types. The worker invokes the process method of each processing module with that MIME type.
112 |
113 | **Step 6:** The HTML processing module will extract links from the page. Each link is converted into an absolute URL and testsed against a user-supplied filter to determine if it should be downloaded. If the URL passes the filter, the worker performs the URL-dedupe test, which checks if the URL has been downloaded before. If it's new, it is added into the URL frontier.
114 |
115 | 
116 |
117 | Let's see how each component can be distributed onto multiple machines:
118 |
119 | #### a. The URL frontier
120 | This is the data structure that contains all the URLs that are queued to be downloaded. We crawl by performing a breadth-first traversal of the Web, starting from the pages in the seed set. We can use a FIFO queue to implement this.
121 |
122 | Since we have a huge list of URLs to crawl, we can distribute our URL frontier into multiple servers. Let's assume on each server we have multiple threads performing the crawling tasks. Our hash function maps each URL to the server responsible for crawling it.
123 |
124 | Constraints for a distributed URL frontier:
125 | - The crawler should not overload a server by downloading a lot of pages.
126 | - Multiple machines should not connect to a single web server.
127 |
128 | > For each server, we can have distinct FIFO sub-queues, where each worker thread will remove URLs for crawling.
129 |
130 | Once a new URL is added, we determine which sub-queue it belongs to by using the URL's canonical hostname. Our hash function will map each **hostname** to a **thread number**. Together, these two points imply, that only one worker thread will download documents from a given web server and also, by using FIFO queue, it'll not overload a web server.
131 |
132 | ##### How big will our URL frontier be?
133 | The size would be in the 100s of millions of URLs. Therefore, we need to store the URLs on disk. We can implement our queues in such a way that they have separate buffers for enqueuing and dequeuing. Enqueuing buffer, once filled, will be dumped to the disk. Dequeuing buffers will keep a cache of URLs that need to be visited; periodically reading from the disk to fill the buffer.
134 |
135 | #### b. The Fetcher Module
136 | This will download the document corresponding to a given URL using the appropriate network protocol like HTTP. Webmasters create a `robot.txt` to make certain parts of the websites off limits for the crawler.
137 | To avoid downloading this text file on every request, our HTTP protocol module can maintain a cache mapping host-names to their robot's exclusion rules.
138 |
139 | #### c. Document input steam
140 | We cache the document locally using DIS to avoid downloading the document multiple times.
141 |
142 | A DIS is an input stream that caches the doc's entire contents in memory. It also provides methods to re-read the document. Larger documents can be temporarily written to a backing file.
143 |
144 | Each worker will have a DIS, which it reuses from document to document. After extracting a URL from the frontier, the worker passes that URL to the relevant protocol module (in our case, for HTTP) which initializes the DIS from a network connection to contain the document's contents. The worker then passes the DIS to all relevant processing modules.
145 |
146 | #### d. Document Dedupe test
147 | To prevent processing a doc more than once, we perform a dedupe test on each doc to remove duplication.
148 |
149 | We can calculate a 64-bit checksum (using MD5 or SHA) on every processed document and store it in a database. For each new document, we compare its checksum to all previously calculated checksums to see if the doc has been seen before.
150 |
151 | ##### How big will be the checksum store
152 | We need to keep a unique set containing checksums of all previously process documents. Considering 15 billion distinct web pages, we would need about
153 | ```text
154 | 15B * 8 bytes => 120 GB
155 | ```
156 | We can have a small LRU cache on each server with everything backed by persistent storage.
157 |
158 | Steps:
159 | - Check if the checksum is present in the cache.
160 | - If not, check if the checksum is in the back storage.
161 | - If found, ignore the document.
162 | - Otherwise, add the checksum to the cache and back storage.
163 |
164 | #### e. URL filters
165 | URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded. We can define filters to restrict URLs by domain, prefix or protocol type.
166 |
167 | Before adding the URL to the frontier, the worker thread consults the user-supplied URL filter.
168 |
169 | #### f. Domain name resolution
170 | Before we contact a web server, a Web crawler must use a DNS to map the Web server's hostname into an IP address. DNS name resolution will be a big bottleneck of our crawlers given the amount of URLs we are working with. To avoid repeated requests, we can start caching DNS results by building our local DNS server.
171 |
172 | #### g. URL dedupe test
173 | While extracting links, we will encounter multiple links to the same document. To avoid downloading and processing a doc multiple times, we use a URL dedupe test on each extracted link before adding it to the URL frontier.
174 |
175 | We can store a fixed-size checksums for all the URLs seen by our crawler in a database. To reduce number of operations performed on the DB, we can keep an in-memory cache of popular URLs on each host shared by all threads. This is because links to some URLs are quite common, so caching will lead to a high in-memory hit rate.
176 |
177 | To keep a unique set of checksums of all previously seen URLS, we would need:
178 | ```text
179 | 15 Billion URLS * 4 bytes => 60 GB
180 | ```
181 |
182 | #### h. Checkpointing
183 | A crawl of the entire Web takes weeks to complete. To guard against failures, our crawler can write regular snapshots of its state to the disk. An interrupted on aborted crawl can easily be restarted from the latest checkpoint.
184 |
185 |
186 | ## 6. Fault Tolerance
187 | We should use consistent hashing for distribution among crawler servers. Consistent hashing will help in:
188 | - replacing dead hosts
189 | - distributing load among crawling servers
190 |
191 | All crawling servers will perform regular checkpointing and storing their FIFO queues to disks. If a server goes down, we can replace it. During the replacement, consistent hashing should shift the load to other load-capable servers.
192 |
--------------------------------------------------------------------------------
/OOP-design/designing_amazon_online_shopping_system.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Designing Amazon - Online Shopping System\n",
8 | "Let's design an online retail store.\n",
9 | "For the sake of this problem, we'll focus on Amazon's retail business where users can buy/sell products.\n",
10 | "\n",
11 | "\n",
12 | "## Requirements and Goals of the System\n",
13 | "1. Users should be able to:\n",
14 | " - Add new products to sell.\n",
15 | " - Search products by name or category.\n",
16 | " - Buy products only if they are registered members.\n",
17 | " - Remove/modify product items in their shopping cart.\n",
18 | " - Checkout and buy items in the shopping cart.\n",
19 | " - Rate and review a product.\n",
20 | " - Specify a shipping address where their order will be delivered.\n",
21 | " - Cancel and order if it hasn't been shipped.\n",
22 | " - Pay via debit or credit cards\n",
23 | " - Track their shipment to see the current state of their order.\n",
24 | "2. The system should be able to:\n",
25 | " - Send a notification whenever the shipping status of the order changes."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "## Use Case Diagram\n",
33 | "We have four main actors in the system:\n",
34 | "\n",
35 | "- **Admin:** Mainly responsible for account management, adding and modifying new product categories.\n",
36 | "\n",
37 | "- **Guest:** All guests can search the catalog, add/remove items on the shopping cart, and also become registered members.\n",
38 | "- **Member:** In addition to what guests can do, members can place orders and add new products to sell\n",
39 | "- **System:** Mainly responsible for sending notifications for orders and shipping updates.\n",
40 | "\n",
41 | "\n",
42 | "Top use cases therefore include:\n",
43 | "1. Add/Update products: whenever a product is added/modified, update the catalog.\n",
44 | "2. Search for products by their name or category.\n",
45 | "3. Add/remove product items from shopping cart.\n",
46 | "4. Checkout to buy a product item in the shopping cart.\n",
47 | "5. Make a payment to place an order.\n",
48 | "6. Add a new product category.\n",
49 | "7. Send notifications about order shipment updates to members. \n"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "### Code\n",
57 | "\n",
58 | "First we define the enums, datatypes and constants that'll be used by the rest of the classes:"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 2,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "from enum import Enum\n",
68 | "\n",
69 | "\n",
70 | "class AccountStatus(Enum):\n",
71 | " ACTIVE, BLOCKED, BANNED, COMPROMISED, ARCHIVED, UNKNOWN = 1, 2, 3, 4, 5, 6\n",
72 | "\n",
73 | "class OrderStatus(Enum):\n",
74 | " UNSHIPPED, SHIPPED, PENDING, COMPLETED, CANCELED, REFUND_APPLIED = 1, 2, 3, 4, 5, 6\n",
75 | "\n",
76 | "class ShipmentStatus(Enum):\n",
77 | " PENDING, SHIPPED, DELIVERED, ON_HOLD = 1, 2, 3, 4\n",
78 | " \n",
79 | "class PaymentStatus(Enum):\n",
80 | " UNPAID, PENDING, COMPLETED, FILLED, DECLINED = 1, 2, 3, 4, 5\n",
81 | " CANCELLED, ABANDONED, SETTLING, SETTLED, REFUNDED = 6, 7, 8, 9, 10\n",
82 | " "
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "#### Account, Customer, Admin and Guest classes \n",
90 | "These classes represent different people that interact with the system."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 3,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "from abc import ABC, abstractmethod\n",
100 | "\n",
101 | "\n",
102 | "class Account:\n",
103 | " \"\"\"Python strives to adhere to Uniform Access Principle. \n",
104 | " \n",
105 | " So there's no need for getter and setter methods. \n",
106 | " \"\"\"\n",
107 | " \n",
108 | " def __init__(self, username, password, name, email, phone, shipping_address, status:AccountStatus):\n",
109 | " # \"private\" attributes \n",
110 | " self._username = username\n",
111 | " self._password = password\n",
112 | " self._email = email\n",
113 | " self._phone = phone\n",
114 | " self._shipping_address = shipping_address\n",
115 | " self._status = status.ACTIVE\n",
116 | " self._credit_cards = []\n",
117 | " self._bank_accounts = []\n",
118 | " \n",
119 | " def add_product(self, product):\n",
120 | " pass\n",
121 | " \n",
122 | " def add_product_review(self, review):\n",
123 | " pass\n",
124 | " \n",
125 | " def reset_password(self):\n",
126 | " pass\n",
127 | "\n",
128 | "\n",
129 | "class Customer(ABC):\n",
130 | " def __init__(self, cart, order):\n",
131 | " self._cart = cart\n",
132 | " self._order = order\n",
133 | " \n",
134 | " def get_shopping_cart(self):\n",
135 | " return self.cart\n",
136 | " \n",
137 | " def add_item_to_cart(self, item):\n",
138 | " raise NotImplementedError\n",
139 | " \n",
140 | " def remove_item_from_cart(self, item):\n",
141 | " raise NotImplementedError\n",
142 | " \n",
143 | "\n",
144 | "class Guest(Customer):\n",
145 | " def register_account(self):\n",
146 | " pass\n",
147 | "\n",
148 | "\n",
149 | "class Member(Customer):\n",
150 | " def __init__(self, account:Account):\n",
151 | " self._account = account\n",
152 | " \n",
153 | " def place_order(self, order):\n",
154 | " pass\n",
155 | " "
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 4,
161 | "metadata": {},
162 | "outputs": [
163 | {
164 | "name": "stdout",
165 | "output_type": "stream",
166 | "text": [
167 | "True\n",
168 | "True\n"
169 | ]
170 | }
171 | ],
172 | "source": [
173 | "# Test class definition\n",
174 | "g = Guest(cart=\"Cart1\", order=\"Order1\")\n",
175 | "print(hasattr(g, \"remove_item_from_cart\"))\n",
176 | "print(isinstance(g, Customer))"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "#### Product Category, Product and Product Review\n",
184 | "The classes below are related to a product:"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 5,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "class Product:\n",
194 | " def __init__(self, product_id, name, description, price, category, available_item_count):\n",
195 | " self._product_id = product_id\n",
196 | " self._name = name\n",
197 | " self._price = price\n",
198 | " self._category = category\n",
199 | " self._available_item_count = 0\n",
200 | " \n",
201 | " def update_price(self, new_price):\n",
202 | " self._price = new_price\n",
203 | " \n",
204 | " \n",
205 | "class ProductCategory:\n",
206 | " def __init__(self, name, description):\n",
207 | " self._name = name\n",
208 | " self._description = description\n",
209 | " \n",
210 | "\n",
211 | "class ProductReview:\n",
212 | " def __init__(self, rating, review, reviewer):\n",
213 | " self._rating = rating\n",
214 | " self._review = review\n",
215 | " self._reviewer = reviewer\n"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "#### ShoppingCart, Item, Order and OrderLog\n",
223 | "Users will add items to the shopping cart and place an order to buy all the items in the cart."
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": 6,
229 | "metadata": {},
230 | "outputs": [],
231 | "source": [
232 | "class Item:\n",
233 | " def __init__(self, item_id, quantity, price):\n",
234 | " self._item_id = item_id\n",
235 | " self._quantity = quantity\n",
236 | " self._price = price\n",
237 | " \n",
238 | " def update_quantity(self, quantity):\n",
239 | " self._quantity = quantity\n",
240 | " \n",
241 | " def __repr__(self):\n",
242 | " return f\"ItemID:<{self._item_id}>\" \n",
243 | "\n",
244 | "\n",
245 | "class ShoppingCart:\n",
246 | " \"\"\"We can still access items by calling items instead of having getter method\n",
247 | " \"\"\"\n",
248 | " def __init__(self):\n",
249 | " self._items = []\n",
250 | " \n",
251 | " def add_item(self, item):\n",
252 | " self._items.append(item)\n",
253 | " \n",
254 | " def remove_item(self, item):\n",
255 | " self._items.remove(item)\n",
256 | " \n",
257 | " def update_item_quantity(self, item, quantity):\n",
258 | " pass\n"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 7,
264 | "metadata": {},
265 | "outputs": [],
266 | "source": [
267 | "item = Item(item_id=1, quantity=2, price=300)\n",
268 | "cart = ShoppingCart()\n",
269 | "cart.add_item(item)"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 8,
275 | "metadata": {},
276 | "outputs": [
277 | {
278 | "data": {
279 | "text/plain": [
280 | "[ItemID:<1>]"
281 | ]
282 | },
283 | "execution_count": 8,
284 | "metadata": {},
285 | "output_type": "execute_result"
286 | }
287 | ],
288 | "source": [
289 | "# shopping cart now has items\n",
290 | "cart._items"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 9,
296 | "metadata": {},
297 | "outputs": [],
298 | "source": [
299 | "import datetime\n",
300 | "\n",
301 | "\n",
302 | "class OrderLog:\n",
303 | " def __init__(self, order_number, status=OrderStatus.PENDING):\n",
304 | " self._order_number = order_number\n",
305 | " self._creation_date = datetime.date.today()\n",
306 | " self._status = status\n",
307 | " \n",
308 | "\n",
309 | "class Order:\n",
310 | " def __init__(self, order_number, status=OrderStatus.PENDING):\n",
311 | " self._order_number = order_number\n",
312 | " self._status = status\n",
313 | " self._order_date = datetime.date.today()\n",
314 | " self._order_log = []\n",
315 | " \n",
316 | " def send_for_shipment(self):\n",
317 | " pass\n",
318 | " \n",
319 | " def make_payment(self, payment):\n",
320 | " pass\n",
321 | " \n",
322 | " def add_order_log(self, order_log):\n",
323 | " pass"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "#### Shipment and Notification\n",
331 | "After successfully placing an order and processing the payment, a shipment record will be created.\n",
332 | "Let's define the Shipment and Notification classes:"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 10,
338 | "metadata": {},
339 | "outputs": [],
340 | "source": [
341 | "import datetime\n",
342 | "\n",
343 | "\n",
344 | "class ShipmentLog:\n",
345 | " def __init__(self, shipment_id, status=ShipmentStatus.PENDING):\n",
346 | " self._shipment_id = shipment_id\n",
347 | " self.shipment_status = status\n",
348 | "\n",
349 | "\n",
350 | "class Shipment:\n",
351 | " def __init__(self, shipment_id, shipment_method, eta=None, shipment_logs=[]):\n",
352 | " self._shipment_id = shipment_id\n",
353 | " self._shipment_date = datetime.date.today()\n",
354 | " self._eta = eta\n",
355 | " self._shipment_logs = shipment_logs\n",
356 | " \n",
357 | "\n",
358 | "class Notification(ABC):\n",
359 | " def __init__(self, notification_id, content):\n",
360 | " self._notification_id = notification_id\n",
361 | " self._created_on = datetime.datetime.now()\n",
362 | " self._content = content\n",
363 | " "
364 | ]
365 | }
366 | ],
367 | "metadata": {
368 | "kernelspec": {
369 | "display_name": "Python 3",
370 | "language": "python",
371 | "name": "python3"
372 | },
373 | "language_info": {
374 | "codemirror_mode": {
375 | "name": "ipython",
376 | "version": 3
377 | },
378 | "file_extension": ".py",
379 | "mimetype": "text/x-python",
380 | "name": "python",
381 | "nbconvert_exporter": "python",
382 | "pygments_lexer": "ipython3",
383 | "version": "3.7.4"
384 | }
385 | },
386 | "nbformat": 4,
387 | "nbformat_minor": 2
388 | }
389 |
--------------------------------------------------------------------------------
/designing_api_rate_limiter.md:
--------------------------------------------------------------------------------
1 | # Designing an API Rate Limiter
2 |
3 | An API reate limiter will throttle users based on the number of requests they are sending.
4 |
5 | ## Why Rate Limiting?
6 | Rate limiting is all about traffic/load management. Clients can misbehave and trigger a retry storm, others might request the same information over and over again, increasing traffic and unnecessary load on the system. This can degrade the performance of the service, and the ability of the system to reliably handle incoming requests.
7 |
8 | Rate limiting helps to protect services against abusive behaviors targeting the application such as retry storms, Denial-of-service (DOS) attacks, brute-force password attempts, brute-force credit card transactions, etc.
9 |
10 | Rate limiting saves the company money by eliminating service and infrastructure costs that would have otherwise been used to handle spamming.
11 |
12 | Here are some scenarios to show the importance of Rate limiting our API/Services:
13 |
14 | - **Prevents service degradation:** It reduces traffic spikes so that the service stays reliable for all users.
15 |
16 | - **Misbehaving clients:** Sometimes, clients can overwhelm servers by sending large number of requests, either intentionally or unintentionally.
17 |
18 | - **Security:** Limiting the number of times a user should authenticate with a wrong password.
19 |
20 | - **Preventing abusive and bad design practices:** Without API limits, developers of client apps might request the same info over and over again.
21 |
22 | - **Revenue:** Services can limit operations based on the tier of their customer's service and thus create a revenue model off the rate limiting. To go beyond the set limit, the user has to buy higher limits.
23 |
24 |
25 | ## 1. Requirements and System Goals
26 |
27 | #### Functional requirements
28 | 1. Limit the number of requests an entity can send to an API within a time window.
29 | 2. The user should get an error whenever they cross the defined threshold within a single server or across a set of servers.
30 |
31 | #### Non-Functional requirements
32 | 1. The system should be highly available, always protecting our API service from external attacks.
33 | 2. The rate limiter should NOT introduce substantial latencies affecting the user experience.
34 |
35 | ## 2. Throttling Types
36 |
37 | - ***Hard Throttling*** – Number of API requests cannot exceed the throttle limit.
38 |
39 | - ***Soft Throttling*** – Set the API request limit to exceed by some percentage. E.g, if the rate-limit = 100 messages/minute, and 10% exceed-limit, our rate limiter will allow up to 110 messages per minute.
40 |
41 | - ***Dynamic Throttling (Priority throttling)*** – The number of requests can exceed the limit if the system has some free resources available. The system can progressively throttle requests based on some predefined priority.
42 |
43 | ## 3. Algorithms used for Rate Limiting
44 |
45 | #### Fixed Window Algorithm
46 | Here, the time window is considered from the start of the time-unit to the end of the time-unit.
47 | For instance, a period will be considered 0-60 sec for a minute regardless of the time frame at which the API request has been made.
48 |
49 | The diagram below shows that we will only throttle 'm5' message, if our rate limit is 2 messages per second.
50 |
51 | 
52 |
53 | #### Rolling Window Algorithm
54 | The time window is considered from the fraction of time at which the request is made plus the time window length.
55 |
56 | With a rate limit of 2 messages per second, the two messages sent at the 300th millisecond (m1) and 400th millisecond (m2), we'll count them as two messages starting from the 300th of that second to the 300th of the next second (making up one second).
57 |
58 | As a result, we will therefore throttle M3, M4 as shown above.
59 |
60 |
61 |
62 |
63 | ## 4. High Level Design
64 |
65 | Once a new request arrives, the Web server first asks the Rate Limiter to decide if it will be served or throttled. If the request is not throttled, then it's passed to the API servers.
66 |
67 | 
68 |
69 |
70 | ## 5. Basic System Design and Algorithm
71 |
72 | Assume for each user, our rate limiter allows 3 requests/sec.
73 |
74 | For each user, store:
75 | - a request count (how many requests the user has made)
76 | - a timestamp when we started counting
77 |
78 | We can keep this in a hashtable, where:
79 |
80 | ```python
81 | # Key (userID): Value {count, start_time}
82 | hashtable = {
83 | 'userId0': {
84 | 'count': 3, 'start_time': 1574866492
85 | },
86 | 'userId1': {
87 | 'count': 1, 'start_time': 1574873716
88 | },
89 | ...
90 | }
91 | ```
92 |
93 | When a new request comes in, the rate limiter will perform the following steps:
94 |
95 | 1. If the `userID` does not exist in the hash-table,
96 | - insert it,
97 | - set `count` to 1 and set `start_time` to current epoch time
98 | 2. Otherwise, find the existing record of the userID, and
99 | - if `current_time - start_time >= 1 minute`, reset `start_time` to be current time,
100 | - set `count` to 1 and allow the request
101 | 3. If `current_time - start_time <= 1 minute` and
102 | - If `count < 3`, increment the count and allow the request.
103 | - If `count == 3`, reject the request.
104 |
105 | #### Problems with this Fixed Window Algorithm
106 | 1. We are resetting the `start_time` at the end of every minute, which means we can potentially allow twice the number of requests per minute.
107 |
108 | Imagine if a user sends 3 requests at the last second of a minute, they can immediately send 3 more requests at the very first second of the next minute, resulting in 6 requests in a span of two seconds.
109 |
110 | To fix this loophole, we'll use the sliding-window algorithm.
111 |
112 | 
113 |
114 | 2. Atomicity: The read and then write process can create a race condition. Imagine, a given user's current count = 2. If two seperate processes served each of these requests and concurrently read the count before either updated it, each process would erroneously think that the user had one more request to hit the rate limit.
115 |
116 | 
117 |
118 |
119 | #### Solutions
120 | We can use a K-V store like Redis to store our key-value and solve the atomicity problem using [Redis lock](https://redis.io/topics/distlock) for the duration of the read-update operation.
121 | This however, would slow down concurrent requests from the same user and introduce another layer of complexity.
122 |
123 |
124 |
125 | #### How much memory to store all the user data?
126 | Assume the userID takes 8 bytes, epoch time needs 4 bytes, 2 bytes for count:
127 | ```
128 | 8 + 2 + 4 = 14 bytes
129 | ```
130 | Let's assume our hash-table has an overhead of 30 bytes for each record. If we need to track 10 million users at any time:
131 | ```
132 | Total memory = (14 + 30) bytes * 10 million users = 440,000,000 bytes => 440MB memory.
133 | ```
134 |
135 | If we assume that we need a 4-byte number to lock each user's record to solve the atomicity problems
136 | ```
137 | Total memory = 4 bytes for lock + (14 + 30) bytes * 10 million users = 480,000,000 bytes => 480MB memory.
138 | ```
139 |
140 | This can fit in a single server. However, we wouldn't want to route all traffic in a single machine because of availability issue => that one server goes down for any reason, our only instance of the rate limiter service goes down with it.
141 |
142 | For instance, when rate limiting 1M users at 10 req/sec, this would be about 10 million queries per second for our rate limiter. This would be too much to handle for a single server. Practically, we can use Redis or Memcached in a distributed setup. We'll store all the data in remote Redis servers and all the Rate limiter servers will read/update these servers before serving or throttling any request.
143 |
144 |
145 |
146 | ## 6. Sliding Window Algorithm
147 |
148 | We can maintain a sliding window if we can keep track of each request per user.
149 |
150 | We will store the timestamp of each request in a [Redis Sorted Set](https://redis.io/docs/data-types/sorted-sets/).
151 |
152 | ```python
153 | hash_table = {
154 | # userID: { Sorted Set }
155 | 'userID-0': {1574860105, 1574881109, 1574890217 },
156 | 'userID-1': {1574866488, 1574866493, 1574866499}
157 | ...
158 | }
159 | ```
160 | Assume our rate limiter allows 3 requests/sec per user.
161 |
162 | When a new request comes in, the rate limiter will perform the following steps:
163 |
164 | 1. Remove all timestamps from Sorted Set older than `1 second`.
165 | 2. Count elements in the set. Reject request if the count is greater than our throttling limit (3 for our case).
166 | 3. Insert current time in the sorted set and accept the request.
167 |
168 | #### Memory for storing user data?
169 | Assume UserId takes 8 bytes, each epoch time takes 4 bytes.
170 |
171 | Now suppose we need a rate limiting of 500 requests per hour. Assume 20 bytes of overhead for hash-table and 20 bytes for sorted set.
172 |
173 | ```
174 | (8 + 4 + 20 bytes sorted set overhead) * 500 + (20 bytes hash-table overhead) = 12KB
175 | ```
176 |
177 | If we need to track 10 million users at any time:
178 | ```
179 | Total memory = 12KB * 10 million users = 11718.75 MB ≈ 120 GB
180 | ```
181 |
182 | For 10M users, sliding window takes a lot of memory compared to fixed window; this won't scale well. We can combine the above two algorithms to optimize our memory usage.
183 |
184 |
185 | ## 7. Sliding Window + Counters
186 | What if we keep track of request counts for each user using multiple fixed time windows.
187 |
188 | For example, if rate limit is hourly, keep a count for **each minute** and calculate the sum of all counter in the past hour when we receive a new request.
189 |
190 | This reduces our memory footprint. Consider a rate-limit at 500 requests/hour, with an additional limit of 10 requests/minute. *This means that when the sum of the counters with timestamps in the past hour `> 500`, the user has exceeded the rate limit.*
191 |
192 | In addition, the user can't send more than 10 requests per minute. This would be a reasonable and practical consideration, as none of the real users would send frequent requests. Even if they do, they'll see success with retries since their limits get reset every minute.
193 |
194 |
195 | We can store our counter in a [Redis Hash](https://redis.io/docs/data-types/hashes/) because it's very efficient in storing <100 keys.
196 | Which each request, increment a counter in the hash, it also sets the hash to [expire](https://redis.io/commands/ttl) an hour later. We will normalize each time to a minute.
197 |
198 |
199 | ```
200 | Rate limiting allowing 3 requests per minute for User1
201 |
202 | [Allow request] 7:00:00 AM ---- "User1": {1574860100: 1}
203 | [Allow request] 7:01:05 AM ---- "User1": { 1574860100: 1, 1574860160: 1}
204 | [Allow request] 7:01:20 AM ---- "User1": { 1574860100: 1, 1574860160: 2}
205 | [Allow request] 7:01:20 AM ---- "User1": { 1574860100: 1, 1574860160: 3}
206 | [Reject request] 7:01:45 AM ---- "User1": { 1574860100: 1, 1574860160: 3}
207 | [Allow request] 7:02:20 AM ---- "User1": { 1574860160: 3, 1574860220: 1}
208 | ```
209 |
210 | #### How much memory to store all user data?
211 |
212 | We'll need:
213 | ```
214 |
215 | UserID = 8 bytes
216 | Counter = 2 bytes
217 | Epoch time = 4 bytes
218 |
219 | Since we keep a count per minute, at max, we need 60 entries per user
220 | 8 + (4 + 2 + 20 Redis-hash overhead) * 60 entries + 20 Hash-table overhead = 1.6KB
221 | ```
222 |
223 | If we need to track 10 million users at any time:
224 | ```
225 |
226 | Total memory = 1.6KB * 10 million => 16 GB
227 | ( 92% less memory than the simple sliding window algorithm )
228 |
229 | ```
230 |
231 |
232 | ## 8. Data Sharding and Caching
233 | We can shard by `UserID` to distribute user data across different partitions.
234 |
235 | For fault tolerance and replication we should use Consistent Hashing. Consistent hashing is a very useful strategy for distributed caching system and DHTs. It allows us to distribute data across a cluster in such a way that will minimize reorganization when nodes are added or removed (resizing).
236 |
237 |
238 |
239 | ## Caching
240 | We can get huge benefits from caching recent active users.
241 |
242 | Our app servers can quickly check if the cache has the desired record before hitting backend servers. Our rate limiter can benefit from the **Write-back cache** by updating all counters and timestamps in cache only. The write to the permanent storage can be done at fixed intervals. This way we can ensure minimum latency added to the user's request by the rate limiter. The reads can always hit the cache first; which will be extremely useful once the user has hit the rate limit and the rate limiter will only be reading data without any updates.
243 |
244 | Least Recently Used (LRU) can be a reasonable eviction policy for the our cache.
245 |
246 |
247 | Sample response:
248 |
249 |
250 |
251 | ## 9. Throttling Response
252 | We can return a 429 status code: Too Many Requests whenever the user exceeds the rate limit.
253 |
254 | | Header Name | Description |
255 | | :------------------- | :---------------------------------------------------------------------------- |
256 | | RateLimit-Exceeded | The specific limit that has been exceeded. |
257 | | Retry-After | The number of seconds that the client should wait before retrying a request. |
258 |
259 |
260 | ```
261 | HTTP/1.1 429 Too Many Requests
262 | Transfer-Encoding: chunked
263 | Retry-After: 5
264 | request-id: ff4751b7-d289-01a0-a6dd-9b3541c077fe
265 | RateLimit-Exceeded: 60
266 | Cache-Control: private
267 | Content-Type: application/json;charset=utf-8
268 | ```
269 | ```json
270 | {
271 | "error": {
272 | "code": "ClientThrottled",
273 | "message": "Client application is over its resource limit."
274 | }
275 | }
276 |
--------------------------------------------------------------------------------
/designing_twitter.md:
--------------------------------------------------------------------------------
1 | # Designing Twitter
2 |
3 | Twitter is an online social networking service where users post and read short 140-character messages called "tweets". Registered Users can post and read tweets, but those not registered can only read them.
4 |
5 |
6 | ## 1. Requirements and System Goals
7 |
8 | ### Functional Requirements
9 | - Users should be able to post new tweets.
10 | - A user should be able to follow other users.
11 | - Users should be able to mark tweets as favorites.
12 | - Tweets can contain photos and videos.
13 | - A user should have a timeline consting of top tweets from all the people the user follows.
14 |
15 | ### Non-functional Requirements
16 | - Our service needs to be highly available.
17 | - Acceptance latency of the sytstem is 200ms for timeline generation.
18 | - Consistency can take a hit (in the interest of availability); if user doesn't see a tweet for a while, it should be fine.
19 |
20 | ### Extended Requirements
21 | - Searching tweets.
22 | - Replying to a tweet.
23 | - Trending topics - current hot topics.
24 | - Tagging other users.
25 | - Tweet notification.
26 | - Suggestions on who to follow.
27 |
28 | ## Capacity Estimation and Constraints
29 |
30 | Let's assume we have 1 billion users, with 200 million daily active users (DAU).
31 | Also assume we have 100 million new tweets every day, and on average each user follows 200 people.
32 |
33 | **How many favorites per day?** If on average, each user favorites 5 tweets per day, we have:
34 | ```
35 | 200M users * 5 => 1 billion favorites.
36 | ```
37 |
38 | **How many total tweet-views?** Let's assume on average a user visits their timeline twice a day and visits 5 other people's pages. On each page if a user sees 20 tweets, then the no. of views our system will generate is:
39 | ```
40 | 200M DAU * ((2 + 5) * 20 tweets) => 28B/day
41 | ```
42 |
43 | #### Storage Estimates
44 | Let's say each tweet has 140 characters and we need two bytes to store a character without compression. Assume we need 30 bytes to store metadata with each tweet (like ID, timestamps, etc.). Total storage we would need is:
45 | ```
46 | 100M new daily tweets * ((140 * 2) + 30) bytes => ~28 GB/day
47 | ```
48 |
49 | Not all tweets will have media, let's assume that on average every fifth tweet has a photo and every tenth a video. Let's also assume on average, a photo = 0.5MB and a video = 5MB. This will lead us to have:
50 | ```
51 | (100M/5 photos * 0.5MB) + (100M/10 videos * 5MB) ~= 60 TB/day
52 | ```
53 |
54 | #### Bandwidth Estimates
55 | Since total ingress is 60TB per day, then it will translate to:
56 | ```
57 | 60TB / (24 * 60 * 60) ~= 690 MB/sec
58 | ```
59 | Remember we have 28 billion tweets views in a day. We must show the photo of every tweet, but let's assume that the users watch every 3rd video they see in their timeline. So, total egress will be:
60 |
61 | ```
62 | (28Billion * 280 bytes) / 86400 of text ==> 93MB/s
63 | + (28Billion/5 * 0.5MB) / 86400 of photos ==> ~32GB/s
64 | + (28Billion/10/3 * 5MB) / 86400 of videos ==> ~54GB/s
65 |
66 | Total ~= 85GB/sec
67 | ```
68 |
69 |
70 | ## 3. System APIs
71 |
72 | We can have a REST API to expose the functionality of our service.
73 |
74 | ```python
75 | tweet(
76 | api_dev_key, # (string): The API developer key. Use to throttle users based on their allocated quota.
77 | tweet_data, # (string): The text of the tweet, typically up to 140 characters.
78 | tweet_location, # (string): Optional location (lat, long) this Tweet refers to.
79 | user_location, # (string): optional user's location.
80 | media_ids, # (optional list of media_ids to associated with the tweet. (all media - photo, videos needs to be uploaded separately)
81 | )
82 | ```
83 | Returns: (string)
84 | A successful post will return the URL to access that tweet. Otherwise, return an appropriate HTTP error.
85 |
86 |
87 | ## 4. High Level System Design
88 | We need a system that can efficiently store all the new tweets,
89 | i.e
90 | - store `100M/86400sec => 1150 tweets per second`
91 | - and read `28billion/86400s => 325,000 tweets per second`.
92 |
93 | It's clears that from the requirements, the system will be **read-heavy**.
94 |
95 | At a high level:
96 | - we need multiple application servers to serve all these requests with load balancers in front of them for traffic distribution.
97 | - On the backend, we need an efficent datastore that will store all the new tweets and can support huge read numbers.
98 | - We also need file storage to store photos and videos.
99 |
100 | This traffic will be distributed unevenly throughout the day, though, at peak time we expect at leas a few thousand write requests and around 1M read requests per second.
101 | **We should keep this in mind while designing the architecture of our system.**
102 |
103 |
104 | 
105 |
106 |
107 | ## 5. Database Schema
108 | We need to store data about users, their tweets, their favorite tweets, and people they follow.
109 |
110 | 
111 |
112 | For choosing between SQL or NoSQL, check out Designing Instagram from the README.
113 |
114 | ## 6. Data Sharding
115 |
116 | We have a huge number of tweets every day. We need to distribute our data onto multiple machines such that reads/writes are efficient.
117 |
118 |
119 | #### Sharding based on UserID
120 | We can try storing a user's data on one server. While storing:
121 | - Pass a UserID to our hash function that will map the user to a DB server where we'll store all of their data. (tweets, favorites, follows, etc.)
122 | - While querying for their data, we can ask the hash function where to find it and read it from there.
123 |
124 | Issues:
125 | - What if a user becomes hot? There will be lots of queries on the server holding that user. This high load will affect the service's performance.
126 | - Over time, some users will have more data compared to others. Maintaining a uniform distribution of growing data is quite difficult.
127 |
128 | #### Sharding based on TweetID
129 | - Hash function maps each TweetID to a random server where we store that tweet.
130 | - Searching a tweet will query all servers, and each server will return a set of tweets.
131 | - A centralized server will aggregate the results to return them to the user.
132 |
133 | **To generate a user's timeline:**
134 | 1. App server will find all the people the user follows.
135 | 2. App server will send a query to all DB servers to find tweets from these people.
136 | 3. Each DB server will find tweets for each user, sort them by recency and return top tweets.
137 | 4. App server will merge all results and sort them again to return the top results to the user.
138 |
139 | This solves the problem of hot users.
140 |
141 | Issues with this approach:
142 | - We have to query all DB partitions to find tweets for a user, leading to higher latencies.
143 |
144 | > We can improve the performance by caching hot tweets in front of the DB servers.
145 |
146 | #### Sharding based on Tweet creation time
147 |
148 | Storing tweets based on creation timestamp will help us to fetch all top tweets quickly and we only have to query a very small set of database servers.
149 |
150 | Issues:
151 | - Traffic load won't be distributed. e.g when writing, all new tweets will be going to one DB server, and remaining DB servers sit idle. When reading, a database server holding the latest data will have high load as compared to servers holding old data.
152 |
153 | #### Sharding by TweetID + Tweet creation time
154 | Each tweetID should be universally unique and contain a timestamp.
155 | We can use epoch time for this.
156 |
157 | First part of TweetID will be a timestamp, second part an auto-incrementing number. We can then figure out the shard number from this TweetId and store it there.
158 |
159 | What could be the size of our TweetID?
160 | If our epoch time started today, the number of bits we need to store the number of seconds for the next 50 years:
161 |
162 | ```
163 | Number of seconds for the next 50 years:
164 | 86400 sec/day * 365 days * 50 years ==> 1.6 billion seconds.
165 |
166 | ```
167 |
168 | 
169 |
170 | We would need 31 bits to store the epoch time.
171 | ```
172 | 2 ^ 31 ==> ~ 2 billion seconds
173 | ```
174 |
175 | Since on average we expect 1150 new tweets every second i.e
176 | ```
177 | (100M daily / 86400 seconds) ==> 1150 tweets/sec
178 | ```
179 | we can allocate 17 bits to store auto incremented sequence. This makes our tweetID 48 bits long.
180 |
181 | Every second, we can store 2^17(130k) new tweets.
182 | We can reset our auto incrementing sequence every second. For fault tolerance and better performance, we can have two DB servers to generate auto-incrementing keys; one odd numbers and one even numbered keys.
183 |
184 | If we assume our current epoch seconds begins now, TweetIDs will look like this:
185 |
186 |
187 | ```python
188 | epoch = 1571691220
189 | print('epoch - autoincrement')
190 | for i in range(1,5):
191 | print(f'{epoch}-{i:06}')
192 | ```
193 |
194 | epoch - autoincrement
195 | 1571691220-000001
196 | 1571691220-000002
197 | 1571691220-000003
198 | 1571691220-000004
199 |
200 |
201 | If we make our TweetID 64bits (8 bytes) long, we can easily store tweets for the next 100 years.
202 |
203 | In the approach above, we still have to query all servers for timeline generation, but the reads/writes will be substancially quicker.
204 |
205 | - Since we don't have any secondary index (on creation time) this will reduce write latency.
206 | - While reading, we don't need to filter on creatiion-time as our primary key has epoch time included in it.
207 |
208 | ## 7. Caching
209 |
210 | We can have a cache for DB servers to cache hot tweets and users.
211 | Memcache can be used here to store the whole tweet objects.
212 |
213 | > Application servers before hitting the DB, can quickly check if the cache has desired tweets.
214 |
215 | #### Cache replacement policy
216 | When cache is full, we want to replace a tweet with a newer/hotter tweet. Least Recently Used(LRU) policy can be used to discard the least recently viewed tweet first.
217 |
218 | #### Intelligent caching?
219 | If we go with the 80-20 rule, 20% of tweets generating 80% of read traffic, meaning certain tweets are so popular that a majority of people read them. Therefore, we can try to cache 20% of daily read volume from each shard.
220 |
221 | #### Caching latest data?
222 | Let's say if 80% of our users see tweets from the last 3 days only, we can cache all tweets from past 3 days.
223 |
224 | We are getting 100M tweets or 30GB of new data every day(without photos or videos). If we want to store all the tweets from last 3 days, we will need 100GB of memory. This data can easily fit into one cache server, but we should replicate it onto multiple servers to distribute all the read traffic to reduce the load on cache servers.
225 |
226 | Whenever we are generating a user's timeline, we can ask the cache servers if they have all the recent tweets for that user, If yes, we can simply return all the data from the cache. If we don't have enough, we have to query backend server to fetch the data. We can also cache photos/videos from the last 3 days.
227 |
228 | ### Cache structure
229 | Our cache would be a hash table.
230 | - Key would be `OwnerID` and value would be a doubly linked list containing all tweets from that user in the past 3 days.
231 | - Since we retrieve the most recent tweets first, we can insert new tweets at the head of linked list.
232 | - Older tweets will be at the tail of the linked list.
233 | - We can remove tweets from the tail to make space for newer tweets
234 |
235 | 
236 |
237 | ## 8. Load Balancing
238 | We can add Load balancing layer at 3 places:
239 | 1. Between clients and application servers
240 | 2. Between app servers and DB replication servers
241 | 3. Between Aggregation servers and Cache servers
242 |
243 | We can adopt a simple round robin approach to distribute incoming requests equally among servers.
244 | Benefits of this LB approach:
245 | - Simple to implement with no overhead
246 | - If a server is dead, LB will take it out from the rotation and will stop sending any traffic to it.
247 |
248 | > Problem with Round Robin is that it doesn't know if a server is overloaded with requests or if it's slow. It won't stop sending requests to that server. To fix this, the LB can periodically query the backend server about its load and adjusts traffic to it based on that.
249 |
250 | ## 9. Extended Requirements
251 |
252 | **Trending Topics:** We can cache most frequently occuring hashtags or search queries in the last N seconds and update them after every M seconds. We can rank trending topics based on like frequency, number of retweets, etc.
253 |
254 | **Suggestions on who to follow:** We can suggest friends of people someone follows. Wecan go two or three degrees of separation down to find famous people for the suggestions. Give preference to people with more followers.
255 |
256 | We can use Machine Learning to offer suggestions based on reoccuring patterns like, common followers if the person is following this user, common location or interests, etc.
257 |
258 | **How do we serve feeds:** We can pre-generate the feed to improve efficiency; for details see [Ranking and time generation under Designing Instagram](designing_instagram.md)
259 |
260 |
261 | ## 10. Monitoring
262 | We should constantly collect data to get an insight into how our system is doing. We can collect:
263 | - New tweets per day/second.
264 | - What time the daily peak is.
265 | - Timeline delivery stats, how many tweets per second our service is delivering.
266 | - Average latency that is seen by the user to refresh their timeline.
267 |
268 | By monitoring these, we will realize if we need more replication, load balancing or caching.
269 |
270 | ## 11. Availability
271 | Since our system is read-heavy, we can have multiple secondary database servers for each DB partition. Seconday servers will be used for read traffic only. All writes will first go to the primary server and then will be replicated to secondary servers. This also implements fault tolerance, since whenever the primary server goes down, we can failover to a secondary server.
272 |
--------------------------------------------------------------------------------
/designing_ticketmaster.md:
--------------------------------------------------------------------------------
1 | # Design E-Ticketing System
2 |
3 | Let's design an online E-ticketing system that sells movie tickets.
4 |
5 | A movie ticket booking system provides its customer the ability to purchase theatre seats online. They allows the customer to browse movies currently being played and to book available seats, anywhere anytime.
6 |
7 | ## 1. Requirements and System goals
8 |
9 | ### Functional Requirements
10 | - The service should list different cities where its affiliated cinemas are located.
11 | - When user selects a city, the service should display movies released in that particular city.
12 | - When user selects movie, the service should display the cinemas running the movie plus available show times.
13 | - Users should be able to book a show at a cinema and book tickets.
14 | - The service should be able to show the user the seating arrangement of the cinema hall. The user should be able to select multiple seats according to the preference.
15 | - The user should be able to distinguish between available seats from booked ones.
16 | - Users should be able to put a hold on the seat (for 5 minutes) while they make payments.
17 | - Users should be able to wait if there is a chance that the seats might become available (When holds by other users expire).
18 | - Waiting customers should be serviced in a fair, first come, first serve manner.
19 |
20 | ### Non-Functional Requirements
21 | - The service should be highly concurrent. There will be multiple booking requests for the same seat at any particular point in time.
22 | - The system has financial transactions, meaning it should be secure and the DB should be ACID compliant.
23 | - Assume traffic will spike on popular/much-awaited movie releases and the seats would fill up pretty fast, so the service should be highly scalable and highly available to keep up with the surge in traffic.
24 |
25 |
26 | ### Design Considerations
27 | 1. Assume that our service doesn't require authentication.
28 | 2. No handling of partial ticket orders. Either users get all the tickets they want or they get nothing.
29 | 3. Fairness is mandatory.
30 | 4. To prevent system abuse, restrict users from booking more than 10 seats at a time.
31 |
32 | ## 2. Capacity Estimation
33 |
34 | > **Traffic estimates:** 3 billion monthly page views, sells 10 million tickets a month.
35 |
36 | > **Storage estimates:**
37 | 500 cities, on average each city has 10 cinemas, each with 300 seats, 3 shows daily.
38 |
39 | Let's assume each seat booking needs 50 bytes (IDs, NumberOfSeats, ShowID, MovieID, SeatNumbers, SeatStatus, Timestamp, etc) to store in the DB.
40 | We need to store information about movies and cinemas; assume another 50 bytes.
41 |
42 | So to store all data about all shows of all cinemas of all cities for a day
43 |
44 | ```
45 | 500 cities * 10 cinemas * 300 seats * 3 shows * (50 + 50) bytes = 450 MB / day
46 | ```
47 | To store data for 5 years, we'd need around
48 | ```
49 | 450 MB/day * 365 * 5 = 821.25 GB
50 | ```
51 |
52 | ## 3. System APIs
53 | Let's use REST APIs to expose the functionality of our service.
54 |
55 |
56 | ### Searching movies
57 | ```python
58 | search_movies(
59 | api_dev_key: str, # The API developer key. This will be used to throttle users
60 | # based on their allocated quota.
61 | keyword: str, # Keyword to search on.
62 | city: str, # City to filter movies by.
63 | lat_long: str, # Latitude and longitude to filter by.
64 | radius: int, # Radius of the area in which we want to search for events.
65 | start_date: datetime, # Filter with a starting datetime.
66 | end_date: datetime, # Filter with an ending datetime.
67 | postal_code: int, # Filter movies by postal code / zipcode.
68 | include_spell_check, # (Enum: yes or no)
69 | result_per_page: int # number of results to return per page. Max = 30.
70 | sorting_order: str # Sorting order of the search result.
71 | # Allowable values: 'name,asc', 'name,desc', 'date,asc',
72 | # 'date, desc', 'distance,asc', 'name,date,asc', 'name,date,desc'
73 | )
74 |
75 | ```
76 | Returns: (JSON)
77 | ```json
78 | [
79 | {
80 | "MovieID": 1,
81 | "ShowID": 1,
82 | "Title": "Klaus",
83 | "Description": "Christmas animation about the origin of Santa Claus",
84 | "Duration": 97,
85 | "Genre": "Animation/Comedy",
86 | "Language": "English",
87 | "ReleaseDate": "8th Nov. 2019",
88 | "Country": "USA",
89 | "StartTime": "14:00",
90 | "EndTime": "16:00",
91 | "Seats":
92 | [
93 | {
94 | "Type": "Regular",
95 | "Price": 14.99,
96 | "Status": "Almost Full"
97 | },
98 | {
99 | "Type": "Premium",
100 | "Price": 24.99,
101 | "Status": "Available"
102 | }
103 | ]
104 | },
105 | {
106 | "MovieID": 2,
107 | "ShowID": 2,
108 | "Title": "The Two Popes",
109 | "Description": "Biographical drama film",
110 | "Duration": 125,
111 | "Genre": "Drama/Comedy",
112 | "Language": "English",
113 | "ReleaseDate": "31st Aug. 2019",
114 | "Country": "USA",
115 | "StartTime": "19:00",
116 | "EndTime": "21:10",
117 | "Seats":
118 | [
119 | {
120 | "Type": "Regular",
121 | "Price": 14.99,
122 | "Status": "Full"
123 | },
124 | {
125 | "Type": "Premium",
126 | "Price": 24.99,
127 | "Status": "Almost Full"
128 | }
129 | ]
130 | },
131 | ]
132 | ```
133 | ### Reserving Seats
134 | ```python
135 | reserve_seats(
136 | api_dev_key: str, # API developer key.
137 | session_id: str, # User Session ID to track this reservation.
138 | # Once the reservation time of 5 minutes expires,
139 | # user's reservation on the server will be removed using this ID.
140 | movie_id: str, # Movie to reserve.
141 | show_id: str, # Show to reserve.
142 | seats_to_reserve: List(int) # An array containing seat IDs to reserve.
143 | )
144 | ```
145 |
146 | Returns: (JSON)
147 | ```
148 | The status of the reservation, which would be one of the following:
149 | 1. Reservation Successful,
150 | 2. Reservation Failed - Show Full
151 | 3. Reservation Failed - Retry, as other users are holding reserved seats.
152 | ```
153 |
154 | ## 4. DB Design
155 |
156 | 1. Each **City** can have multiple **Cinema**s
157 | 2. Each **Cinema** can have multiple **Cinema_Hall**s.
158 | 3. Each **Movie** will have **Show**s and each Show will have multiple **Booking**s.
159 | 4. A **User** can have multiple **Booking**s.
160 |
161 |
162 |
163 |
164 | 
165 |
166 | ## 5. High Level Design
167 | From a bird's eye view,
168 | - Web servers handle user's sessions,
169 | - Application servers handle all the ticket management and
170 | - stored in the DB
171 | - as well as work with cache servers to process reservations.
172 |
173 | 
174 |
175 | ## 6. Detailed Component Design
176 |
177 | Let's explore the workflow part where there are no seats available to reserve, but all seats haven't been booked yet, (some users are holding in the reservation pool and have not booked yet)
178 | - the user is taken to a waiting page, waiting until the required seats get freed from the reservation pool. Options for the user at this point include:
179 | - if the required number of seats become available, take the user to theatre page to choose seats
180 | - While waiting, if all seats are booked, or there are fewer seats in the reservation pool than the user intends to book, then the user is shown the error message.
181 | - User cancels the waiting and is taken back to the movie search page.
182 | - At maximum, a user waits for an hour, after that the user's session expires and the user is taken back to the movie search page.
183 |
184 | If seats are reserved successfully, the user has 5 minutes to pay for the reservation. After payment, booking is marked complete. If the user isn't able to pay within 5 minutes, all the reserved seats are freed from the reservation pool to become available to other users.
185 |
186 | #### How do we keep track of all active reservations that haven't been booked yet? and also keep track of waiting customers?
187 | We need two daemon services:
188 |
189 | **a. Active Reservation Service**
190 |
191 | This will keep track of all active reservations and remove expired ones from the system.
192 |
193 | We can keep all the reservations of a show in memory in a [Linked Hashmap](https://www.geeksforgeeks.org/linkedhashmap-class-java-examples/), in addition to also keeping data in the DB.
194 | - We will need this doubly-linked data structure to jump to any reservation position to remove it when the booking is complete.
195 | - The head of the HashMap will always points to the oldest record, since we will have expiry time associated with each reservation. The reservation can be expired when the timeout is reached.
196 |
197 | To store every reservation for every show, we can have a HashTable where the `key` = `ShowID` and `value` = Linked HashMap containing `BookingID` and creation `Timestamp`.
198 |
199 | In the DB,
200 | - we store reservation in the `Booking` table
201 | - expiry time will be in the Timestamp column.
202 | - The `Status` field will have a value of `Reserved(1)` and, as soon as a booking is complete, update the status to `Booked(2)`
203 | - After status is changed, remove the reservation record from Linked HashMap of the relevant show.
204 | - When reservation expires, remove it from the Booking table or mark it `Expired(3)`, and remove it from memory as well.
205 |
206 | ActiveReservationService will work with the external Financial service to process user payments. When a booking is completed, or a reservation expires, WaitingUserService will get a signal so that any waiting customer can be served.
207 |
208 | ```python
209 |
210 | # The HashTable keeping track of all active reservations
211 | hash_table = {
212 | # ShowID : # LinkedHashMap
213 | 'showID1': {
214 | (1, 1575465935),
215 | (2, 1575465940),
216 | (2, 1575466950),
217 | },
218 | 'showID2': { ... },
219 | }
220 | ```
221 |
222 | **b. Waiting User Service**
223 |
224 | - This service will keep track of waiting users in a Linked HashMap or TreeMap.
225 | - To help us jump to any user in the list and remove them when they cancel the request.
226 | - Since it's a first-come-first-served basis, the head of the Linked HashMap would always point to the longest waiting user, so that whenever seats become available, we can serve users in a fair manner.
227 |
228 | We'll have a HashTable to store all waiting users for every show.
229 | Key = `ShowID`, value = `Linked HashMap containing UserIDs and their start-time`
230 |
231 | Clients can use Long Polling to keep themselves updated for their reservation status. Whenever seats become available, the server can use this request to notify the user.
232 |
233 | #### Reservation Expiration
234 | On the server, the Active Reservation Service keeps track of expiry of active connections (based on reservation time).
235 |
236 | On the client, we will show a timer (for expiration time), which could be a little out of sync with the server. We can add a buffer of 5 seconds on the server to prevent the client from ever timing out after the server, which, if left unchecked, could prevent successful purchase.
237 |
238 |
239 | # 7. Concurrency
240 | We need to handle concurrency, such that no 2 users are able to book the same seat.
241 |
242 | We can use transactions in SQL databases, isolating each transaction by locking the rows before we can update them. If we read rows, we'll get a write lock on the them so that they can't be updated by anyone else.
243 |
244 | Once the DB transaction is committed and successful, we can start tracking the reservation in the Active Reservation Service.
245 |
246 | # 8. Fault Tolerance
247 | If the two services crash, we can read all active reservations from the Booking table.
248 |
249 | Another option is to have a **master-slace configuration** so that, when the master crashes, the slave can take over. We are not storing the waiting users in the DB, so when Waiting User Service crashes, we don't have any means to recover the data unless we have a master-slave setup.
250 |
251 | We can also have the same master-slave setup for DBs to make them fault tolerant.
252 |
253 | # 9. Data Partitioning
254 | Partitioning by MovieID will result in all Shows of a Movie being in a single server.
255 | For a hot movie, this could cause a lot of load on that server. A better approach would be to partition based on ShowID; this way, the load gets distributed among different servers.
256 |
257 | We can use Consistent Hashing to allocate application servers for both (ActiveReservationService and WaitingUserService) based on `ShowID`. This way, all waiting users of a particular show will be handled by a certain set of servers.
258 |
259 | **When a reservation expires, the server holding that reservation will:**
260 |
261 | 1. Update DB to remove the expired Booking (or mark it `Expired(3)` and update the seats status in `Show_Seats` table.
262 | 2. Remove the reservation from Linked HashMap.
263 | 3. Notify the user that their reservation expired.
264 | 4. Broadcast a message to `WaitingUserService` servers that are holding waiting users of that Show to find who the longest waiting user is. Consistent Hashing scheme will help tell what servers are holding these users.
265 | 5. Send a message to the `WaitingUserService` to go ahead and process the longest waiting user if required seats have become available.
266 |
267 | **When a reservation is successful:**
268 | 1. The server holding that reservation will send a message to all servers holding waiting users of that Show.
269 | 2. These servers upon receiving the above message, will query the DB (or a DB cache) to find how many seats are available.
270 | 3. The servers can now expire all waiting users that want to reserve more seats than the available seats.
271 | 4. For this, the WaitingUSerService has to iterate through the Linked HashMap of all the waiting users to remove them.
272 |
--------------------------------------------------------------------------------