├── .gitignore ├── src ├── get-started │ ├── intro.md │ ├── submit-a-pull-request.md │ ├── import-tikv-into-an-ide.md │ ├── write-and-run-unit-tests.md │ ├── build-tikv-from-source.md │ └── debug-and-profile.md ├── understanding-tikv │ ├── storage │ │ ├── encode.md │ │ ├── intro.md │ │ ├── rocksdb.md │ │ ├── io-rate-limiter.md │ │ └── import-and-export.md │ ├── transaction │ │ ├── 1pc.md │ │ ├── cdc.md │ │ ├── intro.md │ │ ├── stale-read.md │ │ ├── async-command.md │ │ ├── async-commit.md │ │ ├── percolator │ │ │ ├── command.md │ │ │ ├── intro.md │ │ │ ├── latch-and-scheduler.md │ │ │ ├── tso.md │ │ │ └── encode.md │ │ └── pessimistic-transaction.md │ ├── high-availability │ │ ├── raftstore │ │ │ ├── fsm.md │ │ │ ├── intro.md │ │ │ └── actor-model.md │ │ ├── intro.md │ │ ├── raft │ │ │ ├── raft-rs.md │ │ │ ├── leader-lease.md │ │ │ ├── hibernate-region.md │ │ │ ├── byzantine-failure.md │ │ │ └── intro.md │ │ └── multi-raft │ │ │ ├── intro.md │ │ │ ├── split.md │ │ │ └── merge.md │ ├── coprocessor │ │ ├── coprocessor-plugin.md │ │ ├── tidb-expression-executor.md │ │ └── intro.md │ ├── overview │ │ ├── intro.md │ │ ├── raw-kv.md │ │ └── transaction-kv.md │ ├── scalability │ │ ├── intro.md │ │ ├── region.md │ │ └── scheduling.md │ └── intro.md ├── contribute-to-tikv │ ├── intro.md │ ├── report-a-bug.md │ ├── committer-guide.md │ ├── contribute-code.md │ ├── write-document.md │ ├── community-guideline.md │ ├── request-for-comments.md │ ├── review-a-pull-request.md │ └── code-style-and-quality-guide.md ├── media │ ├── perf_flame.png │ ├── gdb_tikv_ut.png │ ├── inside-copr.png │ ├── copr-overview.png │ ├── basic-architecture.png │ ├── scalability-region.png │ ├── scalability-data-sharding.png │ └── tikv-distributed-execution.png ├── README.md └── SUMMARY.md ├── book.toml ├── .github ├── workflows │ ├── dco.yml │ └── publish.yml └── pull_request_template.md ├── README.md └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | book 2 | -------------------------------------------------------------------------------- /src/get-started/intro.md: -------------------------------------------------------------------------------- 1 | # Get Started 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/storage/encode.md: -------------------------------------------------------------------------------- 1 | # Encode 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/storage/intro.md: -------------------------------------------------------------------------------- 1 | # Storage 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/1pc.md: -------------------------------------------------------------------------------- 1 | # 1PC 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/cdc.md: -------------------------------------------------------------------------------- 1 | # CDC 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/intro.md: -------------------------------------------------------------------------------- 1 | # Contribute to TiKV 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/report-a-bug.md: -------------------------------------------------------------------------------- 1 | # Report a Bug 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/storage/rocksdb.md: -------------------------------------------------------------------------------- 1 | # RocksDB 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/committer-guide.md: -------------------------------------------------------------------------------- 1 | # Committer Guide 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/contribute-code.md: -------------------------------------------------------------------------------- 1 | # Contribute code 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/write-document.md: -------------------------------------------------------------------------------- 1 | # Write document 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/intro.md: -------------------------------------------------------------------------------- 1 | # Transaction 2 | -------------------------------------------------------------------------------- /src/get-started/submit-a-pull-request.md: -------------------------------------------------------------------------------- 1 | # Submit a Pull Request 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raftstore/fsm.md: -------------------------------------------------------------------------------- 1 | # FSM 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/stale-read.md: -------------------------------------------------------------------------------- 1 | # Stale Read 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/community-guideline.md: -------------------------------------------------------------------------------- 1 | # Community Guideline 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/request-for-comments.md: -------------------------------------------------------------------------------- 1 | # Request for Comments 2 | -------------------------------------------------------------------------------- /src/get-started/import-tikv-into-an-ide.md: -------------------------------------------------------------------------------- 1 | # Import TiKV into an IDE 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/intro.md: -------------------------------------------------------------------------------- 1 | # High Availability 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raft/raft-rs.md: -------------------------------------------------------------------------------- 1 | # raft-rs 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/storage/io-rate-limiter.md: -------------------------------------------------------------------------------- 1 | # IO Rate Limiter 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/async-command.md: -------------------------------------------------------------------------------- 1 | # Async Commit 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/async-commit.md: -------------------------------------------------------------------------------- 1 | # Async Commit 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/percolator/command.md: -------------------------------------------------------------------------------- 1 | # Command 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/percolator/intro.md: -------------------------------------------------------------------------------- 1 | # Percolator 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/review-a-pull-request.md: -------------------------------------------------------------------------------- 1 | # Review a Pull Request 2 | -------------------------------------------------------------------------------- /src/get-started/write-and-run-unit-tests.md: -------------------------------------------------------------------------------- 1 | # Write and Run Unit Tests 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/multi-raft/intro.md: -------------------------------------------------------------------------------- 1 | # Multi-Raft 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raftstore/intro.md: -------------------------------------------------------------------------------- 1 | # RaftStore 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/storage/import-and-export.md: -------------------------------------------------------------------------------- 1 | # Import and Export 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/coprocessor/coprocessor-plugin.md: -------------------------------------------------------------------------------- 1 | # Coprocessor Plugin 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raft/leader-lease.md: -------------------------------------------------------------------------------- 1 | # Leader Lease 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raftstore/actor-model.md: -------------------------------------------------------------------------------- 1 | # Actor Model 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raft/hibernate-region.md: -------------------------------------------------------------------------------- 1 | # Hibernate Region 2 | -------------------------------------------------------------------------------- /src/contribute-to-tikv/code-style-and-quality-guide.md: -------------------------------------------------------------------------------- 1 | # Code Style and Quality Guide 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/coprocessor/tidb-expression-executor.md: -------------------------------------------------------------------------------- 1 | # TiDB Expression Executor 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raft/byzantine-failure.md: -------------------------------------------------------------------------------- 1 | # Byzantine Failure 2 | -------------------------------------------------------------------------------- /src/understanding-tikv/overview/intro.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | 3 | > Talk about the architecture 4 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/percolator/latch-and-scheduler.md: -------------------------------------------------------------------------------- 1 | # Latch and Scheduler 2 | -------------------------------------------------------------------------------- /src/media/perf_flame.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/perf_flame.png -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/raft/intro.md: -------------------------------------------------------------------------------- 1 | # Raft 2 | 3 | > what is Raft, how it works 4 | -------------------------------------------------------------------------------- /src/understanding-tikv/overview/raw-kv.md: -------------------------------------------------------------------------------- 1 | # Raw KV 2 | 3 | > How raw get and raw put are processed 4 | -------------------------------------------------------------------------------- /src/media/gdb_tikv_ut.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/gdb_tikv_ut.png -------------------------------------------------------------------------------- /src/media/inside-copr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/inside-copr.png -------------------------------------------------------------------------------- /src/media/copr-overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/copr-overview.png -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/multi-raft/split.md: -------------------------------------------------------------------------------- 1 | # Split 2 | 3 | > region epoch, conf version 4 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/percolator/tso.md: -------------------------------------------------------------------------------- 1 | # TSO 2 | 3 | > how pd tso works, how global/local tso works 4 | -------------------------------------------------------------------------------- /src/media/basic-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/basic-architecture.png -------------------------------------------------------------------------------- /src/media/scalability-region.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/scalability-region.png -------------------------------------------------------------------------------- /src/understanding-tikv/high-availability/multi-raft/merge.md: -------------------------------------------------------------------------------- 1 | # Merge 2 | 3 | > PrepareMerge, CommitMerge, RollbakMerge 4 | -------------------------------------------------------------------------------- /src/media/scalability-data-sharding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/scalability-data-sharding.png -------------------------------------------------------------------------------- /src/media/tikv-distributed-execution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tikv/tikv-dev-guide/HEAD/src/media/tikv-distributed-execution.png -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/pessimistic-transaction.md: -------------------------------------------------------------------------------- 1 | # Pessimistic Transaction 2 | 3 | > pessimistic lock, deadlock detecter 4 | -------------------------------------------------------------------------------- /src/understanding-tikv/transaction/percolator/encode.md: -------------------------------------------------------------------------------- 1 | # Encode 2 | 3 | > mvcc encode, memcomparable encode, key adjustment on region split 4 | -------------------------------------------------------------------------------- /book.toml: -------------------------------------------------------------------------------- 1 | [book] 2 | title = "TiKV Development Guide" 3 | 4 | [output.html] 5 | git-repository-url = "https://github.com/tisonkun/tikv-dev-guide" 6 | -------------------------------------------------------------------------------- /src/understanding-tikv/overview/transaction-kv.md: -------------------------------------------------------------------------------- 1 | # Transaction KV 2 | 3 | > How 2PC works, how prewrite and commit are processed. (only focus on the optimistic transaction) 4 | -------------------------------------------------------------------------------- /.github/workflows/dco.yml: -------------------------------------------------------------------------------- 1 | name: DCO Check 2 | 3 | on: [pull_request] 4 | 5 | jobs: 6 | check: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - uses: tisonkun/actions-dco@v1.1 10 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | ### What issue does this PR solve? 6 | 7 | - close #xxxx 8 | 9 | ### What is changed: 10 | -------------------------------------------------------------------------------- /src/README.md: -------------------------------------------------------------------------------- 1 | # TiKV Development Guide 2 | 3 | This guide is meant to help document how TiKV works, as well as to help new contributors get involved in TiKV development. 4 | 5 | There are three parts to this guide: 6 | 7 | 1. [Get Started](get-started/intro.md) 8 | 2. [Contribute to TiKV](contribute-to-tikv/intro.md) 9 | 3. [Understanding TiKV](understanding-tikv/intro.md) 10 | -------------------------------------------------------------------------------- /.github/workflows/publish.yml: -------------------------------------------------------------------------------- 1 | name: Publish 2 | 3 | on: 4 | push: 5 | branches: [main] 6 | 7 | jobs: 8 | deploy: 9 | runs-on: ubuntu-18.04 10 | steps: 11 | - uses: actions/checkout@v2 12 | 13 | - name: Setup mdBook 14 | uses: peaceiris/actions-mdbook@v1 15 | with: 16 | mdbook-version: '0.4.8' 17 | 18 | - run: mdbook build 19 | 20 | - name: Deploy 21 | uses: peaceiris/actions-gh-pages@v3 22 | if: github.ref == 'refs/heads/main' 23 | with: 24 | github_token: ${{ secrets.GITHUB_TOKEN }} 25 | publish_dir: ./book 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TiKV Development Guide 2 | 3 | This repository contains the source of TiKV Development Guide. 4 | 5 | ## Requirements 6 | 7 | Building the book requires [mdBook](https://github.com/rust-lang-nursery/mdBook). To get it: 8 | 9 | ```bash 10 | $ cargo install mdbook 11 | ``` 12 | 13 | ## Preview 14 | 15 | To preview the book, type: 16 | 17 | ```bash 18 | $ mdbook serve 19 | ``` 20 | 21 | By default, it will create a server listening on [http://localhost:3000](http://localhost:3000), you can open it with a brower to preview the book. 22 | 23 | ## Building 24 | 25 | To build the book, type: 26 | 27 | ```bash 28 | $ mdbook build 29 | ``` 30 | 31 | The output will be in the `book` subdirectory. To check it out, open it in 32 | your web browser. 33 | 34 | _Firefox:_ 35 | ```bash 36 | $ firefox book/index.html # Linux 37 | $ open -a "Firefox" book/index.html # OS X 38 | $ Start-Process "firefox.exe" .\book\index.html # Windows (PowerShell) 39 | $ start firefox.exe .\book\index.html # Windows (Cmd) 40 | ``` 41 | 42 | _Chrome:_ 43 | ```bash 44 | $ google-chrome book/index.html # Linux 45 | $ open -a "Google Chrome" book/index.html # OS X 46 | $ Start-Process "chrome.exe" .\book\index.html # Windows (PowerShell) 47 | $ start chrome.exe .\book\index.html # Windows (Cmd) 48 | ``` 49 | -------------------------------------------------------------------------------- /src/understanding-tikv/scalability/intro.md: -------------------------------------------------------------------------------- 1 | # Scalability 2 | 3 | In the database field, `Scalability` is the term we use to describe the capability of a system to handle a growing amount of storage or computation. Even if a system is working reliably and fast on a small scale today, it doesn't mean it will necessarily work well in the future, especially when the increased load exceeds what the system can process. In modern systems, the amount of data we have to process can far outgrow our original expectations, so scalability is a critical consideration for the design of a database. 4 | 5 | There are two main types of scalability: 6 | 7 | 1. Vertical scaling, which is also known as __scaling up__, means adding resources to a single node in a system, typically involving the improvement of CPU, memory, or storage, to become a more powerful single computer. Vertical scaling is limited by the technology of the semiconductor industry, and the cost per hertz of CPU or byte of memory/storage will increase dramatically when near the technical limit. 8 | 9 | 2. Horizontal scaling, which is also known as __scaling out__, means adding more machines to a system and distributing the load across multiple smaller machines. As computer prices have dropped and performance continues to increase, high-performance computing applications have adopted low-cost commodity systems for tasks. System architects may deploy hundreds of small computers in a cluster, to obtain aggregate computing power that is far more powerful than a system based on a stand-alone single computer. Moreover, with the widespread use of [Cloud computing](https://en.wikipedia.org/wiki/Cloud_computing) technology, horizontal scalable is necessary for a system to adapt to the resiliency of Cloud. 10 | 11 | A system whose performance improves after adding hardware, proportionally to the added quantity, is said to be a scalable system. It's obvious that a scalable system depends on the ability of horizontal scaling. 12 | 13 | TiKV is a highly scalable key-value store, especially comparing with other stand-alone key-value stores like [RocksDB](https://rocksdb.org/) and [LevelDB](https://github.com/google/leveldb). 14 | 15 | To be scalable, TiKV needs to solve the following problems: 16 | 17 | 1. Partitioning: how to break data up into partitions, also known as _sharding_, to fully utilize resources of nodes in cluster. In TiKV, a partition is called a `Region`. 18 | 19 | 2. Scheduling: how to distribute the Regions in cluster, for balancing the load among all nodes, and eliminating hot spots or other bottle necks. 20 | 21 | In the rest of this chapter, we will talk about `Region` and `Scheduling` of TiKV. 22 | -------------------------------------------------------------------------------- /src/understanding-tikv/intro.md: -------------------------------------------------------------------------------- 1 | # Understanding TiKV 2 | 3 | TiKV is a distributed, transactional key-value database. It is fully ACID compliant and features automatic horizontal scalability, global data consistency, geo-replication, and many other features. It can be used as a building block for other high-level services. 4 | 5 | In this chapter, we will introduce everything about the design and implementation of TiKV. We hope that through this chapter, you can develop a deep understanding of TiKV, and build your knowledge of distributed programming. 6 | 7 | ## History 8 | 9 | In the middle of 2015, we decided to build a database which solved MySQL's scaling problems. At that time, the most common way to increase MySQL's scalability was to build a proxy on top of MySQL that distributes the load more efficiently, but we don't think that is the best way. 10 | 11 | As far as we knew, proxy-based solutions have following problems: 12 | 13 | * Building a proxy on top of the MySQL servers cannot guarantee ACID compliance. Notably, the cross-node transactions are not supported natively. 14 | * It poses great challenges for business flexibility because the users have to worry about the data distribution and design their sharding keys carefully to avoid inefficient queries. 15 | * The high availability and data consistency of MySQL cannot be guaranteed easily based on the traditional Source-Replica replication. 16 | 17 | Although building a proxy based on MySQL directly might be easy at the beginning, we still decided to chose another way, a more difficult path — to build a distributed, MySQL compatible database from scratch. 18 | 19 | Fortunately, Google met the same problem and had already published some papers to describe how they built [Spanner](http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf) and [F1](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41344.pdf) to solve it. Spanner is a globally distributed, externally consistent database and F1 is a distributed SQL database based on Spanner. Inspired by Spanner and F1, we knew we could do the same thing. So we started to build TiDB - a stateless MySQL layer like F1. After we released TiDB, we knew we needed an underlying Spanner-like database so we began to develop TiKV. 20 | 21 | ## Architecture 22 | 23 | The diagram below shows the architecture of TiKV: 24 | 25 | ![basic-architecture](../media/basic-architecture.png) 26 | 27 | In this illustration, there are four TiKV instances in the cluster and each instance uses one key-value engine to save data. On top of key-value engine, TiKV uses the [Raft](https://raft.github.io/) consensus algorithm to replicate the data. In practice, we use at least three replicas to keep data safe and consistent, and these replicas form a Raft group. 28 | 29 | We use the traditional multiversion concurrency control (MVCC) mechanism and have built a distributed transaction layer above the Raft layer. We also provide a Coprocessor framework so that users can push down their computing logic to the storage layer. 30 | 31 | All the network communications are through gRPC so that contributors can develop their own clients easily. 32 | 33 | The whole cluster is managed and scheduled by a central service: the Placement Driver (PD). 34 | 35 | As you can see, the hierarchy of TiKV is clear and easy to understand, and we will give more detailed explanation later. 36 | -------------------------------------------------------------------------------- /src/SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | [TiKV Development Guide](README.md) 4 | 5 | --- 6 | 7 | * [Get Started](get-started/intro.md) 8 | * [Build TiKV from Source](get-started/build-tikv-from-source.md) 9 | * [Import TiKV into an IDE](get-started/import-tikv-into-an-ide.md) 10 | * [Write and Run Unit Tests](get-started/write-and-run-unit-tests.md) 11 | * [Debug and Profile](get-started/debug-and-profile.md) 12 | * [Submit a Pull Request](get-started/submit-a-pull-request.md) 13 | 14 | * [Contribute to TiKV](contribute-to-tikv/intro.md) 15 | * [Community Guideline](contribute-to-tikv/community-guideline.md) 16 | * [Committer Guide](contribute-to-tikv/committer-guide.md) 17 | * [Report a Bug](contribute-to-tikv/report-a-bug.md) 18 | * [Contribute code](contribute-to-tikv/contribute-code.md) 19 | * [Review a Pull Request](contribute-to-tikv/review-a-pull-request.md) 20 | * [Request for Comments](contribute-to-tikv/request-for-comments.md) 21 | * [Code Style and Quality Guide](contribute-to-tikv/code-style-and-quality-guide.md) 22 | * [Write document](contribute-to-tikv/write-document.md) 23 | 24 | * [Understanding TiKV](understanding-tikv/intro.md) 25 | * [Overview](understanding-tikv/overview/intro.md) 26 | * [Raw KV](understanding-tikv/overview/raw-kv.md) 27 | * [Transaction KV](understanding-tikv/overview/transaction-kv.md) 28 | * [Scalability](understanding-tikv/scalability/intro.md) 29 | * [Region](understanding-tikv/scalability/region.md) 30 | * [Scheduling](understanding-tikv/scalability/scheduling.md) 31 | * [High Availability](understanding-tikv/high-availability/intro.md) 32 | * [Raft](understanding-tikv/high-availability/raft/intro.md) 33 | * [raft-rs](understanding-tikv/high-availability/raft/raft-rs.md) 34 | * [Leader Lease](understanding-tikv/high-availability/raft/leader-lease.md) 35 | * [Hibernate Region](understanding-tikv/high-availability/raft/hibernate-region.md) 36 | * [Multi-Raft](understanding-tikv/high-availability/multi-raft/intro.md) 37 | * [Split](understanding-tikv/high-availability/multi-raft/split.md) 38 | * [Merge](understanding-tikv/high-availability/multi-raft/merge.md) 39 | * [RaftStore](understanding-tikv/high-availability/raftstore/intro.md) 40 | * [Actor Model](understanding-tikv/high-availability/raftstore/actor-model.md) 41 | * [FSM](understanding-tikv/high-availability/raftstore/fsm.md) 42 | * [Transaction](understanding-tikv/transaction/intro.md) 43 | * [Percolator](understanding-tikv/transaction/percolator/intro.md) 44 | * [TSO](understanding-tikv/transaction/percolator/tso.md) 45 | * [Encode](understanding-tikv/transaction/percolator/encode.md) 46 | * [Command](understanding-tikv/transaction/percolator/command.md) 47 | * [Latch and Scheduler](understanding-tikv/transaction/percolator/latch-and-scheduler.md) 48 | * [Pessimistic Transaction](understanding-tikv/transaction/pessimistic-transaction.md) 49 | * [Async Commit](understanding-tikv/transaction/async-commit.md) 50 | * [1PC](understanding-tikv/transaction/1pc.md) 51 | * [Stale Read](understanding-tikv/transaction/stale-read.md) 52 | * [CDC](understanding-tikv/transaction/cdc.md) 53 | * [Storage](understanding-tikv/storage/intro.md) 54 | * [RocksDB](understanding-tikv/storage/rocksdb.md) 55 | * [Encode](understanding-tikv/storage/encode.md) 56 | * [Import and Export](understanding-tikv/storage/import-and-export.md) 57 | * [IO Rate Limiter](understanding-tikv/storage/io-rate-limiter.md) 58 | * [Coprocessor](understanding-tikv/coprocessor/intro.md) 59 | * [TiDB Expression Executor](understanding-tikv/coprocessor/tidb-expression-executor.md) 60 | * [Coprocessor Plugin](understanding-tikv/coprocessor/coprocessor-plugin.md) 61 | -------------------------------------------------------------------------------- /src/understanding-tikv/scalability/region.md: -------------------------------------------------------------------------------- 1 | # Region 2 | 3 | **Sharding** is one of the most necessary characteristic for a storage system to be scalable (or more specifically, [horizontally scalable](https://en.wikipedia.org/wiki/Scalability)). By breaking data up into partitions, a storage system can distribute data across computers in the cluster, to fully utilize resource of every single node. Meanwhile, migrate partitions to re-balance computing and storage loads when adding nodes to cluster, to gain performance improvement proportional to the increasing of number of nodes, i.e, to archive **Scalability**. 4 | 5 | ![scalability-data-sharding](../../media/scalability-data-sharding.png) 6 | *Diagram 1, Data Sharding [1]* 7 | 8 | In TiKV, partition is named as `Region`, which is the data unit to be distributed and migrated among TiKV nodes. Region is also the data unit to be replicated by [Raft](https://raft.github.io/), to archive [**High-Availability**](../high-availability/intro.md). A region is likewise a `Raft Group` in Raft algorithm, composed of one or more `Peer`s, while a peer is one of the replica of a partition. 9 | 10 | ![scalability-region](../../media/scalability-region.png) 11 | *Diagram 2, Region & Replication [1]* 12 | 13 | How does TiKV split data into regions? As data element of TiKV is key-value pair, we first decide which keys should be put into a region. In general, there are two approaches: 14 | 15 | * Sharding by hash: splitting by hash of keys (e.g. [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing)). It is easier to be implemented, as location of keys can be calculated by clients. But it is more difficult to scale up, consistent hashing makes things more complicated, and not efficient to range queries. 16 | 17 | * Sharding by range: splitting by range of keys. It is simple, good for scan, easy to scale by splitting and merging, even can switch to sharding by hash easily. But it needs additional management of metadata of partition. 18 | 19 | TiKV splits data by key range. Each region contains a continuous range of keys. The [**Placement Driver (PD) server**](https://github.com/tikv/pd) stores the `[start_key, end_key)` and other [metadata](https://github.com/pingcap/kvproto/blob/release-5.2/proto/metapb.proto#L64-L76) of each region, and performs the [scheduling](scheduling.md) of regions. 20 | 21 | Then, we determine how many key-value pairs should be stored in one region. The size of a region should not be too small, otherwise the management cost of too many regions would be high. Meanwhile, the size should not be too large, or else region migration would be expensive and time-consuming. 22 | 23 | By default, each region is expected to be about 96MB (see [region-split-size](https://docs.pingcap.com/tidb/stable/tikv-configuration-file#region-split-size)) in size. Large regions more than 144MB (see [region-max-size](https://docs.pingcap.com/tidb/stable/tikv-configuration-file#region-max-size)) will be split into two or more regions with 96MB each. Small adjacent regions less than 20MB (see [max-merge-region-size](https://docs.pingcap.com/tidb/stable/pd-configuration-file#max-merge-region-size)) will be merged to one. 24 | 25 | Moreover, each region is expected to contain more or less 960000 (see [region-split-keys](https://docs.pingcap.com/tidb/stable/tikv-configuration-file#region-split-keys)) keys, because region size calculation will need to scan all keys in the region. Big regions with more than 1440000 (see [region-max-keys](https://docs.pingcap.com/tidb/stable/tikv-configuration-file#region-max-keys)) keys will be split, while regions with more than 200000 (see [max-merge-region-keys](https://docs.pingcap.com/tidb/stable/pd-configuration-file#max-merge-region-keys)) keys will **NOT** be merged. 26 | 27 | ## Reference 28 | [1] Wenxuan, PingCAP, ["DEEP DIVE TIKV"](https://docs.google.com/presentation/d/1zF7yRiGt643orPHRFjorUfE8ZTTGwEw9Rdxv6mEc3eQ/edit?usp=sharing) 29 | -------------------------------------------------------------------------------- /src/get-started/build-tikv-from-source.md: -------------------------------------------------------------------------------- 1 | # Build TiKV from Source 2 | 3 | TiKV is mostly written in Rust, and has components written in C++ (RocksDB, gRPC). We are currently using the Rust [nightly toolchain](https://rust-lang.github.io/rustup/concepts/toolchains.html). To provide consistency, we use linters and automated formatting tools. 4 | 5 | ### Prerequisites 6 | 7 | To build TiKV you'll need to at least have the following installed: 8 | 9 | * [`git`](https://git-scm.com/) - Version control 10 | * [`rustup`] - Rust installer and toolchain manager 11 | * [`make`](https://www.gnu.org/software/make/) - Build tool (run common workflows) 12 | * [`cmake`](https://cmake.org/) - Build tool (required for [gRPC]) 13 | * [`awk`](https://www.gnu.org/software/gawk/manual/gawk.html) - Pattern scanning/processing language 14 | * C++ compiler - [`gcc`](https://gcc.gnu.org/) 4.9+ (required for [gRPC]) 15 | 16 | If you are targeting platforms other than x86_64 linux, you'll also need: 17 | 18 | * [`llvm` and `clang`](http://releases.llvm.org/download.html) - Used to generate bindings for different platforms and build native libraries (required for grpcio, rocksdb). 19 | 20 | *(Latest version of the above tools should work in most cases. When encountering any trouble of building TiKV, try upgrading to the latest. If it is not helped, do not hesitate to [ask](#ask-for-help).)* 21 | 22 | ### Getting the repository 23 | 24 | ``` 25 | git clone https://github.com/tikv/tikv.git 26 | cd tikv 27 | # Future instructions assume you are in this directory 28 | ``` 29 | 30 | ### Building and testing 31 | 32 | TiKV includes a [`Makefile`](https://github.com/tikv/tikv/blob/master/Makefile) that has common workflows and sets up a standard build environment. You can also use [`cargo`], as you do in many other Rust projects. To run `cargo` commands in the same environment as the `Makefile` to avoid re-compilations due to environment changes, you can prefix the command with [`scripts/env`](https://github.com/tikv/tikv/blob/master/scripts/env), for example: `./scripts/env cargo build`. 33 | 34 | Furthermore, when building by `make`, `cargo` is configured to use [pipelined compilation](https://internals.rust-lang.org/t/evaluating-pipelined-rustc-compilation/10199) to increase the parallelism of the build. To turn on pipelining while using `cargo` directly, set environment variable `export CARGO_BUILD_PIPELINING=true`. 35 | 36 | To build TiKV: 37 | 38 | ```bash 39 | make build 40 | ``` 41 | 42 | During interactive development, you may prefer using `cargo check`, which will parse, do borrow check, and lint your code, but not actually compile it: 43 | 44 | ```bash 45 | ./scripts/env cargo check --all 46 | ``` 47 | 48 | It is particularly handy alongside `cargo-watch`, which runs a command each time you change a file. 49 | 50 | ```bash 51 | cargo install cargo-watch 52 | cargo watch -s "./scripts/env cargo check --all" 53 | ``` 54 | 55 | When you're ready to test out your changes, use the `dev` task. It will format your codebase, build with `clippy` enabled, and run tests. In most case, this should be done without any failure before you create a Pull Request. Unfortunately, some tests will fail intermittently or can not pass on your platform. If you're unsure, just [ask](#ask-for-help)! 56 | 57 | ```bash 58 | make dev 59 | ``` 60 | 61 | You can run the test suite alone, or just run a specific test: 62 | 63 | ```bash 64 | # Run the full suite 65 | make test 66 | # Run a specific test 67 | ./scripts/test $TESTNAME -- --nocapture 68 | ``` 69 | 70 | TiKV follows the Rust community coding style. We use [rustfmt](https://github.com/rust-lang/rustfmt) and [clippy](https://github.com/rust-lang/rust-clippy) to automatically format and lint our codes. Using these tools is included in our CI. They are also part of `make dev`, and you can run them alone: 71 | 72 | ```bash 73 | # Run Rustfmt 74 | make format 75 | # Run Clippy (note that some lints are ignored, so `cargo clippy` will give many false positives) 76 | make clippy 77 | ``` 78 | 79 | See the [Rust Style Guide](https://github.com/rust-lang/rfcs/blob/master/style-guide/README.md) and the [Rust API Guidelines](https://rust-lang-nursery.github.io/api-guidelines/) for details on the conventions. 80 | 81 | Please follow this style to make TiKV easy to review, maintain, and develop. 82 | 83 | ### Build for Debugging 84 | 85 | To reduce compilation time, TiKV builds do not include full debugging information by default — `release` and `bench` builds include no debuginfo; `dev` and `test` builds include line numbers only. The easiest way to enable debuginfo is to precede build commands with `RUSTFLAGS=-Cdebuginfo=1` (for line numbers), or `RUSTFLAGS=-Cdebuginfo=2` (for full debuginfo). For example, 86 | 87 | ```bash 88 | RUSTFLAGS=-Cdebuginfo=2 make dev 89 | RUSTFLAGS=-Cdebuginfo=2 ./scripts/env cargo build 90 | ``` 91 | 92 | ### Ask for help 93 | If you encounter any problem during your journey, do not hesitate to reach out on the [TiDB Internals forum](https://internals.tidb.io/). 94 | [`rustup`]: https://rustup.rs/ 95 | [`cargo`]: https://doc.rust-lang.org/cargo/ 96 | [gRPC]: https://github.com/grpc/grpc 97 | [rustfmt]: https://github.com/rust-lang/rustfmt 98 | [clippy]: https://github.com/rust-lang/rust-clippy 99 | -------------------------------------------------------------------------------- /src/get-started/debug-and-profile.md: -------------------------------------------------------------------------------- 1 | # Debug and Profile 2 | 3 | In previous chapter, we introduce how to build TiKV from source, and in this chapter, we will focus on how to debug and profile TiKV from the view of a developer. 4 | 5 | ## Prerequisites 6 | 7 | * rust-gdb or rust-lldb 8 | [GDB](https://www.sourceware.org/gdb) and [LLDB](https://lldb.llvm.org/) are commonly used for debugging a program. 9 | * `rust-gdb` and `rust-lldb` are both installed with `rustup` together, however, they depend on `GDB` and `LLDB`, which are need to be installed by yourself. Here is the installation of GDB/LLDB. 10 | ```bash 11 | Ubuntu: sudo apt-get install gdb/lldb 12 | CentOS: sudo yum install gdb/lldb 13 | ``` 14 | * `GDB` and `LLDB` can also be used to debug rust program. 15 | * Basically, `rust-gdb` is a wrapper that loads external Python pretty-printing scripts into GDB. This is useful (and somewhat necessary) when debugging more complex Rust programs because it significantly improves the display of Rust data types. `rust-lldb` is similar. So `rust-gdb` and `rust-lldb` are recommended. 16 | * About how to choose between `rust-gdb` and `rust-lldb`, it depends on the platform you are using and the familiarity of these tools. If you are new hand on the debugging tools, `rust-lldb` is recommended on MacOS and `rust-gdb` is recommended on Linux, like Ubuntu and CentOS. 17 | * perf 18 | [Perf](https://perf.wiki.kernel.org/index.php/Main_Page) is common Linux profiler. It's powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It can be installed as following: 19 | ```bash 20 | Ubuntu: sudo apt-get install linux-tools 21 | CentOS: sudo yum install perf 22 | ``` 23 | 24 | *For simplicity, we will introduce the debugging with rust-gdb, audience can also use rust-lldb.* 25 | 26 | ## Debug TiKV with rust-gdb 27 | 28 | ### Debug a unit test binary in TiKV 29 | 30 | 1. Build the unit test binary, for example we want to debug the test case: [test_raw_get_key_ttl](https://github.com/tikv/tikv/blob/a7d1595f5486616be34e0cf2bbf372edb5f0e85a/src/storage/mod.rs#L5352-L5356) 31 | 32 | Firstly, we can get the binary file with cargo command, like: 33 | ```bash 34 | cargo test -p tikv test_raw_get_key_ttl 35 | ``` 36 | A binary file located in `target/debug/deps/tikv-some-hash` will be produced. 37 | 38 | 2. Debug the binary with rust-gdb: 39 | 40 | ```bash 41 | rust-gdb --args target/debug/deps/tikv-4a32c89a00a366cb test_raw_get_key_ttl 42 | ``` 43 | 3. Now the standard gdb interface is shown. We can debug the unit test with gdb command. Here are some simple commands. 44 | 45 | * r(run) to start the program. 46 | * b(break) file_name:line_number to set a breakpoint. 47 | * p(print) args to print args. 48 | * ls to show the surrounding codes of breakpoint. 49 | * s(step) to step in the function. 50 | * n(next) to step over current line. 51 | * c(continue) to continue the program. 52 | * watch to set a data watch breakpoint. 53 | 54 | An example of debugging an unit test named `test_raw_batch_get` is as following: 55 | * Build `tikv` unit test binary with `cargo test -p tikv test_raw_batch_get` and binary is located in `target/debug/deps/tikv-` 56 | * Launch the binary with `rust-gdb` 57 | ``` 58 | rust-gdb --args target/debug/deps/tikv- test_raw_batch_get 59 | ``` 60 | * debug 61 | 62 | ![gdb-tikv-ut](../media/gdb_tikv_ut.png) 63 | 64 | As the marks shown in above screenshot, firstly, a breakpoint is set in line `4650` of file `src/storage/mod.rs` and set condition that `api_version == 2`, which means program only pause when it hit here and the variable `api_version` is equals to 2. Then `run` is executed and the program start to run. The following steps are some examples to use gdb commands to execute the `step over` and `print`. 65 | 66 | ### Debug TiKV cluster with specified tikv-server binary 67 | 68 | 1. Build tikv-server binary with the guide in [previous chapter](./build-tikv-from-source.md). 69 | 2. The binary files are located in `\${TIKV_SOURCE_CODE}/target/debug/tikv-server`. Debug binary is recommended as it keep much useful debug info, such as codes, lines, local variables. 70 | 3. TiUP is recommended to deploy a TiKV cluster. It's easy to deploy a local TiKV cluster with tiup playground. Please refer to [Get start in 5 minutes](https://tikv.org/docs/5.1/concepts/tikv-in-5-minutes/#set-up-a-local-tikv-cluster-with-the-default-options). With TiUP, we can also specify the tikv-server binary file during deploy. The following is an example: 71 | 72 | ```bash 73 | TIKV_BIN=~/tikv/target/release/tikv-server 74 | 75 | tiup playground v5.0.4 --mode tikv-slim 76 | --kv 3 --kv.binpath ${TIKV_BIN} --kv.config ./tikv_rawkv.toml 77 | ``` 78 | 79 | 4. Now we get one TiKV cluster with three TiKV virtual nodes and one PD node. we can use `rust-gdb` to attach the `tikv-server` process. 80 | 81 | ```bash 82 | rust-gdb attach `pid of tikv-server` 83 | ``` 84 | 85 | pid of tikv-server can be obtained with the following command: 86 | 87 | ```bash 88 | ps -aux|grep tikv-server 89 | ``` 90 | Now the standard GDB interface is shown. The following steps are just the same as debugging unit test binary. 91 | 92 | ## Profiling TiKV 93 | When we want to find the CPU bottleneck of one program, we can use [perf Linux profiler](https://www.brendangregg.com/perf.html) to find the procedures and how much CPU time they are consuming. It can also be used for profiling TiKV. [FlameGraph](http://www.brendangregg.com/flamegraphs.html) can also be used to visualize stack traces with interactive SVGs. FlameGraph can be downloaded from here: 94 | ``` 95 | git clone https://github.com/brendangregg/FlameGraph.git 96 | ``` 97 | 98 | Here is one example: 99 | 100 | 1. Recording performance data with `perf` 101 | ``` 102 | perf record -g -p `pidof tikv-server` 103 | ``` 104 | 2. Generate text report 105 | ``` 106 | perf report 107 | ``` 108 | 109 | 3. Parse the perf data with script. 110 | ``` 111 | perf script -i perf.data &> perf.unfold 112 | ``` 113 | 4. Generate the flame graph 114 | ``` 115 | ./stackcollapse-perf.pl perf.unfold &> perf.folded 116 | 117 | ./flamegraph.pl perf.folded > perf.svg 118 | ``` 119 | 5. We can open the `svg` file with `Chrome` or other browsers. With the flame graph, we can see the performance data more intuitively. 120 | 121 | ![flame_graph](../media/perf_flame.png) -------------------------------------------------------------------------------- /src/understanding-tikv/coprocessor/intro.md: -------------------------------------------------------------------------------- 1 | # TiKV Coprocessor 2 | 3 | ## Why Coprocessor? 4 | 5 | TiKV is a distributed key-value storage engine made for TiDB. When TiDB executes a query, basically, it will need to retrieve and scan full rows from TiKV. Consider the following query: 6 | 7 | ![TiKV Distributed Execution](../../media/tikv-distributed-execution.png) 8 | 9 | Without TiKV coprocessor, TiDB needs to retrieve all rows from TiKV, and then scan and filter them on the TiDB side, even if we only have a single number as the query result. In order to reduce the network traffic for a query, TiDB pushes some computations down to TiKV Coprocessor, which runs queries on the TiKV side. 10 | 11 | A TiKV cluster is composed of multiple TiKV nodes, and thus TiKV coprocessor cannot run all queries which TiDB supports. On one side, TiKV itself only holds part of the data which a query needs. Therefore, for an aggregation query, each TiKV node can only calculate partial sums, and TiDB needs to aggregate the partial sums into the final sum. On the other side, TiKV Coprocessor only supports a limited number of executors. Complex operations like join cannot be done on TiKV. 12 | 13 | In this article, we will introduce the basics of TiKV Coprocessor from a developer's perspective. 14 | 15 | ## Overview of TiKV Coprocessor 16 | 17 | The TiKV Coprocessor is able to handle multiple kinds of requests. Previously, we mentioned "push-down execution". This is handled by DAG handler in the TiKV Coprocessor. The Coprocessor also supports Checksum and Analyze requests. At the same time, the TiKV Coprocessor is not limited to TiDB queries. In Coprocessor v2, developers can dynamically load Coprocessor plugins into a TiKV node, which could process any execution requests that need access to key-value data stored on the TiKV side. 18 | 19 | The code for the TiKV Coprocessor is stored in [src/coprocessor](https://github.com/tikv/tikv/tree/master/src/coprocessor) directory. And folders begin with `tidb_query_` in [components](https://github.com/tikv/tikv/tree/master/components) is used by TiKV Coprocessor to run TiDB queries. 20 | 21 | Here we focus on DAG handler in the TiKV Coprocessor. Running queries on the TiKV side requires two things: a plan to run query, and the data needed by a query. 22 | 23 | ![TiKV Coprocessor Executors](../../media/copr-overview.png) 24 | 25 | For a single query, TiDB will send the plan down to TiKV (the control flow), and TiKV will scan rows of table from the local storage engine (the data flow). TiKV Coprocessor executes the plan with data from the local storage engine, and sends the result back to TiDB. 26 | 27 | The plan and the data exist in different forms throughout the query execution process. In the following parts, we will focus on how they exist on TiKV in each stage. 28 | 29 | ## The Plan 30 | 31 | When TiDB needs to run a query on TiKV Coprocessor, it will encode the plan in the form of protobuf messages. The schema of the query plan itself is defined in [tipb](https://github.com/pingcap/tipb) repository. Inside tipb, we have all SQL expressions supported by TiKV (and TiDB), each with a unique ID, and we define components used by a query, like executors and aggregators. 32 | 33 | The plan is then sent to the gRPC service on the TiKV side, which requires another protobuf schema definition. The definition for that is in the [kvproto](https://github.com/pingcap/kvproto/blob/master/proto/coprocessor.proto) repository. The plan is encoded in the `data` field of the request. Each Coprocessor request will specify the key range (or region) to operate on. 34 | 35 | ```protobuf 36 | message Request { 37 | kvrpcpb.Context context = 1; 38 | int64 tp = 2; 39 | bytes data = 3; 40 | uint64 start_ts = 7; 41 | repeated KeyRange ranges = 4; 42 | // ... 43 | } 44 | ``` 45 | 46 | As a developer, you may want to see what plan is pushed down to TiKV when running a query. This could be easily done with an SQL `explain` statement. 47 | 48 | 49 | ```plain 50 | MySQL [test]> explain select count(*) from test where x > 10; 51 | +-----------------------------+---------+-----------+---------------+-------------------------------------------------+ 52 | | id | estRows | task | access object | operator info | 53 | +-----------------------------+---------+-----------+---------------+-------------------------------------------------+ 54 | | StreamAgg_17 | 1.00 | root | | funcs:count(Column#4)->Column#2 | 55 | | └─TableReader_18 | 1.00 | root | | data:StreamAgg_9 | 56 | | └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#4 | 57 | | └─TableRangeScan_16 | 3333.33 | cop[tikv] | table:test | range:(10,+inf], keep order:false, stats:pseudo | 58 | +-----------------------------+---------+-----------+---------------+-------------------------------------------------+ 59 | 4 rows in set (0.002 sec) 60 | ``` 61 | 62 | A push-down plan is marked as `cop[tikv]` in the `task` column. 63 | 64 | ## The Data 65 | 66 | ### On-Disk Format 67 | 68 | TiKV stores its data by rows in its local KV storage engine (as of now, RocksDB or TitanDB). Upon the local KV storage engine, there is a MVCC transactional layer, called TxnKV, which Coprocessor reads data from. The key in TxnKV is composed of the table information and the primary key of a row. And other columns of this row is stored as the value in storage engine. For example, 69 | 70 | ```plain 71 | t10_r1 --> ["TiDB", "SQL Layer", 10] 72 | t10_r2 --> ["TiKV", "KV Engine", 20] 73 | t10_r3 --> ["PD", "Manager", 30] 74 | ``` 75 | 76 | The TiDB row format (v2) is described in the [A new storage row format for efficient decoding](https://github.com/pingcap/tidb/blob/master/docs/design/2018-07-19-row-format.md) RFC. 77 | 78 | ### In-memory Format 79 | 80 | The executors (will be introduced later) will scan the data from TxnKV in small batches, and store them in memory. Contrary to the on-disk format, data are stored in a columnar way in memory. The memory format is called "Chunk format" in TiDB, and is in some way very similar to the Apache Arrow format. The implementation details are described in the [Using chunk format in coprocessor framework](https://github.com/tikv/rfcs/blob/master/text/0043-copr-chunk.md) RFC, and [this slide](https://docs.google.com/presentation/d/1fUQTJ6gEscHUag9OhIIePL9uiIYJ61TSpfor-pajBoE/edit#slide=id.g446c4deb4d_0_341). The format is implemented in [components/tidb_query_datatype](https://github.com/tikv/tikv/tree/master/components/tidb_query_datatype/src/codec/data_type) beginning with `chunked_vec_`. 81 | 82 | ## Inside Coprocessor 83 | 84 | ![TiKV Coprocessor Internals](../../media/inside-copr.png) 85 | 86 | The TiKV Coprocessor contains necessary components to handle a cop read request, which include the expression framework, the aggregators, and the executors. We will cover these parts in the following chapters. 87 | -------------------------------------------------------------------------------- /src/understanding-tikv/scalability/scheduling.md: -------------------------------------------------------------------------------- 1 | # Scheduling 2 | 3 | This section introduces the scheduling mechanism. 4 | 5 | The scheduling mechanism is mainly by summarizing the collected cluster real-time status information (including the heartbeat of the region and the heartbeat of the store), and then judging whether scheduling needs to be generated according to different strategies, and then generate scheduling `Operator` which is send to TiKV to do scheduling. the `Operator`struct like: 6 | 7 | ```go 8 | // Operator contains execution steps generated by scheduler. 9 | type Operator struct { 10 | desc string 11 | brief string 12 | regionID uint64 13 | regionEpoch *metapb.RegionEpoch 14 | kind OpKind 15 | steps []OpStep 16 | stepsTime []int64 // step finish time 17 | currentStep int32 18 | status OpStatusTracker 19 | level core.PriorityLevel 20 | } 21 | ``` 22 | 23 | Each scheduling `Operator` is only used to operate a region migration, which including some `OpSteps`: add peer, transfer the raft group leader and remove peer. an `Operator` will record the ID of the operator region, the relative strategy name, operator priority level etc. The `Operator` in PD may be generated from two operate, one is the `checker` and the other is the `scheduler`. The generated schedule will be stored in a map, and then it will be returned to the corresponding TiKV through the response when the region heartbeat comes. Let us first look at how checker and scheduler work. 24 | 25 | ## Checker 26 | 27 | After PD server started, there is a [background worker](https://github.com/tikv/pd/blob/release-5.2/server/cluster/coordinator.go#L90-L158) will polling all regions and then check the health status of each region: 28 | 29 | ```go 30 | func (c *coordinator) patrolRegions() { 31 | timer := time.NewTimer(c.cluster.GetPatrolRegionInterval()) 32 | for { 33 | select { 34 | case <-timer.C: 35 | timer.Reset(c.cluster.GetPatrolRegionInterval()) 36 | case <-c.ctx.Done(): 37 | log.Info("patrol regions has been stopped") 38 | return 39 | } 40 | 41 | regions := c.cluster.ScanRegions(key, nil, patrolScanRegionLimit) 42 | for _, region := range regions { 43 | if c.checkRegion(region) { 44 | break 45 | } 46 | } 47 | } 48 | } 49 | ``` 50 | 51 | In this function, `checkRegion` will be executed to determine whether the region needs to be scheduled. If a schedule `Operator` is generated, it will be sent to TiKV through the heartbeat of this region. The initialization of the checker can be found in `coordinator.go`, which mainly contains [four checkers](https://github.com/tikv/pd/blob/release-5.2/server/schedule/checker_controller.go#L64-L107): `RuleChecker`, `MergeChecker`, `JointStateChecker` and `SplitChecker`. 52 | 53 | `RuleChecker` is the most critical checker. It will check whether a region has down peers or offline peers. It will also check whether the number of replicas of the current region is the same as the number of replicas specified in the [`Placement Rules`](https://docs.pingcap.com/tidb/stable/configure-placement-rules#placement-rules). If the conditions are met, it will trigger the logic of the corresponding supplementary replicas or delete the redundant replicas. In addition, `RuleChecker` will also check whether the current copy of this region is placed in the most reasonable place, and if not, it will be placed in a more reasonable place. 54 | 55 | `MergeChecker` will check whether the current region meets the merge conditions, such as whether the size of the region is less than `max-merge-region-size`, whether the key of the region is less than `max-merge-region-keys`, and whether there has been no split operation in the recent period, etc. If these conditions are met, an adjacent region will be selected to try to merge the two regions. 56 | 57 | ## Scheduler 58 | 59 | Let's first take a look at the code about the scheduler running process. Schedulers are running concurrently. Each scheduler has a [scheduler controller](https://github.com/tikv/pd/blob/release-5.2/server/cluster/coordinator.go#L762-L789), which running in a background worker: 60 | 61 | ```go 62 | func (c *coordinator) runScheduler(s *scheduleController) { 63 | defer logutil.LogPanic() 64 | defer c.wg.Done() 65 | defer s.Cleanup(c.cluster) 66 | 67 | timer := time.NewTimer(s.GetInterval()) 68 | defer timer.Stop() 69 | 70 | for { 71 | select { 72 | case <-timer.C: 73 | timer.Reset(s.GetInterval()) 74 | if !s.AllowSchedule() { 75 | continue 76 | } 77 | if op := s.Schedule(); len(op) > 0 { 78 | added := c.opController.AddWaitingOperator(op...) 79 | log.Debug("add operator", zap.Int("added", added), zap.Int("total", len(op)), zap.String("scheduler", s.GetName())) 80 | } 81 | 82 | case <-s.Ctx().Done(): 83 | log.Info("scheduler has been stopped", 84 | zap.String("scheduler-name", s.GetName()), 85 | errs.ZapError(s.Ctx().Err())) 86 | return 87 | } 88 | } 89 | } 90 | ``` 91 | 92 | Similar to the checker, when the PD starts, the specified scheduler will be added according to the configuration. Each scheduler runs in a goroutine by executing `go runScheduler`, and then executes the `Schedule()` function at a dynamically adjusted time interval. 93 | 94 | There are two things that a function has to do. The first is to execute the scheduling logic of the corresponding scheduler to determine whether to generate an `Operator`, and the other is to determine the time interval for the next execution of `Schedule()`. 95 | 96 | PD contains many schedulers. For details, you can check the `server/schedulers` package, which contains the implementation of all schedulers. The schedulers that PD will run by default include `balance-region-scheduler`, `balance-leader-scheduler`, and `balance-hot-region-scheduler`. Let's take a look at the specific functions of these schedulers: 97 | 98 | * The [`balance-region-scheduler`](https://github.com/tikv/pd/blob/release-5.2/server/schedulers/balance_region.go#L136-L144) calculates a score based on the size of the region size on a store and the usage of available space. Then, according to the calculated score, the region is evenly distributed to each store through the `Operator` that generates the balance-region. The reason why the available space is considered here is that the actual situation may have differences in storage capacity of different stores. 99 | * The [`balance-leader-scheduler`](https://github.com/tikv/pd/blob/release-5.2/server/schedulers/balance_leader.go#L137-L153) is similar to the balance-region-scheduler. It calculates a score based on the region count, and then uses the `Operator` that generates the balance-leader to distribute the leaders as evenly as possible across the stores. 100 | * The [`balance-hot-region-scheduler`](https://github.com/tikv/pd/blob/release-5.2/server/schedulers/hot_region.go#L152-L155) implements the related logic of hot spot scheduling. For TiDB, if there are hot spots and only a few stores have hot spots, then the overall resource utilization of the system will be lowered, and it is easy to form a system bottleneck. Therefore, PD also needs to count the hot spots in response to this situation. Come out, and by generating the corresponding schedule, the hot spots are scattered to all stores as much as possible. So that all stores can share the pressure of reading and writing. 101 | 102 | There are some other schedulers to choose from. Each scheduler of PD implements an interface called [Scheduler](https://github.com/tikv/pd/blob/release-5.2/server/schedule/scheduler.go#L33-L45): 103 | 104 | ```go 105 | // Scheduler is an interface to schedule resources. 106 | type Scheduler interface { 107 | http.Handler 108 | GetName() string 109 | // GetType should in accordance with the name passing to schedule.RegisterScheduler() 110 | GetType() string 111 | EncodeConfig() ([]byte, error) 112 | GetMinInterval() time.Duration 113 | GetNextInterval(interval time.Duration) time.Duration 114 | Prepare(cluster opt.Cluster) error 115 | Cleanup(cluster opt.Cluster) 116 | Schedule(cluster opt.Cluster) []*operator.Operator 117 | IsScheduleAllowed(cluster opt.Cluster) bool 118 | } 119 | ``` 120 | 121 | The most important thing is the `Schedule()` interface function, which is used to implement the specific scheduling-related logic of each scheduler. 122 | 123 | In addition, the interface function `IsScheduleAllowed()` is used to determine whether the scheduler is allowed to execute the corresponding scheduling logic. Before executing the scheduling logic, each scheduler will firstly call this function to determine whether the scheduling rate is exceeded. Specifically, in the code, this function is called `AllowSchedule()`, and then the `IsScheduleAllowed()` method implemented by different schedulers is called. 124 | 125 | PD can control the speed at which the scheduler generates operators by setting the limit, but the limit here is just one that maintains a window size, and different operator types have their own window sizes. For example, `balance-region` schedulers and `balance-hot-region` schedulers will generate operators related to migrate region, and the type of this operator is `OpRegion`. We can control the operator of this type of `OpRegion` by adjusting the `region-schedule-limit` parameter. The specific operator type definition can be found in the file [`operator.go`](https://github.com/tikv/pd/blob/release-5.2/server/schedule/operator/operator.go). An operator may contain multiple types. For example, the operator generated by `balance-hot-region` may belong to both `OpRegion` and `OpHotRegion`. 126 | 127 | ## More 128 | 129 | This section mainly introduces the main operation process of PD scheduling. For more details, you can continue to refer to the corresponding code study. And, welcome to contribute [good first issues](https://github.com/tikv/pd/contribute). 130 | 131 | See more information about PD implementation on [its Wiki page](https://github.com/tikv/pd/wiki). 132 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | --------------------------------------------------------------------------------