├── README.md
├── binlog_group.md
├── cats.md
├── group_replication_jitter.md
├── images
├── 149ca62fd014bd12b60c77573a49757d.gif
├── 185beb2c6e0524d93abde3b25ecedc61.png
├── 19a92bf7065627b28a3403c000eba095.png
├── 290e969038561faf1ace6e108945b2d6.png
├── 2eba13133f9eb007d89459cce5d4055b.png
├── 3a5d10caf25003e7ea4dcace59a181f6.gif
├── 4309ffee12d2eed2548845d1e1d2e848.png
├── 47c39411c3240713c75e848ef5ef4b59.png
├── 47f963cdd950abd91a459ffb66a3744e.png
├── 4ca52ffeebc49306e76c74ed9062257d.png
├── 4cc389ee95fbae485f1e014aad393aa8.gif
├── 4f0ea97ad117848a71148849705e311e.png
├── 5136495bbdfbe2cefac98d74bd36a88f.png
├── 55787cfe149ebf1c574aa1da4eea678c.png
├── 5c6f81f69eac7f61744ba3bc035b29e7.png
├── 63cf3b8bd556c6b1ce5a0436883c8f7b.png
├── 6670e70b6d5f0f5152f643c153d13487.png
├── 66a031ab14d56289d0987c65c73323af.png
├── 68eb47a320bce3bb717a2b690da4573d.png
├── 6a6091b51c602923e5fe8b33e4422882.png
├── 77d0c0bdc5ce8574c6ad319864abb032.gif
├── 853d21533f748c1c56a4151869a82a27.gif
├── 8f9080ee71094d948ab7592b449954bb.png
├── a54faa33502b8c17066b1e2af09bdbb0.png
├── b5a20f07b4812ea9f34ffb8e8783b73a.png
├── b89dd5988d7d7ead6923dbc2d20e146c.png
├── bb111919295b7678530a1adcfa8b7d29.png
├── cba8723ae722e8d2d13b94e0cf1fda7a.png
├── ccb771014f600402fee72ca7134aea10.gif
├── ce8a6d4b9cd6df4c48cf914fae8a70d2.png
├── da7d6c8c12d18b915018939970d2b911.png
├── e2554f0ea244f337c0e66ea34bf53edf.gif
├── f25b3e3bc94bed108b0c454413f79873.png
├── f67a3249ad8b040417f195c3ca11f795.png
├── image-20240829080659528.png
├── image-20240829081950365.png
├── image-20240829092940314.png
├── image-20240829092959775.png
├── image-20240829094221268.png
├── image-20240829100432417.png
├── image-20240829101111288.png
├── image-20240829101222447.png
├── image-20240829101254601.png
├── image-20240829101332034.png
├── image-20240829101534550.png
├── image-20240829101612063.png
├── image-20240829101632142.png
├── image-20240829101650730.png
├── image-20240829101712694.png
├── image-20240829102323261.png
├── image-20240829102344815.png
├── image-20240829102642856.png
├── image-20240829102703396.png
├── image-20240829103131857.png
├── image-20240829103236734.png
├── image-20240829103259992.png
├── image-20240829104222478.png
├── image-20240829104512068.png
├── image-20240829104533718.png
├── image-20240829104554155.png
├── image-20240829104639402.png
├── image-20240829104851205.png
├── image-20240829105258689.png
├── image-20240829105418671.png
├── image-20240829105441723.png
├── image-20240829105506722.png
├── image-20240829110009589.png
├── image-20240829110903709.png
├── image-20240829110922449.png
├── image-20240829110943245.png
├── image-20240829113916829.png
├── image-20240829113948830.png
├── image-20240829114017540.png
├── image-20240829114037360.png
├── image-20240829114100020.png
├── image-20240829151823981.png
├── image-20240904142951908.png
├── image-20240904201934894.png
├── image-20240905163847317.png
├── image-20240905163915360.png
├── image-20240905163959058.png
├── image-20240905222509513.png
├── image-bulk-insert-degrade.png
├── image-bulk-insert-optimize.png
├── image-degrade.png
├── image-join-degrade.png
├── image-join-improve.png
└── tcpcopy.png
├── innodb_storage.md
├── isolation.md
├── logical_reasoning_samples.md
├── new_group_replication.md
├── paxos.md
├── paxos_log.md
├── performance_degradation.md
├── pgo.md
├── professional_journey.md
├── scalability.md
├── static_array.md
├── sysbench_perf_degradation.md
└── sysbench_vs_benchmarksql.md
/README.md:
--------------------------------------------------------------------------------
1 | # Blogs
2 |
3 | Extract some insightful technical points from the book "The Art of Problem-Solving in Software Engineering:How to Make MySQL Better".
4 |
5 | # Professional Journey:
6 |
7 | [A Brief Overview of My Professional Journey](professional_journey.md)
8 |
9 | # Table of Contents
10 |
11 | [Analysis of the Root Causes of MySQL Performance Degradation](sysbench_perf_degradation.md)
12 |
13 | [How can the scalability of MySQL be improved](scalability.md)
14 |
15 | [How to explain why Repeatable Read surprisingly outperforms Read Committed?](isolation.md)
16 |
17 | [The Significant Differences Between BenchmarkSQL and SysBench](sysbench_vs_benchmarksql.md)
18 |
19 | [Profile-guided optimization](pgo.md)
20 |
21 | [Performance Degradation in Query Execution Plans](performance_degradation.md)
22 |
23 | [Enhancing the InnoDB Storage Engine](innodb_storage.md)
24 |
25 | [Improving Binlog Group Commit Scalability](binlog_group.md)
26 |
27 | [Evaluating Performance Gains in MySQL Lock Scheduling Algorithms](cats.md)
28 |
29 | [The Powerful Impact of Static Arrays in MySQL Modifications](static_array.md)
30 |
31 | [Lack of Durability: Paxos Not Committed to Disk](paxos_log.md)
32 |
33 | [How to Mitigate Performance Fluctuations in MySQL Group Replication?](group_replication_jitter.md)
34 |
35 | [logical_reasoning_samples](logical_reasoning_samples.md)
36 |
37 | [Traditional Group Replication vs New Group Replication](new_group_replication.md)
38 |
39 | [The Problems with Paxos Variant Algorithms in Group Replication](paxos.md)
40 |
41 | # Book Link:
42 |
43 | [The Art of Problem-Solving in Software Engineering:How to Make MySQL Better](https://github.com/enhancedformysql/The-Art-of-Problem-Solving-in-Software-Engineering_How-to-Make-MySQL-Better).
44 |
45 | # Errata
46 |
47 | If you find any errata in the book, [please open a new issue](https://github.com/enhancedformysql/The-Art-of-Problem-Solving-in-Software-Engineering_How-to-Make-MySQL-Better/issues).
48 |
--------------------------------------------------------------------------------
/binlog_group.md:
--------------------------------------------------------------------------------
1 | # Improving Binlog Group Commit Scalability
2 |
3 | The binlog group commit mechanism is quite complex, and this complexity makes it challenging to identify its inherent performance problems.
4 |
5 | First, capture performance problems during the TPC-C test with 500 concurrency using the *perf* tool, as shown in the following figure:
6 |
7 | 
8 |
9 | Figure 1. *_pthread_mutex_cond_lock* bottleneck reveals performance problems.
10 |
11 | It is evident that *_pthread_mutex_cond_lock* is a significant bottleneck, accounting for approximately 9.5% of the overhead. Although *perf* does not directly pinpoint the exact problem, it indicates the presence of this bottleneck.
12 |
13 | To address the problem, an in-depth exploration of MySQL internals was conducted to uncover the factors contributing to this performance bottleneck. A conventional binary search approach with minimal logging was used to identify functions or code segments that incur significant overhead during execution. The minimal logging approach was chosen to reduce performance interference while diagnosing the root cause of the problem. Excessive logging can disrupt performance analysis, and while some may use MySQL's internal mechanisms for troubleshooting, these often introduce substantial performance overhead themselves.
14 |
15 | After thorough investigation, the bottleneck was identified within the following code segment.
16 |
17 | ```c++
18 | /*
19 | If the queue was not empty, we're a follower and wait for the
20 | leader to process the queue. If we were holding a mutex, we have
21 | to release it before going to sleep.
22 | */
23 | if (!leader) {
24 | CONDITIONAL_SYNC_POINT_FOR_TIMESTAMP("before_follower_wait");
25 | mysql_mutex_lock(&m_lock_done);
26 | ...
27 | ulonglong start_wait_time = my_micro_time();
28 | while (thd->tx_commit_pending) {
29 | if (stage == COMMIT_ORDER_FLUSH_STAGE) {
30 | mysql_cond_wait(&m_stage_cond_commit_order, &m_lock_done);
31 | } else {
32 | mysql_cond_wait(&m_stage_cond_binlog, &m_lock_done);
33 | }
34 | }
35 | ulonglong end_wait_time = my_micro_time();
36 | ulonglong wait_time = end_wait_time - start_wait_time;
37 | if (wait_time > 100000) {
38 | fprintf(stderr, "wait too long:%llu\n", wait_time);
39 | }
40 | mysql_mutex_unlock(&m_lock_done);
41 | return false;
42 | }
43 | ```
44 |
45 | Numerous occurrences of 'wait too long' output indicate that the bottleneck has been exposed. To investigate why 'wait too long' is being reported, the logs were added and modified accordingly. See the specific code below:
46 |
47 | ```c++
48 | /*
49 | If the queue was not empty, we're a follower and wait for the
50 | leader to process the queue. If we were holding a mutex, we have
51 | to release it before going to sleep.
52 | */
53 | if (!leader) {
54 | CONDITIONAL_SYNC_POINT_FOR_TIMESTAMP("before_follower_wait");
55 | mysql_mutex_lock(&m_lock_done);
56 | ...
57 | ulonglong start_wait_time = my_micro_time();
58 | while (thd->tx_commit_pending) {
59 | if (stage == COMMIT_ORDER_FLUSH_STAGE) {
60 | mysql_cond_wait(&m_stage_cond_commit_order, &m_lock_done);
61 | } else {
62 | mysql_cond_wait(&m_stage_cond_binlog, &m_lock_done);
63 | }
64 | fprintf(stderr, "wake up thread:%p,total wait time:%llu, stage:%d\n",
65 | thd, my_micro_time() - start_wait_time, stage);
66 | }
67 | ulonglong end_wait_time = my_micro_time();
68 | ulonglong wait_time = end_wait_time - start_wait_time;
69 | if (wait_time > 100000) {
70 | fprintf(stderr, "wait too long:%llu for thread:%p\n", wait_time, thd);
71 | }
72 | mysql_mutex_unlock(&m_lock_done);
73 | return false;
74 | }
75 | ```
76 |
77 | After another round of testing, a peculiar phenomenon was observed: when 'wait too long' messages appeared, the 'wake up thread' logs showed that many user threads were awakened multiple times.
78 |
79 | The problem was traced to the *thd->tx_commit_pending* value not changing, causing threads to repeatedly re-enter the wait process. Further inspection reveals the conditions under which this variable becomes false, as illustrated in the following code:
80 |
81 | ```c++
82 | void Commit_stage_manager::signal_done(THD *queue, StageID stage) {
83 | mysql_mutex_lock(&m_lock_done);
84 | for (THD *thd = queue; thd; thd = thd->next_to_commit) {
85 | thd->tx_commit_pending = false;
86 | thd->rpl_thd_ctx.binlog_group_commit_ctx().reset();
87 | }
88 | /* if thread belong to commit order wake only commit order queue threads */
89 | if (stage == COMMIT_ORDER_FLUSH_STAGE)
90 | mysql_cond_broadcast(&m_stage_cond_commit_order);
91 | else
92 | mysql_cond_broadcast(&m_stage_cond_binlog);
93 | mysql_mutex_unlock(&m_lock_done);
94 | }
95 | ```
96 |
97 | From the code, it is evident that *thd->tx_commit_pending* is set to false in the *signal_done* function. The *mysql_cond_broadcast* function then activates all waiting threads, leading to a situation similar to a thundering herd problem. When all previously waiting user threads are activated, they check if tx_commit_pending has been set to false. If it has, they proceed with processing; otherwise, they continue waiting.
98 |
99 | Despite the complexity of the binlog group commit mechanism, a straightforward analysis identifies the root cause: threads that should not be activated are being triggered, leading to unnecessary context switches with each activation.
100 |
101 | During one test, additional statistics were collected on the number of times user threads entered the wait state. The details are shown in the following figure:
102 |
103 |
104 |
105 | Figure 2. Statistics of threads activated 1, 2, 3 times.
106 |
107 | Waiting once is normal and indicates 100% efficiency. Waiting twice suggests 50% efficiency, and waiting three times indicates 33.3% efficiency. Based on the figure, the overall activation efficiency is calculated to be 52.7%.
108 |
109 | To solve this problem, an ideal solution would be a multicast activation mechanism with 100% efficiency, where user threads with tx_commit_pending set to false are activated together. However, implementing this requires a deep understanding of the complex logic behind binlog group commit.
110 |
111 | In this case, a point-to-point activation mechanism is used, achieving 100% efficiency but introducing significant system call overhead. The following figure illustrates the relationship between TPC-C throughput and concurrency before and after optimization.
112 |
113 |
114 |
115 | Figure 3. Impact of group commit optimization with innodb_thread_concurrency=128.
116 |
117 | From the figure, it is evident that with innodb_thread_concurrency=128, the optimization of binlog group commit significantly improves throughput under high concurrency.
118 |
119 | It's important to note that this optimization's effectiveness can vary depending on factors such as configuration settings and specific scenarios. However, overall, it notably improves throughput, especially in high concurrency conditions.
120 |
121 | Below is the comparison of TPC-C throughput and concurrency before and after optimization using standard configurations:
122 |
123 |
124 |
125 | Figure 4. Impact of group commit optimization using standard configurations.
126 |
127 | From the figure, it is clear that this optimization is less pronounced compared to the previous one, but it still shows overall improvement. Extensive testing indicates that the worse the scalability of MySQL, the more significant the effectiveness of binlog group commit optimization.
128 |
129 | At the same time, the previously identified bottleneck of *_pthread_mutex_cond_lock* has been significantly alleviated after optimization, as shown in the following figure:
130 |
131 | 
132 |
133 | Figure 5. Mitigation of *_pthread_mutex_cond_lock* bottleneck.
134 |
135 | In summary, this optimization helps address scalability problems associated with binlog group commit.
136 |
137 | ## References:
138 |
139 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
--------------------------------------------------------------------------------
/cats.md:
--------------------------------------------------------------------------------
1 | # Evaluating Performance Gains in MySQL Lock Scheduling Algorithms
2 |
3 | Scheduling is crucial in computer system design. The right policy can significantly reduce mean response time without needing faster machines, effectively improving performance for free. Scheduling also optimizes other metrics, such as user fairness and differentiated service levels, ensuring some job classes have lower mean delays than others [1].
4 |
5 | MySQL 8.0 uses the Contention-Aware Transaction Scheduling (CATS) algorithm to prioritize transactions waiting for locks. When multiple transactions compete for the same lock, CATS determines the priority based on scheduling weight, calculated by the number of transactions a given transaction blocks. The transaction blocking the most others gets higher priority; if weights are equal, the longest waiting transaction goes first.
6 |
7 | A deadlock occurs when multiple transactions cannot proceed because each holds a lock needed by another, causing all involved to wait indefinitely without releasing their locks.
8 |
9 | After understanding the MySQL lock scheduling algorithm, let's examine how this algorithm affects throughput. Before testing, it is necessary to understand the previous FIFO algorithm and how to restore it. For relevant details, refer to the git log explanations provided below.
10 |
11 | ```c++
12 | This WL improves the implementation of CATS to the point where the FCFS will be redundant (as often slower, and easy to "emulate" by setting equal schedule weights in CATS), so it removes FCFS from the code, further simplifying the lock_sys's logic.
13 | ```
14 |
15 | Based on the above prompt, restoring the FIFO lock scheduling algorithm in MySQL is straightforward. Subsequently, throughput was tested using SysBench Pareto distribution scenarios with varying concurrency levels in the improved MySQL 8.0.32. Details are provided in the following figure.
16 |
17 |
18 |
19 | Figure 1. Impact of CATS on throughput at various concurrency levels.
20 |
21 | From the figure, it can be seen that the throughput of the CATS algorithm significantly exceeds that of the FIFO algorithm. To compare these two algorithms in terms of user response time, refer to the following figure.
22 |
23 |
24 |
25 | Figure 2. Impact of CATS on response time at various concurrency levels.
26 |
27 | From the figure, it can be seen that the CATS algorithm provides significantly better user response times.
28 |
29 | Furthermore, comparing deadlock error statistics during the Pareto distribution test process, details can be found in the following figure.
30 |
31 |
32 |
33 | Figure 3. Impact of CATS on ignored errors at various concurrency levels.
34 |
35 | Comparative analysis shows that the CATS algorithm significantly reduces deadlocks. This reduction in deadlocks likely plays a key role in improving performance. The theoretical basis for this correlation is as follows [2]:
36 |
37 | *Under a high-contention setting, the throughput of the target system will be determined by the concurrency control mechanism of the target system: systems which can release locks earlier or reduce the number of aborts will have advantages in such a setting.*
38 |
39 | The above test results align closely with MySQL's official findings. The following two figures, based on official tests [3], demonstrate the significant effectiveness of the CATS algorithm.
40 |
41 | 
42 |
43 | Figure 4. Comparison of CATS and FIFO in TPS and mean latency: insights from the MySQL blog.
44 |
45 | Additionally, MySQL's official requirements for implementing the CATS algorithm are stringent. Specific details are provided in the following figure:
46 |
47 | 
48 |
49 | Figure 5. Requirements of the official worklog for CATS.
50 |
51 | Therefore, with the adoption of the CATS algorithm, performance degradation should be absent in all scenarios. It seems like things end here, but the summary in the CATS algorithm's paper [1] raises some doubts. Details are provided in the following figure:
52 |
53 | 
54 |
55 | Figure 6. Doubts about the CATS paper.
56 |
57 | From the information above, it can be inferred that either the industry has overlooked potential flaws in FIFO, or the paper's assessment is flawed, and FIFO does not have the serious problems suggested. This contradiction highlights a critical problem: one of these conclusions must be flawed; both cannot be correct.
58 |
59 | Contradictions often present valuable opportunities for in-depth problem analysis and resolution. They highlight areas where existing understanding may be challenged or where new insights can be gained.
60 |
61 | This time, testing on the improved MySQL 8.0.27 revealed significant bottlenecks under severe conflicts. The *perf* screenshot is shown in the following figure:
62 |
63 | This time, testing on the improved MySQL 8.0.27 revealed a large number of error logs in the MySQL error log file. Below is a partial screenshot:
64 |
65 | 
66 |
67 | Figure 7. Partial screenshot of numerous error logs.
68 |
69 | Continuing the analysis of the corresponding code, the specifics are as follows:
70 |
71 | ```c++
72 | void Deadlock_notifier::notify(const ut::vector &trxs_on_cycle,
73 | const trx_t *victim_trx) {
74 | ut_ad(locksys::owns_exclusive_global_latch());
75 | start_print();
76 | const auto n = trxs_on_cycle.size();
77 | for (size_t i = 0; i < n; ++i) {
78 | const trx_t *trx = trxs_on_cycle[i];
79 | const trx_t *blocked_trx = trxs_on_cycle[0 < i ? i - 1 : n - 1];
80 | const lock_t *blocking_lock =
81 | lock_has_to_wait_in_queue(blocked_trx->lock.wait_lock, trx);
82 | ut_a(blocking_lock);
83 | print_title(i, "TRANSACTION");
84 | print(trx, 3000);
85 | print_title(i, "HOLDS THE LOCK(S)");
86 | print(blocking_lock);
87 | print_title(i, "WAITING FOR THIS LOCK TO BE GRANTED");
88 | print(trx->lock.wait_lock);
89 | }
90 | const auto victim_it =
91 | std::find(trxs_on_cycle.begin(), trxs_on_cycle.end(), victim_trx);
92 | ut_ad(victim_it != trxs_on_cycle.end());
93 | const auto victim_pos = std::distance(trxs_on_cycle.begin(), victim_it);
94 | ut::ostringstream buff;
95 | buff << "*** WE ROLL BACK TRANSACTION (" << (victim_pos + 1) << ")\n";
96 | print(buff.str().c_str());
97 | DBUG_PRINT("ib_lock", ("deadlock detected"));
98 | ...
99 | lock_deadlock_found = true;
100 | }
101 | ```
102 |
103 | From the code analysis, it's clear that deadlocks lead to a substantial amount of log output. The ignored errors observed during testing are connected to these deadlocks. The CATS algorithm helps reduce the number of ignored errors, resulting in fewer log outputs. This problem can be consistently reproduced.
104 |
105 | Given this context, several considerations emerge:
106 |
107 | 1. **Impact on Performance Testing:** The extensive error logs and the resulting disruptions could potentially skew the performance evaluation, leading to inaccurate assessments of the system's capabilities.
108 | 2. **Effectiveness of the CATS Algorithm:** The performance improvement of the CATS algorithm may need re-evaluation. If the extensive output of error logs significantly impacts performance, its actual effectiveness may not be as high as initially believed.
109 |
110 | Set `innodb_print_all_deadlocks=OFF` or remove all logging from the `Deadlock_notifier::notify` function, recompile MySQL, and run SysBench read-write tests with a Pareto distribution. Details are provided in the following figure:
111 |
112 |
113 |
114 | Figure 8. Impact of CATS on throughput at various concurrency levels for improved MySQL 8.0.27 after eliminating interference.
115 |
116 | From the figure, it is evident that there has been a significant change in throughput comparison. In scenarios with severe conflicts, the CATS algorithm slightly outperforms the FIFO algorithm, but the difference is minimal and much less pronounced than in previous tests. Note that these tests were conducted on the improved MySQL 8.0.27.
117 |
118 | Let's conduct performance comparison tests on the improved MySQL 8.0.32, with deadlock log interference removed, using Pareto distribution.
119 |
120 |
121 |
122 | Figure 9. Impact of CATS on throughput at various concurrency levels for improved MySQL 8.0.32 after eliminating interference.
123 |
124 | From the figure, it is evident that removing the interference results in only a slight performance difference. This small variation makes it understandable why the severity of FIFO scheduling problems may be difficult to notice. The perceived bias from the CATS author and MySQL officials is likely due to interference from extensive deadlock log output.
125 |
126 | Using the same 32 warehouses as in the CATS algorithm paper, TPC-C tests were conducted at various concurrency levels. MySQL was based on the improved MySQL 8.0.27, and BenchmarkSQL was modified to support 100 concurrent transactions per warehouse.
127 |
128 |
129 |
130 | Figure 10. Impact of CATS on throughput at different concurrency levels under NUMA after eliminating interference, according to the CATS paper.
131 |
132 | From the figure, it's evident that the CATS algorithm performs worse than the FIFO algorithm. To avoid NUMA-related interference, MySQL was bound to NUMA node 0 for a new round of throughput versus concurrency tests.
133 |
134 |
135 |
136 | Figure 11. Impact of CATS on throughput at different concurrency levels under SMP after eliminating interference, according to the CATS paper.
137 |
138 | In this round of testing, the FIFO algorithm continued to outperform the CATS algorithm. The decline in performance of the CATS algorithm in BenchmarkSQL TPC-C testing compared to improvements in SysBench Pareto testing can be attributed to the following reasons:
139 |
140 | 1. **Additional Overhead**: The CATS algorithm inherently introduces some extra overhead.
141 | 2. **NUMA Environment Problems**: The CATS algorithm may not perform optimally in NUMA environments.
142 | 3. **Conflict Severity**: The conflict severity in TPC-C testing is less pronounced than in SysBench Pareto testing.
143 | 4. **Different Concurrency Scenarios**: SysBench creates concurrency scenarios that differ significantly from those in BenchmarkSQL.
144 |
145 | Finally, standard TPC-C testing was performed again with 1000 warehouses at varying concurrency levels. Specific details are shown in the following figure:
146 |
147 |
148 |
149 | Figure 12. Impact of CATS on BenchmarkSQL throughput after eliminating interference.
150 |
151 | From the figure, it is evident that there is little difference between the two algorithms in low-conflict scenarios. In other words, the CATS algorithm does not offer significant benefits in situations with fewer conflicts.
152 |
153 | Overall, while CATS shows some improvement in Pareto testing, it is less pronounced than expected. The CATS algorithm significantly reduces transaction deadlocks, potentially resulting in less performance degradation than the FIFO algorithm. When deadlock logs are suppressed, the difference between these algorithms is minimal, clarifying the confusion surrounding the CATS algorithm's performance [4].
154 |
155 | Database performance testing is inherently complex and error-prone [5]. It cannot be judged by data alone and requires thorough investigation to ensure logical consistency.
156 |
157 | ## References
158 |
159 | [1] B. Tian, J. Huang, B. Mozafari, and G. Schoenebeck. Contention-aware lock scheduling for transactional databases. PVLDB, 11(5), 2018.
160 |
161 | [2] Y. Wang, M. Yu, Y. Hui, F. Zhou, Y. Huang, R. Zhu, et al.. 2022. A study of database performance sensitivity to experiment settings, Proceedings of the VLDB Endowment, vol. 15, no. 7.
162 |
163 | [3] Sunny Bains. 2017. Contention-Aware Transaction Scheduling Arriving in InnoDB to Boost Performance. https://dev.mysql.com/blog-archive/.
164 |
165 | [4] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
166 |
167 | [5] M. Raasveldt, P. Holanda, T. Gubner, and H. Muhleisen. Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing. In 7th International Workshop on Testing Database Systems, DBTest, 2:1--2:6, 2018.
--------------------------------------------------------------------------------
/group_replication_jitter.md:
--------------------------------------------------------------------------------
1 | # How to Mitigate Performance Fluctuations in MySQL Group Replication Under Abnormal Conditions?
2 |
3 | ## 1.1 Enhancing the Fail Detection Mechanism
4 |
5 | Accurately detecting node failure is challenging due to the FLP impossibility result, which states that consensus is impossible in a purely asynchronous system if even one process can fail. The difficulty arises because a server can't distinguish if another server has failed or is just "very slow" when it receives no messages [1]. Fortunately, most practical systems are not purely asynchronous, so the FLP result doesn't apply. To circumvent this, additional assumptions about system synchrony are made, allowing for the design of protocols that maintain safety and provide liveness under certain conditions. One common method is to use an inaccurate local failure detector.
6 |
7 |
8 |
9 | Figure 1. The asynchronous message passing model borrowed from the Mencius paper.
10 |
11 | The figure above illustrates the asynchronous message passing model. Each failure detector monitors servers and maintains a list of suspected faulty servers. These detectors can make mistakes, such as suspecting a running server has crashed. If later corrected, the server can be removed from the suspected list. Protocols using failure detectors must always ensure safety despite these errors and guarantee progress when the detectors remain accurate for long periods.
12 |
13 | Group Replication's failure detection mechanism identifies and expels non-communicating members. This increases the likelihood of the group containing a majority of functioning members, ensuring correct client request processing. All group members regularly exchange messages. If a member doesn't receive messages from another for 5 seconds, it suspects that member. If suspicion is not solved, the member is expelled. The expelled member remains unaware of its status and sees other members as unreachable. If it reconnects, it learns of its expulsion through an updated membership view.
14 |
15 | After understanding the above content, let's analyze common types of view change events:
16 |
17 | 1. **Node is Killed**
18 |
19 | In a Linux system, when a node is killed, the TCP layer typically sends a reset (RST) packet to notify other nodes of the connection problem. Paxos communication can use this RST packet to identify the node's termination. However, MySQL does not handle this specifically and relies on the standard timeout mechanism.
20 |
21 | 2. **Node is Network-Partitioned**
22 |
23 | Detecting whether a node is network-partitioned or simply slow is challenging. In such cases, timeout mechanisms are used, as it is difficult to definitively distinguish between these situations.
24 |
25 | 3. **Node is Gracefully Taken Offline**
26 |
27 | Normally, informing other nodes by sending a command should be straightforward. However, MySQL has not managed this aspect well.
28 |
29 | 4. **Adding a new node to the cluster**
30 |
31 | Adding a new node requires consensus and involves a final installation view synchronization. Although some performance fluctuations are expected, severe fluctuations indicate poor handling of the node addition process.
32 |
33 | Whenever a change that needs replication occurs, the group must achieve consensus. This applies to regular transactions, group membership changes, and certain internal messaging to maintain group consistency. Consensus requires a majority of group members to agree on a decision. Without a majority, the group cannot progress and blocks because it cannot secure a quorum.
34 |
35 | Quorum may be lost due to multiple involuntary failures, causing a majority of servers to be abruptly removed. In a group of 5 servers, if 3 servers become unresponsive simultaneously, the majority is lost, which prevents reaching quorum.
36 |
37 | Conversely, if servers exit the group voluntarily, they can instruct the group to reconfigure itself. A server leaving the group notifies others, allowing proper reconfiguration. This maintains membership consistency and recalculates the majority. For example, if 3 out of 5 servers leave one by one, informing the group, the membership can adjust from 5 to 2 while securing quorum during the process [2].
38 |
39 | After understanding the working mechanism of view change, one can then examine how MySQL handles it.
40 |
41 | In cases of node failure or network partitioning, MySQL's handling approach is similar. Testing was conducted with one MySQL secondary killed. Details of the test can be seen in the following figure.
42 |
43 |
44 |
45 | Figure 2. Significant throughput fluctuations when a node is killed.
46 |
47 | From the figure, it is evident that when the MySQL secondary is killed, the MySQL primary's throughput fluctuates significantly, with a drop to zero lasting over 20 seconds. Ideally, in a three-node cluster, if one node is killed, the remaining two nodes should still form a majority, preventing a prolonged zero-throughput problem. This suggests that MySQL may not effectively manage the majority quorum and fail detection mechanisms.
48 |
49 | When a MySQL secondary is gracefully taken offline, the throughput typically behaves as follows:
50 |
51 |
52 |
53 | Figure 3. Throughput drops to zero at intervals when a node is shut down.
54 |
55 | The figure shows that allowing a MySQL node to be gracefully taken offline causes throughput to drop to zero at several points, indicating problems with the fail detection mechanism.
56 |
57 | What will happen when adding a MySQL node in Group Replication?
58 |
59 |
60 |
61 | Figure 4. Throughput drop of approximately 10 seconds when a node is added.
62 |
63 | From the figure, it is evident that the node addition process resulted in a throughput drop of approximately 10 seconds. This indicates that MySQL did not handle the node addition process effectively.
64 |
65 | To address these problems in Group Replication, improving the probing mechanism is crucial for enhancing fault detection accuracy. Without this improvement, throughput can be significantly disrupted, making further performance enhancements challenging.
66 |
67 | Regarding the probe mechanism, the following improvements have been made.
68 |
69 | 1. **Ensure Fair Execution for Probe Coroutines**
70 |
71 | During the processing of large transactions, the Paxos protocol handles substantial writeset data, monopolizing the processing resources of the single-threaded coroutine model. This leaves limited opportunities for the probe detection coroutine to update critical information. As a result, outdated probe data can lead to incorrect judgments, as observed in section 1.2.5.
72 |
73 | To address this, the solution is to amortize data processing by splitting large transactions into multiple stages. This approach ensures that the probe detection coroutine gets more equitable opportunities to execute and update information promptly, enhancing the accuracy of fault detection.
74 |
75 | 2. **Improved Wakeup Delay Function**
76 |
77 | Check the **wakeup_delay** function in MySQL, as shown in the code below:
78 |
79 | ```c++
80 | static double wakeup_delay(double old) {
81 | double const minimum_threshold = 0.1;
82 | #ifdef EXECUTOR_TASK_AGGRESSIVE_NO_OP
83 | double const maximum_threshold = 1.0;
84 | #else
85 | double const maximum_threshold = 20.0;
86 | #endif /* EXECUTOR_TASK_AGGRESSIVE_NO_OP */
87 | double retval = 0.0;
88 | if (0.0 == old) {
89 | double m = median_time();
90 | double const fuzz = 5.0;
91 | IFDBG(D_BUG, FN; NDBG(m, f));
92 | // Guard against unreasonable estimates of median consensus time
93 | if (m <= 0.0) m = minimum_threshold;
94 | if (m > maximum_threshold / fuzz) m = (maximum_threshold / fuzz) / 2.0;
95 | retval = minimum_threshold + fuzz * m + m * xcom_drand48();
96 | } else {
97 | retval = old * 1.4142136; /* Exponential backoff */
98 | }
99 | /* If we exceed maximum, choose a random value in the max/2..max interval */
100 | if (retval > maximum_threshold) {
101 | double const low = maximum_threshold / 2.0;
102 | retval = low + xcom_drand48() * (maximum_threshold - low);
103 | }
104 | IFDBG(D_BUG, FN; NDBG(retval, f));
105 | return retval;
106 | }
107 | ```
108 |
109 | From the code, it is evident that the calculated delay time is too rigid. This inflexibility is a key reason for performance fluctuations, as the primary may wait too long after a node exits. To address this, adjusting the relevant constants based on the environment is essential for adapting to complex and variable network conditions.
110 |
111 | 3. **Split the wakeup_delay function to adapt to different environments**
112 |
113 | For example, when checking if propose messages have been accepted, utilize the original *wakeup_delay* function, as shown in the code below:
114 |
115 | ```c++
116 | while (!finished(ep->p)) { /* Try to get a value accepted */
117 | /* We will wake up periodically, and whenever a message arrives */
118 | TIMED_TASK_WAIT(&ep->p->rv, ep->delay = wakeup_delay(ep->delay));
119 | ...
120 | ```
121 |
122 | In the function *get_xcom_message*, the *wakeup_delay_for_perf* function is used, as shown in the code below:
123 |
124 | ```c++
125 | DECL_ENV
126 | ...
127 | while (!finished(*p)) {
128 | ...
129 | if (!((*p)->force_delivery)) {
130 | ep->delay = wakeup_delay_for_perf(ep->delay, 0.003);
131 | } else {
132 | ep->delay = wakeup_delay_for_perf(ep->delay, 0.1);
133 | }
134 | IFDBG(D_NONE, FN; NDBG(ep->delay, f));
135 | TIMED_TASK_WAIT(&(*p)->rv, ep->delay);
136 | *p = get_cache(msgno);
137 | dump_debug_exec_state();
138 | }
139 | FINALLY
140 | IFDBG(D_NONE, FN; SYCEXP(msgno); PTREXP(*p); NDBG(ep->wait, u);
141 | SYCEXP(msgno));
142 | TASK_END;
143 | }
144 | ```
145 |
146 | In the *wakeup_delay_for_perf* function, a more aggressive strategy can be employed, such as reducing the waiting time further.
147 |
148 | 4. Incorporate the Round-trip time (RTT) from the network into the wakeup_delay.
149 |
150 | The purpose of this is to enhance the accuracy of network probing activities.
151 |
152 | 5. Distinguish between node being killed and network partition.
153 |
154 | In Linux systems, when a node is killed, TCP sends reset packets to the other nodes in the cluster, helping distinguish between node terminations and network partition faults. Integrating information about abnormal node terminations into Paxos' decision-making logic allows for more accurate judgments, addressing the problem of prolonged throughput drops experienced during abrupt node terminations.
155 |
156 | With the implementation of the above mechanism, probing accuracy has been significantly enhanced. Combined with the forthcoming degradation mechanism, this ensures relatively stable throughput even under abnormal conditions.
157 |
158 | ## 1.2 Leverage the Degradation Mechanism to Address Prolonged Waiting Problems
159 |
160 | The degradation mechanism employs a majority-based approach to make decisions when a node becomes unresponsive after a short delay. While this mechanism is not new and is already part of Mencius interaction, MySQL has not effectively leveraged it to handle exceptional situations.
161 |
162 | One drawback of the degradation mechanism is that it increases network interactions, including the prepare phase, leading to a performance decrease. However, its advantage lies in significantly improving throughput compared to how MySQL handles faults. In theory, as long as network latency between majority nodes is low, the degradation mechanism can be highly effective.
163 |
164 | The following figure compares the throughput of SysBench read-write tests before and after improvements, following node being killed.
165 |
166 |
167 |
168 | Figure 5. Significant throughput improvement observed when a node is killed.
169 |
170 | From the figure, it's evident that the native Group Replication experiences prolonged throughput drops, which are unacceptable to users. In the improved Group Replication, throughput decreases from 20,000 to 14,000 transactions per second due to the degradation process. Although this decrease is noticeable, users consider it acceptable as it represents a significant improvement over the native Group Replication.
171 |
172 | Let's continue to examine the throughput comparison over time before and after improvements following the normal shutdown of a particular node, as shown in the following figure:
173 |
174 |
175 |
176 | Figure 6. Significant throughput improvement observed when a node is closed.
177 |
178 | From the figure, it's clear that the improved Group Replication provides much more stable throughput compared to the native version. Although minor fluctuations occur during view changes due to internal synchronization, the improved Group Replication's throughput performance is deemed acceptable by users. In contrast, the frequent throughput drops in the native Group Replication are considered unacceptable.
179 |
180 | Once again, comparing the throughput over time before and after improvements in the scenario of adding a MySQL secondary to the cluster, as shown in the following figure:
181 |
182 |
183 |
184 | Figure 7. Significant throughput improvement observed when adding a node to cluster.
185 |
186 | From the figure, it is evident that the native Group Replication experiences throughput drops of around 10 seconds, whereas the improved Group Replication shows only a slight decrease in throughput with minimal impact on performance.
187 |
188 | Overall, the problems with native Group Replication in abnormal scenarios can be effectively solved [3].
189 |
190 | ## References:
191 |
192 | [1] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: building efficient replicated state machines for WANs. In Proc. 8th USENIX OSDI, pages 369--384, San Diego, CA, Dec. 2008.
193 |
194 | [2] https://dev.mysql.com/doc/refman/8.0/en/.
195 |
196 | [3] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
197 |
198 |
--------------------------------------------------------------------------------
/images/149ca62fd014bd12b60c77573a49757d.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/149ca62fd014bd12b60c77573a49757d.gif
--------------------------------------------------------------------------------
/images/185beb2c6e0524d93abde3b25ecedc61.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/185beb2c6e0524d93abde3b25ecedc61.png
--------------------------------------------------------------------------------
/images/19a92bf7065627b28a3403c000eba095.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/19a92bf7065627b28a3403c000eba095.png
--------------------------------------------------------------------------------
/images/290e969038561faf1ace6e108945b2d6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/290e969038561faf1ace6e108945b2d6.png
--------------------------------------------------------------------------------
/images/2eba13133f9eb007d89459cce5d4055b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/2eba13133f9eb007d89459cce5d4055b.png
--------------------------------------------------------------------------------
/images/3a5d10caf25003e7ea4dcace59a181f6.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/3a5d10caf25003e7ea4dcace59a181f6.gif
--------------------------------------------------------------------------------
/images/4309ffee12d2eed2548845d1e1d2e848.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/4309ffee12d2eed2548845d1e1d2e848.png
--------------------------------------------------------------------------------
/images/47c39411c3240713c75e848ef5ef4b59.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/47c39411c3240713c75e848ef5ef4b59.png
--------------------------------------------------------------------------------
/images/47f963cdd950abd91a459ffb66a3744e.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/47f963cdd950abd91a459ffb66a3744e.png
--------------------------------------------------------------------------------
/images/4ca52ffeebc49306e76c74ed9062257d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/4ca52ffeebc49306e76c74ed9062257d.png
--------------------------------------------------------------------------------
/images/4cc389ee95fbae485f1e014aad393aa8.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/4cc389ee95fbae485f1e014aad393aa8.gif
--------------------------------------------------------------------------------
/images/4f0ea97ad117848a71148849705e311e.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/4f0ea97ad117848a71148849705e311e.png
--------------------------------------------------------------------------------
/images/5136495bbdfbe2cefac98d74bd36a88f.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/5136495bbdfbe2cefac98d74bd36a88f.png
--------------------------------------------------------------------------------
/images/55787cfe149ebf1c574aa1da4eea678c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/55787cfe149ebf1c574aa1da4eea678c.png
--------------------------------------------------------------------------------
/images/5c6f81f69eac7f61744ba3bc035b29e7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/5c6f81f69eac7f61744ba3bc035b29e7.png
--------------------------------------------------------------------------------
/images/63cf3b8bd556c6b1ce5a0436883c8f7b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/63cf3b8bd556c6b1ce5a0436883c8f7b.png
--------------------------------------------------------------------------------
/images/6670e70b6d5f0f5152f643c153d13487.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/6670e70b6d5f0f5152f643c153d13487.png
--------------------------------------------------------------------------------
/images/66a031ab14d56289d0987c65c73323af.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/66a031ab14d56289d0987c65c73323af.png
--------------------------------------------------------------------------------
/images/68eb47a320bce3bb717a2b690da4573d.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/68eb47a320bce3bb717a2b690da4573d.png
--------------------------------------------------------------------------------
/images/6a6091b51c602923e5fe8b33e4422882.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/6a6091b51c602923e5fe8b33e4422882.png
--------------------------------------------------------------------------------
/images/77d0c0bdc5ce8574c6ad319864abb032.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/77d0c0bdc5ce8574c6ad319864abb032.gif
--------------------------------------------------------------------------------
/images/853d21533f748c1c56a4151869a82a27.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/853d21533f748c1c56a4151869a82a27.gif
--------------------------------------------------------------------------------
/images/8f9080ee71094d948ab7592b449954bb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/8f9080ee71094d948ab7592b449954bb.png
--------------------------------------------------------------------------------
/images/a54faa33502b8c17066b1e2af09bdbb0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/a54faa33502b8c17066b1e2af09bdbb0.png
--------------------------------------------------------------------------------
/images/b5a20f07b4812ea9f34ffb8e8783b73a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/b5a20f07b4812ea9f34ffb8e8783b73a.png
--------------------------------------------------------------------------------
/images/b89dd5988d7d7ead6923dbc2d20e146c.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/b89dd5988d7d7ead6923dbc2d20e146c.png
--------------------------------------------------------------------------------
/images/bb111919295b7678530a1adcfa8b7d29.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/bb111919295b7678530a1adcfa8b7d29.png
--------------------------------------------------------------------------------
/images/cba8723ae722e8d2d13b94e0cf1fda7a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/cba8723ae722e8d2d13b94e0cf1fda7a.png
--------------------------------------------------------------------------------
/images/ccb771014f600402fee72ca7134aea10.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/ccb771014f600402fee72ca7134aea10.gif
--------------------------------------------------------------------------------
/images/ce8a6d4b9cd6df4c48cf914fae8a70d2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/ce8a6d4b9cd6df4c48cf914fae8a70d2.png
--------------------------------------------------------------------------------
/images/da7d6c8c12d18b915018939970d2b911.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/da7d6c8c12d18b915018939970d2b911.png
--------------------------------------------------------------------------------
/images/e2554f0ea244f337c0e66ea34bf53edf.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/e2554f0ea244f337c0e66ea34bf53edf.gif
--------------------------------------------------------------------------------
/images/f25b3e3bc94bed108b0c454413f79873.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/f25b3e3bc94bed108b0c454413f79873.png
--------------------------------------------------------------------------------
/images/f67a3249ad8b040417f195c3ca11f795.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/f67a3249ad8b040417f195c3ca11f795.png
--------------------------------------------------------------------------------
/images/image-20240829080659528.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829080659528.png
--------------------------------------------------------------------------------
/images/image-20240829081950365.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829081950365.png
--------------------------------------------------------------------------------
/images/image-20240829092940314.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829092940314.png
--------------------------------------------------------------------------------
/images/image-20240829092959775.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829092959775.png
--------------------------------------------------------------------------------
/images/image-20240829094221268.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829094221268.png
--------------------------------------------------------------------------------
/images/image-20240829100432417.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829100432417.png
--------------------------------------------------------------------------------
/images/image-20240829101111288.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101111288.png
--------------------------------------------------------------------------------
/images/image-20240829101222447.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101222447.png
--------------------------------------------------------------------------------
/images/image-20240829101254601.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101254601.png
--------------------------------------------------------------------------------
/images/image-20240829101332034.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101332034.png
--------------------------------------------------------------------------------
/images/image-20240829101534550.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101534550.png
--------------------------------------------------------------------------------
/images/image-20240829101612063.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101612063.png
--------------------------------------------------------------------------------
/images/image-20240829101632142.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101632142.png
--------------------------------------------------------------------------------
/images/image-20240829101650730.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101650730.png
--------------------------------------------------------------------------------
/images/image-20240829101712694.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829101712694.png
--------------------------------------------------------------------------------
/images/image-20240829102323261.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829102323261.png
--------------------------------------------------------------------------------
/images/image-20240829102344815.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829102344815.png
--------------------------------------------------------------------------------
/images/image-20240829102642856.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829102642856.png
--------------------------------------------------------------------------------
/images/image-20240829102703396.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829102703396.png
--------------------------------------------------------------------------------
/images/image-20240829103131857.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829103131857.png
--------------------------------------------------------------------------------
/images/image-20240829103236734.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829103236734.png
--------------------------------------------------------------------------------
/images/image-20240829103259992.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829103259992.png
--------------------------------------------------------------------------------
/images/image-20240829104222478.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104222478.png
--------------------------------------------------------------------------------
/images/image-20240829104512068.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104512068.png
--------------------------------------------------------------------------------
/images/image-20240829104533718.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104533718.png
--------------------------------------------------------------------------------
/images/image-20240829104554155.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104554155.png
--------------------------------------------------------------------------------
/images/image-20240829104639402.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104639402.png
--------------------------------------------------------------------------------
/images/image-20240829104851205.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829104851205.png
--------------------------------------------------------------------------------
/images/image-20240829105258689.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829105258689.png
--------------------------------------------------------------------------------
/images/image-20240829105418671.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829105418671.png
--------------------------------------------------------------------------------
/images/image-20240829105441723.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829105441723.png
--------------------------------------------------------------------------------
/images/image-20240829105506722.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829105506722.png
--------------------------------------------------------------------------------
/images/image-20240829110009589.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829110009589.png
--------------------------------------------------------------------------------
/images/image-20240829110903709.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829110903709.png
--------------------------------------------------------------------------------
/images/image-20240829110922449.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829110922449.png
--------------------------------------------------------------------------------
/images/image-20240829110943245.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829110943245.png
--------------------------------------------------------------------------------
/images/image-20240829113916829.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829113916829.png
--------------------------------------------------------------------------------
/images/image-20240829113948830.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829113948830.png
--------------------------------------------------------------------------------
/images/image-20240829114017540.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829114017540.png
--------------------------------------------------------------------------------
/images/image-20240829114037360.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829114037360.png
--------------------------------------------------------------------------------
/images/image-20240829114100020.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829114100020.png
--------------------------------------------------------------------------------
/images/image-20240829151823981.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240829151823981.png
--------------------------------------------------------------------------------
/images/image-20240904142951908.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240904142951908.png
--------------------------------------------------------------------------------
/images/image-20240904201934894.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240904201934894.png
--------------------------------------------------------------------------------
/images/image-20240905163847317.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240905163847317.png
--------------------------------------------------------------------------------
/images/image-20240905163915360.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240905163915360.png
--------------------------------------------------------------------------------
/images/image-20240905163959058.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240905163959058.png
--------------------------------------------------------------------------------
/images/image-20240905222509513.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-20240905222509513.png
--------------------------------------------------------------------------------
/images/image-bulk-insert-degrade.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-bulk-insert-degrade.png
--------------------------------------------------------------------------------
/images/image-bulk-insert-optimize.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-bulk-insert-optimize.png
--------------------------------------------------------------------------------
/images/image-degrade.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-degrade.png
--------------------------------------------------------------------------------
/images/image-join-degrade.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-join-degrade.png
--------------------------------------------------------------------------------
/images/image-join-improve.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/image-join-improve.png
--------------------------------------------------------------------------------
/images/tcpcopy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/enhancedformysql/blogs/a8ad24bd92f399abf4a2541e875aaf8fc0ed1022/images/tcpcopy.png
--------------------------------------------------------------------------------
/innodb_storage.md:
--------------------------------------------------------------------------------
1 | ## Enhancing the InnoDB Storage Engine
2 |
3 | ### 1.1 MVCC ReadView: Identified Problems
4 |
5 | A key component of any MVCC scheme is the mechanism for quickly determining which tuples are visible to which transactions. A transaction's snapshot is created by building a ReadView (RV) vector that holds the TXIDs of all concurrent transactions smaller than the transaction's TXID. The cost of acquiring a snapshot increases linearly with the number of concurrent transactions, even if the transaction only reads tuples written by a single committed transaction, highlighting a known scalability limitation [1].
6 |
7 | After understanding the scalability problems with the MVCC ReadView mechanism, let's examine how MySQL implements MVCC ReadView. Under the Read Committed isolation level, during the process of reading data, the InnoDB storage engine triggers the acquisition of the ReadView. A screenshot of part of the ReadView data structure is shown below:
8 |
9 | ```c++
10 | private:
11 | // Disable copying
12 | ReadView(const ReadView &);
13 | ReadView &operator=(const ReadView &);
14 | private:
15 | /** The read should not see any transaction with trx id >= this
16 | value. In other words, this is the "high water mark". */
17 | trx_id_t m_low_limit_id;
18 | /** The read should see all trx ids which are strictly
19 | smaller (<) than this value. In other words, this is the
20 | low water mark". */
21 | trx_id_t m_up_limit_id;
22 | /** trx id of creating transaction, set to TRX_ID_MAX for free
23 | views. */
24 | trx_id_t m_creator_trx_id;
25 | /** Set of RW transactions that was active when this snapshot
26 | was taken */
27 | ids_t m_ids;
28 | /** The view does not need to see the undo logs for transactions
29 | whose transaction number is strictly smaller (<) than this value:
30 | they can be removed in purge if not needed by other views */
31 | trx_id_t m_low_limit_no;
32 | ...
33 | ```
34 |
35 | Here, *m_ids* is a data structure of type *ids_t*, which closely resembles *std::vector*. See the specific explanation below:
36 |
37 | ```c++
38 | /** This is similar to a std::vector but it is not a drop
39 | in replacement. It is specific to ReadView. */
40 | class ids_t {
41 | typedef trx_ids_t::value_type;
42 | /**
43 | Constructor */
44 | ids_t() : m_ptr(), m_size(), m_reserved() {}
45 | /**
46 | Destructor */
47 | ~ids_t() { ut::delete_arr(m_ptr); }
48 | /** Try and increase the size of the array. Old elements are copied across.
49 | It is a no-op if n is < current size.
50 | @param n Make space for n elements */
51 | void reserve(ulint n);
52 | ...
53 | ```
54 |
55 | Algorithm for MVCC ReadView visibility determination, specifically refer to the *changes_visible* function below:
56 |
57 | ```c++
58 | /** Check whether the changes by id are visible.
59 | @param[in] id transaction id to check against the view
60 | @param[in] name table name
61 | @return whether the view sees the modifications of id. */
62 | [[nodiscard]] bool changes_visible(trx_id_t id,
63 | const table_name_t &name) const {
64 | ut_ad(id > 0);
65 | if (id < m_up_limit_id || id == m_creator_trx_id) {
66 | return (true);
67 | }
68 | check_trx_id_sanity(id, name);
69 | if (id >= m_low_limit_id) {
70 | return (false);
71 | } else if (m_ids.empty()) {
72 | return (true);
73 | }
74 | const ids_t::value_type *p = m_ids.data();
75 | return (!std::binary_search(p, p + m_ids.size(), id));
76 | }
77 | ```
78 |
79 | From the code, it can be seen that the visibility algorithm works efficiently when concurrency is low. However, as concurrency increases, the efficiency of using binary search to determine visibility significantly decreases, particularly in NUMA environments.
80 |
81 | ### 1.2 Solutions for Enhancing MVCC ReadView Scalability
82 |
83 | There are two fundamental approaches to improving scalability here [2]:
84 |
85 | *First, finding an algorithm that improves the complexity, so that each additional connection does not increase the snapshot computation costs linearly.*
86 |
87 | *Second, perform less work for each connection, hopefully reducing the total time taken so much that even at high connection counts the total time is still small enough to not matter much (i.e. reduce the constant factor).*
88 |
89 | For the first solution, adopting a multi-version visibility algorithm based on Commit Sequence Numbers (CSN) offers benefits [7]: *the cost of taking snapshots can be reduced by converting snapshots into CSNs instead of maintaining a transaction ID list.* Specifically, under the Read Committed isolation level, there's no need to replicate an active transaction list for each read operation, thereby improving scalability.
90 |
91 | Considering the complexity of implementation, this book opts for the second solution, which directly modifies the MVCC ReadView data structure to mitigate MVCC ReadView scalability problems.
92 |
93 | ### 1.3 Improvements to the MVCC ReadView Data Structure
94 |
95 | In the ReadView structure, the original approach used a vector to store the list of active transactions. Now, it has been changed to the following data structure:
96 |
97 | ```c++
98 | class ReadView {
99 | ...
100 | private:
101 | // Disable copying
102 | ReadView &operator=(const ReadView &);
103 | public:
104 | bool skip_view_list{false};
105 | private:
106 | unsigned char top_active[MAX_TOP_ACTIVE_BYTES];
107 | trx_id_t m_short_min_id;
108 | trx_id_t m_short_max_id;
109 | bool m_has_short_actives;
110 | /** The read should not see any transaction with trx id >= this
111 | value. In other words, this is the "high water mark". */
112 | trx_id_t m_low_limit_id;
113 | /** The read should see all trx ids which are strictly
114 | smaller (<) than this value. In other words, this is the low water mark". */
115 | trx_id_t m_up_limit_id;
116 | /** trx id of creating transaction, set to TRX_ID_MAX for free views. */
117 | trx_id_t m_creator_trx_id;
118 | /** Set of RW transactions that was active when this snapshot
119 | was taken */
120 | ids_t m_long_ids;
121 | ...
122 | ```
123 |
124 | Furthermore, corresponding code modifications were made in the related interface functions, as changes to the data structure necessitate adjustments to the internal code within these functions.
125 |
126 | This new MVCC ReadView data structure can be seen as a hybrid data structure, as shown in the following figure [3].
127 |
128 | 
129 |
130 | Figure 1. A new hybrid data structure suitable for active transaction list in MVCC ReadView.
131 |
132 | Typically, online transactions are short rather than long, and transaction IDs increase continuously. To leverage these characteristics, a hybrid data structure is used: a static array for consecutive short transaction IDs and a vector for long transactions. With a 2048-byte array, up to 16,384 consecutive active transaction IDs can be stored, each bit representing a transaction ID.
133 |
134 | The minimum short transaction ID is used to differentiate between short and long transactions. IDs smaller than this minimum go into the long transaction vector, while IDs equal to or greater than it are placed in the short transaction array. For an ID in changes_visible, if it is below the minimum short transaction ID, a direct query is made to the vector, which is efficient due to the generally small number of long transactions. If the ID is equal to or above the minimum short transaction ID, a bitwise query is performed, with a time complexity of O(1), compared to the previous O(log n) complexity. This improvement enhances efficiency and reduces cache migration between NUMA nodes, as O(1) queries typically complete within a single CPU time slice.
135 |
136 | In addition to the previously mentioned transformation, similar modifications were applied to the global transaction active list. The original data structure used for this list is shown in the following code snippet:
137 |
138 | ```c++
139 | /** Array of Read write transaction IDs for MVCC snapshot. A ReadView would
140 | take a snapshot of these transactions whose changes are not visible to it.
141 | We should remove transactions from the list before committing in memory and
142 | releasing locks to ensure right order of removal and consistent snapshot. */
143 | trx_ids_t rw_trx_ids;
144 | ```
145 |
146 | Now it has been changed to the following data structure:
147 |
148 | ```c++
149 | /** Array of Read write transaction IDs for MVCC snapshot. A ReadView would
150 | take a snapshot of these transactions whose changes are not visible to it.
151 | We should remove transactions from the list before committing in memory and
152 | releasing locks to ensure right order of removal and consistent snapshot. */
153 | trx_ids_t long_rw_trx_ids;
154 | unsigned char short_rw_trx_ids_bitmap[MAX_SHORT_ACTIVE_BYTES];
155 | int short_rw_trx_valid_number;
156 | trx_id_t min_short_valid_id;
157 | trx_id_t max_short_valid_id
158 | ```
159 |
160 | In the *short_rw_trx_ids_bitmap* structure, *MAX_SHORT_ACTIVE_BYTES* is set to 65536, theoretically accommodating up to 524,288 consecutive short transaction IDs. If the limit is exceeded, the oldest short transaction IDs are converted into long transactions and stored in *long_rw_trx_ids*. Global long and short transactions are distinguished by *min_short_valid_id*: IDs smaller than this value are treated as global long transactions, while IDs equal to or greater are considered global short transactions.
161 |
162 | During the copying process from the global active transaction list, the *short_rw_trx_ids_bitmap* structure, which uses only one bit per transaction ID, allows for much higher copying efficiency compared to the native MySQL solution. For example, with 1000 active transactions, the native MySQL version would require copying at least 8000 bytes, whereas the optimized solution may only need a few hundred bytes. This results in a significant improvement in copying efficiency.
163 |
164 | After implementing these modifications, performance comparison tests were conducted to evaluate the effectiveness of the MVCC ReadView optimization. The figure below shows a comparison of TPC-C throughput with varying concurrency levels, before and after modifying the MVCC ReadView data structure.
165 |
166 |
167 |
168 | Figure 2. Performance comparison before and after adopting the new hybrid data structure in NUMA.
169 |
170 | From the figure, it is evident that this transformation primarily optimized scalability and improved MySQL's peak throughput in NUMA environments. Further performance comparisons before and after optimization can be analyzed using tools like *perf*. Below is a screenshot from *perf* at 300 concurrency, prior to optimization:
171 |
172 | 
173 |
174 | Figure 3. Latch-related bottleneck observed in *perf* screenshot.
175 |
176 | From the figure, it can be seen that the first two bottlenecks were significant, accounting for approximately 33% of the overhead. After optimization, the *perf* screenshot at 300 concurrency is as follows:
177 |
178 | 
179 |
180 | Figure 4. Significant alleviation of latch-related bottleneck.
181 |
182 | After optimization, as shown in the screenshot above, the proportions of the previous top two bottlenecks have been significantly reduced.
183 |
184 | Why does changing the MVCC ReadView data structure significantly enhance scalability? This is because accessing these structures involves acquiring a global latch. Optimizing the data structure accelerates access to critical resources, reducing concurrency conflicts and minimizing cache migration across NUMA nodes.
185 |
186 | The native MVCC ReadView uses a vector to store the list of active transactions. In high-concurrency scenarios, this list can become large, leading to a larger working set. In NUMA environments, both querying and replication can become slower, potentially causing a single CPU time slice to miss its deadline and resulting in significant context-switching costs. The theoretical basis for this aspect is as follows [4]:
187 |
188 | *Context-switches that occur in the middle of a logical operation evict a possibly larger working set from the cache. When the suspended thread resumes execution, it wastes time restoring the evicted working set.*
189 |
190 | Throughput improvement under the ARM architecture is evaluated next. Details are shown in the following figure:
191 |
192 |
193 |
194 | Figure 5. Throughput improvement under the ARM architecture.
195 |
196 | From the figure, it is evident that there is also a significant improvement under the ARM architecture. Extensive test data confirms that the MVCC ReadView optimization yields clear benefits in NUMA environments, regardless of whether the architecture is ARM or x86.
197 |
198 | How much improvement can this optimization achieve in a SMP environment?
199 |
200 |
201 |
202 | Figure 6. Performance comparison before and after adopting the new hybrid data structure in SMP.
203 |
204 | From the figure, it can be observed that after binding to NUMA node 0, the improvement from the MVCC ReadView optimization is not significant. This suggests that the optimization primarily enhances scalability in NUMA architectures.
205 |
206 | In practical MySQL usage, preventing excessive user threads from entering the InnoDB storage engine can significantly reduce the size of the global active transaction list. This transaction throttling mechanism complements the MVCC ReadView optimization effectively, improving overall performance. Combined with double latch avoidance, discussed in the next section, the TPC-C test results in the following figure clearly demonstrate these improvements.
207 |
208 |
209 |
210 | Figure 7. Maximum TPC-C throughput in BenchmarkSQL with transaction throttling mechanisms.
211 |
212 | ### 1.4 Avoiding Double Latch Problems
213 |
214 | During testing after the MVCC ReadView optimization, a noticeable decline in throughput was observed under extremely high concurrency conditions. The specific details are shown in the following figure:
215 |
216 |
217 |
218 | Figure 8. Performance degradation at concurrency Levels exceeding 500.
219 |
220 | From the figure, it can be seen that throughput significantly decreases once concurrency exceeds 500. The problem was traced to frequent acquisitions of the *trx-sys* latch, as shown in the code snippet below:
221 |
222 | ```c++
223 | } else if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
224 | MVCC::is_view_active(trx->read_view)) {
225 | mutex_enter(&trx_sys->mutex);
226 | trx_sys->mvcc->view_close(trx->read_view, true);
227 | mutex_exit(&trx_sys->mutex);
228 | }
229 | ```
230 |
231 | The other code snippet is shown below:
232 |
233 | ```c++
234 | if (lock_type != TL_IGNORE && trx->n_mysql_tables_in_use == 0) {
235 | trx->isolation_level =
236 | innobase_trx_map_isolation_level(thd_get_trx_isolation(thd));
237 | if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
238 | MVCC::is_view_active(trx->read_view)) {
239 | /* At low transaction isolation levels we let
240 | each consistent read set its own snapshot */
241 | mutex_enter(&trx_sys->mutex);
242 | trx_sys->mvcc->view_close(trx->read_view, true);
243 | mutex_exit(&trx_sys->mutex);
244 | }
245 | }
246 | ```
247 |
248 | InnoDB introduces a global trx-sys latch during the view close process, impacting scalability under high concurrency. To address this, an attempt was made to remove the global latch. One of the modifications is shown in the code snippet below:
249 |
250 | ```c++
251 | } else if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
252 | MVCC::is_view_active(trx->read_view)) {
253 | trx_sys->mvcc->view_close(trx->read_view, false);
254 | }
255 | ```
256 |
257 | The other modification is shown in the code snippet below:
258 |
259 | ```c++
260 | if (lock_type != TL_IGNORE && trx->n_mysql_tables_in_use == 0) {
261 | trx->isolation_level =
262 | innobase_trx_map_isolation_level(thd_get_trx_isolation(thd));
263 | if (trx->isolation_level <= TRX_ISO_READ_COMMITTED &&
264 | MVCC::is_view_active(trx->read_view)) {
265 | /* At low transaction isolation levels we let
266 | each consistent read set its own snapshot */
267 | trx_sys->mvcc->view_close(trx->read_view, false);
268 | }
269 | }
270 | ```
271 |
272 | Using the MVCC ReadView optimized version, compare TPC-C throughput before and after the modifications. Details are shown in the following figure:
273 |
274 |
275 |
276 | Figure 9. Performance improvement after eliminating the double latch bottleneck.
277 |
278 | From the figure, it is evident that the modifications significantly improved scalability under high-concurrency conditions. To understand the reasons for this improvement, let's use the *perf* tool for further investigation. Below is the *perf* screenshot at 2000 concurrency before the modifications:
279 |
280 | 
281 |
282 | Figure 10. Latch-related bottleneck observed in *perf* screenshot.
283 |
284 | From the figure, it is evident that the latch-related bottlenecks are quite pronounced. After the code modifications, here is the *perf* screenshot at 3000 concurrency:
285 |
286 | 
287 |
288 | Figure 11. Significant alleviation of latch-related bottleneck.
289 |
290 | Even with higher concurrency, such as 3000, the bottlenecks are not pronounced. This suggests that the optimizations have effectively alleviated the latch-related performance problems, improving scalability under extreme conditions.
291 |
292 | Excluding the global latch before and after the *view_close* function call improves scalability, while including it severely degrades scalability under high concurrency. Although the *view_close* function operates efficiently within its critical section, frequent acquisition of the globally used *trx-sys* latch—used throughout the *trx-sys* subsystem—causes significant contention and head-of-line blocking. This issue, known as the 'double latch' problem, arises from both *view_open* and *view_close* requiring the global *trx-sys* latch. Notably, removing the latch from the final stage or using a new latch can significantly mitigate this problem.
293 |
294 |
295 |
296 | ## References:
297 |
298 | [1] Adnan Alhomssi and Viktor Leis. 2023. Scalable and Robust Snapshot Isolation for High-Performance Storage Engines. Proc. VLDB Endow. 16, 6 (2023), 1426–1438.
299 |
300 | [2] Andres Freund. 2020. Improving Postgres Connection Scalability: Snapshots. techcommunity.microsoft.com.
301 |
302 | [3] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
303 |
304 | [4] Harizopoulos, S. and Ailamaki, A. 2003. A case for staged database systems. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA. Harizopoulos, S. and Ailamaki, A. 2003. A case for staged database systems. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA.
--------------------------------------------------------------------------------
/isolation.md:
--------------------------------------------------------------------------------
1 | ## How to explain why Repeatable Read surprisingly outperforms Read Committed?
2 |
3 | Transaction isolation is fundamental to database processing, represented by the 'I' in the ACID acronym. The isolation level determines the balance between performance and the reliability, consistency, and predictability of results when multiple transactions concurrently make changes and queries. Commonly used isolation levels are Read Committed, Repeatable Read, and Serializable. By default, InnoDB uses Repeatable Read.
4 |
5 | InnoDB employs distinct locking strategies for each isolation level, impacting query locking behavior under concurrent conditions. Depending on the isolation level, queries may need to wait for locks currently held by other sessions before execution begins [1]. There's a common perception that stricter isolation levels can degrade performance. How does MySQL perform in practical scenarios?
6 |
7 | Tests were conducted across Serializable, Repeatable Read (RR), and Read Committed (RC) isolation levels using two benchmark types: SysBench uniform and pareto tests. The SysBench uniform test simulates low-conflict scenarios, while the SysBench pareto test models high-conflict situations. Due to excessive deadlock logs generated during the SysBench pareto test, which significantly interfered with performance analysis, these logs were suppressed by modifying the source code to ensure fair testing conditions. Moreover, the MySQL testing program utilized a modified version for accuracy, rather than the original version.
8 |
9 | The figure below presents results from the SysBench uniform test, where concurrency increases from 50 to 800 in doubling increments. Given the few conflicts in this test type, there is little variation in throughput among the three transaction isolation levels at low concurrency levels. However, beyond 400 concurrency, the throughput of the Serializable isolation level exhibits a notable decline.
10 |
11 |
12 |
13 | Figure 1. SysBench read-write performance comparison with low conflicts under different isolation levels.
14 |
15 | Below 400 concurrency, the differences are minor because of fewer conflicts in the uniform test. With fewer conflicts, the impact of lock strategies under different transaction isolation levels is reduced. However, Read Committed is mainly constrained by frequent acquisition of MVCC ReadView, resulting in performance inferior to Repeatable Read.
16 |
17 | Continuing with the SysBench test under pareto distribution conditions, specific comparative test results can be seen in the following figure.
18 |
19 |
20 |
21 | Figure 2. SysBench read-write performance comparison with high conflicts under different isolation levels.
22 |
23 | The figure clearly illustrates that in scenarios with significant conflicts, performance differences due to lock strategies under different transaction isolation levels are pronounced. As anticipated, higher transaction isolation levels generally exhibit lower throughput, particularly under severe conflict conditions.
24 |
25 | In scenarios with few conflicts, performance is primarily constrained by the overhead of acquiring ReadView in MVCC. This is because, under the Read Committed isolation level, MySQL must copy the entire active transaction list each time it reads from the global active transaction list, whereas under Repeatable Read, it only needs to obtain a copy of the active transaction list at the start of the transaction.
26 |
27 | In summary, in low-conflict tests like SysBench uniform, the overhead of MVCC ReadView is the predominant bottleneck, outweighing lock overhead. Consequently, Repeatable Read performs better than Read Committed. Conversely, in high-conflict tests like SysBench pareto, lock overhead becomes the primary bottleneck, resulting in Read Committed outperforming Repeatable Read.
28 |
29 | ## References:
30 |
31 | [1] https://dev.mysql.com/doc/refman/8.0/en/.
32 |
33 | [2] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
--------------------------------------------------------------------------------
/logical_reasoning_samples.md:
--------------------------------------------------------------------------------
1 | # Logical Reasoning in Network Problems
2 |
3 | ## 1.1 Classic Case 1
4 |
5 | Many software professionals lack in-depth knowledge of TCP/IP logic reasoning, which often leads to misidentifying problems as mysterious problems. Some are discouraged by the complexity of TCP/IP networking literature, while others are misled by confusing details in Wireshark. For instance, a DBA facing performance problems might misinterpret packet capture data in Wireshark, erroneously concluding that TCP retransmissions are the cause.
6 |
7 | 
8 |
9 | Figure 1. Packet capture screenshot provided by DBA suspecting retransmission problems.
10 |
11 | Since retransmission is suspected, it's essential to understand its nature. Retransmission fundamentally involves timeout retransmission. To confirm if retransmission is indeed the cause, time-related information is necessary, which is not provided in the screenshot above. After requesting a new screenshot from the DBA, the timestamp information was included.
12 |
13 | 
14 |
15 | Figure 2. Packet capture screenshot with time information added.
16 |
17 | When analyzing network packets, timestamp information is crucial for accurate logical reasoning. A time difference in the microsecond range between two duplicate packets suggests either a timeout retransmission or duplicate packet capture. In a typical LAN environment with a Round-trip Time (RTT) of around 100 microseconds, where TCP retransmissions require at least one RTT, a retransmission occurring at just 1/100th of the RTT likely indicates duplicate packet capture rather than an actual timeout retransmission.
18 |
19 | ## 1.2 Classic Case 2
20 |
21 | Another classic case illustrates the importance of logical reasoning in network problem analysis.
22 |
23 | One day, one business developer came rushing over, saying that a scheduled script using the MySQL database middleware had failed in the early morning hours with no response. Upon hearing about the problem, I checked the error logs of the MySQL database middleware but found no valuable clues. So, I asked the developers if they could reproduce the problem, knowing that once reproducible, a problem becomes easier to solve.
24 |
25 | The developers tried multiple times to reproduce the problem but were unsuccessful. However, they made a new discovery: they found that executing the same SQL queries during the day resulted in different response times compared to the early morning. They suspected that when the SQL response was slow, the MySQL database middleware was blocking the session and not returning results to the client.
26 |
27 | Based on this insight, the database operations team were asked to modify the script's SQL to simulate a slow SQL response. As a result, the MySQL database middleware returned the results without encountering the hang problem seen in the early morning hours.
28 |
29 | For a while, the root cause couldn't be identified, and developers discovered a functional problem with the MySQL database middleware. Therefore, developers and DBA operations became more convinced that the MySQL database middleware was delaying responses. In reality, these problems were not related to the response times of the MySQL database middleware.
30 |
31 | From the events of the first day, the problem did indeed occur. Everyone involved tried to pinpoint the cause, making various guesses, but the true reason remained elusive.
32 |
33 | The next day, developers reported that the script problem reoccurred in the early morning, yet they couldn't reproduce it during the day. Developers, feeling pressured as the script was soon to be used online, complained about the situation. My only suggestion was for them to use the script during the day to avoid problems in the early morning. With all suspicions focused on the MySQL database middleware, it was challenging to analyze the problem from other perspectives.
34 |
35 | As a developer responsible for the MySQL database middleware, such mysterious problems cannot be easily overlooked. Ignoring them could impact subsequent use of the MySQL database middleware, and there is also pressure from leadership to solve the problem promptly. Finally, it was decided to implement a low-cost packet capture analysis solution: during the execution of the script in the early morning, packet captures would be performed on the server to analyze what was happening at that time. The goal was to determine if the MySQL database middleware either failed to send a response at all or if it did send a response that the client script did not receive. Once it could be confirmed that the MySQL database middleware did send a response, the problem would not be attributed to the MySQL database middleware developers.
36 |
37 | On the third day, developers reported that the early morning problem did not recur, and packet capture analysis confirmed that the problem did not occur. After careful consideration, it seemed unlikely that the problem was solely with the MySQL database middleware: frequent occurrences in the early morning and rare occurrences during the day were puzzling. The only course of action was to wait for the problem to occur again and analyze it based on the packet captures.
38 |
39 | On the fourth day, the problem did not surface again.
40 |
41 | However, on the fifth day, the problem finally reappeared, bringing hope for resolution.
42 |
43 | The packet capture files are numerous. First, ask the developers to provide the timestamp when the problem occurred, then search through the extensive packet capture data to identify the SQL query that caused the problem. The final result is as follows:
44 |
45 | 
46 |
47 | Figure 3. Key packet information captured for problem resolution.
48 |
49 | From the packet capture content above (captured from the server), it appears that the SQL query was sent at 3 AM. The MySQL database middleware took 630 seconds (03:10:30.899249-03:00:00.353157) to return the SQL response to the client, indicating that the MySQL database middleware did indeed respond to the SQL query. However, just 238 microseconds later (03:10:30.899487-03:10:30.899249), the server's TCP layer received a reset packet, which is suspiciously quick. It's important to note that this reset packet cannot be immediately assumed to be from the client.
50 |
51 | Firstly, it is necessary to confirm who sent the reset packet—either it was sent by the client or by an intermediate device along the way. Since packet capture was performed only on the server side, information about the client's packet situation is not available. By analyzing the packet capture files from the server side and applying logical reasoning, the aim is to identify the root cause of the problem.
52 |
53 | If the assumption is made that the client sent a reset, it would imply that the client's TCP layer no longer recognizes the TCP state of this connection—transitioning from an established state to a nonexistent one. This change in TCP state would notify the client application of a connection problem, causing the client script to immediately error out. However, in reality, the client script is still waiting for the response to come back. Therefore, the assumption that the client sent a reset does not hold true—the client did not send a reset. The client's connection is still active, but on the server side, the corresponding connection has been terminated by the reset.
54 |
55 | Who sent the reset, then? The primary suspect is Amazon's cloud environment. Based on this packet capture analysis, the DBA operations queried Amazon customer service and received the following information:
56 |
57 | 
58 |
59 | Figure 4. Final response from Amazon customer service.
60 |
61 | Customer service's response aligns with the analysis results, indicating that Amazon's ELB (Elastic Load Balancer, similar to LVS) forcibly terminated the TCP session. According to their feedback, if a response exceeds the 350-second threshold (as observed in the packet capture as 630 seconds), Amazon's ELB device sends a reset to the responding party (in this case, the server). The client scripts deployed by the developers did not receive the reset and mistakenly assumed the server connection was still active. Official recommendations for such problems include using TCP keepalive mechanisms to mitigate these problems.
62 |
63 | With the official response obtained, the problem was considered fully solved.
64 |
65 | This specific case illustrates how online problems can be highly complex, requiring the capture of critical information—in this instance, packet capture data—to understand the situation as it occurred. Through logical reasoning and the application of reductio ad absurdum, the root cause was identified.
--------------------------------------------------------------------------------
/new_group_replication.md:
--------------------------------------------------------------------------------
1 | # Comparing Performance with Traditional Group Replication
2 |
3 | The figure below compares TPC-C throughput against concurrency levels in different modes. The deployment setup is as follows: Both MySQL primary and secondary are deployed on the same machine, with NUMA binding isolation to prevent computational interference. Separate SSDs are used for the primary and secondary to ensure no I/O operation interference.
4 |
5 |
6 |
7 | Figure 1. Effects of the new Group Replication single-primary mode design.
8 |
9 | From the figure, it is evident that the new Group Replication single-primary mode design comprehensively outperforms the traditional mode of Group Replication.
10 |
11 | # References:
12 |
13 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
--------------------------------------------------------------------------------
/paxos.md:
--------------------------------------------------------------------------------
1 | # The Problems with Paxos Variant Algorithms
2 |
3 | ### 1 The Problems with Implementing Mencius in Group Replication
4 |
5 | Group Replication employs Mencius, a multi-leader state machine replication protocol derived from Paxos, at its core. Mencius is novel in that it not only partitions sequence numbers but also addresses key performance problems like adapting to changing client loads and asymmetric network bandwidth.
6 |
7 | Mencius achieves this by using a simplified version of consensus called simple consensus. This allows servers with low client load to skip their turns without needing majority agreement. By opportunistically piggybacking SKIP messages on other messages, Mencius enables servers to skip turns with minimal communication and computation overhead, allowing it to efficiently adapt to client and network load variance [1].
8 |
9 | Unfortunately, Group Replication did not adhere to the above design. The cost of waiting for SKIP information remains significant, leading to Group Replication experiencing potential throughput fluctuations and longer-than-expected response times, especially in cross-datacenter deployment scenarios.
10 |
11 | ### 2 Why Add Its Own Implementation of Multi-Paxos?
12 |
13 | Given that MySQL introduced a new Multi-Paxos algorithm in addition to the existing Mencius algorithm, this indicates either an inadequacy in the Mencius implementation or inherent problems with the Mencius algorithm itself.
14 |
15 | Regarding the Mencius algorithm, the following aspects are particularly noteworthy [1]:
16 |
17 | *By opportunistically piggybacking SKIP messages on other messages, Mencius allows servers to skip turns with little or no communication and computation overhead. This allows Mencius to adapt inexpensively to client and network load variance.*
18 |
19 | It can be inferred that the Mencius algorithm performs well even under severe leader imbalance, as both theoretical validation and practical evidence support this. Therefore, the problems are likely due to an inadequate implementation of the Mencius algorithm.
20 |
21 | When there are no theoretical problems with the Mencius algorithm, introducing a new Multi-Paxos algorithm is not an elegant solution and brings several challenges:
22 |
23 | 1. **High Maintenance Cost**: Maintaining and testing two sets of codebases doubles the workload for this part.
24 | 2. **Regression Testing Challenges**: In practice, the new algorithm has led to several regression problems, some of which are difficult to address.
25 | 3. **Partial Problem-Solving**: The new algorithm may only partially address the requirements. In Group Replication's single-primary mode, it might not be universally applicable, as consistent read and write operations require all nodes to continuously communicate information.
26 |
27 | ### 3 The Specific Implementation of Paxos Skip Optimization
28 |
29 | First, let's investigate the performance problems of the MySQL Mencius algorithm implementation. The following figure illustrates the network interaction status when the Mencius algorithm operates stably with a network delay of 10ms:
30 |
31 | 
32 |
33 | Figure 1. Insights into the Mencius protocol from packet capture data.
34 |
35 | The green box in the figure indicates that the time interval between two consecutive Paxos instances reached 24ms. This suggests that the Mencius algorithm in MySQL is not aligned with a single Round-trip Time (RTT) in its implementation.
36 |
37 | Next, let's refer to the Mencius algorithm paper *"State Machine Replication for Wide Area Networks"* [2]. The specific details of the network testing environment are as follows:
38 |
39 | 
40 |
41 | From the green box, it is evident that the network latency tested in the paper is RTT=100ms. Let's now examine the relevant information on Paxos processing provided in the paper.
42 |
43 | 
44 |
45 | Figure 2. Consensus mechanism of the Mencius protocol as indicated in the Mencius paper.
46 |
47 | Based on the figure, it can be inferred that if only one node generates a request, the Mencius protocol consensus requires 100ms, equivalent to one round-trip time (RTT). This indicates that from the Group Replication primary node's perspective, Mencius consensus can be achieved within a single RTT. With the theoretical feasibility clarified, the following discussion will focus on optimizing Group Replication's Mencius communication.
48 |
49 | The theoretical basis for optimizing Mencius includes [1]:
50 |
51 | *Skipping is the core technique that makes Mencius efficient.*
52 |
53 | The specific Paxos network interaction diagram after Paxos skip optimization is shown in the following figure:
54 |
55 | 
56 |
57 | Figure 3. Mechanism of Paxos skip optimization.
58 |
59 | When a Paxos node receives an *accept_op* message from the Paxos leader and has no messages to propose itself, it can include skip information when sending the *ack_accept_op* message. This informs other nodes that the current node will not propose any messages in this round. During normal stable operation, every *accept_op* message can be handled this way by Paxos nodes.
60 |
61 | In the specific implementation, the impact of pipelining must also be considered. During the Paxos skip optimization process, it is necessary to record these skip actions to avoid interference between communications of different Paxos instances.
62 |
63 | Finally, under a network delay scenario of 10ms, evaluating the effectiveness of Paxos skip optimization shows significant benefits. Here is a comparison of TPC-C throughput at different concurrency levels before and after Paxos skip optimization:
64 |
65 |
66 |
67 | Figure 4. Impact of Paxos skip optimization on BenchmarkSQL tests with 10ms latency.
68 |
69 | From the figure, it's clear that Paxos skip optimization significantly improves performance with a 10ms network latency. Extensive TPC-C testing confirms that this optimization improves performance for Group Replication, whether using a single primary or multiple primaries, and supports consistent reads and writes.
70 |
71 | Paxos skip optimization reduces code complexity by an order of magnitude compared to Multi-Paxos implementations with a single leader. It also minimizes regression testing problems and simplifies maintenance.
72 |
73 | Overall, leveraging theoretical and logical solutions elegantly addresses this problem more effectively than the current native MySQL implementation [3].
74 |
75 | ## References:
76 |
77 | [1] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: building efficient replicated state machines for WANs. In Proc. 8th USENIX OSDI, pages 369--384, San Diego, CA, Dec. 2008.
78 |
79 | [2] Yanhua Mao. 2010. State Machine Replication for Wide Area Networks. Doctor of Philosophy in Computer Science. UNIVERSITY OF CALIFORNIA, SAN DIEGO.
80 |
81 | [3] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
82 |
83 |
--------------------------------------------------------------------------------
/paxos_log.md:
--------------------------------------------------------------------------------
1 | # Lack of Durability: Paxos Not Committed to Disk
2 |
3 | Below is Meta company engineers' view on Group Replication [1]:
4 |
5 | *Another significant and deliberate choice was to not use Group Replication offered by MySQL 5.7 onwards. While there are significant advancements offered by the protocol (such as the multi-primary mode), using Group Replication in our deployment presented significant challenges. It is based on a variant of Paxos which does not use a persistent log. Entries are written to the log only after they are considered committed via in-memory consensus rounds. The leader election algorithm is local and deterministic. It also doesn't consider lag when choosing a new leader and brief network blips trigger a computationally expensive recovery algorithm. This option could have worked but not without excessive engineering effort to fix the drawbacks.*
6 |
7 | The main problem with Group Replication, as outlined in the above content, is the lack of persistence in Paxos messages. The initial design background includes:
8 |
9 | 1. Suboptimal SSD hardware performance.
10 |
11 | 2. Paxos log persistence not meeting the requirements for Group Replication with multiple primaries.
12 |
13 | Without persistent storage in certification databases, Group Replication cannot use Paxos message persistence for crash recovery. After MySQL restarts, there is no corresponding certification database, making continued processing of persisted Paxos messages prone to inconsistent states.
14 |
15 | Based on Group Replication single-primary mode using SSDs, SysBench write-only tests were used to examine the impact of adding Paxos persistence on throughput. Please refer to the specific figure below:
16 |
17 |
18 |
19 | Figure 1. Performance overhead of Paxos log persistence in SysBench write-only tests.
20 |
21 | From the figure, it can be seen that after adding Paxos persistence, there is a slight decrease in performance under low concurrency, which is an expected result. However, under high concurrency, the difference is not significant. This is because, under high concurrency, Paxos uses a batching mechanism that allows multiple transaction records to be logged into the Paxos instance together, thereby reducing I/O pressure.
22 |
23 | Next, let's examine the comparative response times.
24 |
25 |
26 |
27 | Figure 2. After adding Paxos log persistence, the average response time in SysBench write-only tests increased.
28 |
29 | The figure shows response times for 50 to 200 concurrent scenarios. The increase in average response time with Paxos log persistence is acceptable. SysBench write-only tests stress Group Replication significantly, while TPC-C tests, due to their read operations, reduce the write pressure on Group Replication. For comparisons based on Group Replication in single-primary mode, using SSDs, and BenchmarkSQL for TPC-C throughput at various concurrency levels, please refer to the figure below.
30 |
31 |
32 |
33 | Figure 3. Performance overhead of Paxos log persistence in BenchmarkSQL tests.
34 |
35 | The figure shows that, at low concurrency levels, the version with Paxos log persistence has slightly lower throughput, though the impact is much smaller compared to SysBench write-only tests. Based on the results of the various tests, it can be concluded that, under current SSD hardware conditions, employing Paxos log persistence is a viable solution [2].
36 |
37 | ## References:
38 |
39 | [1] Ritwik Yadav and Anirban Rahut. 2023. FlexiRaft: Flexible Quorums with Raft. The Conference on Innovative Data Systems Research (CIDR) (2023).
40 |
41 | [2] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
42 |
43 |
--------------------------------------------------------------------------------
/performance_degradation.md:
--------------------------------------------------------------------------------
1 | # Solved Performance Degradation in Query Execution Plans
2 |
3 | During secondary development on MySQL 8.0.27, TPC-C tests with BenchmarkSQL became unstable. Throughput rapidly declined, complicating optimization. Trust in the official version led to initially overlooking this problem despite testing difficulties.
4 |
5 | Only after a user reported a significant performance drop following an upgrade did we begin to take it seriously. The reliable feedback from users indicated that while MySQL 8.0.25 performed well, upgrading to MySQL 8.0.29 led to a substantial decline. This crucial information indicated that there was a performance problem.
6 |
7 | Simultaneously, it was confirmed that the performance degradation problem in MySQL 8.0.27 was the same as in MySQL 8.0.29. MySQL 8.0.27 had undergone two scalability optimizations specifically for trx-sys, which theoretically should have increased throughput. Reviewing the impact of latch sharding in trx-sys on performance:
8 |
9 |
10 |
11 | Figure 1. Impact of latch sharding in trx-sys under different concurrency levels.
12 |
13 | Let's continue examining the comparison of throughput and concurrency between trx-sys latch sharding optimization and the MySQL 8.0.27 release version. Specific details are shown in the following figure:
14 |
15 |
16 |
17 | Figure 2. Performance degradation in MySQL 8.0.27 release version.
18 |
19 | From the figure, it is evident that the performance degradation of the MySQL 8.0.27 release version is significant under low concurrency conditions, with a noticeable drop in peak performance. This aligns with user feedback regarding decreased throughput and is easily reproducible using BenchmarkSQL.
20 |
21 | The MySQL 8.0.27 release version already had this problem, whereas the earlier MySQL 8.0.25 release version did not. Using this information, the goal was to identify the specific git commit that caused the performance degradation. Finding the git commit responsible for performance degradation is a complex process that typically involves binary search. After extensive testing, it was initially narrowed down to a specific commit. However, this commit contained tens of thousands of lines of code, making it nearly impossible to pinpoint the exact segment causing the problem. It was later discovered that this commit was a collective merge from a particular branch. This allowed for further breakdown and ultimately identifying the root cause of the problem in the following:
22 |
23 | ```c++
24 | commit 9a13c1c6971f4bd56d143179ecfb34cca8ecc018
25 | Author: Steinar H. Gunderson
26 | Date: Tue Jun 8 15:14:35 2021 +0200
27 |
28 | Bug #32976857: REMOVE QEP_TAB_STANDALONE [range optimizer, noclose]
29 |
30 | Remove the QEP_TAB dependency from test_quick_select() (ie., the range
31 | optimizer).
32 |
33 | Change-Id: Ie0fcce71dfc813920711c43c3d62635dae0d7d20
34 | ```
35 |
36 | Using the commit information, two versions were compiled and SQL queries performing exceptionally slow in TPC-C tests were identified. The execution plans of these slow SQL queries were analyzed using *'explain'*. Specific details are shown in the following figure:
37 |
38 | 
39 |
40 | Figure 3. Abnormalities indicated by rows in *'explain'*.
41 |
42 | From the figure, it can be seen that most of the execution plans are identical, except for the *'rows'* column. In the normal version, the *'rows'* column shows just over 200, whereas in the problematic version, it shows over 1,000,000. After continuously simplifying the SQL, a highly representative SQL query was finally identified. Specific details are shown in the following figure:
43 |
44 | 
45 |
46 | Figure 4. Significant discrepancies between SQL execution results and *'explain'* output.
47 |
48 | Based on the *Filter* information obtained from '*explain*', the last query shown in the figure was constructed. The figure reveals that while the last query returned only 193 rows, '*explain*' displayed over 1.17 million rows for *'rows'*. This discrepancy highlights a complex problem, as execution plans are not always fully understood by all MySQL developers. Fortunately, identifying the commit responsible for the performance degradation provided a critical foundation for solving the problem. Although solving the problem was relatively straightforward with this information, analyzing the root cause from the SQL statement itself proved to be far more challenging.
49 |
50 | Let's continue with an in-depth analysis of this problem. The following figure displays the '*explain*' result for a specific SQL query:
51 |
52 | 
53 |
54 | Figure 5. Sample SQL query representing the problem.
55 |
56 | From the figure, it can be seen that the number of rows is still large, indicating that this SQL query is representative.
57 |
58 | Two different debug versions of MySQL were compiled: one with anomalies and one normal. Debug versions were used to capture useful function call relationships through debug traces. When executing the problematic SQL statement on the version with anomalies, the relevant debug trace information is as follows:
59 |
60 | 
61 |
62 | Figure 6. Debug trace information for the abnormal version.
63 |
64 | Similarly, for the normal version, the relevant debug trace information is as follows:
65 |
66 | 
67 |
68 | Figure 7. Debug trace information for the normal version.
69 |
70 | Comparing the two figures above, it is noticeable that the normal version includes additional content within the green box, indicating that conditions are applied in the normal version, whereas the abnormal version lacks these conditions. To understand why the abnormal version is missing these conditions, it is necessary to add additional trace information in the *get_full_func_mm_tree* function to capture specific details about the cause of this difference.
71 |
72 | After adding extra trace information, the debug trace result for the abnormal version is as follows:
73 |
74 | 
75 |
76 | Figure 8. Supplementary debug trace information for the abnormal version.
77 |
78 | The debug trace result for the normal version is as follows:
79 |
80 | 
81 |
82 | Figure 9. Supplementary debug trace information for the normal version.
83 |
84 | Upon comparing the two figures above, significant differences are observed. In the normal version, the value of *param_comp* is 16140901064495857660, while in the abnormal version, it is 16140901064495857661, differing by 1. To understand this discrepancy, let's first examine how the *param_comp* value is calculated, as detailed in the following code snippet:
85 |
86 | ```c++
87 | static SEL_TREE *get_full_func_mm_tree(THD *thd, RANGE_OPT_PARAM *param,
88 | table_map prev_tables,
89 | table_map read_tables,
90 | table_map current_table,
91 | bool remove_jump_scans, Item *predicand,
92 | Item_func *op, Item *value, bool inv) {
93 | SEL_TREE *tree = nullptr;
94 | SEL_TREE *ftree = nullptr;
95 | const table_map param_comp = ~(prev_tables | read_tables | current_table);
96 | DBUG_TRACE;
97 | ...
98 | ```
99 |
100 | From the code, it's evident that *param_comp* is calculated using a bitwise OR operation on three variables, followed by a bitwise NOT operation. The difference of 1 suggests that at least one of these variables differs, helping to narrow down the problem.
101 |
102 | The calculation involves three *table_map* variables with lengthy values, making ordinary calculators insufficient and the process too complex to detail here.
103 |
104 | The key point is that debug tracing revealed critical differences. Coupled with the information provided by identifying the Git commit responsible for the performance discrepancy, analyzing the root cause is no longer difficult.
105 |
106 | Here is the final fix patch, detailed as follows:
107 |
108 | 
109 |
110 | Figure 10. Final patch for solving performance degradation in query execution plans.
111 |
112 | When calling the *test_quick_select* function, reintroduce the *const_table* and *read_tables* variables (related to the previously discussed variables). This ensures that filtering conditions in the execution plan are not overlooked.
113 |
114 | After applying the above patch to MySQL 8.0.27, the performance degradation problem was solved. A test comparing TPC-C throughput at various concurrency levels, both before and after applying the patch, was conducted. Specific details are shown in the following figure:
115 |
116 |
117 |
118 | Figure 11. Effects of the patch on solving performance degradation.
119 |
120 | From the figure, it is evident that after applying the patch, throughput and peak performance have significantly improved under low concurrency conditions. However, under high concurrency conditions, throughput not only failed to increase but actually decreased, likely due to scalability bottlenecks in MVCC ReadView.
121 |
122 | After addressing the MVCC ReadView scalability problem, reassess the impact of this patch, as detailed in the figure below:
123 |
124 |
125 |
126 | Figure 12. Actual effects of the patch after addressing the MVCC ReadView scalability problem.
127 |
128 | From the figure, it is evident that this patch has significantly improved MySQL's throughput. This case demonstrates that scalability problems can disrupt certain optimizations. To scientifically assess the effectiveness of an optimization, it is essential to address most scalability problems beforehand to achieve a more accurate evaluation.
129 |
130 | Finally, let's examine the results of the long-term stability testing for TPC-C. The following figure shows the results of an 8-hour test under 100 concurrency, with throughput captured at various hours (where 1 ≤ n ≤ 8).
131 |
132 |
133 |
134 | Figure 13. Comparison of stability tests: MySQL 8.0.27 vs. improved MySQL 8.0.27.
135 |
136 | From the figure, it is evident that after applying the patch, the rate of throughput decline has been significantly mitigated. The MySQL 8.0.27 version experienced a dramatic throughput decline, failing to meet the stability requirements of TPC-C testing. However, after applying the patch, MySQL's performance returned to normal.
137 |
138 | Addressing this problem directly presents considerable challenges, particularly for MySQL developers unfamiliar with query execution plans. Using logical reasoning and a systematic approach to identify and address code differences before and after the problem arose is a more elegant problem-solving method, though it is complex.
139 |
140 | It is noteworthy that no regression testing problems were encountered after applying the patch, demonstrating high stability and providing a solid foundation for future performance improvements. Currently, MySQL 8.0.38 still hasn't solved this problem, suggesting potential shortcomings in MySQL's testing system. Given the complexity of MySQL databases, users should exercise caution when upgrading and consider using tools like TCPCopy [2] to avoid potential regression testing problems.
141 |
142 | ## References:
143 |
144 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
145 |
146 | [2] https://github.com/session-replay-tools/tcpcopy.
--------------------------------------------------------------------------------
/pgo.md:
--------------------------------------------------------------------------------
1 | # PGO
2 |
3 | Profile-guided optimization (PGO) typically improves program execution efficiency. The following figure illustrates how PGO improves the throughput of a standalone MySQL instance under various concurrency levels, following the resolution of MySQL MVCC ReadView scalability problems.
4 |
5 |
6 |
7 | Figure 1. Impact of PGO after solving MVCC ReadView scalability problems.
8 |
9 | From the figure, it is evident that PGO has a notable impact.
10 |
11 | For MySQL 8.0.27 with PGO, throughput decreases under high concurrency conditions. The specific details are shown in the figure below:
12 |
13 |
14 |
15 | Figure 2. Performance comparison tests before and after using PGO in MySQL 8.0.27.
16 |
17 | The test results above indicate that PGO for MySQL's improvement requires addressing scalability problems before its full potential can be realized. It should be noted that both comparative tests above were conducted in mainstream NUMA environments. When MySQL is bound to a single NUMA node, creating an SMP environment, the following figure shows the relationship between TPC-C throughput and concurrency levels before and after PGO.
18 |
19 |
20 |
21 | Figure 3. Performance comparison tests before and after using PGO in MySQL 8.0.27 under SMP.
22 |
23 | From the figure, it can be seen that PGO consistently improves throughput in SMP environments, without decreasing as concurrency levels increase. The following figure compares the performance improvement of PGO between NUMA and SMP environments.
24 |
25 |
26 |
27 | Figure 4. Performance of PGO optimization in different environments.
28 |
29 | From the figure, it is evident that PGO achieves a maximum performance improvement of up to 30% in SMP environments, whereas in NUMA environments, the performance improvement decreases as concurrency increases. This suggests that PGO has greater potential in SMP environments.
30 |
31 | Continuing the analysis, the performance of PGO in a Group Replication cluster environment compared to a single MySQL instance is examined. The following diagram depicts a simplified queue model of Group Replication.
32 |
33 | 
34 |
35 | Figure 5. A simplified queue model of Group Replication.
36 |
37 | Because the network portion cannot be optimized by PGO, the MySQL primary consumes a lower proportion of time compared to a single MySQL instance. According to Amdahl's Law, the performance gains from PGO will be less pronounced compared to those of a standalone MySQL instance. Generally, as network latency increases, the improvement from PGO tends to diminish.
38 |
39 | The following figure compares the throughput improvement of a standalone MySQL instance and Group Replication using PGO.
40 |
41 |
42 |
43 | Figure 6. PGO Performance Improvement in Group Replication vs. Standalone MySQL.
44 |
45 | From the figure, it can be observed that the performance improvement from PGO in a Group Replication cluster environment is generally less than that of a standalone MySQL instance.
46 |
47 | In conclusion, PGO can be summarized as follows:
48 |
49 | 1. For MySQL, PGO is a worthwhile optimization that theoretically improves performance comprehensively, especially in SMP environments.
50 | 2. In NUMA environments, addressing scalability problems is necessary to achieve significant benefits from PGO.
51 | 3. PGO is less effective in a Group Replication cluster compared to a standalone MySQL instance.
52 |
53 | # References:
54 |
55 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
56 |
57 |
--------------------------------------------------------------------------------
/professional_journey.md:
--------------------------------------------------------------------------------
1 | # A Brief Overview of My Professional Journey
2 |
3 | In 2004, I transitioned from civil engineering to computer science. Despite having a weak foundation and facing a steep learning curve, I ultimately managed to meet the minimum academic standards.
4 |
5 | When I began searching for a job, I encountered numerous challenges. Interviews revealed how crucial algorithm skills were, and as graduation approached, I landed an algorithm-related role with leaders from Tsinghua University. To improve, I immersed myself in MIT’s English algorithm courses, slowly overcoming my fear of algorithms.
6 |
7 | In 2008, I joined NetEase as an algorithm engineer. Though the project I worked on didn’t succeed, it strengthened my foundational skills. By 2009, I was part of a team tasked with developing NetEase’s advertising system to replace Google’s DoubleClick. The high technical demands required innovative solutions. Fortunately, a colleague built a simple version of *tcpcopy* (only 300 lines of code), which allowed us to avoid hundreds of potential problems. For five years, under my management, this advertising system, built from scratch, never experienced a major failure. If ChatGPT helps bridge the communication gap between non-English and English-speaking countries, then *tcpcopy* shortens the gap between developers and production issues. Both are powerful tools that simplify the resolution of complex challenges.
8 |
9 | In 2011, I took on the task of making *tcpcopy* a generalized solution. Despite my limited knowledge of TCP, through logical reasoning and four years of effort, I successfully transformed it and made it open-source. While *tcpcopy* was adopted by many companies domestically, focusing solely on this project meant I missed certain opportunities for financial freedom.
10 |
11 | By 2015, with *tcpcopy* nearly complete, I shifted my focus to databases. In 2016, I entered the MySQL middleware field, gradually deepening my understanding of database systems. By 2019, I fully transitioned into MySQL development, solving numerous challenges and discovering new potential in myself.
12 |
13 | On July 31, 2024, I voluntarily left my job to fully dedicate myself to writing a book. This book encapsulates 20 years of experience and focuses on the art of elegantly solving problems. Though I’ve written only 50,000 lines of effective code throughout my career, 90% of my time was spent solving problems, with logical reasoning being my strongest asset.
14 |
15 | Over these two decades, my ability to think logically has continually enhanced my foundational knowledge, turning me into an expert problem solver. Solving problems is far more valuable than writing code, and successfully overcoming difficult challenges is what makes someone stand out in a company.
16 |
17 | In the future, I plan to share the insights from this book with the MySQL community, as I believe it offers a clear guide to optimizing MySQL performance and improvements.
--------------------------------------------------------------------------------
/scalability.md:
--------------------------------------------------------------------------------
1 | # How can the scalability of MySQL be improved?
2 |
3 | ## Current state of MySQL 5.7
4 |
5 | MySQL 5.7 is not ideal in terms of scalability. The following figure illustrates the relationship between TPC-C throughput and concurrency in MySQL 5.7.39 under a specific configuration. This includes setting the transaction isolation level to Read Committed and adjusting the *innodb_spin_wait_delay* parameter to mitigate throughput degradation.
6 |
7 |
8 |
9 | Figure 1. Scalability problems in MySQL 5.7.39 during BenchmarkSQL testing.
10 |
11 | From the figure, it is evident that scalability problems significantly limit the increase in MySQL throughput. For example, after 100 concurrency, the throughput begins to decline.
12 |
13 | To address the aforementioned performance collapse issue, Percona's thread pool was employed. The following figure illustrates the relationship between TPC-C throughput and concurrency after configuring the Percona thread pool.
14 |
15 |
16 |
17 | Figure 2. Percona thread pool mitigates scalability problems in MySQL 5.7.39.
18 |
19 | Although the thread pool introduces some overhead and peak performance has decreased, it has mitigated the issue of performance collapse under high concurrency.
20 |
21 | ## Current state of MySQL 8.0
22 |
23 | Let's take a look at the efforts MySQL 8.0 has made regarding scalability.
24 |
25 | ### Redo Log Optimization
26 |
27 | The first major improvement is redo log optimization [3].
28 |
29 | ```c++
30 | commit 6be2fa0bdbbadc52cc8478b52b69db02b0eaff40
31 | Author: Paweł Olchawa
32 | Date: Wed Feb 14 09:33:42 2018 +0100
33 |
34 | WL#10310 Redo log optimization: dedicated threads and concurrent log buffer.
35 |
36 | 0. Log buffer became a ring buffer, data inside is no longer shifted.
37 | 1. User threads are able to write concurrently to log buffer.
38 | 2. Relaxed order of dirty pages in flush lists - no need to synchronize
39 | the order in which dirty pages are added to flush lists.
40 | 3. Concurrent MTR commits can interleave on different stages of commits.
41 | 4. Introduced dedicated log threads which keep writing log buffer:
42 | * log_writer: writes log buffer to system buffers,
43 | * log_flusher: flushes system buffers to disk.
44 | As soon as they finished writing (flushing) and there is new data to
45 | write (flush), they start next write (flush).
46 | 5. User threads no longer write / flush log buffer to disk, they only
47 | wait by spinning or on event for notification. They do not have to
48 | compete for the responsibility of writing / flushing.
49 | 6. Introduced a ring buffer of events (one per log-block) which are used
50 | by user threads to wait for written/flushed redo log to avoid:
51 | * contention on single event
52 | * false wake-ups of all waiting threads whenever some write/flush
53 | has finished (we can wake-up only those waiting in related blocks)
54 | 7. Introduced dedicated notifier threads not to delay next writes/fsyncs:
55 | * log_write_notifier: notifies user threads about written redo,
56 | * log_flush_notifier: notifies user threads about flushed redo.
57 | 8. Master thread no longer has to flush log buffer.
58 | ...
59 | 30. Mysql test runner received a new feature (thanks to Marcin):
60 | --exec_in_background.
61 | Review: RB#15134
62 | Reviewers:
63 | - Marcin Babij ,
64 | - Debarun Banerjee .
65 | Performance tests:
66 | - Dimitri Kravtchuk ,
67 | - Daniel Blanchard ,
68 | - Amrendra Kumar .
69 | QA and MTR tests:
70 | - Vinay Fisrekar .
71 | ```
72 |
73 | A test comparing TPC-C throughput with different levels of concurrency before and after optimization was conducted. Specific details are shown in the following figure:
74 |
75 |
76 |
77 | Figure 3. Impact of redo log optimization under different concurrency levels.
78 |
79 | The results in the figure show a significant improvement in throughput at a concurrency level of 100.
80 |
81 | ### Optimizing Lock-Sys Through Latch Sharding
82 |
83 | The second major improvement is lock-sys optimization [5].
84 |
85 | ```c++
86 | commit 1d259b87a63defa814e19a7534380cb43ee23c48
87 | Author: Jakub Łopuszański
88 | Date: Wed Feb 5 14:12:22 2020 +0100
89 |
90 | WL#10314 - InnoDB: Lock-sys optimization: sharded lock_sys mutex
91 |
92 | The Lock-sys orchestrates access to tables and rows. Each table, and each row,
93 | can be thought of as a resource, and a transaction may request access right for
94 | a resource. As two transactions operating on a single resource can lead to
95 | problems if the two operations conflict with each other, Lock-sys remembers
96 | lists of already GRANTED lock requests and checks new requests for conflicts in
97 | which case they have to start WAITING for their turn.
98 |
99 | Lock-sys stores both GRANTED and WAITING lock requests in lists known as queues.
100 | To allow concurrent operations on these queues, we need a mechanism to latch
101 | these queues in safe and quick fashion.
102 |
103 | In the past a single latch protected access to all of these queues.
104 | This scaled poorly, and the managment of queues become a bottleneck.
105 | In this WL, we introduce a more granular approach to latching.
106 |
107 | Reviewed-by: Pawel Olchawa
108 | Reviewed-by: Debarun Banerjee
109 | RB:23836
110 | ```
111 |
112 | Based on the program before and after optimizing with lock-sys, using BenchmarkSQL to compare TPC-C throughput with concurrency, the specific results are as shown in the following figure:
113 |
114 |
115 |
116 | Figure 4. Impact of lock-sys optimization under different concurrency levels.
117 |
118 | From the figure, it can be seen that optimizing lock-sys significantly improves throughput under high concurrency conditions, while the effect is less pronounced under low concurrency due to fewer conflicts.
119 |
120 | ### Latch Sharding for trx-sys
121 |
122 | The third major improvement is latch sharding for trx-sys.
123 |
124 | ```c++
125 | commit bc95476c0156070fd5cedcfd354fa68ce3c95bdb
126 | Author: Paweł Olchawa
127 | Date: Tue May 25 18:12:20 2021 +0200
128 |
129 | BUG#32832196 SINGLE RW_TRX_SET LEADS TO CONTENTION ON TRX_SYS MUTEX
130 |
131 | 1. Introduced shards, each with rw_trx_set and dedicated mutex.
132 | 2. Extracted modifications to rw_trx_set outside its original critical sections
133 | (removal had to be extracted outside trx_erase_lists).
134 | 3. Eliminated allocation on heap inside TrxUndoRsegs.
135 | 4. [BUG-FIX] The trx->state and trx->start_time became converted to std::atomic<>
136 | fields to avoid risk of torn reads on egzotic platforms.
137 | 5. Added assertions which ensure that thread operating on transaction has rights
138 | to do so (to show there is no possible race condition).
139 |
140 | RB: 26314
141 | Reviewed-by: Jakub Łopuszański jakub.lopuszanski@oracle.com
142 | ```
143 |
144 | Based on these optimizations before and after, using BenchmarkSQL to compare TPC-C throughput with concurrency, the specific results are as shown in the following figure:
145 |
146 |
147 |
148 | Figure 5. Impact of latch sharding in trx-sys under different concurrency levels.
149 |
150 | From the figure, it can be seen that this improvement significantly enhances TPC-C throughput, reaching its peak at 200 concurrency. It is worth noting that the impact diminishes at 300 concurrency, primarily due to ongoing scalability problems in the trx-sys subsystem related to MVCC ReadView.
151 |
152 | ## Refining MySQL 8.0
153 |
154 | The remaining improvements are our independent enhancements.
155 |
156 | ### Enhancements to MVCC ReadView
157 |
158 | The first major improvement is the enhancement of the MVCC ReadView data structure [1].
159 |
160 | Performance comparison tests were conducted to evaluate the effectiveness of the MVCC ReadView optimization. The figure below shows a comparison of TPC-C throughput with varying concurrency levels, before and after modifying the MVCC ReadView data structure.
161 |
162 |
163 |
164 | Figure 6. Performance comparison before and after adopting the new hybrid data structure in NUMA.
165 |
166 | From the figure, it is evident that this transformation primarily optimized scalability and improved MySQL's peak throughput in NUMA environments.
167 |
168 | ### Avoiding Double Latch Problems
169 |
170 | The second major improvement we made is addressing the double latch problem, where 'double latch' refers to the requirement for the global trx-sys latch by both view_open and view_close [1].
171 |
172 | Using the MVCC ReadView optimized version, compare TPC-C throughput before and after the modifications. Details are shown in the following figure:
173 |
174 |
175 |
176 | Figure 7. Performance improvement after eliminating the double latch bottleneck.
177 |
178 | From the figure, it is evident that the modifications significantly improved scalability under high-concurrency conditions.
179 |
180 | ### Transaction Throttling Mechanism
181 |
182 | The final improvement is the implementation of a transaction throttling mechanism to guard against performance collapse under extreme concurrency [1] [2] [4].
183 |
184 | The following figure depicts the TPC-C scalability stress test conducted after implementing transaction throttling. The test was performed in a scenario with NUMA BIOS disabled, limiting entry of up to 512 user threads into the transaction system.
185 |
186 |
187 |
188 | Figure 8. Maximum TPC-C throughput in BenchmarkSQL with transaction throttling mechanisms.
189 |
190 | From the figure, it is evident that implementing transaction throttling mechanisms significantly improves MySQL's scalability.
191 |
192 | ## Summary
193 |
194 | Overall, it is entirely feasible for MySQL to maintain performance without collapse at tens of thousands of concurrent connections in low-conflict scenarios of BenchmarkSQL TPC-C testing.
195 |
196 | # References:
197 |
198 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
199 |
200 | [2] https://dev.mysql.com/blog-archive/the-new-mysql-thread-pool/.
201 |
202 | [3] Paweł Olchawa. 2018. MySQL 8.0: New Lock free, scalable WAL design. dev.mysql.com/blog-archive.
203 |
204 | [4] Xiangyao Yu. An evaluation of concurrency control with one thousand cores. PhD thesis, Massachusetts Institute of Technology, 2015.
205 |
206 | [5] https://dev.mysql.com/doc/refman/8.0/en/.
207 |
--------------------------------------------------------------------------------
/static_array.md:
--------------------------------------------------------------------------------
1 | # The Powerful Impact of Static Arrays in MySQL Modifications
2 |
3 | First, let's clarify the terms "*sequence_number*" and "*last_committed*":
4 |
5 | - **sequence_number**: This is an automatically incremented value used to track the order of transactions during Group Replication operation. Each transaction is assigned a unique *sequence_number* during operation.
6 | - **last_committed**: This value indicates the sequence number of the last committed transaction that a new transaction depends on. For a transaction to proceed during replay on a MySQL secondary, the transaction must wait until the one with a *sequence_number* equal to *last_committed* has been fully replayed.
7 |
8 | For example, in the transaction highlighted in the green box below, with *sequence_number=12759* and *last_committed=12757*, the *last_committed=12757* indicates that the transaction with *sequence_number=12759* must wait until the transaction with *sequence_number=12757* has been fully replayed before it can proceed.
9 |
10 | 
11 |
12 | Figure 1. Typical examples of *sequence_number* and *last_committed*.
13 |
14 | Once *sequence_number* and *last_committed* are understood, the calculation of the *last_committed* value can be explored. Typically, this value is derived from the transaction's writeset, which details the rows modified by the transaction. Each row in the writeset is represented by a key corresponding to a table row. In the writeset:
15 |
16 | - For update operations, there are two elements with the same key.
17 | - For insert and delete operations, there is one element.
18 | - The writeset for DDL transactions is empty, indicating that DDL operations must be replayed serially.
19 |
20 | In Group Replication, when processing a transaction's writeset, the applier thread examines the certification database for transactions that have modified the same records as those in the writeset. If such transactions are found, the applier thread determines the latest *sequence_number* that is smaller than the current transaction's *sequence_number*, which becomes the transaction's *last_committed* value. This ensures transactions are replayed in the correct order to maintain data consistency.
21 |
22 | Before diving deeper into the analysis, let's review what the applier thread does:
23 |
24 | 1. Calculating *last_committed* based on the certification database.
25 | 2. Writing transaction events to the relay log file.
26 |
27 | Below is a flame graph generated from capturing performance data of the applier thread:
28 |
29 | 
30 |
31 | Figure 2. Flame graph of performance data for the applier thread.
32 |
33 | From the flame graph, it is evident that the *'add_item'* operation in the certification database consumes 29.80% of the computation time, with half of this time spent on hash table operations. The inefficiency of the hash table results in high CPU resource consumption for calculating last_committed, and delays in writing transaction events to disk.
34 |
35 | To address this bottleneck and improve disk write performance, the hash table's overhead must be reduced. Since direct improvements to the hash table are challenging, a new data structure is necessary.
36 |
37 | Based on the design of Group Replication in single-primary mode, a redesigned data structure has been developed to replace the previous hash table approach in the certification database. This new structure aims to eliminate delays in calculating last_committed and ensure timely writing of transaction events to disk. See the specific code below for the new data structure:
38 |
39 | ```c++
40 | #define REPLAY_CAL_HASH_ITEMS (REPLAY_CAL_HASH_ITEM / 8)
41 | #define MAX_RELATIVE_SEQUENCE_NUMBER 65535
42 | #define REPLAY_CAL_ARRAY 65536
43 | #define REPLAY_CAL_HASH_ITEM 4088
44 | typedef struct {
45 | int number;
46 | int size;
47 | unsigned char values[REPLAY_CAL_HASH_ITEM];
48 | } replay_cal_hash_item;
49 |
50 | class Certifier : public Certifier_interface {
51 | ...
52 | private:
53 | replay_cal_hash_item replayed_cal_array[REPLAY_CAL_ARRAY];
54 | ...
55 | ```
56 |
57 | To store the information necessary for calculating *last_committed*, a static array named *replayed_cal_array* is used. This array contains 65,536 elements, each representing a bucket slot with a *replay_cal_hash_item*. The *replay_cal_hash_item* structure includes:
58 |
59 | - **number**: Indicates the current count of elements within the *replay_cal_hash_item*, tracking how many elements are in use.
60 | - **size**: Specifies the maximum capacity of the *replay_cal_hash_item*, defining the upper limit of elements it can accommodate.
61 | - **values**: An array of 4,088 unsigned char elements that stores data.
62 |
63 | The **values** member is used to store 511 entries, with each entry occupying 8 bytes. Each entry consists of:
64 |
65 | - **Key Value**: 6 byte.
66 | - **Relative Sequence Number**: 2 bytes.
67 |
68 | For specific details, refer to the figure below [1]:
69 |
70 | 
71 |
72 | Figure 3. A new data structure suitable for calculating last_committed.
73 |
74 | The *key* undergoes base64 conversion into an 8-byte integer. This 8-byte integer is divided as follows:
75 |
76 | - **Index for replayed_cal_array**: The first two bytes serve as an index for the *replayed_cal_array*.
77 | - **Value**: The remaining six bytes are stored in the first six bytes of each 8-byte entry.
78 |
79 | Regarding the storage of *sequence_number*:
80 |
81 | - Only the relative value of the *sequence_number* is stored, calculated as the current *sequence_number* minus a base sequence value.
82 | - Instead of requiring 8 bytes, this relative *sequence_number* needs only 2 bytes.
83 | - This 2-byte relative *sequence_number* is stored in the last two bytes of each 8-byte entry.
84 |
85 | This setup optimizes storage by using a compact representation of the *key* and storing only the necessary relative *sequence_number*, ensuring efficient memory use within the *replay_cal_hash_item* structure.
86 |
87 | The algorithm based on the new data structure is illustrated in the figure below, highlighting the following key points:
88 |
89 | 1. Fully utilizes the characteristics of *keys* and the monotonic increase of *sequence* numbers, compressing storage space effectively, resulting in very high memory usage efficiency for the new data structure.
90 | 2. Sets an upper limit on the stored information. Once the threshold is exceeded, a process similar to checkpointing is triggered, and the current transaction is set for serial replay.
91 | 3. The content of the new data structure is relatively small, with a high cache hit rate. Moreover, within *replay_cal_hash_item*, the *values* are stored contiguously, making it very cache-friendly.
92 |
93 | 
94 |
95 | Figure 4. A new algorithm suitable for calculating last_committed.
96 |
97 | It should be noted that the new data structure occupies a memory footprint of 256MB (65536 \* 4096 bytes), which is significantly smaller compared to the several gigabytes or even tens of gigabytes typically required by traditional certification databases during benchmarking. This modest memory usage lays a solid foundation for optimizing the performance of the entire applier thread.
98 |
99 | After optimization, the applier thread has significantly accelerated its computation of the last_committed value, resulting in a considerable improvement in the overall processing speed of the applier thread. The following is a flame graph generated by capturing *perf* data of the applier thread using the improved version.
100 |
101 | 
102 |
103 | Figure 5. Flame graph of performance data for the applier thread after optimization.
104 |
105 | From the graph, it can be observed that the CPU processing overhead for *Certifier::certify* has significantly reduced. Specifically, *quick_add_item* now accounts for only 12.85% of the overhead, whereas previously, when throughput was lower, *add_item* consumed 29.80%. This highlights a significant performance improvement achieved by adopting the new data structure.
106 |
107 | Based on extensive TPC-C testing statistics, the following optimization conclusions can be drawn: Before optimization, the applier thread's disk throughput supported approximately 500,000 tpmC. After optimization, with more CPU time available to process disk writes, the applier thread's disk throughput now supports approximately 1,000,000 tpmC.
108 |
109 | This improvement not only enhances the overall processing capability of the applier thread but also accelerates the cleaning of outdated writeset information. According to tests, each cleaning operation now takes milliseconds. As a result, it effectively mitigates the performance fluctuations inherent in native Group Replication, further improving stability.
110 |
111 | From this case study, the reasons for performance improvement can be summarized as follows:
112 |
113 | 1. **Static Array for Values**: Using a static array for *values* in *replay_cal_hash_item* enhances search efficiency due to contiguous memory access, making it very cache-friendly.
114 | 2. **Reduced Data Storage**: The amount of stored data has been significantly reduced. Previously, it might have required gigabytes of storage, whereas now it only requires 256MB. Smaller memory footprints generally lead to higher efficiency.
115 | 3. **Fixed Memory Space**: The allocated memory space is fixed and does not require dynamic allocation. Previous frequent memory allocations and deallocations were detrimental to high performance due to the synchronous nature of memory operations.
116 | 4. **Efficient Certification Cleanup**: Certification cleanup can achieve millisecond-level performance. During certification cleanup, only zeroing operations are needed for the *number* values among the 65,536 *replay_cal_hash_item* items.
117 |
118 | By implementing a better data structure based on Group Replication's single-primary mode to achieve the same last_committed calculation functionality, the applier thread's maximum processing capability can be significantly enhanced, and performance fluctuations can be eliminated.
119 |
120 | # Summary
121 |
122 | If you compare a hash table with a static array, the static array offers a performance boost of several dozen times, but due to Amdahl's Law, the overall processing capacity only doubles.
123 |
124 | # References:
125 |
126 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
127 |
--------------------------------------------------------------------------------
/sysbench_perf_degradation.md:
--------------------------------------------------------------------------------
1 | # Mitigating the Increasing Performance Decline
2 |
3 | Users tend to notice a decline in low-concurrency performance more easily, while improvements in high-concurrency performance are often harder to perceive. Therefore, maintaining low-concurrency performance is crucial, as it directly affects user experience and the willingness to upgrade [1].
4 |
5 | According to extensive user feedback, after upgrading to MySQL 8.0, users have generally perceived a decline in performance, particularly in batch insert and join operations. This downward trend has become more evident in higher versions of MySQL. Additionally, some MySQL enthusiasts and testers have reported performance degradation in multiple sysbench tests after upgrading.
6 |
7 | Can these performance issues be avoided? Or, more specifically, how should we scientifically assess the ongoing trend of performance decline? These are important questions to consider.
8 |
9 | Although the official team continues to optimize, the gradual deterioration of performance cannot be overlooked. In certain scenarios, there may appear to be improvements, but this does not mean that performance in all scenarios is equally optimized. Moreover, it's also easy to optimize performance for specific scenarios at the cost of degrading performance in other areas.
10 |
11 | ## 1 The Root Causes of MySQL Performance Decline
12 |
13 | In general, as more features are added, the codebase grows, and with the continuous expansion of functionality, performance becomes increasingly difficult to control.
14 |
15 | MySQL developers often fail to notice the decline in performance, as each addition to the codebase results in only a very small decrease in performance. However, over time, these small declines accumulate, leading to a significant cumulative effect, which causes users to perceive a noticeable performance degradation in newer versions of
16 | MySQL.
17 |
18 | For example, the following figure shows the performance of a simple single join operation, with MySQL 8.0.40 showing a performance decline compared to MySQL 8.0.27:
19 |
20 |
21 |
22 | Figure 1. Significant decline in join performance in MySQL 8.0.40.
23 |
24 | The following figure shows the batch insert performance test under single concurrency, with the performance decline of MySQL 8.0.40 compared to version 5.7.44:
25 |
26 |
27 |
28 | Figure 2. Significant decline in bulk insert performance in MySQL 8.0.40.
29 |
30 | From the two graphs above, it can be seen that the performance of version 8.0.40 is not good.
31 |
32 | Next, let's analyze the root cause of the performance degradation in MySQL from the code level. Below is the ***PT_insert_values_list::contextualize*** function in MySQL 8.0:
33 |
34 | ```c++
35 | bool PT_insert_values_list::contextualize(Parse_context *pc) {
36 | if (super::contextualize(pc)) return true;
37 | for (List_item *item_list : many_values) {
38 | for (auto it = item_list->begin(); it != item_list->end(); ++it) {
39 | if ((*it)->itemize(pc, &*it)) return true;
40 | }
41 | }
42 |
43 | return false;
44 | }
45 | ```
46 |
47 | The corresponding ***PT_insert_values_list::contextualize*** function in MySQL 5.7 is as follows:
48 |
49 | ```c++
50 | bool PT_insert_values_list::contextualize(Parse_context *pc)
51 | {
52 | if (super::contextualize(pc))
53 | return true;
54 | List_iterator it1(many_values);
55 | List- *item_list;
56 | while ((item_list= it1++))
57 | {
58 | List_iterator
- it2(*item_list);
59 | Item *item;
60 | while ((item= it2++))
61 | {
62 | if (item->itemize(pc, &item))
63 | return true;
64 | it2.replace(item);
65 | }
66 | }
67 |
68 | return false;
69 | }
70 | ```
71 |
72 | From the code comparison, MySQL 8.0 appears to have more elegant code, seemingly making progress.
73 |
74 | Unfortunately, many times it is precisely the motivations behind these code improvements that lead to performance degradation. The MySQL official team replaced the previous ***List*** data structure with a ***deque***, which has become one of the root causes of the gradual performance degradation. Let's take a look at the ***deque***
75 | documentation:
76 |
77 | ```
78 | std::deque (double-ended queue) is an indexed sequence container that allows fast insertion and deletion at both its
79 | beginning and its end. In addition, insertion and deletion at either end of a deque never invalidates pointers or
80 | references to the rest of the elements.
81 |
82 | As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence
83 | of individually allocated fixed-size arrays, with additional bookkeeping, which means indexed access to deque must
84 | perform two pointer dereferences, compared to vector's indexed access which performs only one.
85 |
86 | The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the
87 | expansion of a std::vector because it does not involve copying of the existing elements to a new memory location. On
88 | the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its
89 | full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes,
90 | whichever is larger, on 64-bit libc++).
91 |
92 | The complexity (efficiency) of common operations on deques is as follows:
93 | Random access - constant O(1).
94 | Insertion or removal of elements at the end or beginning - constant O(1).
95 | Insertion or removal of elements - linear O(n).
96 | ```
97 |
98 | As shown in the above description, in extreme cases, retaining a single element requires allocating the entire array, resulting in very low memory efficiency. For example, in bulk inserts, where a large number of records need to be inserted, the official implementation stores each record in a separate deque. Even if the record content is minimal, a deque must still be allocated. The MySQL deque implementation allocates 1KB of memory for each deque to support fast lookups.
99 |
100 | ```
101 | The implementation is the same as classic std::deque: Elements are held in blocks of about 1 kB each.
102 | ```
103 |
104 | The official implementation uses 1KB of memory to store index information, and even if the record length is not large but there are many records, the memory access addresses may become non-contiguous, leading to poor cache friendliness. This design was intended to improve cache friendliness, but it has not been fully effective.
105 |
106 | It is worth noting that the original implementation used a List data structure, where memory was allocated through a memory pool, providing a certain level of cache friendliness. Although random access is less efficient, optimizing for sequential access to List elements significantly improves performance.
107 |
108 | During the upgrade to MySQL 8.0, users observed a significant decline in batch insert performance, and one of the main causes was the substantial change in underlying data structures.
109 |
110 | Additionally, while the official team improved the redo log mechanism, this also led to a decrease in MTR commit operation efficiency. Compared to MySQL 5.7, the added code significantly reduces the performance of individual commits, even though overall write throughput has been greatly improved.
111 |
112 | Let's examine the core ***execute*** operation of MTR commit in MySQL 5.7.44:
113 |
114 | ```c++
115 | /** Write the redo log record, add dirty pages to the flush list and
116 | release the resources. */
117 | void mtr_t::Command::execute()
118 | {
119 | ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);
120 | if (const ulint len = prepare_write()) {
121 | finish_write(len);
122 | }
123 | if (m_impl->m_made_dirty) {
124 | log_flush_order_mutex_enter();
125 | }
126 | /* It is now safe to release the log mutex because the
127 | flush_order mutex will ensure that we are the first one
128 | to insert into the flush list. */
129 | log_mutex_exit();
130 | m_impl->m_mtr->m_commit_lsn = m_end_lsn;
131 | release_blocks();
132 | if (m_impl->m_made_dirty) {
133 | log_flush_order_mutex_exit();
134 | }
135 | release_all();
136 | release_resources();
137 | }
138 | ```
139 |
140 | Let's examine the core ***execute*** operation of MTR commit in MySQL 8.0.40:
141 |
142 | ```c++
143 | /** Write the redo log record, add dirty pages to the flush list and
144 | release the resources. */
145 | void mtr_t::Command::execute() {
146 | ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);
147 | #ifndef UNIV_HOTBACKUP
148 | ulint len = prepare_write();
149 | if (len > 0) {
150 | mtr_write_log_t write_log;
151 | write_log.m_left_to_write = len;
152 | auto handle = log_buffer_reserve(*log_sys, len);
153 | write_log.m_handle = handle;
154 | write_log.m_lsn = handle.start_lsn;
155 | m_impl->m_log.for_each_block(write_log);
156 | ut_ad(write_log.m_left_to_write == 0);
157 | ut_ad(write_log.m_lsn == handle.end_lsn);
158 | log_wait_for_space_in_log_recent_closed(*log_sys, handle.start_lsn);
159 | DEBUG_SYNC_C("mtr_redo_before_add_dirty_blocks");
160 | add_dirty_blocks_to_flush_list(handle.start_lsn, handle.end_lsn);
161 | log_buffer_close(*log_sys, handle);
162 | m_impl->m_mtr->m_commit_lsn = handle.end_lsn;
163 | } else {
164 | DEBUG_SYNC_C("mtr_noredo_before_add_dirty_blocks");
165 | add_dirty_blocks_to_flush_list(0, 0);
166 | }
167 | #endif /* !UNIV_HOTBACKUP */
168 | ```
169 |
170 | By comparison, it is clear that in MySQL 8.0.40, the execute operation in MTR commit has become much more complex, with more steps involved. This complexity is one of the main causes of the decline in low-concurrency write performance.
171 |
172 | In particular, the operations ***m_impl-\>m_log.for_each_block(write_log)*** and
173 | **log_wait_for_space_in_log_recent_closed(*log_sys, handle.start_lsn)** have significant overhead. These changes were made to enhance high-concurrency performance, but they came at the cost of low-concurrency performance.
174 |
175 | The redo log's prioritization of high-concurrency mode results in poor performance for low-concurrency workloads. Although the introduction of ***innodb_log_writer_threads*** was intended to mitigate low-concurrency performance issues, it does not affect the execution of the above functions. Since these operations have become more complex and require frequent MTR commits, performance has still dropped significantly.
176 |
177 | Let's take a look at the impact of the instant add/drop feature on performance. Below is the ***rec_init_offsets_comp_ordinary*** function in MySQL 5.7:
178 |
179 | ```c++
180 | /******************************************************//**
181 | Determine the offset to each field in a leaf-page record
182 | in ROW_FORMAT=COMPACT. This is a special case of
183 | rec_init_offsets() and rec_get_offsets_func(). */
184 | UNIV_INLINE MY_ATTRIBUTE((nonnull))
185 | void
186 | rec_init_offsets_comp_ordinary(
187 | /*===========================*/
188 | const rec_t* rec, /*!< in: physical record in
189 | ROW_FORMAT=COMPACT */
190 | bool temp, /*!< in: whether to use the
191 | format for temporary files in
192 | index creation */
193 | const dict_index_t* index, /*!< in: record descriptor */
194 | ulint* offsets)/*!< in/out: array of offsets;
195 | in: n=rec_offs_n_fields(offsets) */
196 | {
197 | ulint i = 0;
198 | ulint offs = 0;
199 | ulint any_ext = 0;
200 | ulint n_null = index->n_nullable;
201 | const byte* nulls = temp
202 | ? rec - 1
203 | : rec - (1 + REC_N_NEW_EXTRA_BYTES);
204 | const byte* lens = nulls - UT_BITS_IN_BYTES(n_null);
205 | ulint null_mask = 1;
206 |
207 | ...
208 | ut_ad(temp || dict_table_is_comp(index->table));
209 |
210 | if (temp && dict_table_is_comp(index->table)) {
211 | /* No need to do adjust fixed_len=0. We only need to
212 | adjust it for ROW_FORMAT=REDUNDANT. */
213 | temp = false;
214 | }
215 | /* read the lengths of fields 0..n */
216 | do {
217 | const dict_field_t* field
218 | = dict_index_get_nth_field(index, i);
219 | const dict_col_t* col
220 | = dict_field_get_col(field);
221 | ulint len;
222 | ...
223 | ```
224 |
225 | The ***rec_init_offsets_comp_ordinary*** function in MySQL 8.0.40 is as follows:
226 |
227 | ```c++
228 | /** Determine the offset to each field in a leaf-page record in
229 | ROW_FORMAT=COMPACT. This is a special case of rec_init_offsets() and
230 | rec_get_offsets().
231 | ...
232 | */
233 | inline void rec_init_offsets_comp_ordinary(const rec_t *rec, bool temp,
234 | const dict_index_t *index,
235 | ulint *offsets) {
236 | ...
237 | const byte *nulls = nullptr;
238 | const byte *lens = nullptr;
239 | uint16_t n_null = 0;
240 | enum REC_INSERT_STATE rec_insert_state = REC_INSERT_STATE::NONE;
241 | uint8_t row_version = UINT8_UNDEFINED;
242 | uint16_t non_default_fields = 0;
243 |
244 | if (temp) {
245 | rec_insert_state = rec_init_null_and_len_temp(
246 | rec, index, &nulls, &lens, &n_null, non_default_fields, row_version);
247 | } else {
248 | rec_insert_state = rec_init_null_and_len_comp(
249 | rec, index, &nulls, &lens, &n_null, non_default_fields, row_version);
250 | }
251 |
252 | ut_ad(temp || dict_table_is_comp(index->table));
253 | if (temp) {
254 | if (dict_table_is_comp(index->table)) {
255 | /* No need to do adjust fixed_len=0. We only need to
256 | adjust it for ROW_FORMAT=REDUNDANT. */
257 | temp = false;
258 | } else {
259 | /* Redundant temp row. Old instant record is logged as version 0.*/
260 | if (rec_insert_state == INSERTED_BEFORE_INSTANT_ADD_OLD_IMPLEMENTATION ||
261 | rec_insert_state == INSERTED_AFTER_INSTANT_ADD_OLD_IMPLEMENTATION) {
262 | rec_insert_state = INSERTED_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION;
263 | ut_ad(row_version == UINT8_UNDEFINED);
264 | }
265 | }
266 | }
267 |
268 | /* read the lengths of fields 0..n */
269 | ulint offs = 0;
270 | ulint any_ext = 0;
271 | ulint null_mask = 1;
272 | uint16_t i = 0;
273 | do {
274 | /* Fields are stored on disk in version they are added in and are
275 | maintained in fields_array in the same order. Get the right field. */
276 | const dict_field_t *field = index->get_physical_field(i);
277 | const dict_col_t *col = field->col;
278 | uint64_t len;
279 |
280 | switch (rec_insert_state) {
281 | case INSERTED_INTO_TABLE_WITH_NO_INSTANT_NO_VERSION:
282 | ut_ad(!index->has_instant_cols_or_row_versions());
283 | break;
284 |
285 | case INSERTED_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION: {
286 | ut_ad(row_version == UINT8_UNDEFINED || row_version == 0);
287 | ut_ad(index->has_row_versions() || temp);
288 | /* Record has to be interpreted in v0. */
289 | row_version = 0;
290 | }
291 | [[fallthrough]];
292 | case INSERTED_AFTER_UPGRADE_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION
293 | case INSERTED_AFTER_INSTANT_ADD_NEW_IMPLEMENTATION: {
294 | ...
295 | } break;
296 | case INSERTED_BEFORE_INSTANT_ADD_OLD_IMPLEMENTATION:
297 | case INSERTED_AFTER_INSTANT_ADD_OLD_IMPLEMENTATION: {
298 | ...
299 | } break;
300 |
301 | default:
302 | ut_ad(false);
303 | }
304 | ...
305 | ```
306 |
307 | From the above code, it is clear that with the introduction of the instant add/drop column feature, the ***rec_init_offsets_comp_ordinary*** function has become noticeably more complex, introducing more function calls and adding a switch statement that severely impacts cache optimization. Since this function is called frequently, it directly impacts the performance of update index, batch inserts, and joins, resulting in a major performance hit.
308 |
309 | Moreover, the performance decline in MySQL 8.0 is not limited to the above; there are many other areas that contribute to the overall performance degradation, especially the impact on the expansion of inline functions. For example, the following code affects the expansion of inline functions:
310 |
311 | ```c++
312 | void validate_rec_offset(const dict_index_t *index, const ulint *offsets,
313 | ulint n, ut::Location L) {
314 | ut_ad(rec_offs_validate(nullptr, nullptr, offsets));
315 | if (n >= rec_offs_n_fields(offsets)) {
316 | #ifndef UNIV_NO_ERR_MSGS
317 | dump_metadata_dict_table(index->table);
318 | auto num_fields = static_cast(rec_offs_n_fields(offsets));
319 | ib::fatal(L, ER_IB_DICT_INVALID_COLUMN_POSITION, ulonglong{n}, num_fields);
320 | #endif /* !UNIV_NO_ERR_MSGS */
321 | }
322 | }
323 | ```
324 |
325 | According to our tests, the ***ib::fatal*** statement severely interferes with inline optimization. For frequently accessed functions, it is advisable to avoid statements that interfere with inline optimization.
326 |
327 | Next, let's look at a similar issue. The ***row_sel_store_mysql_field function*** is called frequently, with ***row_sel_field_store_in_mysql_format*** being a hotspot function within it. The specific code is as follows:
328 |
329 | ```c++
330 | // clang-format off
331 | /** Convert a field in the Innobase format to a field in the MySQL format.
332 | @param[out] mysql_rec Record in the MySQL format
333 | @param[in,out] prebuilt Prebuilt struct
334 | @param[in] rec InnoDB record; must be protected by a page
335 | latch
336 | @param[in] rec_index Index of rec
337 | @param[in] prebuilt_index prebuilt->index
338 | @param[in] offsets Array returned by rec_get_offsets()
339 | @param[in] field_no templ->rec_field_no or
340 | templ->clust_rec_field_no or
341 | templ->icp_rec_field_no or sec field no if
342 | clust_templ_for_sec is true
343 | @param[in] templ row template
344 | @param[in] sec_field_no Secondary index field no if the secondary index
345 | record but the prebuilt template is in
346 | clustered index format and used only for end
347 | range comparison.
348 | @param[in] lob_undo the LOB undo information.
349 | @param[in,out] blob_heap If not null then use this heap for BLOBs */
350 | // clang-format on
351 | [[nodiscard]] static bool row_sel_store_mysql_field(
352 | byte *mysql_rec, row_prebuilt_t *prebuilt, const rec_t *rec,
353 | const dict_index_t *rec_index, const dict_index_t *prebuilt_index,
354 | const ulint *offsets, ulint field_no, const mysql_row_templ_t *templ,
355 | ulint sec_field_no, lob::undo_vers_t *lob_undo, mem_heap_t *&blob_heap) {
356 | DBUG_TRACE;
357 | ...
358 | } else {
359 | /* Field is stored in the row. */
360 |
361 | data = rec_get_nth_field_instant(rec, offsets, field_no, index_used, &len);
362 |
363 | if (len == UNIV_SQL_NULL) {
364 | /* MySQL assumes that the field for an SQL
365 | NULL value is set to the default value. */
366 | ut_ad(templ->mysql_null_bit_mask);
367 |
368 | UNIV_MEM_ASSERT_RW(prebuilt->default_rec + templ->mysql_col_offset,
369 | templ->mysql_col_len);
370 | mysql_rec[templ->mysql_null_byte_offset] |=
371 | (byte)templ->mysql_null_bit_mask;
372 | memcpy(mysql_rec + templ->mysql_col_offset,
373 | (const byte *)prebuilt->default_rec + templ->mysql_col_offset,
374 | templ->mysql_col_len);
375 | return true;
376 | }
377 |
378 | if (DATA_LARGE_MTYPE(templ->type) || DATA_GEOMETRY_MTYPE(templ->type)) {
379 | ...
380 | mem_heap_t *heap{};
381 |
382 | if (blob_heap == nullptr) {
383 | blob_heap = mem_heap_create(UNIV_PAGE_SIZE, UT_LOCATION_HERE);
384 | }
385 |
386 | heap = blob_heap;
387 | data = static_cast(mem_heap_dup(heap, data, len));
388 | }
389 |
390 | /* Reassign the clustered index field no. */
391 | if (clust_templ_for_sec) {
392 | field_no = clust_field_no;
393 | }
394 |
395 | row_sel_field_store_in_mysql_format(mysql_rec + templ->mysql_col_offset,
396 | templ, rec_index, field_no, data, len,
397 | sec_field_no);
398 | ...
399 | ```
400 |
401 | The ***row_sel_field_store_in_mysql_format*** function ultimately calls ***row_sel_field_store_in_mysql_format_func***.
402 |
403 | ```c++
404 | /** Convert a non-SQL-NULL field from Innobase format to MySQL format. */
405 | static inline void row_sel_field_store_in_mysql_format(
406 | byte *dest, const mysql_row_templ_t *templ, const dict_index_t *idx,
407 | ulint field, const byte *src, ulint len, ulint sec) {
408 | row_sel_field_store_in_mysql_format_func(
409 | dest, templ, idx, IF_DEBUG(field, ) src, len IF_DEBUG(, sec));
410 | }
411 | ```
412 |
413 | The ***row_sel_field_store_in_mysql_format_func*** function cannot be inlined due to the presence of the ***ib::fatal*** code.
414 |
415 | ```c++
416 | void row_sel_field_store_in_mysql_format_func(
417 | byte *dest, const mysql_row_templ_t *templ, const dict_index_t *index,
418 | IF_DEBUG(ulint field_no, ) const byte *data,
419 | ulint len IF_DEBUG(, ulint sec_field)) {
420 | byte *ptr;
421 |
422 | ...
423 |
424 | if (templ->is_multi_val) {
425 | ib::fatal(UT_LOCATION_HERE, ER_CONVERT_MULTI_VALUE)
426 | << "Table name: " << index->table->name
427 | << " Index name: " << index->name;
428 | }
429 | ```
430 |
431 | Frequently called inefficient functions, executing tens of millions of times per second, can severely impact join performance.
432 |
433 | Let's continue to explore the reasons for performance decline. The following official performance optimization is, in fact, one of the root causes of the decline in join performance. Although certain queries may be improved, it is still one of the reasons for the performance degradation of ordinary join operations.
434 |
435 | ```c++
436 | commit ffe1726f2542505e486c4bcd516c30f36c8ed5f6 (HEAD)
437 | Author: Knut Anders Hatlen
438 | Date: Wed Dec 21 14:29:02 2022 +0100
439 |
440 | Bug#34891365: MySQL 8.0.29+ is slower than MySQL 8.0.28 with
441 | queries using JOINS
442 |
443 | The one-row cache in EQRefIterator was disabled for queries using
444 | streaming aggregation in the fix for bug#33674441. This was done in
445 | order to fix a number of wrong result bugs, but it turned out to have
446 | too big impact on the performance of many queries.
447 |
448 | This patch re-enables the cache for the above queries, and fixes the
449 | original wrong result bugs in a different way. There were two
450 | underlying problems that caused the wrong results:
451 |
452 | 1) AggregateIterator::Init() does not restore the table buffers to the
453 | state they had after the last read from the child iterator in the
454 | previous invocation. The table buffers would have the content from the
455 | first row in the last group that was returned from the iterator in the
456 | previous invocation, rather than the contents of the last row read by
457 | the child iterator, and this made the cache in EQRefIterator return
458 | wrong values. Fixed by restoring the table buffers in
459 | AggregateIterator::Init() if the previous invocation had modified
460 | them.
461 |
462 | 2) When the inner tables of an outer join are null-complemented, the
463 | table buffers are modified to contain NULL values, thereby disturbing
464 | the cached value for any EQRefIterator reading from one of the
465 | null-complemented tables. Fixed by making StoreFromTableBuffers()
466 | store the actual values contained in the table buffer instead of the
467 | null values, if the table is accessed by EQRefIterator.
468 | LoadIntoTableBuffers() is taught to restore those values, but
469 | additionally set the null flags on the columns after restoring them.
470 |
471 | The hypergraph optimizer had another workaround for these wrong
472 | results (it added a streaming step right before the aggregation). This
473 | workaround is also removed now.
474 |
475 | Change-Id: I554a90213cae60749dd6407e48a745bc71578e0c
476 | ```
477 |
478 | MySQL's issues go beyond these. As shown in the analyses above, the performance decline in MySQL is not without cause. A series of small problems, when accumulated, can lead to noticeable performance degradation that users experience. However, these issues are often difficult to identify, making them even harder to resolve.
479 |
480 | The so-called 'premature optimization' is the root of all evil, and it does not apply in MySQL development. Database development is a complex process, and neglecting performance over time makes subsequent performance improvements significantly more challenging.
481 |
482 | ## 2 Solutions to Mitigate MySQL Performance Decline
483 |
484 | The main reasons for the decline in write performance are related to MTR commit issues, instant add/drop column, and several other factors. These are difficult to optimize in traditional ways. However, users can compensate for the performance drop through PGO optimization. With a proper strategy, the performance can generally be kept stable.
485 |
486 | For batch insert performance degradation, our open-source version [2] replaces the official deque with an improved list implementation. This primarily addresses memory efficiency issues and can partially alleviate performance decline. By combining PGO optimization with our open-source version, batch insert performance can approach that of MySQL 5.7.
487 |
488 |
489 |
490 | Figure 3. Optimized MySQL 8.0.40 with PGO performs roughly on par with version 5.7.
491 |
492 | Users can also leverage multiple threads for concurrent batch processing, fully utilizing the improved concurrency of the redo log, which can significantly boost batch insert performance.
493 |
494 | Regarding update index issues, due to the inevitable addition of new code, PGO optimization can help mitigate this problem. Our PGO version [2] can significantly alleviate this issue.
495 |
496 | For read performance, particularly join performance, we have made substantial improvements, including fixing inline issues and making other optimizations. With the addition of PGO, join performance can be increased by over 30% compared to the official version.
497 |
498 |
499 |
500 | Figure 4. Using PGO, along with our optimizations, can lead to significant improvements in join performance.
501 |
502 | We will continue to invest time in optimizing low-concurrency performance. This process is long but involves numerous areas that need improvement.
503 |
504 | The open-source version is available for testing, and efforts will persist to improve MySQL performance.
505 |
506 | ## References
507 |
508 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
509 |
510 | [2] [Enhanced for MySQL · GitHub](https://github.com/enhancedformysql)
511 |
--------------------------------------------------------------------------------
/sysbench_vs_benchmarksql.md:
--------------------------------------------------------------------------------
1 | # The Significant Differences Between BenchmarkSQL and SysBench
2 |
3 | Using the case of optimizing lock-sys as an example, this section evaluates the significant differences between the SysBench tool and BenchmarkSQL in MySQL performance testing [1].
4 |
5 | First, use SysBench's standard read/write tests to evaluate the optimization of lock-sys.
6 |
7 |
8 |
9 | Figure 1. Comparison of SysBench read-write tests before and after lock-sys optimization.
10 |
11 | From the figure, it can be observed that after optimization, the overall performance of the SysBench tests has actually decreased.
12 |
13 | Next, using BenchmarkSQL to test this optimization, the results are shown in the following figure.
14 |
15 |
16 |
17 | Figure 2. Comparison of BenchmarkSQL tests before and after lock-sys optimization.
18 |
19 | From the figure, it can be seen that the results of BenchmarkSQL's TPC-C test indicate that the lock-sys optimization is effective. Why does such a significant difference occur? Let's analyze the differences in characteristics between these testing tools to understand why their tests differ.
20 |
21 | SysBench RW testing is characterized by its speed and simplicity with SQL queries. Under the same concurrency conditions, SysBench typically handles fewer concurrent transactions compared to BenchmarkSQL. Therefore, in the face of latch queue bottlenecks like lock-sys, high concurrency in SysBench may equate to low concurrency in BenchmarkSQL. Consequently, lock-sys optimizations may not have a significant impact in scenarios where BenchmarkSQL operates at lower concurrency levels.
22 |
23 | BenchmarkSQL, a widely used TPC-C testing tool, distributes user threads more evenly across various modules, reducing susceptibility to aggregation effects. In high-concurrency situations, optimizing lock-sys can significantly reduce latch conflicts and minimize impact on other queues, thereby improving throughput. BenchmarkSQL's TPC-C testing is better suited for uncovering deeper concurrency problems in MySQL compared to SysBench.
24 |
25 | This analysis uses deductive reasoning to explore the differences between SysBench and BenchmarkSQL. It demonstrates that poor performance in SysBench tests does not necessarily indicate poor performance in production environments, and vice versa. This discrepancy arises because SysBench test environments often differ significantly from real-world production environments. Consequently, SysBench test results should be used for scenario-specific performance comparisons rather than as comprehensive indicators of production capabilities.
26 |
27 | It is worth noting that the main basis for performance testing and comparison in this book, mainly based on TPC-C, is as follows [2]:
28 |
29 | *TPC benchmark C also known as TPC-C which is the leading online transaction processing (OLTP) benchmark has been used to perform the comparison.*
30 |
31 |
32 |
33 | ## References:
34 |
35 | [1] Bin Wang (2024). The Art of Problem-Solving in Software Engineering:How to Make MySQL Better.
36 |
37 | [2] R. N. Avula and C. Zou. Performance evaluation of TPC-C benchmark on various cloud providers, Proc. 11th IEEE Annu. Ubiquitous Comput. Electron. Mobile Commun. Conf. (UEMCON), pp. 226-233, Oct. 2020.
--------------------------------------------------------------------------------