├── README.md
├── figures
├── bf-tree-batch-write.gif
├── bf-tree-buffer-writes.gif
├── bf-tree-cache-records.gif
├── bf-tree-grow-larger.gif
├── buffer-pool-alloc.gif
├── buffer-pool-evict.gif
├── buffer-pool-grow.gif
├── buffer-pool-lru.gif
└── perf-figure.png
├── paper.pdf
├── poster-vldb.pdf
└── slides-vldb.pptx
/README.md:
--------------------------------------------------------------------------------
1 | # Bf-Tree
2 |
3 | Paper: [Latest](paper.pdf), [VLDB 2024](vldb.org/pvldb/vol17/p3442-hao.pdf)
4 |
5 | Slides: [PowerPoint](slides-vldb.pptx), [pdf]()
6 |
7 | Poster: [VLDB 2024](poster-vldb.pdf)
8 |
9 | Code: [see below](#show-me-the-code)
10 |
11 |
12 |
13 |
14 |
15 |
16 | **Bf-Tree is a modern read-write-optimized concurrent larger-than-memory range index.**
17 |
18 | - **Modern**: designed for modern SSDs, implemented with modern programming languages (Rust).
19 |
20 | - **Read-write-optimized**: 2.5× faster than RocksDB for scan operations, 6× faster than a B-Tree for write operations, and 2× faster than both B-Trees and LSM-Trees for point lookup -- for small records (e.g. ~100 bytes).
21 |
22 | - **Concurrent**: scale to high thread count.
23 |
24 | - **Larger-than-memory**: scale to large data sets.
25 |
26 | - **Range index**: records are sorted.
27 |
28 | The core of Bf-Tree are the mini-page abstraction and the buffer pool to support it.
29 |
30 | Thank you for your time on Bf-Tree, I'm all ears for feedbacks!
31 |
32 | ## Mini-pages
33 |
34 | #### 1. As a record-level cache
35 | More fine-grained than page-level cache, which is more efficient at identifying individual hot records.
36 | 
37 |
38 | #### 2. As a write buffer
39 | Mini pages absorb writes records and batch flush them to disk.
40 | 
41 |
42 | #### 3. Grow/shrink in size
43 | Mini-pages grow and shrink in size to be more precise in memory usage.
44 |
45 | 
46 |
47 | #### 4. Flush to disk
48 | Mini-pages are flushed to disk when they are too large or too cold.
49 |
50 | 
51 |
52 | ## Buffer pool for mini-pages
53 |
54 | The buffer pool is a circular buffer, allocated space is defined by head and tail address.
55 |
56 | #### 1. Allocate mini-pages
57 | 
58 |
59 | #### 2. Evict mini-pages when full
60 | 
61 |
62 | #### 3. Track hot mini-pages
63 | Naive circular buffer is a fifo queue, we make it a LRU-approximation using the second chance region.
64 |
65 | Mini-pages in the second-chance region are:
66 | - Copy-on-accessed to the tail address
67 | - Evicted to disk if not being accessed while in the region
68 |
69 | 
70 |
71 | #### 4. Grow mini-pages
72 | Mini-pages are copied to a larger mini-page when they need to grow.
73 | The old space is added to a free list for future allocations.
74 |
75 | 
76 |
77 |
78 | ## Show me the code!
79 |
80 | Bf-Tree is currently internal to Microsoft. I'll update here once we figured out the next steps.
81 |
82 |
83 |
84 | ## Is Bf-Tree deployed in production?
85 |
86 | No. The prototype I implemented is not deployed in production yet.
87 |
88 | However, the circular buffer pool design is very similar to FASTER's hybrid log, which is deployed in production at Microsoft.
89 |
90 | ## What are the drawbacks of Bf-Tree?
91 |
92 | - Bf-Tree only works for modern SSDs where parallel random 4KB writes have similar throughput to sequential writes. While not all SSDs have this property, it is not uncommon for modern SSDs.
93 |
94 | - Bf-Tree is heavily optimized for small records (e.g., 100 bytes, a common size when used as secondary indexes). Large records will have a similar performance to B-Trees or LSM-Trees.
95 |
96 | - Bf-Tree's buffer pool is more complex than B-Tree and LSM-Trees, as it needs to handle variable length mini-pages. But it is simpler than Bw-Tree in my opinion, which is implemented & deployed in production.
97 |
98 | Add your opinions here!
99 |
100 | ## Future directions?
101 |
102 | - Bf-Tree in-place writes to disk pages, which can cause burden to SSD garbage collection. If it is indeed a problem, we should consider using log-structured write to disk pages.
103 |
104 | - Bf-Tree's mini-page eviction/promotion policy is dead simple. More advanced policies can be used to ensure fairness, improve hit rate, and reduce copying. Our current paper focus on mini-page/buffer pool mechanisms, and exploring advanced policies is left as future work.
105 |
106 | - Better async. Bf-Tree relies on OS threads to interleave I/O operations, many people believe this is not ideal. Implement Bf-Tree with user-space async I/O (e.g., tokio) might be a way to publish paper.
107 |
108 | - Go lock/latch free. Bf-Tree is lock-based, and is carefully designed so that no dead lock is possible. Adding `lock-free` to the paper title is cool -- if you are a lock-free believer.
109 |
110 | - Future hardware. It's not too difficult to imagine [10 more papers](https://blog.haoxp.xyz/posts/research-statement/#todays-problem-vs-tomorrows-problem) on applying Bf-Tree to future hardwares, like CXL, RDMA, PM, GPU, SmartNic etc.
111 |
112 | ## I got questions!
113 |
114 | If you encounter any problems or have questions about implementation details, I'm more than happy to help you out and give you some hints!
115 | Feel free to reach out to me at xiangpeng.hao@wisc.edu or open an issue here.
116 |
117 |
118 | ## Errata
119 |
120 | Some notable changes:
121 |
122 | - We have fixed a legend typo in Figure 1.
123 |
124 | ## Cite Bf-Tree
125 |
126 | ```bibtex
127 | @article{bf-tree,
128 | title={Bf-Tree: A Modern Read-Write-Optimized Concurrent Larger-Than-Memory Range Index},
129 | author={Hao, Xiangpeng and Chandramouli, Badrish},
130 | journal={Proceedings of the VLDB Endowment},
131 | volume={17},
132 | number={11},
133 | pages={3442--3455},
134 | year={2024},
135 | publisher={VLDB Endowment}
136 | }
137 | ```
138 |
139 | ## Community discussions
140 |
141 | - https://x.com/badrishc/status/1828290910431703365
142 | - https://x.com/MarkCallaghanDB/status/1827906983619649562
143 | - https://x.com/MarkCallaghanDB/status/1828466545347252694
144 | - https://discord.com/channels/824628143205384202/1278432084070891523/1278432617858990235
145 | - https://x.com/FilasienoF/status/1830986520808833175
146 |
147 | Add yours here!
148 |
149 |
--------------------------------------------------------------------------------
/figures/bf-tree-batch-write.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/bf-tree-batch-write.gif
--------------------------------------------------------------------------------
/figures/bf-tree-buffer-writes.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/bf-tree-buffer-writes.gif
--------------------------------------------------------------------------------
/figures/bf-tree-cache-records.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/bf-tree-cache-records.gif
--------------------------------------------------------------------------------
/figures/bf-tree-grow-larger.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/bf-tree-grow-larger.gif
--------------------------------------------------------------------------------
/figures/buffer-pool-alloc.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/buffer-pool-alloc.gif
--------------------------------------------------------------------------------
/figures/buffer-pool-evict.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/buffer-pool-evict.gif
--------------------------------------------------------------------------------
/figures/buffer-pool-grow.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/buffer-pool-grow.gif
--------------------------------------------------------------------------------
/figures/buffer-pool-lru.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/buffer-pool-lru.gif
--------------------------------------------------------------------------------
/figures/perf-figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/figures/perf-figure.png
--------------------------------------------------------------------------------
/paper.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/paper.pdf
--------------------------------------------------------------------------------
/poster-vldb.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/poster-vldb.pdf
--------------------------------------------------------------------------------
/slides-vldb.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XiangpengHao/bf-tree-docs/55a68e9df19c7882115e6fe92c4907c0b1f7ef41/slides-vldb.pptx
--------------------------------------------------------------------------------