├── README.md ├── ae ├── .gitignore ├── bin │ ├── libpdp.so │ ├── octopus │ │ ├── conf.xml │ │ ├── dmfs │ │ ├── libnrfsc.so │ │ └── mpibw │ ├── perf │ └── xstore │ │ ├── .gitignore │ │ ├── cs.toml │ │ ├── fserver │ │ ├── libjemalloc.so.2 │ │ ├── master │ │ ├── server │ │ └── config.toml │ │ ├── ycsb │ │ └── ycsb-model.toml ├── check-and-kill-running.sh ├── clear.sh ├── figure-10a.py ├── figure-10b.py ├── figure-10c.py ├── figure-10d.py ├── figure-11.py ├── figure-12a.py ├── figure-12b.py ├── figure-13a.py ├── figure-13b.py ├── figure-14a.py ├── figure-14b.py ├── figure-15a.py ├── figure-15b.py ├── figure-16.py ├── figure-17.py ├── figure-8.py ├── figure-9.py ├── output-example.tar.xz ├── run-all.sh └── scripts │ ├── bootstrap-octopus.py │ ├── bootstrap.py │ ├── ins-pch.sh │ ├── limit-mem.sh │ ├── octopus_eval.py │ ├── reset-memc.sh │ ├── term_eval.py │ └── xstore_eval.py ├── app ├── octopus.zip └── xstore.zip ├── driver ├── driver.patch └── mlnx-ofed-kernel-5.8-2.0.3.0.zip └── libterm ├── .gitignore ├── 3rd-party └── include │ └── better-enum.hh ├── CMakeLists.txt ├── ibverbs-pdp ├── config.hh ├── ctx.hh ├── global.cc ├── global.hh ├── mr.cc ├── mr.hh ├── pdp.cc ├── qp.cc ├── qp.hh ├── rpc.cc └── rpc.hh ├── include ├── bitmap.hh ├── compile.hh ├── const.hh ├── data-conn.hh ├── lock.hh ├── logging.hh ├── node.hh ├── page-bitmap.hh ├── print-stack.hh ├── queue.hh ├── rdma.hh ├── sms.hh ├── ud-conn.hh ├── util.hh └── zipf.hh ├── kmod ├── Makefile ├── pch.c └── test-pch.cc └── test └── perf.cc /README.md: -------------------------------------------------------------------------------- 1 | # TeRM: Extending RDMA-Attached Memory with SSD 2 | 3 | This is the open-source repository for our paper 4 | **TeRM: Extending RDMA-Attached Memory with SSD** on [FAST'24](https://www.usenix.org/conference/fast24/presentation/yang-zhe) and [ACM Transactions on Storage](https://dl.acm.org/doi/10.1145/3700772). 5 | 6 | Notably, the codename of TeRM is `PDP` (one step further beyond ODP). 7 | 8 | # Directory structure 9 | ``` 10 | TeRM 11 | |---- ae # artifact evaluation files 12 | |---- bin # binaries generated by source code 13 | |---- scripts # common scripts 14 | |---- figure-*.py # scripts to execute experiments 15 | |---- run-all.sh # run all experiments 16 | |---- app # source code of octopus and xstore with some bugs fixed 17 | |---- driver # TeRM's driver 18 | |---- driver.patch # patches to the official driver 19 | |---- mlnx-*.zip # the patched driver 20 | |---- libterm # TeRM's userspace shared library 21 | ``` 22 | 23 | # How to build 24 | ## Environment 25 | 26 | - OS: Ubuntu 22.04.2 LTS 27 | - Kernel: Linux 5.19.0-50-generic 28 | - OFED driver: 5.8-2.0.3 29 | 30 | We recommend the same environment used in our development. 31 | You may need to customize the source code for different enviroment. 32 | The environment is mainly required for the driver. 33 | We only need the patched driver on the server side. 34 | 35 | ## Dependencies 36 | ``` 37 | sudo apt install libfmt-dev libaio-dev libboost-coroutine-dev libmemcached-dev libgoogle-glog-dev libgflags-dev 38 | ``` 39 | 40 | ## Settings 41 | 42 | We hard coded some settings in the source code. Please modify them according to your cluster settings. 43 | 44 | 1. memcached. 45 | TeRM uses memcached to synchronize cluster metadata. 46 | Please install memcached in your cluster and modify the ip and port in `ae/scripts/reset-memc.sh`, `libterm/ibverbs-pdp/global.cc`, and `libterm/include/node.hh`. 47 | 48 | 2. CPU affinity. 49 | The source code is in `class Schedule` of file `libterm/include/util.hh`. 50 | Please modify the constants according to your CPU hardware. 51 | 52 | ## Build the driver 53 | The patched driver is required on the server side. 54 | There are two ways to build the driver. We provide an out-of-the-box driver zip file in the second choice. 55 | 56 | 1. Download the source code of the driver from the official website. 57 | Apply official backport batches first and then patch the modifications listed in `driver/driver.patch`. 58 | Then, build the driver. 59 | Please note that, we apply minimum number of patches, instead of all patches, that make it work for our environment. One shall not `git apply` the `driver/driver.patch` directly, because line numbers may differ. One should parse and patch it manually. 60 | 61 | 2. Use `driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip`. Unzip it and run the contained `build.sh`. 62 | 63 | ## Build libterm 64 | We provide `CMakeLists.txt` for building. 65 | It produces two outputs, the userspace shared library `libpdp.so` and a program `perf`. 66 | Please copy two files to `ae/bin` before running AE scripts. 67 | ``` 68 | $ cd libterm 69 | $ mkdir -p build && cd build 70 | $ cmake .. -DCMAKE_BUILD_TYPE=Release # Release for compiler optimizations and high performance 71 | $ make -j 72 | ``` 73 | 74 | # How to use 75 | 1. Replace the modified driver `*.ko` files on the server side and restart the `openibd` service. 76 | 2. Restart the `memcached` instance. We provide a script `ae/scripts/reset-memc.sh` to do so. 77 | 3. `mmap` an SSD in the RDMA program with `MAP_SHARED` and `ibv_reg_mr` the memory area as an `ODP MR`. 78 | 4. Set `LD_PRELOAD=libpdp.so` on all nodes to enable TeRM. Also set enviroment variables `PDP_server_mmap_dev=nvmeXnY` for the SSD backend and `PDP_server_memory_gb=Z` for the size of the mapped area. Set `PDP_is_server=1` if and only if for the server side. 79 | 5. Run the RDMA application. 80 | 81 | libterm accepts a series of environment variables for configuration. Please refer to `libterm/ibverbs-pdp/global.cc` for more details. 82 | 83 | If you have further questions and interests about the repository, please feel free to propose an issue or contact me via email (yangzhe.ac AT outlook.com). You can find my github at [yzim](https://github.com/yzim). 84 | 85 | To cite our paper: 86 | ``` 87 | @inproceedings {fast24-term, 88 | author = {Zhe Yang and Qing Wang and Xiaojian Liao and Youyou Lu and Keji Huang and Jiwu Shu}, 89 | title = {{TeRM}: Extending {RDMA-Attached} Memory with {SSD}}, 90 | booktitle = {22nd USENIX Conference on File and Storage Technologies (FAST 24)}, 91 | year = {2024}, 92 | isbn = {978-1-939133-38-0}, 93 | address = {Santa Clara, CA}, 94 | pages = {1--16}, 95 | url = {https://www.usenix.org/conference/fast24/presentation/yang-zhe}, 96 | publisher = {USENIX Association}, 97 | month = feb 98 | } 99 | 100 | @article{tos24-term, 101 | author = {Yang, Zhe and Wang, Qing and Liao, Xiaojian and Lu, Youyou and Huang, Keji and Shu, Jiwu}, 102 | title = {Efficiently Enlarging RDMA-Attached Memory with SSD}, 103 | year = {2024}, 104 | publisher = {Association for Computing Machinery}, 105 | address = {New York, NY, USA}, 106 | issn = {1553-3077}, 107 | url = {https://doi.org/10.1145/3700772}, 108 | doi = {10.1145/3700772}, 109 | journal = {ACM Trans. Storage}, 110 | month = oct 111 | } 112 | ``` 113 | -------------------------------------------------------------------------------- /ae/.gitignore: -------------------------------------------------------------------------------- 1 | output*/ 2 | __pycache__ 3 | -------------------------------------------------------------------------------- /ae/bin/libpdp.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/libpdp.so -------------------------------------------------------------------------------- /ae/bin/octopus/conf.xml: -------------------------------------------------------------------------------- 1 | 2 |

3 | 4 | 1 5 | 10.0.2.184 6 | 7 |

8 | -------------------------------------------------------------------------------- /ae/bin/octopus/dmfs: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/dmfs -------------------------------------------------------------------------------- /ae/bin/octopus/libnrfsc.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/libnrfsc.so -------------------------------------------------------------------------------- /ae/bin/octopus/mpibw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/mpibw -------------------------------------------------------------------------------- /ae/bin/perf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/perf -------------------------------------------------------------------------------- /ae/bin/xstore/.gitignore: -------------------------------------------------------------------------------- 1 | cpu.txt 2 | time_spot.py 3 | -------------------------------------------------------------------------------- /ae/bin/xstore/cs.toml: -------------------------------------------------------------------------------- 1 | [general_config] 2 | port = 8888 3 | 4 | [[client]] 5 | id = 1 6 | host = "node166" 7 | 8 | [[client]] 9 | id = 2 10 | host = "node168" 11 | 12 | [master] 13 | host = "node181" 14 | 15 | 16 | ## server config 17 | 18 | [[server]] 19 | host = "node184" 20 | 21 | [server_config] 22 | db_type = "ycsbh" -------------------------------------------------------------------------------- /ae/bin/xstore/fserver: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/fserver -------------------------------------------------------------------------------- /ae/bin/xstore/libjemalloc.so.2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/libjemalloc.so.2 -------------------------------------------------------------------------------- /ae/bin/xstore/master: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/master -------------------------------------------------------------------------------- /ae/bin/xstore/server/config.toml: -------------------------------------------------------------------------------- 1 | [memory] 2 | page = "30G" 3 | rdma_heap = "2G" 4 | 5 | [rpc] 6 | network = "ud" 7 | -------------------------------------------------------------------------------- /ae/bin/xstore/ycsb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/ycsb -------------------------------------------------------------------------------- /ae/bin/xstore/ycsb-model.toml: -------------------------------------------------------------------------------- 1 | [[stages]] 2 | type = "lr" 3 | parameter = 1 4 | 5 | [[stages]] 6 | type = "lr" 7 | ## typically, they use much larger models 8 | 9 | 10 | ## YCSB large 200M 11 | parameter = 100000 -------------------------------------------------------------------------------- /ae/check-and-kill-running.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | running=0 4 | pattern="'run-all.sh|tmp/memcached|scripts/bootstrap.py|scripts/bootstrap-octopus.py|scripts/bootstrap-xstore.py|figure|octopus|xstore|mpibw|dmfs|fserver|master|ycsb|perf'" 5 | check_cmd="pgrep -fa $pattern" 6 | 7 | echo "--- node166 ---" 8 | ret=`ssh node166 $check_cmd` 9 | ssh node166 $check_cmd 10 | [ -z "$ret" ] || running=1 11 | echo 12 | 13 | echo "--- node168 ---" 14 | ret=`ssh node168 $check_cmd` 15 | ssh node168 $check_cmd 16 | [ -z "$ret" ] || running=1 17 | echo 18 | 19 | echo "--- node184 ---" 20 | ret=`ssh node184 $check_cmd` 21 | ssh node184 $check_cmd 22 | [ -z "$ret" ] || running=1 23 | echo 24 | 25 | echo "--- node181 ---" 26 | ret=`bash -c "$check_cmd"` 27 | bash -c "$check_cmd" 28 | [ -z "$ret" ] || running=1 29 | echo 30 | 31 | [ $running -eq 0 ] && echo "not found" && exit 32 | 33 | read -p "Confirm to kill? (Y/N): " confirm && [[ $confirm == [yY] || $confirm == [yY][eE][sS] ]] || exit 34 | 35 | kill_cmd="pkill -f $pattern" 36 | echo "--- node166 ---" 37 | ssh node166 $kill_cmd 38 | echo "--- node168 ---" 39 | ssh node168 $kill_cmd 40 | echo "--- node184 ---" 41 | ssh node184 $kill_cmd 42 | echo "--- node181 ---" 43 | bash -c "$kill_cmd" 44 | 45 | echo "killed" 46 | -------------------------------------------------------------------------------- /ae/clear.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | 3 | rm -rf ./output 4 | -------------------------------------------------------------------------------- /ae/figure-10a.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-10a", 6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "rpc_tiering_promote", "pdp"], 7 | "sz_unit": "256", 8 | "xlabel": "", 9 | "xdata": ["Read 256B"], 10 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot", "+magic"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-10b.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-10b", 6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "rpc_tiering_promote", "pdp"], 7 | "sz_unit": "4k", 8 | "nr_server_threads": 8, 9 | "xlabel": "", 10 | "xdata": ["Read 4KB"], 11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot", "+magic"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/figure-10c.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-10c", 6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "pdp"], 7 | "sz_unit": "256", 8 | "write_percent" : 100, 9 | "xlabel": "", 10 | "xdata": ["Write 256B"], 11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() -------------------------------------------------------------------------------- /ae/figure-10d.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-10d", 6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "pdp"], 7 | "sz_unit": "4k", 8 | "write_percent" : 100, 9 | "xlabel": "", 10 | "xdata": ["Write 4KB"], 11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/figure-11.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-11", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "dynamic": True, 8 | "xlabel": "Time (s)", 9 | "xdata": [i for i in range(1, 121)], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"], 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-12a.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-12a", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "skewness_100": [0, 40, 80, 90, 99], 8 | "xlabel": "Zipfian theta", 9 | "xdata": ["0", "0.4", "0.8", "0.9", "0.99"], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-12b.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-12b", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "write_percent": [0, 25, 50, 75, 100], 8 | "xlabel": "Write Ratio", 9 | "xdata": ["0%", "25%", "50%", "75%", "100%"], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | 17 | -------------------------------------------------------------------------------- /ae/figure-13a.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-13a", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "nr_client_threads": [1, 2, 4, 8, 16, 32, 64], 8 | "xlabel": "Number of Client Threads", 9 | "xdata": ["1", "2", "4", "8", "16", "32", "64"], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-13b.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name" : "figure-13b", 6 | "mode" : ["rpc", "pdp"], 7 | "nr_server_threads": [1, 2, 4, 8, 16, 32], 8 | "xlabel" : "Number of Server Threads", 9 | "xdata" : ["1", "2", "4", "8", "16", "32"], 10 | "legend": ["RPC", "TeRM"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-14a.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name" : "figure-14a", 6 | "mode" : ["odp", "rpc", "pdp"], 7 | "server_memory_gb": [13, 26, 38, 51, 58, 0], 8 | "skewness_100": 0, 9 | "xlabel": "DRAM Ratio (%)", 10 | "xdata": ["20", "40", "60", "80", "90", "100"], 11 | "legend": ["ODP", "RPC", "TeRM"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/figure-14b.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name" : "figure-14b", 6 | "mode" : ["odp", "rpc", "pdp"], 7 | "server_memory_gb": [13, 26, 38, 51, 58, 0], 8 | "skewness_100": 99, 9 | "xlabel": "DRAM Ratio (%)", 10 | "xdata": ["20", "40", "60", "80", "90", "100"], 11 | "legend": ["ODP", "RPC", "TeRM"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/figure-15a.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name" : "figure-15a", 6 | "mode" : ["odp", "rpc", "pdp"], # codename 7 | "ssd" : ["nvme4n1", "nvme1n1", "nvme2n1"], 8 | "sz_unit" : 256, 9 | "xlabel": "Read 256B", 10 | "xdata": ["SSD 1", "SSD 2", "SSD 3"], 11 | "legend": ["ODP", "RPC", "TeRM"] 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/figure-15b.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name" : "figure-15b", 6 | "mode" : ["odp", "rpc", "pdp"], 7 | "ssd" : ["nvme4n1", "nvme1n1", "nvme2n1"], 8 | "xlabel": "Read 4KB", 9 | "xdata": ["SSD 1", "SSD 2", "SSD 3"], 10 | "legend": ["ODP", "RPC", "TeRM"] 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-16.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.octopus_eval as octopus_eval 3 | 4 | kwargs = { 5 | "name": "figure-16", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "write": [0, 1], 8 | "unit_size": ["4k", "16k"], 9 | "xlabel": "", 10 | "xdata": ["Read 4KB", "Read 16KB", "Write 4KB", "Write 16KB"], 11 | "legend": ["PIN", "ODP", "RPC", "TeRM"] 12 | } 13 | 14 | 15 | e = octopus_eval.Experiment(**kwargs) 16 | e.run() 17 | e.output() 18 | 19 | # rerun a data point 20 | # kwargs = { 21 | # "name": "figure-16", 22 | # "mode": ["rpc"], 23 | # "write": [1], 24 | # "unit_size": ["16k"], 25 | # "xlabel": "", # for output(), no need to edit 26 | # "xdata": ["Write 16KB"], # for output(), no need to edit 27 | # "legend": ["RPC"] # for output(), no need to edit 28 | # } 29 | # e = octopus_eval.Experiment(**kwargs) 30 | # e.run() 31 | -------------------------------------------------------------------------------- /ae/figure-17.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.xstore_eval as xstore_eval 3 | 4 | kwargs = { 5 | "name": "figure-17", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "zipf_theta": [0, 0.4, 0.8, 0.9, 0.99], 8 | "xlabel": "Zipfian theta", 9 | "xdata": ["0", "0.4", "0.8", "0.9", "0.99"], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"] 11 | } 12 | 13 | e = xstore_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-8.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-8", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "sz_unit": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"], 8 | "xlabel": "Read Size", 9 | "xdata": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"], 10 | "legend": ["PIN", "ODP", "RPC", "TeRM"], 11 | } 12 | 13 | e = term_eval.Experiment(**kwargs) 14 | e.run() 15 | e.output() 16 | -------------------------------------------------------------------------------- /ae/figure-9.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import scripts.term_eval as term_eval 3 | 4 | kwargs = { 5 | "name": "figure-9", 6 | "mode": ["pin", "odp", "rpc", "pdp"], 7 | "sz_unit": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"], 8 | "write_percent": 100, 9 | "xlabel": "Write Size", 10 | "xdata": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"], 11 | "legend": ["PIN", "ODP", "RPC", "TeRM"], 12 | } 13 | 14 | e = term_eval.Experiment(**kwargs) 15 | e.run() 16 | e.output() 17 | -------------------------------------------------------------------------------- /ae/output-example.tar.xz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/output-example.tar.xz -------------------------------------------------------------------------------- /ae/run-all.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | 3 | ./clear.sh 4 | ./figure-8.py 5 | ./figure-9.py 6 | ./figure-10a.py 7 | ./figure-10b.py 8 | ./figure-10c.py 9 | ./figure-10d.py 10 | ./figure-11.py 11 | ./figure-12a.py 12 | ./figure-12b.py 13 | ./figure-13a.py 14 | ./figure-13b.py 15 | ./figure-14a.py 16 | ./figure-14b.py 17 | ./figure-15a.py 18 | ./figure-15b.py 19 | ./figure-16.py 20 | ./figure-17.py 21 | -------------------------------------------------------------------------------- /ae/scripts/bootstrap-octopus.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import toml 4 | import paramiko 5 | import getpass 6 | import argparse 7 | import time 8 | import select 9 | import subprocess 10 | 11 | import os 12 | 13 | from paramiko import SSHConfig 14 | 15 | config = SSHConfig() 16 | with open(os.path.expanduser("/etc/ssh/ssh_config")) as _file: 17 | config.parse(_file) 18 | 19 | 20 | class RunPrinter: 21 | def __init__(self, name, channel): 22 | self.channel = channel 23 | self.name = name 24 | self.dying = False 25 | self.server_inited = False 26 | 27 | def print_one(self): 28 | if self.channel.recv_ready(): 29 | res = self.channel.recv(1048576 * 10).decode().splitlines() 30 | for l in res: 31 | print("@%-10s" % self.name, l.strip()) 32 | if "Worker" in l: 33 | self.server_inited = True 34 | 35 | if self.channel.recv_stderr_ready(): 36 | res = self.channel.recv_stderr(1048576 * 10).decode().splitlines() 37 | for l in res: 38 | print("@%-10s" % self.name, l.strip()) 39 | 40 | if self.channel.exit_status_ready(): 41 | print("! exit ", self.name) 42 | return False 43 | 44 | return True 45 | 46 | 47 | def check_keywords(lines, keywords, black_keywords): 48 | match = [] 49 | for l in lines: 50 | black = False 51 | for bk in black_keywords: 52 | if l.find(bk) >= 0: 53 | black = True 54 | break 55 | if black: 56 | continue 57 | flag = True 58 | for k in keywords: 59 | flag = flag and (l.find(k) >= 0) 60 | if flag: 61 | # print("matched line: ",l) 62 | match.append(l) 63 | return len(match) 64 | 65 | 66 | class Envs: 67 | def __init__(self, f=""): 68 | self.envs = {} 69 | self.history = [] 70 | try: 71 | self.load(f) 72 | except: 73 | pass 74 | 75 | def set(self, envs): 76 | self.envs = envs 77 | self.history += envs.keys() 78 | 79 | def load(self, f): 80 | self.envs = pickle.load(open(f, "rb")) 81 | 82 | def add(self, name, env): 83 | self.history.append(name) 84 | self.envs[name] = env 85 | 86 | def append(self, name, env): 87 | self.envs[name] = (self.envs[name] + ":" + env) 88 | 89 | def delete(self, name): 90 | self.history.remove(name) 91 | del self.envs[name] 92 | 93 | def __str__(self): 94 | s = "" 95 | for name in self.history: 96 | s += ("export %s=%s;" % (name, self.envs[name])) 97 | return s 98 | 99 | def store(self, f): 100 | with open(f, 'wb') as handle: 101 | pickle.dump(self.envs, handle, protocol=pickle.HIGHEST_PROTOCOL) 102 | 103 | 104 | class ConnectProxy: 105 | def __init__(self, mac, user="", passp=""): 106 | if user == "": # use the server user as default 107 | user = getpass.getuser() 108 | self.ssh = paramiko.SSHClient() 109 | 110 | self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 111 | self.user = user 112 | self.mac = mac 113 | self.sftp = None 114 | self.passp = passp 115 | 116 | def connect(self, pwd, passp=None, timeout=30): 117 | # user_config = config.lookup(self.mac) 118 | # if user_config: 119 | # print("found user_config", config) 120 | # print(user_config) 121 | # #print(user_config) 122 | # #cfg = {'hostname': self.mac, 'username': self.user} 123 | # #cfg['sock'] = paramiko.ProxyCommand(user_config['proxycommand']) 124 | # return self.ssh.connect(hostname = self.mac,username = self.user, password = pwd, 125 | # timeout = timeout, allow_agent=False,look_for_keys=False,passphrase=passp,sock=paramiko.ProxyCommand(user_config['proxycommand']),banner_timeout=400) 126 | 127 | # else: 128 | return self.ssh.connect(hostname=self.mac, username=self.user, password=pwd, 129 | timeout=timeout, allow_agent=False, look_for_keys=False, passphrase=passp) 130 | 131 | def connect_with_pkey(self, keyfile_name, timeout=10): 132 | print("connect with pkey") 133 | user_config = ssh_config.lookup(self.mac) 134 | print(user_config) 135 | if user_config: 136 | assert False 137 | 138 | self.ssh.connect(hostname=self.mac, username=self.user, 139 | key_filename=keyfile_name, timeout=timeout) 140 | 141 | def execute(self, cmd, pty=False, timeout=None, background=False): 142 | if not background: 143 | return self.ssh.exec_command(cmd, get_pty=pty, timeout=timeout) 144 | else: 145 | print("exe", cmd, "in background") 146 | transport = self.ssh.get_transport() 147 | channel = transport.open_session() 148 | return channel.exec_command(cmd) 149 | 150 | def execute_w_channel(self, cmd): 151 | transport = self.ssh.get_transport() 152 | channel = transport.open_session() 153 | channel.get_pty() 154 | channel.exec_command(cmd) 155 | return channel 156 | 157 | def copy_file(self, f, dst_dir=""): 158 | if self.sftp == None: 159 | self.sftp = paramiko.SFTPClient.from_transport( 160 | self.ssh.get_transport()) 161 | self.sftp.put(f, dst_dir + "/" + f) 162 | 163 | def get_file(self, remote_path, f): 164 | if self.sftp == None: 165 | self.sftp = paramiko.SFTPClient.from_transport( 166 | self.ssh.get_transport()) 167 | self.sftp.get(remote_path + "/" + f, f) 168 | 169 | def close(self): 170 | if self.sftp != None: 171 | self.sftp.close() 172 | self.ssh.close() 173 | 174 | def copy_dir(self, source, target, verbose=False): 175 | 176 | if self.sftp == None: 177 | self.sftp = paramiko.SFTPClient.from_transport( 178 | self.ssh.get_transport()) 179 | 180 | if os.path.isfile(source): 181 | return self.copy_file(source, target) 182 | 183 | try: 184 | os.listdir(source) 185 | except: 186 | print("[%S] failed to put %s" % (self.mac, source)) 187 | return 188 | 189 | self.mkdir(target, ignore_existing=True) 190 | 191 | for item in os.listdir(source): 192 | if os.path.isfile(os.path.join(source, item)): 193 | try: 194 | self.sftp.put(os.path.join(source, item), 195 | '%s/%s' % (target, item)) 196 | print_v(verbose, "[%s] put %s done" % 197 | (self.mac, os.path.join(source, item))) 198 | except: 199 | print("[%s] put %s error" % 200 | (self.mac, os.path.join(source, item))) 201 | else: 202 | self.mkdir('%s/%s' % (target, item), ignore_existing=True) 203 | self.copy_dir(os.path.join(source, item), 204 | '%s/%s' % (target, item)) 205 | 206 | def mkdir(self, path, mode=511, ignore_existing=False): 207 | try: 208 | self.sftp.mkdir(path, mode) 209 | except IOError: 210 | if ignore_existing: 211 | pass 212 | else: 213 | raise 214 | 215 | 216 | class Courier2: 217 | def __init__(self, user=getpass.getuser(), pwd="123", hosts="hosts.xml", passp="", curdir=".", keyfile=""): 218 | self.user = user 219 | self.pwd = pwd 220 | self.keyfile = keyfile 221 | self.cached_host = "localhost" 222 | self.passp = passp 223 | 224 | self.curdir = curdir 225 | self.envs = Envs() 226 | 227 | def cd(self, dir): 228 | if os.path.isabs(dir): 229 | self.curdir = dir 230 | if self.curdir == "~": 231 | self.curdir = "." 232 | else: 233 | self.curdir += ("/" + dir) 234 | 235 | def get_file(self, host, dst_dir, f, timeout=None): 236 | p = ConnectProxy(host, self.user) 237 | try: 238 | if len(self.keyfile): 239 | p.connect_with_pkey(self.keyfile, timeout) 240 | else: 241 | p.connect(self.pwd, timeout=timeout) 242 | except Exception as e: 243 | print("[get_file] connect to %s error: " % host, e) 244 | p.close() 245 | return False, None 246 | try: 247 | p.get_file(dst_dir, f) 248 | except Exception as e: 249 | print("[get_file] get %s @%s error " % (f, host), e) 250 | p.close() 251 | return False, None 252 | if p: 253 | p.close() 254 | 255 | return True, None 256 | 257 | def copy_file(self, host, f, dst_dir="~/", timeout=None): 258 | p = ConnectProxy(host, self.user) 259 | try: 260 | if len(self.keyfile): 261 | p.connect_with_pkey(self.keyfile, timeout) 262 | else: 263 | p.connect(self.pwd, timeout=timeout) 264 | except Exception as e: 265 | print("[copy_file] connect to %s error: " % host, e) 266 | p.close() 267 | return False, None 268 | try: 269 | p.copy_file(f, dst_dir) 270 | except Exception as e: 271 | print("[copy_file] copy %s error " % host, e) 272 | p.close() 273 | return False, None 274 | if p: 275 | p.close() 276 | 277 | return True, None 278 | 279 | def execute_w_channel(self, cmd, host, dir, timeout=None): 280 | p = ConnectProxy(host, self.user) 281 | try: 282 | if len(self.keyfile): 283 | p.connect_with_pkey(self.keyfile, timeout) 284 | else: 285 | p.connect(self.pwd, self.passp, timeout=timeout) 286 | except Exception as e: 287 | print("[pre execute] connect to %s error: " % host, e) 288 | p.close() 289 | return None, e 290 | 291 | try: 292 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd 293 | return p.execute_w_channel(ccmd) 294 | except: 295 | return None 296 | 297 | def pre_execute(self, cmd, host, pty=True, dir="", timeout=None, retry_count=3, background=False): 298 | if dir == "": 299 | dir = self.curdir 300 | 301 | p = ConnectProxy(host, self.user) 302 | try: 303 | if len(self.keyfile): 304 | p.connect_with_pkey(self.keyfile, timeout) 305 | else: 306 | p.connect(self.pwd, timeout=timeout) 307 | except Exception as e: 308 | print("[pre execute] connect to %s error: " % host, e) 309 | p.close() 310 | return None, e 311 | 312 | try: 313 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd 314 | if not background: 315 | _, stdout, _ = p.execute( 316 | ccmd, pty, timeout, background=background) 317 | return p, stdout 318 | else: 319 | c = p.execute(ccmd, pty, timeout, background=True) 320 | return p, c 321 | except Exception as e: 322 | print("[pre execute] execute cmd @ %s error: " % host, e) 323 | p.close() 324 | if retry_count > 0: 325 | if timeout: 326 | timeout += 2 327 | return self.pre_execute(cmd, host, pty, dir, timeout, retry_count - 1) 328 | 329 | def execute(self, cmd, host, pty=True, dir="", timeout=None, output=True, background=False): 330 | ret = [True, ""] 331 | p, stdout = self.pre_execute( 332 | cmd, host, pty, dir, timeout, background=background) 333 | if p and (stdout and output) and (not background): 334 | try: 335 | while not stdout.channel.closed: 336 | try: 337 | for line in iter(lambda: stdout.readline(2048), ""): 338 | if pty and (len(line) > 0): # ignore null lines 339 | print((("[%s]: " % host) + line), end="") 340 | else: 341 | ret[1] += (line + "\n") 342 | except UnicodeDecodeError as e: 343 | continue 344 | except Exception as e: 345 | break 346 | except Exception as e: 347 | print("[%s] execute with execption:" % host, e) 348 | if p and (not background): 349 | p.close() 350 | # ret[1] = stdout 351 | else: 352 | ret[0] = False 353 | ret[1] = stdout 354 | return ret[0], ret[1] 355 | 356 | 357 | # cr.envs.set(base_env) 358 | 359 | 360 | def main(): 361 | arg_parser = argparse.ArgumentParser( 362 | description=''' Benchmark script for running the cluster''') 363 | arg_parser.add_argument( 364 | '-f', metavar='CONFIG', dest='config', default="run.toml", type=str, 365 | help='The configuration file to execute commands') 366 | arg_parser.add_argument('-b', '--black', nargs='+', 367 | help='hosts to ignore', required=False) 368 | arg_parser.add_argument( 369 | '-n', '--num', help='num-passes to run', default=128, type=int) 370 | 371 | args = arg_parser.parse_args() 372 | num = args.num 373 | 374 | config = toml.load(args.config) 375 | 376 | passes = config.get("pass", []) 377 | 378 | user = config.get("user", "root") 379 | pwd = config.get("pwd", "fast24ae") 380 | passp = config.get("passphrase", None) 381 | 382 | prefix_cmd = config.get("prefix_cmd", "") 383 | print(f"! prefix_cmd={prefix_cmd}") 384 | suffix_cmd = config.get("suffix_cmd", "") 385 | print(f"! suffix_cmd={suffix_cmd}") 386 | 387 | black_list = {} 388 | if args.black: 389 | for e in args.black: 390 | black_list[e] = True 391 | 392 | cr = Courier2(user, pwd, passp=passp) 393 | 394 | # first we sync files 395 | 396 | syncs = config.get("sync", []) 397 | for s in syncs: 398 | source = s["source"] 399 | targets = s["targets"] 400 | for t in targets: 401 | print("(sync files)", "scp -3 %s %s" % (source, t)) 402 | subprocess.call(("scp -3 %s %s" % (source, t)).split()) 403 | 404 | init_cmd = config.get("init_cmd") 405 | if init_cmd: 406 | os.system(init_cmd) 407 | 408 | printer_list = [] 409 | runned = 0 410 | for p in passes: 411 | if runned > num: 412 | break 413 | runned += 1 414 | print("! execute cmd @%s" % p["host"], p["cmd"]) 415 | 416 | if p["host"] in black_list: 417 | continue 418 | 419 | if p.get("local", "no") == "yes": 420 | subprocess.run(p["cmd"].split(" ")) 421 | pass 422 | else: 423 | channel = cr.execute_w_channel(prefix_cmd + " " + p["cmd"] + " " + suffix_cmd, 424 | p["host"], 425 | p["path"]) 426 | if p["name"] not in config.get("null", []): 427 | current_printer = RunPrinter(p["host"] + ":" + p["name"], channel) 428 | printer_list.append(current_printer) 429 | if p["name"] == "s0": 430 | goon = True 431 | while goon and not current_printer.server_inited: 432 | goon = current_printer.print_one() 433 | time.sleep(0.1) 434 | 435 | should_stop = False 436 | printer_go_on = True 437 | while printer_go_on: 438 | try: 439 | temp = [] 440 | for p in printer_list: 441 | if should_stop: 442 | p.dying = True 443 | p.channel.send(chr(3)) 444 | if p.print_one(): 445 | temp.append(p) 446 | else: 447 | should_stop = True 448 | printer_list = temp 449 | printer_go_on = True if len(printer_list) > 0 else False 450 | time.sleep(0.1) 451 | except KeyboardInterrupt: 452 | print("set should_stop") 453 | should_stop = True 454 | 455 | 456 | if __name__ == "__main__": 457 | main() 458 | -------------------------------------------------------------------------------- /ae/scripts/bootstrap.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import toml 4 | import paramiko 5 | import getpass 6 | import argparse 7 | import time 8 | import select 9 | import subprocess 10 | 11 | import os 12 | 13 | from paramiko import SSHConfig 14 | 15 | config = SSHConfig() 16 | with open(os.path.expanduser("/etc/ssh/ssh_config")) as _file: 17 | config.parse(_file) 18 | 19 | 20 | class RunPrinter: 21 | def __init__(self, name, channel): 22 | self.channel = channel 23 | self.name = name 24 | self.dying = False 25 | 26 | def print_one(self): 27 | if self.channel.recv_ready(): 28 | res = self.channel.recv(1048576).decode().splitlines() 29 | for l in res: 30 | print("@%-10s" % self.name, l.strip()) 31 | 32 | if self.channel.recv_stderr_ready(): 33 | res = self.channel.recv_stderr(1048576).decode().splitlines() 34 | for l in res: 35 | print("@%-10s" % self.name, l.strip()) 36 | 37 | if self.channel.exit_status_ready(): 38 | print("! exit ", self.name) 39 | return False 40 | 41 | return True 42 | 43 | 44 | def check_keywords(lines, keywords, black_keywords): 45 | match = [] 46 | for l in lines: 47 | black = False 48 | for bk in black_keywords: 49 | if l.find(bk) >= 0: 50 | black = True 51 | break 52 | if black: 53 | continue 54 | flag = True 55 | for k in keywords: 56 | flag = flag and (l.find(k) >= 0) 57 | if flag: 58 | # print("matched line: ",l) 59 | match.append(l) 60 | return len(match) 61 | 62 | 63 | class Envs: 64 | def __init__(self, f=""): 65 | self.envs = {} 66 | self.history = [] 67 | try: 68 | self.load(f) 69 | except: 70 | pass 71 | 72 | def set(self, envs): 73 | self.envs = envs 74 | self.history += envs.keys() 75 | 76 | def load(self, f): 77 | self.envs = pickle.load(open(f, "rb")) 78 | 79 | def add(self, name, env): 80 | self.history.append(name) 81 | self.envs[name] = env 82 | 83 | def append(self, name, env): 84 | self.envs[name] = (self.envs[name] + ":" + env) 85 | 86 | def delete(self, name): 87 | self.history.remove(name) 88 | del self.envs[name] 89 | 90 | def __str__(self): 91 | s = "" 92 | for name in self.history: 93 | s += ("export %s=%s;" % (name, self.envs[name])) 94 | return s 95 | 96 | def store(self, f): 97 | with open(f, 'wb') as handle: 98 | pickle.dump(self.envs, handle, protocol=pickle.HIGHEST_PROTOCOL) 99 | 100 | 101 | class ConnectProxy: 102 | def __init__(self, mac, user="", passp=""): 103 | if user == "": # use the server user as default 104 | user = getpass.getuser() 105 | self.ssh = paramiko.SSHClient() 106 | 107 | self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 108 | self.user = user 109 | self.mac = mac 110 | self.sftp = None 111 | self.passp = passp 112 | 113 | def connect(self, pwd, passp=None, timeout=30): 114 | # user_config = config.lookup(self.mac) 115 | # if user_config: 116 | # print("found user_config", config) 117 | # print(user_config) 118 | # #print(user_config) 119 | # #cfg = {'hostname': self.mac, 'username': self.user} 120 | # #cfg['sock'] = paramiko.ProxyCommand(user_config['proxycommand']) 121 | # return self.ssh.connect(hostname = self.mac,username = self.user, password = pwd, 122 | # timeout = timeout, allow_agent=False,look_for_keys=False,passphrase=passp,sock=paramiko.ProxyCommand(user_config['proxycommand']),banner_timeout=400) 123 | 124 | # else: 125 | return self.ssh.connect(hostname=self.mac, username=self.user, password=pwd, 126 | timeout=timeout, allow_agent=False, look_for_keys=False, passphrase=passp) 127 | 128 | def connect_with_pkey(self, keyfile_name, timeout=10): 129 | print("connect with pkey") 130 | user_config = ssh_config.lookup(self.mac) 131 | print(user_config) 132 | if user_config: 133 | assert False 134 | 135 | self.ssh.connect(hostname=self.mac, username=self.user, 136 | key_filename=keyfile_name, timeout=timeout) 137 | 138 | def execute(self, cmd, pty=False, timeout=None, background=False): 139 | if not background: 140 | return self.ssh.exec_command(cmd, get_pty=pty, timeout=timeout) 141 | else: 142 | print("exe", cmd, "in background") 143 | transport = self.ssh.get_transport() 144 | channel = transport.open_session() 145 | return channel.exec_command(cmd) 146 | 147 | def execute_w_channel(self, cmd): 148 | transport = self.ssh.get_transport() 149 | channel = transport.open_session() 150 | channel.get_pty() 151 | channel.exec_command(cmd) 152 | return channel 153 | 154 | def copy_file(self, f, dst_dir=""): 155 | if self.sftp == None: 156 | self.sftp = paramiko.SFTPClient.from_transport( 157 | self.ssh.get_transport()) 158 | self.sftp.put(f, dst_dir + "/" + f) 159 | 160 | def get_file(self, remote_path, f): 161 | if self.sftp == None: 162 | self.sftp = paramiko.SFTPClient.from_transport( 163 | self.ssh.get_transport()) 164 | self.sftp.get(remote_path + "/" + f, f) 165 | 166 | def close(self): 167 | if self.sftp != None: 168 | self.sftp.close() 169 | self.ssh.close() 170 | 171 | def copy_dir(self, source, target, verbose=False): 172 | 173 | if self.sftp == None: 174 | self.sftp = paramiko.SFTPClient.from_transport( 175 | self.ssh.get_transport()) 176 | 177 | if os.path.isfile(source): 178 | return self.copy_file(source, target) 179 | 180 | try: 181 | os.listdir(source) 182 | except: 183 | print("[%S] failed to put %s" % (self.mac, source)) 184 | return 185 | 186 | self.mkdir(target, ignore_existing=True) 187 | 188 | for item in os.listdir(source): 189 | if os.path.isfile(os.path.join(source, item)): 190 | try: 191 | self.sftp.put(os.path.join(source, item), 192 | '%s/%s' % (target, item)) 193 | print_v(verbose, "[%s] put %s done" % 194 | (self.mac, os.path.join(source, item))) 195 | except: 196 | print("[%s] put %s error" % 197 | (self.mac, os.path.join(source, item))) 198 | else: 199 | self.mkdir('%s/%s' % (target, item), ignore_existing=True) 200 | self.copy_dir(os.path.join(source, item), 201 | '%s/%s' % (target, item)) 202 | 203 | def mkdir(self, path, mode=511, ignore_existing=False): 204 | try: 205 | self.sftp.mkdir(path, mode) 206 | except IOError: 207 | if ignore_existing: 208 | pass 209 | else: 210 | raise 211 | 212 | 213 | class Courier2: 214 | def __init__(self, user=getpass.getuser(), pwd="123", hosts="hosts.xml", passp="", curdir=".", keyfile=""): 215 | self.user = user 216 | self.pwd = pwd 217 | self.keyfile = keyfile 218 | self.cached_host = "localhost" 219 | self.passp = passp 220 | 221 | self.curdir = curdir 222 | self.envs = Envs() 223 | 224 | def cd(self, dir): 225 | if os.path.isabs(dir): 226 | self.curdir = dir 227 | if self.curdir == "~": 228 | self.curdir = "." 229 | else: 230 | self.curdir += ("/" + dir) 231 | 232 | def get_file(self, host, dst_dir, f, timeout=None): 233 | p = ConnectProxy(host, self.user) 234 | try: 235 | if len(self.keyfile): 236 | p.connect_with_pkey(self.keyfile, timeout) 237 | else: 238 | p.connect(self.pwd, timeout=timeout) 239 | except Exception as e: 240 | print("[get_file] connect to %s error: " % host, e) 241 | p.close() 242 | return False, None 243 | try: 244 | p.get_file(dst_dir, f) 245 | except Exception as e: 246 | print("[get_file] get %s @%s error " % (f, host), e) 247 | p.close() 248 | return False, None 249 | if p: 250 | p.close() 251 | 252 | return True, None 253 | 254 | def copy_file(self, host, f, dst_dir="~/", timeout=None): 255 | p = ConnectProxy(host, self.user) 256 | try: 257 | if len(self.keyfile): 258 | p.connect_with_pkey(self.keyfile, timeout) 259 | else: 260 | p.connect(self.pwd, timeout=timeout) 261 | except Exception as e: 262 | print("[copy_file] connect to %s error: " % host, e) 263 | p.close() 264 | return False, None 265 | try: 266 | p.copy_file(f, dst_dir) 267 | except Exception as e: 268 | print("[copy_file] copy %s error " % host, e) 269 | p.close() 270 | return False, None 271 | if p: 272 | p.close() 273 | 274 | return True, None 275 | 276 | def execute_w_channel(self, cmd, host, dir, timeout=None): 277 | p = ConnectProxy(host, self.user) 278 | try: 279 | if len(self.keyfile): 280 | p.connect_with_pkey(self.keyfile, timeout) 281 | else: 282 | p.connect(self.pwd, self.passp, timeout=timeout) 283 | except Exception as e: 284 | print("[pre execute] connect to %s error: " % host, e) 285 | p.close() 286 | return None, e 287 | 288 | try: 289 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd 290 | return p.execute_w_channel(ccmd) 291 | except: 292 | return None 293 | 294 | def pre_execute(self, cmd, host, pty=True, dir="", timeout=None, retry_count=3, background=False): 295 | if dir == "": 296 | dir = self.curdir 297 | 298 | p = ConnectProxy(host, self.user) 299 | try: 300 | if len(self.keyfile): 301 | p.connect_with_pkey(self.keyfile, timeout) 302 | else: 303 | p.connect(self.pwd, timeout=timeout) 304 | except Exception as e: 305 | print("[pre execute] connect to %s error: " % host, e) 306 | p.close() 307 | return None, e 308 | 309 | try: 310 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd 311 | if not background: 312 | _, stdout, _ = p.execute( 313 | ccmd, pty, timeout, background=background) 314 | return p, stdout 315 | else: 316 | c = p.execute(ccmd, pty, timeout, background=True) 317 | return p, c 318 | except Exception as e: 319 | print("[pre execute] execute cmd @ %s error: " % host, e) 320 | p.close() 321 | if retry_count > 0: 322 | if timeout: 323 | timeout += 2 324 | return self.pre_execute(cmd, host, pty, dir, timeout, retry_count - 1) 325 | 326 | def execute(self, cmd, host, pty=True, dir="", timeout=None, output=True, background=False): 327 | ret = [True, ""] 328 | p, stdout = self.pre_execute( 329 | cmd, host, pty, dir, timeout, background=background) 330 | if p and (stdout and output) and (not background): 331 | try: 332 | while not stdout.channel.closed: 333 | try: 334 | for line in iter(lambda: stdout.readline(2048), ""): 335 | if pty and (len(line) > 0): # ignore null lines 336 | print((("[%s]: " % host) + line), end="") 337 | else: 338 | ret[1] += (line + "\n") 339 | except UnicodeDecodeError as e: 340 | continue 341 | except Exception as e: 342 | break 343 | except Exception as e: 344 | print("[%s] execute with execption:" % host, e) 345 | if p and (not background): 346 | p.close() 347 | # ret[1] = stdout 348 | else: 349 | ret[0] = False 350 | ret[1] = stdout 351 | return ret[0], ret[1] 352 | 353 | 354 | # cr.envs.set(base_env) 355 | 356 | 357 | def main(): 358 | arg_parser = argparse.ArgumentParser( 359 | description=''' Benchmark script for running the cluster''') 360 | arg_parser.add_argument( 361 | '-f', metavar='CONFIG', dest='config', default="run.toml", type=str, 362 | help='The configuration file to execute commands') 363 | arg_parser.add_argument('-b', '--black', nargs='+', 364 | help='hosts to ignore', required=False) 365 | arg_parser.add_argument( 366 | '-n', '--num', help='num-passes to run', default=128, type=int) 367 | 368 | args = arg_parser.parse_args() 369 | num = args.num 370 | 371 | config = toml.load(args.config) 372 | 373 | passes = config.get("pass", []) 374 | 375 | user = config.get("user", "root") 376 | pwd = config.get("pwd", "fast24ae") 377 | passp = config.get("passphrase", None) 378 | 379 | prefix_cmd = config.get("prefix_cmd", "") 380 | print(f"! prefix_cmd={prefix_cmd}") 381 | suffix_cmd = config.get("suffix_cmd", "") 382 | print(f"! suffix_cmd={suffix_cmd}") 383 | 384 | black_list = {} 385 | if args.black: 386 | for e in args.black: 387 | black_list[e] = True 388 | 389 | cr = Courier2(user, pwd, passp=passp) 390 | 391 | # first we sync files 392 | 393 | syncs = config.get("sync", []) 394 | for s in syncs: 395 | source = s["source"] 396 | targets = s["targets"] 397 | for t in targets: 398 | print("(sync files)", "scp -3 %s %s" % (source, t)) 399 | subprocess.call(("scp -3 %s %s" % (source, t)).split()) 400 | 401 | init_cmd = config.get("init_cmd") 402 | if init_cmd: 403 | os.system(init_cmd) 404 | 405 | printer = [] 406 | runned = 0 407 | for p in passes: 408 | if runned > num: 409 | break 410 | runned += 1 411 | print("! execute cmd @%s" % p["host"], p["cmd"]) 412 | 413 | if p["host"] in black_list: 414 | continue 415 | 416 | if p.get("local", "no") == "yes": 417 | subprocess.run(p["cmd"].split(" ")) 418 | pass 419 | else: 420 | channel = cr.execute_w_channel(prefix_cmd + " " + p["cmd"] + " " + suffix_cmd, 421 | p["host"], 422 | p["path"]) 423 | if p["name"] not in config.get("null", []): 424 | printer.append(RunPrinter(p["host"] + ":" + p["name"], channel)) 425 | time.sleep(1) 426 | 427 | should_stop = False 428 | printer_go_on = True 429 | while printer_go_on: 430 | try: 431 | temp = [] 432 | for p in printer: 433 | if should_stop: 434 | p.dying = True 435 | p.channel.send(chr(3)) 436 | print("terminating", p.name) 437 | if p.print_one(): 438 | temp.append(p) 439 | else: 440 | should_stop = True 441 | printer = temp 442 | printer_go_on = True if len(printer) > 0 else False 443 | time.sleep(0.1) 444 | except KeyboardInterrupt: 445 | print("set should_stop") 446 | should_stop = True 447 | 448 | 449 | if __name__ == "__main__": 450 | main() 451 | -------------------------------------------------------------------------------- /ae/scripts/ins-pch.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd /home/yz/workspace/TeRM/libterm/kmod 3 | make 4 | rmmod pch 5 | insmod pch.ko 6 | -------------------------------------------------------------------------------- /ae/scripts/limit-mem.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | function get_available { 4 | echo `free -g|awk 'NR==2 {print $7}'` 5 | } 6 | 7 | function set { 8 | kb=`gb_to_kb $1` 9 | modprobe brd rd_nr=1 rd_size=${kb} 10 | dd if=/dev/zero of=/dev/ram0 bs=1k count=${kb} 11 | } 12 | 13 | function clear { 14 | rmmod brd 15 | } 16 | 17 | function gb_to_kb { 18 | echo `expr $1 \* 1024 \* 1024` 19 | } 20 | 21 | # if [ ! $# -eq 1 ] 22 | # then 23 | # echo "$0 " 24 | # exit 1 25 | # fi 26 | 27 | if [ -z "$PDP_server_memory_gb" ] 28 | then 29 | PDP_server_memory_gb=0 30 | if [ $# -eq 1 ] 31 | then 32 | PDP_server_memory_gb=$1 33 | fi 34 | fi 35 | 36 | echo target memory: ${PDP_server_memory_gb}GB 37 | target=${PDP_server_memory_gb} 38 | 39 | if [ $target -eq 0 ] 40 | then 41 | clear 42 | exit 0 43 | fi 44 | 45 | if [ `get_available` -eq $target ] 46 | then 47 | echo already done. 48 | exit 0 49 | fi 50 | 51 | clear 52 | 53 | avail=`get_available` 54 | if [ $avail -lt $target ] 55 | then 56 | echo "target ${target}GB > available ${avail}GB" 57 | exit 1 58 | fi 59 | 60 | set `expr $avail - $target` 61 | -------------------------------------------------------------------------------- /ae/scripts/octopus_eval.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | 4 | def run_test(mode = "pdp", write = 0, unit_size = "4k", log_dir = "."): 5 | server_memory_gb = 0 if mode == "pin" else 18 6 | 7 | os.system(f"mkdir -p {log_dir}") 8 | log_file = f"{log_dir}/write.{write},mode.{mode},unit.{unit_size}.log" 9 | 10 | test_file_content = f""" 11 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh" 12 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches; 13 | echo 100 > /proc/sys/vm/dirty_ratio; 14 | echo 0 > /proc/sys/kernel/numa_balancing; 15 | systemctl start opensmd; 16 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so /home/yz/workspace/TeRM/ae/bin/octopus/libnrfsc.so"; 17 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb}; 18 | export PDP_mode={mode};\"\"\" 19 | suffix_cmd = "" 20 | 21 | ## server process 22 | [[pass]] 23 | name = "s0" 24 | host = "node184" 25 | path = "/home/yz/workspace/TeRM/ae/bin/octopus" 26 | cmd = \"\"\"../../scripts/ins-pch.sh; 27 | systemctl restart openibd; 28 | ../../scripts/limit-mem.sh; 29 | export PDP_is_server=1; 30 | ./dmfs\"\"\" 31 | 32 | ## below are clients 33 | [[pass]] 34 | name = "c0" 35 | host = "node166" 36 | path = "/home/yz/workspace/TeRM/ae/bin/octopus" 37 | cmd = "mpiexec --allow-run-as-root --np 32 /bin/bash -c './mpibw --write={write} --unit_size={unit_size} --seconds=70 --zipf_theta=0.99 --file_size=1g' " 38 | """ 39 | with open(f"{log_dir}/test.toml", "w") as f: 40 | f.write(test_file_content) 41 | 42 | cmd = f""" 43 | date | tee -a {log_file}; 44 | /home/yz/workspace/TeRM/ae/scripts/bootstrap-octopus.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file} 45 | """ 46 | os.system(cmd) 47 | 48 | def extract_result(mode = "pdp", write = 0, unit_size = "4k", log_dir = "."): 49 | result = 0.0 50 | 51 | log_file = f"{log_dir}/write.{write},mode.{mode},unit.{unit_size}.log" 52 | with open(log_file, "r") as f: 53 | while True: 54 | line = f.readline() 55 | if not line: 56 | break 57 | if "Bandwidth = " not in line: 58 | continue 59 | 60 | tokens = re.findall("Bandwidth = (\d+\.\d+)", line) 61 | result = float(tokens[0]) 62 | 63 | return result 64 | 65 | def format_digits(n): 66 | return f"{n:,.2f}" 67 | 68 | def split_kwargs(kwargs): 69 | vector_kwargs = {} 70 | scalar_kwargs = {} 71 | for k, v in kwargs.items(): 72 | if type(v) == list: 73 | vector_kwargs[k] = v 74 | else: 75 | scalar_kwargs[k] = v 76 | return vector_kwargs, scalar_kwargs 77 | 78 | 79 | def run_batch(name : str, **kwargs): 80 | log_dir = f"./output/log/{name}" 81 | os.system(f"mkdir -p {log_dir}") 82 | 83 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 84 | 85 | from itertools import product 86 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 87 | for idx, kv in enumerate(l): 88 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###" 89 | with open("./output/current-running.txt", "w") as f: 90 | f.write(f"{name}\n{str}\n") 91 | print(str) 92 | run_test(**kv, **scalar_kwargs, log_dir=log_dir) 93 | 94 | 95 | def output_batch(name : str, ydata = [], **kwargs): 96 | log_dir = f"./output/log/{name}" 97 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 98 | from itertools import product 99 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 100 | output = "" 101 | 102 | for idx, kv in enumerate(l): 103 | result = extract_result(**kv, **scalar_kwargs, log_dir=log_dir) 104 | str = f"({idx + 1}/{len(l)}) {kv}: bandwidth (MB/s)=" 105 | ydata.append(result) 106 | str += f"{format_digits(result)}" 107 | output += str + "\n" 108 | 109 | print(output) 110 | with open(f"./output/{name}.txt", "w") as f: 111 | f.write(output) 112 | 113 | 114 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]): 115 | import matplotlib.pyplot as plt 116 | xlen = len(xdata) 117 | 118 | for idx, label in enumerate(legend): 119 | match label: 120 | case "PIN": 121 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN") 122 | case "ODP": 123 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP") 124 | case "RPC": 125 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC") 126 | case "TeRM": 127 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM") 128 | case _: 129 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label) 130 | 131 | plt.xlabel(xlabel) 132 | plt.ylabel("Bandwidth (MB/s)") 133 | plt.legend() 134 | print(f"written {name}.png") 135 | plt.savefig(f"./output/{name}.png") 136 | 137 | class Experiment: 138 | def __init__(self, name : str, 139 | xlabel: str, xdata: list, legend: list[str], 140 | **kwargs): 141 | self.name = name 142 | self.xlabel = xlabel 143 | self.xdata = xdata 144 | self.ydata = [] 145 | self.legend = legend 146 | self.kwargs = kwargs 147 | 148 | def run(self): 149 | run_batch(name=self.name, **self.kwargs) 150 | 151 | def output(self): 152 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs) 153 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend) 154 | -------------------------------------------------------------------------------- /ae/scripts/reset-memc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | addr="10.0.2.181" 4 | port="23333" 5 | 6 | # kill old 7 | # ssh ${addr} "pkill -f memcached; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid" 8 | ssh ${addr} "cat /tmp/memcached.pid | xargs kill > /dev/null 2>&1; sleep 1; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid" 9 | ssh ${addr} "cat /tmp/memcached.pid | xargs kill > /dev/null 2>&1; sleep 1; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid" 10 | 11 | sleep 1 12 | 13 | # init 14 | echo -e "set nr_nodes 0 0 1\r\n0\r\nquit\r" | nc ${addr} ${port} 15 | -------------------------------------------------------------------------------- /ae/scripts/term_eval.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | 4 | def run_test(write_percent = 0, skewness_100 = 99, server_memory_gb = 32, ssd = "nvme4n1", 5 | mode = "pdp", nr_server_threads = 16, nr_client_threads = 64, sz_unit = "4k", 6 | verify = 0, dynamic = False, log_dir="."): 7 | if mode == "pin": 8 | server_memory_gb = 0 9 | running_seconds = 120 if dynamic else 70 10 | 11 | os.system(f"mkdir -p {log_dir}") 12 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log" 13 | 14 | comment_mark = "#" if nr_client_threads == 1 else "" 15 | 16 | str = f""" 17 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh" 18 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches; 19 | echo 100 > /proc/sys/vm/dirty_ratio; 20 | echo 0 > /proc/sys/kernel/numa_balancing; 21 | systemctl start opensmd; 22 | export LD_PRELOAD=/home/yz/workspace/TeRM/ae/bin/libpdp.so; 23 | export PDP_server_rpc_threads={nr_server_threads} PDP_server_mmap_dev={ssd} PDP_server_memory_gb={server_memory_gb}; 24 | export PDP_mode={mode};\"\"\" 25 | suffix_cmd = "--nr_nodes={2 if nr_client_threads == 1 else 3} --running_seconds={running_seconds} --sz_server_mr=64g --write_percent={write_percent} --verify={verify} --skewness_100={skewness_100} --sz_unit={sz_unit} --nr_client_threads={1 if nr_client_threads == 1 else int(nr_client_threads / 2)} --hotspot_switch_second={60 if dynamic else 0};" 26 | 27 | ## server process 28 | [[pass]] 29 | name = "s0" 30 | host = "node184" 31 | path = "/home/yz/workspace/TeRM/ae" 32 | cmd = \"\"\"./scripts/ins-pch.sh; 33 | systemctl restart openibd; 34 | ./scripts/limit-mem.sh 0; 35 | export PDP_is_server=1; 36 | ./bin/perf --node_id=0\"\"\" 37 | 38 | ## below are clients 39 | [[pass]] 40 | name = "c0" 41 | host = "node166" 42 | path = "/home/yz/workspace/TeRM/ae" 43 | cmd = "./bin/perf --node_id=1" 44 | 45 | {comment_mark} [[pass]] 46 | {comment_mark} name = "c1" 47 | {comment_mark} host = "node168" 48 | {comment_mark} path = "/home/yz/workspace/TeRM/ae" 49 | {comment_mark} cmd = "./bin/perf --node_id=2" 50 | 51 | """ 52 | with open(f"{log_dir}/test.toml", "w") as f: 53 | f.write(str) 54 | 55 | cmd = f""" 56 | date | tee -a {log_file}; 57 | /home/yz/workspace/TeRM/ae/scripts/bootstrap.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file} 58 | """ 59 | os.system(cmd) 60 | 61 | def run_octopus(mode = "pdp", rw = 0, unit_size = "4k", log_dir = "."): 62 | server_memory_gb = 0 if mode == "pin" else 18 63 | 64 | os.system(f"mkdir -p {log_dir}") 65 | log_file = f"{log_dir}/write.{rw},mode.{mode},unit.{unit_size}.log" 66 | 67 | test_file_content = f""" 68 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh" 69 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches; 70 | echo 100 > /proc/sys/vm/dirty_ratio; 71 | echo 0 > /proc/sys/kernel/numa_balancing; 72 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so /home/yz/workspace/TeRM/ae/bin/octopus/libnrfsc.so"; 73 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb}; 74 | export PDP_mode={mode};\"\"\" 75 | suffix_cmd = "" 76 | 77 | ## server process 78 | [[pass]] 79 | name = "s0" 80 | host = "node184" 81 | path = "/home/yz/workspace/TeRM/ae/bin/octopus" 82 | cmd = \"\"\"../../scripts/ins-pch.sh; 83 | systemctl restart openibd; 84 | ../../scripts/limit-mem.sh; 85 | export PDP_is_server=1; 86 | ./dmfs\"\"\" 87 | 88 | ## below are clients 89 | [[pass]] 90 | name = "c0" 91 | host = "node166" 92 | path = "/home/yz/workspace/TeRM/ae/bin/octopus" 93 | cmd = "mpiexec --allow-run-as-root --np 32 /bin/bash -c './mpibw --write={rw} --unit_size={unit_size} --seconds=70 --zipf_theta=0.99 --file_size=1g' " 94 | """ 95 | with open(f"{log_dir}/test.toml", "w") as f: 96 | f.write(test_file_content) 97 | 98 | cmd = f""" 99 | date | tee -a {log_file}; 100 | /home/yz/workspace/TeRM/ae/scripts/bootstrap-octopus.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file} 101 | """ 102 | os.system(cmd) 103 | 104 | def extract_throughput(write_percent = 0, skewness_100 = 99, server_memory_gb = 32, 105 | ssd = "nvme4n1", mode = "pdp", nr_server_threads = 16, 106 | nr_client_threads = 64, sz_unit = "4k", dynamic = False, log_dir = "."): 107 | 108 | if mode == "pin": 109 | server_memory_gb = 0 110 | 111 | sampling_start = 0 if dynamic else 5 112 | sampling_seconds = 120 if dynamic else 60 113 | 114 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log" 115 | with open(log_file, "r") as f: 116 | thpt_list = [] 117 | current_thpt_list = [] 118 | 119 | new_flag = False 120 | dual_client = False 121 | while True: 122 | line = f.readline() 123 | if not line: 124 | break 125 | if "epoch" not in line: 126 | continue 127 | if "epoch 1 " in line: 128 | if not new_flag: 129 | thpt_list = current_thpt_list 130 | current_thpt_list = [] 131 | else: 132 | dual_client = True 133 | new_flag = True 134 | else: 135 | new_flag = False 136 | 137 | line = line.replace(",", "") 138 | tokens = re.findall("cnt=(\d+)", line) 139 | current_thpt_list.append(int(tokens[0])) 140 | thpt_list = current_thpt_list 141 | if dual_client: 142 | thpt_list = [thpt_list[i] + thpt_list[i + 1] for i in range(0, 2 * (sampling_start + sampling_seconds), 2)] 143 | thpt_list = thpt_list[(sampling_start) : (sampling_start + sampling_seconds)] 144 | return thpt_list 145 | 146 | def extract_latency(write_percent = 0, skewness_100 = 99, server_memory_gb = 32, 147 | ssd = "nvme4n1", mode = "pdp", nr_server_threads = 16, 148 | nr_client_threads = 64, sz_unit = "4k", 149 | sampling_start = 5, sampling_seconds = 60, label="p99", log_dir = "."): 150 | 151 | if mode == "pin": 152 | server_memory_gb = 0 153 | 154 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log" 155 | with open(log_file, "r") as f: 156 | result_list = [] 157 | current_result_list = [] 158 | 159 | new_flag = False 160 | dual_client = False 161 | while True: 162 | line = f.readline() 163 | if not line: 164 | break 165 | if "epoch" not in line: 166 | continue 167 | if "epoch 1 " in line: 168 | if not new_flag: 169 | result_list = current_result_list 170 | current_result_list = [] 171 | else: 172 | dual_client = True 173 | new_flag = True 174 | else: 175 | new_flag = False 176 | 177 | line = line.replace(",", "") 178 | tokens = re.findall(f"{label}=(\d+)", line) 179 | current_result_list.append(int(tokens[0])) 180 | result_list = current_result_list 181 | if dual_client: 182 | result_list = [(result_list[i] + result_list[i + 1]) / 2 for i in range(0, 2 * (sampling_start + sampling_seconds), 2)] 183 | result_list = result_list[(sampling_start) : (sampling_start + sampling_seconds)] 184 | result_list = [i for i in result_list if i < 120_000_000_000] 185 | return result_list 186 | 187 | 188 | def avg(l): 189 | v = sum(l) / len(l) 190 | return v 191 | 192 | def format_digits(n): 193 | return f"{n:,.2f}" 194 | 195 | def split_kwargs(kwargs): 196 | vector_kwargs = {} 197 | scalar_kwargs = {} 198 | for k, v in kwargs.items(): 199 | if type(v) == list: 200 | vector_kwargs[k] = v 201 | else: 202 | scalar_kwargs[k] = v 203 | return vector_kwargs, scalar_kwargs 204 | 205 | 206 | def run_batch(name : str, **kwargs): 207 | log_dir = f"./output/log/{name}" 208 | os.system(f"mkdir -p {log_dir}") 209 | 210 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 211 | 212 | from itertools import product 213 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 214 | for idx, kv in enumerate(l): 215 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###" 216 | with open("./output/current-running.txt", "w") as f: 217 | f.write(f"{name}\n{str}\n") 218 | print(str) 219 | run_test(**kv, **scalar_kwargs, log_dir=log_dir) 220 | 221 | 222 | def output_batch(name : str, dynamic = False, ydata = [], 223 | **kwargs): 224 | log_dir = f"./output/log/{name}" 225 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 226 | from itertools import product 227 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 228 | output = "" 229 | 230 | for idx, kv in enumerate(l): 231 | thpt_list = extract_throughput(**kv, **scalar_kwargs, dynamic=dynamic, log_dir=log_dir) 232 | str = f"({idx + 1}/{len(l)}) {kv}: throughput (ops/s)=" 233 | if dynamic: 234 | str += f"{thpt_list}" 235 | ydata.append(thpt_list) 236 | else: 237 | avg_v = avg(thpt_list) 238 | ydata.append(avg_v) 239 | str += f"{format_digits(avg_v)}" 240 | output += str + "\n" 241 | 242 | print(output) 243 | with open(f"./output/{name}.txt", "w") as f: 244 | f.write(output) 245 | 246 | 247 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]): 248 | import matplotlib.pyplot as plt 249 | xlen = len(xdata) 250 | if name == "figure-11": 251 | plt.plot(xdata, ydata[0], label = legend[0]) 252 | plt.plot(xdata, ydata[1], label = legend[1]) 253 | plt.plot(xdata, ydata[2], label = legend[2]) 254 | plt.plot(xdata, ydata[3], label = legend[3]) 255 | else: 256 | for idx, label in enumerate(legend): 257 | match label: 258 | case "PIN": 259 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN") 260 | case "ODP": 261 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP") 262 | case "RPC": 263 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC") 264 | case "TeRM": 265 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM") 266 | case _: 267 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label) 268 | plt.xlabel(xlabel) 269 | plt.ylabel("Throughput (ops/s)") 270 | plt.legend() 271 | print(f"written {name}.png") 272 | plt.savefig(f"./output/{name}.png") 273 | 274 | class Experiment: 275 | def __init__(self, name : str, 276 | xlabel: str, xdata: list, legend: list[str], 277 | **kwargs): 278 | self.name = name 279 | self.xlabel = xlabel 280 | self.xdata = xdata 281 | self.ydata = [] 282 | self.legend = legend 283 | self.kwargs = kwargs 284 | 285 | def run(self): 286 | run_batch(name=self.name, **self.kwargs) 287 | 288 | def output(self): 289 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs) 290 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend) 291 | -------------------------------------------------------------------------------- /ae/scripts/xstore_eval.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | 4 | def run_test(mode : str = "pdp", zipf_theta = 0, log_dir : str = "."): 5 | server_memory_gb = 0 if mode == "pin" else 16 6 | 7 | os.system(f"mkdir -p {log_dir}") 8 | log_file = f"{log_dir}/mode.{mode},zipf_theta.{zipf_theta}.log" 9 | 10 | test_file_content = f""" 11 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh" 12 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches; 13 | echo 100 > /proc/sys/vm/dirty_ratio; 14 | echo 0 > /proc/sys/kernel/numa_balancing; 15 | systemctl start opensmd; 16 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so"; 17 | export LD_LIBRARY_PATH="/home/yz/workspace/TeRM/ae/bin/xstore"; 18 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb}; 19 | export PDP_mode={mode};\"\"\" 20 | suffix_cmd = "--need_hash=false --cache_sz_m=327680 --server_host=node184 --total_accts=100'000'000 --eval_type=sc --workloads=ycsbc --concurrency=1 --use_master=true --uniform=false --running_time=60 --zipf_theta={zipf_theta} \ 21 | --undefok=concurrency,workloads,eval_type,total_accts,server_host,cache_sz_m,need_hash,use_master,uniform,running_time,zipf_theta" 22 | 23 | ## server process 24 | [[pass]] 25 | name = "s0" 26 | host = "node184" 27 | path = "/home/yz/workspace/TeRM/ae/bin/xstore" 28 | cmd = \"\"\"../../scripts/ins-pch.sh; 29 | systemctl restart openibd; 30 | ../../scripts/limit-mem.sh; 31 | export PDP_is_server=1; 32 | ./fserver --threads=16 --db_type=ycsb --id=0 --ycsb_num=100'000'000 --no_train=false --step=2 --model_config=./ycsb-model.toml --enable_odp=true\"\"\" 33 | 34 | ## clients 35 | [[pass]] 36 | name = "c0" 37 | host = "node166" 38 | path = "/home/yz/workspace/TeRM/ae/bin/xstore" 39 | cmd = "./ycsb --threads=32 --id=1" 40 | 41 | [[pass]] 42 | name = "c1" 43 | host = "node168" 44 | path = "/home/yz/workspace/TeRM/ae/bin/xstore" 45 | cmd = "./ycsb --threads=32 --id=2" 46 | 47 | # master process to collect results 48 | [[pass]] 49 | name = "m0" 50 | host = "node181" 51 | path = "/home/yz/workspace/TeRM/ae/bin/xstore" 52 | cmd = "sleep 5; ./master --client_config=./cs.toml --epoch=90 --nclients=2" 53 | """ 54 | with open(f"{log_dir}/test.toml", "w") as f: 55 | f.write(test_file_content) 56 | 57 | cmd = f""" 58 | date | tee -a {log_file}; 59 | /home/yz/workspace/TeRM/ae/scripts/bootstrap.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file} 60 | """ 61 | os.system(cmd) 62 | 63 | def extract_result(mode : str = "pdp", zipf_theta = 0, log_dir : str = "."): 64 | log_file = f"{log_dir}/mode.{mode},zipf_theta.{zipf_theta}.log" 65 | with open(log_file, "r") as f: 66 | result_list = [] 67 | current_result_list = [] 68 | 69 | while True: 70 | line = f.readline() 71 | if not line: 72 | break 73 | if "at epoch" not in line: 74 | continue 75 | if "epoch 0 " in line: 76 | result_list = current_result_list 77 | current_result_list = [] 78 | 79 | line = line.replace(",", "") 80 | tokens = re.findall(f"thpt: (\d+\.\d+)", line) 81 | current_result_list.append(float(tokens[0])) 82 | result_list = current_result_list 83 | avg_v = avg(result_list) 84 | 85 | return avg_v 86 | 87 | def avg(l : list): 88 | return sum(l) / len(l) 89 | 90 | def format_digits(n): 91 | return f"{n:,.2f}" 92 | 93 | def split_kwargs(kwargs): 94 | vector_kwargs = {} 95 | scalar_kwargs = {} 96 | for k, v in kwargs.items(): 97 | if type(v) == list: 98 | vector_kwargs[k] = v 99 | else: 100 | scalar_kwargs[k] = v 101 | return vector_kwargs, scalar_kwargs 102 | 103 | 104 | def run_batch(name : str, **kwargs): 105 | log_dir = f"./output/log/{name}" 106 | os.system(f"mkdir -p {log_dir}") 107 | 108 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 109 | 110 | from itertools import product 111 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 112 | for idx, kv in enumerate(l): 113 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###" 114 | with open("./output/current-running.txt", "w") as f: 115 | f.write(f"{name}\n{str}\n") 116 | print(str) 117 | run_test(**kv, **scalar_kwargs, log_dir=log_dir) 118 | 119 | 120 | def output_batch(name : str, ydata = [], **kwargs): 121 | log_dir = f"./output/log/{name}" 122 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs) 123 | from itertools import product 124 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values())) 125 | output = "" 126 | 127 | for idx, kv in enumerate(l): 128 | result = extract_result(**kv, **scalar_kwargs, log_dir=log_dir) 129 | str = f"({idx + 1}/{len(l)}) {kv}: throughput (ops/s)=" 130 | ydata.append(result) 131 | str += f"{format_digits(result)}" 132 | output += str + "\n" 133 | 134 | print(output) 135 | with open(f"./output/{name}.txt", "w") as f: 136 | f.write(output) 137 | 138 | 139 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]): 140 | import matplotlib.pyplot as plt 141 | xlen = len(xdata) 142 | 143 | for idx, label in enumerate(legend): 144 | match label: 145 | case "PIN": 146 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN") 147 | case "ODP": 148 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP") 149 | case "RPC": 150 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC") 151 | case "TeRM": 152 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM") 153 | case _: 154 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label) 155 | 156 | plt.xlabel(xlabel) 157 | plt.ylabel("Throughput (ops/s)") 158 | plt.legend() 159 | print(f"written {name}.png") 160 | plt.savefig(f"./output/{name}.png") 161 | 162 | class Experiment: 163 | def __init__(self, name : str, 164 | xlabel: str, xdata: list, legend: list[str], 165 | **kwargs): 166 | self.name = name 167 | self.xlabel = xlabel 168 | self.xdata = xdata 169 | self.ydata = [] 170 | self.legend = legend 171 | self.kwargs = kwargs 172 | 173 | def run(self): 174 | run_batch(name=self.name, **self.kwargs) 175 | 176 | def output(self): 177 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs) 178 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend) 179 | -------------------------------------------------------------------------------- /app/octopus.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/app/octopus.zip -------------------------------------------------------------------------------- /app/xstore.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/app/xstore.zip -------------------------------------------------------------------------------- /driver/driver.patch: -------------------------------------------------------------------------------- 1 | diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c 2 | index aead24c..a6f08f2 100644 3 | --- a/drivers/infiniband/core/umem_odp.c 4 | +++ b/drivers/infiniband/core/umem_odp.c 5 | @@ -49,6 +49,89 @@ 6 | 7 | #include "uverbs.h" 8 | 9 | +#ifdef PDP_MR 10 | +void set_or_clear_bitmap_reserved(void *addr, uint32_t bits, bool set) 11 | +{ 12 | + size_t bitmap_bytes = BITS_TO_LONGS(bits) * sizeof(long); 13 | + size_t i; 14 | + for (i = 0; i < bitmap_bytes; i += PAGE_SIZE) { 15 | + struct page *page = vmalloc_to_page(addr + i); 16 | + set ? SetPageReserved(page) : ClearPageReserved(page); 17 | + } 18 | +} 19 | + 20 | +static void *my_bitmap_alloc(unsigned int bits) 21 | +{ 22 | + size_t bytes = BITS_TO_LONGS(bits) * sizeof(long); 23 | + return vzalloc(round_up(bytes, PAGE_SIZE)); 24 | +} 25 | + 26 | +static void my_bitmap_free(void *p) 27 | +{ 28 | + vfree(p); 29 | +} 30 | + 31 | +static int init_pdp(struct ib_umem_odp *umem_odp, size_t npfns) 32 | +{ 33 | + char *magic_page_addr; 34 | + 35 | + umem_odp->nr_pfns = npfns; 36 | + 37 | + umem_odp->present_bitmap = my_bitmap_alloc(npfns); 38 | + // pr_info("umem_odp->present_bitmap=%px", umem_odp->present_bitmap); 39 | + if (!umem_odp->present_bitmap) { 40 | + return -ENOMEM; 41 | + } 42 | + if (!PAGE_ALIGNED(umem_odp->present_bitmap)) { 43 | + pr_err("pdp bitmap is not page-aligned!\n"); 44 | + my_bitmap_free(umem_odp->present_bitmap); 45 | + return -ENOMEM; 46 | + } 47 | + // a normal ODP does not need the magic page. 48 | + if (!umem_odp->is_pdp) 49 | + return 0; 50 | + if (umem_odp->page_shift != PAGE_SHIFT) { 51 | + pr_err("cannot support page_shift(%u) != PAGE_SHIFT(%u)!\n", umem_odp->page_shift, PAGE_SHIFT); 52 | + return -EFAULT; 53 | + } 54 | + umem_odp->magic_page = alloc_page(GFP_KERNEL); 55 | + if (!umem_odp->magic_page) { 56 | + my_bitmap_free(umem_odp->present_bitmap); 57 | + return -ENOMEM; 58 | + } 59 | + 60 | + magic_page_addr = kmap(umem_odp->magic_page); 61 | + memset(magic_page_addr, 0xfd, PAGE_SIZE); 62 | + kunmap(umem_odp->magic_page); 63 | + 64 | + umem_odp->magic_page_dma = ib_dma_map_page(umem_odp->umem.ibdev, umem_odp->magic_page, 65 | + 0, PAGE_SIZE, DMA_BIDIRECTIONAL) | ODP_READ_ALLOWED_BIT; 66 | + if (ib_dma_mapping_error(umem_odp->umem.ibdev, umem_odp->magic_page_dma)) { 67 | + my_bitmap_free(umem_odp->present_bitmap); 68 | + __free_page(umem_odp->magic_page); 69 | + return -EFAULT; 70 | + } 71 | + 72 | + set_or_clear_bitmap_reserved(umem_odp->present_bitmap, npfns, true); 73 | + 74 | + return 0; 75 | +} 76 | + 77 | +static void exit_pdp(struct ib_umem_odp *umem_odp) 78 | +{ 79 | + set_or_clear_bitmap_reserved(umem_odp->present_bitmap, umem_odp->nr_pfns, false); 80 | + 81 | + my_bitmap_free(umem_odp->present_bitmap); 82 | + // a normal ODP does not need the magic page. 83 | + if (!umem_odp->is_pdp) { 84 | + return; 85 | + } 86 | + // we mask ODP_DMA_ADDR_MASK here, to ignore ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT 87 | + ib_dma_unmap_page(umem_odp->umem.ibdev, umem_odp->magic_page_dma & ODP_DMA_ADDR_MASK, PAGE_SIZE, DMA_BIDIRECTIONAL); 88 | + __free_page(umem_odp->magic_page); 89 | +} 90 | +#endif 91 | + 92 | static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp, 93 | const struct mmu_interval_notifier_ops *ops) 94 | { 95 | @@ -94,10 +177,20 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp, 96 | start, end - start, ops); 97 | if (ret) 98 | goto out_dma_list; 99 | + 100 | +#ifdef PDP_MR 101 | + ret = init_pdp(umem_odp, npfns); 102 | + if (ret) 103 | + goto out_mmu_notifier; 104 | +#endif 105 | } 106 | 107 | return 0; 108 | 109 | +#ifdef PDP_MR 110 | +out_mmu_notifier: 111 | + mmu_interval_notifier_remove(&umem_odp->notifier); 112 | +#endif 113 | out_dma_list: 114 | kvfree(umem_odp->dma_list); 115 | out_pfn_list: 116 | @@ -236,6 +329,12 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device, 117 | if (!umem_odp) 118 | return ERR_PTR(-ENOMEM); 119 | 120 | +#ifdef PDP_MR 121 | + umem_odp->is_pdp = (access & PDP_ACCESS_PDP); 122 | + pr_info("[%s]: addr=0x%lx, size=0x%lx, mode=%s\n", 123 | + __func__, addr, size, umem_odp->is_pdp ? "PDP" : "ODP"); 124 | +#endif 125 | + 126 | umem_odp->umem.ibdev = device; 127 | umem_odp->umem.length = size; 128 | umem_odp->umem.address = addr; 129 | @@ -278,6 +377,13 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp) 130 | mmu_interval_notifier_remove(&umem_odp->notifier); 131 | kvfree(umem_odp->dma_list); 132 | kvfree(umem_odp->pfn_list); 133 | +#ifdef PDP_MR 134 | + if (umem_odp->pde) { 135 | + proc_remove(umem_odp->pde); 136 | + umem_odp->pde = NULL; 137 | + } 138 | + exit_pdp(umem_odp); 139 | +#endif 140 | } 141 | put_pid(umem_odp->tgid); 142 | kfree(umem_odp); 143 | @@ -304,6 +410,19 @@ static int ib_umem_odp_map_dma_single_page( 144 | struct ib_device *dev = umem_odp->umem.ibdev; 145 | dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index]; 146 | 147 | +#ifdef PDP_MR 148 | + // if (access_mask & ODP_WRITE_ALLOWED_BIT) 149 | + set_bit(dma_index, umem_odp->present_bitmap); 150 | + if (!umem_odp->is_pdp && *dma_addr) { 151 | + /* 152 | + * If the page is already dma mapped it means it went through 153 | + * a non-invalidating trasition, like read-only to writable. 154 | + * Resync the flags. 155 | + */ 156 | + *dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask; 157 | + return 0; 158 | + } 159 | +#else 160 | if (*dma_addr) { 161 | /* 162 | * If the page is already dma mapped it means it went through 163 | @@ -313,7 +432,10 @@ static int ib_umem_odp_map_dma_single_page( 164 | *dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask; 165 | return 0; 166 | } 167 | +#endif 168 | 169 | +// yz: dma_addr might be magic! 170 | +// we simply map the page again. 171 | *dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift, 172 | DMA_BIDIRECTIONAL); 173 | if (ib_dma_mapping_error(dev, *dma_addr)) { 174 | @@ -390,7 +512,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt, 175 | } 176 | 177 | range.hmm_pfns = &(umem_odp->pfn_list[pfn_start_idx]); 178 | - timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); 179 | + timeout = jiffies + msecs_to_jiffies(10000); // HMM_RANGE_DEFAULT_TIMEOUT 1000 180 | 181 | retry: 182 | current_seq = range.notifier_seq = 183 | @@ -426,7 +548,17 @@ retry: 184 | WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)); 185 | } else { 186 | if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) { 187 | +#ifdef PDP_MR 188 | + if (umem_odp->is_pdp) { 189 | + dma_addr_t *dma = &umem_odp->dma_list[dma_index]; 190 | + WARN_ON(*dma && *dma != umem_odp->magic_page_dma); 191 | + *dma = umem_odp->magic_page_dma; // ODP_READ_ALLOWED_BIT has been set. 192 | + } else { 193 | + WARN_ON(umem_odp->dma_list[dma_index]); 194 | + } 195 | +#else 196 | WARN_ON(umem_odp->dma_list[dma_index]); 197 | +#endif 198 | continue; 199 | } 200 | access_mask = ODP_READ_ALLOWED_BIT; 201 | @@ -487,6 +619,11 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt, 202 | idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift; 203 | dma = umem_odp->dma_list[idx]; 204 | 205 | +#ifdef PDP_MR 206 | + clear_bit(idx, umem_odp->present_bitmap); 207 | + if (umem_odp->is_pdp && is_magic_page_dma(umem_odp, dma)) continue; 208 | +#endif 209 | + 210 | /* The access flags guaranteed a valid DMA address in case was NULL */ 211 | if (dma) { 212 | unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT; 213 | @@ -509,7 +646,15 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt, 214 | */ 215 | set_page_dirty(head_page); 216 | } 217 | +#ifdef PDP_MR 218 | + if (umem_odp->is_pdp) { 219 | + umem_odp->dma_list[idx] = umem_odp->magic_page_dma; 220 | + } else { 221 | + umem_odp->dma_list[idx] = 0; 222 | + } 223 | +#else 224 | umem_odp->dma_list[idx] = 0; 225 | +#endif 226 | umem_odp->npages--; 227 | } 228 | } 229 | diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c 230 | index 368f3c4..4fdb739 100644 231 | --- a/drivers/infiniband/hw/mlx5/mr.c 232 | +++ b/drivers/infiniband/hw/mlx5/mr.c 233 | @@ -1442,6 +1442,69 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem, 234 | return &mr->ibmr; 235 | } 236 | 237 | +#ifdef PDP_MR 238 | +static vm_fault_t proc_fault(struct vm_fault *vmf) 239 | +{ 240 | + struct ib_umem_odp *odp = vmf->vma->vm_private_data; 241 | + void *kaddr = (void *)odp->present_bitmap + vmf->pgoff * PAGE_SIZE; 242 | + struct page *page = vmalloc_to_page(kaddr); 243 | + 244 | + get_page(page); 245 | + vmf->page = page; 246 | + 247 | + return 0; 248 | +} 249 | + 250 | +static struct vm_operations_struct proc_vm_ops = { 251 | + .fault = proc_fault 252 | +}; 253 | +static int proc_mmap(struct file *file, struct vm_area_struct *vma) 254 | +{ 255 | + struct ib_umem_odp *odp = pde_data(file_inode(file)); 256 | + 257 | + if (!odp->present_bitmap) { 258 | + pr_err("odp->present_bitmap=NULL\n"); 259 | + return -EINVAL; 260 | + } 261 | + pr_info("[odp mr bitmap]: mmap from virt=0x%px\n", odp->present_bitmap); 262 | + 263 | + vma->vm_private_data = odp; 264 | + vma->vm_ops = &proc_vm_ops; 265 | + 266 | + return 0; 267 | +} 268 | + 269 | +static ssize_t proc_read(struct file *file, char __user *buf, size_t count, loff_t *pos) 270 | +{ 271 | + struct ib_umem_odp *odp = pde_data(file_inode(file)); 272 | + if (copy_to_user(buf, &odp->nr_pfns, sizeof(odp->nr_pfns))) { 273 | + return -EINVAL; 274 | + } 275 | + 276 | + return sizeof(odp->nr_pfns); 277 | +} 278 | + 279 | +static const struct proc_ops proc_ops = { 280 | + .proc_mmap = proc_mmap, 281 | + .proc_read = proc_read 282 | +}; 283 | + 284 | +static int proc_init(u32 key, struct ib_umem_odp *odp) 285 | +{ 286 | + struct proc_dir_entry *entry; 287 | + char name[256]; 288 | + 289 | + sprintf(name, "pdp_bitmap_0x%x", key); 290 | + 291 | + entry = proc_create_data(name, S_IFREG | S_IRUGO, NULL, &proc_ops, odp); 292 | + if (!entry) { 293 | + return -ENOMEM; 294 | + } 295 | + odp->pde = entry; 296 | + 297 | + return 0; 298 | +} 299 | +#endif 300 | static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, 301 | u64 iova, int access_flags, 302 | struct ib_udata *udata) 303 | @@ -1493,6 +1556,13 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, 304 | err = mlx5_ib_init_odp_mr(mr); 305 | if (err) 306 | goto err_dereg_mr; 307 | + 308 | +#ifdef PDP_MR 309 | + err = proc_init(mr->mmkey.key, odp); 310 | + if (err) 311 | + goto err_dereg_mr; 312 | +#endif 313 | + 314 | return &mr->ibmr; 315 | 316 | err_dereg_mr: 317 | diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c 318 | index e3a3e7a..5274801 100644 319 | --- a/drivers/infiniband/hw/mlx5/odp.c 320 | +++ b/drivers/infiniband/hw/mlx5/odp.c 321 | @@ -36,6 +36,9 @@ 322 | #include 323 | #include 324 | 325 | +// yz fix bug 326 | +#include 327 | + 328 | #include "mlx5_ib.h" 329 | #include "cmd.h" 330 | #include "qp.h" 331 | @@ -167,6 +170,15 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries, 332 | 333 | for (i = 0; i < nentries; i++) { 334 | pa = odp->dma_list[idx + i]; 335 | +#ifdef PDP_MR 336 | + if (odp->is_pdp) { 337 | + WARN_ON(pa == 0); // with PDP, pa should never be 0! 338 | + if ((pa & ODP_READ_ALLOWED_BIT) == 0) { 339 | + // read is not allowed => magic dma 340 | + pa = odp->magic_page_dma; 341 | + } 342 | + } 343 | +#endif 344 | pas[i] = cpu_to_be64(umem_dma_to_mtt(pa)); 345 | } 346 | } 347 | @@ -269,12 +281,26 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni, 348 | * estimate the cost of another UMR vs. the cost of bigger 349 | * UMR. 350 | */ 351 | + 352 | + 353 | +#ifdef PDP_MR 354 | + if ((umem_odp->dma_list[idx] & (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) && 355 | + !(umem_odp->is_pdp && is_magic_page_dma(umem_odp, umem_odp->dma_list[idx]))) { 356 | +#else 357 | if (umem_odp->dma_list[idx] & 358 | (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) { 359 | +#endif 360 | if (!in_block) { 361 | blk_start_idx = idx; 362 | in_block = 1; 363 | } 364 | +#ifdef PDP_MR 365 | + if (umem_odp->is_pdp) { 366 | + umem_odp->dma_list[idx] &= (~ODP_READ_ALLOWED_BIT); 367 | + // clear the READ bit for magic_dma! READ is always set for a normal address 368 | + // we leave the dma_addr intact for correct unmap. 369 | + } 370 | +#endif 371 | 372 | /* Count page invalidations */ 373 | invalidations += idx - blk_start_idx + 1; 374 | @@ -284,7 +310,12 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni, 375 | if (in_block && umr_offset == 0) { 376 | mlx5_ib_update_xlt(mr, blk_start_idx, 377 | idx - blk_start_idx, 0, 378 | +#ifdef PDP_MR 379 | + // set instead of clear 380 | + (umem_odp->is_pdp ? 0 : MLX5_IB_UPD_XLT_ZAP) | 381 | +#else 382 | MLX5_IB_UPD_XLT_ZAP | 383 | +#endif 384 | MLX5_IB_UPD_XLT_ATOMIC); 385 | in_block = 0; 386 | } 387 | @@ -293,7 +324,12 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni, 388 | if (in_block) 389 | mlx5_ib_update_xlt(mr, blk_start_idx, 390 | idx - blk_start_idx + 1, 0, 391 | +#ifdef PDP_MR 392 | + // set instead of clear 393 | + (umem_odp->is_pdp ? 0 : MLX5_IB_UPD_XLT_ZAP) | 394 | +#else 395 | MLX5_IB_UPD_XLT_ZAP | 396 | +#endif 397 | MLX5_IB_UPD_XLT_ATOMIC); 398 | 399 | mlx5_update_odp_stats(mr, invalidations, invalidations); 400 | @@ -568,17 +604,37 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp, 401 | 402 | if (odp->umem.writable && !downgrade) 403 | access_mask |= ODP_WRITE_ALLOWED_BIT; 404 | + 405 | + // yz fix: there is a bug here! 406 | + { 407 | + struct task_struct *owning_process = get_pid_task(odp->tgid, PIDTYPE_PID); 408 | + struct mm_struct *owning_mm = odp->umem.owning_mm; 409 | + 410 | + if (!owning_process) { 411 | + return -EINVAL; 412 | + } 413 | + if (!mmget_not_zero(owning_mm)) { 414 | + put_task_struct(owning_process); 415 | + return -EINVAL; 416 | + } 417 | 418 | - np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault); 419 | - if (np < 0) 420 | - return np; 421 | + np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault); 422 | + if (np < 0) { 423 | + mmput(owning_mm); 424 | + put_task_struct(owning_process); 425 | + return np; 426 | + } 427 | 428 | - /* 429 | - * No need to check whether the MTTs really belong to this MR, since 430 | - * ib_umem_odp_map_dma_and_lock already checks this. 431 | - */ 432 | - ret = mlx5_ib_update_xlt(mr, start_idx, np, page_shift, xlt_flags); 433 | - mutex_unlock(&odp->umem_mutex); 434 | + /* 435 | + * No need to check whether the MTTs really belong to this MR, since 436 | + * ib_umem_odp_map_dma_and_lock already checks this. 437 | + */ 438 | + ret = mlx5_ib_update_xlt(mr, start_idx, np, page_shift, xlt_flags); 439 | + mutex_unlock(&odp->umem_mutex); 440 | + 441 | + mmput(owning_mm); 442 | + put_task_struct(owning_process); 443 | + } 444 | 445 | if (ret < 0) { 446 | if (ret != -EAGAIN) 447 | diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h 448 | index 0844c1d..15aaadd 100644 449 | --- a/include/rdma/ib_umem_odp.h 450 | +++ b/include/rdma/ib_umem_odp.h 451 | @@ -9,6 +9,12 @@ 452 | #include 453 | #include 454 | 455 | +#define PDP_MR 456 | +#ifdef PDP_MR 457 | +#include 458 | +#define PDP_ACCESS_PDP (IB_ACCESS_RELAXED_ORDERING << 1) // caution if a future version uses this bit! 459 | +#endif 460 | + 461 | struct ib_umem_odp { 462 | struct ib_umem umem; 463 | struct mmu_interval_notifier notifier; 464 | @@ -42,6 +48,15 @@ struct ib_umem_odp { 465 | bool is_implicit_odp; 466 | 467 | unsigned int page_shift; 468 | + 469 | +#ifdef PDP_MR 470 | + bool is_pdp; // ODP / PDP 471 | + struct page *magic_page; 472 | + dma_addr_t magic_page_dma; // ODP_READ_ALLOWED_BIT is set! 473 | + unsigned long *present_bitmap; // 1: mapped in the RNIC page table, 0: unmapped 474 | + struct proc_dir_entry *pde; 475 | + u32 nr_pfns; 476 | +#endif 477 | }; 478 | 479 | static inline struct ib_umem_odp *to_ib_umem_odp(struct ib_umem *umem) 480 | @@ -80,6 +95,16 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp) 481 | 482 | #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) 483 | 484 | +#ifdef PDP_MR 485 | +static inline bool is_magic_page_dma(struct ib_umem_odp *umem_odp, dma_addr_t addr) 486 | +{ 487 | + WARN_ON(!umem_odp->is_pdp); // a normal ODP should never reach here. 488 | + 489 | + // we AND ODP_DMA_ADDR_MASK to ignore the lower two bits. 490 | + return (umem_odp->magic_page_dma & ODP_DMA_ADDR_MASK) == (addr & ODP_DMA_ADDR_MASK); 491 | +} 492 | +#endif 493 | + 494 | #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING 495 | 496 | struct ib_umem_odp * 497 | -------------------------------------------------------------------------------- /driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip -------------------------------------------------------------------------------- /libterm/.gitignore: -------------------------------------------------------------------------------- 1 | # editor and compiler 2 | .cache 3 | .idea 4 | .vscode 5 | build 6 | cmake-build-debug 7 | __pycache__ 8 | 9 | # linux kernel 10 | *.cmd 11 | Module.symvers 12 | modules.order 13 | *.ko 14 | *.mod 15 | *.mod.c 16 | *.mod.o 17 | *.o 18 | .tmp* 19 | 20 | # log 21 | *.log 22 | ## ibdump 23 | *.pcap 24 | ## perf 25 | *.data 26 | -------------------------------------------------------------------------------- /libterm/CMakeLists.txt: -------------------------------------------------------------------------------- 1 | cmake_minimum_required(VERSION 3.10) 2 | project(nic-odp) 3 | 4 | set(CMAKE_CXX_STANDARD 20) 5 | set(CMAKE_EXPORT_COMPILE_COMMANDS ON) 6 | add_compile_definitions(RECORDER_VERBOSE=0 RECORDER_SINGLE_THREAD=0) 7 | 8 | option (SANITIZE "Turn on sanitization" OFF) 9 | if (SANITIZE) 10 | add_compile_options(-fsanitize=address) 11 | endif() 12 | 13 | # include 14 | include_directories(${PROJECT_SOURCE_DIR}/3rd-party/include) 15 | include_directories(${PROJECT_SOURCE_DIR}/include) 16 | 17 | # verbs-pdp 18 | file(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/ibverbs-pdp/*.cc) 19 | add_library(pdp SHARED ${SRCS}) 20 | find_package(fmt) 21 | target_link_libraries(pdp fmt::fmt-header-only ibverbs memcached uring glog boost_coroutine aio) 22 | 23 | # test 24 | file(GLOB SRCS ${PROJECT_SOURCE_DIR}/test/*.cc) 25 | foreach(SRC ${SRCS}) 26 | get_filename_component(SRC_NAME ${SRC} NAME_WE) 27 | add_executable(${SRC_NAME} ${SRC}) 28 | target_link_libraries(${SRC_NAME} fmt::fmt-header-only ibverbs gflags glog memcached) 29 | endforeach() 30 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/config.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | #ifndef RPC_SERVER_POLL_WITH_EVENT 4 | #define RPC_SERVER_POLL_WITH_EVENT 0 5 | #endif 6 | 7 | #ifndef UNIT_SIZE 8 | #define UNIT_SIZE SZ_1M 9 | #endif 10 | 11 | #ifndef MAX_RDMA_LENGTH 12 | #define MAX_RDMA_LENGTH (64ul * SZ_1K) 13 | #endif 14 | 15 | #ifndef WRITE_INLINE_SIZE 16 | #define WRITE_INLINE_SIZE SZ_4K 17 | #endif 18 | 19 | //#define PAGE_BITMAP_PULL 1 // pull periodically 20 | #define PAGE_BITMAP_UPDATE 0 // incrementally in each request, not pull 21 | #define PAGE_BITMAP_HINT_READ 1 22 | #define MAX_NR_COROUTINES 64 23 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/ctx.hh: -------------------------------------------------------------------------------- 1 | #include "global.hh" 2 | #include 3 | 4 | namespace pdp { 5 | struct ContextBytes { 6 | ibv_gid gid; 7 | uint16_t lid; 8 | ContextBytes(ibv_gid gid, uint16_t lid) : gid(gid), lid(lid) {} 9 | explicit ContextBytes(const std::vector &vec_char) 10 | { 11 | memcpy(this, vec_char.data(), sizeof(ContextBytes)); 12 | } 13 | 14 | // for memcached KV 15 | std::string key() const 16 | { 17 | auto id = rdma::Context::RemoteIdentifier(gid, lid); 18 | return id.to_string(); 19 | } 20 | 21 | auto to_vec_char() const 22 | { 23 | return util::to_vec_char(*this); 24 | } 25 | }; 26 | 27 | class RemoteContext { 28 | public: 29 | using Identifier = rdma::Context::RemoteIdentifier; 30 | 31 | RemoteContext() = default; 32 | explicit RemoteContext(const ContextBytes &bytes) : gid_(bytes.gid), lid_(bytes.lid) {} 33 | 34 | auto identifier() const 35 | { 36 | return Identifier{gid_, lid_}; 37 | } 38 | 39 | private: 40 | ibv_gid gid_; 41 | uint16_t lid_; 42 | }; 43 | 44 | class RemoteContextManager : public Manager {}; 45 | 46 | class LocalContext : public rdma::Context { 47 | public: 48 | using Identifier = util::CachelineAligned; 49 | auto identifier() const { 50 | return native_pd(); 51 | } 52 | public: 53 | LocalContext(ibv_context *native_ctx, ibv_pd *native_pd) 54 | : Context(native_ctx, native_pd) {} 55 | 56 | ContextBytes to_bytes() const 57 | { 58 | return ContextBytes(gid_, lid_); 59 | } 60 | }; 61 | 62 | class LocalContextManager : public Manager {}; 63 | } 64 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/global.cc: -------------------------------------------------------------------------------- 1 | #include "global.hh" 2 | #include "mr.hh" 3 | #include "qp.hh" 4 | 5 | namespace pdp { 6 | GlobalVariables::GlobalVariables() 7 | : sms {"10.0.2.181", 23333}, 8 | configs { 9 | .mode = Mode::_from_string_nocase(util::getenv_string("PDP_mode").value_or("PDP").c_str()), 10 | .rpc_mode = util::getenv_bool("PDP_rpc_mode").value_or(false), 11 | .is_server = util::getenv_bool("PDP_is_server").value_or(false), 12 | .server_mmap_dev = util::getenv_string("PDP_server_mmap_dev").value_or(""), 13 | .server_io_path = ServerIoPath::_from_string_nocase(util::getenv_string("PDP_server_io_path").value_or("tiering").c_str()), 14 | .server_memory_gb = util::getenv_int("PDP_server_memory_gb").value_or(0), 15 | .server_rpc_threads = util::getenv_int("PDP_server_rpc_threads").value_or(32), 16 | .server_page_state = util::getenv_bool("PDP_server_page_state").value_or(true), 17 | .read_magic_pattern = util::getenv_bool("PDP_read_magic_pattern").value_or(true), 18 | .promote_hotspot = util::getenv_bool("PDP_promote_hotspot").value_or(true), 19 | .promotion_window_ms = (uint32_t)util::getenv_int("PDP_promotion_window_ms").value_or(1000), 20 | .pull_page_bitmap = util::getenv_bool("PDP_pull_page_bitmap").value_or(true), 21 | .advise_local_mr = util::getenv_bool("PDP_advise_local_mr").value_or(false), 22 | .predict_remote_mr = util::getenv_bool("PDP_predict_remote_mr").value_or(false) 23 | } 24 | { 25 | // modify 26 | switch (configs.mode) { 27 | case Mode::PIN: 28 | case Mode::ODP: { 29 | configs.rpc_mode = false; 30 | configs.read_magic_pattern = false; 31 | configs.promote_hotspot = false; 32 | break; 33 | } 34 | case Mode::PDP: { 35 | configs.server_io_path = ServerIoPath::TIERING; 36 | configs.mode = Mode::PDP; 37 | configs.rpc_mode = false; 38 | configs.read_magic_pattern = true; 39 | configs.promote_hotspot = true; 40 | break; 41 | } 42 | case Mode::RPC: { 43 | configs.mode = Mode::RPC_MEMCPY; 44 | // fallthrough here 45 | } 46 | case Mode::RPC_MEMCPY: { 47 | configs.server_io_path = ServerIoPath::MEMCPY; 48 | configs.mode = Mode::PDP; 49 | configs.rpc_mode = true; 50 | configs.read_magic_pattern = true; 51 | configs.promote_hotspot = false; 52 | break; 53 | } 54 | case Mode::RPC_BUFFER: { 55 | configs.server_io_path = ServerIoPath::BUFFER; 56 | configs.mode = Mode::PDP; 57 | configs.rpc_mode = true; 58 | configs.read_magic_pattern = true; 59 | configs.promote_hotspot = false; 60 | break; 61 | } 62 | case Mode::RPC_DIRECT: { 63 | configs.server_io_path = ServerIoPath::DIRECT; 64 | configs.mode = Mode::PDP; 65 | configs.rpc_mode = true; 66 | configs.read_magic_pattern = true; 67 | configs.promote_hotspot = false; 68 | break; 69 | } 70 | case Mode::RPC_TIERING: { 71 | configs.server_io_path = ServerIoPath::TIERING; 72 | configs.mode = Mode::PDP; 73 | configs.rpc_mode = true; 74 | configs.read_magic_pattern = true; 75 | configs.promote_hotspot = false; 76 | break; 77 | } 78 | case Mode::RPC_TIERING_PROMOTE: { 79 | configs.server_io_path = ServerIoPath::TIERING; 80 | configs.mode = Mode::PDP; 81 | configs.rpc_mode = true; 82 | configs.read_magic_pattern = true; 83 | configs.promote_hotspot = true; 84 | break; 85 | } 86 | case Mode::MAGIC_MEMCPY: { 87 | configs.server_io_path = ServerIoPath::MEMCPY; 88 | configs.mode = Mode::PDP; 89 | configs.rpc_mode = false; 90 | configs.read_magic_pattern = true; 91 | configs.promote_hotspot = false; 92 | break; 93 | } 94 | case Mode::MAGIC_BUFFER: { 95 | configs.server_io_path = ServerIoPath::BUFFER; 96 | configs.mode = Mode::PDP; 97 | configs.rpc_mode = false; 98 | configs.read_magic_pattern = true; 99 | configs.promote_hotspot = false; 100 | break; 101 | } 102 | case Mode::MAGIC_DIRECT: { 103 | configs.server_io_path = ServerIoPath::DIRECT; 104 | configs.mode = Mode::PDP; 105 | configs.rpc_mode = false; 106 | configs.read_magic_pattern = true; 107 | configs.promote_hotspot = false; 108 | break; 109 | } 110 | case Mode::MAGIC_TIERING: { 111 | configs.server_io_path = ServerIoPath::TIERING; 112 | configs.mode = Mode::PDP; 113 | configs.rpc_mode = false; 114 | configs.read_magic_pattern = true; 115 | configs.promote_hotspot = false; 116 | break; 117 | } 118 | case Mode::PDP_MEMCPY: { 119 | configs.server_io_path = ServerIoPath::MEMCPY; 120 | configs.mode = Mode::PDP; 121 | configs.rpc_mode = false; 122 | configs.read_magic_pattern = true; 123 | configs.promote_hotspot = true; 124 | } 125 | 126 | default: break; 127 | } 128 | 129 | // if (configs.mode == +Mode::PIN && configs.server_memory_gb) { 130 | // pr_err("ignore PDP_server_memory_gb in PIN mode!"); 131 | // configs.server_memory_gb = 0; 132 | // } 133 | if (configs.mode == +Mode::PIN) { 134 | ASSERT(configs.server_memory_gb == 0, "PDP_server_memory_gb should not be set in PIN mode."); 135 | } 136 | 137 | if (configs.mode == +Mode::PDP) { 138 | ASSERT(!configs.server_mmap_dev.empty(), "PDP_server_mmap_dev must be set!"); 139 | } 140 | 141 | if (configs.promote_hotspot) { 142 | pr_warn("auto set PDP_server_page_state for PDP_promote_hotspot"); 143 | configs.server_page_state = true; 144 | } 145 | 146 | if (configs.read_magic_pattern) { 147 | pr_warn("auto set PDP_pull_page_bitmap for PDP_read_magic_pattern"); 148 | configs.pull_page_bitmap = true; 149 | } 150 | 151 | 152 | // print 153 | fmt_pr(info, "memcached: {}", sms.to_string()); 154 | fmt_pr(info, "{}", configs.to_string()); 155 | 156 | ctx_mgr = std::make_unique(); 157 | local_mr_mgr = std::make_unique(); 158 | remote_mr_mgr = std::make_unique(); 159 | qp_mgr = std::make_unique(); 160 | } 161 | } 162 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/global.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | // my headers 4 | #include "config.hh" 5 | 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | #include 12 | // system 13 | #include 14 | 15 | // c++ headers 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | 22 | #define PDP_ACCESS_PDP (IBV_ACCESS_RELAXED_ORDERING << 1) 23 | 24 | namespace pdp 25 | { 26 | using create_qp_t = struct ibv_qp *(*)(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr); 27 | using modify_qp_t = int(*)(struct ibv_qp *qp, struct ibv_qp_attr *attr, int attr_mask); 28 | using poll_cq_t = int(*)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc); 29 | using post_send_t = int(*)(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); 30 | using post_recv_t = int(*)(struct ibv_qp *qp, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr); 31 | using reg_mr_t = struct ibv_mr *(*)(struct ibv_pd *pd, void *addr, size_t length, int access); 32 | using reg_mr_iova2_t = struct ibv_mr *(*)(struct ibv_pd *pd, void *addr, size_t length, 33 | uint64_t iova, unsigned int access); 34 | 35 | class RcQp; 36 | } 37 | 38 | namespace pdp { 39 | template 40 | class Manager { 41 | public: 42 | using resource_type = Resource; 43 | protected: 44 | std::unordered_map> map_; 45 | ReaderFriendlyLock lock_; 46 | using Map = decltype(map_); 47 | 48 | virtual void do_in_reg(typename Map::iterator it) {} 49 | public: 50 | Manager() = default; 51 | virtual ~Manager() = default; 52 | 53 | // thread-safe register function protected by the write_single lock, in order to construct the value exactly once. 54 | // id: the key 55 | // args: to construct the value iff the key does not exist 56 | template 57 | util::ObserverPtr reg_with_construct_args(const typename Resource::Identifier &id, Args&&... args) { 58 | std::unique_lock w_lk(lock_); 59 | 60 | if (const auto &p = unlocked_get(id)) { 61 | return p; 62 | } 63 | 64 | auto p = std::make_unique(std::forward(args)...); 65 | auto [it, inserted] = map_.emplace(id, std::move(p)); 66 | ASSERT(inserted, "fail to insert"); 67 | 68 | do_in_reg(it); 69 | return util::make_observer(it->second.get()); 70 | } 71 | 72 | template 73 | util::ObserverPtr reg_with_construct_lambda(const typename Resource::Identifier &id, const Lambda &lmd) 74 | { 75 | std::unique_lock w_lk(lock_); 76 | 77 | if (auto p = unlocked_get(id)) { 78 | return p; 79 | } 80 | 81 | auto [it, inserted] = map_.emplace(id, lmd()); 82 | ASSERT(inserted, "fail to insert"); 83 | 84 | do_in_reg(it); 85 | return util::make_observer(it->second.get()); 86 | } 87 | 88 | // unlocked get 89 | util::ObserverPtr unlocked_get(const typename Resource::Identifier &id) 90 | { 91 | if (auto it = map_.find(id); it != map_.end()) { 92 | return util::make_observer(it->second.get()); 93 | } 94 | return util::ObserverPtr(nullptr); 95 | } 96 | 97 | util::ObserverPtr get(const typename Resource::Identifier &id) { 98 | std::shared_lock r_lk(lock_); 99 | return unlocked_get(id); 100 | } 101 | }; 102 | 103 | // class Resource { 104 | // public: 105 | // class Bytes {}; 106 | // class Local {}; 107 | // class Remote {}; 108 | // class Manager {}; 109 | // }; 110 | 111 | class ContextManager: public Manager { 112 | public: 113 | util::ObserverPtr get_or_reg_by_pd(ibv_pd *pd) 114 | { 115 | util::ObserverPtr mr = get(pd); 116 | if (!mr) { 117 | mr = reg_with_construct_args(pd, pd->context, pd); 118 | } 119 | return mr; 120 | } 121 | }; 122 | 123 | namespace PageState { 124 | using Val = u8; 125 | enum : Val { 126 | kUncached = 0, 127 | kCached = 1, 128 | kDirty = 2, 129 | }; 130 | } 131 | 132 | class LocalMrManager; 133 | class RemoteMrManager; 134 | class QpManager; 135 | 136 | BETTER_ENUM(ServerIoPath, int, TIERING, MEMCPY, BUFFER, DIRECT); 137 | BETTER_ENUM(Mode, int, 138 | PDP, ODP, PIN, 139 | RPC, RPC_MEMCPY, RPC_BUFFER, RPC_DIRECT, RPC_TIERING, 140 | RPC_TIERING_PROMOTE, 141 | MAGIC_MEMCPY, MAGIC_BUFFER, MAGIC_DIRECT, MAGIC_TIERING, 142 | PDP_MEMCPY); 143 | 144 | struct GlobalVariables { 145 | private: 146 | GlobalVariables(); 147 | public: 148 | rdma::SharedMetaService sms; 149 | 150 | // configs 151 | struct { 152 | Mode mode; 153 | bool rpc_mode; 154 | 155 | const bool is_server; 156 | const std::string server_mmap_dev; 157 | ServerIoPath server_io_path; 158 | 159 | int server_memory_gb; 160 | const int server_rpc_threads; 161 | bool server_page_state; 162 | 163 | bool read_magic_pattern; 164 | bool promote_hotspot; 165 | const uint32_t promotion_window_ms; 166 | 167 | bool pull_page_bitmap; 168 | bool advise_local_mr; 169 | bool predict_remote_mr; 170 | 171 | std::string to_string() const { 172 | std::string out = "=== configs begin ===\n"; 173 | fmt::format_to(std::back_inserter(out), "mode: {}\n", mode._to_string()); 174 | fmt::format_to(std::back_inserter(out), "rpc_mode: {}\n", rpc_mode); 175 | fmt::format_to(std::back_inserter(out), "is_server: {}\n", is_server); 176 | fmt::format_to(std::back_inserter(out), "server_mmap_dev: {}\n", server_mmap_dev); 177 | fmt::format_to(std::back_inserter(out), "server_io_path: {}\n", server_io_path._to_string()); 178 | fmt::format_to(std::back_inserter(out), "server_memory_gb: {}\n", server_memory_gb); 179 | fmt::format_to(std::back_inserter(out), "server_rpc_threads: {}\n", server_rpc_threads); 180 | fmt::format_to(std::back_inserter(out), "server_page_state: {}\n", server_page_state); 181 | fmt::format_to(std::back_inserter(out), "read_magic_pattern: {}\n", read_magic_pattern); 182 | fmt::format_to(std::back_inserter(out), "promote_hotspot: {}\n", promote_hotspot); 183 | fmt::format_to(std::back_inserter(out), "promotion_window_ms: {}\n", promotion_window_ms); 184 | fmt::format_to(std::back_inserter(out), "pull_page_bitmap: {}\n", pull_page_bitmap); 185 | fmt::format_to(std::back_inserter(out), "advise_local_mr: {}\n", advise_local_mr); 186 | fmt::format_to(std::back_inserter(out), "predict_remote_mr: {}\n", predict_remote_mr); 187 | fmt::format_to(std::back_inserter(out), "--- configs end ---\n"); 188 | return out; 189 | } 190 | } configs; 191 | 192 | bool has_write_req = false; 193 | bool has_dirty_page = false; 194 | bool doing_promotion = false; 195 | 196 | // CTXs 197 | std::unique_ptr ctx_mgr; 198 | 199 | // MRs 200 | std::unique_ptr local_mr_mgr; 201 | std::unique_ptr remote_mr_mgr; 202 | 203 | // QPs 204 | std::unique_ptr qp_mgr; 205 | 206 | // original symbols from ibv 207 | struct { 208 | // QP-related: inited in create_qp() 209 | create_qp_t create_qp = nullptr; 210 | post_send_t post_send = nullptr; 211 | post_recv_t post_recv = nullptr; 212 | poll_cq_t poll_cq = nullptr; 213 | 214 | // inited in modify_qp 215 | modify_qp_t modify_qp = nullptr; 216 | 217 | // MR-related: initied in reg_mr_iova2() 218 | reg_mr_iova2_t reg_mr_iova2 = nullptr; 219 | reg_mr_t reg_mr = nullptr; 220 | } ori_ibv_symbols; 221 | 222 | static GlobalVariables& instance() 223 | { 224 | static GlobalVariables gv; 225 | return gv; 226 | } 227 | }; 228 | } 229 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/pdp.cc: -------------------------------------------------------------------------------- 1 | #include "mr.hh" 2 | #include "qp.hh" 3 | // c++ headers 4 | #include 5 | 6 | // system headers 7 | #include 8 | #include 9 | 10 | // my headers 11 | #include 12 | #include 13 | 14 | namespace pdp 15 | { 16 | struct ibv_mr *ori_reg_mr(struct ibv_pd *pd, void *addr, size_t length, 17 | int access) 18 | { 19 | const static auto &gv = GlobalVariables::instance(); 20 | assert(gv.ori_ibv_symbols.reg_mr_iova2); 21 | return gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, (uintptr_t)addr, access); 22 | } 23 | 24 | /** 25 | * postprocess 26 | * All are the same as ibv_poll_cq() and only returns ibv_wc with IBV_SEND_SIGNALED set by the user. 27 | * Be careful that the CQ may not be empty when return < num_entries. 28 | */ 29 | int poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc) 30 | { 31 | const static auto &gv = GlobalVariables::instance(); 32 | assert(gv.ori_ibv_symbols.poll_cq); 33 | // thread_local util::Recorder rec_poll("poll_nr_entries"); 34 | 35 | int ret = gv.ori_ibv_symbols.poll_cq(cq, num_entries, wc); 36 | if (ret <= 0) { 37 | return ret; 38 | } 39 | 40 | if (gv.configs.is_server) return ret; 41 | 42 | if (gv.configs.mode != +Mode::PDP) { 43 | // rec_poll.record_one(ret); 44 | return ret; 45 | } 46 | 47 | if (wc->opcode & IBV_WC_RECV) [[unlikely]] { 48 | // we do not modify post_recv 49 | return ret; 50 | } 51 | 52 | #if 1 53 | // pr_info("poll ret=%d.", ret); 54 | int nr_user_entries = 0; 55 | for (int i = 0; i < ret; i++) { 56 | auto prc = gv.qp_mgr->get(wc[i].qp_num); 57 | 58 | if (!prc) [[unlikely]] { 59 | wc[nr_user_entries++] = wc[i]; 60 | } else { 61 | bool signaled = prc->post_process_wc(wc); 62 | if (signaled) { 63 | wc[nr_user_entries++] = wc[i]; 64 | } 65 | // pr_debug("[pdp] a pdp wc done, signaled=%u, user_num_entries=%d", signaled, user_num_entries); 66 | } 67 | } 68 | #else 69 | bool is_prc = gv.qp_mgr->is_managed_qp(wc[i].qp_num); 70 | if (!is_prc) [[unlikely]] { 71 | wc[user_num_entries++] = wc[i]; 72 | } else { 73 | auto *user_send_wr = reinterpret_cast(wc->wr_id); 74 | bool signaled = user_send_wr->prc->post_process_wc(&wc[i]); 75 | if (signaled) { 76 | wc[user_num_entries++] = wc[i]; 77 | } 78 | } 79 | #endif 80 | 81 | return nr_user_entries; 82 | } 83 | 84 | // preprocess wr 85 | int post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, 86 | struct ibv_send_wr **bad_wr) 87 | { 88 | const static auto &gv = GlobalVariables::instance(); 89 | static util::LatencyRecorder lr_post_send{"post_send"}; 90 | util::LatencyRecorderHelper h{lr_post_send}; 91 | // static util::LatencyRecorder lr_1("1"); 92 | // pr_info("op=%u", wr->opcode); 93 | // lr_1.begin_one(); 94 | if (gv.configs.mode != +Mode::PDP) [[unlikely]] { 95 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr); 96 | } 97 | 98 | // !prc: a shadow RCQP or a user UDQP 99 | const auto &prc = gv.qp_mgr->get(qp->qp_num); 100 | bool is_shadow_rc = !prc && qp->qp_type == IBV_QPT_RC; 101 | if (is_shadow_rc) [[unlikely]] { 102 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr); 103 | } 104 | 105 | // check for all UD and User RC. 106 | // do not check for Shadow RC. 107 | if (gv.configs.is_server) { 108 | gv.local_mr_mgr->convert_pdp_mr_in_send_wr_chain(wr); 109 | } 110 | 111 | if (gv.configs.advise_local_mr) { 112 | gv.local_mr_mgr->advise_mr_in_wr_chain(wr); 113 | } 114 | 115 | // this is UD. 116 | if (!prc || gv.configs.is_server) [[unlikely]] { 117 | // gv.local_mr_mgr->pin_mr_in_recv_wr_chain((ibv_recv_wr *)wr); 118 | // gv.local_mr_mgr->advise_mr_in_wr_chain(wr); 119 | 120 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr); 121 | } 122 | // lr_1.end_one(); 123 | return prc->post_send(wr, bad_wr); 124 | } 125 | 126 | int post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, 127 | struct ibv_recv_wr **bad_wr) 128 | { 129 | const static auto &gv = GlobalVariables::instance(); 130 | assert(gv.ori_ibv_symbols.post_recv); 131 | // thread_local util::LatencyRecorder lr_recv{"post_recv"}; 132 | // util::LatencyRecorder::Helper h{lr_recv}; 133 | 134 | if (gv.configs.mode == +Mode::PIN) { 135 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr); 136 | } 137 | 138 | if (qp->qp_type == IBV_QPT_UD) { 139 | // RECV: the remote writes to us. we pin mrs for UD so that the NIC does not drop the package. 140 | gv.local_mr_mgr->pin_mr_in_recv_wr_chain(wr); 141 | if (gv.configs.mode == +Mode::ODP) { 142 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr); 143 | } 144 | } 145 | 146 | if (gv.configs.advise_local_mr) { 147 | gv.local_mr_mgr->advise_mr_in_wr_chain(wr); 148 | } 149 | 150 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr); 151 | } 152 | 153 | struct ibv_qp *create_qp(struct ibv_pd *pd, 154 | struct ibv_qp_init_attr *qp_init_attr) 155 | { 156 | const static auto ori_ibv_create_qp = (create_qp_t)dlvsym(RTLD_NEXT, "ibv_create_qp", "IBVERBS_1.1"); 157 | // static auto ori_ibv_create_qp = (create_qp_t)dlsym(RTLD_NEXT, "ibv_create_qp"); 158 | static auto &gv = GlobalVariables::instance(); 159 | static bool inited = false; 160 | 161 | if (!inited) { 162 | gv.ori_ibv_symbols.create_qp = ori_ibv_create_qp; 163 | gv.ori_ibv_symbols.post_send = pd->context->ops.post_send; 164 | gv.ori_ibv_symbols.post_recv = pd->context->ops.post_recv; 165 | gv.ori_ibv_symbols.poll_cq = pd->context->ops.poll_cq; 166 | inited = true; 167 | } 168 | 169 | assert(ori_ibv_create_qp); 170 | 171 | struct ibv_qp *uqp = ori_ibv_create_qp(pd, qp_init_attr); // man page: NULL if fails 172 | if (!uqp) return uqp; // fails 173 | 174 | assert(uqp->context == pd->context); 175 | if (gv.configs.mode == +Mode::PIN) return uqp; 176 | 177 | // a work-around patch for UD post_recv with ODP/PDP mrs. 178 | // Using a UD QP, a post_recv of ODP/PDP mr will drop a package directly during processing a page fault. 179 | // But many prototype systems use UD for RPC without considering package drops. 180 | // We pin memory in post_recv, so that the application can run. 181 | // Note! This works for ODP & PDP! 182 | uqp->context->ops.post_recv = post_recv; 183 | 184 | if (!gv.configs.mode == +Mode::PDP) return uqp; 185 | 186 | // we override post_send of all QPs, including UD, to check the mr. 187 | // Note! This works only for PDP! 188 | uqp->context->ops.poll_cq = poll_cq; 189 | uqp->context->ops.post_send = post_send; 190 | 191 | if (qp_init_attr->qp_type != IBV_QPT_RC) { 192 | // pr_info("create qp: type=0x%x, num=0x%x", uqp->qp_type, uqp->qp_num); 193 | return uqp; 194 | } 195 | 196 | // but we only create uqp for RC, to support READ/WRITE 197 | auto ctx = gv.ctx_mgr->get_or_reg_by_pd(uqp->pd); 198 | auto prc = gv.qp_mgr->reg_with_construct_args(uqp->qp_num, ctx, uqp, qp_init_attr); 199 | 200 | RcBytes rc_bytes = prc->to_rc_bytes(); 201 | 202 | // w/ gid 203 | gv.sms.put(rc_bytes.identifier().to_string(), rc_bytes.to_vec_char()); 204 | 205 | // w/o gid 206 | gv.sms.put(rc_bytes.identifier().wo_gid().to_string(), rc_bytes.to_vec_char()); 207 | 208 | return uqp; 209 | } 210 | 211 | // postprocess. put uqp_addr, sqp_. 212 | int modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, 213 | int attr_mask) 214 | { 215 | const static auto ori_ibv_modify_qp = (modify_qp_t)dlvsym(RTLD_NEXT, "ibv_modify_qp", "IBVERBS_1.1"); 216 | assert(ori_ibv_modify_qp != nullptr); 217 | 218 | static auto &gv = GlobalVariables::instance(); 219 | static bool inited = false; 220 | if (!inited) { 221 | gv.ori_ibv_symbols.modify_qp = ori_ibv_modify_qp; 222 | inited = true; 223 | } 224 | 225 | int ret = ori_ibv_modify_qp(qp, attr, attr_mask); 226 | if (!gv.configs.mode == +Mode::PDP) { 227 | return ret; 228 | } 229 | // man page: 0 on success, errno on failure 230 | if (ret) { 231 | return ret; 232 | } 233 | 234 | // change state? 235 | if ((attr_mask & IBV_QP_STATE) == 0) { 236 | return ret; 237 | } 238 | 239 | const auto &prc = gv.qp_mgr->get(qp->qp_num); 240 | if (!prc) return ret; 241 | 242 | // init -> recv -> send 243 | switch (attr->qp_state) { 244 | case IBV_QPS_INIT: { 245 | assert(attr->port_num == 1); 246 | } 247 | break; 248 | case IBV_QPS_RTR: { 249 | rdma::QueuePair::Identifier remote_uqp_addr{{attr->ah_attr.grh.dgid, attr->ah_attr.dlid}, attr->dest_qp_num}; 250 | std::vector vec_char; 251 | if (attr->ah_attr.is_global) { 252 | vec_char = gv.sms.get(remote_uqp_addr.to_string()); 253 | } else { 254 | vec_char = gv.sms.get(remote_uqp_addr.wo_gid().to_string()); 255 | } 256 | RcBytes remote_mr_bytes{vec_char}; 257 | bool ok = prc->connect_to(remote_mr_bytes); 258 | assert(ok); 259 | } 260 | break; 261 | default: break; 262 | } 263 | 264 | return ret; 265 | } 266 | 267 | struct ibv_mr *reg_mr_iova2(struct ibv_pd *pd, void *addr, size_t length, 268 | uint64_t iova, unsigned int access) 269 | { 270 | const static auto ori_ibv_reg_mr_iova2 = (reg_mr_iova2_t)dlsym(RTLD_NEXT, "ibv_reg_mr_iova2"); 271 | assert(ori_ibv_reg_mr_iova2); 272 | static auto &gv = GlobalVariables::instance(); 273 | static bool inited = false; 274 | 275 | if (!inited) { 276 | gv.ori_ibv_symbols.reg_mr_iova2 = ori_ibv_reg_mr_iova2; 277 | gv.ori_ibv_symbols.reg_mr = ori_reg_mr; 278 | inited = true; 279 | } 280 | 281 | bool user_odp = access & IBV_ACCESS_ON_DEMAND; 282 | if (user_odp) { 283 | pr_info("user_odp!"); 284 | } 285 | 286 | if (user_odp && gv.configs.mode == +Mode::PIN) { 287 | pr_warn("!!! ATTENTION !!! ODP mr is requested but pdp_mode is pin"); 288 | access &= (~IBV_ACCESS_ON_DEMAND); 289 | user_odp = false; 290 | } 291 | 292 | struct ibv_mr *native_mr; 293 | for (int i = 0; i < 10; i++) { 294 | native_mr = gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, iova, access); 295 | if (native_mr) break; 296 | } 297 | if (!native_mr) { 298 | pr_err("reg_mr: failed, addr=%p, len=0x%lx, error=%s!", addr, length, strerror(errno)); 299 | return native_mr; 300 | } 301 | 302 | auto ctx = gv.ctx_mgr->get_or_reg_by_pd(pd); 303 | auto rdma_mr = std::make_unique(ctx, native_mr); 304 | 305 | // PIN 306 | if (!user_odp) { 307 | // pr_info("%s", fmt::format("reg_mr: pin,addr={},len={:#x},lkey={:#x},pd={}", addr, length, \ 308 | // native_mr ? native_mr->lkey : -1, fmt::ptr(pd)).c_str()); 309 | 310 | // if (!gv.configs.mode == +Mode::PIN) { 311 | gv.local_mr_mgr->reg_with_construct_args(rdma_mr->lkey_, std::move(rdma_mr)); 312 | // } 313 | 314 | return native_mr; 315 | } 316 | 317 | // ODP 318 | if (gv.configs.mode == +Mode::ODP || !gv.configs.read_magic_pattern) { 319 | pr_info("reg_mr: odp, addr=%p, len=0x%lx, lkey=0x%x", addr, length, 320 | native_mr ? native_mr->lkey : -1); 321 | 322 | gv.local_mr_mgr->reg_with_construct_args(rdma_mr->lkey_, std::move(rdma_mr), nullptr); 323 | pr_info(); 324 | return native_mr; 325 | } 326 | 327 | // PDP 328 | // user_odp && gv.configs.mode.is_pdp 329 | 330 | assert((access & PDP_ACCESS_PDP) == 0); 331 | ASSERT((uint64_t)addr % PAGE_SIZE == 0, "addr must be aligned to 4KB"); 332 | 333 | // we create a PDP native_mr for each ODP native_mr, and return the PDP one 334 | // only the ODP native_mr is registered! 335 | struct ibv_mr *pdp_native_mr; 336 | for (int i = 0; i < 10; i++) { 337 | pdp_native_mr = gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, iova, access | PDP_ACCESS_PDP); 338 | if (pdp_native_mr) break; 339 | if (i) pr_info("retry %d, error=%s", i, strerror(errno)); 340 | } 341 | assert(pdp_native_mr); 342 | 343 | // key-value 344 | // key: lkey of pdp_native_mr 345 | auto pdp_rdma_mr = std::make_unique(ctx, pdp_native_mr); 346 | pr_info(); 347 | gv.local_mr_mgr->reg_with_construct_args(pdp_rdma_mr->lkey_, std::move(rdma_mr), std::move(pdp_rdma_mr)); 348 | 349 | pr_info("reg_mr: pdp, addr=%p, len=0x%lx, pdp_lkey=0x%x, internal_odp_key=0x%x", addr, length, 350 | pdp_native_mr->lkey, native_mr->lkey); 351 | 352 | // return pdp_native_mr, so we can RDMA read! 353 | return pdp_native_mr; 354 | } 355 | } 356 | 357 | struct ibv_qp *ibv_create_qp(struct ibv_pd *pd, 358 | struct ibv_qp_init_attr *qp_init_attr) 359 | { 360 | return pdp::create_qp(pd, qp_init_attr); 361 | } 362 | 363 | int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, 364 | int attr_mask) 365 | { 366 | return pdp::modify_qp(qp, attr, attr_mask); 367 | } 368 | 369 | struct ibv_mr *ibv_reg_mr_iova2(struct ibv_pd *pd, void *addr, size_t length, 370 | uint64_t iova, unsigned int access) 371 | { 372 | return pdp::reg_mr_iova2(pd, addr, length, iova, access); 373 | } 374 | 375 | #ifdef ibv_reg_mr 376 | #undef ibv_reg_mr 377 | #endif 378 | struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, 379 | int access) 380 | { 381 | return pdp::reg_mr_iova2(pd, addr, length, (uintptr_t)addr, access); 382 | } 383 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/qp.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | // mine 3 | #include "global.hh" 4 | #include "rpc.hh" 5 | #include 6 | #include 7 | // sys 8 | #include 9 | // boost 10 | // #include 11 | // c++ 12 | #include 13 | #include 14 | #include 15 | 16 | namespace pdp { 17 | class WrId { 18 | private: 19 | union { 20 | uint64_t u64_value_; 21 | struct { 22 | uint32_t wr_seq_; // max_send_wr uint32_t 23 | uint8_t user_signaled_ : 1; 24 | }; 25 | }; 26 | public: 27 | WrId(uint64_t u64_value) : u64_value_(u64_value) {} 28 | WrId(uint32_t wr_seq, bool user_signaled) : wr_seq_(wr_seq), 29 | user_signaled_(user_signaled) {} 30 | auto u64_value() const { return u64_value_; } 31 | auto wr_seq() const { return wr_seq_; } 32 | bool user_signaled() const { return user_signaled_; } 33 | }; 34 | 35 | struct RcBytes { 36 | rdma::QueuePair::Identifier user_qp; 37 | rdma::QueuePair::Identifier shadow_qp; 38 | rdma::QueuePair::Identifier bitmap_qp; 39 | 40 | using Identifier = rdma::QueuePair::Identifier; 41 | auto identifier() const 42 | { 43 | return user_qp; 44 | } 45 | RcBytes() = default; 46 | explicit RcBytes(const std::vector &vec_char) 47 | { 48 | memcpy(this, vec_char.data(), sizeof(RcBytes)); 49 | } 50 | auto to_vec_char() const 51 | { 52 | return util::to_vec_char(*this); 53 | } 54 | auto to_string() const 55 | { 56 | char str[256]; 57 | sprintf(str, "(%s){0x%08x,0x%08x,0x%08x}", user_qp.ctx_addr().to_string().c_str(), 58 | user_qp.qp_num(), shadow_qp.qp_num(), bitmap_qp.qp_num()); 59 | return std::string{str}; 60 | } 61 | 62 | auto to_qp_string() const 63 | { 64 | char str[256]; 65 | sprintf(str, "{0x%x,0x%x,0x%x}", 66 | user_qp.qp_num(), shadow_qp.qp_num(), bitmap_qp.qp_num()); 67 | return std::string{str}; 68 | } 69 | }; 70 | 71 | class CounterForRemoteMr; 72 | class RemoteMr; 73 | // pdp qp 74 | class RcQp { 75 | public: 76 | // pointers 77 | using Uptr = std::unique_ptr; 78 | 79 | // identifier 80 | using Identifier = util::CachelineAligned; // qpn 81 | Identifier identifier() const { return uqp_->qp_num; } 82 | 83 | struct UserSendWr { 84 | uint64_t original_wr_id{};// for poll_cq. user's original wr_ids. we need to return it in g_ori_poll_cq 85 | ibv_send_wr wr{};// for resubmission. user's wr for resubmission. signal is always set. sg_list is copied. wr_id is original 86 | std::vector sg; // for resubmission. 87 | RcQp *prc{}; 88 | uint32_t seq{}; 89 | bool via_rpc{}; 90 | bool user_signaled{}; 91 | #if PAGE_BITMAP_HINT_READ 92 | util::ObserverPtr remote_mr{}; 93 | bool rdma_read; // whether we tried rdma_read in submission 94 | #endif 95 | 96 | UserSendWr(const UserSendWr &rhs) 97 | { 98 | *this = rhs; 99 | wr.sg_list = sg.data(); 100 | } 101 | 102 | UserSendWr(RcQp *prc, uint32_t max_send_sge) : 103 | prc(prc), 104 | sg(max_send_sge) 105 | { 106 | wr.sg_list = sg.data(); 107 | } 108 | }; 109 | 110 | // member variables 111 | private: 112 | util::ObserverPtr ctx_; 113 | 114 | // UQP 115 | struct ibv_qp *uqp_ = nullptr; // user QP 116 | bool uqp_sq_sig_all_ = false; 117 | struct ibv_qp_cap uqp_cap_; 118 | 119 | // SQP 120 | std::unique_ptr sqp_; // shadow RC by pdp 121 | std::unique_ptr msg_conn_; 122 | 123 | // pdp_rc_ 124 | std::unique_ptr mr_qp_; // for mr prefetch, bitmap, ... 125 | 126 | std::vector user_send_wr_vec_; 127 | std::vector send_bitmap_; // 0 not used, 1 used. 128 | std::mutex mutex_; // protect 129 | std::unordered_map> access_counters_; 130 | 131 | // methods 132 | private: 133 | static bool buffer_is_magic(const char *data, uint32_t len); 134 | static int sge_identify_magic(const ibv_sge &sge, uint64_t remote_addr, uint64_t *fault_bitmap); 135 | 136 | // only records *wr itself, without wr->next!!! 137 | void record_rdma_access(const ibv_send_wr *wr); 138 | 139 | public: 140 | RcQp(util::ObserverPtr ctx, struct ibv_qp *uqp, const ibv_qp_init_attr *init_attr); 141 | 142 | auto ctx() const { return ctx_; } 143 | auto msg_conn() const { return msg_conn_.get(); } 144 | auto mr_qp() const { return mr_qp_.get(); } 145 | RcBytes to_rc_bytes() const; 146 | 147 | auto get_user_qp_remote_identifier() { return rdma::QueuePair::Identifier{ctx_->remote_identifier(), uqp_->qp_num};} 148 | 149 | bool connect_to(const RcBytes &rc_bytes); 150 | 151 | bool post_process_wc(struct ibv_wc *wc); 152 | int post_send(struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); 153 | 154 | std::optional find_send_seq(); 155 | void clear_send_seq(uint32_t seq); 156 | void set_send_seq(uint32_t seq); 157 | bool test_send_seq(uint32_t seq); 158 | 159 | void rpc_wr_async(ibv_send_wr *wr); 160 | Message rpc_wr_wait(ibv_send_wr *wr); 161 | }; 162 | 163 | class QpManager: public Manager { 164 | private: 165 | void do_in_reg(typename Map::iterator it) final; 166 | std::vector> rpc_qps_for_threads_; 167 | std::vector> rpc_server_threads_; 168 | int rpc_server_thread_id_current_{}; 169 | void rpc_worker_func(const std::stop_token &stop_token, int rpc_server_thread_id, coroutine::push_type& sink, int coro_id); 170 | }; 171 | } // namespace pdp 172 | -------------------------------------------------------------------------------- /libterm/ibverbs-pdp/rpc.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | // mine 3 | #include "global.hh" 4 | #include "bitmap.hh" 5 | // sys 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | // 3rd 12 | #include 13 | #define BOOST_ALLOW_DEPRECATED_HEADERS 14 | #include 15 | #include 16 | // c++ 17 | #include 18 | #include 19 | #include 20 | 21 | #define MAX_NR_RDMA_PAGES ((MAX_RDMA_LENGTH + SZ_4K) / SZ_4K) 22 | 23 | namespace pdp { 24 | 25 | using boost::coroutines::coroutine; 26 | enum RpcOpcode : uint32_t { 27 | kReadSingle = 0, 28 | kReadSparce, 29 | kWriteSingle, 30 | kWriteInline, 31 | kRegisterCounter, 32 | kReturnBitmap, 33 | kReturnStatus, 34 | }; 35 | 36 | struct ReadSingle { 37 | uint64_t s_addr; // remote_addr for the requester 38 | uint32_t s_lkey; 39 | uint64_t c_addr; // remote_identifier for the requester 40 | uint32_t c_rkey; // for the responder 41 | uint32_t len; 42 | u8 update_bitmap; 43 | 44 | constexpr size_t size() const { return sizeof(ReadSingle); } 45 | 46 | std::string to_string() const 47 | { 48 | return fmt::format("read_single,s_addr={:#x},s_lkey={:#x},c_addr={:#x},c_rkey={:#x},len={:#x}", 49 | s_addr, s_lkey, c_addr, c_rkey, len); 50 | } 51 | }; 52 | 53 | struct ReadSparce { 54 | uint64_t s_addr; 55 | uint32_t s_lkey; 56 | uint64_t c_addr; 57 | uint32_t c_rkey; 58 | uint32_t len; 59 | uint64_t fault_bitmap[util::bits_to_u64s(MAX_NR_RDMA_PAGES)]; 60 | u8 update_bitmap; 61 | 62 | constexpr size_t size() const { return sizeof(ReadSparce); } 63 | 64 | std::string to_string() const 65 | { 66 | return fmt::format("read_space,s_addr={:#x},s_lkey={:#x},c_addr={:#x},c_rkey={:#x},len={:#x}", 67 | s_addr, s_lkey, c_addr, c_rkey, len); 68 | } 69 | }; 70 | 71 | struct WriteSingle { 72 | uint64_t s_addr; 73 | uint32_t s_lkey; 74 | uint64_t c_addr; 75 | uint32_t c_rkey; 76 | uint32_t len; 77 | u8 update_bitmap; 78 | 79 | constexpr size_t size() const { 80 | return sizeof(WriteSingle); 81 | } 82 | 83 | std::string to_string() const 84 | { 85 | return fmt::format("write_single, s_addr={:#x}, s_lkey={:#x}, c_addr={:#x}, c_rkey={:#x}, len={:#x}", 86 | s_addr, s_lkey, c_addr, c_rkey, len); 87 | } 88 | }; 89 | 90 | struct WriteInline { 91 | uint64_t s_addr; 92 | uint32_t s_lkey; 93 | uint32_t len; 94 | uint8_t data[PAGE_SIZE]; 95 | u8 update_bitmap; 96 | 97 | constexpr size_t size() const { 98 | return offsetof(WriteInline, data) + len; 99 | } 100 | 101 | constexpr size_t data_size() const { return len; } 102 | 103 | std::string to_string() const 104 | { 105 | return fmt::format("write_inline, s_addr={:#x}, s_lkey={:#x}, len={:#x}, data={:#x}", 106 | s_addr, s_lkey, len, *(uint64_t *)data); 107 | } 108 | }; 109 | 110 | struct RegisterCounter { 111 | uint32_t server_mr_key; 112 | rdma::RemoteMemoryRegion counter_mr; 113 | 114 | constexpr size_t size() const { return sizeof(RegisterCounter); } 115 | 116 | std::string to_string() const 117 | { 118 | return fmt::format("register_counter"); 119 | } 120 | }; 121 | 122 | struct ReturnBitmap { 123 | uint64_t present_bitmap[util::bits_to_u64s(MAX_NR_RDMA_PAGES)]; 124 | constexpr size_t size() const { return sizeof(ReturnBitmap); } 125 | std::string to_string() const 126 | { 127 | return fmt::format("return_bitmap"); 128 | } 129 | }; 130 | 131 | struct Message { 132 | uint32_t opcode; 133 | union { 134 | ReadSingle read_single; 135 | ReadSparce read_sparce; 136 | WriteSingle write_single; 137 | WriteInline write_inline; 138 | RegisterCounter register_counter; 139 | ReturnBitmap return_bitmap; 140 | }; 141 | 142 | [[nodiscard]] constexpr size_t size() const 143 | { 144 | switch (opcode) { 145 | case kReadSingle: 146 | return offsetof(Message, read_single) + read_single.size(); 147 | case kReadSparce: 148 | return offsetof(Message, read_sparce) + read_sparce.size(); 149 | case kWriteSingle: 150 | return offsetof(Message, write_single) + write_single.size(); 151 | case kWriteInline: 152 | return offsetof(Message, write_inline) + write_inline.size(); 153 | case kRegisterCounter: 154 | return offsetof(Message, register_counter) + register_counter.size(); 155 | case kReturnBitmap: 156 | return offsetof(Message, return_bitmap) + return_bitmap.size(); 157 | case kReturnStatus: 158 | return sizeof(opcode); 159 | default: 160 | return 0; 161 | } 162 | } 163 | 164 | [[nodiscard]] std::string to_string() const 165 | { 166 | std::string str = fmt::format("{{size={:#x},opcode=", size()); 167 | switch (opcode) { 168 | case kReadSingle: 169 | return str + read_single.to_string() + "}"; 170 | case kReadSparce: 171 | return str + read_sparce.to_string() + "}"; 172 | case kWriteSingle: 173 | return str + write_single.to_string() + "}"; 174 | case kWriteInline: 175 | return str + write_inline.to_string() + "}"; 176 | case kRegisterCounter: 177 | return str + register_counter.to_string() + "}"; 178 | case kReturnBitmap: 179 | return str + return_bitmap.to_string() + "}"; 180 | case kReturnStatus: 181 | return str + "return}"; 182 | default: 183 | return str + "unknown}"; 184 | } 185 | } 186 | }; 187 | 188 | class LocalMr; 189 | 190 | class RpcIo { 191 | static const size_t kBounceMrSize = MAX_NR_RDMA_PAGES * SZ_4K + SZ_4K; 192 | static const size_t kQueueDepth = kBounceMrSize / SZ_4K; 193 | private: 194 | rdma::LocalMemoryRegion::Uptr bounce_mr_; 195 | void *mr_base_addr_{}; 196 | 197 | int nvme_fd_buffer_; 198 | int nvme_fd_direct_; 199 | int pch_fd_; 200 | static inline thread_local std::vector io_context_vec_{std::vector(MAX_NR_COROUTINES)}; 201 | static inline thread_local std::vector inited_{std::vector(MAX_NR_COROUTINES)}; 202 | 203 | // io_uring ring_{}; 204 | 205 | std::vector> access_vec_; 206 | struct io_req_t { 207 | u64 offset; 208 | u32 length; 209 | char *buffer; 210 | bool direct; 211 | 212 | bool try_merge(const io_req_t &rhs) 213 | { 214 | if (offset + length == rhs.offset 215 | && buffer + length == rhs.buffer 216 | && direct == rhs.direct) { 217 | length += rhs.length; 218 | return true; 219 | } 220 | return false; 221 | } 222 | 223 | std::string to_string() const 224 | { 225 | return fmt::format("offset={:#x},length={:#x},buffer={},direct={}", 226 | offset, length, fmt::ptr(buffer), direct); 227 | } 228 | }; 229 | std::vector io_req_vec_; 230 | 231 | private: 232 | char *direct_sector_buffer() 233 | { 234 | return (char *)bounce_mr_->addr() + MAX_NR_RDMA_PAGES * PAGE_SIZE; 235 | } 236 | 237 | public: 238 | explicit RpcIo(util::ObserverPtr ctx); 239 | ~RpcIo() 240 | { 241 | close(nvme_fd_buffer_); 242 | close(nvme_fd_direct_); 243 | close(pch_fd_); 244 | // io_uring_queue_exit(&ring_); 245 | } 246 | 247 | auto bounce_mr() { return util::make_observer(bounce_mr_.get()); } 248 | 249 | // void do_io_by_uring(int nr_io_reqs, bool write); 250 | void do_io_by_psync(int nr_io_reqs, bool write, coroutine::push_type& sink, int coro_id); 251 | void subsector_write(void *buffer, u32 length, u64 offset, coroutine::push_type& sink, int coro_id); 252 | 253 | // rw_offset: in mr and ssd 254 | // mr_base_addr: virtual address, must be 4KB-aligned! 255 | // fault_bitmap: nullptr means all pages need to access 256 | std::span> 257 | io(u64 rw_offset, u32 rw_count, const uint64_t *fault_bitmap, bool write, LocalMr *mr, coroutine::push_type& sink, int coro_id); 258 | }; 259 | 260 | class MessageConnection { 261 | static const uint64_t kSignalBatch = 16; 262 | private: 263 | util::ObserverPtr rc_qp_; 264 | rdma::LocalMemoryRegion::Uptr send_mr_; 265 | rdma::LocalMemoryRegion::Uptr recv_mr_; 266 | uint64_t send_counter_{}; 267 | std::mutex mutex_; 268 | 269 | // PDP server only 270 | std::unique_ptr rpc_io_; 271 | 272 | public: 273 | using Uptr = std::unique_ptr; 274 | 275 | void lock() 276 | { 277 | mutex_.lock(); 278 | } 279 | 280 | void unlock() 281 | { 282 | mutex_.unlock(); 283 | } 284 | 285 | explicit MessageConnection(util::ObserverPtr rc_qp) 286 | : rc_qp_(rc_qp) 287 | { 288 | static const auto &gv = GlobalVariables::instance(); 289 | 290 | const reg_mr_t ori_reg_mr = gv.ori_ibv_symbols.reg_mr; 291 | 292 | rdma::Allocation send_alloc{sizeof(Message)}; 293 | rdma::Allocation recv_alloc{sizeof(Message)}; 294 | 295 | send_mr_ = std::make_unique(rc_qp_->ctx(), std::move(send_alloc), false, ori_reg_mr); 296 | recv_mr_ = std::make_unique(rc_qp_->ctx(), std::move(recv_alloc), false, ori_reg_mr); 297 | rc_qp_->receive(recv_mr_->pick_by_offset0(sizeof(Message))); 298 | 299 | if (!gv.configs.is_server) return; 300 | rpc_io_ = std::make_unique(rc_qp_->ctx()); 301 | } 302 | 303 | auto rc_qp() { return rc_qp_; } 304 | auto rpc_io() { return util::make_observer(rpc_io_.get()); } 305 | 306 | Message get(ibv_wc *wc = nullptr) 307 | { 308 | assert(rc_qp_); 309 | rc_qp_->wait_recv(wc); 310 | 311 | Message msg = *(Message *)recv_mr_->addr_; 312 | 313 | // post for the next 314 | rc_qp_->receive(recv_mr_->addr_, recv_mr_->lkey_, sizeof(Message)); 315 | 316 | return msg; 317 | } 318 | 319 | bool try_get(Message *msg, void *write_bounce_buffer) 320 | { 321 | int ret = rc_qp_->poll_recv(1); 322 | if (ret <= 0) { 323 | return false; 324 | } 325 | 326 | auto *src_msg = (Message *)recv_mr_->addr(); 327 | if (src_msg->opcode == RpcOpcode::kWriteInline) { 328 | memcpy(msg, src_msg, src_msg->size() - src_msg->write_inline.data_size()); 329 | size_t offset_in_page = msg->write_single.s_addr % PAGE_SIZE; 330 | memcpy((char *)write_bounce_buffer + offset_in_page, src_msg->write_inline.data, src_msg->write_inline.len); 331 | } else { 332 | memcpy(msg, src_msg, src_msg->size()); 333 | } 334 | 335 | rc_qp_->receive(recv_mr_->pick_by_offset0(sizeof(Message))); 336 | 337 | return true; 338 | } 339 | 340 | Message *send_mr_ptr() 341 | { 342 | return (Message *)send_mr_->addr(); 343 | } 344 | 345 | bool should_signal() const 346 | { 347 | return send_counter_ % kSignalBatch == 0; 348 | } 349 | 350 | void inc_send_counter() 351 | { 352 | send_counter_++; 353 | } 354 | 355 | // auto batch signal and wait 356 | void post_send_wr_auto_wait(rdma::Wr *wr) 357 | { 358 | if (should_signal()) { 359 | wr->set_signaled(true); 360 | } 361 | rc_qp_->post_send_wr(wr); 362 | // the caller may have set the signal 363 | if (wr->signaled()) { 364 | ibv_wc wc{}; 365 | rc_qp_->wait_send(&wc); 366 | ASSERT(wc.status == IBV_WC_SUCCESS, "fail to wait_send: %s", ibv_wc_status_str(wc.status)); 367 | } 368 | inc_send_counter(); 369 | } 370 | 371 | // auto batch signal and wait 372 | void send(const Message *msg) 373 | { 374 | if ((void *)msg != send_mr_->addr()) { 375 | memcpy(send_mr_->addr(), msg, msg->size()); 376 | } 377 | rc_qp_->send(send_mr_->pick_by_offset0(msg->size()), should_signal()); 378 | if (should_signal()) { 379 | ibv_wc wc{}; 380 | rc_qp_->wait_send(&wc); 381 | ASSERT(wc.status == IBV_WC_SUCCESS, "fail to wait_send: %d(%s), qp=%s, send_counter=%lu", wc.status, ibv_wc_status_str(wc.status), 382 | rc_qp()->identifier().to_string().c_str(), send_counter_); 383 | ASSERT(wc.opcode == IBV_WC_SEND, "unexpected wc.opcode=0x%x, qp=%s, send_counter=%lu", wc.opcode, rc_qp()->identifier().to_string().c_str(), send_counter_); 384 | } 385 | inc_send_counter(); 386 | } 387 | }; 388 | } 389 | -------------------------------------------------------------------------------- /libterm/include/bitmap.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include 3 | 4 | namespace util { 5 | using ulong_t = uint64_t; 6 | static constexpr inline uint64_t kBitsPerByte = 8; 7 | static constexpr inline uint64_t kBitsPerLong = sizeof(ulong_t) * kBitsPerByte; 8 | static constexpr inline uint64_t kBitsPerU64 = sizeof(uint64_t) * kBitsPerByte; 9 | static constexpr inline uint64_t kPageSize = 4096; 10 | 11 | static bool test_bit(uint32_t nr, const uint64_t *addr) 12 | { 13 | return 1ul & (addr[nr / kBitsPerLong] >> (nr % kBitsPerLong)); 14 | } 15 | 16 | static inline void set_bit(uint32_t nr, uint64_t *addr) 17 | { 18 | addr[nr / kBitsPerLong] |= 1UL << (nr % kBitsPerLong); 19 | } 20 | 21 | static inline void clear_bit(uint32_t nr, uint64_t *addr) 22 | { 23 | addr[nr / kBitsPerLong] &= ~(1UL << (nr % kBitsPerLong)); 24 | } 25 | 26 | static constexpr uint64_t bits_to_longs(uint64_t bits) 27 | { 28 | return util::div_up(bits, kBitsPerLong); 29 | } 30 | 31 | static constexpr uint64_t bits_to_u64s(uint64_t bits) 32 | { 33 | return util::div_up(bits, kBitsPerU64); 34 | } 35 | 36 | static constexpr uint64_t bits_to_long_aligned_bytes(uint64_t bits) 37 | { 38 | return util::bits_to_longs(bits) * sizeof(ulong_t); 39 | } 40 | 41 | static uint64_t mr_length_to_pgs(uint64_t mr_length) 42 | { 43 | return util::div_up(mr_length, kPageSize); 44 | } 45 | 46 | class Bitmap 47 | { 48 | ulong_t *data_{}; 49 | uint32_t bits_{}; 50 | public: 51 | using Uptr = std::unique_ptr; 52 | Bitmap(uint32_t bits, bool set) : bits_(bits) 53 | { 54 | data_ = new ulong_t[bits_to_longs(bits_)]; 55 | if (set) { 56 | set_all(); 57 | } else { 58 | clear_all(); 59 | } 60 | } 61 | ~Bitmap() 62 | { 63 | delete[] data_; 64 | } 65 | bool test_bit(uint32_t nr) 66 | { 67 | return util::test_bit(nr, data_); 68 | } 69 | void set_bit(uint32_t nr) 70 | { 71 | util::set_bit(nr, data_); 72 | } 73 | void assign_bit(uint32_t nr, bool set) 74 | { 75 | if (set) { 76 | set_bit(nr); 77 | } else { 78 | clear_bit(nr); 79 | } 80 | } 81 | void clear_bit(uint32_t nr) 82 | { 83 | util::clear_bit(nr, data_); 84 | } 85 | void clear_all() 86 | { 87 | memset(data_, 0, sizeof(ulong_t) * bits_to_longs(bits_)); 88 | } 89 | void set_all() 90 | { 91 | memset(data_, 0xff, sizeof(ulong_t) * bits_to_longs(bits_)); 92 | } 93 | }; 94 | } -------------------------------------------------------------------------------- /libterm/include/compile.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | #include 4 | 5 | #define ENABLE_CACHELINE_ALIGN 1 6 | #if ENABLE_CACHELINE_ALIGN 7 | #define __cacheline_aligned alignas(64) 8 | #else 9 | #define __cacheline_aligned 10 | #endif 11 | 12 | #define __packed __attribute__((__packed__)) 13 | 14 | inline void compiler_barrier() { asm volatile("" ::: "memory"); } 15 | 16 | template 17 | concept explicit_ref = std::same_as>; 18 | 19 | #define DELETE_MOVE_CONSTRUCTOR_AND_ASSIGNMENT(Class) \ 20 | Class(Class&&) = delete; \ 21 | Class& operator=(Class&&) = delete; 22 | #define DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(Class) \ 23 | Class(const Class&) = delete; \ 24 | Class& operator=(const Class&) = delete; 25 | -------------------------------------------------------------------------------- /libterm/include/const.hh: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #define SZ_K (1ul << 10) 4 | #define SZ_M (1ul << 20) 5 | #define SZ_1K (1ul << 10) 6 | #define SZ_2K (2ul << 10) 7 | #define SZ_4K (4ul << 10) 8 | #define SZ_64K (64ul << 10) 9 | #define SZ_256K (256ul << 10) 10 | #define SZ_512K (512ul << 10) 11 | #define SZ_1M (1ul << 20) 12 | #define SZ_2M (2ul << 20) 13 | #define SZ_4M (4ul << 20) 14 | #define SZ_256M (256ul << 20) 15 | #define SZ_1G (1ul << 30) 16 | #define SZ_2G (2ul << 30) 17 | 18 | #define PAGE_SHIFT 12 19 | #define PAGE_SIZE (1ul << PAGE_SHIFT) 20 | #define SECTOR_SIZE 512 21 | 22 | using u8 = uint8_t; 23 | using u16 = uint16_t; 24 | using u32 = uint32_t; 25 | using u64 = uint64_t; 26 | -------------------------------------------------------------------------------- /libterm/include/data-conn.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include 3 | 4 | #include "rdma.hh" 5 | 6 | namespace rdma { 7 | class DataConnection final { 8 | public: 9 | using Uptr = std::unique_ptr; 10 | private: 11 | util::ObserverPtr rc_qp_; 12 | util::ObserverPtr local_mr_; 13 | std::unique_ptr remote_mr_; 14 | public: 15 | DataConnection(util::ObserverPtr rc_qp, util::ObserverPtr local_mr, 16 | std::unique_ptr remote_mr) 17 | : rc_qp_(rc_qp), 18 | local_mr_(local_mr), 19 | remote_mr_(std::move(remote_mr)) {} 20 | 21 | // DataConnection(const Sptr& data_conn) 22 | // : rc_qp_(data_conn->rc_qp_), local_mr_(data_conn->local_mr_), remote_mr_(data_conn->remote_mr_) 23 | // {} 24 | 25 | // async 26 | bool read(const MemorySge &local_mem, const MemorySge &remote_mem, bool signal) 27 | { 28 | return rc_qp_->read(local_mem, remote_mem, signal); 29 | } 30 | 31 | // async 32 | bool read(uint64_t local_addr, uint64_t remote_addr, uint32_t len) 33 | { 34 | pr_info("data-conn::read"); 35 | return rc_qp_->read(local_mr_->pick_by_addr(local_addr, len), 36 | remote_mr_->pick_by_addr(remote_addr, len), true); 37 | } 38 | 39 | bool write(const MemorySge &local_mem, const MemorySge &remote_mem, optional imm = std::nullopt) 40 | { 41 | return rc_qp_->write(local_mem, remote_mem, true, imm); 42 | } 43 | 44 | bool write(uint64_t local_addr, uint64_t remote_addr, uint32_t len, optional imm = std::nullopt) 45 | { 46 | return rc_qp_->write(local_mr_->pick_by_addr(local_addr, len), 47 | remote_mr_->pick_by_addr(remote_addr, len), true, imm); 48 | } 49 | 50 | void post_send_wr(Wr *wr) { rc_qp_->post_send_wr(wr); } 51 | 52 | int wait_send(ibv_wc *wc = nullptr) 53 | { 54 | return rc_qp_->wait_send(wc); 55 | } 56 | 57 | auto rc_qp() { return rc_qp_; } 58 | 59 | auto * local_mr() { return local_mr_.get(); } 60 | auto * remote_mr() { return remote_mr_.get(); } 61 | }; 62 | } 63 | -------------------------------------------------------------------------------- /libterm/include/lock.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | // mine 4 | #include "util.hh" 5 | #include 6 | 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | 13 | #include 14 | #include 15 | 16 | static inline uint64_t NowMicros() 17 | { 18 | static constexpr uint64_t kUsecondsPerSecond = 1000000; 19 | struct timeval tv; 20 | gettimeofday(&tv, nullptr); 21 | return static_cast(tv.tv_sec) * kUsecondsPerSecond + tv.tv_usec; 22 | } 23 | 24 | #define TIMER_START(x) \ 25 | const auto timer_##x = std::chrono::steady_clock::now() 26 | 27 | #define TIMER_STOP(x) \ 28 | x += std::chrono::duration_cast( \ 29 | std::chrono::steady_clock::now() - timer_##x) \ 30 | .count() 31 | 32 | template 33 | struct Timer 34 | { 35 | Timer(T &res) : start_time_(std::chrono::steady_clock::now()), res_(res) {} 36 | 37 | ~Timer() 38 | { 39 | res_ += std::chrono::duration_cast( 40 | std::chrono::steady_clock::now() - start_time_) 41 | .count(); 42 | } 43 | 44 | std::chrono::steady_clock::time_point start_time_; 45 | T &res_; 46 | }; 47 | 48 | #ifdef LOGGING 49 | #define TIMER_START_LOGGING(x) TIMER_START(x) 50 | #define TIMER_STOP_LOGGING(x) TIMER_STOP(x) 51 | #define COUNTER_ADD_LOGGING(x, y) x += (y) 52 | #else 53 | #define TIMER_START_LOGGING(x) 54 | #define TIMER_STOP_LOGGING(x) 55 | #define COUNTER_ADD_LOGGING(x, y) 56 | #endif 57 | 58 | // #define LOGGING 59 | 60 | #ifdef LOGGING 61 | #define LOG(fmt, ...) \ 62 | fprintf(stderr, "\033[1;31mLOG(<%s>:%d %s): \033[0m" fmt "\n", __FILE__, \ 63 | __LINE__, __func__, ##__VA_ARGS__) 64 | #else 65 | #define LOG(fmt, ...) 66 | #endif 67 | 68 | class SpinLock 69 | { 70 | public: 71 | SpinLock() : mutex(false) {} 72 | SpinLock(std::string name) : mutex(false), name(name) {} 73 | 74 | bool try_lock() 75 | { 76 | bool expect = false; 77 | return mutex.compare_exchange_strong( 78 | expect, true, std::memory_order_release, std::memory_order_relaxed); 79 | } 80 | 81 | void lock() 82 | { 83 | uint64_t startOfContention = 0; 84 | bool expect = false; 85 | while (!mutex.compare_exchange_weak(expect, true, std::memory_order_release, 86 | std::memory_order_relaxed)) 87 | { 88 | expect = false; 89 | debugLongWaitAndDeadlock(&startOfContention); 90 | } 91 | if (startOfContention != 0) 92 | { 93 | contendedTime += NowMicros() - startOfContention; 94 | ++contendedAcquisitions; 95 | } 96 | } 97 | 98 | void unlock() { mutex.store(0, std::memory_order_release); } 99 | 100 | void report() 101 | { 102 | LOG("spinlock %s: contendedAcquisitions %lu contendedTime %lu us", 103 | name.c_str(), contendedAcquisitions, contendedTime); 104 | } 105 | 106 | private: 107 | std::atomic_bool mutex; 108 | std::string name; 109 | uint64_t contendedAcquisitions = 0; 110 | uint64_t contendedTime = 0; 111 | 112 | void debugLongWaitAndDeadlock(uint64_t *startOfContention) 113 | { 114 | if (*startOfContention == 0) 115 | { 116 | *startOfContention = NowMicros(); 117 | } 118 | else 119 | { 120 | uint64_t now = NowMicros(); 121 | if (now >= *startOfContention + 1000000) 122 | { 123 | LOG("%s SpinLock locked for one second; deadlock?", name.c_str()); 124 | } 125 | } 126 | } 127 | }; 128 | 129 | // read write_single lock 130 | class ReadWriteLock 131 | { 132 | // the lowest bit is used for writer 133 | public: 134 | bool TryReadLock() 135 | { 136 | uint64_t old_val = lock_value.load(std::memory_order_acquire); 137 | while (true) 138 | { 139 | if (old_val & 1 || old_val > 1024) 140 | { 141 | break; 142 | } 143 | uint64_t new_val = old_val + 2; 144 | bool cas = lock_value.compare_exchange_weak(old_val, new_val, 145 | std::memory_order_acq_rel, 146 | std::memory_order_acquire); 147 | if (cas) 148 | { 149 | return true; 150 | } 151 | } 152 | return false; 153 | } 154 | 155 | void lock_shared() 156 | { 157 | while (!TryReadLock()) 158 | ; 159 | } 160 | 161 | void unlock_shared() 162 | { 163 | uint64_t old_val = lock_value.load(std::memory_order_acquire); 164 | while (true) 165 | { 166 | if (old_val <= 1) 167 | { 168 | assert(old_val >= 2); 169 | return; 170 | } 171 | uint64_t new_val = old_val - 2; 172 | if (lock_value.compare_exchange_weak(old_val, new_val)) 173 | { 174 | break; 175 | } 176 | } 177 | } 178 | 179 | bool TryWriteLock() 180 | { 181 | uint64_t old_val = lock_value.load(std::memory_order_acquire); 182 | while (true) 183 | { 184 | if (old_val & 1) 185 | { 186 | return false; 187 | } 188 | uint64_t new_val = old_val | 1; 189 | bool cas = lock_value.compare_exchange_weak(old_val, new_val); 190 | if (cas) 191 | { 192 | break; 193 | } 194 | } 195 | // got write_single lock, waiting for readers 196 | while (lock_value.load(std::memory_order_acquire) != 1) 197 | { 198 | asm("nop"); 199 | } 200 | return true; 201 | } 202 | 203 | void lock() 204 | { 205 | while (!TryWriteLock()) 206 | ; 207 | } 208 | 209 | void unlock() 210 | { 211 | assert(lock_value == 1); 212 | lock_value.store(0); 213 | } 214 | 215 | private: 216 | std::atomic_uint_fast64_t lock_value{0}; 217 | }; 218 | 219 | #define CAS(ptr, oldval, newval) \ 220 | (__sync_bool_compare_and_swap(ptr, oldval, newval)) 221 | class ReaderFriendlyLock 222 | { 223 | std::vector lock_vec_; 224 | public: 225 | DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(ReaderFriendlyLock); 226 | 227 | ReaderFriendlyLock(ReaderFriendlyLock &&rhs) noexcept 228 | { 229 | *this = std::move(rhs); 230 | } 231 | ReaderFriendlyLock& operator=(ReaderFriendlyLock &&rhs) { 232 | std::swap(this->lock_vec_, rhs.lock_vec_); 233 | return *this; 234 | } 235 | 236 | ReaderFriendlyLock() : lock_vec_(util::Schedule::max_nr_threads()) 237 | { 238 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i) 239 | { 240 | lock_vec_[i][0] = 0; 241 | lock_vec_[i][1] = 0; 242 | } 243 | } 244 | 245 | bool lock() 246 | { 247 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i) 248 | { 249 | while (!CAS(&lock_vec_[i][0], 0, 1)) 250 | { 251 | } 252 | } 253 | return true; 254 | } 255 | 256 | bool try_lock() 257 | { 258 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i) 259 | { 260 | if (!CAS(&lock_vec_[i][0], 0, 1)) 261 | { 262 | for (i--; i >= 0; i--) { 263 | compiler_barrier(); 264 | lock_vec_[i][0] = 0; 265 | } 266 | return false; 267 | } 268 | } 269 | return true; 270 | } 271 | 272 | bool try_lock_shared() 273 | { 274 | if (lock_vec_[util::Schedule::thread_id()][1]) { 275 | pr_once(info, "recursive lock!"); 276 | return true; 277 | } 278 | return CAS(&lock_vec_[util::Schedule::thread_id()][0], 0, 1); 279 | } 280 | 281 | bool lock_shared() 282 | { 283 | if (lock_vec_[util::Schedule::thread_id()][1]) { 284 | pr_once(info, "recursive lock!"); 285 | return true; 286 | } 287 | while (!CAS(&lock_vec_[util::Schedule::thread_id()][0], 0, 1)) 288 | { 289 | } 290 | lock_vec_[util::Schedule::thread_id()][1] = 1; 291 | return true; 292 | } 293 | 294 | void unlock() 295 | { 296 | compiler_barrier(); 297 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i) 298 | { 299 | lock_vec_[i][0] = 0; 300 | } 301 | } 302 | 303 | void unlock_shared() 304 | { 305 | compiler_barrier(); 306 | lock_vec_[util::Schedule::thread_id()][0] = 0; 307 | lock_vec_[util::Schedule::thread_id()][1] = 0; 308 | } 309 | }; 310 | -------------------------------------------------------------------------------- /libterm/include/logging.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include "print-stack.hh" 3 | #include 4 | #include 5 | #include 6 | #include 7 | 8 | #define DYNAMIC_DEBUG 0 9 | 10 | #define COLOR_BLACK "\033[0;30m" 11 | #define COLOR_RED "\033[0;31m" 12 | #define COLOR_GREEN "\033[0;32m" 13 | #define COLOR_YELLOW "\033[0;33m" 14 | #define COLOR_BLUE "\033[0;34m" 15 | #define COLOR_MAGENTA "\033[0;35m" 16 | #define COLOR_CYAN "\033[0;36m" 17 | #define COLOR_WHITE "\033[0;37m" 18 | #define COLOR_DEFAULT "\033[0;39m" 19 | 20 | #define COLOR_BOLD_BLACK "\033[1;30m" 21 | #define COLOR_BOLD_RED "\033[1;31m" 22 | #define COLOR_BOLD_GREEN "\033[1;32m" 23 | #define COLOR_BOLD_YELLOW "\033[1;33m" 24 | #define COLOR_BOLD_BLUE "\033[1;34m" 25 | #define COLOR_BOLD_MAGENTA "\033[1;35m" 26 | #define COLOR_BOLD_CYAN "\033[1;36m" 27 | #define COLOR_BOLD_WHITE "\033[1;37m" 28 | #define COLOR_BOLD_DEFAULT "\033[1;39m" 29 | 30 | #define PT_RESET "\033[0m" 31 | #define PT_BOLD "\033[1m" 32 | #define PT_UNDERLINE "\033[4m" 33 | #define PT_BLINKING "\033[5m" 34 | #define PT_INVERSE "\033[7m" 35 | 36 | #define LIBPDP_PREFIX COLOR_CYAN "[libpdp] " PT_RESET 37 | #define pr_libpdp(fmt, args...) printf(LIBPDP_PREFIX fmt "\n", ##args) 38 | 39 | #define EXTRACT_FILENAME(PATH) (__builtin_strrchr("/" PATH, '/') + 1) 40 | #define __FILENAME__ EXTRACT_FILENAME(__FILE__) 41 | 42 | #define pr(fmt, args...) pr_libpdp(COLOR_GREEN "%s:%d: %s:" PT_RESET " (%d) " fmt, __FILENAME__, __LINE__, __func__, util::Schedule::thread_id(), ##args) 43 | #define pr_flf(file, line, func, fmt, args...) pr_libpdp(COLOR_GREEN "%s:%d: %s:" PT_RESET " (tid=%d) " fmt, file, line, func, util::Schedule::thread_id(), ##args) 44 | 45 | #define pr_info(fmt, args...) pr(fmt, ##args) 46 | #define pr_err(fmt, args...) pr(COLOR_RED fmt PT_RESET, ##args) 47 | #define pr_warn(fmt, args...) pr(COLOR_MAGENTA fmt PT_RESET, ##args) 48 | #define pr_emph(fmt, args...) pr(COLOR_YELLOW fmt PT_RESET, ##args) 49 | #if DYNAMIC_DEBUG 50 | #define pr_debug(fmt, args...) ({ \ 51 | static bool enable_debug = util::getenv_bool("ENABLE_DEBUG").value_or(false); \ 52 | if (enable_debug) pr(fmt, ##args); \ 53 | }) 54 | #else 55 | #define pr_debug(fmt, args...) 56 | #endif 57 | 58 | #define pr_once(level, format...) ({ \ 59 | static bool __warned = false; \ 60 | if (!__warned) [[unlikely]] { pr_##level(format); \ 61 | __warned = true;} \ 62 | }) 63 | 64 | #define pr_coro(level, format, args...) pr_##level("coro_id=%d, " format, coro_id, ##args) 65 | 66 | // #define ASSERT(...) 67 | #define ASSERT(cond, format, args...) ({ \ 68 | if (!(cond)) [[unlikely]] { \ 69 | pr_err(format, ##args); \ 70 | exit(EXIT_FAILURE); \ 71 | } \ 72 | }) 73 | #define ASSERT_PERROR(cond, format, args...) ASSERT(cond, "%s: " format, strerror(errno), ##args) 74 | 75 | #define fmt_pr(level, fmt_str, args...) pr_##level("%s", fmt::format(fmt_str, ##args).c_str()) 76 | -------------------------------------------------------------------------------- /libterm/include/node.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include "rdma.hh" 3 | #include "sms.hh" 4 | #include "data-conn.hh" 5 | #include "ud-conn.hh" 6 | 7 | namespace rdma { 8 | 9 | class Thread; 10 | 11 | class Node { 12 | private: 13 | const char *kNrNodesKey = "nr_nodes"; 14 | const char *kNrPreparedKey = "nr_prepared"; 15 | const char *kNrConnectedKey = "nr_connected"; 16 | 17 | public: 18 | using Uptr = std::unique_ptr; 19 | SharedMetaService sms_{"10.0.2.181", 23333}; // shared_meta_service 20 | int nr_nodes_; 21 | int node_id_; 22 | bool registered_ = false; 23 | vector threads_; 24 | std::vector> ctxs_; 25 | 26 | Node(int nr_nodes, int node_id = -1) : nr_nodes_(nr_nodes), node_id_(node_id) {} 27 | 28 | void register_node() 29 | { 30 | std::cout << "registering node..." << std::flush; 31 | if (node_id_ == -1) 32 | node_id_ = sms_.inc(kNrNodesKey) - 1; 33 | 34 | std::cout << node_id_ << std::endl; 35 | if (is_server()) { 36 | sms_.put(kNrPreparedKey, {'0'}); 37 | sms_.put(kNrConnectedKey, {'0'}); 38 | } 39 | registered_ = true; 40 | } 41 | 42 | void unregister_node() 43 | { 44 | if (registered_) { 45 | sms_.dec(kNrNodesKey); 46 | pr_info("unregister node %d", node_id_); 47 | registered_ = false; 48 | } 49 | } 50 | 51 | bool is_server() 52 | { 53 | return node_id_ == 0; 54 | } 55 | 56 | void wait_for_all_nodes_prepared() 57 | { 58 | std::cout << __func__ << "... " << std::flush; 59 | sms_.inc(kNrPreparedKey); 60 | while (sms_.get_int(kNrPreparedKey) < nr_nodes_) { 61 | usleep(10000); 62 | } 63 | std::cout << "done" << std::endl; 64 | } 65 | 66 | void wait_for_all_nodes_connected() 67 | { 68 | std::cout << __func__ << "... " << std::flush; 69 | sms_.inc(kNrConnectedKey); 70 | while (sms_.get_int(kNrConnectedKey) < nr_nodes_) { 71 | usleep(10000); 72 | } 73 | std::cout << "done" << std::endl; 74 | } 75 | }; 76 | 77 | class Thread { 78 | util::ObserverPtr node_; 79 | util::ObserverPtr ctx_; 80 | int thread_id_; 81 | 82 | int nr_dst_nodes_; // including myself for simplicity 83 | int nr_dst_threads_; 84 | 85 | LocalMemoryRegion::Uptr local_mr_; 86 | 87 | struct RcInfo { 88 | MemoryRegion mr; 89 | QueuePair::Identifier data_addr; 90 | }; 91 | 92 | // [dst_node_id][dst_thread_id]; for clients, dst_node_id=0 (connected to server) 93 | vector> data_qp_; // [node_id][thread_id] 94 | vector> data_conn_; // [node_id][thread_id] 95 | 96 | std::unique_ptr ud_qp_; 97 | vector>> ud_ah_; // [dst_node_id][dst_thread_id] 98 | 99 | static string cat_to_string(int node_id, int thread_id) 100 | { 101 | return std::to_string(node_id) + ":" + std::to_string(thread_id); 102 | } 103 | 104 | string rc_push_key(int dst_node_id, int dst_thread_id) 105 | { 106 | return std::string("[rc-addr]") + cat_to_string(node_->node_id_, thread_id_) + "-" + cat_to_string(dst_node_id, dst_thread_id); 107 | } 108 | 109 | string rc_pull_key(int dst_node_id, int dst_thread_id) 110 | { 111 | return std::string("[rc-addr]") + cat_to_string(dst_node_id, dst_thread_id) + "-" + cat_to_string(node_->node_id_, thread_id_); 112 | } 113 | 114 | auto ud_push_key() 115 | { 116 | return std::string("[ud-addr]") + cat_to_string(node_->node_id_, thread_id_); 117 | } 118 | 119 | auto ud_pull_key(int dst_node_id, int dst_thread_id) 120 | { 121 | return std::string("[ud-addr]") + cat_to_string(dst_node_id, dst_thread_id); 122 | } 123 | 124 | public: 125 | // for clients, dst_nr_nodes should be 1. 126 | // for the server, dst_nr_nodes should be node_->nr_nodes_. 127 | Thread(util::ObserverPtr node, util::ObserverPtr ctx, int thread_id, int dst_nr_nodes, int dst_nr_threads, LocalMemoryRegion::Uptr local_mr) 128 | : node_(std::move(node)), ctx_(ctx), 129 | thread_id_(thread_id), nr_dst_nodes_(dst_nr_nodes), nr_dst_threads_(dst_nr_threads), 130 | local_mr_(std::move(local_mr)) 131 | { 132 | data_qp_.resize(dst_nr_nodes); 133 | data_conn_.resize(dst_nr_nodes); 134 | ud_ah_.resize(dst_nr_nodes); 135 | for (int i = 0; i < dst_nr_nodes; i++) { 136 | for (int j = 0; j < dst_nr_threads; j++) { 137 | data_qp_[i].emplace_back(); 138 | data_conn_[i].emplace_back(); 139 | ud_ah_[i].emplace_back(); 140 | } 141 | } 142 | } 143 | 144 | void prepare_and_push_rc() 145 | { 146 | for (int i = 0; i < nr_dst_nodes_; i++) { 147 | if (i == node_->node_id_) continue; // myself 148 | for (int j = 0; j < nr_dst_threads_; j++) { 149 | data_qp_[i][j] = std::make_unique(ctx_); 150 | RcInfo thread_info{*local_mr_, data_qp_[i][j]->identifier()}; 151 | node_->sms_.put(rc_push_key(i, j), util::to_vec_char(thread_info)); 152 | } 153 | } 154 | } 155 | 156 | void prepare_and_push_ud() 157 | { 158 | ud_qp_ = std::make_unique(util::make_observer(ctx_.get())); 159 | node_->sms_.put(ud_push_key(), util::to_vec_char(ud_qp_->identifier())); 160 | } 161 | 162 | void pull_and_connect_rc() 163 | { 164 | for (int i = 0; i < nr_dst_nodes_; i++) { 165 | if (i == node_->node_id_) continue; // myself 166 | for (int j = 0; j < nr_dst_threads_; j++) { 167 | auto rc_info = util::from_vec_char(node_->sms_.get(rc_pull_key(i, j))); 168 | auto remote_mr = std::make_unique(rc_info.mr); 169 | data_qp_[i][j]->connect_to(rc_info.data_addr); 170 | auto p = std::make_unique(util::make_observer(data_qp_[i][j].get()), 171 | util::make_observer(local_mr_.get()), std::move(remote_mr)); 172 | data_conn_[i][j] = std::move(p); 173 | } 174 | } 175 | } 176 | 177 | void pull_and_setup_ud() 178 | { 179 | for (int i = 0; i < nr_dst_nodes_; i++) { 180 | if (i == node_->node_id_) continue; 181 | for (int j = 0; j < nr_dst_threads_; j++) { 182 | auto remote_addr = util::from_vec_char(node_->sms_.get(ud_pull_key(i, j))); 183 | ud_ah_[i][j] = std::make_unique(ctx_, remote_addr); 184 | ud_qp_->setup(); 185 | } 186 | } 187 | } 188 | 189 | util::ObserverPtr get_data_conn(int dst_node_id, int dst_thread_id) 190 | { 191 | return util::make_observer(data_conn_[dst_node_id][dst_thread_id].get()); 192 | } 193 | 194 | auto get_ud_conn(int dst_node_id, int dst_thread_id) 195 | { 196 | return std::make_unique(util::make_observer(ud_qp_.get()), util::make_observer(ud_ah_[dst_node_id][dst_thread_id].get())); 197 | } 198 | 199 | auto ud_qp() { return ud_qp_.get(); } 200 | auto local_mr() { return local_mr_.get(); } 201 | 202 | }; 203 | 204 | } 205 | -------------------------------------------------------------------------------- /libterm/include/page-bitmap.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | // mine 3 | #include 4 | #include 5 | // linux 6 | #include 7 | #include 8 | // c++ 9 | #include 10 | #include 11 | // c 12 | #include 13 | #include 14 | #include 15 | 16 | namespace pdp { 17 | class PageBitmap { 18 | protected: 19 | util::ulong_t *bitmap_{}; // passed and freed by a derived class 20 | uint64_t bytes_{}; // bitmap mr_bytes_ (bits / 8) 21 | public: 22 | PageBitmap() = default; 23 | 24 | virtual ~PageBitmap() 25 | { 26 | // the derived class will set bitmap_ to nullptr in its dtor 27 | if (bitmap_) { 28 | delete[] bitmap_; 29 | bitmap_ = nullptr; 30 | } 31 | } 32 | 33 | bool test_present(uint64_t pg) const 34 | { 35 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte); 36 | return util::test_bit(pg, bitmap_); 37 | } 38 | 39 | void set_present(uint64_t pg) 40 | { 41 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte); 42 | 43 | util::set_bit(pg, bitmap_); 44 | } 45 | 46 | void clear_present(uint64_t pg) 47 | { 48 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte); 49 | 50 | util::clear_bit(pg, bitmap_); 51 | } 52 | 53 | void assign(uint64_t pg, bool set) 54 | { 55 | if (test_present(pg) == set) return; 56 | if (set) { 57 | set_present(pg); 58 | } else { 59 | clear_present(pg); 60 | } 61 | } 62 | 63 | bool range_has_fault(uint64_t pg_begin, uint64_t pg_end) const 64 | { 65 | ASSERT(pg_end / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg_end, bytes_ * util::kBitsPerByte); 66 | 67 | 68 | // if (pg_end - pg_begin > 1) { 69 | // pr_err("error"); 70 | // } 71 | // return false; 72 | 73 | for (auto i = pg_begin; i < pg_end; i++) { 74 | if (!test_present(i)) { 75 | // pr_err("true"); 76 | return true; 77 | } 78 | } 79 | return false; 80 | } 81 | 82 | uint64_t range_count_fault(uint64_t pg_begin, uint64_t pg_end) const 83 | { 84 | ASSERT(pg_end / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg_end, bytes_ * util::kBitsPerByte); 85 | 86 | uint64_t count = 0; 87 | for (auto i = pg_begin; i < pg_end; i++) { 88 | if (!test_present(i)) { 89 | count++; 90 | } 91 | } 92 | return count; 93 | } 94 | }; 95 | 96 | class LocalPageBitmap final: public PageBitmap { 97 | util::ObserverPtr local_mr_; 98 | rdma::LocalMemoryRegion::Uptr bitmap_mr_; 99 | 100 | public: 101 | using Uptr = std::unique_ptr; 102 | 103 | LocalPageBitmap(util::ObserverPtr local_mr) 104 | : local_mr_(local_mr) 105 | { 106 | static auto &gv = GlobalVariables::instance(); 107 | char file[256]; 108 | sprintf(file, "/proc/pdp_bitmap_0x%x", local_mr_->lkey_); 109 | int fd = open(file, O_RDWR); 110 | assert(fd > 0); 111 | 112 | uint32_t nr_pfns; 113 | auto ret = pread(fd, &nr_pfns, sizeof(uint32_t), 0); 114 | assert(ret == sizeof(uint32_t)); 115 | 116 | bytes_ = util::bits_to_long_aligned_bytes(nr_pfns); 117 | bitmap_ = (util::ulong_t *)mmap(nullptr, bytes_, PROT_READ | PROT_WRITE, 118 | MAP_SHARED | MAP_POPULATE, fd, 0); 119 | assert(bitmap_ != MAP_FAILED); 120 | 121 | close(fd); 122 | 123 | assert(gv.ori_ibv_symbols.reg_mr); 124 | bitmap_mr_ = std::make_unique(local_mr->ctx(), (void *)bitmap_, bytes_, false, gv.ori_ibv_symbols.reg_mr); 125 | 126 | pr_info("mr: key=0x%x; bitmap: addr=%p, length=0x%lx, key=0x%x", local_mr->lkey_, bitmap_, bytes_, bitmap_mr_->lkey_); 127 | } 128 | 129 | ~LocalPageBitmap() final 130 | { 131 | munmap(bitmap_, bytes_); 132 | bitmap_ = nullptr; 133 | } 134 | 135 | auto bitmap_mr() const { return util::make_observer(bitmap_mr_.get()); } 136 | }; 137 | 138 | class RemotePageBitmap final : public PageBitmap { 139 | rdma::RemoteMemoryRegion remote_bitmap_mr_; // RDMA read from this remote mr. 140 | std::unique_ptr local_bitmap_mr_; // RDMA read to this local mr. 141 | rdma::Wr pull_wr_; 142 | 143 | public: 144 | using Uptr = std::unique_ptr; 145 | 146 | DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(RemotePageBitmap); 147 | DELETE_MOVE_CONSTRUCTOR_AND_ASSIGNMENT(RemotePageBitmap); 148 | 149 | RemotePageBitmap(util::ObserverPtr ctx, rdma::RemoteMemoryRegion remote_bitmap_mr) 150 | : remote_bitmap_mr_(remote_bitmap_mr) 151 | { 152 | static auto &gv = GlobalVariables::instance(); 153 | 154 | bytes_ = remote_bitmap_mr_.length_; 155 | pr_info("remote_bitmap_mr: %s", remote_bitmap_mr.to_string().c_str()); 156 | 157 | ASSERT(remote_bitmap_mr_.addr64() % SZ_4K == 0, "remote_bitmap_mr.addr is not aligned to 4KB."); 158 | ASSERT(remote_bitmap_mr_.length() % sizeof(util::ulong_t) == 0, "remote_bitmap_mr.length is not aligned to 8 bytes."); 159 | 160 | bitmap_ = new util::ulong_t[bytes_ / sizeof(util::ulong_t)]; 161 | ASSERT(gv.ori_ibv_symbols.reg_mr, "gv.ori_ibv_symbols.reg_mr"); 162 | local_bitmap_mr_ = std::make_unique(ctx, bitmap_, bytes_, false, gv.ori_ibv_symbols.reg_mr); 163 | 164 | // we generate a wr and let the caller submit via a RC QP. 165 | pull_wr_ = rdma::Wr().set_op_read().set_signaled(true) 166 | .set_rdma(remote_bitmap_mr_.pick_all()) 167 | .set_sg(local_bitmap_mr_->pick_all()); 168 | } 169 | 170 | ~RemotePageBitmap() final 171 | { 172 | delete[] bitmap_; 173 | bitmap_ = nullptr; 174 | } 175 | 176 | rdma::Wr pull_wr() { return pull_wr_; } 177 | }; 178 | 179 | } 180 | -------------------------------------------------------------------------------- /libterm/include/print-stack.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | #include 4 | #include 5 | #include 6 | 7 | namespace util 8 | { 9 | // Credits: This lovely function is from **https://panthema.net/2008/0901-stacktrace-demangled/** 10 | static inline void print_stacktrace(FILE *out = stderr, unsigned int max_frames = 256) 11 | { 12 | // storage array for stack trace address data 13 | void *addrlist[max_frames + 1]; 14 | 15 | // retrieve current stack addresses 16 | int addrlen = backtrace(addrlist, sizeof(addrlist) / sizeof(void *)); 17 | 18 | if (addrlen == 0) 19 | { 20 | fprintf(out, " \n"); 21 | return; 22 | } 23 | 24 | // resolve addresses into strings containing "filename(function+address)", 25 | // this array must be free()-ed 26 | char **symbollist = backtrace_symbols(addrlist, addrlen); 27 | 28 | // allocate string which will be filled with the demangled function name 29 | size_t funcnamesize = 256; 30 | char *funcname = (char *)malloc(funcnamesize); 31 | 32 | // iterate over the returned symbol lines. skip the first, it is the 33 | // address of this function. 34 | for (int i = 1; i < addrlen; i++) 35 | { 36 | char *begin_name = 0, *begin_offset = 0, *end_offset = 0; 37 | 38 | // find parentheses and +address offset surrounding the mangled name: 39 | // ./module(function+0x15c) [0x8048a6d] 40 | for (char *p = symbollist[i]; *p; ++p) 41 | { 42 | if (*p == '(') 43 | begin_name = p; 44 | else if (*p == '+') 45 | begin_offset = p; 46 | else if (*p == ')' && begin_offset) 47 | { 48 | end_offset = p; 49 | break; 50 | } 51 | } 52 | 53 | if (begin_name && begin_offset && end_offset && begin_name < begin_offset) 54 | { 55 | *begin_name++ = '\0'; 56 | *begin_offset++ = '\0'; 57 | *end_offset = '\0'; 58 | 59 | // mangled name is now in [begin_name, begin_offset) and caller 60 | // offset in [begin_offset, end_offset). now apply 61 | // __cxa_demangle(): 62 | 63 | int status; 64 | char *ret = abi::__cxa_demangle(begin_name, 65 | funcname, &funcnamesize, &status); 66 | char str_buffer[4096]; 67 | if (status == 0) 68 | { 69 | funcname = ret; // use possibly realloc()-ed string 70 | sprintf(str_buffer, " [%d] %s : %s+%s\n", 71 | i, symbollist[i], funcname, begin_offset); 72 | } 73 | else 74 | { 75 | // demangling failed. Output function name as a C function with 76 | // no arguments. 77 | sprintf(str_buffer, " [%d] %s : %s()+%s\n", 78 | i, symbollist[i], begin_name, begin_offset); 79 | } 80 | std::cerr << str_buffer; 81 | } 82 | else 83 | { 84 | // couldn't parse the line? print the whole line. 85 | // fprintf(out, " %s\n", symbollist[i]); 86 | std::cerr << "counldnt parse the line: " << symbollist[i]; 87 | } 88 | } 89 | 90 | free(funcname); 91 | free(symbollist); 92 | } 93 | } 94 | -------------------------------------------------------------------------------- /libterm/include/queue.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include 3 | #include 4 | #include 5 | 6 | namespace util { 7 | // template 8 | // class Queue { 9 | // int sz_ = 0; 10 | // 11 | // std::vector last_vec_; 12 | // 13 | // std::mutex mutex_; 14 | // std::condition_variable cv_; 15 | // 16 | // int head_ = 0; 17 | // int tail_ = 0; 18 | // 19 | // int max_sz() const { return sz_ + 1; } 20 | // 21 | // int size() const { return (tail_ + max_sz() - head_) % max_sz(); } 22 | // 23 | // bool empty() const { return size() == 0; } 24 | // 25 | // bool full() const { return size() == max_sz() - 1; } 26 | // 27 | // void inc_head() { head_ = (head_ + 1) % max_sz(); } 28 | // 29 | // void inc_tail() { tail_ = (tail_ + 1) % max_sz(); } 30 | // 31 | // public: 32 | // Queue() : Queue(128) {} 33 | // explicit Queue(int sz) : sz_(sz), last_vec_(sz + 1) {} 34 | // 35 | // void push(auto &&value) { 36 | // std::unique_lock lk(mutex_); 37 | // cv_.wait(lk, [this] { return !full(); }); 38 | // 39 | // last_vec_[tail_] = std::forward(value); 40 | // inc_tail(); 41 | // 42 | // lk.unlock(); 43 | // cv_.notify_all(); 44 | // } 45 | // 46 | // T pop() { 47 | // std::unique_lock lk(mutex_); 48 | // cv_.wait(lk, [this] { return !empty(); }); 49 | // 50 | // T value = std::move(last_vec_[head_]); 51 | // inc_head(); 52 | // 53 | // lk.unlock(); 54 | // cv_.notify_all(); 55 | // return value; 56 | // } 57 | // }; 58 | template 59 | class Queue { 60 | int max_size_{}; 61 | std::queue queue_; 62 | std::mutex mutex_; 63 | std::condition_variable cv_; 64 | public: 65 | int size() const { return queue_.size(); } 66 | bool empty() const { return queue_.size() == 0; } 67 | bool full() const { return size() == max_size_; } 68 | 69 | public: 70 | Queue() : Queue(128) {} 71 | explicit Queue(int max_size) : max_size_{max_size} {} 72 | Queue(Queue &&queue) 73 | : max_size_(queue.max_size_), 74 | queue_(std::move(queue.queue_)), 75 | mutex_(std::move(queue.mutex_)), 76 | cv_(std::move(queue.cv_)) 77 | { 78 | } 79 | 80 | void push(auto &&value) { 81 | std::unique_lock lk(mutex_); 82 | cv_.wait(lk, [this] { return !full(); }); 83 | queue_.push(std::forward(value)); 84 | lk.unlock(); 85 | 86 | cv_.notify_all(); 87 | } 88 | 89 | T pop() { 90 | std::unique_lock lk(mutex_); 91 | cv_.wait(lk, [this] { return !empty(); }); 92 | 93 | T value = std::move(queue_.front()); 94 | queue_.pop(); 95 | 96 | lk.unlock(); 97 | cv_.notify_all(); 98 | return value; 99 | } 100 | }; 101 | } -------------------------------------------------------------------------------- /libterm/include/sms.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #include "util.hh" 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | 10 | namespace rdma { 11 | using std::vector; 12 | using std::string; 13 | 14 | class SharedMetaService { 15 | std::string hostname_; 16 | in_port_t port_; 17 | memcache::Memcache memc_; 18 | std::mutex mutex_; 19 | 20 | public: 21 | SharedMetaService(const std::string &hostname, in_port_t port) 22 | : hostname_{hostname}, port_{port}, memc_(hostname, port) 23 | { 24 | if (!memc_.setBehavior(MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1)) { 25 | throw std::runtime_error("init memcached"); 26 | } 27 | } 28 | 29 | std::string to_string() const 30 | { 31 | return hostname_ + ":" + std::to_string(port_); 32 | } 33 | 34 | void put(const std::string &key, const std::vector &value) 35 | { 36 | std::unique_lock lk(mutex_); 37 | // while (true) { 38 | // memcached_return rc; 39 | // rc = memcached_set((memcached_st *)memc_.getImpl(), key.data(), key.size(), 40 | // value.data(), value.size(), (time_t)0, 0); 41 | // if (rc == MEMCACHED_SUCCESS) break; 42 | // usleep(400); 43 | // } 44 | while (!memc_.set(key, value, 0, 0)) { 45 | usleep(400); 46 | } 47 | // pr_info("key={%s}, value={%s}", key.c_str(), util::vec_char_to_string(value).c_str()); 48 | } 49 | 50 | std::vector get(const std::string &key) 51 | { 52 | std::vector ret_val; 53 | size_t value_size; 54 | uint32_t flags; 55 | memcached_return rc; 56 | 57 | std::unique_lock lk(mutex_); 58 | 59 | // while (true) { 60 | // char *value = memcached_get((memcached_st *)memc_.getImpl(), key.c_str(), key.length(), 61 | // &value_size, &flags, &rc); 62 | // if (value && rc == MEMCACHED_SUCCESS) { 63 | // ret_val.resize(value_size); 64 | // memcpy(ret_val.data(), value, value_size); 65 | // break; 66 | // } 67 | // usleep(400); 68 | // } 69 | 70 | do { 71 | bool success = memc_.get(key, ret_val); 72 | if (!success) { 73 | usleep(10000); 74 | } else { 75 | break; 76 | } 77 | } while (true); 78 | // pr_info("key={%s}, value={%s}", key.c_str(), util::vec_char_to_string(ret_val).c_str()); 79 | 80 | return ret_val; 81 | } 82 | 83 | int get_int(const std::string &key) 84 | { 85 | std::vector ret_val; 86 | ret_val = get(key); 87 | string str(ret_val.begin(), ret_val.end()); 88 | return std::stoi(str); 89 | } 90 | 91 | uint64_t inc(const std::string &key, bool auto_retry = true) 92 | { 93 | uint64_t res; 94 | 95 | do { 96 | bool success = memc_.increment(key, 1, &res); 97 | if (!success && auto_retry) { 98 | usleep(10000); 99 | } else { 100 | break; 101 | } 102 | } while (true); 103 | 104 | return res; 105 | } 106 | 107 | uint64_t dec(const std::string &key) 108 | { 109 | uint64_t res; 110 | while (!memc_.decrement(key, 1, &res)) { 111 | usleep(10000); 112 | } 113 | return res; 114 | } 115 | }; 116 | } // namespace pdp -------------------------------------------------------------------------------- /libterm/include/ud-conn.hh: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | #include "rdma.hh" 4 | 5 | namespace rdma { 6 | // attention! UD does not support connection. 7 | // we wrap an interface for better programming. 8 | // there is actually only one ud_qp per thread! 9 | class UdConnection { 10 | util::ObserverPtr ud_qp_; 11 | util::ObserverPtr remote_ud_ah_; 12 | public: 13 | UdConnection(util::ObserverPtr ud_qp, util::ObserverPtr remote_ud_ah) 14 | : ud_qp_(ud_qp), remote_ud_ah_(remote_ud_ah) {} 15 | 16 | void send(const MemorySge &local_mem, bool signaled) 17 | { 18 | ud_qp_->send(remote_ud_ah_.get(), local_mem, signaled); 19 | } 20 | 21 | auto ud_qp() { return ud_qp_; } 22 | auto remote_ud_ah() { return remote_ud_ah_; } 23 | }; 24 | }; 25 | -------------------------------------------------------------------------------- /libterm/include/zipf.hh: -------------------------------------------------------------------------------- 1 | // Copyright 2014 Carnegie Mellon University 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #pragma once 16 | 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | 24 | namespace util { 25 | struct zipf_gen_state { 26 | uint64_t n; // number of items (input) 27 | double theta; // skewness (input) in (0, 1); or, 0 = uniform, 1 = always 28 | // zero 29 | double alpha; // only depends on theta 30 | double thres; // only depends on theta 31 | uint64_t last_n; // last n used to calculate the following 32 | double dbl_n; 33 | double zetan; 34 | double eta; 35 | // unsigned short rand_state[3]; // prng state 36 | uint64_t rand_state; 37 | }; 38 | 39 | static double mehcached_rand_d(uint64_t *state) { 40 | // caution: this is maybe too non-random 41 | *state = (*state * 0x5deece66dUL + 0xbUL) & ((1UL << 48) - 1); 42 | return (double) *state / (double) ((1UL << 48) - 1); 43 | } 44 | 45 | static double mehcached_pow_approx(double a, double b) { 46 | // from 47 | // http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ 48 | 49 | // calculate approximation with fraction of the exponent 50 | int e = (int) b; 51 | union { 52 | double d; 53 | int x[2]; 54 | } u = {a}; 55 | u.x[1] = 56 | (int) ((b - (double) e) * (double) (u.x[1] - 1072632447) + 1072632447.); 57 | u.x[0] = 0; 58 | 59 | // exponentiation by squaring with the exponent's integer part 60 | // double r = u.d makes everything much slower, not sure why 61 | // TODO: use popcount? 62 | double r = 1.; 63 | while (e) { 64 | if (e & 1) 65 | r *= a; 66 | a *= a; 67 | e >>= 1; 68 | } 69 | 70 | return r * u.d; 71 | } 72 | 73 | static void mehcached_zipf_init(struct zipf_gen_state *state, uint64_t n, 74 | double theta, uint64_t rand_seed) { 75 | assert(n > 0); 76 | if (theta > 0.992 && theta < 1) 77 | fprintf(stderr, 78 | "theta > 0.992 will be inaccurate due to approximation\n"); 79 | if (theta >= 1. && theta < 40.) { 80 | fprintf(stderr, "theta in [1., 40.) is not supported\n"); 81 | assert(false); 82 | } 83 | assert(theta == -1. || (theta >= 0. && theta < 1.) || theta >= 40.); 84 | assert(rand_seed < (1UL << 48)); 85 | memset(state, 0, sizeof(struct zipf_gen_state)); 86 | state->n = n; 87 | state->theta = theta; 88 | if (theta == -1.) 89 | rand_seed = rand_seed % n; 90 | else if (theta > 0. && theta < 1.) { 91 | state->alpha = 1. / (1. - theta); 92 | state->thres = 1. + mehcached_pow_approx(0.5, theta); 93 | } else { 94 | state->alpha = 0.; // unused 95 | state->thres = 0.; // unused 96 | } 97 | state->last_n = 0; 98 | state->zetan = 0.; 99 | // state->rand_state[0] = (unsigned short)(rand_seed >> 0); 100 | // state->rand_state[1] = (unsigned short)(rand_seed >> 16); 101 | // state->rand_state[2] = (unsigned short)(rand_seed >> 32); 102 | state->rand_state = rand_seed; 103 | } 104 | 105 | static void mehcached_zipf_init_copy(struct zipf_gen_state *state, 106 | const struct zipf_gen_state *src_state, 107 | uint64_t rand_seed) { 108 | 109 | (void) mehcached_zipf_init_copy; 110 | assert(rand_seed < (1UL << 48)); 111 | memcpy(state, src_state, sizeof(struct zipf_gen_state)); 112 | // state->rand_state[0] = (unsigned short)(rand_seed >> 0); 113 | // state->rand_state[1] = (unsigned short)(rand_seed >> 16); 114 | // state->rand_state[2] = (unsigned short)(rand_seed >> 32); 115 | state->rand_state = rand_seed; 116 | } 117 | 118 | static void mehcached_zipf_change_n(struct zipf_gen_state *state, uint64_t n) { 119 | (void) mehcached_zipf_change_n; 120 | state->n = n; 121 | } 122 | 123 | static double mehcached_zeta(uint64_t last_n, double last_sum, uint64_t n, 124 | double theta) { 125 | if (last_n > n) { 126 | last_n = 0; 127 | last_sum = 0.; 128 | } 129 | while (last_n < n) { 130 | last_sum += 1. / mehcached_pow_approx((double) last_n + 1., theta); 131 | last_n++; 132 | } 133 | return last_sum; 134 | } 135 | 136 | static uint64_t mehcached_zipf_next(struct zipf_gen_state *state) { 137 | if (state->last_n != state->n) { 138 | if (state->theta > 0. && state->theta < 1.) { 139 | state->zetan = mehcached_zeta(state->last_n, state->zetan, state->n, 140 | state->theta); 141 | state->eta = 142 | (1. - mehcached_pow_approx(2. / (double) state->n, 143 | 1. - state->theta)) / 144 | (1. - mehcached_zeta(0, 0., 2, state->theta) / state->zetan); 145 | } 146 | state->last_n = state->n; 147 | state->dbl_n = (double) state->n; 148 | } 149 | 150 | if (state->theta == -1.) { 151 | uint64_t v = state->rand_state; 152 | if (++state->rand_state >= state->n) 153 | state->rand_state = 0; 154 | return v; 155 | } else if (state->theta == 0.) { 156 | double u = mehcached_rand_d(&state->rand_state); 157 | return (uint64_t) (state->dbl_n * u); 158 | } else if (state->theta >= 40.) { 159 | return 0UL; 160 | } else { 161 | // from J. Gray et al. Quickly generating billion-record synthetic 162 | // databases. In SIGMOD, 1994. 163 | 164 | // double u = erand48(state->rand_state); 165 | double u = mehcached_rand_d(&state->rand_state); 166 | double uz = u * state->zetan; 167 | if (uz < 1.) 168 | return 0UL; 169 | else if (uz < state->thres) 170 | return 1UL; 171 | else 172 | return (uint64_t) ( 173 | state->dbl_n * 174 | mehcached_pow_approx(state->eta * (u - 1.) + 1., state->alpha)); 175 | } 176 | } 177 | 178 | void mehcached_test_zipf(double theta) { 179 | 180 | (void) (mehcached_test_zipf); 181 | 182 | double zetan = 0.; 183 | const uint64_t n = 10000000000UL; 184 | uint64_t i; 185 | 186 | for (i = 0; i < n; i++) 187 | zetan += 1. / pow((double) i + 1., theta); 188 | 189 | struct zipf_gen_state state; 190 | if (theta < 1. || theta >= 40.) 191 | mehcached_zipf_init(&state, n, theta, 0); 192 | 193 | uint64_t num_key0 = 0; 194 | const uint64_t num_samples = 10000000UL; 195 | if (theta < 1. || theta >= 40.) { 196 | for (i = 0; i < num_samples; i++) 197 | if (mehcached_zipf_next(&state) == 0) 198 | num_key0++; 199 | } 200 | 201 | printf("theta = %lf; using pow(): %.10lf", theta, 1. / zetan); 202 | if (theta < 1. || theta >= 40.) 203 | printf(", using approx-pow(): %.10lf", 204 | (double) num_key0 / (double) num_samples); 205 | printf("\n"); 206 | } 207 | 208 | class ZipfGen { 209 | private: 210 | struct zipf_gen_state state_; 211 | public: 212 | ZipfGen(uint64_t n, double theta, uint64_t rand_seed) 213 | { 214 | mehcached_zipf_init(&state_, n, theta, rand_seed); 215 | } 216 | 217 | uint64_t next() 218 | { 219 | return mehcached_zipf_next(&state_); 220 | } 221 | }; 222 | } // namespace util -------------------------------------------------------------------------------- /libterm/kmod/Makefile: -------------------------------------------------------------------------------- 1 | ccflags-y := -DM=$(M) 2 | 3 | obj-m += pch.o 4 | # lioh-y := proc.o 5 | 6 | all: 7 | make -C /lib/modules/$(shell uname -r)/build M=$(shell pwd) modules 8 | 9 | clean: 10 | make -C /lib/modules/$(shell uname -r)/build M=$(shell pwd) clean 11 | -------------------------------------------------------------------------------- /libterm/kmod/pch.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | 8 | #define PROC_NAME "pch" 9 | 10 | // taken from Linux source. 11 | static struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index) 12 | { 13 | struct page *page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD, 0); 14 | if (!page) 15 | return page; 16 | if (!xa_is_value(page)) { 17 | return find_subpage(page, index); 18 | } 19 | return NULL; 20 | } 21 | 22 | static unsigned char mincore_page(struct address_space *mapping, pgoff_t index) 23 | { 24 | struct page *page = find_get_incore_page(mapping, index); 25 | unsigned char val = 0; 26 | 27 | if (page) { 28 | if (PageUptodate(page)) { 29 | val |= (1 << 0); 30 | } 31 | if (PageDirty(page)) { 32 | val |= (1 << 1); 33 | } 34 | put_page(page); 35 | } 36 | 37 | return val; 38 | } 39 | 40 | 41 | static ssize_t 42 | pch_read(struct file *file, char __user *buf, size_t count, loff_t *pos) 43 | { 44 | struct file *mmap_fp = file->private_data; 45 | 46 | uint8_t *kern_buf = kvzalloc(PAGE_SIZE, GFP_KERNEL); 47 | if (!kern_buf) { 48 | return -EINVAL; 49 | } 50 | 51 | for (size_t iter_offset = 0; iter_offset < count; iter_offset += PAGE_SIZE) { 52 | size_t iter_count = min(count - iter_offset, PAGE_SIZE); 53 | pgoff_t start_idx = *pos + iter_offset; 54 | pgoff_t last_idx = start_idx + iter_count - 1; 55 | 56 | size_t off = 0; 57 | 58 | #if 0 59 | XA_STATE(xas, &mmap_fp->f_mapping->i_pages, start_idx); 60 | void *entry; 61 | rcu_read_lock(); 62 | // entry: is it page or folio? whatever. 63 | xas_for_each(&xas, entry, last_idx) { 64 | uint8_t v = 1; 65 | if (xas_retry(&xas, entry) || xa_is_value(entry)) { 66 | v = 0; 67 | } 68 | kern_buf[off] = v; 69 | off++; 70 | } 71 | 72 | rcu_read_unlock(); 73 | #else 74 | for (pgoff_t idx = start_idx; idx <= last_idx; idx++) { 75 | kern_buf[off] = mincore_page(mmap_fp->f_mapping, idx); 76 | off++; 77 | } 78 | #endif 79 | 80 | if (copy_to_user(buf + iter_offset, kern_buf, iter_count)) { 81 | pr_err("failed to copy_to_user\n"); 82 | kvfree(kern_buf); 83 | return -EINVAL; 84 | } 85 | } 86 | kvfree(kern_buf); 87 | *pos += count; 88 | return count; 89 | } 90 | 91 | static ssize_t 92 | pch_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) 93 | { 94 | char path[32]; 95 | struct file *mmap_fp; 96 | 97 | if (file->private_data) { 98 | pr_err("private has been set.\n"); 99 | return -EINVAL; 100 | } 101 | 102 | if (count > 16) { 103 | pr_err("path too long\n"); 104 | return -EINVAL; 105 | } 106 | 107 | if (copy_from_user(path, buf, count)) { 108 | pr_err("fail to copy from user.\n"); 109 | return -EINVAL; 110 | } 111 | path[count] = '\0'; 112 | 113 | mmap_fp = filp_open(path, O_RDWR | O_LARGEFILE, 0); 114 | if (IS_ERR(mmap_fp)) { 115 | pr_err("fail to open %s\n", path); 116 | return PTR_ERR(mmap_fp); 117 | } 118 | 119 | // pr_info("pch: %s\n", path); 120 | 121 | file->private_data = mmap_fp; 122 | 123 | return count; 124 | } 125 | 126 | static int pch_release(struct inode *inode, struct file *file) 127 | { 128 | if (file->private_data) { 129 | filp_close(file->private_data, 0); 130 | } 131 | return 0; 132 | } 133 | 134 | static const struct proc_ops pch_ops = { 135 | .proc_read = pch_read, 136 | .proc_write = pch_write, 137 | .proc_release = pch_release, 138 | }; 139 | 140 | static int __init pch_init(void) 141 | { 142 | struct proc_dir_entry *entry; 143 | 144 | entry = proc_create(PROC_NAME, S_IFREG | S_IRUGO, NULL, &pch_ops); 145 | if (!entry) { 146 | pr_err("failed to create /proc/" PROC_NAME "\n"); 147 | return -ENOMEM; 148 | } 149 | 150 | return 0; 151 | } 152 | 153 | static void __exit pch_exit(void) 154 | { 155 | remove_proc_entry(PROC_NAME, NULL); 156 | } 157 | 158 | module_init(pch_init); 159 | module_exit(pch_exit); 160 | 161 | MODULE_LICENSE("GPL"); 162 | -------------------------------------------------------------------------------- /libterm/kmod/test-pch.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | 8 | #define SZ_1G (1l << 30) 9 | // #define SZ_2G (2l << 30) 10 | #define PAGE_SIZE (4096) 11 | 12 | 13 | static inline uint64_t wall_time_ns() 14 | { 15 | // thread_local auto first_time = std::chrono::steady_clock::now(); 16 | // auto current = std::chrono::steady_clock::now(); 17 | // return std::chrono::duration_cast(current - first_time).count(); 18 | struct timespec ts{}; 19 | clock_gettime(CLOCK_REALTIME, &ts); 20 | return ts.tv_nsec + ts.tv_sec * 1'000'000'000; 21 | } 22 | 23 | int main() 24 | { 25 | int fd = open("/proc/pch-nvme1n1p2", O_RDONLY); 26 | assert(fd >= 0); 27 | uint8_t v; 28 | 29 | auto ts1 = wall_time_ns(); 30 | 31 | // char *buf = new char[SZ_1G / PAGE_SIZE]; 32 | // pread(fd, buf, SZ_1G, 0); 33 | // for (int i = 0; i < (SZ_1G / PAGE_SIZE); i++) { 34 | // printf("%u", buf[i]); 35 | // } 36 | for (size_t i = 0; i < SZ_1G; i += PAGE_SIZE) { 37 | auto ret = pread(fd, &v, 1, i); 38 | assert(ret == 1); 39 | printf("%u", v); 40 | } 41 | 42 | printf("\n"); 43 | auto ts2 = wall_time_ns(); 44 | printf("\n"); 45 | auto elapsed = ts2 - ts1; 46 | printf("total %luns, average %luns\n", elapsed, elapsed / (SZ_1G / PAGE_SIZE)); 47 | 48 | return 0; 49 | } 50 | -------------------------------------------------------------------------------- /libterm/test/perf.cc: -------------------------------------------------------------------------------- 1 | #include "rdma.hh" 2 | #include "util.hh" 3 | #include "node.hh" 4 | #include "zipf.hh" 5 | 6 | #include 7 | #include 8 | 9 | #include 10 | 11 | DEFINE_int32(nr_nodes, 2, "number of total nodes, including the server and clients."); 12 | DEFINE_bool(verify, false, ""); 13 | DEFINE_int32(nr_client_threads, 1, "# client threads"); 14 | DEFINE_string(sz_unit, "64", "access unit in bytes (OK with k,m,g suffix)"); 15 | DEFINE_string(sz_server_mr, "32g", "server mr size in bytes (OK with k,m,g suffix)"); 16 | DEFINE_int32(running_seconds, 120, "running seconds"); 17 | DEFINE_int32(node_id, -1, "node id (0: server, ...: client)"); 18 | DEFINE_int32(skewness_100, 99, "skewness * 100, -1: seq"); 19 | DEFINE_uint32(write_percent, 100, "write percent"); 20 | DEFINE_uint32(hotspot_switch_second, 0, "the second to switch hotspot"); 21 | 22 | constexpr int kNrServerThreads = 1; 23 | constexpr size_t kClientMrSize = SZ_4M; 24 | constexpr bool kClientMmap = false; 25 | constexpr bool kClientOdp = false; 26 | constexpr bool kServerMmap = true; 27 | constexpr bool kServerOdp = true; 28 | 29 | std::atomic_bool g_should_stop{}; 30 | std::string g_server_mmap_dev = util::getenv_string("PDP_server_mmap_dev").value_or(""); 31 | std::unique_ptr g_lat_dis; 32 | rdma::Node::Uptr g_node; 33 | 34 | void signal_int_handler(int sig) 35 | { 36 | g_should_stop = true; 37 | util::print_stacktrace(); 38 | 39 | if (g_node) { 40 | if (!g_node->is_server()) { 41 | } 42 | g_node->unregister_node(); 43 | g_node = nullptr; 44 | } 45 | 46 | // reset to the default for clean-up 47 | signal(sig, SIG_DFL); 48 | raise(sig); 49 | } 50 | 51 | static bool check_data(void *addr, size_t length, uint64_t expected_base) 52 | { 53 | ASSERT((uint64_t)addr % 8 == 0, "addr is not aligned to 8."); 54 | 55 | for (size_t i = 0; i < length; i += sizeof(uint64_t)) { 56 | uint64_t value = *(uint64_t *)((char *)addr + i); 57 | uint64_t expected = expected_base + i; 58 | if (value != expected) { 59 | return false; 60 | } 61 | } 62 | return true; 63 | } 64 | 65 | static void set_data(void *addr, size_t length, uint64_t base, bool print_progress) 66 | { 67 | ASSERT((uint64_t)addr % 8 == 0, "addr is not aligned to 8."); 68 | // #pragma omp parallel for 69 | for (size_t i = 0; i < length; i += sizeof(uint64_t)) { 70 | if (print_progress && i % SZ_1G == 0) { 71 | pr_info("set %lu GB.", i / SZ_1G); 72 | } 73 | *(uint64_t *)((char *)addr + i) = base + i; 74 | } 75 | } 76 | 77 | void initialize() 78 | { 79 | g_node = std::make_unique(FLAGS_nr_nodes, FLAGS_node_id); 80 | g_node->register_node(); 81 | 82 | if (g_node->is_server()) { 83 | std::cout << "Server" << std::endl; 84 | std::cout << "prepare" << std::endl; 85 | auto ctx = std::make_unique(0); 86 | 87 | ASSERT(!g_server_mmap_dev.empty(), "PDP_server_mmap_dev must be set!"); 88 | for (int i = 0; i < kNrServerThreads; i++) { 89 | rdma::Allocation alloc{(size_t)util::stoll_suffix(FLAGS_sz_server_mr), g_server_mmap_dev, !FLAGS_verify}; 90 | if (FLAGS_verify) { 91 | madvise(alloc.addr(), alloc.size(), MADV_SEQUENTIAL); 92 | set_data((char *)alloc.addr(), alloc.size(), 0, true); 93 | madvise(alloc.addr(), alloc.size(), MADV_RANDOM); 94 | } 95 | auto local_mr = std::make_unique(util::make_observer(ctx.get()), std::move(alloc), kServerOdp); 96 | g_node->threads_.emplace_back(util::make_observer(g_node.get()), util::make_observer(ctx.get()), i, FLAGS_nr_nodes, FLAGS_nr_client_threads, std::move(local_mr)); 97 | g_node->threads_.back().prepare_and_push_rc(); 98 | } 99 | 100 | g_node->ctxs_.emplace_back(std::move(ctx)); 101 | 102 | g_node->wait_for_all_nodes_prepared(); 103 | for (int i = 0; i < kNrServerThreads; i++) { 104 | g_node->threads_[i].pull_and_connect_rc(); 105 | } 106 | } else { 107 | std::cout << "Client" << std::endl; 108 | std::cout << "prepare" << std::endl; 109 | 110 | for (int i = 0; i < FLAGS_nr_client_threads; i++) { 111 | auto ctx = std::make_unique(0); 112 | auto local_mr = std::make_unique(util::make_observer(ctx.get()), rdma::Allocation{kClientMrSize}, kClientOdp); 113 | g_node->threads_.emplace_back(util::make_observer(g_node.get()), util::make_observer(ctx.get()), i, 1, kNrServerThreads, std::move(local_mr)); 114 | g_node->threads_.back().prepare_and_push_rc(); 115 | g_node->ctxs_.emplace_back(std::move(ctx)); 116 | } 117 | g_node->wait_for_all_nodes_prepared(); 118 | for (int i = 0; i < FLAGS_nr_client_threads; i++) { 119 | g_node->threads_[i].pull_and_connect_rc(); 120 | } 121 | } 122 | 123 | g_node->wait_for_all_nodes_connected(); 124 | } 125 | 126 | void cleanup() 127 | { 128 | // clean-up for the next execution 129 | g_node->unregister_node(); 130 | } 131 | 132 | void client_test(const std::stop_token &stop_token, int thread_id) 133 | { 134 | auto &thread = g_node->threads_[thread_id]; 135 | auto conn = thread.get_data_conn(0, 0); 136 | auto *local_mr = conn->local_mr(); 137 | auto *remote_mr = conn->remote_mr(); 138 | auto sz_unit = util::stoll_suffix(FLAGS_sz_unit); 139 | static util::LatencyRecorder lr_send("send"); 140 | static util::LatencyRecorder lr_wait("wait"); 141 | 142 | uint64_t nr_units = remote_mr->length() / sz_unit; 143 | 144 | util::Schedule::bind_client_fg_cpu(thread_id); 145 | 146 | ASSERT(FLAGS_skewness_100 < 100, "FLAGS_skewness_100 must < 100!"); 147 | bool seq = false; 148 | if (FLAGS_skewness_100 < 0) { 149 | seq = true; 150 | FLAGS_skewness_100 = 0; 151 | } 152 | double theta = double(FLAGS_skewness_100) / 100; 153 | util::ZipfGen zipf_gen(nr_units, theta, FLAGS_node_id * FLAGS_nr_client_threads + thread_id); 154 | 155 | unsigned int seed = thread_id; 156 | 157 | auto switch_tp = std::chrono::steady_clock::now() + std::chrono::seconds(FLAGS_hotspot_switch_second); 158 | 159 | for (uint64_t i = 0; /*i < 16 * sz_unit*/; i += sz_unit) { 160 | lr_send.begin_one(); 161 | uint64_t ts = util::wall_time_ns(); 162 | 163 | if (stop_token.stop_requested()) break; 164 | 165 | uint64_t v = zipf_gen.next(); 166 | if (FLAGS_hotspot_switch_second) { 167 | if (std::chrono::steady_clock::now() > switch_tp) { 168 | v = nr_units - 1 - v; 169 | } 170 | } 171 | v = util::hash_u64(v) % nr_units; 172 | if (seq) { 173 | v = (i / sz_unit) % nr_units; 174 | } 175 | 176 | uint64_t remote_offset = v * sz_unit; 177 | 178 | uint64_t local_addr = local_mr->addr64() + i % local_mr->length(); 179 | uint64_t remote_addr = remote_mr->addr64() + remote_offset; 180 | 181 | auto wr = rdma::Wr() 182 | .set_sg(local_addr, sz_unit, local_mr->lkey()) 183 | .set_rdma(remote_addr, remote_mr->rkey()) 184 | .set_signaled(true); 185 | 186 | uint64_t expected_base = util::hash_u64(remote_offset); 187 | 188 | bool is_read = (rand_r(&seed) % 100) >= FLAGS_write_percent; 189 | if (is_read) { 190 | wr.set_op_read(); 191 | } else { 192 | wr.set_op_write(); 193 | if (FLAGS_verify) { 194 | set_data((void *)local_addr, sz_unit, expected_base, false); 195 | } 196 | } 197 | 198 | conn->post_send_wr(&wr); 199 | lr_wait.begin_one(); 200 | 201 | lr_send.end_one(); 202 | struct ibv_wc wc{}; 203 | conn->wait_send(&wc); 204 | assert(wc.status == IBV_WC_SUCCESS); 205 | 206 | if (FLAGS_verify) { 207 | if (is_read) { 208 | bool ok = check_data((void *)local_addr, sz_unit, remote_offset); 209 | ASSERT(ok, "check error"); 210 | } else { 211 | memset((void *)local_addr, 0x34, sz_unit); 212 | auto read_wr = rdma::Wr().set_op_read() 213 | .set_sg(local_addr, sz_unit, local_mr->lkey()) 214 | .set_rdma(remote_addr, remote_mr->rkey()) 215 | .set_signaled(true); 216 | conn->post_send_wr(&read_wr); 217 | conn->wait_send(&wc); 218 | ASSERT(wc.status == IBV_WC_SUCCESS, ""); 219 | 220 | bool ok = check_data((void *)local_addr, sz_unit, expected_base); 221 | ASSERT(ok, "check error"); 222 | } 223 | } 224 | 225 | lr_wait.end_one(); 226 | 227 | uint64_t lat = util::wall_time_ns() - ts; 228 | g_lat_dis->record(thread_id, lat); 229 | } 230 | } 231 | 232 | static void check_flags() 233 | { 234 | ASSERT(FLAGS_skewness_100 < 100, "skewness_100 must < 100"); 235 | ASSERT(FLAGS_write_percent <= 100, "write_percent must <= 100"); 236 | ASSERT(FLAGS_node_id >= 0 && FLAGS_node_id < FLAGS_nr_nodes, "FLAGS_node_id must between 0 and nr_nodes"); 237 | } 238 | 239 | static void system_cmd_func(const std::stop_token &stop_token, const std::string &cmd) 240 | { 241 | 242 | while (!stop_token.stop_requested()) { 243 | using namespace std::chrono_literals; 244 | auto tp = std::chrono::steady_clock::now(); 245 | std::ignore = system(cmd.c_str()); 246 | std::this_thread::sleep_until(tp + 1s); 247 | } 248 | } 249 | 250 | int main(int argc, char *argv[]) 251 | { 252 | 253 | gflags::SetUsageMessage(argv[0]); 254 | gflags::ParseCommandLineFlags(&argc, &argv, true); 255 | 256 | signal(SIGINT, signal_int_handler); 257 | signal(SIGSEGV, signal_int_handler); 258 | 259 | check_flags(); 260 | 261 | initialize(); 262 | 263 | if (g_node->is_server()) { 264 | pr_info("running..."); 265 | const std::string cmd = fmt::format(R"(top -bc -n1 -w512 | grep -e "{}" -e "mlx5_ib_page" | grep -v grep; cat /proc/vmstat | egrep "writeback|dirty")", argv[0]); 266 | 267 | // std::jthread system_cmd_thread([&cmd](const std::stop_token &stop_token) { 268 | // system_cmd_func(stop_token, cmd); 269 | // }); 270 | sleep(FLAGS_running_seconds + 2); 271 | } else { 272 | pr_info("# client threads=%d", FLAGS_nr_client_threads); 273 | 274 | std::vector test_threads(FLAGS_nr_client_threads); 275 | g_lat_dis = std::make_unique(FLAGS_nr_client_threads); 276 | 277 | for (int i = 0; i < FLAGS_nr_client_threads; i++) { 278 | test_threads[i] = std::jthread(client_test, i); 279 | } 280 | 281 | double elapsed = 0.0; 282 | util::Schedule::bind_client_fg_cpu(FLAGS_nr_client_threads + 1); 283 | 284 | uint32_t running_second = 0; 285 | while (!g_should_stop) { 286 | using namespace std::chrono_literals; 287 | auto tp = std::chrono::steady_clock::now(); 288 | if (running_second) { 289 | pr_emph("epoch %u (%.2lfs): %s", running_second, elapsed, g_lat_dis->report_and_clear().c_str()); 290 | } 291 | running_second++; 292 | if (running_second > FLAGS_running_seconds) { 293 | break; 294 | } 295 | std::this_thread::sleep_until(tp + 1s); 296 | 297 | auto d = std::chrono::steady_clock::now() - tp; 298 | elapsed = std::chrono::duration(d).count(); 299 | } 300 | } 301 | 302 | cleanup(); 303 | return 0; 304 | } 305 | --------------------------------------------------------------------------------