├── README.md
├── ae
├── .gitignore
├── bin
│ ├── libpdp.so
│ ├── octopus
│ │ ├── conf.xml
│ │ ├── dmfs
│ │ ├── libnrfsc.so
│ │ └── mpibw
│ ├── perf
│ └── xstore
│ │ ├── .gitignore
│ │ ├── cs.toml
│ │ ├── fserver
│ │ ├── libjemalloc.so.2
│ │ ├── master
│ │ ├── server
│ │ └── config.toml
│ │ ├── ycsb
│ │ └── ycsb-model.toml
├── check-and-kill-running.sh
├── clear.sh
├── figure-10a.py
├── figure-10b.py
├── figure-10c.py
├── figure-10d.py
├── figure-11.py
├── figure-12a.py
├── figure-12b.py
├── figure-13a.py
├── figure-13b.py
├── figure-14a.py
├── figure-14b.py
├── figure-15a.py
├── figure-15b.py
├── figure-16.py
├── figure-17.py
├── figure-8.py
├── figure-9.py
├── output-example.tar.xz
├── run-all.sh
└── scripts
│ ├── bootstrap-octopus.py
│ ├── bootstrap.py
│ ├── ins-pch.sh
│ ├── limit-mem.sh
│ ├── octopus_eval.py
│ ├── reset-memc.sh
│ ├── term_eval.py
│ └── xstore_eval.py
├── app
├── octopus.zip
└── xstore.zip
├── driver
├── driver.patch
└── mlnx-ofed-kernel-5.8-2.0.3.0.zip
└── libterm
├── .gitignore
├── 3rd-party
└── include
│ └── better-enum.hh
├── CMakeLists.txt
├── ibverbs-pdp
├── config.hh
├── ctx.hh
├── global.cc
├── global.hh
├── mr.cc
├── mr.hh
├── pdp.cc
├── qp.cc
├── qp.hh
├── rpc.cc
└── rpc.hh
├── include
├── bitmap.hh
├── compile.hh
├── const.hh
├── data-conn.hh
├── lock.hh
├── logging.hh
├── node.hh
├── page-bitmap.hh
├── print-stack.hh
├── queue.hh
├── rdma.hh
├── sms.hh
├── ud-conn.hh
├── util.hh
└── zipf.hh
├── kmod
├── Makefile
├── pch.c
└── test-pch.cc
└── test
└── perf.cc
/README.md:
--------------------------------------------------------------------------------
1 | # TeRM: Extending RDMA-Attached Memory with SSD
2 |
3 | This is the open-source repository for our paper
4 | **TeRM: Extending RDMA-Attached Memory with SSD** on [FAST'24](https://www.usenix.org/conference/fast24/presentation/yang-zhe) and [ACM Transactions on Storage](https://dl.acm.org/doi/10.1145/3700772).
5 |
6 | Notably, the codename of TeRM is `PDP` (one step further beyond ODP).
7 |
8 | # Directory structure
9 | ```
10 | TeRM
11 | |---- ae # artifact evaluation files
12 | |---- bin # binaries generated by source code
13 | |---- scripts # common scripts
14 | |---- figure-*.py # scripts to execute experiments
15 | |---- run-all.sh # run all experiments
16 | |---- app # source code of octopus and xstore with some bugs fixed
17 | |---- driver # TeRM's driver
18 | |---- driver.patch # patches to the official driver
19 | |---- mlnx-*.zip # the patched driver
20 | |---- libterm # TeRM's userspace shared library
21 | ```
22 |
23 | # How to build
24 | ## Environment
25 |
26 | - OS: Ubuntu 22.04.2 LTS
27 | - Kernel: Linux 5.19.0-50-generic
28 | - OFED driver: 5.8-2.0.3
29 |
30 | We recommend the same environment used in our development.
31 | You may need to customize the source code for different enviroment.
32 | The environment is mainly required for the driver.
33 | We only need the patched driver on the server side.
34 |
35 | ## Dependencies
36 | ```
37 | sudo apt install libfmt-dev libaio-dev libboost-coroutine-dev libmemcached-dev libgoogle-glog-dev libgflags-dev
38 | ```
39 |
40 | ## Settings
41 |
42 | We hard coded some settings in the source code. Please modify them according to your cluster settings.
43 |
44 | 1. memcached.
45 | TeRM uses memcached to synchronize cluster metadata.
46 | Please install memcached in your cluster and modify the ip and port in `ae/scripts/reset-memc.sh`, `libterm/ibverbs-pdp/global.cc`, and `libterm/include/node.hh`.
47 |
48 | 2. CPU affinity.
49 | The source code is in `class Schedule` of file `libterm/include/util.hh`.
50 | Please modify the constants according to your CPU hardware.
51 |
52 | ## Build the driver
53 | The patched driver is required on the server side.
54 | There are two ways to build the driver. We provide an out-of-the-box driver zip file in the second choice.
55 |
56 | 1. Download the source code of the driver from the official website.
57 | Apply official backport batches first and then patch the modifications listed in `driver/driver.patch`.
58 | Then, build the driver.
59 | Please note that, we apply minimum number of patches, instead of all patches, that make it work for our environment. One shall not `git apply` the `driver/driver.patch` directly, because line numbers may differ. One should parse and patch it manually.
60 |
61 | 2. Use `driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip`. Unzip it and run the contained `build.sh`.
62 |
63 | ## Build libterm
64 | We provide `CMakeLists.txt` for building.
65 | It produces two outputs, the userspace shared library `libpdp.so` and a program `perf`.
66 | Please copy two files to `ae/bin` before running AE scripts.
67 | ```
68 | $ cd libterm
69 | $ mkdir -p build && cd build
70 | $ cmake .. -DCMAKE_BUILD_TYPE=Release # Release for compiler optimizations and high performance
71 | $ make -j
72 | ```
73 |
74 | # How to use
75 | 1. Replace the modified driver `*.ko` files on the server side and restart the `openibd` service.
76 | 2. Restart the `memcached` instance. We provide a script `ae/scripts/reset-memc.sh` to do so.
77 | 3. `mmap` an SSD in the RDMA program with `MAP_SHARED` and `ibv_reg_mr` the memory area as an `ODP MR`.
78 | 4. Set `LD_PRELOAD=libpdp.so` on all nodes to enable TeRM. Also set enviroment variables `PDP_server_mmap_dev=nvmeXnY` for the SSD backend and `PDP_server_memory_gb=Z` for the size of the mapped area. Set `PDP_is_server=1` if and only if for the server side.
79 | 5. Run the RDMA application.
80 |
81 | libterm accepts a series of environment variables for configuration. Please refer to `libterm/ibverbs-pdp/global.cc` for more details.
82 |
83 | If you have further questions and interests about the repository, please feel free to propose an issue or contact me via email (yangzhe.ac AT outlook.com). You can find my github at [yzim](https://github.com/yzim).
84 |
85 | To cite our paper:
86 | ```
87 | @inproceedings {fast24-term,
88 | author = {Zhe Yang and Qing Wang and Xiaojian Liao and Youyou Lu and Keji Huang and Jiwu Shu},
89 | title = {{TeRM}: Extending {RDMA-Attached} Memory with {SSD}},
90 | booktitle = {22nd USENIX Conference on File and Storage Technologies (FAST 24)},
91 | year = {2024},
92 | isbn = {978-1-939133-38-0},
93 | address = {Santa Clara, CA},
94 | pages = {1--16},
95 | url = {https://www.usenix.org/conference/fast24/presentation/yang-zhe},
96 | publisher = {USENIX Association},
97 | month = feb
98 | }
99 |
100 | @article{tos24-term,
101 | author = {Yang, Zhe and Wang, Qing and Liao, Xiaojian and Lu, Youyou and Huang, Keji and Shu, Jiwu},
102 | title = {Efficiently Enlarging RDMA-Attached Memory with SSD},
103 | year = {2024},
104 | publisher = {Association for Computing Machinery},
105 | address = {New York, NY, USA},
106 | issn = {1553-3077},
107 | url = {https://doi.org/10.1145/3700772},
108 | doi = {10.1145/3700772},
109 | journal = {ACM Trans. Storage},
110 | month = oct
111 | }
112 | ```
113 |
--------------------------------------------------------------------------------
/ae/.gitignore:
--------------------------------------------------------------------------------
1 | output*/
2 | __pycache__
3 |
--------------------------------------------------------------------------------
/ae/bin/libpdp.so:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/libpdp.so
--------------------------------------------------------------------------------
/ae/bin/octopus/conf.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | 1
5 | 10.0.2.184
6 |
7 |
8 |
--------------------------------------------------------------------------------
/ae/bin/octopus/dmfs:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/dmfs
--------------------------------------------------------------------------------
/ae/bin/octopus/libnrfsc.so:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/libnrfsc.so
--------------------------------------------------------------------------------
/ae/bin/octopus/mpibw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/octopus/mpibw
--------------------------------------------------------------------------------
/ae/bin/perf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/perf
--------------------------------------------------------------------------------
/ae/bin/xstore/.gitignore:
--------------------------------------------------------------------------------
1 | cpu.txt
2 | time_spot.py
3 |
--------------------------------------------------------------------------------
/ae/bin/xstore/cs.toml:
--------------------------------------------------------------------------------
1 | [general_config]
2 | port = 8888
3 |
4 | [[client]]
5 | id = 1
6 | host = "node166"
7 |
8 | [[client]]
9 | id = 2
10 | host = "node168"
11 |
12 | [master]
13 | host = "node181"
14 |
15 |
16 | ## server config
17 |
18 | [[server]]
19 | host = "node184"
20 |
21 | [server_config]
22 | db_type = "ycsbh"
--------------------------------------------------------------------------------
/ae/bin/xstore/fserver:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/fserver
--------------------------------------------------------------------------------
/ae/bin/xstore/libjemalloc.so.2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/libjemalloc.so.2
--------------------------------------------------------------------------------
/ae/bin/xstore/master:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/master
--------------------------------------------------------------------------------
/ae/bin/xstore/server/config.toml:
--------------------------------------------------------------------------------
1 | [memory]
2 | page = "30G"
3 | rdma_heap = "2G"
4 |
5 | [rpc]
6 | network = "ud"
7 |
--------------------------------------------------------------------------------
/ae/bin/xstore/ycsb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/bin/xstore/ycsb
--------------------------------------------------------------------------------
/ae/bin/xstore/ycsb-model.toml:
--------------------------------------------------------------------------------
1 | [[stages]]
2 | type = "lr"
3 | parameter = 1
4 |
5 | [[stages]]
6 | type = "lr"
7 | ## typically, they use much larger models
8 |
9 |
10 | ## YCSB large 200M
11 | parameter = 100000
--------------------------------------------------------------------------------
/ae/check-and-kill-running.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | running=0
4 | pattern="'run-all.sh|tmp/memcached|scripts/bootstrap.py|scripts/bootstrap-octopus.py|scripts/bootstrap-xstore.py|figure|octopus|xstore|mpibw|dmfs|fserver|master|ycsb|perf'"
5 | check_cmd="pgrep -fa $pattern"
6 |
7 | echo "--- node166 ---"
8 | ret=`ssh node166 $check_cmd`
9 | ssh node166 $check_cmd
10 | [ -z "$ret" ] || running=1
11 | echo
12 |
13 | echo "--- node168 ---"
14 | ret=`ssh node168 $check_cmd`
15 | ssh node168 $check_cmd
16 | [ -z "$ret" ] || running=1
17 | echo
18 |
19 | echo "--- node184 ---"
20 | ret=`ssh node184 $check_cmd`
21 | ssh node184 $check_cmd
22 | [ -z "$ret" ] || running=1
23 | echo
24 |
25 | echo "--- node181 ---"
26 | ret=`bash -c "$check_cmd"`
27 | bash -c "$check_cmd"
28 | [ -z "$ret" ] || running=1
29 | echo
30 |
31 | [ $running -eq 0 ] && echo "not found" && exit
32 |
33 | read -p "Confirm to kill? (Y/N): " confirm && [[ $confirm == [yY] || $confirm == [yY][eE][sS] ]] || exit
34 |
35 | kill_cmd="pkill -f $pattern"
36 | echo "--- node166 ---"
37 | ssh node166 $kill_cmd
38 | echo "--- node168 ---"
39 | ssh node168 $kill_cmd
40 | echo "--- node184 ---"
41 | ssh node184 $kill_cmd
42 | echo "--- node181 ---"
43 | bash -c "$kill_cmd"
44 |
45 | echo "killed"
46 |
--------------------------------------------------------------------------------
/ae/clear.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 |
3 | rm -rf ./output
4 |
--------------------------------------------------------------------------------
/ae/figure-10a.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-10a",
6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "rpc_tiering_promote", "pdp"],
7 | "sz_unit": "256",
8 | "xlabel": "",
9 | "xdata": ["Read 256B"],
10 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot", "+magic"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-10b.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-10b",
6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "rpc_tiering_promote", "pdp"],
7 | "sz_unit": "4k",
8 | "nr_server_threads": 8,
9 | "xlabel": "",
10 | "xdata": ["Read 4KB"],
11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot", "+magic"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/figure-10c.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-10c",
6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "pdp"],
7 | "sz_unit": "256",
8 | "write_percent" : 100,
9 | "xlabel": "",
10 | "xdata": ["Write 256B"],
11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
--------------------------------------------------------------------------------
/ae/figure-10d.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-10d",
6 | "mode": ["rpc_memcpy", "rpc_buffer", "rpc_direct", "rpc_tiering", "pdp"],
7 | "sz_unit": "4k",
8 | "write_percent" : 100,
9 | "xlabel": "",
10 | "xdata": ["Write 4KB"],
11 | "legend": ["RPC", "RPC_buffer", "RPC_direct", "+tiering", "+hotspot"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/figure-11.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-11",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "dynamic": True,
8 | "xlabel": "Time (s)",
9 | "xdata": [i for i in range(1, 121)],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"],
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-12a.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-12a",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "skewness_100": [0, 40, 80, 90, 99],
8 | "xlabel": "Zipfian theta",
9 | "xdata": ["0", "0.4", "0.8", "0.9", "0.99"],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-12b.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-12b",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "write_percent": [0, 25, 50, 75, 100],
8 | "xlabel": "Write Ratio",
9 | "xdata": ["0%", "25%", "50%", "75%", "100%"],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
17 |
--------------------------------------------------------------------------------
/ae/figure-13a.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-13a",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "nr_client_threads": [1, 2, 4, 8, 16, 32, 64],
8 | "xlabel": "Number of Client Threads",
9 | "xdata": ["1", "2", "4", "8", "16", "32", "64"],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-13b.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name" : "figure-13b",
6 | "mode" : ["rpc", "pdp"],
7 | "nr_server_threads": [1, 2, 4, 8, 16, 32],
8 | "xlabel" : "Number of Server Threads",
9 | "xdata" : ["1", "2", "4", "8", "16", "32"],
10 | "legend": ["RPC", "TeRM"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-14a.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name" : "figure-14a",
6 | "mode" : ["odp", "rpc", "pdp"],
7 | "server_memory_gb": [13, 26, 38, 51, 58, 0],
8 | "skewness_100": 0,
9 | "xlabel": "DRAM Ratio (%)",
10 | "xdata": ["20", "40", "60", "80", "90", "100"],
11 | "legend": ["ODP", "RPC", "TeRM"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/figure-14b.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name" : "figure-14b",
6 | "mode" : ["odp", "rpc", "pdp"],
7 | "server_memory_gb": [13, 26, 38, 51, 58, 0],
8 | "skewness_100": 99,
9 | "xlabel": "DRAM Ratio (%)",
10 | "xdata": ["20", "40", "60", "80", "90", "100"],
11 | "legend": ["ODP", "RPC", "TeRM"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/figure-15a.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name" : "figure-15a",
6 | "mode" : ["odp", "rpc", "pdp"], # codename
7 | "ssd" : ["nvme4n1", "nvme1n1", "nvme2n1"],
8 | "sz_unit" : 256,
9 | "xlabel": "Read 256B",
10 | "xdata": ["SSD 1", "SSD 2", "SSD 3"],
11 | "legend": ["ODP", "RPC", "TeRM"]
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/figure-15b.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name" : "figure-15b",
6 | "mode" : ["odp", "rpc", "pdp"],
7 | "ssd" : ["nvme4n1", "nvme1n1", "nvme2n1"],
8 | "xlabel": "Read 4KB",
9 | "xdata": ["SSD 1", "SSD 2", "SSD 3"],
10 | "legend": ["ODP", "RPC", "TeRM"]
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-16.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.octopus_eval as octopus_eval
3 |
4 | kwargs = {
5 | "name": "figure-16",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "write": [0, 1],
8 | "unit_size": ["4k", "16k"],
9 | "xlabel": "",
10 | "xdata": ["Read 4KB", "Read 16KB", "Write 4KB", "Write 16KB"],
11 | "legend": ["PIN", "ODP", "RPC", "TeRM"]
12 | }
13 |
14 |
15 | e = octopus_eval.Experiment(**kwargs)
16 | e.run()
17 | e.output()
18 |
19 | # rerun a data point
20 | # kwargs = {
21 | # "name": "figure-16",
22 | # "mode": ["rpc"],
23 | # "write": [1],
24 | # "unit_size": ["16k"],
25 | # "xlabel": "", # for output(), no need to edit
26 | # "xdata": ["Write 16KB"], # for output(), no need to edit
27 | # "legend": ["RPC"] # for output(), no need to edit
28 | # }
29 | # e = octopus_eval.Experiment(**kwargs)
30 | # e.run()
31 |
--------------------------------------------------------------------------------
/ae/figure-17.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.xstore_eval as xstore_eval
3 |
4 | kwargs = {
5 | "name": "figure-17",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "zipf_theta": [0, 0.4, 0.8, 0.9, 0.99],
8 | "xlabel": "Zipfian theta",
9 | "xdata": ["0", "0.4", "0.8", "0.9", "0.99"],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"]
11 | }
12 |
13 | e = xstore_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-8.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-8",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "sz_unit": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"],
8 | "xlabel": "Read Size",
9 | "xdata": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"],
10 | "legend": ["PIN", "ODP", "RPC", "TeRM"],
11 | }
12 |
13 | e = term_eval.Experiment(**kwargs)
14 | e.run()
15 | e.output()
16 |
--------------------------------------------------------------------------------
/ae/figure-9.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import scripts.term_eval as term_eval
3 |
4 | kwargs = {
5 | "name": "figure-9",
6 | "mode": ["pin", "odp", "rpc", "pdp"],
7 | "sz_unit": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"],
8 | "write_percent": 100,
9 | "xlabel": "Write Size",
10 | "xdata": ["64", "128", "256", "512", "1k", "2k", "4k", "8k", "16k"],
11 | "legend": ["PIN", "ODP", "RPC", "TeRM"],
12 | }
13 |
14 | e = term_eval.Experiment(**kwargs)
15 | e.run()
16 | e.output()
17 |
--------------------------------------------------------------------------------
/ae/output-example.tar.xz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/ae/output-example.tar.xz
--------------------------------------------------------------------------------
/ae/run-all.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/bash
2 |
3 | ./clear.sh
4 | ./figure-8.py
5 | ./figure-9.py
6 | ./figure-10a.py
7 | ./figure-10b.py
8 | ./figure-10c.py
9 | ./figure-10d.py
10 | ./figure-11.py
11 | ./figure-12a.py
12 | ./figure-12b.py
13 | ./figure-13a.py
14 | ./figure-13b.py
15 | ./figure-14a.py
16 | ./figure-14b.py
17 | ./figure-15a.py
18 | ./figure-15b.py
19 | ./figure-16.py
20 | ./figure-17.py
21 |
--------------------------------------------------------------------------------
/ae/scripts/bootstrap-octopus.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import toml
4 | import paramiko
5 | import getpass
6 | import argparse
7 | import time
8 | import select
9 | import subprocess
10 |
11 | import os
12 |
13 | from paramiko import SSHConfig
14 |
15 | config = SSHConfig()
16 | with open(os.path.expanduser("/etc/ssh/ssh_config")) as _file:
17 | config.parse(_file)
18 |
19 |
20 | class RunPrinter:
21 | def __init__(self, name, channel):
22 | self.channel = channel
23 | self.name = name
24 | self.dying = False
25 | self.server_inited = False
26 |
27 | def print_one(self):
28 | if self.channel.recv_ready():
29 | res = self.channel.recv(1048576 * 10).decode().splitlines()
30 | for l in res:
31 | print("@%-10s" % self.name, l.strip())
32 | if "Worker" in l:
33 | self.server_inited = True
34 |
35 | if self.channel.recv_stderr_ready():
36 | res = self.channel.recv_stderr(1048576 * 10).decode().splitlines()
37 | for l in res:
38 | print("@%-10s" % self.name, l.strip())
39 |
40 | if self.channel.exit_status_ready():
41 | print("! exit ", self.name)
42 | return False
43 |
44 | return True
45 |
46 |
47 | def check_keywords(lines, keywords, black_keywords):
48 | match = []
49 | for l in lines:
50 | black = False
51 | for bk in black_keywords:
52 | if l.find(bk) >= 0:
53 | black = True
54 | break
55 | if black:
56 | continue
57 | flag = True
58 | for k in keywords:
59 | flag = flag and (l.find(k) >= 0)
60 | if flag:
61 | # print("matched line: ",l)
62 | match.append(l)
63 | return len(match)
64 |
65 |
66 | class Envs:
67 | def __init__(self, f=""):
68 | self.envs = {}
69 | self.history = []
70 | try:
71 | self.load(f)
72 | except:
73 | pass
74 |
75 | def set(self, envs):
76 | self.envs = envs
77 | self.history += envs.keys()
78 |
79 | def load(self, f):
80 | self.envs = pickle.load(open(f, "rb"))
81 |
82 | def add(self, name, env):
83 | self.history.append(name)
84 | self.envs[name] = env
85 |
86 | def append(self, name, env):
87 | self.envs[name] = (self.envs[name] + ":" + env)
88 |
89 | def delete(self, name):
90 | self.history.remove(name)
91 | del self.envs[name]
92 |
93 | def __str__(self):
94 | s = ""
95 | for name in self.history:
96 | s += ("export %s=%s;" % (name, self.envs[name]))
97 | return s
98 |
99 | def store(self, f):
100 | with open(f, 'wb') as handle:
101 | pickle.dump(self.envs, handle, protocol=pickle.HIGHEST_PROTOCOL)
102 |
103 |
104 | class ConnectProxy:
105 | def __init__(self, mac, user="", passp=""):
106 | if user == "": # use the server user as default
107 | user = getpass.getuser()
108 | self.ssh = paramiko.SSHClient()
109 |
110 | self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
111 | self.user = user
112 | self.mac = mac
113 | self.sftp = None
114 | self.passp = passp
115 |
116 | def connect(self, pwd, passp=None, timeout=30):
117 | # user_config = config.lookup(self.mac)
118 | # if user_config:
119 | # print("found user_config", config)
120 | # print(user_config)
121 | # #print(user_config)
122 | # #cfg = {'hostname': self.mac, 'username': self.user}
123 | # #cfg['sock'] = paramiko.ProxyCommand(user_config['proxycommand'])
124 | # return self.ssh.connect(hostname = self.mac,username = self.user, password = pwd,
125 | # timeout = timeout, allow_agent=False,look_for_keys=False,passphrase=passp,sock=paramiko.ProxyCommand(user_config['proxycommand']),banner_timeout=400)
126 |
127 | # else:
128 | return self.ssh.connect(hostname=self.mac, username=self.user, password=pwd,
129 | timeout=timeout, allow_agent=False, look_for_keys=False, passphrase=passp)
130 |
131 | def connect_with_pkey(self, keyfile_name, timeout=10):
132 | print("connect with pkey")
133 | user_config = ssh_config.lookup(self.mac)
134 | print(user_config)
135 | if user_config:
136 | assert False
137 |
138 | self.ssh.connect(hostname=self.mac, username=self.user,
139 | key_filename=keyfile_name, timeout=timeout)
140 |
141 | def execute(self, cmd, pty=False, timeout=None, background=False):
142 | if not background:
143 | return self.ssh.exec_command(cmd, get_pty=pty, timeout=timeout)
144 | else:
145 | print("exe", cmd, "in background")
146 | transport = self.ssh.get_transport()
147 | channel = transport.open_session()
148 | return channel.exec_command(cmd)
149 |
150 | def execute_w_channel(self, cmd):
151 | transport = self.ssh.get_transport()
152 | channel = transport.open_session()
153 | channel.get_pty()
154 | channel.exec_command(cmd)
155 | return channel
156 |
157 | def copy_file(self, f, dst_dir=""):
158 | if self.sftp == None:
159 | self.sftp = paramiko.SFTPClient.from_transport(
160 | self.ssh.get_transport())
161 | self.sftp.put(f, dst_dir + "/" + f)
162 |
163 | def get_file(self, remote_path, f):
164 | if self.sftp == None:
165 | self.sftp = paramiko.SFTPClient.from_transport(
166 | self.ssh.get_transport())
167 | self.sftp.get(remote_path + "/" + f, f)
168 |
169 | def close(self):
170 | if self.sftp != None:
171 | self.sftp.close()
172 | self.ssh.close()
173 |
174 | def copy_dir(self, source, target, verbose=False):
175 |
176 | if self.sftp == None:
177 | self.sftp = paramiko.SFTPClient.from_transport(
178 | self.ssh.get_transport())
179 |
180 | if os.path.isfile(source):
181 | return self.copy_file(source, target)
182 |
183 | try:
184 | os.listdir(source)
185 | except:
186 | print("[%S] failed to put %s" % (self.mac, source))
187 | return
188 |
189 | self.mkdir(target, ignore_existing=True)
190 |
191 | for item in os.listdir(source):
192 | if os.path.isfile(os.path.join(source, item)):
193 | try:
194 | self.sftp.put(os.path.join(source, item),
195 | '%s/%s' % (target, item))
196 | print_v(verbose, "[%s] put %s done" %
197 | (self.mac, os.path.join(source, item)))
198 | except:
199 | print("[%s] put %s error" %
200 | (self.mac, os.path.join(source, item)))
201 | else:
202 | self.mkdir('%s/%s' % (target, item), ignore_existing=True)
203 | self.copy_dir(os.path.join(source, item),
204 | '%s/%s' % (target, item))
205 |
206 | def mkdir(self, path, mode=511, ignore_existing=False):
207 | try:
208 | self.sftp.mkdir(path, mode)
209 | except IOError:
210 | if ignore_existing:
211 | pass
212 | else:
213 | raise
214 |
215 |
216 | class Courier2:
217 | def __init__(self, user=getpass.getuser(), pwd="123", hosts="hosts.xml", passp="", curdir=".", keyfile=""):
218 | self.user = user
219 | self.pwd = pwd
220 | self.keyfile = keyfile
221 | self.cached_host = "localhost"
222 | self.passp = passp
223 |
224 | self.curdir = curdir
225 | self.envs = Envs()
226 |
227 | def cd(self, dir):
228 | if os.path.isabs(dir):
229 | self.curdir = dir
230 | if self.curdir == "~":
231 | self.curdir = "."
232 | else:
233 | self.curdir += ("/" + dir)
234 |
235 | def get_file(self, host, dst_dir, f, timeout=None):
236 | p = ConnectProxy(host, self.user)
237 | try:
238 | if len(self.keyfile):
239 | p.connect_with_pkey(self.keyfile, timeout)
240 | else:
241 | p.connect(self.pwd, timeout=timeout)
242 | except Exception as e:
243 | print("[get_file] connect to %s error: " % host, e)
244 | p.close()
245 | return False, None
246 | try:
247 | p.get_file(dst_dir, f)
248 | except Exception as e:
249 | print("[get_file] get %s @%s error " % (f, host), e)
250 | p.close()
251 | return False, None
252 | if p:
253 | p.close()
254 |
255 | return True, None
256 |
257 | def copy_file(self, host, f, dst_dir="~/", timeout=None):
258 | p = ConnectProxy(host, self.user)
259 | try:
260 | if len(self.keyfile):
261 | p.connect_with_pkey(self.keyfile, timeout)
262 | else:
263 | p.connect(self.pwd, timeout=timeout)
264 | except Exception as e:
265 | print("[copy_file] connect to %s error: " % host, e)
266 | p.close()
267 | return False, None
268 | try:
269 | p.copy_file(f, dst_dir)
270 | except Exception as e:
271 | print("[copy_file] copy %s error " % host, e)
272 | p.close()
273 | return False, None
274 | if p:
275 | p.close()
276 |
277 | return True, None
278 |
279 | def execute_w_channel(self, cmd, host, dir, timeout=None):
280 | p = ConnectProxy(host, self.user)
281 | try:
282 | if len(self.keyfile):
283 | p.connect_with_pkey(self.keyfile, timeout)
284 | else:
285 | p.connect(self.pwd, self.passp, timeout=timeout)
286 | except Exception as e:
287 | print("[pre execute] connect to %s error: " % host, e)
288 | p.close()
289 | return None, e
290 |
291 | try:
292 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd
293 | return p.execute_w_channel(ccmd)
294 | except:
295 | return None
296 |
297 | def pre_execute(self, cmd, host, pty=True, dir="", timeout=None, retry_count=3, background=False):
298 | if dir == "":
299 | dir = self.curdir
300 |
301 | p = ConnectProxy(host, self.user)
302 | try:
303 | if len(self.keyfile):
304 | p.connect_with_pkey(self.keyfile, timeout)
305 | else:
306 | p.connect(self.pwd, timeout=timeout)
307 | except Exception as e:
308 | print("[pre execute] connect to %s error: " % host, e)
309 | p.close()
310 | return None, e
311 |
312 | try:
313 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd
314 | if not background:
315 | _, stdout, _ = p.execute(
316 | ccmd, pty, timeout, background=background)
317 | return p, stdout
318 | else:
319 | c = p.execute(ccmd, pty, timeout, background=True)
320 | return p, c
321 | except Exception as e:
322 | print("[pre execute] execute cmd @ %s error: " % host, e)
323 | p.close()
324 | if retry_count > 0:
325 | if timeout:
326 | timeout += 2
327 | return self.pre_execute(cmd, host, pty, dir, timeout, retry_count - 1)
328 |
329 | def execute(self, cmd, host, pty=True, dir="", timeout=None, output=True, background=False):
330 | ret = [True, ""]
331 | p, stdout = self.pre_execute(
332 | cmd, host, pty, dir, timeout, background=background)
333 | if p and (stdout and output) and (not background):
334 | try:
335 | while not stdout.channel.closed:
336 | try:
337 | for line in iter(lambda: stdout.readline(2048), ""):
338 | if pty and (len(line) > 0): # ignore null lines
339 | print((("[%s]: " % host) + line), end="")
340 | else:
341 | ret[1] += (line + "\n")
342 | except UnicodeDecodeError as e:
343 | continue
344 | except Exception as e:
345 | break
346 | except Exception as e:
347 | print("[%s] execute with execption:" % host, e)
348 | if p and (not background):
349 | p.close()
350 | # ret[1] = stdout
351 | else:
352 | ret[0] = False
353 | ret[1] = stdout
354 | return ret[0], ret[1]
355 |
356 |
357 | # cr.envs.set(base_env)
358 |
359 |
360 | def main():
361 | arg_parser = argparse.ArgumentParser(
362 | description=''' Benchmark script for running the cluster''')
363 | arg_parser.add_argument(
364 | '-f', metavar='CONFIG', dest='config', default="run.toml", type=str,
365 | help='The configuration file to execute commands')
366 | arg_parser.add_argument('-b', '--black', nargs='+',
367 | help='hosts to ignore', required=False)
368 | arg_parser.add_argument(
369 | '-n', '--num', help='num-passes to run', default=128, type=int)
370 |
371 | args = arg_parser.parse_args()
372 | num = args.num
373 |
374 | config = toml.load(args.config)
375 |
376 | passes = config.get("pass", [])
377 |
378 | user = config.get("user", "root")
379 | pwd = config.get("pwd", "fast24ae")
380 | passp = config.get("passphrase", None)
381 |
382 | prefix_cmd = config.get("prefix_cmd", "")
383 | print(f"! prefix_cmd={prefix_cmd}")
384 | suffix_cmd = config.get("suffix_cmd", "")
385 | print(f"! suffix_cmd={suffix_cmd}")
386 |
387 | black_list = {}
388 | if args.black:
389 | for e in args.black:
390 | black_list[e] = True
391 |
392 | cr = Courier2(user, pwd, passp=passp)
393 |
394 | # first we sync files
395 |
396 | syncs = config.get("sync", [])
397 | for s in syncs:
398 | source = s["source"]
399 | targets = s["targets"]
400 | for t in targets:
401 | print("(sync files)", "scp -3 %s %s" % (source, t))
402 | subprocess.call(("scp -3 %s %s" % (source, t)).split())
403 |
404 | init_cmd = config.get("init_cmd")
405 | if init_cmd:
406 | os.system(init_cmd)
407 |
408 | printer_list = []
409 | runned = 0
410 | for p in passes:
411 | if runned > num:
412 | break
413 | runned += 1
414 | print("! execute cmd @%s" % p["host"], p["cmd"])
415 |
416 | if p["host"] in black_list:
417 | continue
418 |
419 | if p.get("local", "no") == "yes":
420 | subprocess.run(p["cmd"].split(" "))
421 | pass
422 | else:
423 | channel = cr.execute_w_channel(prefix_cmd + " " + p["cmd"] + " " + suffix_cmd,
424 | p["host"],
425 | p["path"])
426 | if p["name"] not in config.get("null", []):
427 | current_printer = RunPrinter(p["host"] + ":" + p["name"], channel)
428 | printer_list.append(current_printer)
429 | if p["name"] == "s0":
430 | goon = True
431 | while goon and not current_printer.server_inited:
432 | goon = current_printer.print_one()
433 | time.sleep(0.1)
434 |
435 | should_stop = False
436 | printer_go_on = True
437 | while printer_go_on:
438 | try:
439 | temp = []
440 | for p in printer_list:
441 | if should_stop:
442 | p.dying = True
443 | p.channel.send(chr(3))
444 | if p.print_one():
445 | temp.append(p)
446 | else:
447 | should_stop = True
448 | printer_list = temp
449 | printer_go_on = True if len(printer_list) > 0 else False
450 | time.sleep(0.1)
451 | except KeyboardInterrupt:
452 | print("set should_stop")
453 | should_stop = True
454 |
455 |
456 | if __name__ == "__main__":
457 | main()
458 |
--------------------------------------------------------------------------------
/ae/scripts/bootstrap.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import toml
4 | import paramiko
5 | import getpass
6 | import argparse
7 | import time
8 | import select
9 | import subprocess
10 |
11 | import os
12 |
13 | from paramiko import SSHConfig
14 |
15 | config = SSHConfig()
16 | with open(os.path.expanduser("/etc/ssh/ssh_config")) as _file:
17 | config.parse(_file)
18 |
19 |
20 | class RunPrinter:
21 | def __init__(self, name, channel):
22 | self.channel = channel
23 | self.name = name
24 | self.dying = False
25 |
26 | def print_one(self):
27 | if self.channel.recv_ready():
28 | res = self.channel.recv(1048576).decode().splitlines()
29 | for l in res:
30 | print("@%-10s" % self.name, l.strip())
31 |
32 | if self.channel.recv_stderr_ready():
33 | res = self.channel.recv_stderr(1048576).decode().splitlines()
34 | for l in res:
35 | print("@%-10s" % self.name, l.strip())
36 |
37 | if self.channel.exit_status_ready():
38 | print("! exit ", self.name)
39 | return False
40 |
41 | return True
42 |
43 |
44 | def check_keywords(lines, keywords, black_keywords):
45 | match = []
46 | for l in lines:
47 | black = False
48 | for bk in black_keywords:
49 | if l.find(bk) >= 0:
50 | black = True
51 | break
52 | if black:
53 | continue
54 | flag = True
55 | for k in keywords:
56 | flag = flag and (l.find(k) >= 0)
57 | if flag:
58 | # print("matched line: ",l)
59 | match.append(l)
60 | return len(match)
61 |
62 |
63 | class Envs:
64 | def __init__(self, f=""):
65 | self.envs = {}
66 | self.history = []
67 | try:
68 | self.load(f)
69 | except:
70 | pass
71 |
72 | def set(self, envs):
73 | self.envs = envs
74 | self.history += envs.keys()
75 |
76 | def load(self, f):
77 | self.envs = pickle.load(open(f, "rb"))
78 |
79 | def add(self, name, env):
80 | self.history.append(name)
81 | self.envs[name] = env
82 |
83 | def append(self, name, env):
84 | self.envs[name] = (self.envs[name] + ":" + env)
85 |
86 | def delete(self, name):
87 | self.history.remove(name)
88 | del self.envs[name]
89 |
90 | def __str__(self):
91 | s = ""
92 | for name in self.history:
93 | s += ("export %s=%s;" % (name, self.envs[name]))
94 | return s
95 |
96 | def store(self, f):
97 | with open(f, 'wb') as handle:
98 | pickle.dump(self.envs, handle, protocol=pickle.HIGHEST_PROTOCOL)
99 |
100 |
101 | class ConnectProxy:
102 | def __init__(self, mac, user="", passp=""):
103 | if user == "": # use the server user as default
104 | user = getpass.getuser()
105 | self.ssh = paramiko.SSHClient()
106 |
107 | self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
108 | self.user = user
109 | self.mac = mac
110 | self.sftp = None
111 | self.passp = passp
112 |
113 | def connect(self, pwd, passp=None, timeout=30):
114 | # user_config = config.lookup(self.mac)
115 | # if user_config:
116 | # print("found user_config", config)
117 | # print(user_config)
118 | # #print(user_config)
119 | # #cfg = {'hostname': self.mac, 'username': self.user}
120 | # #cfg['sock'] = paramiko.ProxyCommand(user_config['proxycommand'])
121 | # return self.ssh.connect(hostname = self.mac,username = self.user, password = pwd,
122 | # timeout = timeout, allow_agent=False,look_for_keys=False,passphrase=passp,sock=paramiko.ProxyCommand(user_config['proxycommand']),banner_timeout=400)
123 |
124 | # else:
125 | return self.ssh.connect(hostname=self.mac, username=self.user, password=pwd,
126 | timeout=timeout, allow_agent=False, look_for_keys=False, passphrase=passp)
127 |
128 | def connect_with_pkey(self, keyfile_name, timeout=10):
129 | print("connect with pkey")
130 | user_config = ssh_config.lookup(self.mac)
131 | print(user_config)
132 | if user_config:
133 | assert False
134 |
135 | self.ssh.connect(hostname=self.mac, username=self.user,
136 | key_filename=keyfile_name, timeout=timeout)
137 |
138 | def execute(self, cmd, pty=False, timeout=None, background=False):
139 | if not background:
140 | return self.ssh.exec_command(cmd, get_pty=pty, timeout=timeout)
141 | else:
142 | print("exe", cmd, "in background")
143 | transport = self.ssh.get_transport()
144 | channel = transport.open_session()
145 | return channel.exec_command(cmd)
146 |
147 | def execute_w_channel(self, cmd):
148 | transport = self.ssh.get_transport()
149 | channel = transport.open_session()
150 | channel.get_pty()
151 | channel.exec_command(cmd)
152 | return channel
153 |
154 | def copy_file(self, f, dst_dir=""):
155 | if self.sftp == None:
156 | self.sftp = paramiko.SFTPClient.from_transport(
157 | self.ssh.get_transport())
158 | self.sftp.put(f, dst_dir + "/" + f)
159 |
160 | def get_file(self, remote_path, f):
161 | if self.sftp == None:
162 | self.sftp = paramiko.SFTPClient.from_transport(
163 | self.ssh.get_transport())
164 | self.sftp.get(remote_path + "/" + f, f)
165 |
166 | def close(self):
167 | if self.sftp != None:
168 | self.sftp.close()
169 | self.ssh.close()
170 |
171 | def copy_dir(self, source, target, verbose=False):
172 |
173 | if self.sftp == None:
174 | self.sftp = paramiko.SFTPClient.from_transport(
175 | self.ssh.get_transport())
176 |
177 | if os.path.isfile(source):
178 | return self.copy_file(source, target)
179 |
180 | try:
181 | os.listdir(source)
182 | except:
183 | print("[%S] failed to put %s" % (self.mac, source))
184 | return
185 |
186 | self.mkdir(target, ignore_existing=True)
187 |
188 | for item in os.listdir(source):
189 | if os.path.isfile(os.path.join(source, item)):
190 | try:
191 | self.sftp.put(os.path.join(source, item),
192 | '%s/%s' % (target, item))
193 | print_v(verbose, "[%s] put %s done" %
194 | (self.mac, os.path.join(source, item)))
195 | except:
196 | print("[%s] put %s error" %
197 | (self.mac, os.path.join(source, item)))
198 | else:
199 | self.mkdir('%s/%s' % (target, item), ignore_existing=True)
200 | self.copy_dir(os.path.join(source, item),
201 | '%s/%s' % (target, item))
202 |
203 | def mkdir(self, path, mode=511, ignore_existing=False):
204 | try:
205 | self.sftp.mkdir(path, mode)
206 | except IOError:
207 | if ignore_existing:
208 | pass
209 | else:
210 | raise
211 |
212 |
213 | class Courier2:
214 | def __init__(self, user=getpass.getuser(), pwd="123", hosts="hosts.xml", passp="", curdir=".", keyfile=""):
215 | self.user = user
216 | self.pwd = pwd
217 | self.keyfile = keyfile
218 | self.cached_host = "localhost"
219 | self.passp = passp
220 |
221 | self.curdir = curdir
222 | self.envs = Envs()
223 |
224 | def cd(self, dir):
225 | if os.path.isabs(dir):
226 | self.curdir = dir
227 | if self.curdir == "~":
228 | self.curdir = "."
229 | else:
230 | self.curdir += ("/" + dir)
231 |
232 | def get_file(self, host, dst_dir, f, timeout=None):
233 | p = ConnectProxy(host, self.user)
234 | try:
235 | if len(self.keyfile):
236 | p.connect_with_pkey(self.keyfile, timeout)
237 | else:
238 | p.connect(self.pwd, timeout=timeout)
239 | except Exception as e:
240 | print("[get_file] connect to %s error: " % host, e)
241 | p.close()
242 | return False, None
243 | try:
244 | p.get_file(dst_dir, f)
245 | except Exception as e:
246 | print("[get_file] get %s @%s error " % (f, host), e)
247 | p.close()
248 | return False, None
249 | if p:
250 | p.close()
251 |
252 | return True, None
253 |
254 | def copy_file(self, host, f, dst_dir="~/", timeout=None):
255 | p = ConnectProxy(host, self.user)
256 | try:
257 | if len(self.keyfile):
258 | p.connect_with_pkey(self.keyfile, timeout)
259 | else:
260 | p.connect(self.pwd, timeout=timeout)
261 | except Exception as e:
262 | print("[copy_file] connect to %s error: " % host, e)
263 | p.close()
264 | return False, None
265 | try:
266 | p.copy_file(f, dst_dir)
267 | except Exception as e:
268 | print("[copy_file] copy %s error " % host, e)
269 | p.close()
270 | return False, None
271 | if p:
272 | p.close()
273 |
274 | return True, None
275 |
276 | def execute_w_channel(self, cmd, host, dir, timeout=None):
277 | p = ConnectProxy(host, self.user)
278 | try:
279 | if len(self.keyfile):
280 | p.connect_with_pkey(self.keyfile, timeout)
281 | else:
282 | p.connect(self.pwd, self.passp, timeout=timeout)
283 | except Exception as e:
284 | print("[pre execute] connect to %s error: " % host, e)
285 | p.close()
286 | return None, e
287 |
288 | try:
289 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd
290 | return p.execute_w_channel(ccmd)
291 | except:
292 | return None
293 |
294 | def pre_execute(self, cmd, host, pty=True, dir="", timeout=None, retry_count=3, background=False):
295 | if dir == "":
296 | dir = self.curdir
297 |
298 | p = ConnectProxy(host, self.user)
299 | try:
300 | if len(self.keyfile):
301 | p.connect_with_pkey(self.keyfile, timeout)
302 | else:
303 | p.connect(self.pwd, timeout=timeout)
304 | except Exception as e:
305 | print("[pre execute] connect to %s error: " % host, e)
306 | p.close()
307 | return None, e
308 |
309 | try:
310 | ccmd = ("cd %s" % dir) + ";" + str(self.envs) + cmd
311 | if not background:
312 | _, stdout, _ = p.execute(
313 | ccmd, pty, timeout, background=background)
314 | return p, stdout
315 | else:
316 | c = p.execute(ccmd, pty, timeout, background=True)
317 | return p, c
318 | except Exception as e:
319 | print("[pre execute] execute cmd @ %s error: " % host, e)
320 | p.close()
321 | if retry_count > 0:
322 | if timeout:
323 | timeout += 2
324 | return self.pre_execute(cmd, host, pty, dir, timeout, retry_count - 1)
325 |
326 | def execute(self, cmd, host, pty=True, dir="", timeout=None, output=True, background=False):
327 | ret = [True, ""]
328 | p, stdout = self.pre_execute(
329 | cmd, host, pty, dir, timeout, background=background)
330 | if p and (stdout and output) and (not background):
331 | try:
332 | while not stdout.channel.closed:
333 | try:
334 | for line in iter(lambda: stdout.readline(2048), ""):
335 | if pty and (len(line) > 0): # ignore null lines
336 | print((("[%s]: " % host) + line), end="")
337 | else:
338 | ret[1] += (line + "\n")
339 | except UnicodeDecodeError as e:
340 | continue
341 | except Exception as e:
342 | break
343 | except Exception as e:
344 | print("[%s] execute with execption:" % host, e)
345 | if p and (not background):
346 | p.close()
347 | # ret[1] = stdout
348 | else:
349 | ret[0] = False
350 | ret[1] = stdout
351 | return ret[0], ret[1]
352 |
353 |
354 | # cr.envs.set(base_env)
355 |
356 |
357 | def main():
358 | arg_parser = argparse.ArgumentParser(
359 | description=''' Benchmark script for running the cluster''')
360 | arg_parser.add_argument(
361 | '-f', metavar='CONFIG', dest='config', default="run.toml", type=str,
362 | help='The configuration file to execute commands')
363 | arg_parser.add_argument('-b', '--black', nargs='+',
364 | help='hosts to ignore', required=False)
365 | arg_parser.add_argument(
366 | '-n', '--num', help='num-passes to run', default=128, type=int)
367 |
368 | args = arg_parser.parse_args()
369 | num = args.num
370 |
371 | config = toml.load(args.config)
372 |
373 | passes = config.get("pass", [])
374 |
375 | user = config.get("user", "root")
376 | pwd = config.get("pwd", "fast24ae")
377 | passp = config.get("passphrase", None)
378 |
379 | prefix_cmd = config.get("prefix_cmd", "")
380 | print(f"! prefix_cmd={prefix_cmd}")
381 | suffix_cmd = config.get("suffix_cmd", "")
382 | print(f"! suffix_cmd={suffix_cmd}")
383 |
384 | black_list = {}
385 | if args.black:
386 | for e in args.black:
387 | black_list[e] = True
388 |
389 | cr = Courier2(user, pwd, passp=passp)
390 |
391 | # first we sync files
392 |
393 | syncs = config.get("sync", [])
394 | for s in syncs:
395 | source = s["source"]
396 | targets = s["targets"]
397 | for t in targets:
398 | print("(sync files)", "scp -3 %s %s" % (source, t))
399 | subprocess.call(("scp -3 %s %s" % (source, t)).split())
400 |
401 | init_cmd = config.get("init_cmd")
402 | if init_cmd:
403 | os.system(init_cmd)
404 |
405 | printer = []
406 | runned = 0
407 | for p in passes:
408 | if runned > num:
409 | break
410 | runned += 1
411 | print("! execute cmd @%s" % p["host"], p["cmd"])
412 |
413 | if p["host"] in black_list:
414 | continue
415 |
416 | if p.get("local", "no") == "yes":
417 | subprocess.run(p["cmd"].split(" "))
418 | pass
419 | else:
420 | channel = cr.execute_w_channel(prefix_cmd + " " + p["cmd"] + " " + suffix_cmd,
421 | p["host"],
422 | p["path"])
423 | if p["name"] not in config.get("null", []):
424 | printer.append(RunPrinter(p["host"] + ":" + p["name"], channel))
425 | time.sleep(1)
426 |
427 | should_stop = False
428 | printer_go_on = True
429 | while printer_go_on:
430 | try:
431 | temp = []
432 | for p in printer:
433 | if should_stop:
434 | p.dying = True
435 | p.channel.send(chr(3))
436 | print("terminating", p.name)
437 | if p.print_one():
438 | temp.append(p)
439 | else:
440 | should_stop = True
441 | printer = temp
442 | printer_go_on = True if len(printer) > 0 else False
443 | time.sleep(0.1)
444 | except KeyboardInterrupt:
445 | print("set should_stop")
446 | should_stop = True
447 |
448 |
449 | if __name__ == "__main__":
450 | main()
451 |
--------------------------------------------------------------------------------
/ae/scripts/ins-pch.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | cd /home/yz/workspace/TeRM/libterm/kmod
3 | make
4 | rmmod pch
5 | insmod pch.ko
6 |
--------------------------------------------------------------------------------
/ae/scripts/limit-mem.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | function get_available {
4 | echo `free -g|awk 'NR==2 {print $7}'`
5 | }
6 |
7 | function set {
8 | kb=`gb_to_kb $1`
9 | modprobe brd rd_nr=1 rd_size=${kb}
10 | dd if=/dev/zero of=/dev/ram0 bs=1k count=${kb}
11 | }
12 |
13 | function clear {
14 | rmmod brd
15 | }
16 |
17 | function gb_to_kb {
18 | echo `expr $1 \* 1024 \* 1024`
19 | }
20 |
21 | # if [ ! $# -eq 1 ]
22 | # then
23 | # echo "$0 "
24 | # exit 1
25 | # fi
26 |
27 | if [ -z "$PDP_server_memory_gb" ]
28 | then
29 | PDP_server_memory_gb=0
30 | if [ $# -eq 1 ]
31 | then
32 | PDP_server_memory_gb=$1
33 | fi
34 | fi
35 |
36 | echo target memory: ${PDP_server_memory_gb}GB
37 | target=${PDP_server_memory_gb}
38 |
39 | if [ $target -eq 0 ]
40 | then
41 | clear
42 | exit 0
43 | fi
44 |
45 | if [ `get_available` -eq $target ]
46 | then
47 | echo already done.
48 | exit 0
49 | fi
50 |
51 | clear
52 |
53 | avail=`get_available`
54 | if [ $avail -lt $target ]
55 | then
56 | echo "target ${target}GB > available ${avail}GB"
57 | exit 1
58 | fi
59 |
60 | set `expr $avail - $target`
61 |
--------------------------------------------------------------------------------
/ae/scripts/octopus_eval.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 |
4 | def run_test(mode = "pdp", write = 0, unit_size = "4k", log_dir = "."):
5 | server_memory_gb = 0 if mode == "pin" else 18
6 |
7 | os.system(f"mkdir -p {log_dir}")
8 | log_file = f"{log_dir}/write.{write},mode.{mode},unit.{unit_size}.log"
9 |
10 | test_file_content = f"""
11 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh"
12 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches;
13 | echo 100 > /proc/sys/vm/dirty_ratio;
14 | echo 0 > /proc/sys/kernel/numa_balancing;
15 | systemctl start opensmd;
16 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so /home/yz/workspace/TeRM/ae/bin/octopus/libnrfsc.so";
17 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb};
18 | export PDP_mode={mode};\"\"\"
19 | suffix_cmd = ""
20 |
21 | ## server process
22 | [[pass]]
23 | name = "s0"
24 | host = "node184"
25 | path = "/home/yz/workspace/TeRM/ae/bin/octopus"
26 | cmd = \"\"\"../../scripts/ins-pch.sh;
27 | systemctl restart openibd;
28 | ../../scripts/limit-mem.sh;
29 | export PDP_is_server=1;
30 | ./dmfs\"\"\"
31 |
32 | ## below are clients
33 | [[pass]]
34 | name = "c0"
35 | host = "node166"
36 | path = "/home/yz/workspace/TeRM/ae/bin/octopus"
37 | cmd = "mpiexec --allow-run-as-root --np 32 /bin/bash -c './mpibw --write={write} --unit_size={unit_size} --seconds=70 --zipf_theta=0.99 --file_size=1g' "
38 | """
39 | with open(f"{log_dir}/test.toml", "w") as f:
40 | f.write(test_file_content)
41 |
42 | cmd = f"""
43 | date | tee -a {log_file};
44 | /home/yz/workspace/TeRM/ae/scripts/bootstrap-octopus.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file}
45 | """
46 | os.system(cmd)
47 |
48 | def extract_result(mode = "pdp", write = 0, unit_size = "4k", log_dir = "."):
49 | result = 0.0
50 |
51 | log_file = f"{log_dir}/write.{write},mode.{mode},unit.{unit_size}.log"
52 | with open(log_file, "r") as f:
53 | while True:
54 | line = f.readline()
55 | if not line:
56 | break
57 | if "Bandwidth = " not in line:
58 | continue
59 |
60 | tokens = re.findall("Bandwidth = (\d+\.\d+)", line)
61 | result = float(tokens[0])
62 |
63 | return result
64 |
65 | def format_digits(n):
66 | return f"{n:,.2f}"
67 |
68 | def split_kwargs(kwargs):
69 | vector_kwargs = {}
70 | scalar_kwargs = {}
71 | for k, v in kwargs.items():
72 | if type(v) == list:
73 | vector_kwargs[k] = v
74 | else:
75 | scalar_kwargs[k] = v
76 | return vector_kwargs, scalar_kwargs
77 |
78 |
79 | def run_batch(name : str, **kwargs):
80 | log_dir = f"./output/log/{name}"
81 | os.system(f"mkdir -p {log_dir}")
82 |
83 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
84 |
85 | from itertools import product
86 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
87 | for idx, kv in enumerate(l):
88 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###"
89 | with open("./output/current-running.txt", "w") as f:
90 | f.write(f"{name}\n{str}\n")
91 | print(str)
92 | run_test(**kv, **scalar_kwargs, log_dir=log_dir)
93 |
94 |
95 | def output_batch(name : str, ydata = [], **kwargs):
96 | log_dir = f"./output/log/{name}"
97 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
98 | from itertools import product
99 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
100 | output = ""
101 |
102 | for idx, kv in enumerate(l):
103 | result = extract_result(**kv, **scalar_kwargs, log_dir=log_dir)
104 | str = f"({idx + 1}/{len(l)}) {kv}: bandwidth (MB/s)="
105 | ydata.append(result)
106 | str += f"{format_digits(result)}"
107 | output += str + "\n"
108 |
109 | print(output)
110 | with open(f"./output/{name}.txt", "w") as f:
111 | f.write(output)
112 |
113 |
114 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]):
115 | import matplotlib.pyplot as plt
116 | xlen = len(xdata)
117 |
118 | for idx, label in enumerate(legend):
119 | match label:
120 | case "PIN":
121 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN")
122 | case "ODP":
123 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP")
124 | case "RPC":
125 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC")
126 | case "TeRM":
127 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM")
128 | case _:
129 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label)
130 |
131 | plt.xlabel(xlabel)
132 | plt.ylabel("Bandwidth (MB/s)")
133 | plt.legend()
134 | print(f"written {name}.png")
135 | plt.savefig(f"./output/{name}.png")
136 |
137 | class Experiment:
138 | def __init__(self, name : str,
139 | xlabel: str, xdata: list, legend: list[str],
140 | **kwargs):
141 | self.name = name
142 | self.xlabel = xlabel
143 | self.xdata = xdata
144 | self.ydata = []
145 | self.legend = legend
146 | self.kwargs = kwargs
147 |
148 | def run(self):
149 | run_batch(name=self.name, **self.kwargs)
150 |
151 | def output(self):
152 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs)
153 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend)
154 |
--------------------------------------------------------------------------------
/ae/scripts/reset-memc.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | addr="10.0.2.181"
4 | port="23333"
5 |
6 | # kill old
7 | # ssh ${addr} "pkill -f memcached; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid"
8 | ssh ${addr} "cat /tmp/memcached.pid | xargs kill > /dev/null 2>&1; sleep 1; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid"
9 | ssh ${addr} "cat /tmp/memcached.pid | xargs kill > /dev/null 2>&1; sleep 1; memcached -u root -l ${addr} -p ${port} -c 10000 -d -P /tmp/memcached.pid"
10 |
11 | sleep 1
12 |
13 | # init
14 | echo -e "set nr_nodes 0 0 1\r\n0\r\nquit\r" | nc ${addr} ${port}
15 |
--------------------------------------------------------------------------------
/ae/scripts/term_eval.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 |
4 | def run_test(write_percent = 0, skewness_100 = 99, server_memory_gb = 32, ssd = "nvme4n1",
5 | mode = "pdp", nr_server_threads = 16, nr_client_threads = 64, sz_unit = "4k",
6 | verify = 0, dynamic = False, log_dir="."):
7 | if mode == "pin":
8 | server_memory_gb = 0
9 | running_seconds = 120 if dynamic else 70
10 |
11 | os.system(f"mkdir -p {log_dir}")
12 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log"
13 |
14 | comment_mark = "#" if nr_client_threads == 1 else ""
15 |
16 | str = f"""
17 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh"
18 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches;
19 | echo 100 > /proc/sys/vm/dirty_ratio;
20 | echo 0 > /proc/sys/kernel/numa_balancing;
21 | systemctl start opensmd;
22 | export LD_PRELOAD=/home/yz/workspace/TeRM/ae/bin/libpdp.so;
23 | export PDP_server_rpc_threads={nr_server_threads} PDP_server_mmap_dev={ssd} PDP_server_memory_gb={server_memory_gb};
24 | export PDP_mode={mode};\"\"\"
25 | suffix_cmd = "--nr_nodes={2 if nr_client_threads == 1 else 3} --running_seconds={running_seconds} --sz_server_mr=64g --write_percent={write_percent} --verify={verify} --skewness_100={skewness_100} --sz_unit={sz_unit} --nr_client_threads={1 if nr_client_threads == 1 else int(nr_client_threads / 2)} --hotspot_switch_second={60 if dynamic else 0};"
26 |
27 | ## server process
28 | [[pass]]
29 | name = "s0"
30 | host = "node184"
31 | path = "/home/yz/workspace/TeRM/ae"
32 | cmd = \"\"\"./scripts/ins-pch.sh;
33 | systemctl restart openibd;
34 | ./scripts/limit-mem.sh 0;
35 | export PDP_is_server=1;
36 | ./bin/perf --node_id=0\"\"\"
37 |
38 | ## below are clients
39 | [[pass]]
40 | name = "c0"
41 | host = "node166"
42 | path = "/home/yz/workspace/TeRM/ae"
43 | cmd = "./bin/perf --node_id=1"
44 |
45 | {comment_mark} [[pass]]
46 | {comment_mark} name = "c1"
47 | {comment_mark} host = "node168"
48 | {comment_mark} path = "/home/yz/workspace/TeRM/ae"
49 | {comment_mark} cmd = "./bin/perf --node_id=2"
50 |
51 | """
52 | with open(f"{log_dir}/test.toml", "w") as f:
53 | f.write(str)
54 |
55 | cmd = f"""
56 | date | tee -a {log_file};
57 | /home/yz/workspace/TeRM/ae/scripts/bootstrap.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file}
58 | """
59 | os.system(cmd)
60 |
61 | def run_octopus(mode = "pdp", rw = 0, unit_size = "4k", log_dir = "."):
62 | server_memory_gb = 0 if mode == "pin" else 18
63 |
64 | os.system(f"mkdir -p {log_dir}")
65 | log_file = f"{log_dir}/write.{rw},mode.{mode},unit.{unit_size}.log"
66 |
67 | test_file_content = f"""
68 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh"
69 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches;
70 | echo 100 > /proc/sys/vm/dirty_ratio;
71 | echo 0 > /proc/sys/kernel/numa_balancing;
72 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so /home/yz/workspace/TeRM/ae/bin/octopus/libnrfsc.so";
73 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb};
74 | export PDP_mode={mode};\"\"\"
75 | suffix_cmd = ""
76 |
77 | ## server process
78 | [[pass]]
79 | name = "s0"
80 | host = "node184"
81 | path = "/home/yz/workspace/TeRM/ae/bin/octopus"
82 | cmd = \"\"\"../../scripts/ins-pch.sh;
83 | systemctl restart openibd;
84 | ../../scripts/limit-mem.sh;
85 | export PDP_is_server=1;
86 | ./dmfs\"\"\"
87 |
88 | ## below are clients
89 | [[pass]]
90 | name = "c0"
91 | host = "node166"
92 | path = "/home/yz/workspace/TeRM/ae/bin/octopus"
93 | cmd = "mpiexec --allow-run-as-root --np 32 /bin/bash -c './mpibw --write={rw} --unit_size={unit_size} --seconds=70 --zipf_theta=0.99 --file_size=1g' "
94 | """
95 | with open(f"{log_dir}/test.toml", "w") as f:
96 | f.write(test_file_content)
97 |
98 | cmd = f"""
99 | date | tee -a {log_file};
100 | /home/yz/workspace/TeRM/ae/scripts/bootstrap-octopus.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file}
101 | """
102 | os.system(cmd)
103 |
104 | def extract_throughput(write_percent = 0, skewness_100 = 99, server_memory_gb = 32,
105 | ssd = "nvme4n1", mode = "pdp", nr_server_threads = 16,
106 | nr_client_threads = 64, sz_unit = "4k", dynamic = False, log_dir = "."):
107 |
108 | if mode == "pin":
109 | server_memory_gb = 0
110 |
111 | sampling_start = 0 if dynamic else 5
112 | sampling_seconds = 120 if dynamic else 60
113 |
114 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log"
115 | with open(log_file, "r") as f:
116 | thpt_list = []
117 | current_thpt_list = []
118 |
119 | new_flag = False
120 | dual_client = False
121 | while True:
122 | line = f.readline()
123 | if not line:
124 | break
125 | if "epoch" not in line:
126 | continue
127 | if "epoch 1 " in line:
128 | if not new_flag:
129 | thpt_list = current_thpt_list
130 | current_thpt_list = []
131 | else:
132 | dual_client = True
133 | new_flag = True
134 | else:
135 | new_flag = False
136 |
137 | line = line.replace(",", "")
138 | tokens = re.findall("cnt=(\d+)", line)
139 | current_thpt_list.append(int(tokens[0]))
140 | thpt_list = current_thpt_list
141 | if dual_client:
142 | thpt_list = [thpt_list[i] + thpt_list[i + 1] for i in range(0, 2 * (sampling_start + sampling_seconds), 2)]
143 | thpt_list = thpt_list[(sampling_start) : (sampling_start + sampling_seconds)]
144 | return thpt_list
145 |
146 | def extract_latency(write_percent = 0, skewness_100 = 99, server_memory_gb = 32,
147 | ssd = "nvme4n1", mode = "pdp", nr_server_threads = 16,
148 | nr_client_threads = 64, sz_unit = "4k",
149 | sampling_start = 5, sampling_seconds = 60, label="p99", log_dir = "."):
150 |
151 | if mode == "pin":
152 | server_memory_gb = 0
153 |
154 | log_file = f"{log_dir}/write.{write_percent},skew.{skewness_100},pm.{server_memory_gb},ssd.{ssd},mode.{mode},sth.{nr_server_threads},cth.{nr_client_threads},unit.{sz_unit}.log"
155 | with open(log_file, "r") as f:
156 | result_list = []
157 | current_result_list = []
158 |
159 | new_flag = False
160 | dual_client = False
161 | while True:
162 | line = f.readline()
163 | if not line:
164 | break
165 | if "epoch" not in line:
166 | continue
167 | if "epoch 1 " in line:
168 | if not new_flag:
169 | result_list = current_result_list
170 | current_result_list = []
171 | else:
172 | dual_client = True
173 | new_flag = True
174 | else:
175 | new_flag = False
176 |
177 | line = line.replace(",", "")
178 | tokens = re.findall(f"{label}=(\d+)", line)
179 | current_result_list.append(int(tokens[0]))
180 | result_list = current_result_list
181 | if dual_client:
182 | result_list = [(result_list[i] + result_list[i + 1]) / 2 for i in range(0, 2 * (sampling_start + sampling_seconds), 2)]
183 | result_list = result_list[(sampling_start) : (sampling_start + sampling_seconds)]
184 | result_list = [i for i in result_list if i < 120_000_000_000]
185 | return result_list
186 |
187 |
188 | def avg(l):
189 | v = sum(l) / len(l)
190 | return v
191 |
192 | def format_digits(n):
193 | return f"{n:,.2f}"
194 |
195 | def split_kwargs(kwargs):
196 | vector_kwargs = {}
197 | scalar_kwargs = {}
198 | for k, v in kwargs.items():
199 | if type(v) == list:
200 | vector_kwargs[k] = v
201 | else:
202 | scalar_kwargs[k] = v
203 | return vector_kwargs, scalar_kwargs
204 |
205 |
206 | def run_batch(name : str, **kwargs):
207 | log_dir = f"./output/log/{name}"
208 | os.system(f"mkdir -p {log_dir}")
209 |
210 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
211 |
212 | from itertools import product
213 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
214 | for idx, kv in enumerate(l):
215 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###"
216 | with open("./output/current-running.txt", "w") as f:
217 | f.write(f"{name}\n{str}\n")
218 | print(str)
219 | run_test(**kv, **scalar_kwargs, log_dir=log_dir)
220 |
221 |
222 | def output_batch(name : str, dynamic = False, ydata = [],
223 | **kwargs):
224 | log_dir = f"./output/log/{name}"
225 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
226 | from itertools import product
227 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
228 | output = ""
229 |
230 | for idx, kv in enumerate(l):
231 | thpt_list = extract_throughput(**kv, **scalar_kwargs, dynamic=dynamic, log_dir=log_dir)
232 | str = f"({idx + 1}/{len(l)}) {kv}: throughput (ops/s)="
233 | if dynamic:
234 | str += f"{thpt_list}"
235 | ydata.append(thpt_list)
236 | else:
237 | avg_v = avg(thpt_list)
238 | ydata.append(avg_v)
239 | str += f"{format_digits(avg_v)}"
240 | output += str + "\n"
241 |
242 | print(output)
243 | with open(f"./output/{name}.txt", "w") as f:
244 | f.write(output)
245 |
246 |
247 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]):
248 | import matplotlib.pyplot as plt
249 | xlen = len(xdata)
250 | if name == "figure-11":
251 | plt.plot(xdata, ydata[0], label = legend[0])
252 | plt.plot(xdata, ydata[1], label = legend[1])
253 | plt.plot(xdata, ydata[2], label = legend[2])
254 | plt.plot(xdata, ydata[3], label = legend[3])
255 | else:
256 | for idx, label in enumerate(legend):
257 | match label:
258 | case "PIN":
259 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN")
260 | case "ODP":
261 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP")
262 | case "RPC":
263 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC")
264 | case "TeRM":
265 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM")
266 | case _:
267 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label)
268 | plt.xlabel(xlabel)
269 | plt.ylabel("Throughput (ops/s)")
270 | plt.legend()
271 | print(f"written {name}.png")
272 | plt.savefig(f"./output/{name}.png")
273 |
274 | class Experiment:
275 | def __init__(self, name : str,
276 | xlabel: str, xdata: list, legend: list[str],
277 | **kwargs):
278 | self.name = name
279 | self.xlabel = xlabel
280 | self.xdata = xdata
281 | self.ydata = []
282 | self.legend = legend
283 | self.kwargs = kwargs
284 |
285 | def run(self):
286 | run_batch(name=self.name, **self.kwargs)
287 |
288 | def output(self):
289 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs)
290 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend)
291 |
--------------------------------------------------------------------------------
/ae/scripts/xstore_eval.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 |
4 | def run_test(mode : str = "pdp", zipf_theta = 0, log_dir : str = "."):
5 | server_memory_gb = 0 if mode == "pin" else 16
6 |
7 | os.system(f"mkdir -p {log_dir}")
8 | log_file = f"{log_dir}/mode.{mode},zipf_theta.{zipf_theta}.log"
9 |
10 | test_file_content = f"""
11 | init_cmd = "/home/yz/workspace/TeRM/ae/scripts/reset-memc.sh"
12 | prefix_cmd = \"\"\"echo 3 > /proc/sys/vm/drop_caches;
13 | echo 100 > /proc/sys/vm/dirty_ratio;
14 | echo 0 > /proc/sys/kernel/numa_balancing;
15 | systemctl start opensmd;
16 | export LD_PRELOAD="/home/yz/workspace/TeRM/ae/bin/libpdp.so";
17 | export LD_LIBRARY_PATH="/home/yz/workspace/TeRM/ae/bin/xstore";
18 | export PDP_server_rpc_threads=16 PDP_server_mmap_dev=nvme4n1 PDP_server_memory_gb={server_memory_gb};
19 | export PDP_mode={mode};\"\"\"
20 | suffix_cmd = "--need_hash=false --cache_sz_m=327680 --server_host=node184 --total_accts=100'000'000 --eval_type=sc --workloads=ycsbc --concurrency=1 --use_master=true --uniform=false --running_time=60 --zipf_theta={zipf_theta} \
21 | --undefok=concurrency,workloads,eval_type,total_accts,server_host,cache_sz_m,need_hash,use_master,uniform,running_time,zipf_theta"
22 |
23 | ## server process
24 | [[pass]]
25 | name = "s0"
26 | host = "node184"
27 | path = "/home/yz/workspace/TeRM/ae/bin/xstore"
28 | cmd = \"\"\"../../scripts/ins-pch.sh;
29 | systemctl restart openibd;
30 | ../../scripts/limit-mem.sh;
31 | export PDP_is_server=1;
32 | ./fserver --threads=16 --db_type=ycsb --id=0 --ycsb_num=100'000'000 --no_train=false --step=2 --model_config=./ycsb-model.toml --enable_odp=true\"\"\"
33 |
34 | ## clients
35 | [[pass]]
36 | name = "c0"
37 | host = "node166"
38 | path = "/home/yz/workspace/TeRM/ae/bin/xstore"
39 | cmd = "./ycsb --threads=32 --id=1"
40 |
41 | [[pass]]
42 | name = "c1"
43 | host = "node168"
44 | path = "/home/yz/workspace/TeRM/ae/bin/xstore"
45 | cmd = "./ycsb --threads=32 --id=2"
46 |
47 | # master process to collect results
48 | [[pass]]
49 | name = "m0"
50 | host = "node181"
51 | path = "/home/yz/workspace/TeRM/ae/bin/xstore"
52 | cmd = "sleep 5; ./master --client_config=./cs.toml --epoch=90 --nclients=2"
53 | """
54 | with open(f"{log_dir}/test.toml", "w") as f:
55 | f.write(test_file_content)
56 |
57 | cmd = f"""
58 | date | tee -a {log_file};
59 | /home/yz/workspace/TeRM/ae/scripts/bootstrap.py -f {log_dir}/test.toml 2>&1 | tee -a {log_file}
60 | """
61 | os.system(cmd)
62 |
63 | def extract_result(mode : str = "pdp", zipf_theta = 0, log_dir : str = "."):
64 | log_file = f"{log_dir}/mode.{mode},zipf_theta.{zipf_theta}.log"
65 | with open(log_file, "r") as f:
66 | result_list = []
67 | current_result_list = []
68 |
69 | while True:
70 | line = f.readline()
71 | if not line:
72 | break
73 | if "at epoch" not in line:
74 | continue
75 | if "epoch 0 " in line:
76 | result_list = current_result_list
77 | current_result_list = []
78 |
79 | line = line.replace(",", "")
80 | tokens = re.findall(f"thpt: (\d+\.\d+)", line)
81 | current_result_list.append(float(tokens[0]))
82 | result_list = current_result_list
83 | avg_v = avg(result_list)
84 |
85 | return avg_v
86 |
87 | def avg(l : list):
88 | return sum(l) / len(l)
89 |
90 | def format_digits(n):
91 | return f"{n:,.2f}"
92 |
93 | def split_kwargs(kwargs):
94 | vector_kwargs = {}
95 | scalar_kwargs = {}
96 | for k, v in kwargs.items():
97 | if type(v) == list:
98 | vector_kwargs[k] = v
99 | else:
100 | scalar_kwargs[k] = v
101 | return vector_kwargs, scalar_kwargs
102 |
103 |
104 | def run_batch(name : str, **kwargs):
105 | log_dir = f"./output/log/{name}"
106 | os.system(f"mkdir -p {log_dir}")
107 |
108 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
109 |
110 | from itertools import product
111 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
112 | for idx, kv in enumerate(l):
113 | str = f"### Running ({idx + 1}/{len(l)}): {kv} ###"
114 | with open("./output/current-running.txt", "w") as f:
115 | f.write(f"{name}\n{str}\n")
116 | print(str)
117 | run_test(**kv, **scalar_kwargs, log_dir=log_dir)
118 |
119 |
120 | def output_batch(name : str, ydata = [], **kwargs):
121 | log_dir = f"./output/log/{name}"
122 | vector_kwargs, scalar_kwargs = split_kwargs(kwargs)
123 | from itertools import product
124 | l = list(dict(zip(vector_kwargs.keys(), values)) for values in product(*vector_kwargs.values()))
125 | output = ""
126 |
127 | for idx, kv in enumerate(l):
128 | result = extract_result(**kv, **scalar_kwargs, log_dir=log_dir)
129 | str = f"({idx + 1}/{len(l)}) {kv}: throughput (ops/s)="
130 | ydata.append(result)
131 | str += f"{format_digits(result)}"
132 | output += str + "\n"
133 |
134 | print(output)
135 | with open(f"./output/{name}.txt", "w") as f:
136 | f.write(output)
137 |
138 |
139 | def draw_figure(name : str, xlabel : str, xdata : list, ydata : list, legend : list[str]):
140 | import matplotlib.pyplot as plt
141 | xlen = len(xdata)
142 |
143 | for idx, label in enumerate(legend):
144 | match label:
145 | case "PIN":
146 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-D", mfc="w", label = "PIN")
147 | case "ODP":
148 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-^", mfc="w", label = "ODP")
149 | case "RPC":
150 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-x", mfc="w", label = "RPC")
151 | case "TeRM":
152 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = "TeRM")
153 | case _:
154 | plt.plot(xdata, ydata[idx * xlen : (idx + 1) * xlen], "-o", label = label)
155 |
156 | plt.xlabel(xlabel)
157 | plt.ylabel("Throughput (ops/s)")
158 | plt.legend()
159 | print(f"written {name}.png")
160 | plt.savefig(f"./output/{name}.png")
161 |
162 | class Experiment:
163 | def __init__(self, name : str,
164 | xlabel: str, xdata: list, legend: list[str],
165 | **kwargs):
166 | self.name = name
167 | self.xlabel = xlabel
168 | self.xdata = xdata
169 | self.ydata = []
170 | self.legend = legend
171 | self.kwargs = kwargs
172 |
173 | def run(self):
174 | run_batch(name=self.name, **self.kwargs)
175 |
176 | def output(self):
177 | output_batch(name=self.name, ydata=self.ydata, **self.kwargs)
178 | draw_figure(name=self.name, xlabel=self.xlabel, xdata=self.xdata, ydata=self.ydata, legend=self.legend)
179 |
--------------------------------------------------------------------------------
/app/octopus.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/app/octopus.zip
--------------------------------------------------------------------------------
/app/xstore.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/app/xstore.zip
--------------------------------------------------------------------------------
/driver/driver.patch:
--------------------------------------------------------------------------------
1 | diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
2 | index aead24c..a6f08f2 100644
3 | --- a/drivers/infiniband/core/umem_odp.c
4 | +++ b/drivers/infiniband/core/umem_odp.c
5 | @@ -49,6 +49,89 @@
6 |
7 | #include "uverbs.h"
8 |
9 | +#ifdef PDP_MR
10 | +void set_or_clear_bitmap_reserved(void *addr, uint32_t bits, bool set)
11 | +{
12 | + size_t bitmap_bytes = BITS_TO_LONGS(bits) * sizeof(long);
13 | + size_t i;
14 | + for (i = 0; i < bitmap_bytes; i += PAGE_SIZE) {
15 | + struct page *page = vmalloc_to_page(addr + i);
16 | + set ? SetPageReserved(page) : ClearPageReserved(page);
17 | + }
18 | +}
19 | +
20 | +static void *my_bitmap_alloc(unsigned int bits)
21 | +{
22 | + size_t bytes = BITS_TO_LONGS(bits) * sizeof(long);
23 | + return vzalloc(round_up(bytes, PAGE_SIZE));
24 | +}
25 | +
26 | +static void my_bitmap_free(void *p)
27 | +{
28 | + vfree(p);
29 | +}
30 | +
31 | +static int init_pdp(struct ib_umem_odp *umem_odp, size_t npfns)
32 | +{
33 | + char *magic_page_addr;
34 | +
35 | + umem_odp->nr_pfns = npfns;
36 | +
37 | + umem_odp->present_bitmap = my_bitmap_alloc(npfns);
38 | + // pr_info("umem_odp->present_bitmap=%px", umem_odp->present_bitmap);
39 | + if (!umem_odp->present_bitmap) {
40 | + return -ENOMEM;
41 | + }
42 | + if (!PAGE_ALIGNED(umem_odp->present_bitmap)) {
43 | + pr_err("pdp bitmap is not page-aligned!\n");
44 | + my_bitmap_free(umem_odp->present_bitmap);
45 | + return -ENOMEM;
46 | + }
47 | + // a normal ODP does not need the magic page.
48 | + if (!umem_odp->is_pdp)
49 | + return 0;
50 | + if (umem_odp->page_shift != PAGE_SHIFT) {
51 | + pr_err("cannot support page_shift(%u) != PAGE_SHIFT(%u)!\n", umem_odp->page_shift, PAGE_SHIFT);
52 | + return -EFAULT;
53 | + }
54 | + umem_odp->magic_page = alloc_page(GFP_KERNEL);
55 | + if (!umem_odp->magic_page) {
56 | + my_bitmap_free(umem_odp->present_bitmap);
57 | + return -ENOMEM;
58 | + }
59 | +
60 | + magic_page_addr = kmap(umem_odp->magic_page);
61 | + memset(magic_page_addr, 0xfd, PAGE_SIZE);
62 | + kunmap(umem_odp->magic_page);
63 | +
64 | + umem_odp->magic_page_dma = ib_dma_map_page(umem_odp->umem.ibdev, umem_odp->magic_page,
65 | + 0, PAGE_SIZE, DMA_BIDIRECTIONAL) | ODP_READ_ALLOWED_BIT;
66 | + if (ib_dma_mapping_error(umem_odp->umem.ibdev, umem_odp->magic_page_dma)) {
67 | + my_bitmap_free(umem_odp->present_bitmap);
68 | + __free_page(umem_odp->magic_page);
69 | + return -EFAULT;
70 | + }
71 | +
72 | + set_or_clear_bitmap_reserved(umem_odp->present_bitmap, npfns, true);
73 | +
74 | + return 0;
75 | +}
76 | +
77 | +static void exit_pdp(struct ib_umem_odp *umem_odp)
78 | +{
79 | + set_or_clear_bitmap_reserved(umem_odp->present_bitmap, umem_odp->nr_pfns, false);
80 | +
81 | + my_bitmap_free(umem_odp->present_bitmap);
82 | + // a normal ODP does not need the magic page.
83 | + if (!umem_odp->is_pdp) {
84 | + return;
85 | + }
86 | + // we mask ODP_DMA_ADDR_MASK here, to ignore ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT
87 | + ib_dma_unmap_page(umem_odp->umem.ibdev, umem_odp->magic_page_dma & ODP_DMA_ADDR_MASK, PAGE_SIZE, DMA_BIDIRECTIONAL);
88 | + __free_page(umem_odp->magic_page);
89 | +}
90 | +#endif
91 | +
92 | static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
93 | const struct mmu_interval_notifier_ops *ops)
94 | {
95 | @@ -94,10 +177,20 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
96 | start, end - start, ops);
97 | if (ret)
98 | goto out_dma_list;
99 | +
100 | +#ifdef PDP_MR
101 | + ret = init_pdp(umem_odp, npfns);
102 | + if (ret)
103 | + goto out_mmu_notifier;
104 | +#endif
105 | }
106 |
107 | return 0;
108 |
109 | +#ifdef PDP_MR
110 | +out_mmu_notifier:
111 | + mmu_interval_notifier_remove(&umem_odp->notifier);
112 | +#endif
113 | out_dma_list:
114 | kvfree(umem_odp->dma_list);
115 | out_pfn_list:
116 | @@ -236,6 +329,12 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
117 | if (!umem_odp)
118 | return ERR_PTR(-ENOMEM);
119 |
120 | +#ifdef PDP_MR
121 | + umem_odp->is_pdp = (access & PDP_ACCESS_PDP);
122 | + pr_info("[%s]: addr=0x%lx, size=0x%lx, mode=%s\n",
123 | + __func__, addr, size, umem_odp->is_pdp ? "PDP" : "ODP");
124 | +#endif
125 | +
126 | umem_odp->umem.ibdev = device;
127 | umem_odp->umem.length = size;
128 | umem_odp->umem.address = addr;
129 | @@ -278,6 +377,13 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
130 | mmu_interval_notifier_remove(&umem_odp->notifier);
131 | kvfree(umem_odp->dma_list);
132 | kvfree(umem_odp->pfn_list);
133 | +#ifdef PDP_MR
134 | + if (umem_odp->pde) {
135 | + proc_remove(umem_odp->pde);
136 | + umem_odp->pde = NULL;
137 | + }
138 | + exit_pdp(umem_odp);
139 | +#endif
140 | }
141 | put_pid(umem_odp->tgid);
142 | kfree(umem_odp);
143 | @@ -304,6 +410,19 @@ static int ib_umem_odp_map_dma_single_page(
144 | struct ib_device *dev = umem_odp->umem.ibdev;
145 | dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
146 |
147 | +#ifdef PDP_MR
148 | + // if (access_mask & ODP_WRITE_ALLOWED_BIT)
149 | + set_bit(dma_index, umem_odp->present_bitmap);
150 | + if (!umem_odp->is_pdp && *dma_addr) {
151 | + /*
152 | + * If the page is already dma mapped it means it went through
153 | + * a non-invalidating trasition, like read-only to writable.
154 | + * Resync the flags.
155 | + */
156 | + *dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
157 | + return 0;
158 | + }
159 | +#else
160 | if (*dma_addr) {
161 | /*
162 | * If the page is already dma mapped it means it went through
163 | @@ -313,7 +432,10 @@ static int ib_umem_odp_map_dma_single_page(
164 | *dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
165 | return 0;
166 | }
167 | +#endif
168 |
169 | +// yz: dma_addr might be magic!
170 | +// we simply map the page again.
171 | *dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
172 | DMA_BIDIRECTIONAL);
173 | if (ib_dma_mapping_error(dev, *dma_addr)) {
174 | @@ -390,7 +512,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
175 | }
176 |
177 | range.hmm_pfns = &(umem_odp->pfn_list[pfn_start_idx]);
178 | - timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
179 | + timeout = jiffies + msecs_to_jiffies(10000); // HMM_RANGE_DEFAULT_TIMEOUT 1000
180 |
181 | retry:
182 | current_seq = range.notifier_seq =
183 | @@ -426,7 +548,17 @@ retry:
184 | WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
185 | } else {
186 | if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) {
187 | +#ifdef PDP_MR
188 | + if (umem_odp->is_pdp) {
189 | + dma_addr_t *dma = &umem_odp->dma_list[dma_index];
190 | + WARN_ON(*dma && *dma != umem_odp->magic_page_dma);
191 | + *dma = umem_odp->magic_page_dma; // ODP_READ_ALLOWED_BIT has been set.
192 | + } else {
193 | + WARN_ON(umem_odp->dma_list[dma_index]);
194 | + }
195 | +#else
196 | WARN_ON(umem_odp->dma_list[dma_index]);
197 | +#endif
198 | continue;
199 | }
200 | access_mask = ODP_READ_ALLOWED_BIT;
201 | @@ -487,6 +619,11 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
202 | idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
203 | dma = umem_odp->dma_list[idx];
204 |
205 | +#ifdef PDP_MR
206 | + clear_bit(idx, umem_odp->present_bitmap);
207 | + if (umem_odp->is_pdp && is_magic_page_dma(umem_odp, dma)) continue;
208 | +#endif
209 | +
210 | /* The access flags guaranteed a valid DMA address in case was NULL */
211 | if (dma) {
212 | unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
213 | @@ -509,7 +646,15 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
214 | */
215 | set_page_dirty(head_page);
216 | }
217 | +#ifdef PDP_MR
218 | + if (umem_odp->is_pdp) {
219 | + umem_odp->dma_list[idx] = umem_odp->magic_page_dma;
220 | + } else {
221 | + umem_odp->dma_list[idx] = 0;
222 | + }
223 | +#else
224 | umem_odp->dma_list[idx] = 0;
225 | +#endif
226 | umem_odp->npages--;
227 | }
228 | }
229 | diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
230 | index 368f3c4..4fdb739 100644
231 | --- a/drivers/infiniband/hw/mlx5/mr.c
232 | +++ b/drivers/infiniband/hw/mlx5/mr.c
233 | @@ -1442,6 +1442,69 @@ static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
234 | return &mr->ibmr;
235 | }
236 |
237 | +#ifdef PDP_MR
238 | +static vm_fault_t proc_fault(struct vm_fault *vmf)
239 | +{
240 | + struct ib_umem_odp *odp = vmf->vma->vm_private_data;
241 | + void *kaddr = (void *)odp->present_bitmap + vmf->pgoff * PAGE_SIZE;
242 | + struct page *page = vmalloc_to_page(kaddr);
243 | +
244 | + get_page(page);
245 | + vmf->page = page;
246 | +
247 | + return 0;
248 | +}
249 | +
250 | +static struct vm_operations_struct proc_vm_ops = {
251 | + .fault = proc_fault
252 | +};
253 | +static int proc_mmap(struct file *file, struct vm_area_struct *vma)
254 | +{
255 | + struct ib_umem_odp *odp = pde_data(file_inode(file));
256 | +
257 | + if (!odp->present_bitmap) {
258 | + pr_err("odp->present_bitmap=NULL\n");
259 | + return -EINVAL;
260 | + }
261 | + pr_info("[odp mr bitmap]: mmap from virt=0x%px\n", odp->present_bitmap);
262 | +
263 | + vma->vm_private_data = odp;
264 | + vma->vm_ops = &proc_vm_ops;
265 | +
266 | + return 0;
267 | +}
268 | +
269 | +static ssize_t proc_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
270 | +{
271 | + struct ib_umem_odp *odp = pde_data(file_inode(file));
272 | + if (copy_to_user(buf, &odp->nr_pfns, sizeof(odp->nr_pfns))) {
273 | + return -EINVAL;
274 | + }
275 | +
276 | + return sizeof(odp->nr_pfns);
277 | +}
278 | +
279 | +static const struct proc_ops proc_ops = {
280 | + .proc_mmap = proc_mmap,
281 | + .proc_read = proc_read
282 | +};
283 | +
284 | +static int proc_init(u32 key, struct ib_umem_odp *odp)
285 | +{
286 | + struct proc_dir_entry *entry;
287 | + char name[256];
288 | +
289 | + sprintf(name, "pdp_bitmap_0x%x", key);
290 | +
291 | + entry = proc_create_data(name, S_IFREG | S_IRUGO, NULL, &proc_ops, odp);
292 | + if (!entry) {
293 | + return -ENOMEM;
294 | + }
295 | + odp->pde = entry;
296 | +
297 | + return 0;
298 | +}
299 | +#endif
300 | static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length,
301 | u64 iova, int access_flags,
302 | struct ib_udata *udata)
303 | @@ -1493,6 +1556,13 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length,
304 | err = mlx5_ib_init_odp_mr(mr);
305 | if (err)
306 | goto err_dereg_mr;
307 | +
308 | +#ifdef PDP_MR
309 | + err = proc_init(mr->mmkey.key, odp);
310 | + if (err)
311 | + goto err_dereg_mr;
312 | +#endif
313 | +
314 | return &mr->ibmr;
315 |
316 | err_dereg_mr:
317 | diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
318 | index e3a3e7a..5274801 100644
319 | --- a/drivers/infiniband/hw/mlx5/odp.c
320 | +++ b/drivers/infiniband/hw/mlx5/odp.c
321 | @@ -36,6 +36,9 @@
322 | #include
323 | #include
324 |
325 | +// yz fix bug
326 | +#include
327 | +
328 | #include "mlx5_ib.h"
329 | #include "cmd.h"
330 | #include "qp.h"
331 | @@ -167,6 +170,15 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
332 |
333 | for (i = 0; i < nentries; i++) {
334 | pa = odp->dma_list[idx + i];
335 | +#ifdef PDP_MR
336 | + if (odp->is_pdp) {
337 | + WARN_ON(pa == 0); // with PDP, pa should never be 0!
338 | + if ((pa & ODP_READ_ALLOWED_BIT) == 0) {
339 | + // read is not allowed => magic dma
340 | + pa = odp->magic_page_dma;
341 | + }
342 | + }
343 | +#endif
344 | pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
345 | }
346 | }
347 | @@ -269,12 +281,26 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
348 | * estimate the cost of another UMR vs. the cost of bigger
349 | * UMR.
350 | */
351 | +
352 | +
353 | +#ifdef PDP_MR
354 | + if ((umem_odp->dma_list[idx] & (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) &&
355 | + !(umem_odp->is_pdp && is_magic_page_dma(umem_odp, umem_odp->dma_list[idx]))) {
356 | +#else
357 | if (umem_odp->dma_list[idx] &
358 | (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
359 | +#endif
360 | if (!in_block) {
361 | blk_start_idx = idx;
362 | in_block = 1;
363 | }
364 | +#ifdef PDP_MR
365 | + if (umem_odp->is_pdp) {
366 | + umem_odp->dma_list[idx] &= (~ODP_READ_ALLOWED_BIT);
367 | + // clear the READ bit for magic_dma! READ is always set for a normal address
368 | + // we leave the dma_addr intact for correct unmap.
369 | + }
370 | +#endif
371 |
372 | /* Count page invalidations */
373 | invalidations += idx - blk_start_idx + 1;
374 | @@ -284,7 +310,12 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
375 | if (in_block && umr_offset == 0) {
376 | mlx5_ib_update_xlt(mr, blk_start_idx,
377 | idx - blk_start_idx, 0,
378 | +#ifdef PDP_MR
379 | + // set instead of clear
380 | + (umem_odp->is_pdp ? 0 : MLX5_IB_UPD_XLT_ZAP) |
381 | +#else
382 | MLX5_IB_UPD_XLT_ZAP |
383 | +#endif
384 | MLX5_IB_UPD_XLT_ATOMIC);
385 | in_block = 0;
386 | }
387 | @@ -293,7 +324,12 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
388 | if (in_block)
389 | mlx5_ib_update_xlt(mr, blk_start_idx,
390 | idx - blk_start_idx + 1, 0,
391 | +#ifdef PDP_MR
392 | + // set instead of clear
393 | + (umem_odp->is_pdp ? 0 : MLX5_IB_UPD_XLT_ZAP) |
394 | +#else
395 | MLX5_IB_UPD_XLT_ZAP |
396 | +#endif
397 | MLX5_IB_UPD_XLT_ATOMIC);
398 |
399 | mlx5_update_odp_stats(mr, invalidations, invalidations);
400 | @@ -568,17 +604,37 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
401 |
402 | if (odp->umem.writable && !downgrade)
403 | access_mask |= ODP_WRITE_ALLOWED_BIT;
404 | +
405 | + // yz fix: there is a bug here!
406 | + {
407 | + struct task_struct *owning_process = get_pid_task(odp->tgid, PIDTYPE_PID);
408 | + struct mm_struct *owning_mm = odp->umem.owning_mm;
409 | +
410 | + if (!owning_process) {
411 | + return -EINVAL;
412 | + }
413 | + if (!mmget_not_zero(owning_mm)) {
414 | + put_task_struct(owning_process);
415 | + return -EINVAL;
416 | + }
417 |
418 | - np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
419 | - if (np < 0)
420 | - return np;
421 | + np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
422 | + if (np < 0) {
423 | + mmput(owning_mm);
424 | + put_task_struct(owning_process);
425 | + return np;
426 | + }
427 |
428 | - /*
429 | - * No need to check whether the MTTs really belong to this MR, since
430 | - * ib_umem_odp_map_dma_and_lock already checks this.
431 | - */
432 | - ret = mlx5_ib_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
433 | - mutex_unlock(&odp->umem_mutex);
434 | + /*
435 | + * No need to check whether the MTTs really belong to this MR, since
436 | + * ib_umem_odp_map_dma_and_lock already checks this.
437 | + */
438 | + ret = mlx5_ib_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
439 | + mutex_unlock(&odp->umem_mutex);
440 | +
441 | + mmput(owning_mm);
442 | + put_task_struct(owning_process);
443 | + }
444 |
445 | if (ret < 0) {
446 | if (ret != -EAGAIN)
447 | diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
448 | index 0844c1d..15aaadd 100644
449 | --- a/include/rdma/ib_umem_odp.h
450 | +++ b/include/rdma/ib_umem_odp.h
451 | @@ -9,6 +9,12 @@
452 | #include
453 | #include
454 |
455 | +#define PDP_MR
456 | +#ifdef PDP_MR
457 | +#include
458 | +#define PDP_ACCESS_PDP (IB_ACCESS_RELAXED_ORDERING << 1) // caution if a future version uses this bit!
459 | +#endif
460 | +
461 | struct ib_umem_odp {
462 | struct ib_umem umem;
463 | struct mmu_interval_notifier notifier;
464 | @@ -42,6 +48,15 @@ struct ib_umem_odp {
465 | bool is_implicit_odp;
466 |
467 | unsigned int page_shift;
468 | +
469 | +#ifdef PDP_MR
470 | + bool is_pdp; // ODP / PDP
471 | + struct page *magic_page;
472 | + dma_addr_t magic_page_dma; // ODP_READ_ALLOWED_BIT is set!
473 | + unsigned long *present_bitmap; // 1: mapped in the RNIC page table, 0: unmapped
474 | + struct proc_dir_entry *pde;
475 | + u32 nr_pfns;
476 | +#endif
477 | };
478 |
479 | static inline struct ib_umem_odp *to_ib_umem_odp(struct ib_umem *umem)
480 | @@ -80,6 +95,16 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
481 |
482 | #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
483 |
484 | +#ifdef PDP_MR
485 | +static inline bool is_magic_page_dma(struct ib_umem_odp *umem_odp, dma_addr_t addr)
486 | +{
487 | + WARN_ON(!umem_odp->is_pdp); // a normal ODP should never reach here.
488 | +
489 | + // we AND ODP_DMA_ADDR_MASK to ignore the lower two bits.
490 | + return (umem_odp->magic_page_dma & ODP_DMA_ADDR_MASK) == (addr & ODP_DMA_ADDR_MASK);
491 | +}
492 | +#endif
493 | +
494 | #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
495 |
496 | struct ib_umem_odp *
497 |
--------------------------------------------------------------------------------
/driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/thustorage/TeRM/a94d44a59238d8d51266a7a2ea8bbf2daf3151d9/driver/mlnx-ofed-kernel-5.8-2.0.3.0.zip
--------------------------------------------------------------------------------
/libterm/.gitignore:
--------------------------------------------------------------------------------
1 | # editor and compiler
2 | .cache
3 | .idea
4 | .vscode
5 | build
6 | cmake-build-debug
7 | __pycache__
8 |
9 | # linux kernel
10 | *.cmd
11 | Module.symvers
12 | modules.order
13 | *.ko
14 | *.mod
15 | *.mod.c
16 | *.mod.o
17 | *.o
18 | .tmp*
19 |
20 | # log
21 | *.log
22 | ## ibdump
23 | *.pcap
24 | ## perf
25 | *.data
26 |
--------------------------------------------------------------------------------
/libterm/CMakeLists.txt:
--------------------------------------------------------------------------------
1 | cmake_minimum_required(VERSION 3.10)
2 | project(nic-odp)
3 |
4 | set(CMAKE_CXX_STANDARD 20)
5 | set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
6 | add_compile_definitions(RECORDER_VERBOSE=0 RECORDER_SINGLE_THREAD=0)
7 |
8 | option (SANITIZE "Turn on sanitization" OFF)
9 | if (SANITIZE)
10 | add_compile_options(-fsanitize=address)
11 | endif()
12 |
13 | # include
14 | include_directories(${PROJECT_SOURCE_DIR}/3rd-party/include)
15 | include_directories(${PROJECT_SOURCE_DIR}/include)
16 |
17 | # verbs-pdp
18 | file(GLOB_RECURSE SRCS ${PROJECT_SOURCE_DIR}/ibverbs-pdp/*.cc)
19 | add_library(pdp SHARED ${SRCS})
20 | find_package(fmt)
21 | target_link_libraries(pdp fmt::fmt-header-only ibverbs memcached uring glog boost_coroutine aio)
22 |
23 | # test
24 | file(GLOB SRCS ${PROJECT_SOURCE_DIR}/test/*.cc)
25 | foreach(SRC ${SRCS})
26 | get_filename_component(SRC_NAME ${SRC} NAME_WE)
27 | add_executable(${SRC_NAME} ${SRC})
28 | target_link_libraries(${SRC_NAME} fmt::fmt-header-only ibverbs gflags glog memcached)
29 | endforeach()
30 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/config.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | #ifndef RPC_SERVER_POLL_WITH_EVENT
4 | #define RPC_SERVER_POLL_WITH_EVENT 0
5 | #endif
6 |
7 | #ifndef UNIT_SIZE
8 | #define UNIT_SIZE SZ_1M
9 | #endif
10 |
11 | #ifndef MAX_RDMA_LENGTH
12 | #define MAX_RDMA_LENGTH (64ul * SZ_1K)
13 | #endif
14 |
15 | #ifndef WRITE_INLINE_SIZE
16 | #define WRITE_INLINE_SIZE SZ_4K
17 | #endif
18 |
19 | //#define PAGE_BITMAP_PULL 1 // pull periodically
20 | #define PAGE_BITMAP_UPDATE 0 // incrementally in each request, not pull
21 | #define PAGE_BITMAP_HINT_READ 1
22 | #define MAX_NR_COROUTINES 64
23 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/ctx.hh:
--------------------------------------------------------------------------------
1 | #include "global.hh"
2 | #include
3 |
4 | namespace pdp {
5 | struct ContextBytes {
6 | ibv_gid gid;
7 | uint16_t lid;
8 | ContextBytes(ibv_gid gid, uint16_t lid) : gid(gid), lid(lid) {}
9 | explicit ContextBytes(const std::vector &vec_char)
10 | {
11 | memcpy(this, vec_char.data(), sizeof(ContextBytes));
12 | }
13 |
14 | // for memcached KV
15 | std::string key() const
16 | {
17 | auto id = rdma::Context::RemoteIdentifier(gid, lid);
18 | return id.to_string();
19 | }
20 |
21 | auto to_vec_char() const
22 | {
23 | return util::to_vec_char(*this);
24 | }
25 | };
26 |
27 | class RemoteContext {
28 | public:
29 | using Identifier = rdma::Context::RemoteIdentifier;
30 |
31 | RemoteContext() = default;
32 | explicit RemoteContext(const ContextBytes &bytes) : gid_(bytes.gid), lid_(bytes.lid) {}
33 |
34 | auto identifier() const
35 | {
36 | return Identifier{gid_, lid_};
37 | }
38 |
39 | private:
40 | ibv_gid gid_;
41 | uint16_t lid_;
42 | };
43 |
44 | class RemoteContextManager : public Manager {};
45 |
46 | class LocalContext : public rdma::Context {
47 | public:
48 | using Identifier = util::CachelineAligned;
49 | auto identifier() const {
50 | return native_pd();
51 | }
52 | public:
53 | LocalContext(ibv_context *native_ctx, ibv_pd *native_pd)
54 | : Context(native_ctx, native_pd) {}
55 |
56 | ContextBytes to_bytes() const
57 | {
58 | return ContextBytes(gid_, lid_);
59 | }
60 | };
61 |
62 | class LocalContextManager : public Manager {};
63 | }
64 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/global.cc:
--------------------------------------------------------------------------------
1 | #include "global.hh"
2 | #include "mr.hh"
3 | #include "qp.hh"
4 |
5 | namespace pdp {
6 | GlobalVariables::GlobalVariables()
7 | : sms {"10.0.2.181", 23333},
8 | configs {
9 | .mode = Mode::_from_string_nocase(util::getenv_string("PDP_mode").value_or("PDP").c_str()),
10 | .rpc_mode = util::getenv_bool("PDP_rpc_mode").value_or(false),
11 | .is_server = util::getenv_bool("PDP_is_server").value_or(false),
12 | .server_mmap_dev = util::getenv_string("PDP_server_mmap_dev").value_or(""),
13 | .server_io_path = ServerIoPath::_from_string_nocase(util::getenv_string("PDP_server_io_path").value_or("tiering").c_str()),
14 | .server_memory_gb = util::getenv_int("PDP_server_memory_gb").value_or(0),
15 | .server_rpc_threads = util::getenv_int("PDP_server_rpc_threads").value_or(32),
16 | .server_page_state = util::getenv_bool("PDP_server_page_state").value_or(true),
17 | .read_magic_pattern = util::getenv_bool("PDP_read_magic_pattern").value_or(true),
18 | .promote_hotspot = util::getenv_bool("PDP_promote_hotspot").value_or(true),
19 | .promotion_window_ms = (uint32_t)util::getenv_int("PDP_promotion_window_ms").value_or(1000),
20 | .pull_page_bitmap = util::getenv_bool("PDP_pull_page_bitmap").value_or(true),
21 | .advise_local_mr = util::getenv_bool("PDP_advise_local_mr").value_or(false),
22 | .predict_remote_mr = util::getenv_bool("PDP_predict_remote_mr").value_or(false)
23 | }
24 | {
25 | // modify
26 | switch (configs.mode) {
27 | case Mode::PIN:
28 | case Mode::ODP: {
29 | configs.rpc_mode = false;
30 | configs.read_magic_pattern = false;
31 | configs.promote_hotspot = false;
32 | break;
33 | }
34 | case Mode::PDP: {
35 | configs.server_io_path = ServerIoPath::TIERING;
36 | configs.mode = Mode::PDP;
37 | configs.rpc_mode = false;
38 | configs.read_magic_pattern = true;
39 | configs.promote_hotspot = true;
40 | break;
41 | }
42 | case Mode::RPC: {
43 | configs.mode = Mode::RPC_MEMCPY;
44 | // fallthrough here
45 | }
46 | case Mode::RPC_MEMCPY: {
47 | configs.server_io_path = ServerIoPath::MEMCPY;
48 | configs.mode = Mode::PDP;
49 | configs.rpc_mode = true;
50 | configs.read_magic_pattern = true;
51 | configs.promote_hotspot = false;
52 | break;
53 | }
54 | case Mode::RPC_BUFFER: {
55 | configs.server_io_path = ServerIoPath::BUFFER;
56 | configs.mode = Mode::PDP;
57 | configs.rpc_mode = true;
58 | configs.read_magic_pattern = true;
59 | configs.promote_hotspot = false;
60 | break;
61 | }
62 | case Mode::RPC_DIRECT: {
63 | configs.server_io_path = ServerIoPath::DIRECT;
64 | configs.mode = Mode::PDP;
65 | configs.rpc_mode = true;
66 | configs.read_magic_pattern = true;
67 | configs.promote_hotspot = false;
68 | break;
69 | }
70 | case Mode::RPC_TIERING: {
71 | configs.server_io_path = ServerIoPath::TIERING;
72 | configs.mode = Mode::PDP;
73 | configs.rpc_mode = true;
74 | configs.read_magic_pattern = true;
75 | configs.promote_hotspot = false;
76 | break;
77 | }
78 | case Mode::RPC_TIERING_PROMOTE: {
79 | configs.server_io_path = ServerIoPath::TIERING;
80 | configs.mode = Mode::PDP;
81 | configs.rpc_mode = true;
82 | configs.read_magic_pattern = true;
83 | configs.promote_hotspot = true;
84 | break;
85 | }
86 | case Mode::MAGIC_MEMCPY: {
87 | configs.server_io_path = ServerIoPath::MEMCPY;
88 | configs.mode = Mode::PDP;
89 | configs.rpc_mode = false;
90 | configs.read_magic_pattern = true;
91 | configs.promote_hotspot = false;
92 | break;
93 | }
94 | case Mode::MAGIC_BUFFER: {
95 | configs.server_io_path = ServerIoPath::BUFFER;
96 | configs.mode = Mode::PDP;
97 | configs.rpc_mode = false;
98 | configs.read_magic_pattern = true;
99 | configs.promote_hotspot = false;
100 | break;
101 | }
102 | case Mode::MAGIC_DIRECT: {
103 | configs.server_io_path = ServerIoPath::DIRECT;
104 | configs.mode = Mode::PDP;
105 | configs.rpc_mode = false;
106 | configs.read_magic_pattern = true;
107 | configs.promote_hotspot = false;
108 | break;
109 | }
110 | case Mode::MAGIC_TIERING: {
111 | configs.server_io_path = ServerIoPath::TIERING;
112 | configs.mode = Mode::PDP;
113 | configs.rpc_mode = false;
114 | configs.read_magic_pattern = true;
115 | configs.promote_hotspot = false;
116 | break;
117 | }
118 | case Mode::PDP_MEMCPY: {
119 | configs.server_io_path = ServerIoPath::MEMCPY;
120 | configs.mode = Mode::PDP;
121 | configs.rpc_mode = false;
122 | configs.read_magic_pattern = true;
123 | configs.promote_hotspot = true;
124 | }
125 |
126 | default: break;
127 | }
128 |
129 | // if (configs.mode == +Mode::PIN && configs.server_memory_gb) {
130 | // pr_err("ignore PDP_server_memory_gb in PIN mode!");
131 | // configs.server_memory_gb = 0;
132 | // }
133 | if (configs.mode == +Mode::PIN) {
134 | ASSERT(configs.server_memory_gb == 0, "PDP_server_memory_gb should not be set in PIN mode.");
135 | }
136 |
137 | if (configs.mode == +Mode::PDP) {
138 | ASSERT(!configs.server_mmap_dev.empty(), "PDP_server_mmap_dev must be set!");
139 | }
140 |
141 | if (configs.promote_hotspot) {
142 | pr_warn("auto set PDP_server_page_state for PDP_promote_hotspot");
143 | configs.server_page_state = true;
144 | }
145 |
146 | if (configs.read_magic_pattern) {
147 | pr_warn("auto set PDP_pull_page_bitmap for PDP_read_magic_pattern");
148 | configs.pull_page_bitmap = true;
149 | }
150 |
151 |
152 | // print
153 | fmt_pr(info, "memcached: {}", sms.to_string());
154 | fmt_pr(info, "{}", configs.to_string());
155 |
156 | ctx_mgr = std::make_unique();
157 | local_mr_mgr = std::make_unique();
158 | remote_mr_mgr = std::make_unique();
159 | qp_mgr = std::make_unique();
160 | }
161 | }
162 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/global.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | // my headers
4 | #include "config.hh"
5 |
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | #include
12 | // system
13 | #include
14 |
15 | // c++ headers
16 | #include
17 | #include
18 | #include
19 | #include
20 | #include
21 |
22 | #define PDP_ACCESS_PDP (IBV_ACCESS_RELAXED_ORDERING << 1)
23 |
24 | namespace pdp
25 | {
26 | using create_qp_t = struct ibv_qp *(*)(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr);
27 | using modify_qp_t = int(*)(struct ibv_qp *qp, struct ibv_qp_attr *attr, int attr_mask);
28 | using poll_cq_t = int(*)(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);
29 | using post_send_t = int(*)(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr);
30 | using post_recv_t = int(*)(struct ibv_qp *qp, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr);
31 | using reg_mr_t = struct ibv_mr *(*)(struct ibv_pd *pd, void *addr, size_t length, int access);
32 | using reg_mr_iova2_t = struct ibv_mr *(*)(struct ibv_pd *pd, void *addr, size_t length,
33 | uint64_t iova, unsigned int access);
34 |
35 | class RcQp;
36 | }
37 |
38 | namespace pdp {
39 | template
40 | class Manager {
41 | public:
42 | using resource_type = Resource;
43 | protected:
44 | std::unordered_map> map_;
45 | ReaderFriendlyLock lock_;
46 | using Map = decltype(map_);
47 |
48 | virtual void do_in_reg(typename Map::iterator it) {}
49 | public:
50 | Manager() = default;
51 | virtual ~Manager() = default;
52 |
53 | // thread-safe register function protected by the write_single lock, in order to construct the value exactly once.
54 | // id: the key
55 | // args: to construct the value iff the key does not exist
56 | template
57 | util::ObserverPtr reg_with_construct_args(const typename Resource::Identifier &id, Args&&... args) {
58 | std::unique_lock w_lk(lock_);
59 |
60 | if (const auto &p = unlocked_get(id)) {
61 | return p;
62 | }
63 |
64 | auto p = std::make_unique(std::forward(args)...);
65 | auto [it, inserted] = map_.emplace(id, std::move(p));
66 | ASSERT(inserted, "fail to insert");
67 |
68 | do_in_reg(it);
69 | return util::make_observer(it->second.get());
70 | }
71 |
72 | template
73 | util::ObserverPtr reg_with_construct_lambda(const typename Resource::Identifier &id, const Lambda &lmd)
74 | {
75 | std::unique_lock w_lk(lock_);
76 |
77 | if (auto p = unlocked_get(id)) {
78 | return p;
79 | }
80 |
81 | auto [it, inserted] = map_.emplace(id, lmd());
82 | ASSERT(inserted, "fail to insert");
83 |
84 | do_in_reg(it);
85 | return util::make_observer(it->second.get());
86 | }
87 |
88 | // unlocked get
89 | util::ObserverPtr unlocked_get(const typename Resource::Identifier &id)
90 | {
91 | if (auto it = map_.find(id); it != map_.end()) {
92 | return util::make_observer(it->second.get());
93 | }
94 | return util::ObserverPtr(nullptr);
95 | }
96 |
97 | util::ObserverPtr get(const typename Resource::Identifier &id) {
98 | std::shared_lock r_lk(lock_);
99 | return unlocked_get(id);
100 | }
101 | };
102 |
103 | // class Resource {
104 | // public:
105 | // class Bytes {};
106 | // class Local {};
107 | // class Remote {};
108 | // class Manager {};
109 | // };
110 |
111 | class ContextManager: public Manager {
112 | public:
113 | util::ObserverPtr get_or_reg_by_pd(ibv_pd *pd)
114 | {
115 | util::ObserverPtr mr = get(pd);
116 | if (!mr) {
117 | mr = reg_with_construct_args(pd, pd->context, pd);
118 | }
119 | return mr;
120 | }
121 | };
122 |
123 | namespace PageState {
124 | using Val = u8;
125 | enum : Val {
126 | kUncached = 0,
127 | kCached = 1,
128 | kDirty = 2,
129 | };
130 | }
131 |
132 | class LocalMrManager;
133 | class RemoteMrManager;
134 | class QpManager;
135 |
136 | BETTER_ENUM(ServerIoPath, int, TIERING, MEMCPY, BUFFER, DIRECT);
137 | BETTER_ENUM(Mode, int,
138 | PDP, ODP, PIN,
139 | RPC, RPC_MEMCPY, RPC_BUFFER, RPC_DIRECT, RPC_TIERING,
140 | RPC_TIERING_PROMOTE,
141 | MAGIC_MEMCPY, MAGIC_BUFFER, MAGIC_DIRECT, MAGIC_TIERING,
142 | PDP_MEMCPY);
143 |
144 | struct GlobalVariables {
145 | private:
146 | GlobalVariables();
147 | public:
148 | rdma::SharedMetaService sms;
149 |
150 | // configs
151 | struct {
152 | Mode mode;
153 | bool rpc_mode;
154 |
155 | const bool is_server;
156 | const std::string server_mmap_dev;
157 | ServerIoPath server_io_path;
158 |
159 | int server_memory_gb;
160 | const int server_rpc_threads;
161 | bool server_page_state;
162 |
163 | bool read_magic_pattern;
164 | bool promote_hotspot;
165 | const uint32_t promotion_window_ms;
166 |
167 | bool pull_page_bitmap;
168 | bool advise_local_mr;
169 | bool predict_remote_mr;
170 |
171 | std::string to_string() const {
172 | std::string out = "=== configs begin ===\n";
173 | fmt::format_to(std::back_inserter(out), "mode: {}\n", mode._to_string());
174 | fmt::format_to(std::back_inserter(out), "rpc_mode: {}\n", rpc_mode);
175 | fmt::format_to(std::back_inserter(out), "is_server: {}\n", is_server);
176 | fmt::format_to(std::back_inserter(out), "server_mmap_dev: {}\n", server_mmap_dev);
177 | fmt::format_to(std::back_inserter(out), "server_io_path: {}\n", server_io_path._to_string());
178 | fmt::format_to(std::back_inserter(out), "server_memory_gb: {}\n", server_memory_gb);
179 | fmt::format_to(std::back_inserter(out), "server_rpc_threads: {}\n", server_rpc_threads);
180 | fmt::format_to(std::back_inserter(out), "server_page_state: {}\n", server_page_state);
181 | fmt::format_to(std::back_inserter(out), "read_magic_pattern: {}\n", read_magic_pattern);
182 | fmt::format_to(std::back_inserter(out), "promote_hotspot: {}\n", promote_hotspot);
183 | fmt::format_to(std::back_inserter(out), "promotion_window_ms: {}\n", promotion_window_ms);
184 | fmt::format_to(std::back_inserter(out), "pull_page_bitmap: {}\n", pull_page_bitmap);
185 | fmt::format_to(std::back_inserter(out), "advise_local_mr: {}\n", advise_local_mr);
186 | fmt::format_to(std::back_inserter(out), "predict_remote_mr: {}\n", predict_remote_mr);
187 | fmt::format_to(std::back_inserter(out), "--- configs end ---\n");
188 | return out;
189 | }
190 | } configs;
191 |
192 | bool has_write_req = false;
193 | bool has_dirty_page = false;
194 | bool doing_promotion = false;
195 |
196 | // CTXs
197 | std::unique_ptr ctx_mgr;
198 |
199 | // MRs
200 | std::unique_ptr local_mr_mgr;
201 | std::unique_ptr remote_mr_mgr;
202 |
203 | // QPs
204 | std::unique_ptr qp_mgr;
205 |
206 | // original symbols from ibv
207 | struct {
208 | // QP-related: inited in create_qp()
209 | create_qp_t create_qp = nullptr;
210 | post_send_t post_send = nullptr;
211 | post_recv_t post_recv = nullptr;
212 | poll_cq_t poll_cq = nullptr;
213 |
214 | // inited in modify_qp
215 | modify_qp_t modify_qp = nullptr;
216 |
217 | // MR-related: initied in reg_mr_iova2()
218 | reg_mr_iova2_t reg_mr_iova2 = nullptr;
219 | reg_mr_t reg_mr = nullptr;
220 | } ori_ibv_symbols;
221 |
222 | static GlobalVariables& instance()
223 | {
224 | static GlobalVariables gv;
225 | return gv;
226 | }
227 | };
228 | }
229 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/pdp.cc:
--------------------------------------------------------------------------------
1 | #include "mr.hh"
2 | #include "qp.hh"
3 | // c++ headers
4 | #include
5 |
6 | // system headers
7 | #include
8 | #include
9 |
10 | // my headers
11 | #include
12 | #include
13 |
14 | namespace pdp
15 | {
16 | struct ibv_mr *ori_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
17 | int access)
18 | {
19 | const static auto &gv = GlobalVariables::instance();
20 | assert(gv.ori_ibv_symbols.reg_mr_iova2);
21 | return gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, (uintptr_t)addr, access);
22 | }
23 |
24 | /**
25 | * postprocess
26 | * All are the same as ibv_poll_cq() and only returns ibv_wc with IBV_SEND_SIGNALED set by the user.
27 | * Be careful that the CQ may not be empty when return < num_entries.
28 | */
29 | int poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc)
30 | {
31 | const static auto &gv = GlobalVariables::instance();
32 | assert(gv.ori_ibv_symbols.poll_cq);
33 | // thread_local util::Recorder rec_poll("poll_nr_entries");
34 |
35 | int ret = gv.ori_ibv_symbols.poll_cq(cq, num_entries, wc);
36 | if (ret <= 0) {
37 | return ret;
38 | }
39 |
40 | if (gv.configs.is_server) return ret;
41 |
42 | if (gv.configs.mode != +Mode::PDP) {
43 | // rec_poll.record_one(ret);
44 | return ret;
45 | }
46 |
47 | if (wc->opcode & IBV_WC_RECV) [[unlikely]] {
48 | // we do not modify post_recv
49 | return ret;
50 | }
51 |
52 | #if 1
53 | // pr_info("poll ret=%d.", ret);
54 | int nr_user_entries = 0;
55 | for (int i = 0; i < ret; i++) {
56 | auto prc = gv.qp_mgr->get(wc[i].qp_num);
57 |
58 | if (!prc) [[unlikely]] {
59 | wc[nr_user_entries++] = wc[i];
60 | } else {
61 | bool signaled = prc->post_process_wc(wc);
62 | if (signaled) {
63 | wc[nr_user_entries++] = wc[i];
64 | }
65 | // pr_debug("[pdp] a pdp wc done, signaled=%u, user_num_entries=%d", signaled, user_num_entries);
66 | }
67 | }
68 | #else
69 | bool is_prc = gv.qp_mgr->is_managed_qp(wc[i].qp_num);
70 | if (!is_prc) [[unlikely]] {
71 | wc[user_num_entries++] = wc[i];
72 | } else {
73 | auto *user_send_wr = reinterpret_cast(wc->wr_id);
74 | bool signaled = user_send_wr->prc->post_process_wc(&wc[i]);
75 | if (signaled) {
76 | wc[user_num_entries++] = wc[i];
77 | }
78 | }
79 | #endif
80 |
81 | return nr_user_entries;
82 | }
83 |
84 | // preprocess wr
85 | int post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
86 | struct ibv_send_wr **bad_wr)
87 | {
88 | const static auto &gv = GlobalVariables::instance();
89 | static util::LatencyRecorder lr_post_send{"post_send"};
90 | util::LatencyRecorderHelper h{lr_post_send};
91 | // static util::LatencyRecorder lr_1("1");
92 | // pr_info("op=%u", wr->opcode);
93 | // lr_1.begin_one();
94 | if (gv.configs.mode != +Mode::PDP) [[unlikely]] {
95 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr);
96 | }
97 |
98 | // !prc: a shadow RCQP or a user UDQP
99 | const auto &prc = gv.qp_mgr->get(qp->qp_num);
100 | bool is_shadow_rc = !prc && qp->qp_type == IBV_QPT_RC;
101 | if (is_shadow_rc) [[unlikely]] {
102 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr);
103 | }
104 |
105 | // check for all UD and User RC.
106 | // do not check for Shadow RC.
107 | if (gv.configs.is_server) {
108 | gv.local_mr_mgr->convert_pdp_mr_in_send_wr_chain(wr);
109 | }
110 |
111 | if (gv.configs.advise_local_mr) {
112 | gv.local_mr_mgr->advise_mr_in_wr_chain(wr);
113 | }
114 |
115 | // this is UD.
116 | if (!prc || gv.configs.is_server) [[unlikely]] {
117 | // gv.local_mr_mgr->pin_mr_in_recv_wr_chain((ibv_recv_wr *)wr);
118 | // gv.local_mr_mgr->advise_mr_in_wr_chain(wr);
119 |
120 | return gv.ori_ibv_symbols.post_send(qp, wr, bad_wr);
121 | }
122 | // lr_1.end_one();
123 | return prc->post_send(wr, bad_wr);
124 | }
125 |
126 | int post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr,
127 | struct ibv_recv_wr **bad_wr)
128 | {
129 | const static auto &gv = GlobalVariables::instance();
130 | assert(gv.ori_ibv_symbols.post_recv);
131 | // thread_local util::LatencyRecorder lr_recv{"post_recv"};
132 | // util::LatencyRecorder::Helper h{lr_recv};
133 |
134 | if (gv.configs.mode == +Mode::PIN) {
135 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr);
136 | }
137 |
138 | if (qp->qp_type == IBV_QPT_UD) {
139 | // RECV: the remote writes to us. we pin mrs for UD so that the NIC does not drop the package.
140 | gv.local_mr_mgr->pin_mr_in_recv_wr_chain(wr);
141 | if (gv.configs.mode == +Mode::ODP) {
142 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr);
143 | }
144 | }
145 |
146 | if (gv.configs.advise_local_mr) {
147 | gv.local_mr_mgr->advise_mr_in_wr_chain(wr);
148 | }
149 |
150 | return gv.ori_ibv_symbols.post_recv(qp, wr, bad_wr);
151 | }
152 |
153 | struct ibv_qp *create_qp(struct ibv_pd *pd,
154 | struct ibv_qp_init_attr *qp_init_attr)
155 | {
156 | const static auto ori_ibv_create_qp = (create_qp_t)dlvsym(RTLD_NEXT, "ibv_create_qp", "IBVERBS_1.1");
157 | // static auto ori_ibv_create_qp = (create_qp_t)dlsym(RTLD_NEXT, "ibv_create_qp");
158 | static auto &gv = GlobalVariables::instance();
159 | static bool inited = false;
160 |
161 | if (!inited) {
162 | gv.ori_ibv_symbols.create_qp = ori_ibv_create_qp;
163 | gv.ori_ibv_symbols.post_send = pd->context->ops.post_send;
164 | gv.ori_ibv_symbols.post_recv = pd->context->ops.post_recv;
165 | gv.ori_ibv_symbols.poll_cq = pd->context->ops.poll_cq;
166 | inited = true;
167 | }
168 |
169 | assert(ori_ibv_create_qp);
170 |
171 | struct ibv_qp *uqp = ori_ibv_create_qp(pd, qp_init_attr); // man page: NULL if fails
172 | if (!uqp) return uqp; // fails
173 |
174 | assert(uqp->context == pd->context);
175 | if (gv.configs.mode == +Mode::PIN) return uqp;
176 |
177 | // a work-around patch for UD post_recv with ODP/PDP mrs.
178 | // Using a UD QP, a post_recv of ODP/PDP mr will drop a package directly during processing a page fault.
179 | // But many prototype systems use UD for RPC without considering package drops.
180 | // We pin memory in post_recv, so that the application can run.
181 | // Note! This works for ODP & PDP!
182 | uqp->context->ops.post_recv = post_recv;
183 |
184 | if (!gv.configs.mode == +Mode::PDP) return uqp;
185 |
186 | // we override post_send of all QPs, including UD, to check the mr.
187 | // Note! This works only for PDP!
188 | uqp->context->ops.poll_cq = poll_cq;
189 | uqp->context->ops.post_send = post_send;
190 |
191 | if (qp_init_attr->qp_type != IBV_QPT_RC) {
192 | // pr_info("create qp: type=0x%x, num=0x%x", uqp->qp_type, uqp->qp_num);
193 | return uqp;
194 | }
195 |
196 | // but we only create uqp for RC, to support READ/WRITE
197 | auto ctx = gv.ctx_mgr->get_or_reg_by_pd(uqp->pd);
198 | auto prc = gv.qp_mgr->reg_with_construct_args(uqp->qp_num, ctx, uqp, qp_init_attr);
199 |
200 | RcBytes rc_bytes = prc->to_rc_bytes();
201 |
202 | // w/ gid
203 | gv.sms.put(rc_bytes.identifier().to_string(), rc_bytes.to_vec_char());
204 |
205 | // w/o gid
206 | gv.sms.put(rc_bytes.identifier().wo_gid().to_string(), rc_bytes.to_vec_char());
207 |
208 | return uqp;
209 | }
210 |
211 | // postprocess. put uqp_addr, sqp_.
212 | int modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
213 | int attr_mask)
214 | {
215 | const static auto ori_ibv_modify_qp = (modify_qp_t)dlvsym(RTLD_NEXT, "ibv_modify_qp", "IBVERBS_1.1");
216 | assert(ori_ibv_modify_qp != nullptr);
217 |
218 | static auto &gv = GlobalVariables::instance();
219 | static bool inited = false;
220 | if (!inited) {
221 | gv.ori_ibv_symbols.modify_qp = ori_ibv_modify_qp;
222 | inited = true;
223 | }
224 |
225 | int ret = ori_ibv_modify_qp(qp, attr, attr_mask);
226 | if (!gv.configs.mode == +Mode::PDP) {
227 | return ret;
228 | }
229 | // man page: 0 on success, errno on failure
230 | if (ret) {
231 | return ret;
232 | }
233 |
234 | // change state?
235 | if ((attr_mask & IBV_QP_STATE) == 0) {
236 | return ret;
237 | }
238 |
239 | const auto &prc = gv.qp_mgr->get(qp->qp_num);
240 | if (!prc) return ret;
241 |
242 | // init -> recv -> send
243 | switch (attr->qp_state) {
244 | case IBV_QPS_INIT: {
245 | assert(attr->port_num == 1);
246 | }
247 | break;
248 | case IBV_QPS_RTR: {
249 | rdma::QueuePair::Identifier remote_uqp_addr{{attr->ah_attr.grh.dgid, attr->ah_attr.dlid}, attr->dest_qp_num};
250 | std::vector vec_char;
251 | if (attr->ah_attr.is_global) {
252 | vec_char = gv.sms.get(remote_uqp_addr.to_string());
253 | } else {
254 | vec_char = gv.sms.get(remote_uqp_addr.wo_gid().to_string());
255 | }
256 | RcBytes remote_mr_bytes{vec_char};
257 | bool ok = prc->connect_to(remote_mr_bytes);
258 | assert(ok);
259 | }
260 | break;
261 | default: break;
262 | }
263 |
264 | return ret;
265 | }
266 |
267 | struct ibv_mr *reg_mr_iova2(struct ibv_pd *pd, void *addr, size_t length,
268 | uint64_t iova, unsigned int access)
269 | {
270 | const static auto ori_ibv_reg_mr_iova2 = (reg_mr_iova2_t)dlsym(RTLD_NEXT, "ibv_reg_mr_iova2");
271 | assert(ori_ibv_reg_mr_iova2);
272 | static auto &gv = GlobalVariables::instance();
273 | static bool inited = false;
274 |
275 | if (!inited) {
276 | gv.ori_ibv_symbols.reg_mr_iova2 = ori_ibv_reg_mr_iova2;
277 | gv.ori_ibv_symbols.reg_mr = ori_reg_mr;
278 | inited = true;
279 | }
280 |
281 | bool user_odp = access & IBV_ACCESS_ON_DEMAND;
282 | if (user_odp) {
283 | pr_info("user_odp!");
284 | }
285 |
286 | if (user_odp && gv.configs.mode == +Mode::PIN) {
287 | pr_warn("!!! ATTENTION !!! ODP mr is requested but pdp_mode is pin");
288 | access &= (~IBV_ACCESS_ON_DEMAND);
289 | user_odp = false;
290 | }
291 |
292 | struct ibv_mr *native_mr;
293 | for (int i = 0; i < 10; i++) {
294 | native_mr = gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, iova, access);
295 | if (native_mr) break;
296 | }
297 | if (!native_mr) {
298 | pr_err("reg_mr: failed, addr=%p, len=0x%lx, error=%s!", addr, length, strerror(errno));
299 | return native_mr;
300 | }
301 |
302 | auto ctx = gv.ctx_mgr->get_or_reg_by_pd(pd);
303 | auto rdma_mr = std::make_unique(ctx, native_mr);
304 |
305 | // PIN
306 | if (!user_odp) {
307 | // pr_info("%s", fmt::format("reg_mr: pin,addr={},len={:#x},lkey={:#x},pd={}", addr, length, \
308 | // native_mr ? native_mr->lkey : -1, fmt::ptr(pd)).c_str());
309 |
310 | // if (!gv.configs.mode == +Mode::PIN) {
311 | gv.local_mr_mgr->reg_with_construct_args(rdma_mr->lkey_, std::move(rdma_mr));
312 | // }
313 |
314 | return native_mr;
315 | }
316 |
317 | // ODP
318 | if (gv.configs.mode == +Mode::ODP || !gv.configs.read_magic_pattern) {
319 | pr_info("reg_mr: odp, addr=%p, len=0x%lx, lkey=0x%x", addr, length,
320 | native_mr ? native_mr->lkey : -1);
321 |
322 | gv.local_mr_mgr->reg_with_construct_args(rdma_mr->lkey_, std::move(rdma_mr), nullptr);
323 | pr_info();
324 | return native_mr;
325 | }
326 |
327 | // PDP
328 | // user_odp && gv.configs.mode.is_pdp
329 |
330 | assert((access & PDP_ACCESS_PDP) == 0);
331 | ASSERT((uint64_t)addr % PAGE_SIZE == 0, "addr must be aligned to 4KB");
332 |
333 | // we create a PDP native_mr for each ODP native_mr, and return the PDP one
334 | // only the ODP native_mr is registered!
335 | struct ibv_mr *pdp_native_mr;
336 | for (int i = 0; i < 10; i++) {
337 | pdp_native_mr = gv.ori_ibv_symbols.reg_mr_iova2(pd, addr, length, iova, access | PDP_ACCESS_PDP);
338 | if (pdp_native_mr) break;
339 | if (i) pr_info("retry %d, error=%s", i, strerror(errno));
340 | }
341 | assert(pdp_native_mr);
342 |
343 | // key-value
344 | // key: lkey of pdp_native_mr
345 | auto pdp_rdma_mr = std::make_unique(ctx, pdp_native_mr);
346 | pr_info();
347 | gv.local_mr_mgr->reg_with_construct_args(pdp_rdma_mr->lkey_, std::move(rdma_mr), std::move(pdp_rdma_mr));
348 |
349 | pr_info("reg_mr: pdp, addr=%p, len=0x%lx, pdp_lkey=0x%x, internal_odp_key=0x%x", addr, length,
350 | pdp_native_mr->lkey, native_mr->lkey);
351 |
352 | // return pdp_native_mr, so we can RDMA read!
353 | return pdp_native_mr;
354 | }
355 | }
356 |
357 | struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
358 | struct ibv_qp_init_attr *qp_init_attr)
359 | {
360 | return pdp::create_qp(pd, qp_init_attr);
361 | }
362 |
363 | int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
364 | int attr_mask)
365 | {
366 | return pdp::modify_qp(qp, attr, attr_mask);
367 | }
368 |
369 | struct ibv_mr *ibv_reg_mr_iova2(struct ibv_pd *pd, void *addr, size_t length,
370 | uint64_t iova, unsigned int access)
371 | {
372 | return pdp::reg_mr_iova2(pd, addr, length, iova, access);
373 | }
374 |
375 | #ifdef ibv_reg_mr
376 | #undef ibv_reg_mr
377 | #endif
378 | struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length,
379 | int access)
380 | {
381 | return pdp::reg_mr_iova2(pd, addr, length, (uintptr_t)addr, access);
382 | }
383 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/qp.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | // mine
3 | #include "global.hh"
4 | #include "rpc.hh"
5 | #include
6 | #include
7 | // sys
8 | #include
9 | // boost
10 | // #include
11 | // c++
12 | #include
13 | #include
14 | #include
15 |
16 | namespace pdp {
17 | class WrId {
18 | private:
19 | union {
20 | uint64_t u64_value_;
21 | struct {
22 | uint32_t wr_seq_; // max_send_wr uint32_t
23 | uint8_t user_signaled_ : 1;
24 | };
25 | };
26 | public:
27 | WrId(uint64_t u64_value) : u64_value_(u64_value) {}
28 | WrId(uint32_t wr_seq, bool user_signaled) : wr_seq_(wr_seq),
29 | user_signaled_(user_signaled) {}
30 | auto u64_value() const { return u64_value_; }
31 | auto wr_seq() const { return wr_seq_; }
32 | bool user_signaled() const { return user_signaled_; }
33 | };
34 |
35 | struct RcBytes {
36 | rdma::QueuePair::Identifier user_qp;
37 | rdma::QueuePair::Identifier shadow_qp;
38 | rdma::QueuePair::Identifier bitmap_qp;
39 |
40 | using Identifier = rdma::QueuePair::Identifier;
41 | auto identifier() const
42 | {
43 | return user_qp;
44 | }
45 | RcBytes() = default;
46 | explicit RcBytes(const std::vector &vec_char)
47 | {
48 | memcpy(this, vec_char.data(), sizeof(RcBytes));
49 | }
50 | auto to_vec_char() const
51 | {
52 | return util::to_vec_char(*this);
53 | }
54 | auto to_string() const
55 | {
56 | char str[256];
57 | sprintf(str, "(%s){0x%08x,0x%08x,0x%08x}", user_qp.ctx_addr().to_string().c_str(),
58 | user_qp.qp_num(), shadow_qp.qp_num(), bitmap_qp.qp_num());
59 | return std::string{str};
60 | }
61 |
62 | auto to_qp_string() const
63 | {
64 | char str[256];
65 | sprintf(str, "{0x%x,0x%x,0x%x}",
66 | user_qp.qp_num(), shadow_qp.qp_num(), bitmap_qp.qp_num());
67 | return std::string{str};
68 | }
69 | };
70 |
71 | class CounterForRemoteMr;
72 | class RemoteMr;
73 | // pdp qp
74 | class RcQp {
75 | public:
76 | // pointers
77 | using Uptr = std::unique_ptr;
78 |
79 | // identifier
80 | using Identifier = util::CachelineAligned; // qpn
81 | Identifier identifier() const { return uqp_->qp_num; }
82 |
83 | struct UserSendWr {
84 | uint64_t original_wr_id{};// for poll_cq. user's original wr_ids. we need to return it in g_ori_poll_cq
85 | ibv_send_wr wr{};// for resubmission. user's wr for resubmission. signal is always set. sg_list is copied. wr_id is original
86 | std::vector sg; // for resubmission.
87 | RcQp *prc{};
88 | uint32_t seq{};
89 | bool via_rpc{};
90 | bool user_signaled{};
91 | #if PAGE_BITMAP_HINT_READ
92 | util::ObserverPtr remote_mr{};
93 | bool rdma_read; // whether we tried rdma_read in submission
94 | #endif
95 |
96 | UserSendWr(const UserSendWr &rhs)
97 | {
98 | *this = rhs;
99 | wr.sg_list = sg.data();
100 | }
101 |
102 | UserSendWr(RcQp *prc, uint32_t max_send_sge) :
103 | prc(prc),
104 | sg(max_send_sge)
105 | {
106 | wr.sg_list = sg.data();
107 | }
108 | };
109 |
110 | // member variables
111 | private:
112 | util::ObserverPtr ctx_;
113 |
114 | // UQP
115 | struct ibv_qp *uqp_ = nullptr; // user QP
116 | bool uqp_sq_sig_all_ = false;
117 | struct ibv_qp_cap uqp_cap_;
118 |
119 | // SQP
120 | std::unique_ptr sqp_; // shadow RC by pdp
121 | std::unique_ptr msg_conn_;
122 |
123 | // pdp_rc_
124 | std::unique_ptr mr_qp_; // for mr prefetch, bitmap, ...
125 |
126 | std::vector user_send_wr_vec_;
127 | std::vector send_bitmap_; // 0 not used, 1 used.
128 | std::mutex mutex_; // protect
129 | std::unordered_map> access_counters_;
130 |
131 | // methods
132 | private:
133 | static bool buffer_is_magic(const char *data, uint32_t len);
134 | static int sge_identify_magic(const ibv_sge &sge, uint64_t remote_addr, uint64_t *fault_bitmap);
135 |
136 | // only records *wr itself, without wr->next!!!
137 | void record_rdma_access(const ibv_send_wr *wr);
138 |
139 | public:
140 | RcQp(util::ObserverPtr ctx, struct ibv_qp *uqp, const ibv_qp_init_attr *init_attr);
141 |
142 | auto ctx() const { return ctx_; }
143 | auto msg_conn() const { return msg_conn_.get(); }
144 | auto mr_qp() const { return mr_qp_.get(); }
145 | RcBytes to_rc_bytes() const;
146 |
147 | auto get_user_qp_remote_identifier() { return rdma::QueuePair::Identifier{ctx_->remote_identifier(), uqp_->qp_num};}
148 |
149 | bool connect_to(const RcBytes &rc_bytes);
150 |
151 | bool post_process_wc(struct ibv_wc *wc);
152 | int post_send(struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr);
153 |
154 | std::optional find_send_seq();
155 | void clear_send_seq(uint32_t seq);
156 | void set_send_seq(uint32_t seq);
157 | bool test_send_seq(uint32_t seq);
158 |
159 | void rpc_wr_async(ibv_send_wr *wr);
160 | Message rpc_wr_wait(ibv_send_wr *wr);
161 | };
162 |
163 | class QpManager: public Manager {
164 | private:
165 | void do_in_reg(typename Map::iterator it) final;
166 | std::vector> rpc_qps_for_threads_;
167 | std::vector> rpc_server_threads_;
168 | int rpc_server_thread_id_current_{};
169 | void rpc_worker_func(const std::stop_token &stop_token, int rpc_server_thread_id, coroutine::push_type& sink, int coro_id);
170 | };
171 | } // namespace pdp
172 |
--------------------------------------------------------------------------------
/libterm/ibverbs-pdp/rpc.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | // mine
3 | #include "global.hh"
4 | #include "bitmap.hh"
5 | // sys
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | // 3rd
12 | #include
13 | #define BOOST_ALLOW_DEPRECATED_HEADERS
14 | #include
15 | #include
16 | // c++
17 | #include
18 | #include
19 | #include
20 |
21 | #define MAX_NR_RDMA_PAGES ((MAX_RDMA_LENGTH + SZ_4K) / SZ_4K)
22 |
23 | namespace pdp {
24 |
25 | using boost::coroutines::coroutine;
26 | enum RpcOpcode : uint32_t {
27 | kReadSingle = 0,
28 | kReadSparce,
29 | kWriteSingle,
30 | kWriteInline,
31 | kRegisterCounter,
32 | kReturnBitmap,
33 | kReturnStatus,
34 | };
35 |
36 | struct ReadSingle {
37 | uint64_t s_addr; // remote_addr for the requester
38 | uint32_t s_lkey;
39 | uint64_t c_addr; // remote_identifier for the requester
40 | uint32_t c_rkey; // for the responder
41 | uint32_t len;
42 | u8 update_bitmap;
43 |
44 | constexpr size_t size() const { return sizeof(ReadSingle); }
45 |
46 | std::string to_string() const
47 | {
48 | return fmt::format("read_single,s_addr={:#x},s_lkey={:#x},c_addr={:#x},c_rkey={:#x},len={:#x}",
49 | s_addr, s_lkey, c_addr, c_rkey, len);
50 | }
51 | };
52 |
53 | struct ReadSparce {
54 | uint64_t s_addr;
55 | uint32_t s_lkey;
56 | uint64_t c_addr;
57 | uint32_t c_rkey;
58 | uint32_t len;
59 | uint64_t fault_bitmap[util::bits_to_u64s(MAX_NR_RDMA_PAGES)];
60 | u8 update_bitmap;
61 |
62 | constexpr size_t size() const { return sizeof(ReadSparce); }
63 |
64 | std::string to_string() const
65 | {
66 | return fmt::format("read_space,s_addr={:#x},s_lkey={:#x},c_addr={:#x},c_rkey={:#x},len={:#x}",
67 | s_addr, s_lkey, c_addr, c_rkey, len);
68 | }
69 | };
70 |
71 | struct WriteSingle {
72 | uint64_t s_addr;
73 | uint32_t s_lkey;
74 | uint64_t c_addr;
75 | uint32_t c_rkey;
76 | uint32_t len;
77 | u8 update_bitmap;
78 |
79 | constexpr size_t size() const {
80 | return sizeof(WriteSingle);
81 | }
82 |
83 | std::string to_string() const
84 | {
85 | return fmt::format("write_single, s_addr={:#x}, s_lkey={:#x}, c_addr={:#x}, c_rkey={:#x}, len={:#x}",
86 | s_addr, s_lkey, c_addr, c_rkey, len);
87 | }
88 | };
89 |
90 | struct WriteInline {
91 | uint64_t s_addr;
92 | uint32_t s_lkey;
93 | uint32_t len;
94 | uint8_t data[PAGE_SIZE];
95 | u8 update_bitmap;
96 |
97 | constexpr size_t size() const {
98 | return offsetof(WriteInline, data) + len;
99 | }
100 |
101 | constexpr size_t data_size() const { return len; }
102 |
103 | std::string to_string() const
104 | {
105 | return fmt::format("write_inline, s_addr={:#x}, s_lkey={:#x}, len={:#x}, data={:#x}",
106 | s_addr, s_lkey, len, *(uint64_t *)data);
107 | }
108 | };
109 |
110 | struct RegisterCounter {
111 | uint32_t server_mr_key;
112 | rdma::RemoteMemoryRegion counter_mr;
113 |
114 | constexpr size_t size() const { return sizeof(RegisterCounter); }
115 |
116 | std::string to_string() const
117 | {
118 | return fmt::format("register_counter");
119 | }
120 | };
121 |
122 | struct ReturnBitmap {
123 | uint64_t present_bitmap[util::bits_to_u64s(MAX_NR_RDMA_PAGES)];
124 | constexpr size_t size() const { return sizeof(ReturnBitmap); }
125 | std::string to_string() const
126 | {
127 | return fmt::format("return_bitmap");
128 | }
129 | };
130 |
131 | struct Message {
132 | uint32_t opcode;
133 | union {
134 | ReadSingle read_single;
135 | ReadSparce read_sparce;
136 | WriteSingle write_single;
137 | WriteInline write_inline;
138 | RegisterCounter register_counter;
139 | ReturnBitmap return_bitmap;
140 | };
141 |
142 | [[nodiscard]] constexpr size_t size() const
143 | {
144 | switch (opcode) {
145 | case kReadSingle:
146 | return offsetof(Message, read_single) + read_single.size();
147 | case kReadSparce:
148 | return offsetof(Message, read_sparce) + read_sparce.size();
149 | case kWriteSingle:
150 | return offsetof(Message, write_single) + write_single.size();
151 | case kWriteInline:
152 | return offsetof(Message, write_inline) + write_inline.size();
153 | case kRegisterCounter:
154 | return offsetof(Message, register_counter) + register_counter.size();
155 | case kReturnBitmap:
156 | return offsetof(Message, return_bitmap) + return_bitmap.size();
157 | case kReturnStatus:
158 | return sizeof(opcode);
159 | default:
160 | return 0;
161 | }
162 | }
163 |
164 | [[nodiscard]] std::string to_string() const
165 | {
166 | std::string str = fmt::format("{{size={:#x},opcode=", size());
167 | switch (opcode) {
168 | case kReadSingle:
169 | return str + read_single.to_string() + "}";
170 | case kReadSparce:
171 | return str + read_sparce.to_string() + "}";
172 | case kWriteSingle:
173 | return str + write_single.to_string() + "}";
174 | case kWriteInline:
175 | return str + write_inline.to_string() + "}";
176 | case kRegisterCounter:
177 | return str + register_counter.to_string() + "}";
178 | case kReturnBitmap:
179 | return str + return_bitmap.to_string() + "}";
180 | case kReturnStatus:
181 | return str + "return}";
182 | default:
183 | return str + "unknown}";
184 | }
185 | }
186 | };
187 |
188 | class LocalMr;
189 |
190 | class RpcIo {
191 | static const size_t kBounceMrSize = MAX_NR_RDMA_PAGES * SZ_4K + SZ_4K;
192 | static const size_t kQueueDepth = kBounceMrSize / SZ_4K;
193 | private:
194 | rdma::LocalMemoryRegion::Uptr bounce_mr_;
195 | void *mr_base_addr_{};
196 |
197 | int nvme_fd_buffer_;
198 | int nvme_fd_direct_;
199 | int pch_fd_;
200 | static inline thread_local std::vector io_context_vec_{std::vector(MAX_NR_COROUTINES)};
201 | static inline thread_local std::vector inited_{std::vector(MAX_NR_COROUTINES)};
202 |
203 | // io_uring ring_{};
204 |
205 | std::vector> access_vec_;
206 | struct io_req_t {
207 | u64 offset;
208 | u32 length;
209 | char *buffer;
210 | bool direct;
211 |
212 | bool try_merge(const io_req_t &rhs)
213 | {
214 | if (offset + length == rhs.offset
215 | && buffer + length == rhs.buffer
216 | && direct == rhs.direct) {
217 | length += rhs.length;
218 | return true;
219 | }
220 | return false;
221 | }
222 |
223 | std::string to_string() const
224 | {
225 | return fmt::format("offset={:#x},length={:#x},buffer={},direct={}",
226 | offset, length, fmt::ptr(buffer), direct);
227 | }
228 | };
229 | std::vector io_req_vec_;
230 |
231 | private:
232 | char *direct_sector_buffer()
233 | {
234 | return (char *)bounce_mr_->addr() + MAX_NR_RDMA_PAGES * PAGE_SIZE;
235 | }
236 |
237 | public:
238 | explicit RpcIo(util::ObserverPtr ctx);
239 | ~RpcIo()
240 | {
241 | close(nvme_fd_buffer_);
242 | close(nvme_fd_direct_);
243 | close(pch_fd_);
244 | // io_uring_queue_exit(&ring_);
245 | }
246 |
247 | auto bounce_mr() { return util::make_observer(bounce_mr_.get()); }
248 |
249 | // void do_io_by_uring(int nr_io_reqs, bool write);
250 | void do_io_by_psync(int nr_io_reqs, bool write, coroutine::push_type& sink, int coro_id);
251 | void subsector_write(void *buffer, u32 length, u64 offset, coroutine::push_type& sink, int coro_id);
252 |
253 | // rw_offset: in mr and ssd
254 | // mr_base_addr: virtual address, must be 4KB-aligned!
255 | // fault_bitmap: nullptr means all pages need to access
256 | std::span>
257 | io(u64 rw_offset, u32 rw_count, const uint64_t *fault_bitmap, bool write, LocalMr *mr, coroutine::push_type& sink, int coro_id);
258 | };
259 |
260 | class MessageConnection {
261 | static const uint64_t kSignalBatch = 16;
262 | private:
263 | util::ObserverPtr rc_qp_;
264 | rdma::LocalMemoryRegion::Uptr send_mr_;
265 | rdma::LocalMemoryRegion::Uptr recv_mr_;
266 | uint64_t send_counter_{};
267 | std::mutex mutex_;
268 |
269 | // PDP server only
270 | std::unique_ptr rpc_io_;
271 |
272 | public:
273 | using Uptr = std::unique_ptr;
274 |
275 | void lock()
276 | {
277 | mutex_.lock();
278 | }
279 |
280 | void unlock()
281 | {
282 | mutex_.unlock();
283 | }
284 |
285 | explicit MessageConnection(util::ObserverPtr rc_qp)
286 | : rc_qp_(rc_qp)
287 | {
288 | static const auto &gv = GlobalVariables::instance();
289 |
290 | const reg_mr_t ori_reg_mr = gv.ori_ibv_symbols.reg_mr;
291 |
292 | rdma::Allocation send_alloc{sizeof(Message)};
293 | rdma::Allocation recv_alloc{sizeof(Message)};
294 |
295 | send_mr_ = std::make_unique(rc_qp_->ctx(), std::move(send_alloc), false, ori_reg_mr);
296 | recv_mr_ = std::make_unique(rc_qp_->ctx(), std::move(recv_alloc), false, ori_reg_mr);
297 | rc_qp_->receive(recv_mr_->pick_by_offset0(sizeof(Message)));
298 |
299 | if (!gv.configs.is_server) return;
300 | rpc_io_ = std::make_unique(rc_qp_->ctx());
301 | }
302 |
303 | auto rc_qp() { return rc_qp_; }
304 | auto rpc_io() { return util::make_observer(rpc_io_.get()); }
305 |
306 | Message get(ibv_wc *wc = nullptr)
307 | {
308 | assert(rc_qp_);
309 | rc_qp_->wait_recv(wc);
310 |
311 | Message msg = *(Message *)recv_mr_->addr_;
312 |
313 | // post for the next
314 | rc_qp_->receive(recv_mr_->addr_, recv_mr_->lkey_, sizeof(Message));
315 |
316 | return msg;
317 | }
318 |
319 | bool try_get(Message *msg, void *write_bounce_buffer)
320 | {
321 | int ret = rc_qp_->poll_recv(1);
322 | if (ret <= 0) {
323 | return false;
324 | }
325 |
326 | auto *src_msg = (Message *)recv_mr_->addr();
327 | if (src_msg->opcode == RpcOpcode::kWriteInline) {
328 | memcpy(msg, src_msg, src_msg->size() - src_msg->write_inline.data_size());
329 | size_t offset_in_page = msg->write_single.s_addr % PAGE_SIZE;
330 | memcpy((char *)write_bounce_buffer + offset_in_page, src_msg->write_inline.data, src_msg->write_inline.len);
331 | } else {
332 | memcpy(msg, src_msg, src_msg->size());
333 | }
334 |
335 | rc_qp_->receive(recv_mr_->pick_by_offset0(sizeof(Message)));
336 |
337 | return true;
338 | }
339 |
340 | Message *send_mr_ptr()
341 | {
342 | return (Message *)send_mr_->addr();
343 | }
344 |
345 | bool should_signal() const
346 | {
347 | return send_counter_ % kSignalBatch == 0;
348 | }
349 |
350 | void inc_send_counter()
351 | {
352 | send_counter_++;
353 | }
354 |
355 | // auto batch signal and wait
356 | void post_send_wr_auto_wait(rdma::Wr *wr)
357 | {
358 | if (should_signal()) {
359 | wr->set_signaled(true);
360 | }
361 | rc_qp_->post_send_wr(wr);
362 | // the caller may have set the signal
363 | if (wr->signaled()) {
364 | ibv_wc wc{};
365 | rc_qp_->wait_send(&wc);
366 | ASSERT(wc.status == IBV_WC_SUCCESS, "fail to wait_send: %s", ibv_wc_status_str(wc.status));
367 | }
368 | inc_send_counter();
369 | }
370 |
371 | // auto batch signal and wait
372 | void send(const Message *msg)
373 | {
374 | if ((void *)msg != send_mr_->addr()) {
375 | memcpy(send_mr_->addr(), msg, msg->size());
376 | }
377 | rc_qp_->send(send_mr_->pick_by_offset0(msg->size()), should_signal());
378 | if (should_signal()) {
379 | ibv_wc wc{};
380 | rc_qp_->wait_send(&wc);
381 | ASSERT(wc.status == IBV_WC_SUCCESS, "fail to wait_send: %d(%s), qp=%s, send_counter=%lu", wc.status, ibv_wc_status_str(wc.status),
382 | rc_qp()->identifier().to_string().c_str(), send_counter_);
383 | ASSERT(wc.opcode == IBV_WC_SEND, "unexpected wc.opcode=0x%x, qp=%s, send_counter=%lu", wc.opcode, rc_qp()->identifier().to_string().c_str(), send_counter_);
384 | }
385 | inc_send_counter();
386 | }
387 | };
388 | }
389 |
--------------------------------------------------------------------------------
/libterm/include/bitmap.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include
3 |
4 | namespace util {
5 | using ulong_t = uint64_t;
6 | static constexpr inline uint64_t kBitsPerByte = 8;
7 | static constexpr inline uint64_t kBitsPerLong = sizeof(ulong_t) * kBitsPerByte;
8 | static constexpr inline uint64_t kBitsPerU64 = sizeof(uint64_t) * kBitsPerByte;
9 | static constexpr inline uint64_t kPageSize = 4096;
10 |
11 | static bool test_bit(uint32_t nr, const uint64_t *addr)
12 | {
13 | return 1ul & (addr[nr / kBitsPerLong] >> (nr % kBitsPerLong));
14 | }
15 |
16 | static inline void set_bit(uint32_t nr, uint64_t *addr)
17 | {
18 | addr[nr / kBitsPerLong] |= 1UL << (nr % kBitsPerLong);
19 | }
20 |
21 | static inline void clear_bit(uint32_t nr, uint64_t *addr)
22 | {
23 | addr[nr / kBitsPerLong] &= ~(1UL << (nr % kBitsPerLong));
24 | }
25 |
26 | static constexpr uint64_t bits_to_longs(uint64_t bits)
27 | {
28 | return util::div_up(bits, kBitsPerLong);
29 | }
30 |
31 | static constexpr uint64_t bits_to_u64s(uint64_t bits)
32 | {
33 | return util::div_up(bits, kBitsPerU64);
34 | }
35 |
36 | static constexpr uint64_t bits_to_long_aligned_bytes(uint64_t bits)
37 | {
38 | return util::bits_to_longs(bits) * sizeof(ulong_t);
39 | }
40 |
41 | static uint64_t mr_length_to_pgs(uint64_t mr_length)
42 | {
43 | return util::div_up(mr_length, kPageSize);
44 | }
45 |
46 | class Bitmap
47 | {
48 | ulong_t *data_{};
49 | uint32_t bits_{};
50 | public:
51 | using Uptr = std::unique_ptr;
52 | Bitmap(uint32_t bits, bool set) : bits_(bits)
53 | {
54 | data_ = new ulong_t[bits_to_longs(bits_)];
55 | if (set) {
56 | set_all();
57 | } else {
58 | clear_all();
59 | }
60 | }
61 | ~Bitmap()
62 | {
63 | delete[] data_;
64 | }
65 | bool test_bit(uint32_t nr)
66 | {
67 | return util::test_bit(nr, data_);
68 | }
69 | void set_bit(uint32_t nr)
70 | {
71 | util::set_bit(nr, data_);
72 | }
73 | void assign_bit(uint32_t nr, bool set)
74 | {
75 | if (set) {
76 | set_bit(nr);
77 | } else {
78 | clear_bit(nr);
79 | }
80 | }
81 | void clear_bit(uint32_t nr)
82 | {
83 | util::clear_bit(nr, data_);
84 | }
85 | void clear_all()
86 | {
87 | memset(data_, 0, sizeof(ulong_t) * bits_to_longs(bits_));
88 | }
89 | void set_all()
90 | {
91 | memset(data_, 0xff, sizeof(ulong_t) * bits_to_longs(bits_));
92 | }
93 | };
94 | }
--------------------------------------------------------------------------------
/libterm/include/compile.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | #include
4 |
5 | #define ENABLE_CACHELINE_ALIGN 1
6 | #if ENABLE_CACHELINE_ALIGN
7 | #define __cacheline_aligned alignas(64)
8 | #else
9 | #define __cacheline_aligned
10 | #endif
11 |
12 | #define __packed __attribute__((__packed__))
13 |
14 | inline void compiler_barrier() { asm volatile("" ::: "memory"); }
15 |
16 | template
17 | concept explicit_ref = std::same_as[>;
18 |
19 | #define DELETE_MOVE_CONSTRUCTOR_AND_ASSIGNMENT(Class) \
20 | Class(Class&&) = delete; \
21 | Class& operator=(Class&&) = delete;
22 | #define DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(Class) \
23 | Class(const Class&) = delete; \
24 | Class& operator=(const Class&) = delete;
25 |
--------------------------------------------------------------------------------
/libterm/include/const.hh:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | #define SZ_K (1ul << 10)
4 | #define SZ_M (1ul << 20)
5 | #define SZ_1K (1ul << 10)
6 | #define SZ_2K (2ul << 10)
7 | #define SZ_4K (4ul << 10)
8 | #define SZ_64K (64ul << 10)
9 | #define SZ_256K (256ul << 10)
10 | #define SZ_512K (512ul << 10)
11 | #define SZ_1M (1ul << 20)
12 | #define SZ_2M (2ul << 20)
13 | #define SZ_4M (4ul << 20)
14 | #define SZ_256M (256ul << 20)
15 | #define SZ_1G (1ul << 30)
16 | #define SZ_2G (2ul << 30)
17 |
18 | #define PAGE_SHIFT 12
19 | #define PAGE_SIZE (1ul << PAGE_SHIFT)
20 | #define SECTOR_SIZE 512
21 |
22 | using u8 = uint8_t;
23 | using u16 = uint16_t;
24 | using u32 = uint32_t;
25 | using u64 = uint64_t;
26 |
--------------------------------------------------------------------------------
/libterm/include/data-conn.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include
3 |
4 | #include "rdma.hh"
5 |
6 | namespace rdma {
7 | class DataConnection final {
8 | public:
9 | using Uptr = std::unique_ptr;
10 | private:
11 | util::ObserverPtr rc_qp_;
12 | util::ObserverPtr local_mr_;
13 | std::unique_ptr remote_mr_;
14 | public:
15 | DataConnection(util::ObserverPtr rc_qp, util::ObserverPtr local_mr,
16 | std::unique_ptr remote_mr)
17 | : rc_qp_(rc_qp),
18 | local_mr_(local_mr),
19 | remote_mr_(std::move(remote_mr)) {}
20 |
21 | // DataConnection(const Sptr& data_conn)
22 | // : rc_qp_(data_conn->rc_qp_), local_mr_(data_conn->local_mr_), remote_mr_(data_conn->remote_mr_)
23 | // {}
24 |
25 | // async
26 | bool read(const MemorySge &local_mem, const MemorySge &remote_mem, bool signal)
27 | {
28 | return rc_qp_->read(local_mem, remote_mem, signal);
29 | }
30 |
31 | // async
32 | bool read(uint64_t local_addr, uint64_t remote_addr, uint32_t len)
33 | {
34 | pr_info("data-conn::read");
35 | return rc_qp_->read(local_mr_->pick_by_addr(local_addr, len),
36 | remote_mr_->pick_by_addr(remote_addr, len), true);
37 | }
38 |
39 | bool write(const MemorySge &local_mem, const MemorySge &remote_mem, optional imm = std::nullopt)
40 | {
41 | return rc_qp_->write(local_mem, remote_mem, true, imm);
42 | }
43 |
44 | bool write(uint64_t local_addr, uint64_t remote_addr, uint32_t len, optional imm = std::nullopt)
45 | {
46 | return rc_qp_->write(local_mr_->pick_by_addr(local_addr, len),
47 | remote_mr_->pick_by_addr(remote_addr, len), true, imm);
48 | }
49 |
50 | void post_send_wr(Wr *wr) { rc_qp_->post_send_wr(wr); }
51 |
52 | int wait_send(ibv_wc *wc = nullptr)
53 | {
54 | return rc_qp_->wait_send(wc);
55 | }
56 |
57 | auto rc_qp() { return rc_qp_; }
58 |
59 | auto * local_mr() { return local_mr_.get(); }
60 | auto * remote_mr() { return remote_mr_.get(); }
61 | };
62 | }
63 |
--------------------------------------------------------------------------------
/libterm/include/lock.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | // mine
4 | #include "util.hh"
5 | #include
6 |
7 | #include
8 | #include
9 | #include
10 | #include
11 | #include
12 |
13 | #include
14 | #include
15 |
16 | static inline uint64_t NowMicros()
17 | {
18 | static constexpr uint64_t kUsecondsPerSecond = 1000000;
19 | struct timeval tv;
20 | gettimeofday(&tv, nullptr);
21 | return static_cast(tv.tv_sec) * kUsecondsPerSecond + tv.tv_usec;
22 | }
23 |
24 | #define TIMER_START(x) \
25 | const auto timer_##x = std::chrono::steady_clock::now()
26 |
27 | #define TIMER_STOP(x) \
28 | x += std::chrono::duration_cast( \
29 | std::chrono::steady_clock::now() - timer_##x) \
30 | .count()
31 |
32 | template
33 | struct Timer
34 | {
35 | Timer(T &res) : start_time_(std::chrono::steady_clock::now()), res_(res) {}
36 |
37 | ~Timer()
38 | {
39 | res_ += std::chrono::duration_cast(
40 | std::chrono::steady_clock::now() - start_time_)
41 | .count();
42 | }
43 |
44 | std::chrono::steady_clock::time_point start_time_;
45 | T &res_;
46 | };
47 |
48 | #ifdef LOGGING
49 | #define TIMER_START_LOGGING(x) TIMER_START(x)
50 | #define TIMER_STOP_LOGGING(x) TIMER_STOP(x)
51 | #define COUNTER_ADD_LOGGING(x, y) x += (y)
52 | #else
53 | #define TIMER_START_LOGGING(x)
54 | #define TIMER_STOP_LOGGING(x)
55 | #define COUNTER_ADD_LOGGING(x, y)
56 | #endif
57 |
58 | // #define LOGGING
59 |
60 | #ifdef LOGGING
61 | #define LOG(fmt, ...) \
62 | fprintf(stderr, "\033[1;31mLOG(<%s>:%d %s): \033[0m" fmt "\n", __FILE__, \
63 | __LINE__, __func__, ##__VA_ARGS__)
64 | #else
65 | #define LOG(fmt, ...)
66 | #endif
67 |
68 | class SpinLock
69 | {
70 | public:
71 | SpinLock() : mutex(false) {}
72 | SpinLock(std::string name) : mutex(false), name(name) {}
73 |
74 | bool try_lock()
75 | {
76 | bool expect = false;
77 | return mutex.compare_exchange_strong(
78 | expect, true, std::memory_order_release, std::memory_order_relaxed);
79 | }
80 |
81 | void lock()
82 | {
83 | uint64_t startOfContention = 0;
84 | bool expect = false;
85 | while (!mutex.compare_exchange_weak(expect, true, std::memory_order_release,
86 | std::memory_order_relaxed))
87 | {
88 | expect = false;
89 | debugLongWaitAndDeadlock(&startOfContention);
90 | }
91 | if (startOfContention != 0)
92 | {
93 | contendedTime += NowMicros() - startOfContention;
94 | ++contendedAcquisitions;
95 | }
96 | }
97 |
98 | void unlock() { mutex.store(0, std::memory_order_release); }
99 |
100 | void report()
101 | {
102 | LOG("spinlock %s: contendedAcquisitions %lu contendedTime %lu us",
103 | name.c_str(), contendedAcquisitions, contendedTime);
104 | }
105 |
106 | private:
107 | std::atomic_bool mutex;
108 | std::string name;
109 | uint64_t contendedAcquisitions = 0;
110 | uint64_t contendedTime = 0;
111 |
112 | void debugLongWaitAndDeadlock(uint64_t *startOfContention)
113 | {
114 | if (*startOfContention == 0)
115 | {
116 | *startOfContention = NowMicros();
117 | }
118 | else
119 | {
120 | uint64_t now = NowMicros();
121 | if (now >= *startOfContention + 1000000)
122 | {
123 | LOG("%s SpinLock locked for one second; deadlock?", name.c_str());
124 | }
125 | }
126 | }
127 | };
128 |
129 | // read write_single lock
130 | class ReadWriteLock
131 | {
132 | // the lowest bit is used for writer
133 | public:
134 | bool TryReadLock()
135 | {
136 | uint64_t old_val = lock_value.load(std::memory_order_acquire);
137 | while (true)
138 | {
139 | if (old_val & 1 || old_val > 1024)
140 | {
141 | break;
142 | }
143 | uint64_t new_val = old_val + 2;
144 | bool cas = lock_value.compare_exchange_weak(old_val, new_val,
145 | std::memory_order_acq_rel,
146 | std::memory_order_acquire);
147 | if (cas)
148 | {
149 | return true;
150 | }
151 | }
152 | return false;
153 | }
154 |
155 | void lock_shared()
156 | {
157 | while (!TryReadLock())
158 | ;
159 | }
160 |
161 | void unlock_shared()
162 | {
163 | uint64_t old_val = lock_value.load(std::memory_order_acquire);
164 | while (true)
165 | {
166 | if (old_val <= 1)
167 | {
168 | assert(old_val >= 2);
169 | return;
170 | }
171 | uint64_t new_val = old_val - 2;
172 | if (lock_value.compare_exchange_weak(old_val, new_val))
173 | {
174 | break;
175 | }
176 | }
177 | }
178 |
179 | bool TryWriteLock()
180 | {
181 | uint64_t old_val = lock_value.load(std::memory_order_acquire);
182 | while (true)
183 | {
184 | if (old_val & 1)
185 | {
186 | return false;
187 | }
188 | uint64_t new_val = old_val | 1;
189 | bool cas = lock_value.compare_exchange_weak(old_val, new_val);
190 | if (cas)
191 | {
192 | break;
193 | }
194 | }
195 | // got write_single lock, waiting for readers
196 | while (lock_value.load(std::memory_order_acquire) != 1)
197 | {
198 | asm("nop");
199 | }
200 | return true;
201 | }
202 |
203 | void lock()
204 | {
205 | while (!TryWriteLock())
206 | ;
207 | }
208 |
209 | void unlock()
210 | {
211 | assert(lock_value == 1);
212 | lock_value.store(0);
213 | }
214 |
215 | private:
216 | std::atomic_uint_fast64_t lock_value{0};
217 | };
218 |
219 | #define CAS(ptr, oldval, newval) \
220 | (__sync_bool_compare_and_swap(ptr, oldval, newval))
221 | class ReaderFriendlyLock
222 | {
223 | std::vector lock_vec_;
224 | public:
225 | DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(ReaderFriendlyLock);
226 |
227 | ReaderFriendlyLock(ReaderFriendlyLock &&rhs) noexcept
228 | {
229 | *this = std::move(rhs);
230 | }
231 | ReaderFriendlyLock& operator=(ReaderFriendlyLock &&rhs) {
232 | std::swap(this->lock_vec_, rhs.lock_vec_);
233 | return *this;
234 | }
235 |
236 | ReaderFriendlyLock() : lock_vec_(util::Schedule::max_nr_threads())
237 | {
238 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i)
239 | {
240 | lock_vec_[i][0] = 0;
241 | lock_vec_[i][1] = 0;
242 | }
243 | }
244 |
245 | bool lock()
246 | {
247 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i)
248 | {
249 | while (!CAS(&lock_vec_[i][0], 0, 1))
250 | {
251 | }
252 | }
253 | return true;
254 | }
255 |
256 | bool try_lock()
257 | {
258 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i)
259 | {
260 | if (!CAS(&lock_vec_[i][0], 0, 1))
261 | {
262 | for (i--; i >= 0; i--) {
263 | compiler_barrier();
264 | lock_vec_[i][0] = 0;
265 | }
266 | return false;
267 | }
268 | }
269 | return true;
270 | }
271 |
272 | bool try_lock_shared()
273 | {
274 | if (lock_vec_[util::Schedule::thread_id()][1]) {
275 | pr_once(info, "recursive lock!");
276 | return true;
277 | }
278 | return CAS(&lock_vec_[util::Schedule::thread_id()][0], 0, 1);
279 | }
280 |
281 | bool lock_shared()
282 | {
283 | if (lock_vec_[util::Schedule::thread_id()][1]) {
284 | pr_once(info, "recursive lock!");
285 | return true;
286 | }
287 | while (!CAS(&lock_vec_[util::Schedule::thread_id()][0], 0, 1))
288 | {
289 | }
290 | lock_vec_[util::Schedule::thread_id()][1] = 1;
291 | return true;
292 | }
293 |
294 | void unlock()
295 | {
296 | compiler_barrier();
297 | for (int i = 0; i < util::Schedule::max_nr_threads(); ++i)
298 | {
299 | lock_vec_[i][0] = 0;
300 | }
301 | }
302 |
303 | void unlock_shared()
304 | {
305 | compiler_barrier();
306 | lock_vec_[util::Schedule::thread_id()][0] = 0;
307 | lock_vec_[util::Schedule::thread_id()][1] = 0;
308 | }
309 | };
310 |
--------------------------------------------------------------------------------
/libterm/include/logging.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include "print-stack.hh"
3 | #include
4 | #include
5 | #include
6 | #include
7 |
8 | #define DYNAMIC_DEBUG 0
9 |
10 | #define COLOR_BLACK "\033[0;30m"
11 | #define COLOR_RED "\033[0;31m"
12 | #define COLOR_GREEN "\033[0;32m"
13 | #define COLOR_YELLOW "\033[0;33m"
14 | #define COLOR_BLUE "\033[0;34m"
15 | #define COLOR_MAGENTA "\033[0;35m"
16 | #define COLOR_CYAN "\033[0;36m"
17 | #define COLOR_WHITE "\033[0;37m"
18 | #define COLOR_DEFAULT "\033[0;39m"
19 |
20 | #define COLOR_BOLD_BLACK "\033[1;30m"
21 | #define COLOR_BOLD_RED "\033[1;31m"
22 | #define COLOR_BOLD_GREEN "\033[1;32m"
23 | #define COLOR_BOLD_YELLOW "\033[1;33m"
24 | #define COLOR_BOLD_BLUE "\033[1;34m"
25 | #define COLOR_BOLD_MAGENTA "\033[1;35m"
26 | #define COLOR_BOLD_CYAN "\033[1;36m"
27 | #define COLOR_BOLD_WHITE "\033[1;37m"
28 | #define COLOR_BOLD_DEFAULT "\033[1;39m"
29 |
30 | #define PT_RESET "\033[0m"
31 | #define PT_BOLD "\033[1m"
32 | #define PT_UNDERLINE "\033[4m"
33 | #define PT_BLINKING "\033[5m"
34 | #define PT_INVERSE "\033[7m"
35 |
36 | #define LIBPDP_PREFIX COLOR_CYAN "[libpdp] " PT_RESET
37 | #define pr_libpdp(fmt, args...) printf(LIBPDP_PREFIX fmt "\n", ##args)
38 |
39 | #define EXTRACT_FILENAME(PATH) (__builtin_strrchr("/" PATH, '/') + 1)
40 | #define __FILENAME__ EXTRACT_FILENAME(__FILE__)
41 |
42 | #define pr(fmt, args...) pr_libpdp(COLOR_GREEN "%s:%d: %s:" PT_RESET " (%d) " fmt, __FILENAME__, __LINE__, __func__, util::Schedule::thread_id(), ##args)
43 | #define pr_flf(file, line, func, fmt, args...) pr_libpdp(COLOR_GREEN "%s:%d: %s:" PT_RESET " (tid=%d) " fmt, file, line, func, util::Schedule::thread_id(), ##args)
44 |
45 | #define pr_info(fmt, args...) pr(fmt, ##args)
46 | #define pr_err(fmt, args...) pr(COLOR_RED fmt PT_RESET, ##args)
47 | #define pr_warn(fmt, args...) pr(COLOR_MAGENTA fmt PT_RESET, ##args)
48 | #define pr_emph(fmt, args...) pr(COLOR_YELLOW fmt PT_RESET, ##args)
49 | #if DYNAMIC_DEBUG
50 | #define pr_debug(fmt, args...) ({ \
51 | static bool enable_debug = util::getenv_bool("ENABLE_DEBUG").value_or(false); \
52 | if (enable_debug) pr(fmt, ##args); \
53 | })
54 | #else
55 | #define pr_debug(fmt, args...)
56 | #endif
57 |
58 | #define pr_once(level, format...) ({ \
59 | static bool __warned = false; \
60 | if (!__warned) [[unlikely]] { pr_##level(format); \
61 | __warned = true;} \
62 | })
63 |
64 | #define pr_coro(level, format, args...) pr_##level("coro_id=%d, " format, coro_id, ##args)
65 |
66 | // #define ASSERT(...)
67 | #define ASSERT(cond, format, args...) ({ \
68 | if (!(cond)) [[unlikely]] { \
69 | pr_err(format, ##args); \
70 | exit(EXIT_FAILURE); \
71 | } \
72 | })
73 | #define ASSERT_PERROR(cond, format, args...) ASSERT(cond, "%s: " format, strerror(errno), ##args)
74 |
75 | #define fmt_pr(level, fmt_str, args...) pr_##level("%s", fmt::format(fmt_str, ##args).c_str())
76 |
--------------------------------------------------------------------------------
/libterm/include/node.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include "rdma.hh"
3 | #include "sms.hh"
4 | #include "data-conn.hh"
5 | #include "ud-conn.hh"
6 |
7 | namespace rdma {
8 |
9 | class Thread;
10 |
11 | class Node {
12 | private:
13 | const char *kNrNodesKey = "nr_nodes";
14 | const char *kNrPreparedKey = "nr_prepared";
15 | const char *kNrConnectedKey = "nr_connected";
16 |
17 | public:
18 | using Uptr = std::unique_ptr;
19 | SharedMetaService sms_{"10.0.2.181", 23333}; // shared_meta_service
20 | int nr_nodes_;
21 | int node_id_;
22 | bool registered_ = false;
23 | vector threads_;
24 | std::vector> ctxs_;
25 |
26 | Node(int nr_nodes, int node_id = -1) : nr_nodes_(nr_nodes), node_id_(node_id) {}
27 |
28 | void register_node()
29 | {
30 | std::cout << "registering node..." << std::flush;
31 | if (node_id_ == -1)
32 | node_id_ = sms_.inc(kNrNodesKey) - 1;
33 |
34 | std::cout << node_id_ << std::endl;
35 | if (is_server()) {
36 | sms_.put(kNrPreparedKey, {'0'});
37 | sms_.put(kNrConnectedKey, {'0'});
38 | }
39 | registered_ = true;
40 | }
41 |
42 | void unregister_node()
43 | {
44 | if (registered_) {
45 | sms_.dec(kNrNodesKey);
46 | pr_info("unregister node %d", node_id_);
47 | registered_ = false;
48 | }
49 | }
50 |
51 | bool is_server()
52 | {
53 | return node_id_ == 0;
54 | }
55 |
56 | void wait_for_all_nodes_prepared()
57 | {
58 | std::cout << __func__ << "... " << std::flush;
59 | sms_.inc(kNrPreparedKey);
60 | while (sms_.get_int(kNrPreparedKey) < nr_nodes_) {
61 | usleep(10000);
62 | }
63 | std::cout << "done" << std::endl;
64 | }
65 |
66 | void wait_for_all_nodes_connected()
67 | {
68 | std::cout << __func__ << "... " << std::flush;
69 | sms_.inc(kNrConnectedKey);
70 | while (sms_.get_int(kNrConnectedKey) < nr_nodes_) {
71 | usleep(10000);
72 | }
73 | std::cout << "done" << std::endl;
74 | }
75 | };
76 |
77 | class Thread {
78 | util::ObserverPtr node_;
79 | util::ObserverPtr ctx_;
80 | int thread_id_;
81 |
82 | int nr_dst_nodes_; // including myself for simplicity
83 | int nr_dst_threads_;
84 |
85 | LocalMemoryRegion::Uptr local_mr_;
86 |
87 | struct RcInfo {
88 | MemoryRegion mr;
89 | QueuePair::Identifier data_addr;
90 | };
91 |
92 | // [dst_node_id][dst_thread_id]; for clients, dst_node_id=0 (connected to server)
93 | vector> data_qp_; // [node_id][thread_id]
94 | vector> data_conn_; // [node_id][thread_id]
95 |
96 | std::unique_ptr ud_qp_;
97 | vector>> ud_ah_; // [dst_node_id][dst_thread_id]
98 |
99 | static string cat_to_string(int node_id, int thread_id)
100 | {
101 | return std::to_string(node_id) + ":" + std::to_string(thread_id);
102 | }
103 |
104 | string rc_push_key(int dst_node_id, int dst_thread_id)
105 | {
106 | return std::string("[rc-addr]") + cat_to_string(node_->node_id_, thread_id_) + "-" + cat_to_string(dst_node_id, dst_thread_id);
107 | }
108 |
109 | string rc_pull_key(int dst_node_id, int dst_thread_id)
110 | {
111 | return std::string("[rc-addr]") + cat_to_string(dst_node_id, dst_thread_id) + "-" + cat_to_string(node_->node_id_, thread_id_);
112 | }
113 |
114 | auto ud_push_key()
115 | {
116 | return std::string("[ud-addr]") + cat_to_string(node_->node_id_, thread_id_);
117 | }
118 |
119 | auto ud_pull_key(int dst_node_id, int dst_thread_id)
120 | {
121 | return std::string("[ud-addr]") + cat_to_string(dst_node_id, dst_thread_id);
122 | }
123 |
124 | public:
125 | // for clients, dst_nr_nodes should be 1.
126 | // for the server, dst_nr_nodes should be node_->nr_nodes_.
127 | Thread(util::ObserverPtr node, util::ObserverPtr ctx, int thread_id, int dst_nr_nodes, int dst_nr_threads, LocalMemoryRegion::Uptr local_mr)
128 | : node_(std::move(node)), ctx_(ctx),
129 | thread_id_(thread_id), nr_dst_nodes_(dst_nr_nodes), nr_dst_threads_(dst_nr_threads),
130 | local_mr_(std::move(local_mr))
131 | {
132 | data_qp_.resize(dst_nr_nodes);
133 | data_conn_.resize(dst_nr_nodes);
134 | ud_ah_.resize(dst_nr_nodes);
135 | for (int i = 0; i < dst_nr_nodes; i++) {
136 | for (int j = 0; j < dst_nr_threads; j++) {
137 | data_qp_[i].emplace_back();
138 | data_conn_[i].emplace_back();
139 | ud_ah_[i].emplace_back();
140 | }
141 | }
142 | }
143 |
144 | void prepare_and_push_rc()
145 | {
146 | for (int i = 0; i < nr_dst_nodes_; i++) {
147 | if (i == node_->node_id_) continue; // myself
148 | for (int j = 0; j < nr_dst_threads_; j++) {
149 | data_qp_[i][j] = std::make_unique(ctx_);
150 | RcInfo thread_info{*local_mr_, data_qp_[i][j]->identifier()};
151 | node_->sms_.put(rc_push_key(i, j), util::to_vec_char(thread_info));
152 | }
153 | }
154 | }
155 |
156 | void prepare_and_push_ud()
157 | {
158 | ud_qp_ = std::make_unique(util::make_observer(ctx_.get()));
159 | node_->sms_.put(ud_push_key(), util::to_vec_char(ud_qp_->identifier()));
160 | }
161 |
162 | void pull_and_connect_rc()
163 | {
164 | for (int i = 0; i < nr_dst_nodes_; i++) {
165 | if (i == node_->node_id_) continue; // myself
166 | for (int j = 0; j < nr_dst_threads_; j++) {
167 | auto rc_info = util::from_vec_char(node_->sms_.get(rc_pull_key(i, j)));
168 | auto remote_mr = std::make_unique(rc_info.mr);
169 | data_qp_[i][j]->connect_to(rc_info.data_addr);
170 | auto p = std::make_unique(util::make_observer(data_qp_[i][j].get()),
171 | util::make_observer(local_mr_.get()), std::move(remote_mr));
172 | data_conn_[i][j] = std::move(p);
173 | }
174 | }
175 | }
176 |
177 | void pull_and_setup_ud()
178 | {
179 | for (int i = 0; i < nr_dst_nodes_; i++) {
180 | if (i == node_->node_id_) continue;
181 | for (int j = 0; j < nr_dst_threads_; j++) {
182 | auto remote_addr = util::from_vec_char(node_->sms_.get(ud_pull_key(i, j)));
183 | ud_ah_[i][j] = std::make_unique(ctx_, remote_addr);
184 | ud_qp_->setup();
185 | }
186 | }
187 | }
188 |
189 | util::ObserverPtr get_data_conn(int dst_node_id, int dst_thread_id)
190 | {
191 | return util::make_observer(data_conn_[dst_node_id][dst_thread_id].get());
192 | }
193 |
194 | auto get_ud_conn(int dst_node_id, int dst_thread_id)
195 | {
196 | return std::make_unique(util::make_observer(ud_qp_.get()), util::make_observer(ud_ah_[dst_node_id][dst_thread_id].get()));
197 | }
198 |
199 | auto ud_qp() { return ud_qp_.get(); }
200 | auto local_mr() { return local_mr_.get(); }
201 |
202 | };
203 |
204 | }
205 |
--------------------------------------------------------------------------------
/libterm/include/page-bitmap.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | // mine
3 | #include
4 | #include
5 | // linux
6 | #include
7 | #include
8 | // c++
9 | #include
10 | #include
11 | // c
12 | #include
13 | #include
14 | #include
15 |
16 | namespace pdp {
17 | class PageBitmap {
18 | protected:
19 | util::ulong_t *bitmap_{}; // passed and freed by a derived class
20 | uint64_t bytes_{}; // bitmap mr_bytes_ (bits / 8)
21 | public:
22 | PageBitmap() = default;
23 |
24 | virtual ~PageBitmap()
25 | {
26 | // the derived class will set bitmap_ to nullptr in its dtor
27 | if (bitmap_) {
28 | delete[] bitmap_;
29 | bitmap_ = nullptr;
30 | }
31 | }
32 |
33 | bool test_present(uint64_t pg) const
34 | {
35 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte);
36 | return util::test_bit(pg, bitmap_);
37 | }
38 |
39 | void set_present(uint64_t pg)
40 | {
41 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte);
42 |
43 | util::set_bit(pg, bitmap_);
44 | }
45 |
46 | void clear_present(uint64_t pg)
47 | {
48 | ASSERT(pg / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg, bytes_ * util::kBitsPerByte);
49 |
50 | util::clear_bit(pg, bitmap_);
51 | }
52 |
53 | void assign(uint64_t pg, bool set)
54 | {
55 | if (test_present(pg) == set) return;
56 | if (set) {
57 | set_present(pg);
58 | } else {
59 | clear_present(pg);
60 | }
61 | }
62 |
63 | bool range_has_fault(uint64_t pg_begin, uint64_t pg_end) const
64 | {
65 | ASSERT(pg_end / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg_end, bytes_ * util::kBitsPerByte);
66 |
67 |
68 | // if (pg_end - pg_begin > 1) {
69 | // pr_err("error");
70 | // }
71 | // return false;
72 |
73 | for (auto i = pg_begin; i < pg_end; i++) {
74 | if (!test_present(i)) {
75 | // pr_err("true");
76 | return true;
77 | }
78 | }
79 | return false;
80 | }
81 |
82 | uint64_t range_count_fault(uint64_t pg_begin, uint64_t pg_end) const
83 | {
84 | ASSERT(pg_end / util::kBitsPerByte < bytes_, "pg=0x%lx is out of bound max_pages=0x%lx", pg_end, bytes_ * util::kBitsPerByte);
85 |
86 | uint64_t count = 0;
87 | for (auto i = pg_begin; i < pg_end; i++) {
88 | if (!test_present(i)) {
89 | count++;
90 | }
91 | }
92 | return count;
93 | }
94 | };
95 |
96 | class LocalPageBitmap final: public PageBitmap {
97 | util::ObserverPtr local_mr_;
98 | rdma::LocalMemoryRegion::Uptr bitmap_mr_;
99 |
100 | public:
101 | using Uptr = std::unique_ptr;
102 |
103 | LocalPageBitmap(util::ObserverPtr local_mr)
104 | : local_mr_(local_mr)
105 | {
106 | static auto &gv = GlobalVariables::instance();
107 | char file[256];
108 | sprintf(file, "/proc/pdp_bitmap_0x%x", local_mr_->lkey_);
109 | int fd = open(file, O_RDWR);
110 | assert(fd > 0);
111 |
112 | uint32_t nr_pfns;
113 | auto ret = pread(fd, &nr_pfns, sizeof(uint32_t), 0);
114 | assert(ret == sizeof(uint32_t));
115 |
116 | bytes_ = util::bits_to_long_aligned_bytes(nr_pfns);
117 | bitmap_ = (util::ulong_t *)mmap(nullptr, bytes_, PROT_READ | PROT_WRITE,
118 | MAP_SHARED | MAP_POPULATE, fd, 0);
119 | assert(bitmap_ != MAP_FAILED);
120 |
121 | close(fd);
122 |
123 | assert(gv.ori_ibv_symbols.reg_mr);
124 | bitmap_mr_ = std::make_unique(local_mr->ctx(), (void *)bitmap_, bytes_, false, gv.ori_ibv_symbols.reg_mr);
125 |
126 | pr_info("mr: key=0x%x; bitmap: addr=%p, length=0x%lx, key=0x%x", local_mr->lkey_, bitmap_, bytes_, bitmap_mr_->lkey_);
127 | }
128 |
129 | ~LocalPageBitmap() final
130 | {
131 | munmap(bitmap_, bytes_);
132 | bitmap_ = nullptr;
133 | }
134 |
135 | auto bitmap_mr() const { return util::make_observer(bitmap_mr_.get()); }
136 | };
137 |
138 | class RemotePageBitmap final : public PageBitmap {
139 | rdma::RemoteMemoryRegion remote_bitmap_mr_; // RDMA read from this remote mr.
140 | std::unique_ptr local_bitmap_mr_; // RDMA read to this local mr.
141 | rdma::Wr pull_wr_;
142 |
143 | public:
144 | using Uptr = std::unique_ptr;
145 |
146 | DELETE_COPY_CONSTRUCTOR_AND_ASSIGNMENT(RemotePageBitmap);
147 | DELETE_MOVE_CONSTRUCTOR_AND_ASSIGNMENT(RemotePageBitmap);
148 |
149 | RemotePageBitmap(util::ObserverPtr ctx, rdma::RemoteMemoryRegion remote_bitmap_mr)
150 | : remote_bitmap_mr_(remote_bitmap_mr)
151 | {
152 | static auto &gv = GlobalVariables::instance();
153 |
154 | bytes_ = remote_bitmap_mr_.length_;
155 | pr_info("remote_bitmap_mr: %s", remote_bitmap_mr.to_string().c_str());
156 |
157 | ASSERT(remote_bitmap_mr_.addr64() % SZ_4K == 0, "remote_bitmap_mr.addr is not aligned to 4KB.");
158 | ASSERT(remote_bitmap_mr_.length() % sizeof(util::ulong_t) == 0, "remote_bitmap_mr.length is not aligned to 8 bytes.");
159 |
160 | bitmap_ = new util::ulong_t[bytes_ / sizeof(util::ulong_t)];
161 | ASSERT(gv.ori_ibv_symbols.reg_mr, "gv.ori_ibv_symbols.reg_mr");
162 | local_bitmap_mr_ = std::make_unique(ctx, bitmap_, bytes_, false, gv.ori_ibv_symbols.reg_mr);
163 |
164 | // we generate a wr and let the caller submit via a RC QP.
165 | pull_wr_ = rdma::Wr().set_op_read().set_signaled(true)
166 | .set_rdma(remote_bitmap_mr_.pick_all())
167 | .set_sg(local_bitmap_mr_->pick_all());
168 | }
169 |
170 | ~RemotePageBitmap() final
171 | {
172 | delete[] bitmap_;
173 | bitmap_ = nullptr;
174 | }
175 |
176 | rdma::Wr pull_wr() { return pull_wr_; }
177 | };
178 |
179 | }
180 |
--------------------------------------------------------------------------------
/libterm/include/print-stack.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | #include
4 | #include
5 | #include
6 |
7 | namespace util
8 | {
9 | // Credits: This lovely function is from **https://panthema.net/2008/0901-stacktrace-demangled/**
10 | static inline void print_stacktrace(FILE *out = stderr, unsigned int max_frames = 256)
11 | {
12 | // storage array for stack trace address data
13 | void *addrlist[max_frames + 1];
14 |
15 | // retrieve current stack addresses
16 | int addrlen = backtrace(addrlist, sizeof(addrlist) / sizeof(void *));
17 |
18 | if (addrlen == 0)
19 | {
20 | fprintf(out, " \n");
21 | return;
22 | }
23 |
24 | // resolve addresses into strings containing "filename(function+address)",
25 | // this array must be free()-ed
26 | char **symbollist = backtrace_symbols(addrlist, addrlen);
27 |
28 | // allocate string which will be filled with the demangled function name
29 | size_t funcnamesize = 256;
30 | char *funcname = (char *)malloc(funcnamesize);
31 |
32 | // iterate over the returned symbol lines. skip the first, it is the
33 | // address of this function.
34 | for (int i = 1; i < addrlen; i++)
35 | {
36 | char *begin_name = 0, *begin_offset = 0, *end_offset = 0;
37 |
38 | // find parentheses and +address offset surrounding the mangled name:
39 | // ./module(function+0x15c) [0x8048a6d]
40 | for (char *p = symbollist[i]; *p; ++p)
41 | {
42 | if (*p == '(')
43 | begin_name = p;
44 | else if (*p == '+')
45 | begin_offset = p;
46 | else if (*p == ')' && begin_offset)
47 | {
48 | end_offset = p;
49 | break;
50 | }
51 | }
52 |
53 | if (begin_name && begin_offset && end_offset && begin_name < begin_offset)
54 | {
55 | *begin_name++ = '\0';
56 | *begin_offset++ = '\0';
57 | *end_offset = '\0';
58 |
59 | // mangled name is now in [begin_name, begin_offset) and caller
60 | // offset in [begin_offset, end_offset). now apply
61 | // __cxa_demangle():
62 |
63 | int status;
64 | char *ret = abi::__cxa_demangle(begin_name,
65 | funcname, &funcnamesize, &status);
66 | char str_buffer[4096];
67 | if (status == 0)
68 | {
69 | funcname = ret; // use possibly realloc()-ed string
70 | sprintf(str_buffer, " [%d] %s : %s+%s\n",
71 | i, symbollist[i], funcname, begin_offset);
72 | }
73 | else
74 | {
75 | // demangling failed. Output function name as a C function with
76 | // no arguments.
77 | sprintf(str_buffer, " [%d] %s : %s()+%s\n",
78 | i, symbollist[i], begin_name, begin_offset);
79 | }
80 | std::cerr << str_buffer;
81 | }
82 | else
83 | {
84 | // couldn't parse the line? print the whole line.
85 | // fprintf(out, " %s\n", symbollist[i]);
86 | std::cerr << "counldnt parse the line: " << symbollist[i];
87 | }
88 | }
89 |
90 | free(funcname);
91 | free(symbollist);
92 | }
93 | }
94 |
--------------------------------------------------------------------------------
/libterm/include/queue.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include
3 | #include
4 | #include
5 |
6 | namespace util {
7 | // template
8 | // class Queue {
9 | // int sz_ = 0;
10 | //
11 | // std::vector last_vec_;
12 | //
13 | // std::mutex mutex_;
14 | // std::condition_variable cv_;
15 | //
16 | // int head_ = 0;
17 | // int tail_ = 0;
18 | //
19 | // int max_sz() const { return sz_ + 1; }
20 | //
21 | // int size() const { return (tail_ + max_sz() - head_) % max_sz(); }
22 | //
23 | // bool empty() const { return size() == 0; }
24 | //
25 | // bool full() const { return size() == max_sz() - 1; }
26 | //
27 | // void inc_head() { head_ = (head_ + 1) % max_sz(); }
28 | //
29 | // void inc_tail() { tail_ = (tail_ + 1) % max_sz(); }
30 | //
31 | // public:
32 | // Queue() : Queue(128) {}
33 | // explicit Queue(int sz) : sz_(sz), last_vec_(sz + 1) {}
34 | //
35 | // void push(auto &&value) {
36 | // std::unique_lock lk(mutex_);
37 | // cv_.wait(lk, [this] { return !full(); });
38 | //
39 | // last_vec_[tail_] = std::forward(value);
40 | // inc_tail();
41 | //
42 | // lk.unlock();
43 | // cv_.notify_all();
44 | // }
45 | //
46 | // T pop() {
47 | // std::unique_lock lk(mutex_);
48 | // cv_.wait(lk, [this] { return !empty(); });
49 | //
50 | // T value = std::move(last_vec_[head_]);
51 | // inc_head();
52 | //
53 | // lk.unlock();
54 | // cv_.notify_all();
55 | // return value;
56 | // }
57 | // };
58 | template
59 | class Queue {
60 | int max_size_{};
61 | std::queue queue_;
62 | std::mutex mutex_;
63 | std::condition_variable cv_;
64 | public:
65 | int size() const { return queue_.size(); }
66 | bool empty() const { return queue_.size() == 0; }
67 | bool full() const { return size() == max_size_; }
68 |
69 | public:
70 | Queue() : Queue(128) {}
71 | explicit Queue(int max_size) : max_size_{max_size} {}
72 | Queue(Queue &&queue)
73 | : max_size_(queue.max_size_),
74 | queue_(std::move(queue.queue_)),
75 | mutex_(std::move(queue.mutex_)),
76 | cv_(std::move(queue.cv_))
77 | {
78 | }
79 |
80 | void push(auto &&value) {
81 | std::unique_lock lk(mutex_);
82 | cv_.wait(lk, [this] { return !full(); });
83 | queue_.push(std::forward(value));
84 | lk.unlock();
85 |
86 | cv_.notify_all();
87 | }
88 |
89 | T pop() {
90 | std::unique_lock lk(mutex_);
91 | cv_.wait(lk, [this] { return !empty(); });
92 |
93 | T value = std::move(queue_.front());
94 | queue_.pop();
95 |
96 | lk.unlock();
97 | cv_.notify_all();
98 | return value;
99 | }
100 | };
101 | }
--------------------------------------------------------------------------------
/libterm/include/sms.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 | #include "util.hh"
3 |
4 | #include
5 | #include
6 | #include
7 | #include
8 | #include
9 |
10 | namespace rdma {
11 | using std::vector;
12 | using std::string;
13 |
14 | class SharedMetaService {
15 | std::string hostname_;
16 | in_port_t port_;
17 | memcache::Memcache memc_;
18 | std::mutex mutex_;
19 |
20 | public:
21 | SharedMetaService(const std::string &hostname, in_port_t port)
22 | : hostname_{hostname}, port_{port}, memc_(hostname, port)
23 | {
24 | if (!memc_.setBehavior(MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1)) {
25 | throw std::runtime_error("init memcached");
26 | }
27 | }
28 |
29 | std::string to_string() const
30 | {
31 | return hostname_ + ":" + std::to_string(port_);
32 | }
33 |
34 | void put(const std::string &key, const std::vector &value)
35 | {
36 | std::unique_lock lk(mutex_);
37 | // while (true) {
38 | // memcached_return rc;
39 | // rc = memcached_set((memcached_st *)memc_.getImpl(), key.data(), key.size(),
40 | // value.data(), value.size(), (time_t)0, 0);
41 | // if (rc == MEMCACHED_SUCCESS) break;
42 | // usleep(400);
43 | // }
44 | while (!memc_.set(key, value, 0, 0)) {
45 | usleep(400);
46 | }
47 | // pr_info("key={%s}, value={%s}", key.c_str(), util::vec_char_to_string(value).c_str());
48 | }
49 |
50 | std::vector get(const std::string &key)
51 | {
52 | std::vector ret_val;
53 | size_t value_size;
54 | uint32_t flags;
55 | memcached_return rc;
56 |
57 | std::unique_lock lk(mutex_);
58 |
59 | // while (true) {
60 | // char *value = memcached_get((memcached_st *)memc_.getImpl(), key.c_str(), key.length(),
61 | // &value_size, &flags, &rc);
62 | // if (value && rc == MEMCACHED_SUCCESS) {
63 | // ret_val.resize(value_size);
64 | // memcpy(ret_val.data(), value, value_size);
65 | // break;
66 | // }
67 | // usleep(400);
68 | // }
69 |
70 | do {
71 | bool success = memc_.get(key, ret_val);
72 | if (!success) {
73 | usleep(10000);
74 | } else {
75 | break;
76 | }
77 | } while (true);
78 | // pr_info("key={%s}, value={%s}", key.c_str(), util::vec_char_to_string(ret_val).c_str());
79 |
80 | return ret_val;
81 | }
82 |
83 | int get_int(const std::string &key)
84 | {
85 | std::vector ret_val;
86 | ret_val = get(key);
87 | string str(ret_val.begin(), ret_val.end());
88 | return std::stoi(str);
89 | }
90 |
91 | uint64_t inc(const std::string &key, bool auto_retry = true)
92 | {
93 | uint64_t res;
94 |
95 | do {
96 | bool success = memc_.increment(key, 1, &res);
97 | if (!success && auto_retry) {
98 | usleep(10000);
99 | } else {
100 | break;
101 | }
102 | } while (true);
103 |
104 | return res;
105 | }
106 |
107 | uint64_t dec(const std::string &key)
108 | {
109 | uint64_t res;
110 | while (!memc_.decrement(key, 1, &res)) {
111 | usleep(10000);
112 | }
113 | return res;
114 | }
115 | };
116 | } // namespace pdp
--------------------------------------------------------------------------------
/libterm/include/ud-conn.hh:
--------------------------------------------------------------------------------
1 | #pragma once
2 |
3 | #include "rdma.hh"
4 |
5 | namespace rdma {
6 | // attention! UD does not support connection.
7 | // we wrap an interface for better programming.
8 | // there is actually only one ud_qp per thread!
9 | class UdConnection {
10 | util::ObserverPtr ud_qp_;
11 | util::ObserverPtr remote_ud_ah_;
12 | public:
13 | UdConnection(util::ObserverPtr ud_qp, util::ObserverPtr remote_ud_ah)
14 | : ud_qp_(ud_qp), remote_ud_ah_(remote_ud_ah) {}
15 |
16 | void send(const MemorySge &local_mem, bool signaled)
17 | {
18 | ud_qp_->send(remote_ud_ah_.get(), local_mem, signaled);
19 | }
20 |
21 | auto ud_qp() { return ud_qp_; }
22 | auto remote_ud_ah() { return remote_ud_ah_; }
23 | };
24 | };
25 |
--------------------------------------------------------------------------------
/libterm/include/zipf.hh:
--------------------------------------------------------------------------------
1 | // Copyright 2014 Carnegie Mellon University
2 | //
3 | // Licensed under the Apache License, Version 2.0 (the "License");
4 | // you may not use this file except in compliance with the License.
5 | // You may obtain a copy of the License at
6 | //
7 | // http://www.apache.org/licenses/LICENSE-2.0
8 | //
9 | // Unless required by applicable law or agreed to in writing, software
10 | // distributed under the License is distributed on an "AS IS" BASIS,
11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | // See the License for the specific language governing permissions and
13 | // limitations under the License.
14 |
15 | #pragma once
16 |
17 | #include
18 | #include
19 | #include
20 | #include
21 | #include
22 | #include
23 |
24 | namespace util {
25 | struct zipf_gen_state {
26 | uint64_t n; // number of items (input)
27 | double theta; // skewness (input) in (0, 1); or, 0 = uniform, 1 = always
28 | // zero
29 | double alpha; // only depends on theta
30 | double thres; // only depends on theta
31 | uint64_t last_n; // last n used to calculate the following
32 | double dbl_n;
33 | double zetan;
34 | double eta;
35 | // unsigned short rand_state[3]; // prng state
36 | uint64_t rand_state;
37 | };
38 |
39 | static double mehcached_rand_d(uint64_t *state) {
40 | // caution: this is maybe too non-random
41 | *state = (*state * 0x5deece66dUL + 0xbUL) & ((1UL << 48) - 1);
42 | return (double) *state / (double) ((1UL << 48) - 1);
43 | }
44 |
45 | static double mehcached_pow_approx(double a, double b) {
46 | // from
47 | // http://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/
48 |
49 | // calculate approximation with fraction of the exponent
50 | int e = (int) b;
51 | union {
52 | double d;
53 | int x[2];
54 | } u = {a};
55 | u.x[1] =
56 | (int) ((b - (double) e) * (double) (u.x[1] - 1072632447) + 1072632447.);
57 | u.x[0] = 0;
58 |
59 | // exponentiation by squaring with the exponent's integer part
60 | // double r = u.d makes everything much slower, not sure why
61 | // TODO: use popcount?
62 | double r = 1.;
63 | while (e) {
64 | if (e & 1)
65 | r *= a;
66 | a *= a;
67 | e >>= 1;
68 | }
69 |
70 | return r * u.d;
71 | }
72 |
73 | static void mehcached_zipf_init(struct zipf_gen_state *state, uint64_t n,
74 | double theta, uint64_t rand_seed) {
75 | assert(n > 0);
76 | if (theta > 0.992 && theta < 1)
77 | fprintf(stderr,
78 | "theta > 0.992 will be inaccurate due to approximation\n");
79 | if (theta >= 1. && theta < 40.) {
80 | fprintf(stderr, "theta in [1., 40.) is not supported\n");
81 | assert(false);
82 | }
83 | assert(theta == -1. || (theta >= 0. && theta < 1.) || theta >= 40.);
84 | assert(rand_seed < (1UL << 48));
85 | memset(state, 0, sizeof(struct zipf_gen_state));
86 | state->n = n;
87 | state->theta = theta;
88 | if (theta == -1.)
89 | rand_seed = rand_seed % n;
90 | else if (theta > 0. && theta < 1.) {
91 | state->alpha = 1. / (1. - theta);
92 | state->thres = 1. + mehcached_pow_approx(0.5, theta);
93 | } else {
94 | state->alpha = 0.; // unused
95 | state->thres = 0.; // unused
96 | }
97 | state->last_n = 0;
98 | state->zetan = 0.;
99 | // state->rand_state[0] = (unsigned short)(rand_seed >> 0);
100 | // state->rand_state[1] = (unsigned short)(rand_seed >> 16);
101 | // state->rand_state[2] = (unsigned short)(rand_seed >> 32);
102 | state->rand_state = rand_seed;
103 | }
104 |
105 | static void mehcached_zipf_init_copy(struct zipf_gen_state *state,
106 | const struct zipf_gen_state *src_state,
107 | uint64_t rand_seed) {
108 |
109 | (void) mehcached_zipf_init_copy;
110 | assert(rand_seed < (1UL << 48));
111 | memcpy(state, src_state, sizeof(struct zipf_gen_state));
112 | // state->rand_state[0] = (unsigned short)(rand_seed >> 0);
113 | // state->rand_state[1] = (unsigned short)(rand_seed >> 16);
114 | // state->rand_state[2] = (unsigned short)(rand_seed >> 32);
115 | state->rand_state = rand_seed;
116 | }
117 |
118 | static void mehcached_zipf_change_n(struct zipf_gen_state *state, uint64_t n) {
119 | (void) mehcached_zipf_change_n;
120 | state->n = n;
121 | }
122 |
123 | static double mehcached_zeta(uint64_t last_n, double last_sum, uint64_t n,
124 | double theta) {
125 | if (last_n > n) {
126 | last_n = 0;
127 | last_sum = 0.;
128 | }
129 | while (last_n < n) {
130 | last_sum += 1. / mehcached_pow_approx((double) last_n + 1., theta);
131 | last_n++;
132 | }
133 | return last_sum;
134 | }
135 |
136 | static uint64_t mehcached_zipf_next(struct zipf_gen_state *state) {
137 | if (state->last_n != state->n) {
138 | if (state->theta > 0. && state->theta < 1.) {
139 | state->zetan = mehcached_zeta(state->last_n, state->zetan, state->n,
140 | state->theta);
141 | state->eta =
142 | (1. - mehcached_pow_approx(2. / (double) state->n,
143 | 1. - state->theta)) /
144 | (1. - mehcached_zeta(0, 0., 2, state->theta) / state->zetan);
145 | }
146 | state->last_n = state->n;
147 | state->dbl_n = (double) state->n;
148 | }
149 |
150 | if (state->theta == -1.) {
151 | uint64_t v = state->rand_state;
152 | if (++state->rand_state >= state->n)
153 | state->rand_state = 0;
154 | return v;
155 | } else if (state->theta == 0.) {
156 | double u = mehcached_rand_d(&state->rand_state);
157 | return (uint64_t) (state->dbl_n * u);
158 | } else if (state->theta >= 40.) {
159 | return 0UL;
160 | } else {
161 | // from J. Gray et al. Quickly generating billion-record synthetic
162 | // databases. In SIGMOD, 1994.
163 |
164 | // double u = erand48(state->rand_state);
165 | double u = mehcached_rand_d(&state->rand_state);
166 | double uz = u * state->zetan;
167 | if (uz < 1.)
168 | return 0UL;
169 | else if (uz < state->thres)
170 | return 1UL;
171 | else
172 | return (uint64_t) (
173 | state->dbl_n *
174 | mehcached_pow_approx(state->eta * (u - 1.) + 1., state->alpha));
175 | }
176 | }
177 |
178 | void mehcached_test_zipf(double theta) {
179 |
180 | (void) (mehcached_test_zipf);
181 |
182 | double zetan = 0.;
183 | const uint64_t n = 10000000000UL;
184 | uint64_t i;
185 |
186 | for (i = 0; i < n; i++)
187 | zetan += 1. / pow((double) i + 1., theta);
188 |
189 | struct zipf_gen_state state;
190 | if (theta < 1. || theta >= 40.)
191 | mehcached_zipf_init(&state, n, theta, 0);
192 |
193 | uint64_t num_key0 = 0;
194 | const uint64_t num_samples = 10000000UL;
195 | if (theta < 1. || theta >= 40.) {
196 | for (i = 0; i < num_samples; i++)
197 | if (mehcached_zipf_next(&state) == 0)
198 | num_key0++;
199 | }
200 |
201 | printf("theta = %lf; using pow(): %.10lf", theta, 1. / zetan);
202 | if (theta < 1. || theta >= 40.)
203 | printf(", using approx-pow(): %.10lf",
204 | (double) num_key0 / (double) num_samples);
205 | printf("\n");
206 | }
207 |
208 | class ZipfGen {
209 | private:
210 | struct zipf_gen_state state_;
211 | public:
212 | ZipfGen(uint64_t n, double theta, uint64_t rand_seed)
213 | {
214 | mehcached_zipf_init(&state_, n, theta, rand_seed);
215 | }
216 |
217 | uint64_t next()
218 | {
219 | return mehcached_zipf_next(&state_);
220 | }
221 | };
222 | } // namespace util
--------------------------------------------------------------------------------
/libterm/kmod/Makefile:
--------------------------------------------------------------------------------
1 | ccflags-y := -DM=$(M)
2 |
3 | obj-m += pch.o
4 | # lioh-y := proc.o
5 |
6 | all:
7 | make -C /lib/modules/$(shell uname -r)/build M=$(shell pwd) modules
8 |
9 | clean:
10 | make -C /lib/modules/$(shell uname -r)/build M=$(shell pwd) clean
11 |
--------------------------------------------------------------------------------
/libterm/kmod/pch.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 |
8 | #define PROC_NAME "pch"
9 |
10 | // taken from Linux source.
11 | static struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
12 | {
13 | struct page *page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD, 0);
14 | if (!page)
15 | return page;
16 | if (!xa_is_value(page)) {
17 | return find_subpage(page, index);
18 | }
19 | return NULL;
20 | }
21 |
22 | static unsigned char mincore_page(struct address_space *mapping, pgoff_t index)
23 | {
24 | struct page *page = find_get_incore_page(mapping, index);
25 | unsigned char val = 0;
26 |
27 | if (page) {
28 | if (PageUptodate(page)) {
29 | val |= (1 << 0);
30 | }
31 | if (PageDirty(page)) {
32 | val |= (1 << 1);
33 | }
34 | put_page(page);
35 | }
36 |
37 | return val;
38 | }
39 |
40 |
41 | static ssize_t
42 | pch_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
43 | {
44 | struct file *mmap_fp = file->private_data;
45 |
46 | uint8_t *kern_buf = kvzalloc(PAGE_SIZE, GFP_KERNEL);
47 | if (!kern_buf) {
48 | return -EINVAL;
49 | }
50 |
51 | for (size_t iter_offset = 0; iter_offset < count; iter_offset += PAGE_SIZE) {
52 | size_t iter_count = min(count - iter_offset, PAGE_SIZE);
53 | pgoff_t start_idx = *pos + iter_offset;
54 | pgoff_t last_idx = start_idx + iter_count - 1;
55 |
56 | size_t off = 0;
57 |
58 | #if 0
59 | XA_STATE(xas, &mmap_fp->f_mapping->i_pages, start_idx);
60 | void *entry;
61 | rcu_read_lock();
62 | // entry: is it page or folio? whatever.
63 | xas_for_each(&xas, entry, last_idx) {
64 | uint8_t v = 1;
65 | if (xas_retry(&xas, entry) || xa_is_value(entry)) {
66 | v = 0;
67 | }
68 | kern_buf[off] = v;
69 | off++;
70 | }
71 |
72 | rcu_read_unlock();
73 | #else
74 | for (pgoff_t idx = start_idx; idx <= last_idx; idx++) {
75 | kern_buf[off] = mincore_page(mmap_fp->f_mapping, idx);
76 | off++;
77 | }
78 | #endif
79 |
80 | if (copy_to_user(buf + iter_offset, kern_buf, iter_count)) {
81 | pr_err("failed to copy_to_user\n");
82 | kvfree(kern_buf);
83 | return -EINVAL;
84 | }
85 | }
86 | kvfree(kern_buf);
87 | *pos += count;
88 | return count;
89 | }
90 |
91 | static ssize_t
92 | pch_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
93 | {
94 | char path[32];
95 | struct file *mmap_fp;
96 |
97 | if (file->private_data) {
98 | pr_err("private has been set.\n");
99 | return -EINVAL;
100 | }
101 |
102 | if (count > 16) {
103 | pr_err("path too long\n");
104 | return -EINVAL;
105 | }
106 |
107 | if (copy_from_user(path, buf, count)) {
108 | pr_err("fail to copy from user.\n");
109 | return -EINVAL;
110 | }
111 | path[count] = '\0';
112 |
113 | mmap_fp = filp_open(path, O_RDWR | O_LARGEFILE, 0);
114 | if (IS_ERR(mmap_fp)) {
115 | pr_err("fail to open %s\n", path);
116 | return PTR_ERR(mmap_fp);
117 | }
118 |
119 | // pr_info("pch: %s\n", path);
120 |
121 | file->private_data = mmap_fp;
122 |
123 | return count;
124 | }
125 |
126 | static int pch_release(struct inode *inode, struct file *file)
127 | {
128 | if (file->private_data) {
129 | filp_close(file->private_data, 0);
130 | }
131 | return 0;
132 | }
133 |
134 | static const struct proc_ops pch_ops = {
135 | .proc_read = pch_read,
136 | .proc_write = pch_write,
137 | .proc_release = pch_release,
138 | };
139 |
140 | static int __init pch_init(void)
141 | {
142 | struct proc_dir_entry *entry;
143 |
144 | entry = proc_create(PROC_NAME, S_IFREG | S_IRUGO, NULL, &pch_ops);
145 | if (!entry) {
146 | pr_err("failed to create /proc/" PROC_NAME "\n");
147 | return -ENOMEM;
148 | }
149 |
150 | return 0;
151 | }
152 |
153 | static void __exit pch_exit(void)
154 | {
155 | remove_proc_entry(PROC_NAME, NULL);
156 | }
157 |
158 | module_init(pch_init);
159 | module_exit(pch_exit);
160 |
161 | MODULE_LICENSE("GPL");
162 |
--------------------------------------------------------------------------------
/libterm/kmod/test-pch.cc:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 |
8 | #define SZ_1G (1l << 30)
9 | // #define SZ_2G (2l << 30)
10 | #define PAGE_SIZE (4096)
11 |
12 |
13 | static inline uint64_t wall_time_ns()
14 | {
15 | // thread_local auto first_time = std::chrono::steady_clock::now();
16 | // auto current = std::chrono::steady_clock::now();
17 | // return std::chrono::duration_cast(current - first_time).count();
18 | struct timespec ts{};
19 | clock_gettime(CLOCK_REALTIME, &ts);
20 | return ts.tv_nsec + ts.tv_sec * 1'000'000'000;
21 | }
22 |
23 | int main()
24 | {
25 | int fd = open("/proc/pch-nvme1n1p2", O_RDONLY);
26 | assert(fd >= 0);
27 | uint8_t v;
28 |
29 | auto ts1 = wall_time_ns();
30 |
31 | // char *buf = new char[SZ_1G / PAGE_SIZE];
32 | // pread(fd, buf, SZ_1G, 0);
33 | // for (int i = 0; i < (SZ_1G / PAGE_SIZE); i++) {
34 | // printf("%u", buf[i]);
35 | // }
36 | for (size_t i = 0; i < SZ_1G; i += PAGE_SIZE) {
37 | auto ret = pread(fd, &v, 1, i);
38 | assert(ret == 1);
39 | printf("%u", v);
40 | }
41 |
42 | printf("\n");
43 | auto ts2 = wall_time_ns();
44 | printf("\n");
45 | auto elapsed = ts2 - ts1;
46 | printf("total %luns, average %luns\n", elapsed, elapsed / (SZ_1G / PAGE_SIZE));
47 |
48 | return 0;
49 | }
50 |
--------------------------------------------------------------------------------
/libterm/test/perf.cc:
--------------------------------------------------------------------------------
1 | #include "rdma.hh"
2 | #include "util.hh"
3 | #include "node.hh"
4 | #include "zipf.hh"
5 |
6 | #include
7 | #include
8 |
9 | #include
10 |
11 | DEFINE_int32(nr_nodes, 2, "number of total nodes, including the server and clients.");
12 | DEFINE_bool(verify, false, "");
13 | DEFINE_int32(nr_client_threads, 1, "# client threads");
14 | DEFINE_string(sz_unit, "64", "access unit in bytes (OK with k,m,g suffix)");
15 | DEFINE_string(sz_server_mr, "32g", "server mr size in bytes (OK with k,m,g suffix)");
16 | DEFINE_int32(running_seconds, 120, "running seconds");
17 | DEFINE_int32(node_id, -1, "node id (0: server, ...: client)");
18 | DEFINE_int32(skewness_100, 99, "skewness * 100, -1: seq");
19 | DEFINE_uint32(write_percent, 100, "write percent");
20 | DEFINE_uint32(hotspot_switch_second, 0, "the second to switch hotspot");
21 |
22 | constexpr int kNrServerThreads = 1;
23 | constexpr size_t kClientMrSize = SZ_4M;
24 | constexpr bool kClientMmap = false;
25 | constexpr bool kClientOdp = false;
26 | constexpr bool kServerMmap = true;
27 | constexpr bool kServerOdp = true;
28 |
29 | std::atomic_bool g_should_stop{};
30 | std::string g_server_mmap_dev = util::getenv_string("PDP_server_mmap_dev").value_or("");
31 | std::unique_ptr g_lat_dis;
32 | rdma::Node::Uptr g_node;
33 |
34 | void signal_int_handler(int sig)
35 | {
36 | g_should_stop = true;
37 | util::print_stacktrace();
38 |
39 | if (g_node) {
40 | if (!g_node->is_server()) {
41 | }
42 | g_node->unregister_node();
43 | g_node = nullptr;
44 | }
45 |
46 | // reset to the default for clean-up
47 | signal(sig, SIG_DFL);
48 | raise(sig);
49 | }
50 |
51 | static bool check_data(void *addr, size_t length, uint64_t expected_base)
52 | {
53 | ASSERT((uint64_t)addr % 8 == 0, "addr is not aligned to 8.");
54 |
55 | for (size_t i = 0; i < length; i += sizeof(uint64_t)) {
56 | uint64_t value = *(uint64_t *)((char *)addr + i);
57 | uint64_t expected = expected_base + i;
58 | if (value != expected) {
59 | return false;
60 | }
61 | }
62 | return true;
63 | }
64 |
65 | static void set_data(void *addr, size_t length, uint64_t base, bool print_progress)
66 | {
67 | ASSERT((uint64_t)addr % 8 == 0, "addr is not aligned to 8.");
68 | // #pragma omp parallel for
69 | for (size_t i = 0; i < length; i += sizeof(uint64_t)) {
70 | if (print_progress && i % SZ_1G == 0) {
71 | pr_info("set %lu GB.", i / SZ_1G);
72 | }
73 | *(uint64_t *)((char *)addr + i) = base + i;
74 | }
75 | }
76 |
77 | void initialize()
78 | {
79 | g_node = std::make_unique(FLAGS_nr_nodes, FLAGS_node_id);
80 | g_node->register_node();
81 |
82 | if (g_node->is_server()) {
83 | std::cout << "Server" << std::endl;
84 | std::cout << "prepare" << std::endl;
85 | auto ctx = std::make_unique(0);
86 |
87 | ASSERT(!g_server_mmap_dev.empty(), "PDP_server_mmap_dev must be set!");
88 | for (int i = 0; i < kNrServerThreads; i++) {
89 | rdma::Allocation alloc{(size_t)util::stoll_suffix(FLAGS_sz_server_mr), g_server_mmap_dev, !FLAGS_verify};
90 | if (FLAGS_verify) {
91 | madvise(alloc.addr(), alloc.size(), MADV_SEQUENTIAL);
92 | set_data((char *)alloc.addr(), alloc.size(), 0, true);
93 | madvise(alloc.addr(), alloc.size(), MADV_RANDOM);
94 | }
95 | auto local_mr = std::make_unique(util::make_observer(ctx.get()), std::move(alloc), kServerOdp);
96 | g_node->threads_.emplace_back(util::make_observer(g_node.get()), util::make_observer(ctx.get()), i, FLAGS_nr_nodes, FLAGS_nr_client_threads, std::move(local_mr));
97 | g_node->threads_.back().prepare_and_push_rc();
98 | }
99 |
100 | g_node->ctxs_.emplace_back(std::move(ctx));
101 |
102 | g_node->wait_for_all_nodes_prepared();
103 | for (int i = 0; i < kNrServerThreads; i++) {
104 | g_node->threads_[i].pull_and_connect_rc();
105 | }
106 | } else {
107 | std::cout << "Client" << std::endl;
108 | std::cout << "prepare" << std::endl;
109 |
110 | for (int i = 0; i < FLAGS_nr_client_threads; i++) {
111 | auto ctx = std::make_unique(0);
112 | auto local_mr = std::make_unique(util::make_observer(ctx.get()), rdma::Allocation{kClientMrSize}, kClientOdp);
113 | g_node->threads_.emplace_back(util::make_observer(g_node.get()), util::make_observer(ctx.get()), i, 1, kNrServerThreads, std::move(local_mr));
114 | g_node->threads_.back().prepare_and_push_rc();
115 | g_node->ctxs_.emplace_back(std::move(ctx));
116 | }
117 | g_node->wait_for_all_nodes_prepared();
118 | for (int i = 0; i < FLAGS_nr_client_threads; i++) {
119 | g_node->threads_[i].pull_and_connect_rc();
120 | }
121 | }
122 |
123 | g_node->wait_for_all_nodes_connected();
124 | }
125 |
126 | void cleanup()
127 | {
128 | // clean-up for the next execution
129 | g_node->unregister_node();
130 | }
131 |
132 | void client_test(const std::stop_token &stop_token, int thread_id)
133 | {
134 | auto &thread = g_node->threads_[thread_id];
135 | auto conn = thread.get_data_conn(0, 0);
136 | auto *local_mr = conn->local_mr();
137 | auto *remote_mr = conn->remote_mr();
138 | auto sz_unit = util::stoll_suffix(FLAGS_sz_unit);
139 | static util::LatencyRecorder lr_send("send");
140 | static util::LatencyRecorder lr_wait("wait");
141 |
142 | uint64_t nr_units = remote_mr->length() / sz_unit;
143 |
144 | util::Schedule::bind_client_fg_cpu(thread_id);
145 |
146 | ASSERT(FLAGS_skewness_100 < 100, "FLAGS_skewness_100 must < 100!");
147 | bool seq = false;
148 | if (FLAGS_skewness_100 < 0) {
149 | seq = true;
150 | FLAGS_skewness_100 = 0;
151 | }
152 | double theta = double(FLAGS_skewness_100) / 100;
153 | util::ZipfGen zipf_gen(nr_units, theta, FLAGS_node_id * FLAGS_nr_client_threads + thread_id);
154 |
155 | unsigned int seed = thread_id;
156 |
157 | auto switch_tp = std::chrono::steady_clock::now() + std::chrono::seconds(FLAGS_hotspot_switch_second);
158 |
159 | for (uint64_t i = 0; /*i < 16 * sz_unit*/; i += sz_unit) {
160 | lr_send.begin_one();
161 | uint64_t ts = util::wall_time_ns();
162 |
163 | if (stop_token.stop_requested()) break;
164 |
165 | uint64_t v = zipf_gen.next();
166 | if (FLAGS_hotspot_switch_second) {
167 | if (std::chrono::steady_clock::now() > switch_tp) {
168 | v = nr_units - 1 - v;
169 | }
170 | }
171 | v = util::hash_u64(v) % nr_units;
172 | if (seq) {
173 | v = (i / sz_unit) % nr_units;
174 | }
175 |
176 | uint64_t remote_offset = v * sz_unit;
177 |
178 | uint64_t local_addr = local_mr->addr64() + i % local_mr->length();
179 | uint64_t remote_addr = remote_mr->addr64() + remote_offset;
180 |
181 | auto wr = rdma::Wr()
182 | .set_sg(local_addr, sz_unit, local_mr->lkey())
183 | .set_rdma(remote_addr, remote_mr->rkey())
184 | .set_signaled(true);
185 |
186 | uint64_t expected_base = util::hash_u64(remote_offset);
187 |
188 | bool is_read = (rand_r(&seed) % 100) >= FLAGS_write_percent;
189 | if (is_read) {
190 | wr.set_op_read();
191 | } else {
192 | wr.set_op_write();
193 | if (FLAGS_verify) {
194 | set_data((void *)local_addr, sz_unit, expected_base, false);
195 | }
196 | }
197 |
198 | conn->post_send_wr(&wr);
199 | lr_wait.begin_one();
200 |
201 | lr_send.end_one();
202 | struct ibv_wc wc{};
203 | conn->wait_send(&wc);
204 | assert(wc.status == IBV_WC_SUCCESS);
205 |
206 | if (FLAGS_verify) {
207 | if (is_read) {
208 | bool ok = check_data((void *)local_addr, sz_unit, remote_offset);
209 | ASSERT(ok, "check error");
210 | } else {
211 | memset((void *)local_addr, 0x34, sz_unit);
212 | auto read_wr = rdma::Wr().set_op_read()
213 | .set_sg(local_addr, sz_unit, local_mr->lkey())
214 | .set_rdma(remote_addr, remote_mr->rkey())
215 | .set_signaled(true);
216 | conn->post_send_wr(&read_wr);
217 | conn->wait_send(&wc);
218 | ASSERT(wc.status == IBV_WC_SUCCESS, "");
219 |
220 | bool ok = check_data((void *)local_addr, sz_unit, expected_base);
221 | ASSERT(ok, "check error");
222 | }
223 | }
224 |
225 | lr_wait.end_one();
226 |
227 | uint64_t lat = util::wall_time_ns() - ts;
228 | g_lat_dis->record(thread_id, lat);
229 | }
230 | }
231 |
232 | static void check_flags()
233 | {
234 | ASSERT(FLAGS_skewness_100 < 100, "skewness_100 must < 100");
235 | ASSERT(FLAGS_write_percent <= 100, "write_percent must <= 100");
236 | ASSERT(FLAGS_node_id >= 0 && FLAGS_node_id < FLAGS_nr_nodes, "FLAGS_node_id must between 0 and nr_nodes");
237 | }
238 |
239 | static void system_cmd_func(const std::stop_token &stop_token, const std::string &cmd)
240 | {
241 |
242 | while (!stop_token.stop_requested()) {
243 | using namespace std::chrono_literals;
244 | auto tp = std::chrono::steady_clock::now();
245 | std::ignore = system(cmd.c_str());
246 | std::this_thread::sleep_until(tp + 1s);
247 | }
248 | }
249 |
250 | int main(int argc, char *argv[])
251 | {
252 |
253 | gflags::SetUsageMessage(argv[0]);
254 | gflags::ParseCommandLineFlags(&argc, &argv, true);
255 |
256 | signal(SIGINT, signal_int_handler);
257 | signal(SIGSEGV, signal_int_handler);
258 |
259 | check_flags();
260 |
261 | initialize();
262 |
263 | if (g_node->is_server()) {
264 | pr_info("running...");
265 | const std::string cmd = fmt::format(R"(top -bc -n1 -w512 | grep -e "{}" -e "mlx5_ib_page" | grep -v grep; cat /proc/vmstat | egrep "writeback|dirty")", argv[0]);
266 |
267 | // std::jthread system_cmd_thread([&cmd](const std::stop_token &stop_token) {
268 | // system_cmd_func(stop_token, cmd);
269 | // });
270 | sleep(FLAGS_running_seconds + 2);
271 | } else {
272 | pr_info("# client threads=%d", FLAGS_nr_client_threads);
273 |
274 | std::vector test_threads(FLAGS_nr_client_threads);
275 | g_lat_dis = std::make_unique(FLAGS_nr_client_threads);
276 |
277 | for (int i = 0; i < FLAGS_nr_client_threads; i++) {
278 | test_threads[i] = std::jthread(client_test, i);
279 | }
280 |
281 | double elapsed = 0.0;
282 | util::Schedule::bind_client_fg_cpu(FLAGS_nr_client_threads + 1);
283 |
284 | uint32_t running_second = 0;
285 | while (!g_should_stop) {
286 | using namespace std::chrono_literals;
287 | auto tp = std::chrono::steady_clock::now();
288 | if (running_second) {
289 | pr_emph("epoch %u (%.2lfs): %s", running_second, elapsed, g_lat_dis->report_and_clear().c_str());
290 | }
291 | running_second++;
292 | if (running_second > FLAGS_running_seconds) {
293 | break;
294 | }
295 | std::this_thread::sleep_until(tp + 1s);
296 |
297 | auto d = std::chrono::steady_clock::now() - tp;
298 | elapsed = std::chrono::duration(d).count();
299 | }
300 | }
301 |
302 | cleanup();
303 | return 0;
304 | }
305 |
--------------------------------------------------------------------------------
]