├── README.md ├── qos └── status.rst └── crimson └── status.rst /README.md: -------------------------------------------------------------------------------- 1 | # ceph-notes 2 | 3 | Repository for posting rst files for public consumption that don't warrant commiting to the main repository. 4 | -------------------------------------------------------------------------------- /qos/status.rst: -------------------------------------------------------------------------------- 1 | QOS Orientation 2 | =============== 3 | 4 | Goal 5 | ---- 6 | 7 | Add support for serving low throughput/high priority clients at low latency 8 | will maintaining high total throughput. 9 | 10 | A secondary goal is for this framework is to also solve the problem of 11 | maintaining high background recovery and scrub throughput without disturbing 12 | throughput and latency for clients when client IO is available. 13 | 14 | Work So Far 15 | ----------- 16 | 17 | MClock/DMClock 18 | ~~~~~~~~~~~~~~ 19 | 20 | mclock and dmclock [1] are algorithms for allocating IO resources among a set 21 | of clients with different io allocations/priorities.  [2] is the current ceph 22 | implementation of dmclock. [3] is an old PR adding the client level hooks 23 | and messages for propagating the dmclock parameters. [4] is a cleaned pending 24 | PR for a cleaned up reimplementation of the server side. 25 | 26 | - [1] https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Gulati.pdf 27 | - [2] https://github.com/ceph/dmclock 28 | - [3] https://github.com/ceph/ceph/pull/20235 29 | - [4] https://github.com/ceph/ceph/pull/30650 30 | 31 | In Progress: 32 | 33 | - [4] above replaces the existing OpQueue based mclock implementations with a 34 | simpler one. 35 | 36 | Next: 37 | 38 | - Evaluate mclock behavior in scheduling client IO vs recovery/background IO 39 | (See Recovery below) 40 | 41 | Throttles 42 | ~~~~~~~~~ 43 | 44 | A general prerequisite for providing QoS among competing sources of work is to 45 | avoid accepting more work at any particular time than is required for the next 46 | layer down to achieve the desired throughput. The first and most important 47 | place where this shows up in rados is at the bluestore layer. Bluestore 48 | currently will accept many IOs into the :bluestore queue before it begins to 49 | block. This means that if the next layer up receives a high priority IO, it 50 | cannot be served until all requests already in the bluestore queue (presumably 51 | lower priority) have been served. 52 | 53 | In order to evaluate what the bluestore throttles *should* be set to, there is 54 | now a set of tracepoints in bluestore [4] for measuring sampled per-io throttle 55 | values and latency as well, modifications to the objectstore fio backend to 56 | vary the throttle values over the course of a run, and a set of scripts in cbt 57 | [5] for plotting the throttle/latency/throughput curves of the emitted 58 | tracepoints. 59 | 60 | Complete: 61 | 62 | - [4] https://github.com/ceph/ceph/pull/29674 63 | 64 | * This PR is also a good primer on generally how to add lttng tracepoints. 65 | 66 | - [5] https://github.com/ceph/cbt/pull/190 67 | 68 | * See fio_objectstore_tools/bluestore_throttle_tuning.rst in the cbt pr for 69 | more details on how to use these new tools. 70 | 71 | Next Steps: 72 | 73 | - Figure out how to automatically tune the throttles as the above tools aren’t reasonable for most users to use. 74 | 75 | Recovery (future work) 76 | ~~~~~~~~~~~~~~~~~~~~~~ 77 | 78 | It’s possible that even without mclock, the existing WeightedPriorityQueue 79 | implementation may suffice for controlling client IO vs recovery IO given 80 | appropriate throttle values. 81 | 82 | Next Steps: 83 | 84 | - Add a way of measuring recovery operation throughput and latency, likely 85 | using lttng tracepoints as in [4] Use cbt’s existing recovery support to 86 | quantify recovery impact with and without appropriate throttle values. 87 | 88 | -------------------------------------------------------------------------------- /crimson/status.rst: -------------------------------------------------------------------------------- 1 | Goal 2 | ==== 3 | 4 | Crimson is designed to be a faster OSD, in the sense that 5 | 6 | - It targets fast storage devices, like NVMe storage, to take advantage of the high performance of random I/O and high throughput of the new hardware. 7 | - It will be able to bypass the kernel when talking to networking/storage devices which support polling. 8 | - It will be more computationally efficient, so the load is more balanced in a multi-core system, and each core will be able to serve the same amount of IO with lower CPU usage. 9 | - It will achieve these goals by a design from scratch. It also utilizes modern techniques, like SPDK, DPDK and the Seastar framework, so it can 1) bypass the kernel, 2) avoid memcpy between, 3) avoid lock contention. 10 | 11 | Crismon will be a drop-in-replacement of the classic OSD. Our long-term goal is to redesign the object storage backend from scratch, This new backend is dubbed “Seastore”. The disk layout we will use for Seastore won’t necessarily be optimal for HDDs. But Seastore is still in its very early stage, and is still missing a detailed design -- we don’t have enough resources to work on it yet. As an intermediate solution, we are working on adapting bluestore to crimson-osd in the hope that we can use it for testing f he mid-term. But if the new object storage backend does not work well with HDD, we will continue using bluestore in crimson just for supporting HDD. 12 | 13 | We are using a C++ framework named Seastar to implement Crimson, but we are not porting classic OSD to Seastar, as the design philosophy and various considerations could be very different between them. So, again, please bear in mind, to port classic OSD to Seastar and adapt it to bluestore is never our final goal. 14 | 15 | The new Crimson OSD will be compatible with classic OSD, so existing users will be able to upgrade from classic OSD to Crimson OSD. Put in other words, crimson-osd will: 16 | 17 | - Support librados protocol 18 | * Be able to talk to existing librados clients 19 | * Be able to join an existing Ceph cluster as an OSD 20 | - Support most existing OSD commands 21 | 22 | Note that some existing options won’t apply to the Crimson OSD, and new options will be added. 23 | 24 | Current status 25 | ============== 26 | 27 | Currently, we are using memory-based Object Store for testing. 28 | 29 | Functions 30 | --------- 31 | 32 | The feature set of OSD can be covered by different use cases. 33 | 34 | - rados benchmark 35 | 36 | * random/seq read/write 37 | - RBD engine of the fio benchmark 38 | 39 | * random/seq read/write without exclusive locking (work in progress). This is 40 | enough to run the fio benchmark. 41 | 42 | Tests 43 | ----- 44 | 45 | Currently, we test crimson-osd in following ways: 46 | 47 | - Unit test for different building blocks, like config, messenger, mon client. 48 | These tests are now part of “make check” run. 49 | - Performance test for comparing the performance of a new change against that 50 | of master branch. This test is driven by CBT, and is now integrated as a 51 | jenkins job, which can be launched using a trigger phrase on GitHub. 52 | 53 | Future Work 54 | =========== 55 | 56 | all operations supported librados including snap support 57 | -------------------------------------------------------- 58 | Which means we will be able to support RBD and RGW. 59 | 60 | Log based recovery + backfill 61 | ----------------------------- 62 | 63 | A core feature of ceph is online movement and recovery of data due to cluster 64 | changes and failures. Broadly, the following features need to be added to 65 | crimson to support those higher level features: 66 | 67 | * Operation logging on write 68 | * Recovery from the above for briefly down osds (Recovery) 69 | * Online reconstruction for new osds/osds down for a longer period (Backfill) 70 | 71 | Scrub 72 | ----- 73 | 74 | Online verification of stored data over a long period of time is another staple 75 | of the current osd -- this feature will need to be carried over to crimson. 76 | 77 | I/O operation state machine 78 | --------------------------- 79 | 80 | How the client I/O operations coordinate with background operations? For 81 | instance, read/write I/O should wait for recovery/scrub/backfill etc. 82 | 83 | Use config-proxy in messenger 84 | ----------------------------- 85 | 86 | We are using a poor-man solution - a plain struct for holding msgr related 87 | settings. And it’s not updated with the option subcomponent. 88 | 89 | Performance Review and CI integration 90 | ------------------------------------- 91 | 92 | Currently we have a preliminary CI support for checking of the significant 93 | performance regressions, but we need to update the test harness and test cases 94 | whenever it’s necessary. Also we need to 95 | 96 | - Review the performance to see if there is any regression not caught by the CI 97 | tests. 98 | - Compare the performance of classic OSD and crimson-osd on monthly basis 99 | - Adapt crimson-osd to existing teuthology qa suite, once it is able to cover 100 | more rados operations, and hence support RBD and RGW clients. 101 | 102 | Adapt to bluestore 103 | ------------------ 104 | 105 | We are now using a variant of memstore as the object store backend. Bluestore 106 | will serve as the intermediate solution before we have seastore. 107 | 108 | Seastore 109 | -------- 110 | 111 | The next generation objectstore optimized for NVMe devices. And it would take 112 | a long time before its GA. As a reference, bluestore started in early 2015, and 113 | it was ready for production in late 2017. There will be three phases: 114 | 115 | - Prototyping/Design: 116 | 117 | * Prototyping 118 | 119 | - define typical use cases, the constraints and assumptions. 120 | - evaluate different options and understand the tradeoffs, probably do some 121 | experiments, before solidifying on a specific design 122 | * Design: 123 | 124 | - in-memory/on-disk data structures 125 | - A sequence diagram to illustrate how to coordinate the foreground IOPS and 126 | background maintenance task. 127 | - define its interfaces talking to the other part of OSD 128 | - Implementation: 129 | 130 | * Integrate the object store with crimson-osd. If seastore cannot support HDD 131 | well, it should be able to coexist with bluestore. 132 | * Stabilize the disk layout 133 | 134 | Seastar+SPDK evaluation/integration 135 | ----------------------------------- 136 | 137 | - To evaluate different approaches of kernel-bypassing techniques. 138 | - To integrate SPDK into Seastar 139 | --------------------------------------------------------------------------------