├── README.md
├── qos
    └── status.rst
└── crimson
    └── status.rst


/README.md:
--------------------------------------------------------------------------------
1 | # ceph-notes
2 | 
3 | Repository for posting rst files for public consumption that don't warrant commiting to the main repository.
4 | 


--------------------------------------------------------------------------------
/qos/status.rst:
--------------------------------------------------------------------------------
 1 | QOS Orientation
 2 | ===============
 3 | 
 4 | Goal
 5 | ----
 6 | 
 7 | Add support for serving low throughput/high priority clients at low latency
 8 | will maintaining high total throughput.
 9 | 
10 | A secondary goal is for this framework is to also solve the problem of
11 | maintaining high background recovery and scrub throughput without disturbing
12 | throughput and latency for clients when client IO is available.
13 | 
14 | Work So Far
15 | -----------
16 | 
17 | MClock/DMClock
18 | ~~~~~~~~~~~~~~
19 | 
20 | mclock and dmclock [1] are algorithms for allocating IO resources among a set
21 | of clients with different io allocations/priorities.  [2] is the current ceph
22 | implementation of dmclock.  [3] is an old PR adding the client level hooks
23 | and messages for propagating the dmclock parameters. [4] is a cleaned pending
24 | PR for a cleaned up reimplementation of the server side.
25 | 
26 | - [1] https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Gulati.pdf
27 | - [2] https://github.com/ceph/dmclock
28 | - [3] https://github.com/ceph/ceph/pull/20235
29 | - [4] https://github.com/ceph/ceph/pull/30650
30 | 
31 | In Progress:
32 | 
33 | - [4] above replaces the existing OpQueue based mclock implementations with a
34 |   simpler one.
35 | 
36 | Next:
37 | 
38 | - Evaluate mclock behavior in scheduling client IO vs recovery/background IO
39 |   (See Recovery below)
40 | 
41 | Throttles
42 | ~~~~~~~~~
43 | 
44 | A general prerequisite for providing QoS among competing sources of work is to
45 | avoid accepting more work at any particular time than is required for the next
46 | layer down to achieve the desired throughput.  The first and most important
47 | place where this shows up in rados is at the bluestore layer.  Bluestore
48 | currently will accept many IOs into the :bluestore queue before it begins to
49 | block.  This means that if the next layer up receives a high priority IO, it
50 | cannot be served until all requests already in the bluestore queue (presumably
51 | lower priority) have been served.
52 | 
53 | In order to evaluate what the bluestore throttles *should* be set to, there is
54 | now a set of tracepoints in bluestore [4] for measuring sampled per-io throttle
55 | values and latency as well, modifications to the objectstore fio backend to
56 | vary the throttle values over the course of a run, and a set of scripts in cbt
57 | [5] for plotting the throttle/latency/throughput curves of the emitted
58 | tracepoints.
59 | 
60 | Complete:
61 | 
62 | - [4] https://github.com/ceph/ceph/pull/29674
63 | 
64 |   * This PR is also a good primer on generally how to add lttng tracepoints.
65 | 
66 | - [5] https://github.com/ceph/cbt/pull/190
67 | 
68 |   * See fio_objectstore_tools/bluestore_throttle_tuning.rst in the cbt pr for
69 |     more details on how to use these new tools.
70 | 
71 | Next Steps:
72 | 
73 | - Figure out how to automatically tune the throttles as the above tools aren’t reasonable for most users to use.
74 | 
75 | Recovery (future work)
76 | ~~~~~~~~~~~~~~~~~~~~~~
77 | 
78 | It’s possible that even without mclock, the existing WeightedPriorityQueue
79 | implementation may suffice for controlling client IO vs recovery IO given
80 | appropriate throttle values.
81 | 
82 | Next Steps:
83 | 
84 | - Add a way of measuring recovery operation throughput and latency, likely
85 |   using lttng tracepoints as in [4] Use cbt’s existing recovery support to
86 |   quantify recovery impact with and without appropriate throttle values.
87 | 
88 | 


--------------------------------------------------------------------------------
/crimson/status.rst:
--------------------------------------------------------------------------------
  1 | Goal
  2 | ====
  3 | 
  4 | Crimson is designed to be a faster OSD, in the sense that
  5 | 
  6 | - It targets fast storage devices, like NVMe storage, to take advantage of the high performance of random I/O and high throughput of the new hardware.
  7 | - It will be able to bypass the kernel when talking to networking/storage devices which support polling.
  8 | - It will be more computationally efficient, so the load is more balanced in a multi-core system, and each core will be able to serve the same amount of IO with lower CPU usage.
  9 | - It will achieve these goals by a design from scratch. It also utilizes modern techniques, like SPDK, DPDK and the Seastar framework, so it can 1) bypass the kernel, 2) avoid memcpy between, 3) avoid lock contention.
 10 | 
 11 | Crismon will be a drop-in-replacement of the classic OSD. Our long-term goal is to redesign the object storage backend from scratch, This new backend is dubbed “Seastore”. The disk layout we will use for Seastore won’t necessarily be optimal for HDDs. But Seastore is still in its very early stage, and is still missing a detailed design -- we don’t have enough resources to work on it yet. As an intermediate solution, we are working on adapting bluestore to crimson-osd in the hope that we can use it for testing f he mid-term. But if the new object storage backend does not work well with HDD, we will continue using bluestore in crimson just for supporting HDD.
 12 | 
 13 | We are using a C++ framework named Seastar to implement Crimson, but we are not porting classic OSD to Seastar, as the design philosophy and various considerations could be very different between them. So, again, please bear in mind, to port classic OSD to Seastar and adapt it to bluestore is never our final goal.
 14 | 
 15 | The new Crimson OSD will be compatible with classic OSD, so existing users will be able to upgrade from classic OSD to Crimson OSD. Put in other words, crimson-osd will:
 16 | 
 17 | - Support librados protocol
 18 |   * Be able to talk to existing librados clients
 19 |   * Be able to join an existing Ceph cluster as an OSD
 20 | - Support most existing OSD commands
 21 | 
 22 | Note that some existing options won’t apply to the Crimson OSD, and new options will be added.
 23 | 
 24 | Current status
 25 | ==============
 26 | 
 27 | Currently, we are using memory-based Object Store for testing.
 28 | 
 29 | Functions
 30 | ---------
 31 | 
 32 | The feature set of OSD can be covered by different use cases.
 33 | 
 34 | - rados benchmark
 35 | 
 36 |   * random/seq read/write
 37 | - RBD engine of the fio benchmark
 38 | 
 39 |   * random/seq read/write without exclusive locking (work in progress). This is
 40 |     enough to run the fio benchmark.
 41 | 
 42 | Tests
 43 | -----
 44 | 
 45 | Currently, we test crimson-osd in following ways:
 46 | 
 47 | - Unit test for different building blocks, like config, messenger, mon client.
 48 |   These tests are now part of “make check” run.
 49 | - Performance test for comparing the performance of a new change against that
 50 |   of master branch. This test is driven by CBT, and is now integrated as a
 51 |   jenkins job, which can be launched using a trigger phrase on GitHub.
 52 | 
 53 | Future Work
 54 | ===========
 55 | 
 56 | all operations supported librados including snap support
 57 | --------------------------------------------------------
 58 | Which means we will be able to support RBD and RGW.
 59 | 
 60 | Log based recovery + backfill
 61 | -----------------------------
 62 | 
 63 | A core feature of ceph is online movement and recovery of data due to cluster
 64 | changes and failures. Broadly, the following features need to be added to
 65 | crimson to support those higher level features:
 66 | 
 67 | * Operation logging on write
 68 | * Recovery from the above for briefly down osds (Recovery)
 69 | * Online reconstruction for new osds/osds down for a longer period (Backfill)
 70 | 
 71 | Scrub
 72 | -----
 73 | 
 74 | Online verification of stored data over a long period of time is another staple
 75 | of the current osd -- this feature will need to be carried over to crimson.
 76 | 
 77 | I/O operation state machine
 78 | ---------------------------
 79 | 
 80 | How the client I/O operations coordinate with background operations? For
 81 | instance, read/write I/O should wait for recovery/scrub/backfill etc.
 82 | 
 83 | Use config-proxy in messenger
 84 | -----------------------------
 85 | 
 86 | We are using a poor-man solution - a plain struct for holding msgr related
 87 | settings. And it’s not updated with the option subcomponent.
 88 | 
 89 | Performance Review and CI integration
 90 | -------------------------------------
 91 | 
 92 | Currently we have a preliminary CI support for checking of the significant
 93 | performance regressions, but we need to update the test harness and test cases
 94 | whenever it’s necessary. Also we need to
 95 | 
 96 | - Review the performance to see if there is any regression not caught by the CI
 97 |   tests.
 98 | - Compare the performance of classic OSD and crimson-osd on monthly basis
 99 | - Adapt crimson-osd to existing teuthology qa suite, once it is able to cover
100 |   more rados operations, and hence support RBD and RGW clients.
101 | 
102 | Adapt to bluestore
103 | ------------------
104 | 
105 | We are now using a variant of memstore as the object store backend. Bluestore
106 | will serve as the intermediate solution before we have seastore.
107 | 
108 | Seastore
109 | --------
110 | 
111 | The next generation objectstore optimized for NVMe devices. And it would take
112 | a long time before its GA. As a reference, bluestore started in early 2015, and
113 | it was ready for production in late 2017. There will be three phases:
114 | 
115 | - Prototyping/Design:
116 | 
117 |   * Prototyping
118 | 
119 |     - define typical use cases, the constraints and assumptions.
120 |     - evaluate different options and understand the tradeoffs, probably do some
121 |       experiments, before solidifying on a specific design
122 |   * Design:
123 | 
124 |     - in-memory/on-disk data structures
125 |     - A sequence diagram to illustrate how to coordinate the foreground IOPS and
126 |       background maintenance task.
127 |     - define its interfaces talking to the other part of OSD
128 | - Implementation:
129 | 
130 |   * Integrate the object store with crimson-osd. If seastore cannot support HDD
131 |     well, it should be able to coexist with bluestore.
132 |   * Stabilize the disk layout
133 | 
134 | Seastar+SPDK evaluation/integration
135 | -----------------------------------
136 | 
137 | - To evaluate different approaches of kernel-bypassing techniques.
138 | - To integrate SPDK into Seastar
139 | 


--------------------------------------------------------------------------------