├── LICENSE ├── README.md └── docs ├── CXL_Emu_Setup.md ├── CXL_Introduction.md ├── CXL_Usage.md ├── CXL_related_works.md ├── Evaluations.md ├── Xalloc.md └── k8s_with_cxl.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Zhang Cao 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CXL-101 2 | Contain some materials about CXL. 3 | 4 | 5 | 1. [**CXL Introduction**](docs/CXL_Introduction.md): Why CXL and Introduce CXL 1.1, 2.0, 3.0 specification. 6 | 2. [**CXL-reltaed Works**](docs/CXL_related_works.md): Introduce some current works about CXL. (TBD) 7 | 3. [**CXL Emulation and Setup**](docs/CXL_Emu_Setup.md): Introduce how to Emulate CXL-based memory devices and give a sepecific method to setup the simulation environment. 8 | 4. [**CXL Usage**](docs/CXL_Usage.md): Introduce how to use cxl-based memory. 9 | 5. [**CXL-based Memory Evaluations**](docs/Evaluations.md): Evaluations of CXL-based Memory (QEMU-Emulated and NUMA-Node-Emulated). 10 | 6. [**Build a Easy-to-Use CXL-based memory Lib in Rust**](docs/Xalloc.md): Simplify CXL-based Memory Access with a Rust Library. 11 | 7. [**Enable k8s with CXL-based Memory**](docs/k8s_with_cxl.md): Explore how to enable current kubernetes with CXL-based memory. 12 | -------------------------------------------------------------------------------- /docs/CXL_Emu_Setup.md: -------------------------------------------------------------------------------- 1 | 2 | CXL is a fairly new technology, and there aren't any commercial hardware options for CXL devices on the market yet. That's why it's crucial to have ways to emulate CXL devices for testing and development purposes. 3 | 4 | There are primarily two methods for simulating CXL-based memory. The first involves using Qemu for emulation, and the second method uses NUMA node memory for emulation. Additionally, there are some lightweight methods available for emulating CXL-based memory, focusing mainly on performance considerations. 5 | 6 | # Qemu Emulation 7 | 8 | CXL devices can be emulated with the help of Qemu. As of 8/22/2023, the [mainline QEMU](https://www.qemu.org/docs/master/system/devices/cxl.html) have full support for creating CXL volatile memory devices and also non-volatile memory devices. Also, [Linux kernel](https://docs.kernel.org/driver-api/cxl/memory-devices.html) supports the CXL-related drivers. Thus, it's pratical to set a CXL-based Memory device in this way, you can find [here](https://memverge.com/cxl-qemuemulating-cxl-shared-memory-devices-in-qemu/) 9 | 10 | # NUMA Memory Emulaton 11 | 12 | In a NUMA setup, processors within a computer system share local memory and collaborate. Think of NUMA as a kind of microprocessor cluster contained in a single box. Microsoft's paper, which you can find [here](https://dl.acm.org/doi/pdf/10.1145/3575693.3578835), suggests a way to simulate cxl devices on 2-socket server systems based on two key characteristics of cxl-connected dram. The first characteristic is a latency of 150 ns, and the second is that no local CPU can directly access this CPU-less node. Essentially, they've established two virtual nodes, each aligned with a physical node. One has CPUs, while the other is without CPUs. They've designated the memory associated with the CPU-less node as cxl-memory. According to Pond, the latency introduced by this setup closely matches what cxl promises. You can find instructions on how to set up a Pond virtual machine with this design [here](https://github.com/vtess/Pond). 13 | 14 | 15 | # Lightweight Emulation 16 | 17 | There are also simpler methods available that can help your program perform almost as well as if it were running on CXL-based Memory. [CXLMemSim](https://github.com/SlugLab/CXLMemSim) is a quick and efficient CXL.mem simulator designed for performance analysis. It utilizes a performance model based on performance monitoring events, which are widely supported by common processors. Additionally, it can simulate complex system topologies. -------------------------------------------------------------------------------- /docs/CXL_Introduction.md: -------------------------------------------------------------------------------- 1 | *This introduction is a summary of the referrence paper and there also exists my understandings to CXL.* 2 | 3 | # Background 4 | 5 | The Compute Express Link (CXL) is an open industry standard that defines a family of interconnect protocols between CPUs and devices. As a general device interconnect, CXL takes a broad definition of devices including GPUs, GP-GPUs, FPGAs, as well as a wide range of purpose-built accelerators and storage devices. Traditionally, these devices use the PCIe serial interface. CXL also targets memory which is traditionally connected to the CPU through the DDR parallel interface. 6 | 7 | While PCIe and DDR have been great interfaces for a wide range of devices, they also come with some inherent limitations. These limitations lead to the following challenges that motivated the development and deployment of CXL. 8 | 9 | ## Challenge 1: coherent access to system and device memory. 10 | 11 | **Summary: PCIe attached devices(GPU, SmartNICs and etc) cann't directly access host memory. Also host cann't directly access the memory of PCIe attached devices. There exists cache coherence problem.** 12 | 13 | System memory is conventionally attached via DDR and cacheable by the CPU cache hierarchy. In contrast, accesses from PCIe devices to system memory happen through 14 | non-coherent reads/writes. A PCIe device cannot cache system memory to exploit temporal or spatial locality or to perform atomic sequences of operations. Similarly, memory attached to a PCIe device is accessed non-coherently from the host, with each access handled by the PCIe device. Non-coherent accesses work well for streaming I/O operations (such as network access or storage access). For accelerators(GPUs, SmartNICs and etc), entire data structures are moved from system memory to the accelerator for specific functions before being moved back to the main memory and software mechanisms are deployed to prevent simultaneous accesses between CPUs and accelerator(s). This challenge also arises in the evolving area of processing-in-memory (PIM), which seeks to move computation close to data. There is currently no standardized approach for PIM devices to coherently access data that may be present in the CPU cache hierarchy. 15 | 16 | 17 | ## Challenge 2: memory scalability. 18 | 19 | **Summary: Due to the characteristic of DDR, the bandwidth of DDR attached memory is low and also DDR only support DRAM-based memory.** 20 | 21 | Demand for memory capacity and bandwidth increases proportionate to the exponential growth of compute. Unfortunately, DDR memory has not been keeping up with this demand. This limits **memory bandwidth** per CPU. A key reason for this mismatch in scaling is the pin-inefficiency of the parallel DDR interface. In principle, PCIe pins would be a great alternative due to their superior memory bandwidth per pin. For example, a x16 Gen5 PCIe port at 32 GT/s offers 256 GB/s with 64 signal pins. DDR5-6400 offers 50 GB/s with ~200 signal-pins. Unfortunately, PCIe does not support coherency. Thus, PCIe has not been able to replace DDR. 22 | 23 | Another scaling challenge is that DRAM memory cost per bit has recently stayed flat. While there are multiple promising media types including Managed DRAM, ReRam, 3DXP/Optane, the DDR standard relies on DRAM-specific commands for access and maintenance, which hinders adoption of new media types. 24 | 25 | 26 | ## Challenge 3: memory and compute inefficiency due to stranding. 27 | 28 | **Summary: memory stranding problem in data center which leads to low memory utilization.** 29 | 30 | 31 | Today’s data centers are inefficient due to stranded resources. A resource, such as memory, is stranded when idle capacity remains while another resource, such as 32 | compute, is fully used. The underlying cause is tight resources coupling where compute, memory, and I/O devices belong to only one server. As a result, each server needs to be overprovisioned with memory and accelerators to handle workloads with peak capacity demands. For example, a server that hosts an application that needs more memory (or accelerators) than available cannot borrow memory (or accelerators) from another underutilized server in the same rack and must suffer the performance consequences of page misses. On the other hand, servers where all cores are used by workloads often have memory remaining unused. 33 | 34 | 35 | ## Challenge 4: fine-grained data sharing in distributed systems. 36 | 37 | **Summary: we need high performance protocal which can handle fine-grained sychronization in distributed systems.** 38 | 39 | Distributed systems frequently on fine-grained synchronization. The underlying updates are often small and latency sensitive. Like distributed databases rely on kB-scale pages and distributed consensus with even smaller updates. Sharing data at such fine granularity, means that the communication delay in typical datacenter networks dominates the wait time for updates and slows down these important use cases. For example, transmitting 4kB at 50GB/s takes under 2us, but communication delays exceed 10us on current networks. A coherent shared memory implementation can help cut down communication delays to sub microseconds. 40 | 41 | 42 | # CXL Specification 43 | 44 | CXL has been developed to address these four and other challenges. CXL has evolved through three generations. Each generation specifies the interconnect and multiple 45 | protocols while remaining fully backward compatible. The CXL 1.0(1.1) specification adds coherency and memory semantics on top of PCIe. This addresses Challenges 1 (coherency) and Challenge 2 (memory scaling). CXL 2.0 additionally addresses Challenge 3 (resource stranding) by enabling resource pooling across multiple hosts. CXL 3.0 addresses Challenge 3 on a larger scale with multiple levels of CXL switching. Furthermore, CXL 3.0 addresses Challenge 4 (distributed data sharing) by enabling fine-grained memory sharing across host boundaries. 46 | 47 | ## CXL 1.1 48 | 49 | There are three protocols and also three types devices. 50 | 51 | ### Protocols 52 | 53 | CXL is implemented using three protocols, CXL.io, CXL.cache, and CXL.memory (aka. CXL.mem), which are dynamically multiplexed on PCIe PHY. Below is their functions. 54 | 55 | 1. CXL.io: device discovery, configuration, initialization, I/O virtualization, and DMA using non-coherent load-store semantics.(like PCIe) 56 | 2. CXL.mem: allow host access device memory.(HDM: Host-managed Device Memory) 57 | 1. Memory expander: HDM-H(H: host-only coherence) 58 | 2. Accelerator memory: HDM-D(D: Device-managed coherence) 59 | 3. CXL.cache: enable a device to use host memory. 60 | 61 | 62 | ### Devices 63 | 64 | Type 1 devices are accelerators such as SmartNICs that use coherency semantics along with PCIe-style DMA transfers. Thus, they implement only the CXL.io and CXL.cache protocols. 65 | 66 | Type 2 devices are accelerators such as GP-GPUs and FPGAs with local memory that can be mapped in part to the cacheable system memory(cxl.mem: host access device memory). These devices also cache system memory for processing(cxl.cache: device access host memory). Thus, they implement CXL.io, CXL.cache and CXL.mem protocols. 67 | 68 | Type-3 devices are used for memory bandwidth and capacity expansion and can be used to connect to different memory types, including supporting multiple memory tiers attached to the device. Thus, Type-3 devices would implement only the CXL.io and CXL.mem protocols. 69 | 70 | ## CXL2.0 71 | 72 | CXL2.0 enables resource pooling which allows assigning the same resources to different hosts over time. The ability to reassign resources at run time solves resource stranding (Challenge 3) as it overcomes the tight coupling of resources to individual hosts. If one host runs a compute intensive workload and does not use the device 73 | memory assigned from the pool, operators can reassign this device memory to another host, which might run a memory intensive workload. The same pooling construct is applicable to other resources like accelerators. 74 | 75 | **Caution: for CXL2.0, a memory region can only assign to one host, it can not be shared among multiple hosts.** 76 | 77 | CXL 2.0 adds **Hot-Plug**, **Single Level Switching**, Quality-of-Service (QoS) for Memory, **Memory Pooling**, **Device Pooling**, and Global Persistent Flush (GPF). 78 | 79 | Hot-Plug was not allowed in CXL 1.1 which precludes adding CXL resources after platform boot. CXL 2.0 enables standard PCIe hot-plug mechanisms enabling traditional physical hot-plug and dynamic resource pooling. 80 | 81 | ## CXL3.0 82 | 83 | 84 | # Referrence 85 | 86 | [An Introduction to the Compute Express Link (CXL) Interconnect](https://arxiv.org/ftp/arxiv/papers/2306/2306.11227.pdf) -------------------------------------------------------------------------------- /docs/CXL_Usage.md: -------------------------------------------------------------------------------- 1 | 2 | ![CXL-usage](https://github.com/Tom-CaoZH/notes-pictures/blob/main/img/CXL-usage.png) 3 | 4 | After conducting a thorough investigation, we have determined that the CXL software ecosystem is compatible with established PMem concepts and libraries. As depicted in the figure above, we can observe that its usage closely resembles that of PMem. In general, there are three ways to utilize CXL-based memory, with two options for volatile memory and one for non-volatile memory: 5 | 6 | 1. For volitile memory: 7 | 1. the character device(like /dax/dax0.0) 8 | 2. the character device can be transferred into headless NUMA node, in this way, it can be used as NUMA node memory. 9 | 2. For non-volitile memory: 10 | 1. it can be used by PMem-enabled file system. 11 | 12 | There are also several libraries which can directly make use of CXL-based memory like [memkind](https://pmem.io/memkind/) and etc. 13 | 14 | Refferrence: 15 | 16 | [EXPLORING THE SOFTWARE ECOSYSTEM FOR COMPUTE EXPRESS LINK (CXL) MEMORY](https://pmem.io/blog/2023/05/exploring-the-software-ecosystem-for-compute-express-link-cxl-memory/) -------------------------------------------------------------------------------- /docs/CXL_related_works.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Tom-CaoZH/CXL-101/ec0d4bde4405acb843603c5b2ec346a10cb488db/docs/CXL_related_works.md -------------------------------------------------------------------------------- /docs/Evaluations.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Tom-CaoZH/CXL-101/ec0d4bde4405acb843603c5b2ec346a10cb488db/docs/Evaluations.md -------------------------------------------------------------------------------- /docs/Xalloc.md: -------------------------------------------------------------------------------- 1 | *Still under construction.* 2 | 3 | This lib is used to allocate normal DRAM-based memory and CXL-based memory using Rust. 4 | 5 | Generally, for normal DRAM-based memory, we add a wrapper above [jemalloc](https://github.com/tikv/jemallocator). For CXL-based memory, because CXL-based memory can be transfered into cpuless numa-node memory, we enable specific numa node memory allocation. 6 | 7 | Repo is [here](https://github.com/Tom-CaoZH/xalloc). -------------------------------------------------------------------------------- /docs/k8s_with_cxl.md: -------------------------------------------------------------------------------- 1 | To enable Kubernetes with CXL-based memory, we need to verify two things: 2 | 3 | 1. whether the CXL-based memory is visible to the container like the host. 4 | 2. whether container can control CXL-based memory well. 5 | 6 | 7 | ## CXL-based Memory Visibility 8 | 9 | For kubernetes, it relys on kubelet to control the containers in a machine. then for the container in kubernetes, there is a CRI(Container Runtime Interface). It can utilize containerd which is the container in docker. So in order to verify the CXL-based memory visibility, we can verify docker's. For verifying it, we only need to see whether the CXL-based memory of the host can be seen in the docker. After we setup the simulation evirtonment and install docker in the simulated machine, we find that it's visibile which means we can see the CXL-based memory in docker just like the host. 10 | 11 | ## CXL-based Memory Control 12 | 13 | Kubernetes Replys on Cgroup to control the resource like cpu, memory and etc. So in order to enable Kubernetes control CXL-based memory, one method is to enable Cgroup aware of CXL-based memory. For more details, you can refer to the information in the referrnece. 14 | 15 | 16 | 17 | Referrence: 18 | 19 | [Container Memory Interface](https://www.youtube.com/watch?v=ZArmCVN4uF0) 20 | 21 | [Design of per cgroup memory disaggregation](https://asplos.dev/wordpress/2023/07/03/design-of-per-cgroup-memory-disaggregation/) from [Yiwei Yang](https://asplos.dev/) 22 | --------------------------------------------------------------------------------