├── AUTHORS ├── COPYING ├── Makefile ├── README ├── README.md ├── pmbd.c └── pmbd.h /AUTHORS: -------------------------------------------------------------------------------- 1 | 2013 - Released the open-source version 0.9 (fchen) 2 | 2012 - Ported to Linux 3.2.1 (fchen) 3 | 2011 - Feng Chen (Intel) implemented version 1 of PMBD for Linux 2.6.34. 4 | -------------------------------------------------------------------------------- /COPYING: -------------------------------------------------------------------------------- 1 | /* 2 | * Intel Persistent Memory Block Driver 3 | * Copyright (c) <2011-2013>, Intel Corporation. 4 | * 5 | * This program is free software; you can redistribute it and/or modify it 6 | * under the terms and conditions of the GNU General Public License, 7 | * version 2, as published by the Free Software Foundation. 8 | * 9 | * This program is distributed in the hope it will be useful, but WITHOUT 10 | * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 12 | * more details. 13 | * 14 | * You should have received a copy of the GNU General Public License along with 15 | * this program; if not, write to the Free Software Foundation, Inc., 16 | * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. 17 | */ 18 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | NAME=pmbd 2 | obj-m = $(NAME).o 3 | KVERSION = $(shell uname -r) 4 | 5 | all: 6 | make -C /lib/modules/$(KVERSION)/build M=$(PWD) modules 7 | clean: 8 | make -C /lib/modules/$(KVERSION)/build M=$(PWD) clean 9 | 10 | install: 11 | if [ -f $(NAME).ko ]; then \ 12 | if ! cp $(NAME).ko /lib/modules/$(KVERSION)/kernel/drivers/block; \ 13 | then exit 1; fi; \ 14 | fi; 15 | /sbin/depmod -a 16 | 17 | uninstall: 18 | if [ -f /lib/modules/$(KVERSION)/kernel/drivers/block/$(NAME).ko ]; then \ 19 | if ! rm -f /lib/modules/$(KVERSION)/kernel/drivers/block/$(NAME).ko; \ 20 | then exit 1; fi; \ 21 | fi; 22 | /sbin/depmod -a 23 | 24 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | NOTE: PMDB was an Intel research project and is not currently being maintained. For current information and instructions related to persistent memory enabling in Linux please refer to https://nvdimm.wiki.kernel.org/ 2 | 3 | =============================================================================== 4 | INTEL PERSISTENT MEMORY BLOCK DRIVER (PMBD) v0.9 5 | =============================================================================== 6 | 7 | This software implements a block device driver for persistent memory (PM). 8 | This module provides a block-based logical interface to manage PM that is 9 | physically attached to the system memory bus. 10 | 11 | The architecture is assumed as follows. Both DRAM and PM DIMMs are directly 12 | attached to the host memory bus. The PM space is presented to the operating 13 | system as a contiguous range of physical memory address space at the high end. 14 | 15 | There are three major design considerations: (1) Data protection - Private 16 | mapping is used to prevent stray pointers (in kernel/driver bugs) to 17 | accidentally wipe off persistent PM data. (2) Data persistence - Non-temporal 18 | store and fence instructions are used to leverage the processor store buffer 19 | and avoid polluting the CPU cache. (3) Write ordering - Write barrier is 20 | supported to ensure correct order of writes. 21 | 22 | This module also includes other (experimental) features, such as PM speed 23 | emulation, checksum for page integrity, partial page updates, write 24 | verification, etc. Please refer to the help page of the module. 25 | 26 | 27 | =============================================================================== 28 | COMPILING AND INSTALLING THE PMBD DRIVER 29 | =============================================================================== 30 | 31 | 1. Compile the PMBD driver: 32 | 33 | $ make 34 | 35 | 2. Install the PMBD driver: 36 | 37 | $ sudo make install 38 | 39 | 3. Check available driver information: 40 | 41 | $ modinfo pmbd 42 | 43 | NOTE: This module runs with Linux kernel 3.2.1 or 2.6.34. For other kernel 44 | versions, please search for "KERNEL_VERSION" and change the code as needed. 45 | 46 | =============================================================================== 47 | QUICK USER'S GUIDE OF THE PMBD DRIVER 48 | =============================================================================== 49 | 50 | 1. modify /etc/grub.conf to set the physical memory address range that 51 | is to be simulated as PM. 52 | 53 | Add the following to the boot option line: 54 | 55 | memmap=G$G numa=off 56 | 57 | NOTE: 58 | 59 | PM_SIZE_GB - the PM space size (in GBs) 60 | DRAM_SIZE_GB - the DRAM space size (in GBs) 61 | 62 | Example: 63 | 64 | Assuming a total memory capacity of 24GB, and if we want to use 16GB PM and 65 | 8GB DRAM, it should be "memmap=16G$8G". 66 | 67 | 2. Reboot and check if the memory size is set as expected. 68 | 69 | $ sudo reboot; exit 70 | $ free 71 | 72 | 3. Load the device driver module 73 | 74 | Load the driver module into the kernel with private mapping, non-temp store, 75 | and write barrier enabled (*** RECOMMENDED CONFIG ***): 76 | 77 | $ modprobe pmbd mode="pmbd;hmo=;hms; \ 78 | pmapY;ntsY;wbY;" 79 | 80 | Check the kernel message output: 81 | 82 | $ dmesg 83 | 84 | After loading the module, a block device (/dev/pma) should appear. Since 85 | now, it can be used as any block device, such as fdisk, mkfs, etc. 86 | 87 | 4. Unload the device driver 88 | 89 | $ rmmod pmbd 90 | 91 | =============================================================================== 92 | OTHER CONFIGURATION OPTIONS OF THE PERSISTENT MEMORY DEVICE DRIVER MODULE 93 | =============================================================================== 94 | 95 | usage: $ modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];;.." 96 | 97 | GENERAL OPTIONS: 98 | pmbd<#,#..> set pmbd size (GBs) 99 | HM|VM use high memory (HM default) or vmalloc (VM) 100 | hmo<#> high memory starting offset (GB) 101 | hms<#> high memory size (GBs) 102 | pmap use private mapping (Y) or not (N default) - (note: must 103 | enable HM and wrprotN) 104 | nts use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), 105 | or regular memcpy (N default) 106 | wb use write barrier (Y) or not (N default) 107 | fua use WRITE_FUA (Y default) or not (N) (only effective for 108 | Linux 3.2.1) 109 | ntl use non-temporal load (MOVNTDQA) to do memcpy (Y), or 110 | regular memcpy (N default) - this option enforces memory type 111 | of write combining 112 | 113 | 114 | SIMULATION: 115 | simmode<#,#..> use the specified numbers to the whole device (0 default) or 116 | PM only (1) 117 | rdlat<#,#..> set read access latency (ns) 118 | wrlat<#,#..> set write access latency (ns) 119 | rdbw<#,#..> set read bandwidth (MB/sec) (if set 0, no emulation) 120 | wrbw<#,#..> set write bandwidth (MB/sec) (if set 0, no emulation) 121 | rdsx<#,#..> set the relative slowdown (x) for read 122 | wrsx<#,#..> set the relative slowdown (x) for write 123 | rdpause<#,.> set a pause (cycles per 4KB) for each read 124 | wrpause<#,.> set a pause (cycles per 4KB) for each write 125 | adj<#> set an adjustment to the system overhead (nanoseconds) 126 | 127 | WRITE PROTECTION: 128 | wrprot use write protection for PM pages? (Y or N) 129 | wpmode<#,#,..> write protection mode: use the PTE change (0 default) or flip 130 | CR0/WP bit (1) 131 | clflush use clflush to flush CPU cache for each write to PM space? 132 | (Y or N) 133 | wrverify use write verification for PM pages? (Y or N) 134 | checksum use checksum to protect PM pages? (Y or N) 135 | bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB) 136 | bufnum<#> the number of buffers for a PMBD device (16 buffers, at least 1 137 | if using buffer, 0 -no buffer) 138 | bufstride<#> the number of contiguous blocks(4KB) mapped into one buffer 139 | (bucket size for round-robin mapping) (1024 in default) 140 | batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1 means 141 | no batching) 142 | 143 | MISC: 144 | mgb mergeable? (Y or N) 145 | lock lock the on-access page to serialize accesses? (Y or N) 146 | cache use which CPU cache policy? Write back (WB), Write Combined 147 | (WB), or Uncachable (UC) 148 | subupdate only update the changed cachelines of a page? (Y or N) (check 149 | PMBD_CACHELINE_SIZE) 150 | timestat enable the detailed timing statistics (/proc/pmbd/pmbdstat)? 151 | This will cause significant performance slowdown (Y or N) 152 | 153 | NOTE: 154 | (1) Option rdlat/wrlat only specifies the minimum access times. Real access 155 | times can be higher. 156 | (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. 157 | (3) Option simmode1 applies the simulated specification to the PM space, 158 | rather than the whole device, which may have buffer. 159 | 160 | WARNING: 161 | (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning 162 | may appear. Use "nosoftlockup" boot option to disable it. 163 | (2) Enabling timestat may cause performance degradation. 164 | (3) FUA is supported in Linux 3.2.1, but if buffer is used (for PT based 165 | protection), enabling FUA lowers performance due to double writes. 166 | (4) No support for changing CPU cache related PTE attributes for VM-based PMBD 167 | in Linux 3.2.1 (RCU stalls). 168 | 169 | PROC ENTRIES: 170 | /proc/pmbd/pmbdcfg: config info about the PMBD devices 171 | /proc/pmbd/pmbdstat: statistics of the PMBD devices (if timestat is enabled) 172 | 173 | EXAMPLE: 174 | Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB: 175 | (1) Basic (Ramdisk): 176 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;" 177 | 178 | (2) Protected (with private mapping): 179 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;" 180 | 181 | (3) Protected and synced (with private mapping, non-temp store): 182 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;" 183 | 184 | (4) *** RECOMMENDED CONFIGURATION *** 185 | Protected, synced, and ordered (with private mapping, nt-store, write 186 | barrier): 187 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;wbY;" 188 | 189 | 190 | 191 | 192 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | #### NOTE: PMDB was an Intel research project and is not currently being maintained. For current information and instructions related to persistent memory enabling in Linux please refer to https://nvdimm.wiki.kernel.org/ #### 2 | 3 | =============================================================================== 4 | INTEL PERSISTENT MEMORY BLOCK DRIVER (PMBD) v0.9 5 | =============================================================================== 6 | 7 | This software implements a block device driver for persistent memory (PM). 8 | This module provides a block-based logical interface to manage PM that is 9 | physically attached to the system memory bus. 10 | 11 | The architecture is assumed as follows. Both DRAM and PM DIMMs are directly 12 | attached to the host memory bus. The PM space is presented to the operating 13 | system as a contiguous range of physical memory address space at the high end. 14 | 15 | There are three major design considerations: (1) Data protection - Private 16 | mapping is used to prevent stray pointers (in kernel/driver bugs) to 17 | accidentally wipe off persistent PM data. (2) Data persistence - Non-temporal 18 | store and fence instructions are used to leverage the processor store buffer 19 | and avoid polluting the CPU cache. (3) Write ordering - Write barrier is 20 | supported to ensure correct order of writes. 21 | 22 | This module also includes other (experimental) features, such as PM speed 23 | emulation, checksum for page integrity, partial page updates, write 24 | verification, etc. Please refer to the help page of the module. 25 | 26 | 27 | =============================================================================== 28 | COMPILING AND INSTALLING THE PMBD DRIVER 29 | =============================================================================== 30 | 31 | 1. Compile the PMBD driver: 32 | 33 | $ make 34 | 35 | 2. Install the PMBD driver: 36 | 37 | $ sudo make install 38 | 39 | 3. Check available driver information: 40 | 41 | $ modinfo pmbd 42 | 43 | NOTE: This module runs with Linux kernel 3.2.1 or 2.6.34. For other kernel 44 | versions, please search for "KERNEL_VERSION" and change the code as needed. 45 | 46 | =============================================================================== 47 | QUICK USER'S GUIDE OF THE PMBD DRIVER 48 | =============================================================================== 49 | 50 | 1. modify /etc/grub.conf to set the physical memory address range that 51 | is to be simulated as PM. 52 | 53 | Add the following to the boot option line: 54 | 55 | memmap=G$G numa=off 56 | 57 | NOTE: 58 | 59 | PM_SIZE_GB - the PM space size (in GBs) 60 | DRAM_SIZE_GB - the DRAM space size (in GBs) 61 | 62 | Example: 63 | 64 | Assuming a total memory capacity of 24GB, and if we want to use 16GB PM and 65 | 8GB DRAM, it should be "memmap=16G$8G". 66 | 67 | 2. Reboot and check if the memory size is set as expected. 68 | 69 | $ sudo reboot; exit 70 | $ free 71 | 72 | 3. Load the device driver module 73 | 74 | Load the driver module into the kernel with private mapping, non-temp store, 75 | and write barrier enabled (*** RECOMMENDED CONFIG ***): 76 | 77 | $ modprobe pmbd mode="pmbd;hmo=;hms; \ 78 | pmapY;ntsY;wbY;" 79 | 80 | Check the kernel message output: 81 | 82 | $ dmesg 83 | 84 | After loading the module, a block device (/dev/pma) should appear. Since 85 | now, it can be used as any block device, such as fdisk, mkfs, etc. 86 | 87 | 4. Unload the device driver 88 | 89 | $ rmmod pmbd 90 | 91 | =============================================================================== 92 | OTHER CONFIGURATION OPTIONS OF THE PERSISTENT MEMORY DEVICE DRIVER MODULE 93 | =============================================================================== 94 | 95 | usage: $ modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];;.." 96 | 97 | GENERAL OPTIONS: 98 | pmbd<#,#..> set pmbd size (GBs) 99 | HM|VM use high memory (HM default) or vmalloc (VM) 100 | hmo<#> high memory starting offset (GB) 101 | hms<#> high memory size (GBs) 102 | pmap use private mapping (Y) or not (N default) - (note: must 103 | enable HM and wrprotN) 104 | nts use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), 105 | or regular memcpy (N default) 106 | wb use write barrier (Y) or not (N default) 107 | fua use WRITE_FUA (Y default) or not (N) (only effective for 108 | Linux 3.2.1) 109 | ntl use non-temporal load (MOVNTDQA) to do memcpy (Y), or 110 | regular memcpy (N default) - this option enforces memory type 111 | of write combining 112 | 113 | 114 | SIMULATION: 115 | simmode<#,#..> use the specified numbers to the whole device (0 default) or 116 | PM only (1) 117 | rdlat<#,#..> set read access latency (ns) 118 | wrlat<#,#..> set write access latency (ns) 119 | rdbw<#,#..> set read bandwidth (MB/sec) (if set 0, no emulation) 120 | wrbw<#,#..> set write bandwidth (MB/sec) (if set 0, no emulation) 121 | rdsx<#,#..> set the relative slowdown (x) for read 122 | wrsx<#,#..> set the relative slowdown (x) for write 123 | rdpause<#,.> set a pause (cycles per 4KB) for each read 124 | wrpause<#,.> set a pause (cycles per 4KB) for each write 125 | adj<#> set an adjustment to the system overhead (nanoseconds) 126 | 127 | WRITE PROTECTION: 128 | wrprot use write protection for PM pages? (Y or N) 129 | wpmode<#,#,..> write protection mode: use the PTE change (0 default) or flip 130 | CR0/WP bit (1) 131 | clflush use clflush to flush CPU cache for each write to PM space? 132 | (Y or N) 133 | wrverify use write verification for PM pages? (Y or N) 134 | checksum use checksum to protect PM pages? (Y or N) 135 | bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB) 136 | bufnum<#> the number of buffers for a PMBD device (16 buffers, at least 1 137 | if using buffer, 0 -no buffer) 138 | bufstride<#> the number of contiguous blocks(4KB) mapped into one buffer 139 | (bucket size for round-robin mapping) (1024 in default) 140 | batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1 means 141 | no batching) 142 | 143 | MISC: 144 | mgb mergeable? (Y or N) 145 | lock lock the on-access page to serialize accesses? (Y or N) 146 | cache use which CPU cache policy? Write back (WB), Write Combined 147 | (WB), or Uncachable (UC) 148 | subupdate only update the changed cachelines of a page? (Y or N) (check 149 | PMBD_CACHELINE_SIZE) 150 | timestat enable the detailed timing statistics (/proc/pmbd/pmbdstat)? 151 | This will cause significant performance slowdown (Y or N) 152 | 153 | NOTE: 154 | (1) Option rdlat/wrlat only specifies the minimum access times. Real access 155 | times can be higher. 156 | (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. 157 | (3) Option simmode1 applies the simulated specification to the PM space, 158 | rather than the whole device, which may have buffer. 159 | 160 | WARNING: 161 | (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning 162 | may appear. Use "nosoftlockup" boot option to disable it. 163 | (2) Enabling timestat may cause performance degradation. 164 | (3) FUA is supported in Linux 3.2.1, but if buffer is used (for PT based 165 | protection), enabling FUA lowers performance due to double writes. 166 | (4) No support for changing CPU cache related PTE attributes for VM-based PMBD 167 | in Linux 3.2.1 (RCU stalls). 168 | 169 | PROC ENTRIES: 170 | /proc/pmbd/pmbdcfg: config info about the PMBD devices 171 | /proc/pmbd/pmbdstat: statistics of the PMBD devices (if timestat is enabled) 172 | 173 | EXAMPLE: 174 | Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB: 175 | (1) Basic (Ramdisk): 176 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;" 177 | 178 | (2) Protected (with private mapping): 179 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;" 180 | 181 | (3) Protected and synced (with private mapping, non-temp store): 182 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;" 183 | 184 | (4) *** RECOMMENDED CONFIGURATION *** 185 | Protected, synced, and ordered (with private mapping, nt-store, write 186 | barrier): 187 | $ sudo modprobe pmbd mode="pmbd16;hmo8;hms16;pmapY;ntsY;wbY;" 188 | 189 | 190 | 191 | 192 | -------------------------------------------------------------------------------- /pmbd.c: -------------------------------------------------------------------------------- 1 | /* 2 | * Intel Persistent Memory Block Driver 3 | * Copyright (c) <2011-2013>, Intel Corporation. 4 | * 5 | * This program is free software; you can redistribute it and/or modify it 6 | * under the terms and conditions of the GNU General Public License, 7 | * version 2, as published by the Free Software Foundation. 8 | * 9 | * This program is distributed in the hope it will be useful, but WITHOUT 10 | * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 12 | * more details. 13 | * 14 | * You should have received a copy of the GNU General Public License along with 15 | * this program; if not, write to the Free Software Foundation, Inc., 16 | * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. 17 | */ 18 | 19 | /* 20 | * Intel Persistent Memory Block Driver (v0.9) 21 | * 22 | * Parts derived with changes from drivers/block/brd.c, lib/crc32.c, and 23 | * arch/x86/lib/mmx_32.c 24 | * 25 | * Intel Corporation 26 | * 03/24/2011 27 | */ 28 | 29 | 30 | /* 31 | ******************************************************************************* 32 | * Persistent Memory Block Device Driver 33 | * 34 | * USAGE: 35 | * % sudo modprobe pmbd mode="pmbd<#>;hmo<#>;hms<#>;[OPTION1];[OPTION2];..>" 36 | * 37 | * GENERAL OPTIONS: 38 | * - pmbd<#,..>: a sequence of integer numbers setting PMBD device sizes (in 39 | * units of GBs). For example, mode="pmbd4,1" means creating a 40 | * 4GB and a 1GB PMBD device (/dev/pma and /dev/pmb). 41 | * 42 | * - HM|VM: choose two types of PMBD devices 43 | * - VM: vmalloc() based 44 | * - HM: HIGH_MEM based (default) 45 | * - In /boot/grub/grub.conf, add "mem=G memmap=G$G" 46 | * to reserve the high m GBs for PM, starting from offset n 47 | * GBs in physical memory 48 | * 49 | * - hmo<#>: if HM is set, setting the starting physical mem address 50 | * (in units of GBs). 51 | * 52 | * - hms<#>: if HM is set, setting the remapping memory size (in GBs) 53 | * 54 | * - pmap set private mapping (Y) or not (N default). using 55 | * pmap_atomic_pfn() to dynamically map/unmap the 56 | * to-be-accessed PM page for protection purpose. 57 | * This option must work with HM enabled. In the Linux boot 58 | * option, "mem" option must be removed. 59 | * 60 | * - nts set non-temporal store/sfence (Y) or not (N default). 61 | * 62 | * - wb: use write barrier (Y) or not (N default) 63 | * 64 | * - fua use WRITE_FUA (Y default) or not (N) (only effective for 65 | * Linux 3.2.1) 66 | * 67 | * SIMULATION OPTIONS: 68 | * 69 | * - simmode<#,#..> set the simulation mode for each PMBD device 70 | * - 0 for simulating the whole device 71 | * - 1 for simulating the PM space only 72 | * Note that simulating the PM space may cause some system 73 | * warning of soft lockup. To disable it, add nonsoftlockup 74 | * in the boot options. 75 | * 76 | * - rdlat<#,#..>: a sequence of integer numbers setting emulated read 77 | * latencies (in units of nanoseconds) for reading each 78 | * sector. Each number is corresponding to a device. Default 79 | * value is 0. 80 | * 81 | * - wrlat<#,#..>: set emulated write access latencies (see rdlat) 82 | * 83 | * - rdbw<#,#..>: a sequence of integer numbers setting emulated read 84 | * bandwidth (in units of MB/sec) for reading each sector. 85 | * Each number corresponds to a device. Default value is 0; 86 | * 87 | * - wrbw<#,#..>: set emulated write bandwidth (see rdbw) 88 | * 89 | * - rdsx<#,#..>: set the slowdown ratio (x) for reads as compared to DRAM 90 | * 91 | * - wrsx<#,#..>: set the slowdown ratio (x) for writes as compared to DRAM 92 | * 93 | * - rdpause<#,#..>: set the injected delay (cycles per page) for read (not 94 | * for emulation, just inject latencies for each read per page) 95 | * 96 | * - wrpause<#,#..>: set the injected delay (cycles per page) for write (not for 97 | * emulation, just inject latencies for each read per page). 98 | * 99 | * - adj<#>: offset the overhead with estimated system overhead. Default 100 | * is 4us, however, this could vary system by system. 101 | * 102 | * WRITE PROTECTION: 103 | * 104 | * - wrprot: provide write protection on PM space by setting page 105 | * read-only (default: N). This option is incompatible with pmap. 106 | * 107 | * - wpmode<#,#,..> write protection mode: use the PTE change (0 default) or 108 | * switch CR0/WP bit (1) 109 | * 110 | * - wrverify: read out the data for verification after writing into PM 111 | * space 112 | * 113 | * - clflush: flush CPU cache or not (default: N) 114 | * 115 | * - checksum: use checksum to provide further protection from data 116 | * corruption (default: N) 117 | * 118 | * - lock: lock the on-access PM page to serialize accesses (default: Y) 119 | * 120 | * - bufsize<#,#,#.#...> -- the buffer size in MBs (for speeding up write 121 | * protection) 0 means no buffer, minimum size is 16 MBs 122 | * 123 | * - bufnum<#> the number of buffers for a pmbd device (16 buffers, at 124 | * least 1 if using buffering, 0 will disable buffer mode) 125 | * 126 | * - bufstride<#> the number of contiguous blocks(4KB) mapped into one 127 | * buffer (the bucket size for round-robin mapping) (1024 in default) 128 | * 129 | * - batch<#,#> the batch size (num of pages) for flushing PMBD buffer (1 130 | * means no batching) 131 | * 132 | * MISC OPTIONS: 133 | * 134 | * - subupdate only update changed cachelines of a page (check 135 | * PMBD_CACHELINE_SIZE, default: N) 136 | * 137 | * - mgb: setting mergeable or not (default: Y) 138 | * 139 | * - cache: 140 | * WB -- write back (both read/write cache used) 141 | * WC -- write combined (write through but cachable) 142 | * UM -- uncachable but write back 143 | * UC -- write through and uncachable 144 | * 145 | * - timestat enable the detailed timing statistics (/proc/pmbd/pmbdstat) or 146 | * not (default: N). This will cause significant performance loss. 147 | * 148 | * EXAMPLE: 149 | * mode="pmbd2,1;rdlat100,2000;wrlat500,4000;rdbw100,100;wrbw100,100;HM;hmo4;hms3; 150 | * mgbY;flushY;cacheWB;wrprotY;wrverifyY;checksumY;lockY;rammode0,1;bufsize16,0; 151 | * subupdateY;" 152 | * 153 | * Explanation: Create two PMBD devices, /dev/pma (2GB) and /dev/pmb (1GB). 154 | * Insert 100ns and 500ns for reading and writing a sector to /dev/pma, 155 | * respectively. Insert 2000ns and 4000ns for reading and writing a sector 156 | * to /dev/pmb. Make the read/write bandwidth for both devices 100MB/sec. 157 | * No system overhead adjustment is applied. We use 3GB high memory for the 158 | * PMBD devices, starting from 4GB physical memory address. Make it 159 | * mergeable, use writeback and flush CPU cache for the PM space, use write 160 | * protection for PM space by setting PM space read-only, verify each 161 | * write by reading out written data, use checksum to protect PM space, use 162 | * spinlock to protect from corruption caused by concurrent accesses, the 163 | * first device is applied without write protection, the second device is 164 | * applied with write protection, and use sub-page updates. 165 | * 166 | * NOTE: 167 | * - We can create no more than 26 devices, 4 partitions each. 168 | * 169 | * FIXME: 170 | * (1) We use an unoccupied major device num (261) temporarily 171 | ******************************************************************************* 172 | */ 173 | 174 | #include 175 | #include 176 | #include 177 | #include 178 | #include 179 | #include 180 | #include 181 | #include 182 | #include 183 | #include 184 | #include 185 | #include 186 | #include 187 | #include 188 | #include 189 | #include 190 | #include 191 | #include 192 | #include 193 | #include 194 | #include 195 | #include 196 | #include 197 | #include "pmbd.h" 198 | 199 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 200 | #include 201 | #endif 202 | 203 | 204 | /* device configs */ 205 | static int max_part = 4; /* maximum num of partitions */ 206 | static int part_shift = 0; /* partition shift */ 207 | static LIST_HEAD(pmbd_devices); /* device list */ 208 | static DEFINE_MUTEX(pmbd_devices_mutex); /* device mutex */ 209 | 210 | /* /proc file system entry */ 211 | static struct proc_dir_entry* proc_pmbd = NULL; 212 | static struct proc_dir_entry* proc_pmbdstat = NULL; 213 | static struct proc_dir_entry* proc_pmbdcfg = NULL; 214 | 215 | /* pmbd device default configuration */ 216 | static unsigned g_pmbd_type = PMBD_CONFIG_HIGHMEM; /* vmalloc(PMBD_CONFIG_VMALLOC) or reserve highmem (PMBD_CONFIG_HIGHMEM default) */ 217 | static unsigned g_pmbd_pmap = FALSE; /* use pmap_atomic() to map/unmap space on demand */ 218 | static unsigned g_pmbd_nts = FALSE; /* use non-temporal store (movntq) */ 219 | static unsigned g_pmbd_wb = FALSE; /* use write barrier */ 220 | static unsigned g_pmbd_fua = TRUE; /* use fua support (Linux 3.2.1) */ 221 | static unsigned g_pmbd_mergeable = TRUE; /* mergeable or not */ 222 | static unsigned g_pmbd_cpu_cache_clflush= FALSE; /* flush CPU cache or not*/ 223 | static unsigned g_pmbd_wr_protect = FALSE; /* flip PTE R/W bits for write protection */ 224 | static unsigned g_pmbd_wr_verify = FALSE; /* read out written data for verification */ 225 | static unsigned g_pmbd_checksum = FALSE; /* do checksum on PM data */ 226 | static unsigned g_pmbd_lock = TRUE; /* do spinlock on accessing a PM page */ 227 | static unsigned g_pmbd_subpage_update = FALSE; /* do subpage update (only write changed content) */ 228 | static unsigned g_pmbd_timestat = FALSE; /* do a detailed timestamp breakdown statistics */ 229 | static unsigned g_pmbd_ntl = FALSE; /* use non-temporal load (movntdqa)*/ 230 | static unsigned long g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB; /* CPU cache flag (default - write back) */ 231 | 232 | /* high memory configs */ 233 | static unsigned long g_highmem_size = 0; /* size of the reserved physical mem space (bytes) */ 234 | static phys_addr_t g_highmem_phys_addr = 0; /* beginning of the reserved phy mem space (bytes)*/ 235 | static void* g_highmem_virt_addr = NULL; /* beginning of the reserve HIGH_MEM space */ 236 | static void* g_highmem_curr_addr = NULL; /* beginning of the available HIGH_MEM space for alloc*/ 237 | 238 | /* module parameters */ 239 | static unsigned g_pmbd_nr = 0; /* num of PMBD devices */ 240 | static unsigned long long g_pmbd_size[PMBD_MAX_NUM_DEVICES]; /* PMBD device sizes in units of GBs */ 241 | static unsigned long long g_pmbd_rdlat[PMBD_MAX_NUM_DEVICES]; /* access latency for read (nanosecs) */ 242 | static unsigned long long g_pmbd_wrlat[PMBD_MAX_NUM_DEVICES]; /* access latency for write nanosecs) */ 243 | static unsigned long long g_pmbd_rdbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for read (MB/sec) */ 244 | static unsigned long long g_pmbd_wrbw[PMBD_MAX_NUM_DEVICES]; /* bandwidth for write (MB/sec)*/ 245 | static unsigned long long g_pmbd_rdsx[PMBD_MAX_NUM_DEVICES]; /* read slowdown (x) */ 246 | static unsigned long long g_pmbd_wrsx[PMBD_MAX_NUM_DEVICES]; /* write slowdown (x)*/ 247 | static unsigned long long g_pmbd_rdpause[PMBD_MAX_NUM_DEVICES]; /* read pause (cycles per page) */ 248 | static unsigned long long g_pmbd_wrpause[PMBD_MAX_NUM_DEVICES]; /* write pause (cycles per page)*/ 249 | static unsigned long long g_pmbd_simmode[PMBD_MAX_NUM_DEVICES]; /* simulating PM space (1) or the whole device (0 default) */ 250 | static unsigned long long g_pmbd_adjust_ns = 0; /* nanosec of adjustment to offset system overhead */ 251 | static unsigned long long g_pmbd_rammode[PMBD_MAX_NUM_DEVICES]; /* do write optimization or not */ 252 | static unsigned long long g_pmbd_bufsize[PMBD_MAX_NUM_DEVICES]; /* the buffer size (in MBs) */ 253 | static unsigned long long g_pmbd_buffer_batch_size[PMBD_MAX_NUM_DEVICES]; /* the batch size (num of pages) for flushing PMBD buffer */ 254 | static unsigned long long g_pmbd_wpmode[PMBD_MAX_NUM_DEVICES]; /* write protection mode: PTE change (0 default) and CR0 Switch (1)*/ 255 | 256 | static unsigned long long g_pmbd_num_buffers = 0; /* number of individual buffers */ 257 | static unsigned long long g_pmbd_buffer_stride = 1024; /* number of contiguous PBNs belonging to the same buffer */ 258 | 259 | /* definition of functions */ 260 | static inline uint64_t cycle_to_ns(uint64_t cycle); 261 | static inline void sync_slowdown_cycles(uint64_t cycles); 262 | static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw); 263 | static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start); 264 | 265 | /* 266 | * ************************************************************************* 267 | * parse module parameters functions 268 | * ************************************************************************* 269 | */ 270 | static char *mode = ""; 271 | module_param(mode, charp, 444); 272 | MODULE_PARM_DESC(mode, USAGE_INFO); 273 | 274 | /* print pmbd configuration info */ 275 | static void pmbd_print_conf(void) 276 | { 277 | int i; 278 | #ifndef CONFIG_X86 279 | printk(KERN_INFO "pmbd: running on a non-x86 platform, check ioremap()...\n"); 280 | #endif 281 | printk(KERN_INFO "pmbd: cacheline_size=%d\n", PMBD_CACHELINE_SIZE); 282 | printk(KERN_INFO "pmbd: PMBD_SECTOR_SIZE=%lu, PMBD_PAGE_SIZE=%lu\n", PMBD_SECTOR_SIZE, PMBD_PAGE_SIZE); 283 | printk(KERN_INFO "pmbd: g_pmbd_type = %s\n", PMBD_USE_VMALLOC()? "VMALLOC" : "HIGH_MEM"); 284 | printk(KERN_INFO "pmbd: g_pmbd_mergeable = %s\n", PMBD_IS_MERGEABLE()? "YES" : "NO"); 285 | printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_clflush = %s\n", PMBD_USE_CLFLUSH()? "YES" : "NO"); 286 | printk(KERN_INFO "pmbd: g_pmbd_cpu_cache_flag = %s\n", PMBD_CPU_CACHE_FLAG()); 287 | printk(KERN_INFO "pmbd: g_pmbd_wr_protect = %s\n", PMBD_USE_WRITE_PROTECTION()? "YES" : "NO"); 288 | printk(KERN_INFO "pmbd: g_pmbd_wr_verify = %s\n", PMBD_USE_WRITE_VERIFICATION()? "YES" : "NO"); 289 | printk(KERN_INFO "pmbd: g_pmbd_checksum = %s\n", PMBD_USE_CHECKSUM()? "YES" : "NO"); 290 | printk(KERN_INFO "pmbd: g_pmbd_lock = %s\n", PMBD_USE_LOCK()? "YES" : "NO"); 291 | printk(KERN_INFO "pmbd: g_pmbd_subpage_update = %s\n", PMBD_USE_SUBPAGE_UPDATE()? "YES" : "NO"); 292 | printk(KERN_INFO "pmbd: g_pmbd_adjust_ns = %llu ns\n", g_pmbd_adjust_ns); 293 | printk(KERN_INFO "pmbd: g_pmbd_num_buffers = %llu\n", g_pmbd_num_buffers); 294 | printk(KERN_INFO "pmbd: g_pmbd_buffer_stride = %llu blocks\n", g_pmbd_buffer_stride); 295 | printk(KERN_INFO "pmbd: g_pmbd_timestat = %u \n", g_pmbd_timestat); 296 | printk(KERN_INFO "pmbd: HIGHMEM offset [%llu] size [%lu] Private Mapping (%s) (%s) (%s) Write Barrier(%s) FUA(%s)\n", 297 | g_highmem_phys_addr, g_highmem_size, (PMBD_USE_PMAP()? "Enabled" : "Disabled"), 298 | (PMBD_USE_NTS()? "Non-Temporal Store":"Temporal Store"), 299 | (PMBD_USE_NTL()? "Non-Temporal Load":"Temporal Load"), 300 | (PMBD_USE_WB()? "Enabled": "Disabled"), 301 | (PMBD_USE_FUA()? "Enabled":"Disabled")); 302 | 303 | /* for each pmbd device */ 304 | for (i = 0; i < g_pmbd_nr; i ++) { 305 | printk(KERN_INFO "pmbd: /dev/pm%c (%d)[%llu GB] read[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] write[%llu ns %llu MB/sec (%llux) (pause %llu cyc/pg)] [%s] [Buf: %llu MBs, batch %llu pages] [%s] [%s]\n", 306 | 'a'+i, i, g_pmbd_size[i], g_pmbd_rdlat[i], g_pmbd_rdbw[i], g_pmbd_rdsx[i], g_pmbd_rdpause[i], g_pmbd_wrlat[i], g_pmbd_wrbw[i], g_pmbd_wrsx[i], g_pmbd_wrpause[i],\ 307 | (g_pmbd_rammode[i] ? "RAM" : "PMBD"), g_pmbd_bufsize[i], g_pmbd_buffer_batch_size[i], \ 308 | (g_pmbd_simmode[i] ? "Simulating PM only" : "Simulating the whole device"), \ 309 | (PMBD_USE_PMAP() ? "PMAP" : (g_pmbd_wpmode[i] ? "WP-CR0/WP" : "WP-PTE"))); 310 | 311 | if (g_pmbd_simmode[i] > 0){ 312 | printk(KERN_INFO "pmbd: ********************************* WARNING **************************************\n"); 313 | printk(KERN_INFO "pmbd: Using simmode%llu to simulate a slowed-down PM space may cause system soft lockup.\n", g_pmbd_simmode[i]); 314 | printk(KERN_INFO "pmbd: To disable the warning message, please add \"nosoftlockup\" in the boot option. \n"); 315 | printk(KERN_INFO "pmbd: ********************************************************************************\n"); 316 | } 317 | } 318 | 319 | printk(KERN_INFO "pmbd: ****************************** WARNING ***********************************\n"); 320 | printk(KERN_INFO "pmbd: 1. Checksum mismatch can be detected but not handled \n"); 321 | printk(KERN_INFO "pmbd: 2. PMAP is incompatible with \"wrprotY\"\n"); 322 | printk(KERN_INFO "pmbd: **************************************************************************\n"); 323 | 324 | return; 325 | } 326 | 327 | /* 328 | * Parse a string with config for multiple devices (e.g. mode="pmbd4,1,3;") 329 | * @mode: input option string 330 | * @tag: the tag being looked for (e.g. pmbd) 331 | * @data: output in an array 332 | */ 333 | static int _pmbd_parse_multi(char* mode, char* tag, unsigned long long data[]) 334 | { 335 | int nr = 0; 336 | if (strlen(mode)) { 337 | char* head = mode; 338 | char* tail = mode; 339 | char* end = mode + strlen(mode); 340 | char tmp[128]; 341 | 342 | if ((head = strstr(mode, tag))) { 343 | head = head + strlen(tag); 344 | tail = head; 345 | while(head < end){ 346 | int len = 0; 347 | 348 | /* locate the position of the first non-number char */ 349 | for(tail = head; IS_DIGIT(*tail) && tail < end; tail++) {}; 350 | 351 | /* pick up the numbers */ 352 | len = tail - head; 353 | if(len > 0) { 354 | nr ++; 355 | if (nr > PMBD_MAX_NUM_DEVICES) { 356 | printk(KERN_ERR "pmbd: %s(%d) - too many (%d) device config for %s\n", 357 | __FUNCTION__, __LINE__, nr, tag); 358 | return -1; 359 | } 360 | strncpy(tmp, head, len); tmp[len] = '\0'; 361 | data[nr - 1] = simple_strtoull(tmp, NULL, 0); 362 | } 363 | 364 | /* check the next sequence of numbers */ 365 | for(; !IS_DIGIT(*tail) && tail < end; tail++) { 366 | /* if we meet the first alpha char or space, clause ends */ 367 | if(IS_ALPHA(*tail) || IS_SPACE(*tail)) 368 | goto done; 369 | }; 370 | 371 | /* move head to the next sequence of numbers */ 372 | head = tail; 373 | } 374 | } 375 | } 376 | done: 377 | return nr; 378 | } 379 | 380 | /* 381 | * Parse a string with config for all devices (e.g. mode="adj1000") 382 | * @mode: input option string 383 | * @tag: the tag being looked for (e.g. pmbd) 384 | * @data: output 385 | */ 386 | static int _pmbd_parse_single(char* mode, char* tag, unsigned long long* data) 387 | { 388 | if (strlen(mode)) { 389 | char* head = mode; 390 | char* tail = mode; 391 | char tmp[128]; 392 | 393 | if (strstr(mode, tag)) { 394 | head = strstr(mode, tag) + strlen(tag); 395 | for(tail=head; IS_DIGIT(*tail); tail++) {}; 396 | if(tail == head) { 397 | return -1; 398 | } else { 399 | int len = tail - head; 400 | strncpy(tmp, head, len); tmp[len] = '\0'; 401 | *data = simple_strtoull(tmp, NULL, 0); 402 | } 403 | } 404 | } 405 | return 0; 406 | } 407 | 408 | static void load_default_conf(void) 409 | { 410 | int i = 0; 411 | for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) 412 | g_pmbd_buffer_batch_size[i] = PMBD_BUFFER_BATCH_SIZE_DEFAULT; 413 | } 414 | 415 | /* parse the module parameters (mode) */ 416 | static void pmbd_parse_conf(void) 417 | { 418 | int i = 0; 419 | static unsigned enforce_cache_wc = FALSE; 420 | 421 | load_default_conf(); 422 | 423 | if (strlen(mode)) { 424 | unsigned long long data = 0; 425 | 426 | /* check pmbd size/usable */ 427 | if (strstr(mode, "pmbd")) { 428 | if( (g_pmbd_nr = _pmbd_parse_multi(mode, "pmbd", g_pmbd_size)) <= 0) 429 | goto fail; 430 | } else { 431 | printk(KERN_ERR "pmbd: no pmbd size set\n"); 432 | goto fail; 433 | } 434 | 435 | /* rdlat/wrlat (emulated read/write latency) in nanosec */ 436 | if (strstr(mode, "rdlat")) 437 | if (_pmbd_parse_multi(mode, "rdlat", g_pmbd_rdlat) < 0) 438 | goto fail; 439 | if (strstr(mode, "wrlat")) 440 | if (_pmbd_parse_multi(mode, "wrlat", g_pmbd_wrlat) < 0) 441 | goto fail; 442 | 443 | /* rdbw/wrbw (emulated read/write bandwidth) in MB/sec*/ 444 | if (strstr(mode, "rdbw")) 445 | if (_pmbd_parse_multi(mode, "rdbw", g_pmbd_rdbw) < 0) 446 | goto fail; 447 | if (strstr(mode, "wrbw")) 448 | if (_pmbd_parse_multi(mode, "wrbw", g_pmbd_wrbw) < 0) 449 | goto fail; 450 | 451 | /* rdsx/wrsx (emulated read/write slowdown X) */ 452 | if (strstr(mode, "rdsx")) 453 | if (_pmbd_parse_multi(mode, "rdsx", g_pmbd_rdsx) < 0) 454 | goto fail; 455 | if (strstr(mode, "wrsx")) 456 | if (_pmbd_parse_multi(mode, "wrsx", g_pmbd_wrsx) < 0) 457 | goto fail; 458 | 459 | /* rdsx/wrsx (emulated read/write slowdown X) */ 460 | if (strstr(mode, "rdpause")) 461 | if (_pmbd_parse_multi(mode, "rdpause", g_pmbd_rdpause) < 0) 462 | goto fail; 463 | if (strstr(mode, "wrpause")) 464 | if (_pmbd_parse_multi(mode, "wrpause", g_pmbd_wrpause) < 0) 465 | goto fail; 466 | 467 | /* do write optimization */ 468 | if (strstr(mode, "rammode")){ 469 | printk(KERN_ERR "pmbd: rammode removed\n"); 470 | goto fail; 471 | if (_pmbd_parse_multi(mode, "rammode", g_pmbd_rammode) < 0) 472 | goto fail; 473 | } 474 | 475 | if (strstr(mode, "bufsize")){ 476 | if (_pmbd_parse_multi(mode, "bufsize", g_pmbd_bufsize) < 0) 477 | goto fail; 478 | for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) { 479 | if (g_pmbd_bufsize[i] > 0 && g_pmbd_bufsize[i] < PMBD_BUFFER_MIN_BUFSIZE){ 480 | printk(KERN_ERR "pmbd: bufsize cannot be smaller than %d MBs. Setting 0 to disable PMBD buffer.\n", PMBD_BUFFER_MIN_BUFSIZE); 481 | goto fail; 482 | } 483 | } 484 | } 485 | 486 | /* numbuf and bufstride*/ 487 | if (strstr(mode, "bufnum")) { 488 | if(_pmbd_parse_single(mode, "bufnum", &data) < 0) { 489 | printk(KERN_ERR "pmbd: incorrect bufnum (must be at least 1)\n"); 490 | goto fail; 491 | } else { 492 | g_pmbd_num_buffers = data; 493 | } 494 | } 495 | if (strstr(mode, "bufstride")) { 496 | if(_pmbd_parse_single(mode, "bufstride", &data) < 0) { 497 | printk(KERN_ERR "pmbd: incorrect bufstride (must be at least 1)\n"); 498 | goto fail; 499 | } else { 500 | g_pmbd_buffer_stride = data; 501 | } 502 | } 503 | 504 | /* check the nanoseconds of overhead to compensate */ 505 | if (strstr(mode, "adj")) { 506 | if(_pmbd_parse_single(mode, "adj", &data) < 0) { 507 | printk(KERN_ERR "pmbd: incorrect adj\n"); 508 | goto fail; 509 | } else { 510 | g_pmbd_adjust_ns = data; 511 | } 512 | } 513 | 514 | /* check PMBD device type */ 515 | if ((strstr(mode, "VM"))) { 516 | g_pmbd_type = PMBD_CONFIG_VMALLOC; 517 | } else if ((strstr(mode, "HM"))) { 518 | g_pmbd_type = PMBD_CONFIG_HIGHMEM; 519 | } 520 | 521 | /* use pmap*/ 522 | if ((strstr(mode, "pmapY"))) { 523 | g_pmbd_pmap = TRUE; 524 | } else if ((strstr(mode, "pmapN"))) { 525 | g_pmbd_pmap = FALSE; 526 | } 527 | if ((strstr(mode, "PMAP"))){ 528 | printk("WARNING: !!! pmbd: PMAP is not supported any more (use pmapY) !!!\n"); 529 | goto fail; 530 | } 531 | 532 | /* use nts*/ 533 | if ((strstr(mode, "ntsY"))) { 534 | g_pmbd_nts = TRUE; 535 | } else if ((strstr(mode, "ntsN"))) { 536 | g_pmbd_nts = FALSE; 537 | } 538 | if ((strstr(mode, "NTS"))){ 539 | printk("WARNING: !!! pmbd: NTS is not supported any more (use ntsY) !!!\n"); 540 | goto fail; 541 | } 542 | 543 | /* use ntl*/ 544 | if ((strstr(mode, "ntlY"))) { 545 | g_pmbd_ntl = TRUE; 546 | enforce_cache_wc = TRUE; 547 | } else if ((strstr(mode, "ntlN"))) { 548 | g_pmbd_ntl = FALSE; 549 | } 550 | 551 | /* timestat */ 552 | if ((strstr(mode, "timestatY"))) { 553 | g_pmbd_timestat = TRUE; 554 | } else if ((strstr(mode, "timestatN"))) { 555 | g_pmbd_timestat = FALSE; 556 | } 557 | 558 | 559 | /* write barrier */ 560 | if ((strstr(mode, "wbY"))) { 561 | g_pmbd_wb = TRUE; 562 | } else if ((strstr(mode, "wbN"))) { 563 | g_pmbd_wb = FALSE; 564 | } 565 | 566 | /* write barrier */ 567 | if ((strstr(mode, "fuaY"))) { 568 | g_pmbd_fua = TRUE; 569 | } else if ((strstr(mode, "fuaN"))) { 570 | g_pmbd_fua = FALSE; 571 | } 572 | 573 | 574 | /* check if HIGH_MEM PMBD is configured */ 575 | if (PMBD_USE_HIGHMEM()) { 576 | if (strstr(mode, "hmo") && strstr(mode, "hms")) { 577 | /* parse reserved HIGH_MEM offset */ 578 | if(_pmbd_parse_single(mode, "hmo", &data) < 0){ 579 | printk(KERN_ERR "pmbd: incorrect hmo\n"); 580 | g_highmem_phys_addr = 0; 581 | goto fail; 582 | } else { 583 | g_highmem_phys_addr = data * 1024 * 1024 * 1024; 584 | } 585 | 586 | /* parse reserved HIGH_MEM size */ 587 | if(_pmbd_parse_single(mode, "hms", &data) < 0 || data == 0){ 588 | printk(KERN_ERR "pmbd: incorrect hms\n"); 589 | g_highmem_size = 0; 590 | goto fail; 591 | } else { 592 | g_highmem_size = data * 1024 * 1024 * 1024; 593 | } 594 | } else { 595 | printk(KERN_ERR "pmbd: hmo or hms not set ***\n"); 596 | goto fail; 597 | } 598 | 599 | 600 | } 601 | 602 | 603 | /* check if mergeable */ 604 | if((strstr(mode,"mgbY"))) 605 | g_pmbd_mergeable = TRUE; 606 | else if((strstr(mode,"mgbN"))) 607 | g_pmbd_mergeable = FALSE; 608 | 609 | /* CPU cache flushing */ 610 | if((strstr(mode,"clflushY"))) 611 | g_pmbd_cpu_cache_clflush = TRUE; 612 | else if((strstr(mode,"clflushN"))) 613 | g_pmbd_cpu_cache_clflush = FALSE; 614 | 615 | /* CPU cache setting */ 616 | if((strstr(mode,"cacheWB"))) /* cache write back */ 617 | g_pmbd_cpu_cache_flag = _PAGE_CACHE_WB; 618 | else if((strstr(mode,"cacheWC"))) /* cache write combined (through) */ 619 | g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC; 620 | else if((strstr(mode,"cacheUM"))) /* cache cachable but write back */ 621 | g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC_MINUS; 622 | else if((strstr(mode,"cacheUC"))) /* cache uncablable */ 623 | g_pmbd_cpu_cache_flag = _PAGE_CACHE_UC; 624 | 625 | 626 | /* write protectable */ 627 | if((strstr(mode,"wrprotY"))) 628 | g_pmbd_wr_protect = TRUE; 629 | else if((strstr(mode,"wrprotN"))) 630 | g_pmbd_wr_protect = FALSE; 631 | 632 | /* write protectable */ 633 | if((strstr(mode,"wrverifyY"))) 634 | g_pmbd_wr_verify = TRUE; 635 | else if((strstr(mode,"wrverifyN"))) 636 | g_pmbd_wr_verify = FALSE; 637 | 638 | /* checksum */ 639 | if((strstr(mode,"checksumY"))) 640 | g_pmbd_checksum = TRUE; 641 | else if((strstr(mode,"checksumN"))) 642 | g_pmbd_checksum = FALSE; 643 | 644 | /* checksum */ 645 | if((strstr(mode,"lockY"))) 646 | g_pmbd_lock = TRUE; 647 | else if((strstr(mode,"lockN"))) 648 | g_pmbd_lock = FALSE; 649 | 650 | /* write protectable */ 651 | if((strstr(mode,"subupdateY"))) 652 | g_pmbd_subpage_update = TRUE; 653 | else if((strstr(mode,"subupdateN"))) 654 | g_pmbd_subpage_update = FALSE; 655 | 656 | 657 | /* batch */ 658 | if (strstr(mode, "batch")){ 659 | if (_pmbd_parse_multi(mode, "batch", g_pmbd_buffer_batch_size) < 0) 660 | goto fail; 661 | /* check if any batch size is set too small */ 662 | for (i = 0; i < PMBD_MAX_NUM_DEVICES; i ++) { 663 | if (g_pmbd_buffer_batch_size[i] < 1){ 664 | printk(KERN_ERR "pmbd: buffer batch size cannot be smaller than 1 page (default: 1024 pages)\n"); 665 | goto fail; 666 | } 667 | } 668 | } 669 | 670 | /* simmode */ 671 | if (strstr(mode, "simmode")){ 672 | if (_pmbd_parse_multi(mode, "simmode", g_pmbd_simmode) < 0) 673 | goto fail; 674 | } 675 | 676 | /* wpmode */ 677 | if (strstr(mode, "wpmode")){ 678 | if (_pmbd_parse_multi(mode, "wpmode", g_pmbd_wpmode) < 0) 679 | goto fail; 680 | } 681 | 682 | } else { 683 | goto fail; 684 | } 685 | 686 | /* apply some enforced configuration */ 687 | if (enforce_cache_wc) /* if ntl is used, we must use WC */ 688 | g_pmbd_cpu_cache_flag = _PAGE_CACHE_WC; 689 | 690 | /* Done, print input options */ 691 | pmbd_print_conf(); 692 | return; 693 | 694 | fail: 695 | printk(KERN_ERR "pmbd: wrong mode config! Check modinfo\n\n"); 696 | g_pmbd_nr = 0; 697 | return; 698 | } 699 | 700 | /* 701 | * ***************************************************************** 702 | * simple emulation API functions 703 | * pmbd_rdwr_pause - pause read/write for a specified cycles/page 704 | * pmbd_rdwr_slowdown - slowdown read/write proportionally to DRAM 705 | * *****************************************************************/ 706 | 707 | /* handle rdpause and wrpause options*/ 708 | static void pmbd_rdwr_pause(PMBD_DEVICE_T* pmbd, size_t bytes, unsigned rw) 709 | { 710 | uint64_t cycles = 0; 711 | uint64_t time_p1, time_p2; 712 | 713 | /* sanity check */ 714 | if (pmbd->rdpause == 0 && pmbd->wrpause == 0) 715 | return; 716 | 717 | /* start */ 718 | TIMESTAT_POINT(time_p1); 719 | 720 | /* calculate the cycles to pause */ 721 | if (rw == READ && pmbd->rdpause){ 722 | cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->rdpause), pmbd->rdpause); 723 | } else if (rw == WRITE && pmbd->wrpause){ 724 | cycles = MAX_OF((BYTE_TO_PAGE(bytes) * pmbd->wrpause), pmbd->wrpause); 725 | } 726 | 727 | /* slow down now */ 728 | if (cycles) 729 | sync_slowdown_cycles(cycles); 730 | 731 | TIMESTAT_POINT(time_p2); 732 | 733 | if(PMBD_USE_TIMESTAT()){ 734 | int cid = CUR_CPU_ID(); 735 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 736 | pmbd_stat->cycles_pause[rw][cid] += time_p2 - time_p1; 737 | } 738 | 739 | return; 740 | } 741 | 742 | 743 | /* handle rdsx and wrsx options */ 744 | static void pmbd_rdwr_slowdown(PMBD_DEVICE_T* pmbd, int rw, uint64_t start, uint64_t end) 745 | { 746 | uint64_t cycles = 0; 747 | uint64_t time_p1, time_p2; 748 | 749 | /* sanity check */ 750 | if ( !((rw == READ && pmbd->rdsx > 1) || (rw == WRITE && pmbd->wrsx > 1))) 751 | return; 752 | 753 | if (end < start){ 754 | printk(KERN_WARNING "pmbd: %s(%d) end (%llu) is earlier than start (%llu)\n", \ 755 | __FUNCTION__, __LINE__, (unsigned long long) start, (unsigned long long)end); 756 | return; 757 | } 758 | 759 | /* start */ 760 | TIMESTAT_POINT(time_p1); 761 | 762 | /*FIXME: should we allow to do async slowdown? */ 763 | cycles = (end-start)*((rw == READ) ? (pmbd->rdsx - 1) : (pmbd->wrsx -1)); 764 | 765 | /*FIXME: should we minus a slack here (80-100cycles)? */ 766 | if (cycles) 767 | sync_slowdown_cycles(cycles); 768 | 769 | TIMESTAT_POINT(time_p2); 770 | 771 | /* updating statistics */ 772 | if(PMBD_USE_TIMESTAT()){ 773 | int cid = CUR_CPU_ID(); 774 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 775 | pmbd_stat->cycles_slowdown[rw][cid] += time_p2 - time_p1; 776 | } 777 | 778 | return; 779 | } 780 | 781 | 782 | /* 783 | * set page's cache flags 784 | * @vaddr: start virtual address 785 | * @num_pages: the range size 786 | */ 787 | static void set_pages_cache_flags(unsigned long vaddr, int num_pages) 788 | { 789 | switch (g_pmbd_cpu_cache_flag) { 790 | case _PAGE_CACHE_WB: 791 | printk(KERN_INFO "pmbd: set PM pages cache flags (WB)\n"); 792 | set_memory_wb(vaddr, num_pages); 793 | break; 794 | case _PAGE_CACHE_WC: 795 | printk(KERN_INFO "pmbd: set PM pages cache flags (WC)\n"); 796 | set_memory_wc(vaddr, num_pages); 797 | break; 798 | case _PAGE_CACHE_UC: 799 | printk(KERN_INFO "pmbd: set PM pages cache flags (UC)\n"); 800 | set_memory_uc(vaddr, num_pages); 801 | break; 802 | case _PAGE_CACHE_UC_MINUS: 803 | printk(KERN_INFO "pmbd: set PM pages cache flags (UM)\n"); 804 | set_memory_uc(vaddr, num_pages); 805 | break; 806 | default: 807 | set_memory_wb(vaddr, num_pages); 808 | printk(KERN_WARNING "pmbd: PM page attribute is not set - use WB\n"); 809 | break; 810 | } 811 | return; 812 | } 813 | 814 | 815 | /* 816 | * ************************************************************************* 817 | * PMAP - Private mapping interface APIs 818 | * ************************************************************************* 819 | * 820 | * The private mapping is for providing write protection -- only when we need 821 | * to access the PM page, we map it into the kernel virtual memory space, once 822 | * we finish using it, we unmap it, so the spatial and temporal window left for 823 | * bug attack is really small. 824 | * 825 | * Notes: pmap works similar to kmap_atomic*. It does the following: 826 | * (1) pmap_create(): allocate 128 pages with vmalloc, these 128 pte mapping is 827 | * saved to a backup place, and then be cleared to prevent accidental accesses. 828 | * Each page is assigned correspondingly to the CPU ID where the calling thread 829 | * is running on. So we support at most 128 CPU IDs. 830 | * (2) pmap_atomic_pfn(): map the specified pfn into the entry, whose index is 831 | * the ID of the CPU on which the current thread is running. The pfn is loaded 832 | * into the corresponding pte entry and the corresponding TLB entry is flushed 833 | * (3) punmap_atomic(): the specified pte entry is cleared, and the TLB entry 834 | * is flushed 835 | * (4) pmap_destroy(): the saved pte mapping of the 128 pages are restored, and 836 | * vfree() is called to release the 128 pages allocated through vmalloc(). 837 | * 838 | */ 839 | 840 | #define PMAP_NR_PAGES (128) 841 | static unsigned int pmap_nr_pages = 0; /* the total number of available pages for private mapping */ 842 | static void* pmap_va_start = NULL; /* the first PMAP virtual address */ 843 | static pte_t* pmap_ptep[PMAP_NR_PAGES]; /* the array of PTE entries */ 844 | static unsigned long pmap_pfn[PMAP_NR_PAGES]; /* the array of page frame numbers for restoring */ 845 | static pgprot_t pmap_prot[PMAP_NR_PAGES]; /* the array of page protection fields */ 846 | #define PMAP_VA(IDX) (pmap_va_start + (IDX) * PAGE_SIZE) 847 | #define PMAP_IDX(VA) (((unsigned long)(VA) - (unsigned long)pmap_va_start) >> PAGE_SHIFT) 848 | 849 | static inline void pmap_flush_tlb_single(unsigned long addr) 850 | { 851 | asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); 852 | } 853 | 854 | static inline void* update_pmap_pfn(unsigned long pfn, unsigned int idx) 855 | { 856 | void* va = PMAP_VA(idx); 857 | pte_t* ptep = pmap_ptep[idx]; 858 | pte_t old_pte = *ptep; 859 | pte_t new_pte = pfn_pte(pfn, pmap_prot[idx]); 860 | 861 | if (pte_val(old_pte) == pte_val(new_pte)) 862 | return va; 863 | 864 | /* update the pte entry */ 865 | set_pte_atomic(ptep, new_pte); 866 | // set_pte(ptep, new_pte); 867 | 868 | /* flush one single tlb */ 869 | __flush_tlb_one((unsigned long) va); 870 | // pmap_flush_tlb_single((unsigned long) va); 871 | 872 | /* return the old one for bkup */ 873 | return va; 874 | } 875 | 876 | static inline void clear_pmap_pfn(unsigned idx) 877 | { 878 | if (idx < pmap_nr_pages){ 879 | 880 | void* va = PMAP_VA(idx); 881 | pte_t* ptep = pmap_ptep[idx]; 882 | 883 | /* clear the mapping */ 884 | pte_clear(NULL, (unsigned long) va, ptep); 885 | __flush_tlb_one((unsigned long) va); 886 | 887 | } else { 888 | panic("%s(%d) illegal pmap idx\n", __FUNCTION__, __LINE__); 889 | } 890 | } 891 | 892 | static int pmap_atomic_init(void) 893 | { 894 | unsigned int i; 895 | 896 | /* checking */ 897 | if (pmap_va_start) 898 | panic("%s(%d) something is wrong\n", __FUNCTION__, __LINE__); 899 | 900 | /* allocate an array of dummy pages as pmap virtual addresses */ 901 | pmap_va_start = vmalloc(PAGE_SIZE * PMAP_NR_PAGES); 902 | if (!pmap_va_start){ 903 | printk(KERN_ERR "pmbd:%s(%d) pmap_va_start cannot be initialized\n", __FUNCTION__, __LINE__); 904 | return -ENOMEM; 905 | } 906 | pmap_nr_pages = PMAP_NR_PAGES; 907 | 908 | /* set pages' cache flags, this flag would be saved into pmap_prot 909 | * and will be applied together with the dynamically mapped page too (01/12/2012)*/ 910 | set_pages_cache_flags((unsigned long)pmap_va_start, pmap_nr_pages); 911 | 912 | /* save the dummy pages' ptep, pfn, and prot info */ 913 | printk(KERN_INFO "pmbd: saving dummy pmap entries\n"); 914 | for (i = 0; i < pmap_nr_pages; i ++){ 915 | pte_t old_pte; 916 | unsigned int level; 917 | void* va = PMAP_VA(i); 918 | 919 | /* get the ptep */ 920 | pte_t* ptep = lookup_address((unsigned long)(va), &level); 921 | 922 | /* sanity check */ 923 | if (!ptep) 924 | panic("%s(%d) mapping not found\n", __FUNCTION__, __LINE__); 925 | 926 | old_pte = *ptep; 927 | if (!pte_val(old_pte)) 928 | panic("%s(%d) invalid pte value\n", __FUNCTION__, __LINE__); 929 | 930 | if (level != PG_LEVEL_4K) 931 | panic("%s(%d) not PG_LEVEL_4K \n", __FUNCTION__, __LINE__); 932 | 933 | /* save dummy entries */ 934 | pmap_ptep[i] = ptep; 935 | pmap_pfn[i] = pte_pfn(old_pte); 936 | pmap_prot[i] = pte_pgprot(old_pte); 937 | 938 | /* printk(KERN_INFO "%s(%d): saving dummy pmap entries: %u va=%p pfn=%lx\n", \ 939 | __FUNCTION__, __LINE__, i, va, pmap_pfn[i]); 940 | */ 941 | } 942 | 943 | /* clear the pte to make it illegal to access */ 944 | for (i = 0; i < pmap_nr_pages; i ++) 945 | clear_pmap_pfn(i); 946 | 947 | return 0; 948 | } 949 | 950 | static void pmap_atomic_done(void) 951 | { 952 | int i; 953 | 954 | /* restore the dummy pages' pte */ 955 | printk(KERN_INFO "pmbd: restoring dummy pmap entries\n"); 956 | for (i = 0; i < pmap_nr_pages; i ++){ 957 | /* void* va = PMAP_VA(i); 958 | printk(KERN_INFO "%s(%d): restoring dummy pmap entries: %d va=%p pfn=%lx\n", \ 959 | __FUNCTION__, __LINE__, i, va, pmap_pfn[i]); 960 | */ 961 | /* restore the old pfn */ 962 | update_pmap_pfn(pmap_pfn[i], i); 963 | pmap_ptep[i]= NULL; 964 | pmap_pfn[i] = 0; 965 | } 966 | 967 | /* free the dummy pages*/ 968 | if (pmap_va_start) 969 | vfree(pmap_va_start); 970 | else 971 | panic("%s(%d): freeing dummy pages failed\n", __FUNCTION__, __LINE__); 972 | 973 | pmap_va_start = NULL; 974 | pmap_nr_pages = 0; 975 | return; 976 | } 977 | 978 | static void* pmap_atomic_pfn(unsigned long pfn, PMBD_DEVICE_T* pmbd, unsigned rw) 979 | { 980 | void* va = NULL; 981 | unsigned int idx = CUR_CPU_ID(); 982 | uint64_t time_p1 = 0; 983 | uint64_t time_p2 = 0; 984 | 985 | TIMESTAMP(time_p1); 986 | 987 | /* disable page fault temporarily */ 988 | pagefault_disable(); 989 | 990 | /* change the mapping to the specified pfn*/ 991 | va = update_pmap_pfn(pfn, idx); 992 | 993 | TIMESTAMP(time_p2); 994 | 995 | /* update time statistics */ 996 | if(PMBD_USE_TIMESTAT()){ 997 | int cid = CUR_CPU_ID(); 998 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 999 | pmbd_stat->cycles_pmap[rw][cid] += time_p2 - time_p1; 1000 | } 1001 | 1002 | return va; 1003 | } 1004 | 1005 | static void punmap_atomic(void* va, PMBD_DEVICE_T* pmbd, unsigned rw) 1006 | { 1007 | unsigned int idx = PMAP_IDX(va); 1008 | uint64_t time_p1 = 0; 1009 | uint64_t time_p2 = 0; 1010 | 1011 | TIMESTAMP(time_p1); 1012 | 1013 | /* clear the mapping */ 1014 | clear_pmap_pfn(idx); 1015 | 1016 | /* re-enable the page fault */ 1017 | pagefault_enable(); 1018 | 1019 | TIMESTAMP(time_p2); 1020 | 1021 | /* update time statistics */ 1022 | if(PMBD_USE_TIMESTAT()){ 1023 | int cid = CUR_CPU_ID(); 1024 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 1025 | pmbd_stat->cycles_punmap[rw][cid] += time_p2 - time_p1; 1026 | } 1027 | 1028 | return; 1029 | } 1030 | 1031 | /* create the dummy pmap space */ 1032 | static int pmap_create(void) 1033 | { 1034 | pmap_atomic_init(); 1035 | return 0; 1036 | } 1037 | 1038 | /* destroy the dummy pmap space */ 1039 | static void pmap_destroy(void) 1040 | { 1041 | pmap_atomic_done(); 1042 | return; 1043 | } 1044 | 1045 | /* 1046 | * ************************************************************************* 1047 | * Non-temporal memcpy 1048 | * ************************************************************************* 1049 | * Non-temporal memcpy does the following: 1050 | * (1) use movntq to copy into PM space 1051 | * (2) use sfence to flush the data to memory controller 1052 | * 1053 | * Compared to regular temporal memcpy, it provides several benefits here: 1054 | * (1) writes to PM bypass the CPU cache, which avoids polluting CPU cache 1055 | * (2) reads from PM still benefit from the CPU cache 1056 | * (3) sfence used for each write guarantees data will be flushed out of buffer 1057 | */ 1058 | 1059 | static void nts_memcpy_64bytes_v2(void* to, void* from, size_t size) 1060 | { 1061 | int i; 1062 | unsigned bs = 64; /* write unit size 8 bytes */ 1063 | 1064 | if (size < bs) 1065 | panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs); 1066 | 1067 | if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL)) 1068 | panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__); 1069 | 1070 | /* start */ 1071 | kernel_fpu_begin(); 1072 | 1073 | /* do the non-temporal mov */ 1074 | for (i = 0; i < size; i += bs){ 1075 | __asm__ __volatile__ ( 1076 | "movdqa (%0), %%xmm0\n" 1077 | "movdqa 16(%0), %%xmm1\n" 1078 | "movdqa 32(%0), %%xmm2\n" 1079 | "movdqa 48(%0), %%xmm3\n" 1080 | "movntdq %%xmm0, (%1)\n" 1081 | "movntdq %%xmm1, 16(%1)\n" 1082 | "movntdq %%xmm2, 32(%1)\n" 1083 | "movntdq %%xmm3, 48(%1)\n" 1084 | : 1085 | : "r" (from), "r" (to) 1086 | : "memory"); 1087 | 1088 | to += bs; 1089 | from += bs; 1090 | } 1091 | 1092 | /* do sfence to push data out */ 1093 | __asm__ __volatile__ ( 1094 | " sfence\n" : : 1095 | ); 1096 | 1097 | /* end */ 1098 | kernel_fpu_end(); 1099 | 1100 | /*NOTE: we assume it would be multiple units of 64 bytes*/ 1101 | if (i != size) 1102 | panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size); 1103 | 1104 | return; 1105 | } 1106 | 1107 | /* non-temporal store */ 1108 | static void nts_memcpy(void* to, void* from, size_t size) 1109 | { 1110 | if (size < 64){ 1111 | panic("no support for nt load smaller than 64 bytes yet\n"); 1112 | } else { 1113 | nts_memcpy_64bytes_v2(to, from, size); 1114 | } 1115 | } 1116 | 1117 | 1118 | static void ntl_memcpy_64bytes(void* to, void* from, size_t size) 1119 | { 1120 | int i; 1121 | unsigned bs = 64; /* write unit size 16 bytes */ 1122 | 1123 | if (size < bs) 1124 | panic("%s(%d) size (%zu) is smaller than %u\n", __FUNCTION__, __LINE__, size, bs); 1125 | 1126 | if (((unsigned long) from & 64UL) || ((unsigned long)to & 64UL)) 1127 | panic("%s(%d) not aligned\n", __FUNCTION__, __LINE__); 1128 | 1129 | /* start */ 1130 | kernel_fpu_begin(); 1131 | 1132 | /* do the non-temporal mov */ 1133 | for (i = 0; i < size; i += bs){ 1134 | __asm__ __volatile__ ( 1135 | "movntdqa (%0), %%xmm0\n" 1136 | "movntdqa 16(%0), %%xmm1\n" 1137 | "movntdqa 32(%0), %%xmm2\n" 1138 | "movntdqa 48(%0), %%xmm3\n" 1139 | "movdqa %%xmm0, (%1)\n" 1140 | "movdqa %%xmm1, 16(%1)\n" 1141 | "movdqa %%xmm2, 32(%1)\n" 1142 | "movdqa %%xmm3, 48(%1)\n" 1143 | : 1144 | : "r" (from), "r" (to) 1145 | : "memory"); 1146 | 1147 | to += bs; 1148 | from += bs; 1149 | } 1150 | 1151 | /* end */ 1152 | kernel_fpu_end(); 1153 | 1154 | /*NOTE: we assume it would be multiple units of 64 bytes (at least 512 bytes)*/ 1155 | if (i != size) 1156 | panic("%s:%s:%d size (%zu) is in multiple units of 64 bytes\n", __FILE__, __FUNCTION__, __LINE__, size); 1157 | 1158 | return; 1159 | } 1160 | 1161 | /* non-temporal load */ 1162 | static void ntl_memcpy(void* to, void* from, size_t size) 1163 | { 1164 | if (size < 64){ 1165 | panic("no support for nt load smaller than 128 bytes yet\n"); 1166 | } else { 1167 | ntl_memcpy_64bytes(to, from, size); 1168 | } 1169 | } 1170 | 1171 | 1172 | /* 1173 | * ************************************************************************* 1174 | * COPY TO/FROM PM 1175 | * ************************************************************************* 1176 | * 1177 | * NOTE: copying into PM needs particular care, we use different solution here: 1178 | * (1) pmap: we only map/unmap PM pages when we need to access, which provides 1179 | * us the most protection, for both reads and writes 1180 | * (2) non-pmap: we always map every page into the kernel space, however, we 1181 | * put different protection for writes only. In both cases, PM pages are 1182 | * initialized as read-only 1183 | * - PTE manipulation: before each write, the page writable bit is enabled, and 1184 | * disabled right after the write operation is done. 1185 | * - CR0/WP switch: before each write, the WP bit in the CR0 register turned 1186 | * off, and turned back on right after the write operation is done. Once 1187 | * CR0/WP bit is turned off, the CPU would not check the writable bit in the 1188 | * TLB in local CPU. So it is a tricky way to hack and walk around this 1189 | * problem. 1190 | * 1191 | */ 1192 | 1193 | #define PMBD_PMAP_DUMMY_BASE_VA (4096) 1194 | #define PMBD_PMAP_VA_TO_PA(VA) (g_highmem_phys_addr + (VA) - PMBD_PMAP_DUMMY_BASE_VA) 1195 | /* 1196 | * copying from/to a contiguous PM space using pmap 1197 | * @ram_va: the RAM virtual address 1198 | * @pmbd_dummy_va: the dummy PM virtual address (for converting to phys addr) 1199 | * @rw: 0 - read, 1 - write 1200 | */ 1201 | 1202 | #define MEMCPY_TO_PMBD(dst, src, bytes) { if (PMBD_USE_NTS()) \ 1203 | nts_memcpy((dst), (src), (bytes)); \ 1204 | else \ 1205 | memcpy((dst), (src), (bytes));} 1206 | 1207 | #define MEMCPY_FROM_PMBD(dst, src, bytes) { if (PMBD_USE_NTL()) \ 1208 | ntl_memcpy((dst), (src), (bytes)); \ 1209 | else \ 1210 | memcpy((dst), (src), (bytes));} 1211 | 1212 | static inline int _memcpy_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* ram_va, void* pmbd_dummy_va, size_t bytes, unsigned rw, unsigned do_fua) 1213 | { 1214 | unsigned long flags = 0; 1215 | uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va); 1216 | 1217 | /* disable interrupt (PMAP entry is shared) */ 1218 | DISABLE_SAVE_IRQ(flags); 1219 | 1220 | /* do the real work */ 1221 | while(bytes){ 1222 | uint64_t time_p1 = 0; 1223 | uint64_t time_p2 = 0; 1224 | 1225 | unsigned long pfn = (pa >> PAGE_SHIFT); /* page frame number */ 1226 | unsigned off = pa & (~PAGE_MASK); /* offset in one page */ 1227 | unsigned size = MIN_OF((PAGE_SIZE - off), bytes);/* the size to copy */ 1228 | 1229 | /* map it */ 1230 | void * map = pmap_atomic_pfn(pfn, pmbd, rw); 1231 | void * pmbd_va = map + off; 1232 | 1233 | /* do memcopy */ 1234 | TIMESTAMP(time_p1); 1235 | if (rw == READ) { 1236 | MEMCPY_FROM_PMBD(ram_va, pmbd_va, size); 1237 | } else { 1238 | if (PMBD_USE_SUBPAGE_UPDATE()) { 1239 | /* if we do subpage write, write a cacheline each time */ 1240 | /* FIXME: we probably need to check the alignment here */ 1241 | size = MIN_OF(size, PMBD_CACHELINE_SIZE); 1242 | if (memcmp(pmbd_va, ram_va, size)){ 1243 | MEMCPY_TO_PMBD(pmbd_va, ram_va, size); 1244 | } 1245 | } else { 1246 | MEMCPY_TO_PMBD(pmbd_va, ram_va, size); 1247 | } 1248 | } 1249 | TIMESTAMP(time_p2); 1250 | 1251 | /* emulating slowdown*/ 1252 | if(PMBD_DEV_USE_SLOWDOWN(pmbd)) 1253 | pmbd_rdwr_slowdown((pmbd), rw, time_p1, time_p2); 1254 | 1255 | /* for write check if we need to do clflush or do FUA*/ 1256 | if (rw == WRITE){ 1257 | if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS())) 1258 | pmbd_clflush_range(pmbd, pmbd_va, (size)); 1259 | } 1260 | 1261 | /* if write combine is used, we need to do sfence (like in ntstore) */ 1262 | if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) 1263 | sfence(); 1264 | 1265 | /* update time statistics */ 1266 | if(PMBD_USE_TIMESTAT()){ 1267 | int cid = CUR_CPU_ID(); 1268 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 1269 | pmbd_stat->cycles_memcpy[rw][cid] += time_p2 - time_p1; 1270 | } 1271 | 1272 | /* unmap it */ 1273 | punmap_atomic(map, pmbd, rw); 1274 | 1275 | /* prepare the next iteration */ 1276 | ram_va += size; 1277 | bytes -= size; 1278 | pa += size; 1279 | } 1280 | 1281 | /* re-enable interrupt */ 1282 | ENABLE_RESTORE_IRQ(flags); 1283 | 1284 | return 0; 1285 | } 1286 | 1287 | static inline int memcpy_from_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes) 1288 | { 1289 | return _memcpy_pmbd_pmap(pmbd, dst, src, bytes, READ, FALSE); 1290 | } 1291 | 1292 | static inline int memcpy_to_pmbd_pmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua) 1293 | { 1294 | return _memcpy_pmbd_pmap(pmbd, src, dst, bytes, WRITE, do_fua); 1295 | } 1296 | 1297 | 1298 | /* 1299 | * memcpy from/to PM without using pmap 1300 | */ 1301 | 1302 | #define DISABLE_CR0_WP(CR0,FLAGS) {\ 1303 | if (PMBD_USE_WRITE_PROTECTION()){\ 1304 | DISABLE_SAVE_IRQ((FLAGS));\ 1305 | (CR0) = read_cr0();\ 1306 | write_cr0((CR0) & ~X86_CR0_WP);\ 1307 | }\ 1308 | } 1309 | #define ENABLE_CR0_WP(CR0,FLAGS) {\ 1310 | if (PMBD_USE_WRITE_PROTECTION()){\ 1311 | write_cr0((CR0));\ 1312 | ENABLE_RESTORE_IRQ((FLAGS));\ 1313 | }\ 1314 | } 1315 | 1316 | static inline int memcpy_from_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes) 1317 | { 1318 | uint64_t time_p1 = 0; 1319 | uint64_t time_p2 = 0; 1320 | 1321 | /* start memcpy */ 1322 | TIMESTAMP(time_p1); 1323 | #if 0 1324 | if (PMBD_DEV_USE_VMALLOC((pmbd))) 1325 | memcpy((dst), (src), (bytes)); 1326 | else if (PMBD_DEV_USE_HIGHMEM((pmbd))) 1327 | memcpy_fromio((dst), (src), (bytes)); 1328 | #endif 1329 | MEMCPY_FROM_PMBD(dst, src, bytes); 1330 | 1331 | TIMESTAMP(time_p2); 1332 | 1333 | /* emulating slowdown*/ 1334 | if(PMBD_DEV_USE_SLOWDOWN(pmbd)) 1335 | pmbd_rdwr_slowdown((pmbd), READ, time_p1, time_p2); 1336 | 1337 | /* update time statistics */ 1338 | if(PMBD_USE_TIMESTAT()){ 1339 | int cid = CUR_CPU_ID(); 1340 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 1341 | pmbd_stat->cycles_memcpy[READ][cid] += time_p2 - time_p1; 1342 | } 1343 | 1344 | return 0; 1345 | } 1346 | 1347 | static int memcpy_to_pmbd_nopmap(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua) 1348 | { 1349 | 1350 | unsigned long cr0 = 0; 1351 | unsigned long flags = 0; 1352 | size_t left = bytes; 1353 | 1354 | 1355 | /* get a bkup copy of the CR0 (to allow writable)*/ 1356 | if (PMBD_DEV_USE_WPMODE_CR0(pmbd)) 1357 | DISABLE_CR0_WP(cr0, flags); 1358 | 1359 | /* do the real work */ 1360 | while(left){ 1361 | size_t size = left; // the size to copy 1362 | uint64_t time_p1 = 0; 1363 | uint64_t time_p2 = 0; 1364 | 1365 | TIMESTAMP(time_p1); 1366 | /* do memcopy */ 1367 | if (PMBD_USE_SUBPAGE_UPDATE()) { 1368 | /* if we do subpage write, write a cacheline each time */ 1369 | size = MIN_OF(size, PMBD_CACHELINE_SIZE); 1370 | 1371 | if (memcmp(dst, src, size)){ 1372 | MEMCPY_TO_PMBD(dst, src, size); 1373 | } 1374 | } else { 1375 | MEMCPY_TO_PMBD(dst, src, size); 1376 | } 1377 | TIMESTAMP(time_p2); 1378 | 1379 | /* emulating slowdown*/ 1380 | if(PMBD_DEV_USE_SLOWDOWN(pmbd)) 1381 | pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2); 1382 | 1383 | /* if write, check if we need to do clflush or we do FUA */ 1384 | if (PMBD_USE_CLFLUSH() || (do_fua && PMBD_CPU_CACHE_USE_WB() && !PMBD_USE_NTS())) 1385 | pmbd_clflush_range(pmbd, dst, (size)); 1386 | 1387 | /* if write combine is used, we need to do sfence (like in ntstore) */ 1388 | if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) 1389 | sfence(); 1390 | 1391 | /* update time statistics */ 1392 | if(PMBD_USE_TIMESTAT()){ 1393 | int cid = CUR_CPU_ID(); 1394 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 1395 | pmbd_stat->cycles_memcpy[WRITE][cid] += time_p2 - time_p1; 1396 | } 1397 | 1398 | /* prepare the next iteration */ 1399 | dst += size; 1400 | src += size; 1401 | left -= size; 1402 | } 1403 | 1404 | /* restore the CR0 */ 1405 | if (PMBD_DEV_USE_WPMODE_CR0(pmbd)) 1406 | ENABLE_CR0_WP(cr0, flags); 1407 | 1408 | return 0; 1409 | } 1410 | 1411 | static int memcpy_to_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes, unsigned do_fua) 1412 | { 1413 | uint64_t start = 0; 1414 | uint64_t end = 0; 1415 | 1416 | /* start simulation timing */ 1417 | if (PMBD_DEV_SIM_PMBD((pmbd))) 1418 | start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), WRITE); 1419 | 1420 | /* do memcpy now */ 1421 | if (PMBD_USE_PMAP()){ 1422 | memcpy_to_pmbd_pmap(pmbd, dst, src, bytes, do_fua); 1423 | } else { 1424 | memcpy_to_pmbd_nopmap(pmbd, dst, src, bytes, do_fua); 1425 | } 1426 | 1427 | /* stop simulation timing */ 1428 | if (PMBD_DEV_SIM_PMBD((pmbd))) 1429 | end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), WRITE, start); 1430 | 1431 | /* pause write for a while*/ 1432 | pmbd_rdwr_pause(pmbd, bytes, WRITE); 1433 | 1434 | return 0; 1435 | } 1436 | 1437 | 1438 | 1439 | static int memcpy_from_pmbd(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes) 1440 | { 1441 | uint64_t start = 0; 1442 | uint64_t end = 0; 1443 | 1444 | /* start simulation timing */ 1445 | if (PMBD_DEV_SIM_PMBD((pmbd))) 1446 | start = emul_start((pmbd), BYTE_TO_SECTOR((bytes)), READ); 1447 | 1448 | /* do memcpy here */ 1449 | if (PMBD_USE_PMAP()){ 1450 | memcpy_from_pmbd_pmap(pmbd, dst, src, bytes); 1451 | }else{ 1452 | memcpy_from_pmbd_nopmap(pmbd, dst, src, bytes); 1453 | } 1454 | 1455 | /* stop simulation timing */ 1456 | if (PMBD_DEV_SIM_PMBD((pmbd))) 1457 | end = emul_end((pmbd), BYTE_TO_SECTOR((bytes)), READ, start); 1458 | 1459 | /* pause read for a while */ 1460 | pmbd_rdwr_pause(pmbd, bytes, READ); 1461 | 1462 | return 0; 1463 | } 1464 | 1465 | 1466 | 1467 | /* 1468 | * ************************************************************************* 1469 | * PMBD device buffer management 1470 | * ************************************************************************* 1471 | * 1472 | * Since write protection involves high performance overhead (due to TLB 1473 | * shootdown and other system locking, linked list scan overhead related with 1474 | * set_memory_* functions), we cannot change page table attributes for each 1475 | * incoming write to PM space. In order to battle this issue, we added a 1476 | * buffer to temporarily hold the incoming writes into a DRAM buffer, and 1477 | * launch a syncer daemon to periodically flush dirty pages from the buffer to 1478 | * the PM storage. This brings two benefits: first, more contiguous pages can 1479 | * be clustered together, and we only need to do one page attribute change for 1480 | * a cluster; second, high overhead is hidden in the background, since the 1481 | * writes become asynchronous now. 1482 | * 1483 | */ 1484 | 1485 | 1486 | /* support functions to sort the bbi entries */ 1487 | static int compare_bbi_sort_entries(const void* m, const void* n) 1488 | { 1489 | PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m; 1490 | PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n; 1491 | if (a->pbn < b->pbn) 1492 | return -1; 1493 | else if (a->pbn == b->pbn) 1494 | return 0; 1495 | else 1496 | return 1; 1497 | 1498 | } 1499 | 1500 | static void swap_bbi_sort_entries(void* m, void* n, int size) 1501 | { 1502 | PMBD_BSORT_ENTRY_T* a = (PMBD_BSORT_ENTRY_T*) m; 1503 | PMBD_BSORT_ENTRY_T* b = (PMBD_BSORT_ENTRY_T*) n; 1504 | PMBD_BSORT_ENTRY_T tmp; 1505 | tmp = *a; 1506 | *a = *b; 1507 | *b = tmp; 1508 | return; 1509 | } 1510 | 1511 | 1512 | /* 1513 | * get the aligned in-block offsets for a given request 1514 | * @pmbd: the pmbd device 1515 | * @sector: the starting offset (in sectors) of the incoming request 1516 | * @bytes: the size of the incoming request 1517 | * 1518 | * return: the in-block offset of the starting sector in the request 1519 | * 1520 | * Since the block size (4096 bytes) is larger than the sector size (512 bytes), 1521 | * if the incoming request is not completely aligned in units of blocks, then 1522 | * we need to pull the whole block from PM space into the buffer, and apply 1523 | * changes to partial blocks. This function is needed to calculate the offset 1524 | * for the beginning and ending sectors. 1525 | * 1526 | * For example: assuming sector size is 1024, buffer block size is 4096, sector 1527 | * is 5, size is 1024, then the returned start offset is 1 (the second sector 1528 | * in the buffer block), and the returned end offset is 2 (the third sector in 1529 | * the buffer block) 1530 | * 1531 | * offset_s -----v v--- offset_e 1532 | * ---------------------------------- 1533 | * | |*****| | | 1534 | * ---------------------------------- 1535 | * 1536 | */ 1537 | 1538 | static sector_t pmbd_buffer_aligned_request_start(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes) 1539 | { 1540 | sector_t sector_s = sector; 1541 | PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector_s); 1542 | sector_t block_s = PBN_TO_SECTOR(pmbd, pbn_s); /* the block's starting offset (in sector) */ 1543 | sector_t offset_s = 0; 1544 | if (sector_s >= block_s) /* if not aligned */ 1545 | offset_s = sector_s - block_s; 1546 | return offset_s; 1547 | } 1548 | 1549 | static sector_t pmbd_buffer_aligned_request_end(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes) 1550 | { 1551 | sector_t sector_e = sector + BYTE_TO_SECTOR(bytes) - 1; 1552 | PBN_T pbn_e = SECTOR_TO_PBN(pmbd, sector_e); 1553 | sector_t block_e = PBN_TO_SECTOR(pmbd, pbn_e); /* the block's starting offset (in sector) */ 1554 | sector_t offset_e = PBN_TO_SECTOR(pmbd, 1) - 1; 1555 | 1556 | if (sector_e >= block_e) /* if not aligned */ 1557 | offset_e = (sector_e - block_e); 1558 | return offset_e; 1559 | } 1560 | 1561 | 1562 | /* 1563 | * check and see if a physical block (pbn) is buffered 1564 | * @pmbd: pmbd device 1565 | * @pbn: buffer block number 1566 | * 1567 | * NOTE: The caller must hold the pbi->lock 1568 | */ 1569 | static PMBD_BBI_T* _pmbd_buffer_lookup(PMBD_BUFFER_T* buffer, PBN_T pbn) 1570 | { 1571 | PMBD_BBI_T* bbi = NULL; 1572 | PMBD_DEVICE_T* pmbd = buffer->pmbd; 1573 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 1574 | 1575 | if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) { 1576 | bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn); 1577 | } 1578 | return bbi; 1579 | } 1580 | 1581 | /* 1582 | * Alloc/flush buffer functions 1583 | */ 1584 | 1585 | /* 1586 | * flushing a range of contiguous physical blocks from buffer to PM space 1587 | * @pmbd: pmbd device 1588 | * @pbn_s: the first physical block number to flush (start) 1589 | * @pbn_e: the last physical block number to flush (end) 1590 | * 1591 | * This function only flushes blocks from buffer to PM and unlink(free) the 1592 | * corresponding buffer blocks and physical PM blocks, and it does not update 1593 | * the buffer control info (num_dirty, pos_dirty). This is because after 1594 | * sorting, the processing order of buffer blocks (BBNs) may be different from 1595 | * the spatial order of the buffer blocks, which makes it impossible to move 1596 | * pos_dirty forward exactly one after one. In other words, pos_dirty only 1597 | * points to the end of the dirty range, and we may flush a dirty block in the 1598 | * middle of the range, rather than from the end first. 1599 | * 1600 | * NOTE: The caller must hold the flush_lock; only one thread is allowed to do 1601 | * this sync; we also assume all the physical blocks in the specified range are 1602 | * buffered. 1603 | * 1604 | */ 1605 | 1606 | static unsigned long _pmbd_buffer_flush_range(PMBD_BUFFER_T* buffer, PBN_T pbn_s, PBN_T pbn_e) 1607 | { 1608 | PBN_T pbn = 0; 1609 | unsigned long num_cleaned = 0; 1610 | PMBD_DEVICE_T* pmbd = buffer->pmbd; 1611 | void* dst = PMBD_BLOCK_VADDR(pmbd, pbn_s); 1612 | size_t bytes = PBN_TO_BYTE(pmbd, (pbn_e - pbn_s + 1)); 1613 | 1614 | /* NOTE: we are protected by the flush_lock here, no-one else here */ 1615 | 1616 | /* set the pages readwriteable */ 1617 | /* if we use CR0/WP to temporarily switch the writable permission, 1618 | * we don't have to change the PTE attributes directly */ 1619 | if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 1620 | pmbd_set_pages_rw(pmbd, dst, bytes, TRUE); 1621 | 1622 | 1623 | /* for each physical block, flush it from buffer to the PM space */ 1624 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++){ 1625 | BBN_T bbn = 0; 1626 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 1627 | void* to = PMBD_BLOCK_VADDR(pmbd, pbn); 1628 | size_t size = pmbd->pb_size; 1629 | void* from = NULL; /* wait to get it in locked region */ 1630 | PMBD_BBI_T* bbi = NULL; /* wait to get it in locked region */ 1631 | 1632 | /* 1633 | * NOTE: This would not cause a deadlock, because the block 1634 | * here are already buffered, and these blocks would not call 1635 | * pmbd_buffer_alloc_block() 1636 | */ 1637 | spin_lock(&pbi->lock); /* lock the block */ 1638 | 1639 | /* get related buffer block info */ 1640 | if (PMBD_BLOCK_IS_BUFFERED(pmbd, pbn)) { 1641 | bbn = pbi->bbn; 1642 | bbi = PMBD_BUFFER_BBI(buffer, pbi->bbn); 1643 | from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn); 1644 | } else { 1645 | panic("pmbd: %s(%d) something wrong here \n", __FUNCTION__, __LINE__); 1646 | } 1647 | 1648 | /* sync data from buffer into PM first */ 1649 | if (PMBD_BUFFER_BBI_IS_DIRTY(buffer, bbn)) { 1650 | /* flush to PM */ 1651 | memcpy_to_pmbd(pmbd, to, from, size, FALSE); 1652 | 1653 | /* mark it as clean */ 1654 | PMBD_BUFFER_SET_BBI_CLEAN(buffer, bbn); 1655 | } 1656 | } 1657 | 1658 | /* set the pages back to read-only */ 1659 | if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 1660 | pmbd_set_pages_ro(pmbd, dst, bytes, TRUE); 1661 | 1662 | 1663 | /* finish the remaining work */ 1664 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++){ 1665 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 1666 | void* to = PMBD_BLOCK_VADDR(pmbd, pbn); 1667 | size_t size = pmbd->pb_size; 1668 | BBN_T bbn = pbi->bbn; 1669 | void* from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn); 1670 | 1671 | /* verify that the write operation succeeded */ 1672 | if(PMBD_USE_WRITE_VERIFICATION()) 1673 | pmbd_verify_wr_pages(pmbd, to, from, size); 1674 | 1675 | /* reset the bbi and pbi link info */ 1676 | PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, bbn); 1677 | PMBD_SET_BLOCK_UNBUFFERED(pmbd, pbn); 1678 | 1679 | /* unlock the block */ 1680 | spin_unlock(&pbi->lock); 1681 | 1682 | num_cleaned ++; 1683 | } 1684 | 1685 | /* generate checksum */ 1686 | if (PMBD_USE_CHECKSUM()) 1687 | pmbd_checksum_on_write(pmbd, dst, bytes); 1688 | 1689 | return num_cleaned; 1690 | } 1691 | 1692 | 1693 | /* 1694 | * core function of flushing the pmbd buffer 1695 | * @pmbd: pmbd device 1696 | * 1697 | * NOTE: this function performs the flushing in the following steps 1698 | * (1) get the flush lock (to allow only one to do flushing) 1699 | * (2) get the buffer_lock to protect the buffer control info (num_dirty, 1700 | * pos_dirty, pos_clean) 1701 | * (3) check if someone else has already done the flushing work while waiting 1702 | * for the lock 1703 | * (4) copy the buffer block info from pos_dirty to pos_clean to a temporary 1704 | * array 1705 | * (5) release the buffer_lock (to allow alloc to proceed, as long as free 1706 | * blocks exist) 1707 | * 1708 | * (6) sort the temporary array of buffer blocks in the order of their PBNs. 1709 | * This is because we need to organize sequences of contiguous physical blocks, 1710 | * so that we can use only one set_memory_* function for a sequence of memory 1711 | * pages, rather than once for each page. So the larger the sequence is, the 1712 | * more efficient it would be. 1713 | * (7) scan the sorted list, and form sequences of contiguous physical blocks, 1714 | * and call __pmbd_buffer_flush_range() to synchronize the sequences one by one 1715 | * 1716 | * (8) get the flush_lock again 1717 | * (9) update the pos_dirty and num_dirty to reflect the recent changes 1718 | * (10) release the flush_lock 1719 | * 1720 | * NOTE: The caller must not hold flush_lock and buffer_lock, but can hold 1721 | * pbi->lock. 1722 | * 1723 | */ 1724 | static unsigned long pmbd_buffer_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean) 1725 | { 1726 | BBN_T i = 0; 1727 | BBN_T bbn_s = 0; 1728 | BBN_T bbn_e = 0; 1729 | PBN_T first_pbn = 0; 1730 | PBN_T last_pbn = 0; 1731 | unsigned long num_cleaned = 0; 1732 | unsigned long num_scanned = 0; 1733 | PMBD_DEVICE_T* pmbd = buffer->pmbd; 1734 | PMBD_BSORT_ENTRY_T* bbi_sort_buffer = buffer->bbi_sort_buffer; 1735 | 1736 | /* lock the flush_lock to ensure no-one else can do flush in parallel */ 1737 | spin_lock(&buffer->flush_lock); 1738 | 1739 | /* now we lock the buffer to protect buffer control info */ 1740 | spin_lock(&buffer->buffer_lock); 1741 | 1742 | /* check if num_to_clean is too large */ 1743 | if (num_to_clean > buffer->num_dirty) 1744 | num_to_clean = buffer->num_dirty; 1745 | 1746 | /* check if the buffer is empty (someone else may have done the flushing job) */ 1747 | if (PMBD_BUFFER_IS_EMPTY(buffer) || num_to_clean == 0) { 1748 | spin_unlock(&buffer->buffer_lock); 1749 | goto done; 1750 | } 1751 | 1752 | /* set up the range of BBNs we need to check */ 1753 | bbn_s = buffer->pos_dirty; /* the first bbn */ 1754 | bbn_e = PMBD_BUFFER_PRIO_POS(buffer, buffer->pos_clean);/* the last bbn */ 1755 | 1756 | /* scan the buffer range and put it into the sort buffer */ 1757 | /* 1758 | * NOTE: bbn_s could be equal to PMBD_BUFFER_NEXT_POS(buffer, bbn_e), if 1759 | * the buffer is filled with dirty blocks, so we need to check num_scanned 1760 | * here. 1761 | * */ 1762 | for (i = bbn_s; 1763 | (i != PMBD_BUFFER_NEXT_POS(buffer, bbn_e)) || (num_scanned == 0); 1764 | i = PMBD_BUFFER_NEXT_POS(buffer, i)) { 1765 | /* 1766 | * FIXME: it may be possible that some blocks in the dirty 1767 | * block range are "clean", because after the block is 1768 | * allocated, and before it is being written, the block is 1769 | * marked as CLEAN, but it is allocated already. However, it is 1770 | * safe to attempt to flush it, because the pbi->lock would 1771 | * protect us. 1772 | * 1773 | * UPDATES: we changed the allocator code to mark it dirty as 1774 | * soon as the block is allocated. So the aforesaid situation 1775 | * would not happen anymore. 1776 | */ 1777 | if(PMBD_BUFFER_BBI_IS_CLEAN(buffer, i)){ 1778 | /* found clean blocks */ 1779 | panic("ERR: %s(%d)%u: found clean block in the range of dirty blocks (bbn_s=%lu bbn_e=%lu, i=%lu, num_scanned=%lu num_to_clean=%lu num_dirty=%lu pos_dirty=%lu pos_clean=%lu)\n", 1780 | __FUNCTION__, __LINE__, __CURRENT_PID__,bbn_s, bbn_e, i, num_scanned, num_to_clean, buffer->num_dirty, buffer->pos_dirty, buffer->pos_clean); 1781 | continue; 1782 | } else { 1783 | PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, i); 1784 | PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + num_scanned; 1785 | 1786 | /* add it to the buffer for sorting */ 1787 | se->pbn = bbi->pbn; 1788 | se->bbn = i; 1789 | num_scanned ++; 1790 | 1791 | /* only clean num_to_clean blocks */ 1792 | if (num_scanned >= num_to_clean) 1793 | break; 1794 | } 1795 | } 1796 | /* unlock the buffer to let allocator continue */ 1797 | spin_unlock(&buffer->buffer_lock); 1798 | 1799 | /* if no valid dirty block to be cleaned*/ 1800 | if (num_scanned == 0) 1801 | goto done; 1802 | 1803 | /* 1804 | * sort the buffer to get sequences of contiguous blocks 1805 | */ 1806 | if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 1807 | sort(bbi_sort_buffer, num_scanned, sizeof(PMBD_BSORT_ENTRY_T), compare_bbi_sort_entries, swap_bbi_sort_entries); 1808 | 1809 | /* scan the sorted list to organize and flush the sequences of contiguous PBNs */ 1810 | for (i = 0; i < num_scanned; i ++) { 1811 | PMBD_BSORT_ENTRY_T* se = bbi_sort_buffer + i; 1812 | PMBD_BBI_T* bbi = PMBD_BUFFER_BBI(buffer, se->bbn); 1813 | if (i == 0) { 1814 | /* the first one */ 1815 | first_pbn = bbi->pbn; 1816 | last_pbn = bbi->pbn; 1817 | continue; 1818 | } else { 1819 | if (bbi->pbn == (last_pbn + 1) ) { 1820 | /* if blocks are contiguous */ 1821 | last_pbn = bbi->pbn; 1822 | continue; 1823 | } else { 1824 | /* if blocks are not contiguous */ 1825 | num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn); 1826 | 1827 | /* start a new sequence */ 1828 | first_pbn = bbi->pbn; 1829 | last_pbn = bbi->pbn; 1830 | continue; 1831 | } 1832 | } 1833 | } 1834 | 1835 | /* finish the last sequence of contiguous PBNs */ 1836 | num_cleaned += _pmbd_buffer_flush_range(buffer, first_pbn, last_pbn); 1837 | 1838 | /* update the buffer control info */ 1839 | spin_lock(&buffer->buffer_lock); 1840 | buffer->pos_dirty = PMBD_BUFFER_NEXT_N_POS(buffer, bbn_s, num_cleaned); /* move pos_dirty forward */ 1841 | buffer->num_dirty -= num_cleaned; /* decrement the counter*/ 1842 | spin_unlock(&buffer->buffer_lock); 1843 | 1844 | done: 1845 | spin_unlock(&buffer->flush_lock); 1846 | return num_cleaned; 1847 | } 1848 | 1849 | /* 1850 | * entry function of flushing buffer 1851 | * This function is called by both allocator and syncer 1852 | * @pmbd: pmbd device 1853 | * @num_to_clean: how many blocks to clean 1854 | * @i_am_syncer: indicate which caller is (TRUE for syncer and FALSE for allocator) 1855 | */ 1856 | static unsigned long pmbd_buffer_check_and_flush(PMBD_BUFFER_T* buffer, unsigned long num_to_clean, unsigned caller) 1857 | { 1858 | unsigned long num_cleaned = 0; 1859 | 1860 | /* 1861 | * Since there may exist more than one thread (e.g. alloc/flush or 1862 | * alloc/alloc) trying to flush the buffer, we need to first check if 1863 | * someone else has already done the job while waiting for the lock. If 1864 | * true, we don't have to proceed and flush it again. This improves the 1865 | * responsiveness of applications 1866 | */ 1867 | if (caller == CALLER_DESTROYER){ 1868 | /* if destroyer calls this function, just flush everything */ 1869 | goto do_it; 1870 | 1871 | } else if (caller == CALLER_SYNCER) { 1872 | /* if syncer calls this function and the buffer is empty, do nothing */ 1873 | spin_lock(&buffer->buffer_lock); 1874 | if (PMBD_BUFFER_IS_EMPTY(buffer)){ 1875 | spin_unlock(&buffer->buffer_lock); 1876 | goto done; 1877 | } 1878 | spin_unlock(&buffer->buffer_lock); 1879 | 1880 | } else if (caller == CALLER_ALLOCATOR){ 1881 | 1882 | /* if reader/writer calls this function, some blocks are freed, then 1883 | * we just do nothing */ 1884 | spin_lock(&buffer->buffer_lock); 1885 | if (!PMBD_BUFFER_IS_FULL(buffer)){ 1886 | spin_unlock(&buffer->buffer_lock); 1887 | goto done; 1888 | } 1889 | spin_unlock(&buffer->buffer_lock); 1890 | 1891 | } else { 1892 | panic("ERR: %s(%d) unknown caller id\n", __FUNCTION__, __LINE__); 1893 | } 1894 | 1895 | /* otherwise, we do flushing */ 1896 | do_it: 1897 | num_cleaned = pmbd_buffer_flush(buffer, num_to_clean); 1898 | 1899 | done: 1900 | return num_cleaned; 1901 | } 1902 | 1903 | /* 1904 | * Core function of allocating a buffer block 1905 | * 1906 | * We first grab the buffer_lock, and check to see if the buffer is full. If 1907 | * not, we allocate a buffer block, move the pos_clean, and update num_dirty, 1908 | * then release the buffer_lock. Since we already hold the pbi->lock, it is 1909 | * safe to release the lock and let other threads proceed (before we really 1910 | * write data into the buffer block), because no one else can read/write or 1911 | * access the same buffer block concurrently. If the buffer is full, we release 1912 | * the buffer_lock to allow others to proceed (because we may be blocked at 1913 | * flush_lock later), and then we call the function to synchronously flush the 1914 | * buffer. Note that someone else may be there already, so we may be blocked 1915 | * there, and if we find someone has already flushed the buffer, we need to 1916 | * grab the buffer_lock and check if there is available buffer block again. 1917 | * 1918 | * NOTE: The caller must hold the pbi->lock. 1919 | * 1920 | */ 1921 | static PMBD_BBI_T* pmbd_buffer_alloc_block(PMBD_BUFFER_T* buffer, PBN_T pbn) 1922 | { 1923 | BBN_T pos = 0; 1924 | PMBD_BBI_T* bbi = NULL; 1925 | PMBD_DEVICE_T* pmbd = buffer->pmbd; 1926 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 1927 | 1928 | /* lock the buffer control info (we will check and update it) */ 1929 | spin_lock(&buffer->buffer_lock); 1930 | 1931 | check_again: 1932 | /* check if the buffer is completely full, if yes, flush it to PM */ 1933 | if (PMBD_BUFFER_IS_FULL(buffer)) { 1934 | /* release the buffer_lock (someone may be doing flushing)*/ 1935 | spin_unlock(&buffer->buffer_lock); 1936 | 1937 | /* If the buffer is full, we must flush it synchronously. 1938 | * 1939 | * NOTE: this on-demand flushing can improve performance a lot, since 1940 | * the allocator has not to wait for waking up syncer to do this, which 1941 | * is much faster. Another merit is that it makes the application run 1942 | * more smoothly (it is abrupt if completely relying on syncer). Also 1943 | * note that we only flush a batch (e.g. 1024) of blocks, rather than 1944 | * all the buffer blocks, this is because we only need a few blocks to 1945 | * satisfy the application's own need, and this reduces the time that 1946 | * the application spends on allocation. */ 1947 | pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_ALLOCATOR); 1948 | 1949 | /* grab the lock and check the availability of free buffer blocks 1950 | * again, because someone may use up all the free buffer blocks, right 1951 | * after the buffer is flushed but before we can get one */ 1952 | spin_lock(&buffer->buffer_lock); 1953 | goto check_again; 1954 | } 1955 | 1956 | /* if buffer is not full, only reserve one spot first. 1957 | * 1958 | * NOTE that we do not have to do link and memcpy in the locked region, 1959 | * because pbi->lock guarantees that no-one else can use it now. This 1960 | * moves the high-cost operations out of the critical section */ 1961 | pos = buffer->pos_clean; 1962 | buffer->pos_clean = PMBD_BUFFER_NEXT_POS(buffer, buffer->pos_clean); 1963 | buffer->num_dirty ++; 1964 | 1965 | /* NOTE: we mark it "dirty" here, but actually the data has not been 1966 | * really written into the PMBD buffer block yet. This is safe, because 1967 | * we are protected by the pbi->lock */ 1968 | PMBD_BUFFER_SET_BBI_DIRTY(buffer, pos); 1969 | 1970 | /* now link them up (no-one else can see it) */ 1971 | bbi = PMBD_BUFFER_BBI(buffer, pos); 1972 | 1973 | bbi->pbn = pbn; 1974 | pbi->bbn = pos; 1975 | 1976 | /* unlock the buffer_lock and let others proceed */ 1977 | spin_unlock(&buffer->buffer_lock); 1978 | 1979 | return bbi; 1980 | } 1981 | 1982 | 1983 | /* 1984 | * syncer daemon worker function 1985 | */ 1986 | 1987 | static inline uint64_t pmbd_device_is_idle(PMBD_DEVICE_T* pmbd) 1988 | { 1989 | unsigned last_jiffies, now_jiffies; 1990 | uint64_t interval = 0; 1991 | 1992 | now_jiffies = jiffies; 1993 | PMBD_DEV_GET_ACCESS_TIME(pmbd, last_jiffies); 1994 | interval = jiffies_to_usecs(now_jiffies - last_jiffies); 1995 | 1996 | if (PMBD_DEV_IS_IDLE(pmbd, interval)) { 1997 | return interval; 1998 | } else { 1999 | return 0; 2000 | } 2001 | } 2002 | 2003 | static int pmbd_syncer_worker(void* data) 2004 | { 2005 | PMBD_BUFFER_T* buffer = (PMBD_BUFFER_T*) data; 2006 | 2007 | set_user_nice(current, 0); 2008 | 2009 | do { 2010 | unsigned do_flush = 0; 2011 | // unsigned long loop = 0; 2012 | uint64_t idle_usec = 0; 2013 | spin_lock(&buffer->buffer_lock); 2014 | 2015 | /* we start flushing, if 2016 | * (1) the num of dirty blocks hits the high watermark, or 2017 | * (2) the device has been idle for a while */ 2018 | if (PMBD_BUFFER_ABOVE_HW(buffer)) { 2019 | //printk("High watermark is hit\n"; 2020 | do_flush = 1; 2021 | } 2022 | // if (pmbd_device_is_idle(buffer->pmbd) && !PMBD_BUFFER_IS_EMPTY(buffer)) { 2023 | if ((idle_usec = pmbd_device_is_idle(buffer->pmbd)) && PMBD_BUFFER_ABOVE_LW(buffer)) { 2024 | //printk("Device is idle for %llu uSeconds\n", idle_usec); 2025 | do_flush = 1; 2026 | } 2027 | if (do_flush){ 2028 | unsigned long num_dirty = 0; 2029 | unsigned long num_cleaned = 0; 2030 | repeat: 2031 | num_dirty = buffer->num_dirty; 2032 | spin_unlock(&buffer->buffer_lock); 2033 | 2034 | /* start flushing 2035 | * 2036 | * NOTE: we only allocate a batch (e.g. 1024) of blocks each time. The 2037 | * purpose is to let the applications wait for free blocks, so that they can 2038 | * get a few free blocks and proceed, rather than waiting for the whole 2039 | * buffer gets flushed. Otherwise, the bandwidth would be lower and the 2040 | * applications cannot run smoothly. 2041 | */ 2042 | num_cleaned = pmbd_buffer_check_and_flush(buffer, buffer->batch_size, CALLER_SYNCER); 2043 | //printk("Syncer(%u) activated (%lu) - Before (%lu) Cleaned (%lu) After (%lu)\n", 2044 | // buffer->buffer_id, loop++, num_dirty, num_cleaned, buffer->num_dirty); 2045 | 2046 | /* continue to flush until we hit the low watermark */ 2047 | spin_lock(&buffer->buffer_lock); 2048 | if (PMBD_BUFFER_ABOVE_LW(buffer)) { 2049 | // if (buffer->num_dirty > 0) { 2050 | goto repeat; 2051 | } 2052 | } 2053 | spin_unlock(&buffer->buffer_lock); 2054 | 2055 | /* go to sleep */ 2056 | set_current_state(TASK_INTERRUPTIBLE); 2057 | schedule_timeout(1); 2058 | set_current_state(TASK_RUNNING); 2059 | 2060 | } while(!kthread_should_stop()); 2061 | return 0; 2062 | } 2063 | 2064 | static struct task_struct* pmbd_buffer_syncer_init(PMBD_BUFFER_T* buffer) 2065 | { 2066 | struct task_struct* tsk = NULL; 2067 | tsk = kthread_run(pmbd_syncer_worker, (void*) buffer, "nsyncer"); 2068 | if (!tsk) { 2069 | printk(KERN_ERR "pmbd: initializing buffer syncer failed\n"); 2070 | return NULL; 2071 | } 2072 | 2073 | buffer->syncer = tsk; 2074 | printk("pmbd: buffer syncer launched\n"); 2075 | return tsk; 2076 | } 2077 | 2078 | static int pmbd_buffer_syncer_stop(PMBD_BUFFER_T* buffer) 2079 | { 2080 | if (buffer->syncer){ 2081 | kthread_stop(buffer->syncer); 2082 | buffer->syncer = NULL; 2083 | printk(KERN_INFO "pmbd: buffer syncer stopped\n"); 2084 | } 2085 | return 0; 2086 | } 2087 | 2088 | /* 2089 | * read and write to PMBD with buffer 2090 | */ 2091 | static void copy_to_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes) 2092 | { 2093 | PBN_T pbn = 0; 2094 | void* from = src; 2095 | 2096 | /* 2097 | * get the start and end in-block offset 2098 | * 2099 | * NOTE: Since the buffer block (4096 bytes) can be larger than the 2100 | * sector(512 bytes), if incoming request is not completely aligned to 2101 | * buffer blocks, we need to read the full block from PM into the 2102 | * buffer block and apply writes to partial of the buffer block. Here, 2103 | * offset_s and offset_e are the start and end in-block offsets (in 2104 | * units of sectors) for the first and the last sector in the request, 2105 | * they may or may not appear in the same buffer block, depending on the 2106 | * request size. 2107 | */ 2108 | PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector); 2109 | PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1)); 2110 | sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes); 2111 | sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes); 2112 | 2113 | /* for each physical block */ 2114 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++){ 2115 | void* to = NULL; 2116 | PMBD_BBI_T* bbi = NULL; 2117 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 2118 | sector_t sect_s = (pbn == pbn_s) ? offset_s : 0; /* sub-block access */ 2119 | sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */ 2120 | size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */ 2121 | PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn); 2122 | 2123 | /* lock the physical block first */ 2124 | spin_lock(&pbi->lock); 2125 | 2126 | /* check if the physical block is buffered */ 2127 | bbi = _pmbd_buffer_lookup(buffer, pbn); 2128 | 2129 | if (bbi){ 2130 | /* if the block is already buffered */ 2131 | to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s); 2132 | } else { 2133 | /* if not buffered, allocate one free buffer block */ 2134 | bbi = pmbd_buffer_alloc_block(buffer, pbn); 2135 | 2136 | /* if not aligned to a full block, we have to copy the whole 2137 | * block from the PM space to the buffer block first */ 2138 | if (size < pmbd->pb_size){ 2139 | memcpy_from_pmbd(pmbd, PMBD_BUFFER_BLOCK(buffer, pbi->bbn), PMBD_BLOCK_VADDR(pmbd, pbn), pmbd->pb_size); 2140 | } 2141 | to = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s); 2142 | } 2143 | 2144 | /* writing it into buffer */ 2145 | memcpy(to, from, size); 2146 | PMBD_BUFFER_SET_BBI_DIRTY(buffer, pbi->bbn); 2147 | 2148 | /* unlock the block */ 2149 | spin_unlock(&pbi->lock); 2150 | 2151 | from += size; 2152 | } 2153 | 2154 | return; 2155 | } 2156 | 2157 | static void copy_from_pmbd_buffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes) 2158 | { 2159 | PBN_T pbn = 0; 2160 | void* to = dst; 2161 | 2162 | /* get the start and end in-block offset */ 2163 | PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector); 2164 | PBN_T pbn_e = BYTE_TO_PBN(pmbd, SECTOR_TO_BYTE(sector) + bytes - 1); 2165 | sector_t offset_s = pmbd_buffer_aligned_request_start(pmbd, sector, bytes); 2166 | sector_t offset_e = pmbd_buffer_aligned_request_end(pmbd, sector, bytes); 2167 | 2168 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++){ 2169 | /* Scan the incoming request and check each block, for each block, we 2170 | * check if it is in the buffer. If true, we read it from the buffer, 2171 | * otherwise, we read from the PM space. */ 2172 | 2173 | void* from = NULL; 2174 | PMBD_BBI_T* bbi = NULL; 2175 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 2176 | sector_t sect_s = (pbn == pbn_s) ? offset_s : 0; 2177 | sector_t sect_e = (pbn == pbn_e) ? offset_e : (PBN_TO_SECTOR(pmbd, 1) - 1);/* sub-block access */ 2178 | size_t size = SECTOR_TO_BYTE(sect_e - sect_s + 1); /* get the real size */ 2179 | PMBD_BUFFER_T* buffer = PBN_TO_PMBD_BUFFER(pmbd, pbn); 2180 | 2181 | /* lock the physical block first */ 2182 | spin_lock(&pbi->lock); 2183 | 2184 | /* check if the block is in the buffer */ 2185 | bbi = _pmbd_buffer_lookup(buffer, pbn); 2186 | 2187 | /* start reading data */ 2188 | if (bbi) { 2189 | /* if buffered, read it from the buffer */ 2190 | from = PMBD_BUFFER_BLOCK(buffer, pbi->bbn) + SECTOR_TO_BYTE(sect_s); 2191 | 2192 | /* read it out */ 2193 | memcpy(to, from, size); 2194 | 2195 | } else { 2196 | /* if not buffered, read it from PM space */ 2197 | from = PMBD_BLOCK_VADDR(pmbd, pbn) + SECTOR_TO_BYTE(sect_s); 2198 | 2199 | /* verify the checksum first */ 2200 | if (PMBD_USE_CHECKSUM()) 2201 | pmbd_checksum_on_read(pmbd, from, size); 2202 | 2203 | /* read it out*/ 2204 | memcpy_from_pmbd(pmbd, to, from, size); 2205 | } 2206 | 2207 | /* unlock the block */ 2208 | spin_unlock(&pbi->lock); 2209 | 2210 | to += size; 2211 | } 2212 | 2213 | return; 2214 | } 2215 | 2216 | /* 2217 | * buffer related space alloc/free functions 2218 | */ 2219 | static int pmbd_pbi_space_alloc(PMBD_DEVICE_T* pmbd) 2220 | { 2221 | int err = 0; 2222 | 2223 | /* allocate checksum space */ 2224 | pmbd->pbi_space = vmalloc(PMBD_TOTAL_PB_NUM(pmbd) * sizeof(PMBD_PBI_T)); 2225 | if (pmbd->pbi_space) { 2226 | PBN_T i; 2227 | for (i = 0; i < PMBD_TOTAL_PB_NUM(pmbd); i ++) { 2228 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, i); 2229 | PMBD_SET_BLOCK_UNBUFFERED(pmbd, i); 2230 | spin_lock_init(&pbi->lock); 2231 | } 2232 | printk(KERN_INFO "pmbd(%d): pbi space is initialized\n", pmbd->pmbd_id); 2233 | } else { 2234 | err = -ENOMEM; 2235 | } 2236 | 2237 | return err; 2238 | } 2239 | 2240 | static int pmbd_pbi_space_free(PMBD_DEVICE_T* pmbd) 2241 | { 2242 | if (pmbd->pbi_space){ 2243 | vfree(pmbd->pbi_space); 2244 | pmbd->pbi_space = NULL; 2245 | printk(KERN_INFO "pmbd(%d): pbi space is freed\n", pmbd->pmbd_id); 2246 | } 2247 | return 0; 2248 | } 2249 | 2250 | static PMBD_BUFFER_T* pmbd_buffer_create(PMBD_DEVICE_T* pmbd) 2251 | { 2252 | int i; 2253 | PMBD_BUFFER_T* buffer = kzalloc (sizeof(PMBD_BUFFER_T), GFP_KERNEL); 2254 | if (!buffer){ 2255 | goto fail; 2256 | } 2257 | 2258 | /* link to the pmbd device */ 2259 | buffer->pmbd = pmbd; 2260 | 2261 | /* set size */ 2262 | if (g_pmbd_bufsize[pmbd->pmbd_id] > PMBD_BUFFER_MIN_BUFSIZE) { 2263 | buffer->num_blocks = MB_TO_BYTES(g_pmbd_bufsize[pmbd->pmbd_id]) / pmbd->pb_size; 2264 | } else { 2265 | if (PMBD_DEV_USE_BUFFER(pmbd)) { 2266 | printk(KERN_INFO "pmbd(%d): WARNING - too small buffer size (%llu MBs). Buffer set to %d MBs\n", 2267 | pmbd->pmbd_id, g_pmbd_bufsize[pmbd->pmbd_id], PMBD_BUFFER_MIN_BUFSIZE); 2268 | } 2269 | buffer->num_blocks = MB_TO_BYTES(PMBD_BUFFER_MIN_BUFSIZE) / pmbd->pb_size; 2270 | } 2271 | 2272 | /* buffer space */ 2273 | buffer->buffer_space = vmalloc(buffer->num_blocks * pmbd->pb_size); 2274 | if (!buffer->buffer_space) 2275 | goto fail; 2276 | 2277 | /* BBI array */ 2278 | buffer->bbi_space = vmalloc(buffer->num_blocks * sizeof(PMBD_BBI_T)); 2279 | if (!buffer->bbi_space) 2280 | goto fail; 2281 | memset(buffer->bbi_space, 0, buffer->num_blocks * sizeof(PMBD_BBI_T)); 2282 | 2283 | /* temporary array of bbi for sorting */ 2284 | buffer->bbi_sort_buffer = vmalloc(buffer->num_blocks * sizeof(PMBD_BSORT_ENTRY_T)); 2285 | if (!buffer->bbi_sort_buffer) 2286 | goto fail; 2287 | 2288 | /* initialize the locks*/ 2289 | spin_lock_init(&buffer->buffer_lock); 2290 | spin_lock_init(&buffer->flush_lock); 2291 | 2292 | /* initialize the BBI array */ 2293 | for (i = 0; i < buffer->num_blocks; i ++){ 2294 | PMBD_BUFFER_SET_BBI_CLEAN(buffer, i); 2295 | PMBD_BUFFER_SET_BBI_UNBUFFERED(buffer, i); 2296 | } 2297 | 2298 | /* initialize the buffer control info */ 2299 | buffer->num_dirty = 0; 2300 | buffer->pos_dirty = 0; 2301 | buffer->pos_clean = 0; 2302 | buffer->batch_size = g_pmbd_buffer_batch_size[pmbd->pmbd_id]; 2303 | 2304 | /* launch the syncer daemon */ 2305 | pmbd_buffer_syncer_init(buffer); 2306 | if (!buffer->syncer) 2307 | goto fail; 2308 | 2309 | printk(KERN_INFO "pmbd: pmbd device buffer (%u) allocated (%lu blocks - block size %u bytes)\n", 2310 | buffer->buffer_id, buffer->num_blocks, pmbd->pb_size); 2311 | return buffer; 2312 | 2313 | fail: 2314 | if (buffer && buffer->bbi_sort_buffer) 2315 | vfree(buffer->bbi_sort_buffer); 2316 | if (buffer && buffer->bbi_space) 2317 | vfree(buffer->bbi_space); 2318 | if (buffer && buffer->buffer_space) 2319 | vfree(buffer->buffer_space); 2320 | if (buffer) 2321 | kfree(buffer); 2322 | printk(KERN_ERR "%s(%d) vzalloc failed\n", __FUNCTION__, __LINE__); 2323 | return NULL; 2324 | } 2325 | 2326 | static int pmbd_buffer_destroy(PMBD_BUFFER_T* buffer) 2327 | { 2328 | unsigned id = buffer->buffer_id; 2329 | 2330 | /* stop syncer first */ 2331 | pmbd_buffer_syncer_stop(buffer); 2332 | 2333 | /* flush the buffer to the PM space */ 2334 | pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER); 2335 | 2336 | /* FIXME: wait for the on-going operations to finish first? */ 2337 | if (buffer && buffer->bbi_sort_buffer) 2338 | vfree(buffer->bbi_sort_buffer); 2339 | if (buffer && buffer->bbi_space) 2340 | vfree(buffer->bbi_space); 2341 | if (buffer && buffer->buffer_space) 2342 | vfree(buffer->buffer_space); 2343 | if (buffer) 2344 | kfree(buffer); 2345 | printk(KERN_INFO "pmbd: pmbd device buffer (%u) space freed\n", id); 2346 | return 0; 2347 | } 2348 | 2349 | static int pmbd_buffers_create(PMBD_DEVICE_T* pmbd) 2350 | { 2351 | int i; 2352 | for (i = 0; i < pmbd->num_buffers; i ++){ 2353 | pmbd->buffers[i] = pmbd_buffer_create(pmbd); 2354 | if (pmbd->buffers[i] == NULL) 2355 | return -ENOMEM; 2356 | (pmbd->buffers[i])->buffer_id = i; 2357 | } 2358 | return 0; 2359 | } 2360 | 2361 | static int pmbd_buffers_destroy(PMBD_DEVICE_T* pmbd) 2362 | { 2363 | int i; 2364 | for (i = 0; i < pmbd->num_buffers; i ++){ 2365 | if(pmbd->buffers[i]){ 2366 | pmbd_buffer_destroy(pmbd->buffers[i]); 2367 | pmbd->buffers[i] = NULL; 2368 | } 2369 | } 2370 | return 0; 2371 | } 2372 | 2373 | static int pmbd_buffer_space_alloc(PMBD_DEVICE_T* pmbd) 2374 | { 2375 | int err = 0; 2376 | 2377 | if (pmbd->num_buffers <= 0) 2378 | return 0; 2379 | 2380 | /* allocate buffers array */ 2381 | pmbd->buffers = kzalloc (sizeof(PMBD_BUFFER_T*) * pmbd->num_buffers, GFP_KERNEL); 2382 | if (pmbd->buffers == NULL){ 2383 | err = -ENOMEM; 2384 | goto fail; 2385 | } 2386 | 2387 | /* allocate each buffer */ 2388 | err = pmbd_buffers_create(pmbd); 2389 | printk(KERN_INFO "pmbd: pmbd buffer space allocated.\n"); 2390 | fail: 2391 | return err; 2392 | } 2393 | 2394 | static int pmbd_buffer_space_free(PMBD_DEVICE_T* pmbd) 2395 | { 2396 | if (pmbd->num_buffers <=0) 2397 | return 0; 2398 | 2399 | pmbd_buffers_destroy(pmbd); 2400 | kfree(pmbd->buffers); 2401 | pmbd->buffers = NULL; 2402 | printk(KERN_INFO "pmbd: pmbd buffer space freed.\n"); 2403 | 2404 | return 0; 2405 | } 2406 | 2407 | 2408 | /* 2409 | * ************************************************************************* 2410 | * High memory based PMBD functions 2411 | * ************************************************************************* 2412 | * 2413 | * NOTE: 2414 | * (1) memcpy_fromio() and memcpy_intoio() are used for reading/writing PM, 2415 | * but it is unnecessary on x86 architectures. 2416 | * (2) Currently we only allocate the reserved space to multiple PMBDs once. 2417 | * No dynamic allocate/deallocate of the space is needed so far. 2418 | */ 2419 | 2420 | 2421 | static void* pmbd_highmem_map(void) 2422 | { 2423 | /* 2424 | * NOTE: we can also use ioremap_* functions to directly set memory 2425 | * page attributes when do remapping, but to make it consistent with 2426 | * using vmalloc(), we do ioremap_cache() and call set_memory_* later. 2427 | */ 2428 | 2429 | if (PMBD_USE_PMAP()){ 2430 | /* NOTE: If we use pmap(), we don't need to map the reserved 2431 | * physical memory into the kernel space. Instead we use 2432 | * pmap_atomic() to make and unmap the to-be-accessed pages on 2433 | * demand. Since such mapping is private to the processor, 2434 | * there is no need to change PTE, and TLB shootdown either. 2435 | * 2436 | * Also note that We use PMBD_PMAP_DUMMY_BASE_VA to make the rest 2437 | * of code happy with a valid virtual address. The real 2438 | * physical address is calculated as follows: 2439 | * g_highmem_phys_addr + (vaddr) - PMBD_PMAP_DUMMY_BASE_VA 2440 | * 2441 | * (updated 10/25/2011) 2442 | */ 2443 | 2444 | g_highmem_virt_addr = (void*) PMBD_PMAP_DUMMY_BASE_VA; 2445 | g_highmem_curr_addr = g_highmem_virt_addr; 2446 | printk(KERN_INFO "pmbd: PMAP enabled - setting g_highmem_virt_addr to a dummy address (%d)\n", PMBD_PMAP_DUMMY_BASE_VA); 2447 | return g_highmem_virt_addr; 2448 | 2449 | } else if ((g_highmem_virt_addr = ioremap_prot(g_highmem_phys_addr, g_highmem_size, g_pmbd_cpu_cache_flag))) { 2450 | 2451 | g_highmem_curr_addr = g_highmem_virt_addr; 2452 | printk(KERN_INFO "pmbd: high memory space remapped (offset: %llu MB, size=%lu MB, cache flag=%s)\n", 2453 | BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size), PMBD_CPU_CACHE_FLAG()); 2454 | return g_highmem_virt_addr; 2455 | 2456 | } else { 2457 | 2458 | printk(KERN_ERR "pmbd: %s(%d) - failed remapping high memory space (offset: %llu MB size=%lu MB)\n", 2459 | __FUNCTION__, __LINE__, BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size)); 2460 | return NULL; 2461 | } 2462 | } 2463 | 2464 | static void pmbd_highmem_unmap(void) 2465 | { 2466 | /* de-remap the high memory from kernel address space */ 2467 | /* NOTE: if we use pmap(), the g_highmem_virt_addr is fake */ 2468 | if (!PMBD_USE_PMAP()){ 2469 | if(g_highmem_virt_addr){ 2470 | iounmap(g_highmem_virt_addr); 2471 | g_highmem_virt_addr = NULL; 2472 | printk(KERN_INFO "pmbd: unmapping high mem space (offset: %llu MB, size=%lu MB)is unmapped\n", 2473 | BYTES_TO_MB(g_highmem_phys_addr), BYTES_TO_MB(g_highmem_size)); 2474 | } 2475 | } 2476 | return; 2477 | } 2478 | 2479 | static void* hmalloc(uint64_t bytes) 2480 | { 2481 | void* rtn = NULL; 2482 | 2483 | /* check if there is still available reserve high memory space */ 2484 | if (bytes <= PMBD_HIGHMEM_AVAILABLE_SPACE) { 2485 | rtn = g_highmem_curr_addr; 2486 | g_highmem_curr_addr += bytes; 2487 | } else { 2488 | printk(KERN_ERR "pmbd: %s(%d) - no available space (< %llu bytes) in reserved high memory\n", 2489 | __FUNCTION__, __LINE__, bytes); 2490 | } 2491 | return rtn; 2492 | } 2493 | 2494 | static int hfree(void* addr) 2495 | { 2496 | /* FIXME: no support for dynamic alloc/dealloc in HIGH_MEM space */ 2497 | return 0; 2498 | } 2499 | 2500 | 2501 | /* 2502 | * ************************************************************************* 2503 | * Device Emulation 2504 | * ************************************************************************* 2505 | * 2506 | * Our emulation is based on a simple model - access time and transfer time. 2507 | * 2508 | * emulated time = access time + (request size / bandwidth) 2509 | * inserted delay = emulated time - observed time 2510 | * 2511 | * (1) Access time is applied to each request. We check each request's real 2512 | * access time and pad it with an extra delay to meet the designated latency. 2513 | * This is a best-effort solution, which means we just guarantee that no 2514 | * request can be completed with a response time less than the specified 2515 | * latency, but the real access latencies could be higher. In addition, if the 2516 | * total number of threads is larger than the number of available processors, 2517 | * the simulated latencies could be higher, due to CPU saturation. 2518 | * 2519 | * (2) Transfer time is calculated based on batches 2520 | * - A batch is a sequence of consecutive requests with a short interval in 2521 | * between; requests in a batch can be overlapped with each other (parallel 2522 | * jobs); there is a limit for the total amount of data and the duration of 2523 | * a batch 2524 | * - For each batch, we calculate its target emulated transfer time as 2525 | * "emul_trans_time = num_sectors/emul_bandwidth" and calculate a delay as 2526 | * "delay = emul_trans_time - real_trans_time" 2527 | * - The calculated delay is applied to each batch at the end 2528 | * - A lock is used to slow down all threads, because bandwidth is a 2529 | * system-wide specification. In this way, we serialize the threads 2530 | * accessing the device, which simulates that the device is busy on a task. 2531 | * 2532 | * (3) Two types of delays implemented 2533 | * - Sync delay: if delay is less than 10ms, we keep polling the TSC 2534 | * counter, which is basically "busy waiting", like spin-lock. This allows 2535 | * to reach precision of one hundred of cycles 2536 | * - Async delay: if delay is more than 10ms, we call msleep() to sleep for 2537 | * a while, which relinquish CPU control, which results in a low precision. 2538 | * The left-over delay is done with sync delay in nanosecs. Async delay 2539 | * cannot be used while holding a lock. 2540 | * 2541 | */ 2542 | 2543 | 2544 | static inline uint64_t DIV64_ROUND(uint64_t dividend, uint64_t divisor) 2545 | { 2546 | if (divisor > 0) { 2547 | uint32_t quot1 = dividend / divisor; 2548 | uint32_t mod = dividend % divisor; 2549 | uint32_t mult = mod * 2; 2550 | uint32_t quot2 = mult / divisor; 2551 | uint64_t result = quot1 + quot2; 2552 | return result; 2553 | } else { // FIXME: how to handle this? 2554 | printk(KERN_WARNING "pmbd: WARNING - %s(%d) divisor is zero\n", __FUNCTION__, __LINE__); 2555 | return 0; 2556 | } 2557 | } 2558 | 2559 | static inline unsigned int get_cpu_freq(void) 2560 | { 2561 | #if 0 2562 | unsigned int khz = cpufreq_quick_get(0); /* FIXME: use cpufreq_get() ??? */ 2563 | if (!khz) 2564 | khz = cpu_khz; 2565 | printk("khz=%u, cpu_khz=%u\n", khz, cpu_khz); 2566 | #endif 2567 | return cpu_khz; 2568 | } 2569 | 2570 | static inline uint64_t _cycle_to_ns(uint64_t cycle, unsigned int khz) 2571 | { 2572 | return cycle * 1000000 / khz; 2573 | } 2574 | 2575 | static inline uint64_t cycle_to_ns(uint64_t cycle) 2576 | { 2577 | unsigned int khz = get_cpu_freq(); 2578 | return _cycle_to_ns(cycle, khz); 2579 | } 2580 | 2581 | /* 2582 | * emulate the latency for a given request size/type on a device 2583 | * @num_sectors: num of sectors to read/write 2584 | * @rw: read or write 2585 | * @pmbd: the pmbd device 2586 | */ 2587 | static uint64_t cal_trans_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd) 2588 | { 2589 | uint64_t ns = 0; 2590 | uint64_t bw = (rw == READ) ? pmbd->rdbw : pmbd->wrbw; /* bandwidth */ 2591 | if (bw) { 2592 | uint64_t tmp = num_sectors * PMBD_SECTOR_SIZE; 2593 | uint64_t tt = 1000000000UL >> MB_SHIFT; 2594 | ns += DIV64_ROUND((tmp * tt), bw); 2595 | } 2596 | return ns; 2597 | } 2598 | 2599 | static uint64_t cal_access_time(unsigned int num_sectors, unsigned rw, PMBD_DEVICE_T* pmbd) 2600 | { 2601 | uint64_t ns = (rw == READ) ? pmbd->rdlat : pmbd->wrlat; /* access time */ 2602 | return ns; 2603 | } 2604 | 2605 | static inline void sync_slowdown(uint64_t ns) 2606 | { 2607 | uint64_t start, now; 2608 | unsigned int khz = get_cpu_freq(); 2609 | if (ns) { 2610 | /* 2611 | * We keep reading TSC counter to check if the delay has 2612 | * been passed and this prevents CPU from being scaled down, 2613 | * which provides a stable estimation of the elapsed time. 2614 | */ 2615 | TIMESTAMP(start); 2616 | while(1) { 2617 | TIMESTAMP(now); 2618 | if (_cycle_to_ns((now-start), khz) > ns) 2619 | break; 2620 | } 2621 | } 2622 | return; 2623 | } 2624 | 2625 | static inline void sync_slowdown_cycles(uint64_t cycles) 2626 | { 2627 | 2628 | uint64_t start, now; 2629 | if (cycles){ 2630 | /* 2631 | * We keep reading TSC counter to check if the delay has 2632 | * been passed and this prevents CPU from being scaled down, 2633 | * which provides a stable estimation of the elapsed time. 2634 | */ 2635 | TIMESTAMP(start); 2636 | while(1) { 2637 | TIMESTAMP(now); 2638 | if ((now - start) >= cycles) 2639 | break; 2640 | } 2641 | } 2642 | return; 2643 | } 2644 | 2645 | static inline void async_slowdown(uint64_t ns) 2646 | { 2647 | uint64_t ms = ns / 1000000; 2648 | uint64_t left = ns - (ms * 1000000); 2649 | /* do ms delay with sleep */ 2650 | msleep(ms); 2651 | 2652 | /* make up the sub-ms delay */ 2653 | sync_slowdown(left); 2654 | } 2655 | 2656 | #if 0 2657 | static inline void slowdown_us(unsigned long long us) 2658 | { 2659 | set_current_state(TASK_INTERRUPTIBLE); 2660 | schedule_timeout(us * HZ / 1000000); 2661 | } 2662 | #endif 2663 | 2664 | static void pmbd_slowdown(uint64_t ns, unsigned in_lock) 2665 | { 2666 | /* 2667 | * NOTE: if the delay is less than 10ms, we use sync_slowdown to keep 2668 | * polling the CPU cycle counter and busy waiting for the delay elapse; 2669 | * otherwise, we use msleep() to relinquish the CPU control. 2670 | */ 2671 | if (ns > MAX_SYNC_SLOWDOWN && !in_lock) 2672 | async_slowdown(ns); 2673 | else if (ns > 0) 2674 | sync_slowdown(ns); 2675 | 2676 | return; 2677 | } 2678 | 2679 | /* 2680 | * Emulating the transfer time for a batch of requests for specific bandwidth 2681 | * 2682 | * We group a bunch of consecutive requests as a "batch". In one batch, the 2683 | * interval between two consecutive requests should be small, and the total 2684 | * amount of accessed data should be a good size (not too small, not too 2685 | * large), the duration is reasonable (not too long). For each batch, we 2686 | * estimate the emulated transfer time and compare it with the real transfer 2687 | * time (the start and end time of the batch), if the real transfer time is 2688 | * less than the emulated time, we apply an extra delay to the end of batch for 2689 | * making up the difference. In this way we can make the bandwidth emulation 2690 | * closer to real situation. Note that, since requests from multiple threads 2691 | * could be processed in parallel, so we must slowdown ALL the threads 2692 | * accessing the PMBD device, thus, we use batch_lock to coordinate all threads. 2693 | * 2694 | * @num_sectors: the num of sectors of the request 2695 | * @rw: read or write 2696 | * @pmbd: the involved pmbd device 2697 | * 2698 | */ 2699 | 2700 | static void pmbd_emul_transfer_time(int num_sectors, int rw, PMBD_DEVICE_T* pmbd) 2701 | { 2702 | uint64_t interval_ns = 0; 2703 | uint64_t duration_ns = 0; 2704 | unsigned new_batch = FALSE; 2705 | unsigned end_batch = FALSE; 2706 | uint64_t now_cycle = 0; 2707 | 2708 | spin_lock(&pmbd->batch_lock); 2709 | 2710 | /* get a timestamp for now */ 2711 | TIMESTAMP(now_cycle); 2712 | 2713 | /* if this is the first timestamp */ 2714 | if (pmbd->batch_start_cycle[rw] == 0) { 2715 | pmbd->batch_start_cycle[rw] = now_cycle; 2716 | pmbd->batch_end_cycle[rw] = now_cycle; 2717 | goto done; 2718 | } 2719 | 2720 | /* calculate the interval from the last request */ 2721 | if (now_cycle >= pmbd->batch_end_cycle[rw]){ 2722 | interval_ns = cycle_to_ns(now_cycle - pmbd->batch_end_cycle[rw]); 2723 | } else { 2724 | panic(KERN_ERR "%s(%d): timestamp in the past found.\n", __FUNCTION__, __LINE__); 2725 | } 2726 | 2727 | /* check the interval length (cannot be too distant) */ 2728 | if (interval_ns >= PMBD_BATCH_MAX_INTERVAL) { 2729 | /* interval is too big, break it to two batches */ 2730 | new_batch = TRUE; 2731 | end_batch = TRUE; 2732 | } else { 2733 | /* still in the same batch, good */ 2734 | pmbd->batch_sectors[rw] += num_sectors; 2735 | pmbd->batch_end_cycle[rw] = now_cycle; 2736 | } 2737 | 2738 | /* check current batch duration (cannot be too long) */ 2739 | duration_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]); 2740 | if (duration_ns >= PMBD_BATCH_MAX_DURATION) 2741 | end_batch = TRUE; 2742 | 2743 | /* check current batch data amount (cannot be too large) */ 2744 | if (pmbd->batch_sectors[rw] >= PMBD_BATCH_MAX_SECTORS) 2745 | end_batch = TRUE; 2746 | 2747 | /* if the batch ends, check and apply slow-down */ 2748 | if (end_batch) { 2749 | /* batch size must be large enough, if not, just skip it */ 2750 | if (pmbd->batch_sectors[rw] > PMBD_BATCH_MIN_SECTORS) { 2751 | uint64_t real_ns = cycle_to_ns(pmbd->batch_end_cycle[rw] - pmbd->batch_start_cycle[rw]); 2752 | uint64_t emul_ns = cal_trans_time(pmbd->batch_sectors[rw], rw, pmbd); 2753 | 2754 | if (emul_ns > real_ns) 2755 | pmbd_slowdown((emul_ns - real_ns), TRUE); 2756 | } 2757 | 2758 | pmbd->batch_sectors[rw] = 0; 2759 | pmbd->batch_start_cycle[rw] = now_cycle; 2760 | pmbd->batch_end_cycle[rw] = now_cycle; 2761 | } 2762 | 2763 | /* if a new batch begins, add the first request */ 2764 | if (new_batch) { 2765 | pmbd->batch_sectors[rw] = num_sectors; 2766 | pmbd->batch_start_cycle[rw] = now_cycle; 2767 | pmbd->batch_end_cycle[rw] = now_cycle; 2768 | } 2769 | 2770 | done: 2771 | spin_unlock(&pmbd->batch_lock); 2772 | return; 2773 | } 2774 | 2775 | /* 2776 | * Emulating access time for a request 2777 | * 2778 | * Different from emulating bandwidths, we emulate access time for each 2779 | * individual access. Right after we simulate the transfer time, we examine 2780 | * the real access time (including transfer time), if the real time is smaller 2781 | * than the specified access time, we slow down the request by applying a delay 2782 | * to make up the difference. Note that we do not use any lock to coordinate 2783 | * multiple threads for a system-wide "slowdown", but apply this delay on each 2784 | * request individually and separately. 2785 | * 2786 | * Also note that since we basically use "busy-waiting", when the total number 2787 | * of threads exceeds or be close to the total number of processors, the 2788 | * simulated access time observed at application level could be longer than the 2789 | * specified access time due to high CPU usage. But for each request, after 2790 | * directly examining the duration of being in the make_request() function, the 2791 | * simulated access time is still very precise. 2792 | * 2793 | */ 2794 | static void pmbd_emul_access_time(uint64_t start, uint64_t end, int num_sectors, int rw, PMBD_DEVICE_T* pmbd) 2795 | { 2796 | /* 2797 | * Access time can be overlapped with each other, so there is no need 2798 | * to use a lock to serialize it. 2799 | * FIXME: should we apply this on each batch or each request? 2800 | */ 2801 | uint64_t real_ns = cycle_to_ns(end - start); 2802 | uint64_t emul_ns = cal_access_time(num_sectors, rw, pmbd); 2803 | 2804 | if (emul_ns > real_ns) 2805 | pmbd_slowdown((emul_ns - real_ns), FALSE); 2806 | 2807 | return; 2808 | } 2809 | 2810 | /* 2811 | * set the starting hook for PM emulation 2812 | * 2813 | * @pmbd: pmbd device 2814 | * @num_sectors: sectors being accessed 2815 | * @rw: READ/WRITE 2816 | * return value: the start cycle 2817 | */ 2818 | static uint64_t emul_start(PMBD_DEVICE_T* pmbd, int num_sectors, int rw) 2819 | { 2820 | uint64_t start = 0; 2821 | if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) { 2822 | /* start timer here */ 2823 | TIMESTAMP(start); 2824 | } 2825 | return start; 2826 | } 2827 | 2828 | /* 2829 | * set the stopping hook for PM emulation 2830 | * 2831 | * @pmbd: pmbd device 2832 | * @num_sectors: sectors being accessed 2833 | * @rw: READ/WRITE 2834 | * @start: the starting cycle 2835 | * return value: the end cycle 2836 | */ 2837 | static uint64_t emul_end(PMBD_DEVICE_T* pmbd, int num_sectors, int rw, uint64_t start) 2838 | { 2839 | uint64_t end = 0; 2840 | uint64_t end2 = 0; 2841 | /* 2842 | * NOTE: emulation can be done in two ways - (1) directly specify the 2843 | * read/write latencies and bandwidths (2) only specify a relative 2844 | * slowdown ratio (X), compared to DRAM. 2845 | * 2846 | * Also note that if rdsx/wrsx is set, we will ignore 2847 | * rdlat/wrlat/rdbw/wrbw. 2848 | */ 2849 | if (PMBD_DEV_USE_EMULATION(pmbd) && num_sectors > 0) { 2850 | /* 2851 | * NOTE: we first attempt to meet the target bandwidth and then 2852 | * latency. This means the actual bandwidth should be close 2853 | * to the emulated bandwidth, and then we guarantee that the 2854 | * latency would not be SMALLER than the target latency. 2855 | */ 2856 | 2857 | /* emulate the bandwidth first */ 2858 | if (pmbd->rdbw > 0 && pmbd->wrbw > 0) { 2859 | /* emulate transfer time (bandwidth) */ 2860 | pmbd_emul_transfer_time(num_sectors, rw, pmbd); 2861 | } 2862 | 2863 | /* emulate the latency now */ 2864 | TIMESTAMP(end); 2865 | if (pmbd->rdlat > 0 || pmbd->wrlat > 0) { 2866 | /* emulate access time (latency) */ 2867 | pmbd_emul_access_time(start, end, num_sectors, rw, pmbd); 2868 | } 2869 | } 2870 | /* get the ending timestamp */ 2871 | TIMESTAMP(end2); 2872 | 2873 | return end2; 2874 | } 2875 | 2876 | /* 2877 | * ************************************************************************* 2878 | * PM space protection functions 2879 | * - clflush 2880 | * - write protection 2881 | * - write verification 2882 | * - checksum 2883 | * ************************************************************************* 2884 | */ 2885 | 2886 | /* 2887 | * flush designated cache lines in CPU cache 2888 | */ 2889 | 2890 | static inline void pmbd_clflush_all(PMBD_DEVICE_T* pmbd) 2891 | { 2892 | uint64_t time_p1 = 0; 2893 | uint64_t time_p2 = 0; 2894 | 2895 | TIMESTAMP(time_p1); 2896 | if (cpu_has_clflush){ 2897 | #ifdef CONFIG_X86 2898 | wbinvd_on_all_cpus(); 2899 | #else 2900 | printk(KERN_WARNING "pmbd: WARNING - %s(%d) flush_cache_all() not implemented\n", __FUNCTION__, __LINE__); 2901 | #endif 2902 | } 2903 | TIMESTAMP(time_p2); 2904 | 2905 | /* emulating slowdown */ 2906 | if(PMBD_DEV_USE_SLOWDOWN(pmbd)) 2907 | pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2); 2908 | 2909 | /* update time statistics */ 2910 | if(PMBD_USE_TIMESTAT()){ 2911 | int cid = CUR_CPU_ID(); 2912 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 2913 | pmbd_stat->cycles_clflushall[WRITE][cid] += time_p2 - time_p1; 2914 | } 2915 | return; 2916 | } 2917 | 2918 | static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes) 2919 | { 2920 | uint64_t time_p1 = 0; 2921 | uint64_t time_p2 = 0; 2922 | 2923 | TIMESTAMP(time_p1); 2924 | if (cpu_has_clflush){ 2925 | clflush_cache_range(dst, bytes); 2926 | } 2927 | TIMESTAMP(time_p2); 2928 | 2929 | /* emulating slowdown */ 2930 | if(PMBD_DEV_USE_SLOWDOWN(pmbd)) 2931 | pmbd_rdwr_slowdown((pmbd), WRITE, time_p1, time_p2); 2932 | 2933 | /* update time statistics */ 2934 | if(PMBD_USE_TIMESTAT()){ 2935 | int cid = CUR_CPU_ID(); 2936 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 2937 | pmbd_stat->cycles_clflush[WRITE][cid] += time_p2 - time_p1; 2938 | } 2939 | return; 2940 | } 2941 | 2942 | 2943 | /* 2944 | * Write-protection 2945 | * 2946 | * Being used as storage, PMBD needs to provide certain protection on accidental 2947 | * change caused by wild pointers. So we initialize all the PM pages as 2948 | * read-only; before we perform write operations into PM space, we set the 2949 | * pages writable, after done, we set it back to read-only. This would 2950 | * introduce extra overhead. However, this is a realistic solution to tackle 2951 | * wild pointer problem. 2952 | * 2953 | */ 2954 | 2955 | /* 2956 | * set PM pages to read-only 2957 | * @addr - the starting virtual address (PM space) 2958 | * @bytes - the range in bytes 2959 | * @on_access - this change command from request or during creating/destroying 2960 | */ 2961 | 2962 | static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access) 2963 | { 2964 | if (PMBD_USE_WRITE_PROTECTION()) { 2965 | /* FIXME: type conversion happens here */ 2966 | /* FIXME: add range and bytes check here?? - not so necessary */ 2967 | uint64_t time_p1 = 0; 2968 | uint64_t time_p2 = 0; 2969 | unsigned long offset = (unsigned long) addr; 2970 | unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset)); 2971 | int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1; 2972 | 2973 | if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1))) 2974 | printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", 2975 | __FUNCTION__, __LINE__, vaddr, num_pages); 2976 | 2977 | TIMESTAMP(time_p1); 2978 | set_memory_ro(vaddr, num_pages); 2979 | TIMESTAMP(time_p2); 2980 | 2981 | /* update time statistics */ 2982 | // if(PMBD_USE_TIMESTAT() && on_access){ 2983 | if(PMBD_USE_TIMESTAT()){ 2984 | int cid = CUR_CPU_ID(); 2985 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 2986 | pmbd_stat->cycles_setpages_ro[WRITE][cid] += time_p2 - time_p1; 2987 | } 2988 | } 2989 | return; 2990 | } 2991 | 2992 | static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access) 2993 | { 2994 | if (PMBD_USE_WRITE_PROTECTION()) { 2995 | uint64_t time_p1 = 0; 2996 | uint64_t time_p2 = 0; 2997 | unsigned long offset = (unsigned long) addr; 2998 | unsigned long vaddr = PAGE_TO_VADDR(VADDR_TO_PAGE(offset)); 2999 | int num_pages = VADDR_TO_PAGE(offset + bytes - 1) - VADDR_TO_PAGE(offset) + 1; 3000 | 3001 | if(!(VADDR_IN_PMBD_SPACE(pmbd, addr) && VADDR_IN_PMBD_SPACE(pmbd, addr + bytes-1))) 3002 | printk(KERN_WARNING "pmbd: WARNING - %s(%d): PM space range exceeded (%lu : %d pages)\n", __FUNCTION__, __LINE__, vaddr, num_pages); 3003 | 3004 | TIMESTAMP(time_p1); 3005 | set_memory_rw(vaddr, num_pages); 3006 | TIMESTAMP(time_p2); 3007 | 3008 | /* update time statistics */ 3009 | // if(PMBD_USE_TIMESTAT() && on_access){ 3010 | if(PMBD_USE_TIMESTAT()){ 3011 | int cid = CUR_CPU_ID(); 3012 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3013 | pmbd_stat->cycles_setpages_rw[WRITE][cid] += time_p2 - time_p1; 3014 | } 3015 | } 3016 | return; 3017 | } 3018 | 3019 | 3020 | /* 3021 | * Write verification (EXPERIMENTAL) 3022 | * 3023 | * Note: Even we do write protection by setting PM space read-only, there is 3024 | * still a short vulnerable window when we write pages into PM space - between 3025 | * the time when the pages are set RW and the time when the pages are set back 3026 | * to RO. So we need to verify that no data has been changed during this window 3027 | * by reading out the written data and comparing with the source data. 3028 | * 3029 | */ 3030 | 3031 | 3032 | static inline int pmbd_verify_wr_pages_pmap(PMBD_DEVICE_T* pmbd, void* pmbd_dummy_va, void* ram_va, size_t bytes) 3033 | { 3034 | 3035 | unsigned long flags = 0; 3036 | 3037 | /*NOTE: we assume src is starting from 0 */ 3038 | uint64_t pa = (uint64_t) PMBD_PMAP_VA_TO_PA(pmbd_dummy_va); 3039 | 3040 | /* disable interrupt (FIXME: do we need to do this?)*/ 3041 | DISABLE_SAVE_IRQ(flags); 3042 | 3043 | /* do the real work */ 3044 | while(bytes){ 3045 | uint64_t pfn = (pa >> PAGE_SHIFT); // page frame number 3046 | unsigned off = pa & (~PAGE_MASK); // offset in one page 3047 | unsigned size = MIN_OF((PAGE_SIZE - off), bytes); // the size to copy 3048 | 3049 | /* map it */ 3050 | void * map = pmap_atomic_pfn(pfn, pmbd, WRITE); 3051 | void * pmbd_va = map + off; 3052 | 3053 | /* do memcopy */ 3054 | if (memcmp(pmbd_va, ram_va, size)){ 3055 | punmap_atomic(map, pmbd, WRITE); 3056 | goto bad; 3057 | } 3058 | 3059 | /* unmap it */ 3060 | punmap_atomic(map, pmbd, WRITE); 3061 | 3062 | /* prepare the next iteration */ 3063 | ram_va += size; 3064 | bytes -= size; 3065 | pa += size; 3066 | } 3067 | 3068 | /* re-enable interrupt */ 3069 | ENABLE_RESTORE_IRQ(flags); 3070 | return 0; 3071 | 3072 | bad: 3073 | ENABLE_RESTORE_IRQ(flags); 3074 | return -1; 3075 | } 3076 | 3077 | 3078 | static inline int pmbd_verify_wr_pages_nopmap(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes) 3079 | { 3080 | if (memcmp(pmbd_va, ram_va, bytes)) 3081 | return -1; 3082 | else 3083 | return 0; 3084 | } 3085 | 3086 | static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* pmbd_va, void* ram_va, size_t bytes) 3087 | { 3088 | int rtn = 0; 3089 | uint64_t time_p1, time_p2; 3090 | 3091 | TIMESTAT_POINT(time_p1); 3092 | 3093 | /* check it */ 3094 | if (PMBD_USE_PMAP()) 3095 | rtn = pmbd_verify_wr_pages_pmap(pmbd, pmbd_va, ram_va, bytes); 3096 | else 3097 | rtn = pmbd_verify_wr_pages_nopmap(pmbd, pmbd_va, ram_va, bytes); 3098 | 3099 | /* found mismatch */ 3100 | if (rtn < 0){ 3101 | panic("pmbd: *** writing into PM failed (error found) ***\n"); 3102 | return -1; 3103 | } 3104 | 3105 | TIMESTAT_POINT(time_p2); 3106 | 3107 | /* timestamp */ 3108 | if(PMBD_USE_TIMESTAT()){ 3109 | int cid = CUR_CPU_ID(); 3110 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3111 | pmbd_stat->cycles_wrverify[WRITE][cid] += time_p2 - time_p1; 3112 | } 3113 | 3114 | return 0; 3115 | } 3116 | 3117 | /* 3118 | * Checksum (EXPERIMENTAL) 3119 | * 3120 | * Note: With write-protection and write verification, we can largely reduce 3121 | * the risk of PM data corruption caused by wild in-kernel pointers, however, 3122 | * it is still possible that some data gets corrupted (e.g. PM pages are 3123 | * maliciously changed to writable). Thus, we need to provide another layer of 3124 | * protection by checksuming the PM pages. When writing a page, we compute a 3125 | * checksum and write it into memory; When reading a page, we compute its 3126 | * checksum and compare it with the stored checksum. If a mismatch is found, 3127 | * it indicates that either PM data or the checksum has been corrupted. 3128 | * 3129 | * FIXME: 3130 | * (1) checksum should be stored in PM space, currently we just store it in RAM. 3131 | * (2) probably we should use the CPU cache to speed up and avoid reading the same 3132 | * chunk of data again. 3133 | * (3) currently we always allocate checksum space, whether we enable or disable it 3134 | * in the module config options; may need to make it more efficient in the future. 3135 | * 3136 | */ 3137 | 3138 | 3139 | static int pmbd_checksum_space_alloc(PMBD_DEVICE_T* pmbd) 3140 | { 3141 | int err = 0; 3142 | 3143 | /* allocate checksum space */ 3144 | pmbd->checksum_space= vmalloc(PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T)); 3145 | if (pmbd->checksum_space){ 3146 | memset(pmbd->checksum_space, 0, (PMBD_CHECKSUM_TOTAL_NUM(pmbd) * sizeof(PMBD_CHECKSUM_T))); 3147 | printk(KERN_INFO "pmbd(%d): checksum space is allocated\n", pmbd->pmbd_id); 3148 | } else { 3149 | err = -ENOMEM; 3150 | } 3151 | 3152 | /* allocate checksum buffer space */ 3153 | pmbd->checksum_iomem_buf = vmalloc(pmbd->checksum_unit_size); 3154 | if (pmbd->checksum_iomem_buf){ 3155 | memset(pmbd->checksum_iomem_buf, 0, pmbd->checksum_unit_size); 3156 | printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is allocated\n", pmbd->pmbd_id); 3157 | } else { 3158 | err = -ENOMEM; 3159 | } 3160 | 3161 | return err; 3162 | } 3163 | 3164 | static int pmbd_checksum_space_free(PMBD_DEVICE_T* pmbd) 3165 | { 3166 | if (pmbd->checksum_space) { 3167 | vfree(pmbd->checksum_space); 3168 | pmbd->checksum_space = NULL; 3169 | printk(KERN_INFO "pmbd(%d): checksum space is freed\n", pmbd->pmbd_id); 3170 | } 3171 | if (pmbd->checksum_iomem_buf) { 3172 | vfree(pmbd->checksum_iomem_buf); 3173 | pmbd->checksum_iomem_buf = NULL; 3174 | printk(KERN_INFO "pmbd(%d): checksum iomem buffer space is freed\n", pmbd->pmbd_id); 3175 | } 3176 | return 0; 3177 | } 3178 | 3179 | 3180 | /* 3181 | * Derived from linux/lib/crc32.c GPL v2 3182 | */ 3183 | static unsigned int crc32_my(unsigned char const *p, unsigned int len) 3184 | { 3185 | int i; 3186 | unsigned int crc = 0; 3187 | while (len--) { 3188 | crc ^= *p++; 3189 | for (i = 0; i < 8; i++) 3190 | crc = (crc >> 1) ^ ((crc & 1) ? 0xedb88320 : 0); 3191 | } 3192 | return crc; 3193 | } 3194 | 3195 | static inline PMBD_CHECKSUM_T pmbd_checksum_func(void* data, size_t size) 3196 | { 3197 | return crc32_my(data, size); 3198 | } 3199 | 3200 | /* 3201 | * calculate the checksum for a chunksum unit 3202 | * @pmbd: the pmbd device 3203 | * @data: the virtual address of the target data (must be aligned to the 3204 | * checksum unit boundaries) 3205 | */ 3206 | 3207 | 3208 | static inline PMBD_CHECKSUM_T pmbd_cal_checksum(PMBD_DEVICE_T* pmbd, void* data) 3209 | { 3210 | void* vaddr = data; 3211 | size_t size = pmbd->checksum_unit_size; 3212 | PMBD_CHECKSUM_T chk = 0; 3213 | 3214 | #if 0 3215 | #ifndef CONFIG_X86 3216 | /* 3217 | * Note: If we are directly using vmalloc(), we don't have to copy it 3218 | * to the checksum buffer; however, if we are using High Memory, we should not 3219 | * directly dereference the ioremapped data (on non-x86 platform), so we have to 3220 | * first copy it to a temporary buffer, this extra copy would significantly 3221 | * slows down operations. We do this here is just to remove this extra copy on 3222 | * x86 platform. (see kernel/Documents/IO-mapping.txt) 3223 | * 3224 | */ 3225 | if (PMBD_DEV_USE_HIGHMEM(pmbd) && VADDR_IN_PMBD_SPACE(pmbd, data)) { 3226 | memcpy_fromio(pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size); 3227 | vaddr = pmbd->checksum_iomem_buf; 3228 | } 3229 | #endif 3230 | #endif 3231 | 3232 | if (pmbd->checksum_unit_size != PAGE_SIZE){ 3233 | panic("ERR: %s(%d) checksum unit size (%u) must be %lu\n", __FUNCTION__, __LINE__, pmbd->checksum_unit_size, PAGE_SIZE); 3234 | return 0; 3235 | } 3236 | 3237 | /* FIXME: do we really need to copy the data out first (if not pmap)*/ 3238 | memcpy_from_pmbd(pmbd, pmbd->checksum_iomem_buf, data, pmbd->checksum_unit_size); 3239 | 3240 | /* calculate the checksum */ 3241 | vaddr = pmbd->checksum_iomem_buf; 3242 | chk = pmbd_checksum_func(vaddr, size); 3243 | 3244 | return chk; 3245 | } 3246 | 3247 | static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes) 3248 | { 3249 | unsigned long i; 3250 | unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr); 3251 | unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1)); 3252 | 3253 | uint64_t time_p1, time_p2; 3254 | 3255 | TIMESTAT_POINT(time_p1); 3256 | 3257 | for (i = ck_id_s; i <= ck_id_e; i ++){ 3258 | void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i); 3259 | void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 3260 | 3261 | PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data); 3262 | memcpy(chk, &checksum, sizeof(PMBD_CHECKSUM_T)); 3263 | } 3264 | 3265 | TIMESTAT_POINT(time_p2); 3266 | 3267 | /* timestamp */ 3268 | if(PMBD_USE_TIMESTAT()){ 3269 | int cid = CUR_CPU_ID(); 3270 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3271 | pmbd_stat->cycles_checksum[WRITE][cid] += time_p2 - time_p1; 3272 | } 3273 | return 0; 3274 | } 3275 | 3276 | static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes) 3277 | { 3278 | unsigned long i; 3279 | unsigned long ck_id_s = VADDR_TO_CHECKSUM_IDX(pmbd, vaddr); 3280 | unsigned long ck_id_e = VADDR_TO_CHECKSUM_IDX(pmbd, (vaddr + bytes - 1)); 3281 | 3282 | uint64_t time_p1, time_p2; 3283 | TIMESTAT_POINT(time_p1); 3284 | 3285 | for (i = ck_id_s; i <= ck_id_e; i ++){ 3286 | void* data = CHECKSUM_IDX_TO_VADDR(pmbd, i); 3287 | void* chk = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 3288 | 3289 | PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, data); 3290 | if (memcmp(chk, &checksum, sizeof(PMBD_CHECKSUM_T))){ 3291 | printk(KERN_WARNING "pmbd(%d): checksum mismatch found!", pmbd->pmbd_id); 3292 | } 3293 | } 3294 | 3295 | TIMESTAT_POINT(time_p2); 3296 | 3297 | /* timestamp */ 3298 | if(PMBD_USE_TIMESTAT()){ 3299 | int cid = CUR_CPU_ID(); 3300 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3301 | pmbd_stat->cycles_checksum[READ][cid] += time_p2 - time_p1; 3302 | } 3303 | 3304 | return 0; 3305 | } 3306 | 3307 | #if 0 3308 | /* WARN: Calculating checksum for a big PM space is slow and could lockup system*/ 3309 | static int pmbd_checksum_space_init(PMBD_DEVICE_T* pmbd) 3310 | { 3311 | unsigned long i; 3312 | PMBD_CHECKSUM_T checksum = pmbd_cal_checksum(pmbd, pmbd->mem_space); 3313 | unsigned long ck_s = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_FIRST_BYTE(pmbd)); 3314 | unsigned long ck_e = VADDR_TO_CHECKSUM_IDX(pmbd, PMBD_MEM_SPACE_LAT_BYTE(pmbd)); 3315 | 3316 | for (i = ck_s; i <= ck_e; i ++){ 3317 | void* dst = CHECKSUM_IDX_TO_CKADDR(pmbd, i); 3318 | memcpy(dst, &checksum, sizeof(PMBD_CHECKSUM_T)); 3319 | } 3320 | return 0; 3321 | } 3322 | #endif 3323 | 3324 | /* 3325 | * locks 3326 | * 3327 | * Note: We should prevent multiple threads from concurrently accessing the same 3328 | * chunk of data. For example, if two writes access the same page, the PM page 3329 | * could be corrupted with a merged content of two. So we allocate one spinlock 3330 | * for each 4KB PM page. When read/writing PM data, we lock the related pages 3331 | * and unlock them after done. 3332 | * 3333 | */ 3334 | 3335 | static int pmbd_lock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes) 3336 | { 3337 | if (PMBD_USE_LOCK()) { 3338 | PBN_T pbn = 0; 3339 | PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector); 3340 | PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1)); 3341 | 3342 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++) { 3343 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 3344 | spin_lock(&pbi->lock); 3345 | } 3346 | } 3347 | return 0; 3348 | } 3349 | 3350 | static int pmbd_unlock_on_access(PMBD_DEVICE_T* pmbd, sector_t sector, size_t bytes) 3351 | { 3352 | if (PMBD_USE_LOCK()){ 3353 | PBN_T pbn = 0; 3354 | PBN_T pbn_s = SECTOR_TO_PBN(pmbd, sector); 3355 | PBN_T pbn_e = BYTE_TO_PBN(pmbd, (SECTOR_TO_BYTE(sector) + bytes - 1)); 3356 | 3357 | for (pbn = pbn_s; pbn <= pbn_e; pbn ++) { 3358 | PMBD_PBI_T* pbi = PMBD_BLOCK_PBI(pmbd, pbn); 3359 | spin_unlock(&pbi->lock); 3360 | } 3361 | } 3362 | return 0; 3363 | } 3364 | 3365 | /* 3366 | ************************************************************************** 3367 | * Unbuffered Read/write functions 3368 | ************************************************************************** 3369 | */ 3370 | static void copy_to_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *src, sector_t sector, size_t bytes, unsigned do_fua) 3371 | { 3372 | void *dst; 3373 | 3374 | dst = pmbd->mem_space + sector * pmbd->sector_size; 3375 | 3376 | /* lock the pages */ 3377 | pmbd_lock_on_access(pmbd, sector, bytes); 3378 | 3379 | /* set the pages writable */ 3380 | /* if we use CR0/WP to temporarily switch the writable permission, 3381 | * we don't have to change the PTE attributes directly */ 3382 | if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 3383 | pmbd_set_pages_rw(pmbd, dst, bytes, TRUE); 3384 | 3385 | /* do memcpy */ 3386 | memcpy_to_pmbd(pmbd, dst, src, bytes, do_fua); 3387 | 3388 | /* finish up */ 3389 | /* set the pages read-only */ 3390 | if (PMBD_DEV_USE_WPMODE_PTE(pmbd)) 3391 | pmbd_set_pages_ro(pmbd, dst, bytes, TRUE); 3392 | 3393 | /* verify that the write operation succeeded */ 3394 | if(PMBD_USE_WRITE_VERIFICATION()) 3395 | pmbd_verify_wr_pages(pmbd, dst, src, bytes); 3396 | 3397 | /* generate check sum */ 3398 | if (PMBD_USE_CHECKSUM()) 3399 | pmbd_checksum_on_write(pmbd, dst, bytes); 3400 | 3401 | /* unlock the pages */ 3402 | pmbd_unlock_on_access(pmbd, sector, bytes); 3403 | 3404 | return; 3405 | } 3406 | 3407 | 3408 | static void copy_from_pmbd_unbuffered(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes) 3409 | { 3410 | void *src = pmbd->mem_space + sector * pmbd->sector_size; 3411 | 3412 | /* lock the pages */ 3413 | pmbd_lock_on_access(pmbd, sector, bytes); 3414 | 3415 | /* check checksum first */ 3416 | if (PMBD_USE_CHECKSUM()) 3417 | pmbd_checksum_on_read(pmbd, src, bytes); 3418 | 3419 | /* read it out*/ 3420 | memcpy_from_pmbd(pmbd, dst, src, bytes); 3421 | 3422 | /* unlock the pages */ 3423 | pmbd_unlock_on_access(pmbd, sector, bytes); 3424 | 3425 | return; 3426 | } 3427 | 3428 | 3429 | /* 3430 | * ************************************************************************* 3431 | * Read/write functions 3432 | * ************************************************************************* 3433 | */ 3434 | 3435 | static void copy_to_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes, unsigned do_fua) 3436 | { 3437 | if (PMBD_DEV_USE_BUFFER(pmbd)){ 3438 | copy_to_pmbd_buffered(pmbd, dst, sector, bytes); 3439 | if (do_fua){ 3440 | /* NOTE: 3441 | * When we use a FUA, if the buffer is enabled, we 3442 | * still write into the buffer first, but then we 3443 | * directly write into the PM space without using the 3444 | * buffer again. This is suboptimal (we need to write 3445 | * the data twice), however, it is better than changing 3446 | * the buffering code. 3447 | */ 3448 | copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua); 3449 | } 3450 | }else 3451 | copy_to_pmbd_unbuffered(pmbd, dst, sector, bytes, do_fua); 3452 | return; 3453 | } 3454 | 3455 | static void copy_from_pmbd(PMBD_DEVICE_T* pmbd, void *dst, sector_t sector, size_t bytes) 3456 | { 3457 | if (PMBD_DEV_USE_BUFFER(pmbd)) 3458 | copy_from_pmbd_buffered(pmbd, dst, sector, bytes); 3459 | else 3460 | copy_from_pmbd_unbuffered(pmbd, dst, sector, bytes); 3461 | return; 3462 | } 3463 | 3464 | static int pmbd_seg_read_write(PMBD_DEVICE_T* pmbd, struct page *page, unsigned int len, 3465 | unsigned int off, int rw, sector_t sector, unsigned do_fua) 3466 | { 3467 | void *mem; 3468 | int err = 0; 3469 | 3470 | mem = kmap_atomic(page, KM_USER0); 3471 | if (rw == READ) { 3472 | copy_from_pmbd(pmbd, mem + off, sector, len); 3473 | flush_dcache_page(page); 3474 | } else { 3475 | flush_dcache_page(page); 3476 | copy_to_pmbd(pmbd, mem + off, sector, len, do_fua); 3477 | } 3478 | kunmap_atomic(mem, KM_USER0); 3479 | 3480 | return err; 3481 | } 3482 | 3483 | static int pmbd_do_bvec(PMBD_DEVICE_T* pmbd, struct page *page, 3484 | unsigned int len, unsigned int off, int rw, sector_t sector, unsigned do_fua) 3485 | { 3486 | return pmbd_seg_read_write(pmbd, page, len, off, rw, sector, do_fua); 3487 | } 3488 | 3489 | /* 3490 | * Handling write barrier 3491 | * @pmbd: the pmbd device 3492 | * 3493 | * When the application sends fsync(), a bio labeled with WRITE_BARRIER would be 3494 | * received by pmbd_make_request(), and we need to stop accepting new incoming 3495 | * writes (by locking pmbd->wr_barrier_lock), and wait for the on-the-fly writes 3496 | * to complete (by checking pmbd->num_flying_wr), then if we use buffer, we flush 3497 | * the whole entire DRAM buffer with clflush enabled. If we do not use the buffer, 3498 | * we flush the CPU cache to let all the data securely be written into PM. 3499 | * 3500 | */ 3501 | 3502 | 3503 | static void __x86_mfence_all(void *arg) 3504 | { 3505 | unsigned long cache = (unsigned long)arg; 3506 | if (cache && boot_cpu_data.x86 >= 4) 3507 | mfence(); 3508 | } 3509 | 3510 | static void x86_mfence_all(unsigned long cache) 3511 | { 3512 | BUG_ON(irqs_disabled()); 3513 | on_each_cpu(__x86_mfence_all, (void*) cache, 1); 3514 | } 3515 | 3516 | static inline void pmbd_mfence_all(PMBD_DEVICE_T* pmbd) 3517 | { 3518 | x86_mfence_all(1); 3519 | } 3520 | 3521 | 3522 | static void __x86_sfence_all(void *arg) 3523 | { 3524 | unsigned long cache = (unsigned long)arg; 3525 | if (cache && boot_cpu_data.x86 >= 4) 3526 | sfence(); 3527 | } 3528 | 3529 | static void x86_sfence_all(unsigned long cache) 3530 | { 3531 | BUG_ON(irqs_disabled()); 3532 | on_each_cpu(__x86_sfence_all, (void*) cache, 1); 3533 | 3534 | } 3535 | 3536 | static inline void pmbd_sfence_all(PMBD_DEVICE_T* pmbd) 3537 | { 3538 | x86_sfence_all(1); 3539 | } 3540 | 3541 | static int pmbd_write_barrier(PMBD_DEVICE_T* pmbd) 3542 | { 3543 | unsigned i; 3544 | 3545 | /* blocking incoming writes */ 3546 | spin_lock(&pmbd->wr_barrier_lock); 3547 | 3548 | /* wait for all on-the-fly writes to finish first */ 3549 | while (atomic_read(&pmbd->num_flying_wr) != 0) 3550 | ; 3551 | 3552 | if (PMBD_DEV_USE_BUFFER(pmbd)){ 3553 | /* if buffer is used, flush the entire buffer */ 3554 | for (i = 0; i < pmbd->num_buffers; i ++){ 3555 | PMBD_BUFFER_T* buffer = pmbd->buffers[i]; 3556 | pmbd_buffer_check_and_flush(buffer, buffer->num_blocks, CALLER_DESTROYER); 3557 | } 3558 | } 3559 | 3560 | /* 3561 | * considering the following: 3562 | * UC (write-through): strong ordering, we do nothing 3563 | * UC-Minus: strong ordering (may be overridden by WC), we use sfence, do nothing 3564 | * WC (write-combining): sfence should be used after each write, so we do nothing 3565 | * WB (write-back): non-temporal store : sfence is used, do nothing 3566 | * clflush/mfence: mfence is used in clflush_cache_range(), do nothing 3567 | * nothing: wbinvd needed to drop the entire cache 3568 | */ 3569 | if (PMBD_CPU_CACHE_USE_WB()){ 3570 | if (PMBD_USE_NTS()){ 3571 | /* sfence is used after each movntq, so it is safe, we 3572 | * do nothing, just stop accepting any incoming requests */ 3573 | } else if (PMBD_USE_CLFLUSH()) { 3574 | /* if use clflush/mfence to sync I/O, we do nothing*/ 3575 | // pmbd_mfence_all(pmbd); 3576 | } else { 3577 | /* if no sync operations, we have to drop the entire cache */ 3578 | pmbd_clflush_all(pmbd); 3579 | } 3580 | } else if (PMBD_CPU_CACHE_USE_WC() || PMBD_CPU_CACHE_USE_UM()) { 3581 | /* if using WC, sfence should used already, so do nothing */ 3582 | 3583 | } else if (PMBD_CPU_CACHE_USE_UC()) { 3584 | /* strong ordering is used, no need to do anything else*/ 3585 | } else { 3586 | panic("%s(%d): something is wrong\n", __FUNCTION__, __LINE__); 3587 | } 3588 | 3589 | /* unblock incoming writes */ 3590 | spin_unlock(&pmbd->wr_barrier_lock); 3591 | return 0; 3592 | } 3593 | 3594 | 3595 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 3596 | // #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & REQ_FLUSH) == REQ_FLUSH) 3597 | // #define BIO_WR_BARRIER(BIO) ((BIO)->bi_rw & (REQ_FLUSH | REQ_FLUSH_SEQ)) 3598 | #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & WRITE_FLUSH) == WRITE_FLUSH) 3599 | #define BIO_WR_FUA(BIO) (((BIO)->bi_rw & WRITE_FUA) == WRITE_FUA) 3600 | #define BIO_WR_SYNC(BIO) (((BIO)->bi_rw & WRITE_SYNC) == WRITE_SYNC) 3601 | #elif LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 3602 | #define BIO_WR_BARRIER(BIO) (((BIO)->bi_rw & WRITE_BARRIER) == WRITE_BARRIER) 3603 | #define BIO_WR_SYNC(BIO) (((BIO)->bi_rw & WRITE_SYNC) == WRITE_SYNC) 3604 | #endif 3605 | 3606 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 3607 | #define MKREQ_RTN_TYPE void 3608 | #elif LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 3609 | #define MKREQ_RTN_TYPE int 3610 | #endif 3611 | 3612 | static MKREQ_RTN_TYPE pmbd_make_request(struct request_queue *q, struct bio *bio) 3613 | { 3614 | int i = 0; 3615 | int err = -EIO; 3616 | uint64_t start = 0; 3617 | uint64_t end = 0; 3618 | struct bio_vec *bvec; 3619 | int rw = bio_rw(bio); 3620 | sector_t sector = bio->bi_sector; 3621 | int num_sectors = bio_sectors(bio); 3622 | struct block_device *bdev = bio->bi_bdev; 3623 | PMBD_DEVICE_T *pmbd = bdev->bd_disk->private_data; 3624 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3625 | unsigned bio_is_write_fua = FALSE; 3626 | unsigned bio_is_write_barrier = FALSE; 3627 | unsigned do_fua = FALSE; 3628 | uint64_t time_p1, time_p2, time_p3, time_p4, time_p5, time_p6; 3629 | time_p1 = time_p2 = time_p3 = time_p4 = time_p5 = time_p6 = 0; 3630 | 3631 | 3632 | TIMESTAT_POINT(time_p1); 3633 | // printk("ACCESS: %u %d %X %d\n", sector, num_sectors, bio->bi_rw, rw); 3634 | 3635 | /* update rw */ 3636 | if (rw == READA) 3637 | rw = READ; 3638 | if (rw != READ && rw != WRITE) 3639 | panic("pmbd: %s(%d) found request not read or write either\n", __FUNCTION__, __LINE__); 3640 | 3641 | /* handle write barrier (we don't do for BIO_WR_SYNC(bio) anymore*/ 3642 | if (BIO_WR_BARRIER(bio)){ 3643 | /* 3644 | * Note: Linux kernel 2.6.37 and later use file systems and FUA 3645 | * to ensure data reliability, rather than write barriers. 3646 | * See http://monolight.cc/2011/06/barriers-caches-filesystems 3647 | */ 3648 | bio_is_write_barrier = TRUE; 3649 | // printk(KERN_INFO "pmbd: received barrier request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw); 3650 | 3651 | if (PMBD_USE_WB()) 3652 | pmbd_write_barrier(pmbd); 3653 | } 3654 | 3655 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 3656 | if (BIO_WR_FUA(bio)){ 3657 | bio_is_write_fua = TRUE; 3658 | // printk(KERN_INFO "pmbd: received FUA request %u %d %lx %d\n", (unsigned int) sector, num_sectors, bio->bi_rw, rw); 3659 | 3660 | if (PMBD_USE_FUA()) 3661 | do_fua = TRUE; 3662 | } 3663 | #endif 3664 | 3665 | TIMESTAT_POINT(time_p2); 3666 | 3667 | /* blocking write until write barrier is done */ 3668 | if (rw == WRITE){ 3669 | spin_lock(&pmbd->wr_barrier_lock); 3670 | spin_unlock(&pmbd->wr_barrier_lock); 3671 | } 3672 | 3673 | /* increment on-the-fly writes counter */ 3674 | atomic_inc(&pmbd->num_flying_wr); 3675 | 3676 | /* starting emulation */ 3677 | if (PMBD_DEV_SIM_DEV(pmbd)) 3678 | start = emul_start(pmbd, num_sectors, rw); 3679 | 3680 | /* check if out of range */ 3681 | if (sector + (bio->bi_size >> SECTOR_SHIFT) > get_capacity(bdev->bd_disk)){ 3682 | printk(KERN_WARNING "pmbd: request exceeds the PMBD capacity\n"); 3683 | TIMESTAT_POINT(time_p3); 3684 | goto out; 3685 | } 3686 | 3687 | // printk("DEBUG: ACCESS %lu %d %d\n", sector, num_sectors, rw); 3688 | 3689 | /* 3690 | * NOTE: some applications (e.g. fdisk) call fsync() to request 3691 | * flushing dirty data from the buffer cache. In default, fsync() is 3692 | * linked to blkdev_fsync() in the def_blk_fops structure, and 3693 | * blkdev_fsync() will call blkdev_issue_flush(), which generates an 3694 | * empty bio carrying a write barrier down to the block device through 3695 | * generic_make_request(), which calls pmbd_make_request() in turn. If 3696 | * we don't set err=0 here, this error message would pass upwards back 3697 | * to the application. For example, fdisk will fail and reports error 3698 | * when trying to write the partition table before it exits. Thus we 3699 | * must reset the error code here if the bio is empty. Also note that 3700 | * we directly check the bio size, rather than using bio_wr_barrier(), 3701 | * to handle other cases. 3702 | * 3703 | */ 3704 | if (num_sectors == 0) { 3705 | err = 0; 3706 | TIMESTAT_POINT(time_p3); 3707 | goto out; 3708 | } 3709 | 3710 | /* update the access time*/ 3711 | PMBD_DEV_UPDATE_ACCESS_TIME(pmbd); 3712 | 3713 | TIMESTAT_POINT(time_p3); 3714 | 3715 | /* 3716 | * Do read/write now. We first perform the operation, then check how 3717 | * long it actually takes to finish the operation, then we calculate an 3718 | * emulated time for a given slow-down model, if the actual access time 3719 | * is less than the emulated time, we just make up the difference to 3720 | * emulate a slower device. 3721 | */ 3722 | bio_for_each_segment(bvec, bio, i) { 3723 | unsigned int len = bvec->bv_len; 3724 | err = pmbd_do_bvec(pmbd, bvec->bv_page, len, 3725 | bvec->bv_offset, rw, sector, do_fua); 3726 | if (err) 3727 | break; 3728 | sector += len >> SECTOR_SHIFT; 3729 | } 3730 | 3731 | out: 3732 | TIMESTAT_POINT(time_p4); 3733 | 3734 | bio_endio(bio, err); 3735 | 3736 | TIMESTAT_POINT(time_p5); 3737 | 3738 | /* ending emulation (simmode0)*/ 3739 | if (PMBD_DEV_SIM_DEV(pmbd)) 3740 | end = emul_end(pmbd, num_sectors, rw, start); 3741 | 3742 | /* decrement on-the-fly writes counter */ 3743 | atomic_dec(&pmbd->num_flying_wr); 3744 | 3745 | TIMESTAT_POINT(time_p6); 3746 | 3747 | /* update statistics data */ 3748 | spin_lock(&pmbd_stat->stat_lock); 3749 | if (rw == READ) { 3750 | pmbd_stat->num_requests_read ++; 3751 | pmbd_stat->num_sectors_read += num_sectors; 3752 | } else { 3753 | pmbd_stat->num_requests_write ++; 3754 | pmbd_stat->num_sectors_write += num_sectors; 3755 | } 3756 | if (bio_is_write_barrier) 3757 | pmbd_stat->num_write_barrier ++; 3758 | if (bio_is_write_fua) 3759 | pmbd_stat->num_write_fua ++; 3760 | spin_unlock(&pmbd_stat->stat_lock); 3761 | 3762 | /* cycles */ 3763 | if (PMBD_USE_TIMESTAT()){ 3764 | int cid = CUR_CPU_ID(); 3765 | pmbd_stat->cycles_total[rw][cid] += time_p6 - time_p1; 3766 | pmbd_stat->cycles_wb[rw][cid] += time_p2 - time_p1; /* write barrier */ 3767 | pmbd_stat->cycles_prepare[rw][cid] += time_p3 - time_p2; 3768 | pmbd_stat->cycles_work[rw][cid] += time_p4 - time_p3; 3769 | pmbd_stat->cycles_endio[rw][cid] += time_p5 - time_p4; 3770 | pmbd_stat->cycles_finish[rw][cid] += time_p6 - time_p5; 3771 | } 3772 | 3773 | #if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 3774 | return 0; 3775 | #endif 3776 | } 3777 | 3778 | 3779 | /* 3780 | ************************************************************************** 3781 | * Allocating memory space for PMBD device 3782 | ************************************************************************** 3783 | */ 3784 | 3785 | /* 3786 | * Set the page attributes for the PMBD backstore memory space 3787 | * - WB: cache enabled, write back (default) 3788 | * - WC: cache disabled, write through, speculative writes combined 3789 | * - UC: cache disabled, write through, no write combined 3790 | * - UC-Minus: the same as UC 3791 | * 3792 | * REF: 3793 | * - http://www.kernel.org/doc/ols/2008/ols2008v2-pages-135-144.pdf 3794 | * - http://www.mjmwired.net/kernel/Documentation/x86/pat.txt 3795 | */ 3796 | 3797 | static int pmbd_set_pages_cache_flags(PMBD_DEVICE_T* pmbd) 3798 | { 3799 | if (pmbd->mem_space && pmbd->num_sectors) { 3800 | /* NOTE: we convert it here with no problem on 64-bit system */ 3801 | unsigned long vaddr = (unsigned long) pmbd->mem_space; 3802 | int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd); 3803 | 3804 | printk(KERN_INFO "pmbd: setting %s PTE flags (%lx:%d)\n", pmbd->pmbd_name, vaddr, num_pages); 3805 | set_pages_cache_flags(vaddr, num_pages); 3806 | printk(KERN_INFO "pmbd: setting %s PTE flags done.\n", pmbd->pmbd_name); 3807 | } 3808 | return 0; 3809 | } 3810 | 3811 | static int pmbd_reset_pages_cache_flags(PMBD_DEVICE_T* pmbd) 3812 | { 3813 | if (pmbd->mem_space){ 3814 | unsigned long vaddr = (unsigned long) pmbd->mem_space; 3815 | int num_pages = PMBD_MEM_TOTAL_PAGES(pmbd); 3816 | set_memory_wb(vaddr, num_pages); 3817 | printk(KERN_INFO "pmbd: %s pages cache flags are reset to WB\n", pmbd->pmbd_name); 3818 | } 3819 | return 0; 3820 | } 3821 | 3822 | 3823 | /* 3824 | * Allocate/free memory backstore space for PMBD devices 3825 | */ 3826 | static int pmbd_mem_space_alloc (PMBD_DEVICE_T* pmbd) 3827 | { 3828 | int err = 0; 3829 | 3830 | /* allocate PM memory space */ 3831 | if (PMBD_DEV_USE_VMALLOC(pmbd)){ 3832 | pmbd->mem_space = vmalloc (PMBD_MEM_TOTAL_BYTES(pmbd)); 3833 | } else if (PMBD_DEV_USE_HIGHMEM(pmbd)){ 3834 | pmbd->mem_space = hmalloc (PMBD_MEM_TOTAL_BYTES(pmbd)); 3835 | } 3836 | 3837 | if (pmbd->mem_space) { 3838 | #if 0 3839 | /* FIXME: No need to do this. It's slow, system could be locked up */ 3840 | memset(pmbd->mem_space, 0, pmbd->sectors * pmbd->sector_size); 3841 | #endif 3842 | printk(KERN_INFO "pmbd: /dev/%s is created [%lu : %llu MBs]\n", 3843 | pmbd->pmbd_name, (unsigned long) pmbd->mem_space, SECTORS_TO_MB(pmbd->num_sectors)); 3844 | } else { 3845 | printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__); 3846 | err = -ENOMEM; 3847 | } 3848 | return err; 3849 | } 3850 | 3851 | static int pmbd_mem_space_free(PMBD_DEVICE_T* pmbd) 3852 | { 3853 | /* free it up */ 3854 | if (pmbd->mem_space) { 3855 | if (PMBD_DEV_USE_VMALLOC(pmbd)) 3856 | vfree(pmbd->mem_space); 3857 | else if (PMBD_DEV_USE_HIGHMEM(pmbd)) { 3858 | hfree(pmbd->mem_space); 3859 | } 3860 | pmbd->mem_space = NULL; 3861 | } 3862 | return 0; 3863 | } 3864 | 3865 | /* pmbd->pmbd_stat */ 3866 | static int pmbd_stat_alloc(PMBD_DEVICE_T* pmbd) 3867 | { 3868 | int err = 0; 3869 | pmbd->pmbd_stat = (PMBD_STAT_T*)kzalloc(sizeof(PMBD_STAT_T), GFP_KERNEL); 3870 | if (pmbd->pmbd_stat){ 3871 | spin_lock_init(&pmbd->pmbd_stat->stat_lock); 3872 | } else { 3873 | printk(KERN_ERR "pmbd: %s(%d): PMBD space allocation failed\n", __FUNCTION__, __LINE__); 3874 | err = -ENOMEM; 3875 | } 3876 | return 0; 3877 | } 3878 | 3879 | static int pmbd_stat_free(PMBD_DEVICE_T* pmbd) 3880 | { 3881 | if(pmbd->pmbd_stat) { 3882 | kfree(pmbd->pmbd_stat); 3883 | pmbd->pmbd_stat = NULL; 3884 | } 3885 | return 0; 3886 | } 3887 | 3888 | /* /proc/pmbd/ */ 3889 | static int pmbd_proc_pmbdstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data) 3890 | { 3891 | int rtn; 3892 | if (offset > 0) { 3893 | *eof = 1; 3894 | rtn = 0; 3895 | } else { 3896 | //char local_buffer[1024]; 3897 | char* local_buffer = kzalloc(8192, GFP_KERNEL); 3898 | PMBD_DEVICE_T* pmbd, *next; 3899 | char rdwr_name[2][16] = {"read\0", "write\0"}; 3900 | local_buffer[0] = '\0'; 3901 | 3902 | list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) { 3903 | unsigned i, j; 3904 | BBN_T num_dirty = 0; 3905 | BBN_T num_blocks = 0; 3906 | PMBD_STAT_T* pmbd_stat = pmbd->pmbd_stat; 3907 | 3908 | /* FIXME: should we lock the buffer? (NOT NECESSARY)*/ 3909 | for (i = 0; i < pmbd->num_buffers; i ++){ 3910 | num_blocks += pmbd->buffers[i]->num_blocks; 3911 | num_dirty += pmbd->buffers[i]->num_dirty; 3912 | } 3913 | 3914 | /* print stuff now */ 3915 | spin_lock(&pmbd->pmbd_stat->stat_lock); 3916 | 3917 | sprintf(local_buffer+strlen(local_buffer), "num_dirty_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) num_dirty); 3918 | sprintf(local_buffer+strlen(local_buffer), "num_clean_blocks[%s] %u\n", pmbd->pmbd_name, (unsigned int) (num_blocks - num_dirty)); 3919 | sprintf(local_buffer+strlen(local_buffer), "num_sectors_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_read); 3920 | sprintf(local_buffer+strlen(local_buffer), "num_sectors_write[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_sectors_write); 3921 | sprintf(local_buffer+strlen(local_buffer), "num_requests_read[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_requests_read); 3922 | sprintf(local_buffer+strlen(local_buffer), "num_requests_write[%s] %llu\n",pmbd->pmbd_name, pmbd_stat->num_requests_write); 3923 | sprintf(local_buffer+strlen(local_buffer), "num_write_barrier[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_barrier); 3924 | sprintf(local_buffer+strlen(local_buffer), "num_write_fua[%s] %llu\n", pmbd->pmbd_name, pmbd_stat->num_write_fua); 3925 | 3926 | spin_unlock(&pmbd->pmbd_stat->stat_lock); 3927 | 3928 | // sprintf(local_buffer+strlen(local_buffer), "\n"); 3929 | 3930 | for (j = 0; j <= 1; j ++){ 3931 | int k=0; 3932 | 3933 | unsigned long long cycles_total = 0; 3934 | unsigned long long cycles_prepare = 0; 3935 | unsigned long long cycles_wb = 0; 3936 | unsigned long long cycles_work = 0; 3937 | unsigned long long cycles_endio = 0; 3938 | unsigned long long cycles_finish = 0; 3939 | 3940 | unsigned long long cycles_pmap = 0; 3941 | unsigned long long cycles_punmap = 0; 3942 | unsigned long long cycles_memcpy = 0; 3943 | unsigned long long cycles_clflush = 0; 3944 | unsigned long long cycles_clflushall = 0; 3945 | unsigned long long cycles_wrverify = 0; 3946 | unsigned long long cycles_checksum = 0; 3947 | unsigned long long cycles_pause = 0; 3948 | unsigned long long cycles_slowdown = 0; 3949 | unsigned long long cycles_setpages_ro = 0; 3950 | unsigned long long cycles_setpages_rw = 0; 3951 | 3952 | for (k = 0; k < PMBD_MAX_NUM_CPUS; k ++){ 3953 | cycles_total += pmbd_stat->cycles_total[j][k]; 3954 | cycles_prepare += pmbd_stat->cycles_prepare[j][k]; 3955 | cycles_wb += pmbd_stat->cycles_wb[j][k]; 3956 | cycles_work += pmbd_stat->cycles_work[j][k]; 3957 | cycles_endio += pmbd_stat->cycles_endio[j][k]; 3958 | cycles_finish += pmbd_stat->cycles_finish[j][k]; 3959 | 3960 | cycles_pmap += pmbd_stat->cycles_pmap[j][k]; 3961 | cycles_punmap += pmbd_stat->cycles_punmap[j][k]; 3962 | cycles_memcpy += pmbd_stat->cycles_memcpy[j][k]; 3963 | cycles_clflush += pmbd_stat->cycles_clflush[j][k]; 3964 | cycles_clflushall+=pmbd_stat->cycles_clflushall[j][k]; 3965 | cycles_wrverify += pmbd_stat->cycles_wrverify[j][k]; 3966 | cycles_checksum += pmbd_stat->cycles_checksum[j][k]; 3967 | cycles_pause += pmbd_stat->cycles_pause[j][k]; 3968 | cycles_slowdown += pmbd_stat->cycles_slowdown[j][k]; 3969 | cycles_setpages_ro+= pmbd_stat->cycles_setpages_ro[j][k]; 3970 | cycles_setpages_rw+= pmbd_stat->cycles_setpages_rw[j][k]; 3971 | } 3972 | 3973 | sprintf(local_buffer+strlen(local_buffer), "cycles_total_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_total); 3974 | sprintf(local_buffer+strlen(local_buffer), "cycles_prepare_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_prepare); 3975 | sprintf(local_buffer+strlen(local_buffer), "cycles_wb_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wb); 3976 | sprintf(local_buffer+strlen(local_buffer), "cycles_work_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_work); 3977 | sprintf(local_buffer+strlen(local_buffer), "cycles_endio_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_endio); 3978 | sprintf(local_buffer+strlen(local_buffer), "cycles_finish_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_finish); 3979 | sprintf(local_buffer+strlen(local_buffer), "cycles_pmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pmap); 3980 | sprintf(local_buffer+strlen(local_buffer), "cycles_punmap_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_punmap); 3981 | sprintf(local_buffer+strlen(local_buffer), "cycles_memcpy_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_memcpy); 3982 | sprintf(local_buffer+strlen(local_buffer), "cycles_clflush_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflush); 3983 | sprintf(local_buffer+strlen(local_buffer), "cycles_clflushall_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_clflushall); 3984 | sprintf(local_buffer+strlen(local_buffer), "cycles_wrverify_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_wrverify); 3985 | sprintf(local_buffer+strlen(local_buffer), "cycles_checksum_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_checksum); 3986 | sprintf(local_buffer+strlen(local_buffer), "cycles_pause_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_pause); 3987 | sprintf(local_buffer+strlen(local_buffer), "cycles_slowdown_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_slowdown); 3988 | sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_ro_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_ro); 3989 | sprintf(local_buffer+strlen(local_buffer), "cycles_setpages_rw_%s[%s] %llu\n", rdwr_name[j], pmbd->pmbd_name, cycles_setpages_rw); 3990 | } 3991 | 3992 | #if 0 3993 | /* print something temporary for debugging purpose */ 3994 | if (0) { 3995 | spin_lock(&pmbd->tmp_lock); 3996 | printk("%llu %lu\n", pmbd->tmp_data, pmbd->tmp_num); 3997 | spin_unlock(&pmbd->tmp_lock); 3998 | } 3999 | #endif 4000 | } 4001 | 4002 | memcpy(buffer, local_buffer, strlen(local_buffer)); 4003 | rtn = strlen(local_buffer); 4004 | kfree(local_buffer); 4005 | } 4006 | return rtn; 4007 | } 4008 | 4009 | /* /proc/pmbdcfg */ 4010 | static int pmbd_proc_pmbdcfg_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data) 4011 | { 4012 | int rtn; 4013 | if (offset > 0) { 4014 | *eof = 1; 4015 | rtn = 0; 4016 | } else { 4017 | char* local_buffer = kzalloc(8192, GFP_KERNEL); 4018 | PMBD_DEVICE_T* pmbd, *next; 4019 | local_buffer[0] = '\0'; 4020 | 4021 | /* global configurations */ 4022 | sprintf(local_buffer+strlen(local_buffer), "MODULE OPTIONS: %s\n", mode); 4023 | sprintf(local_buffer+strlen(local_buffer), "\n"); 4024 | 4025 | sprintf(local_buffer+strlen(local_buffer), "max_part %d\n", max_part); 4026 | sprintf(local_buffer+strlen(local_buffer), "part_shift %d\n", part_shift); 4027 | 4028 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_type %u\n", g_pmbd_type); 4029 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_mergeable %u\n", g_pmbd_mergeable); 4030 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_clflush %u\n", g_pmbd_cpu_cache_clflush); 4031 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_cpu_cache_flag %lu\n", g_pmbd_cpu_cache_flag); 4032 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_protect %u\n", g_pmbd_wr_protect); 4033 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wr_verify %u\n", g_pmbd_wr_verify); 4034 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_checksum %u\n", g_pmbd_checksum); 4035 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_lock %u\n", g_pmbd_lock); 4036 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_subpage_update %u\n", g_pmbd_subpage_update); 4037 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_pmap %u\n", g_pmbd_pmap); 4038 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nts %u\n", g_pmbd_nts); 4039 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_ntl %u\n", g_pmbd_ntl); 4040 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_wb %u\n", g_pmbd_wb); 4041 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_fua %u\n", g_pmbd_fua); 4042 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_timestat %u\n", g_pmbd_timestat); 4043 | sprintf(local_buffer+strlen(local_buffer), "g_highmem_size %lu\n", g_highmem_size); 4044 | sprintf(local_buffer+strlen(local_buffer), "g_highmem_phys_addr %llu\n", (unsigned long long) g_highmem_phys_addr); 4045 | sprintf(local_buffer+strlen(local_buffer), "g_highmem_virt_addr %llu\n", (unsigned long long) g_highmem_virt_addr); 4046 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_nr %u\n", g_pmbd_nr); 4047 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_adjust_ns %llu\n", g_pmbd_adjust_ns); 4048 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_num_buffers %llu\n", g_pmbd_num_buffers); 4049 | sprintf(local_buffer+strlen(local_buffer), "g_pmbd_buffer_stride %llu\n", g_pmbd_buffer_stride); 4050 | sprintf(local_buffer+strlen(local_buffer), "\n"); 4051 | 4052 | /* device specific configurations */ 4053 | list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) { 4054 | int i = 0; 4055 | 4056 | sprintf(local_buffer+strlen(local_buffer), "pmbd_id[%s] %d\n", pmbd->pmbd_name, pmbd->pmbd_id); 4057 | sprintf(local_buffer+strlen(local_buffer), "num_sectors[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->num_sectors); 4058 | sprintf(local_buffer+strlen(local_buffer), "sector_size[%s] %u\n", pmbd->pmbd_name, pmbd->sector_size); 4059 | sprintf(local_buffer+strlen(local_buffer), "pmbd_type[%s] %u\n", pmbd->pmbd_name, pmbd->pmbd_type); 4060 | sprintf(local_buffer+strlen(local_buffer), "rammode[%s] %u\n", pmbd->pmbd_name, pmbd->rammode); 4061 | sprintf(local_buffer+strlen(local_buffer), "bufmode[%s] %u\n", pmbd->pmbd_name, pmbd->bufmode); 4062 | sprintf(local_buffer+strlen(local_buffer), "wpmode[%s] %u\n", pmbd->pmbd_name, pmbd->wpmode); 4063 | sprintf(local_buffer+strlen(local_buffer), "num_buffers[%s] %u\n", pmbd->pmbd_name, pmbd->num_buffers); 4064 | sprintf(local_buffer+strlen(local_buffer), "buffer_stride[%s] %u\n", pmbd->pmbd_name, pmbd->buffer_stride); 4065 | sprintf(local_buffer+strlen(local_buffer), "pb_size[%s] %u\n", pmbd->pmbd_name, pmbd->pb_size); 4066 | sprintf(local_buffer+strlen(local_buffer), "checksum_unit_size[%s] %u\n", pmbd->pmbd_name, pmbd->checksum_unit_size); 4067 | sprintf(local_buffer+strlen(local_buffer), "simmode[%s] %u\n", pmbd->pmbd_name, pmbd->simmode); 4068 | sprintf(local_buffer+strlen(local_buffer), "rdlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdlat); 4069 | sprintf(local_buffer+strlen(local_buffer), "wrlat[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrlat); 4070 | sprintf(local_buffer+strlen(local_buffer), "rdbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdbw); 4071 | sprintf(local_buffer+strlen(local_buffer), "wrbw[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrbw); 4072 | sprintf(local_buffer+strlen(local_buffer), "rdsx[%s] %u\n", pmbd->pmbd_name, pmbd->rdsx); 4073 | sprintf(local_buffer+strlen(local_buffer), "wrsx[%s] %u\n", pmbd->pmbd_name, pmbd->wrsx); 4074 | sprintf(local_buffer+strlen(local_buffer), "rdpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->rdpause); 4075 | sprintf(local_buffer+strlen(local_buffer), "wrpause[%s] %llu\n", pmbd->pmbd_name, (unsigned long long) pmbd->wrpause); 4076 | 4077 | for (i = 0; i < pmbd->num_buffers; i ++){ 4078 | PMBD_BUFFER_T* buffer = pmbd->buffers[i]; 4079 | sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]buffer_id %u\n", i, pmbd->pmbd_name, buffer->buffer_id); 4080 | sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]num_blocks %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->num_blocks); 4081 | sprintf(local_buffer+strlen(local_buffer), "buffer%d[%s]batch_size %lu\n", i, pmbd->pmbd_name, (unsigned long) buffer->batch_size); 4082 | } 4083 | 4084 | } 4085 | 4086 | memcpy(buffer, local_buffer, strlen(local_buffer)); 4087 | rtn = strlen(local_buffer); 4088 | kfree(local_buffer); 4089 | } 4090 | return rtn; 4091 | } 4092 | 4093 | 4094 | 4095 | static int pmbd_proc_devstat_read(char* buffer, char** start, off_t offset, int count, int* eof, void* data) 4096 | { 4097 | int rtn; 4098 | char local_buffer[1024]; 4099 | if (offset > 0) { 4100 | *eof = 1; 4101 | rtn = 0; 4102 | } else { 4103 | sprintf(local_buffer, "N/A\n"); 4104 | memcpy(buffer, local_buffer, strlen(local_buffer)); 4105 | rtn = strlen(local_buffer); 4106 | } 4107 | return rtn; 4108 | } 4109 | 4110 | static int pmbd_proc_devstat_create(PMBD_DEVICE_T* pmbd) 4111 | { 4112 | /* create a /proc/pmbd/ entry */ 4113 | pmbd->proc_devstat = create_proc_entry(pmbd->pmbd_name, S_IRUGO, proc_pmbd); 4114 | if (pmbd->proc_devstat == NULL) { 4115 | remove_proc_entry(pmbd->pmbd_name, proc_pmbd); 4116 | printk(KERN_ERR "pmbd: cannot create /proc/pmbd/%s\n", pmbd->pmbd_name); 4117 | return -ENOMEM; 4118 | } 4119 | pmbd->proc_devstat->read_proc = pmbd_proc_devstat_read; 4120 | printk(KERN_INFO "pmbd: /proc/pmbd/%s created\n", pmbd->pmbd_name); 4121 | 4122 | return 0; 4123 | } 4124 | 4125 | static int pmbd_proc_devstat_destroy(PMBD_DEVICE_T* pmbd) 4126 | { 4127 | remove_proc_entry(pmbd->pmbd_name, proc_pmbd); 4128 | printk(KERN_INFO "pmbd: /proc/pmbd/%s removed\n", pmbd->pmbd_name); 4129 | return 0; 4130 | } 4131 | 4132 | static int pmbd_create (PMBD_DEVICE_T* pmbd, uint64_t sectors) 4133 | { 4134 | int err = 0; 4135 | 4136 | pmbd->num_sectors = sectors; 4137 | pmbd->sector_size = PMBD_SECTOR_SIZE; /* FIXME: now we use 512, do we need to change it? */ 4138 | pmbd->pmbd_type = g_pmbd_type; 4139 | pmbd->checksum_unit_size = PAGE_SIZE; 4140 | pmbd->pb_size = PAGE_SIZE; 4141 | 4142 | spin_lock_init(&pmbd->batch_lock); 4143 | spin_lock_init(&pmbd->wr_barrier_lock); 4144 | 4145 | spin_lock_init(&pmbd->tmp_lock); 4146 | pmbd->tmp_data = 0; 4147 | pmbd->tmp_num = 0; 4148 | 4149 | /* allocate statistics info */ 4150 | if ((err = pmbd_stat_alloc(pmbd)) < 0) 4151 | goto error; 4152 | 4153 | /* allocate memory space */ 4154 | if ((err = pmbd_mem_space_alloc(pmbd)) < 0) 4155 | goto error; 4156 | 4157 | /* allocate buffer space */ 4158 | if ((err = pmbd_buffer_space_alloc(pmbd)) < 0) 4159 | goto error; 4160 | 4161 | /* allocate checksum space */ 4162 | if ((err = pmbd_checksum_space_alloc(pmbd)) < 0) 4163 | goto error; 4164 | 4165 | /* allocate block info space */ 4166 | if ((err = pmbd_pbi_space_alloc(pmbd)) < 0) 4167 | goto error; 4168 | 4169 | /* create a /proc/pmbd/ entry*/ 4170 | if ((err = pmbd_proc_devstat_create(pmbd)) < 0) 4171 | goto error; 4172 | 4173 | #if 0 4174 | /* FIXME: No need to do it. It's slow and could lock up the system*/ 4175 | pmbd_checksum_space_init(pmbd); 4176 | #endif 4177 | 4178 | /* set up the page attributes related with CPU cache 4179 | * if using vmalloc(), we need to set up the page cache flags (WB,WC,UC,UM); 4180 | * if using high memory, we set up the page cache flag with ioremap_prot(); 4181 | * WARN: In Linux 3.2.1, this function is slow and could cause system hangs. 4182 | */ 4183 | 4184 | if (PMBD_USE_VMALLOC()){ 4185 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 4186 | printk(KERN_ERR "pmbd: WARNING (3.2.1) changing CPU cache setting is slow!\n"); 4187 | #endif 4188 | pmbd_set_pages_cache_flags(pmbd); 4189 | } 4190 | 4191 | /* initialize PM pages read-only */ 4192 | if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION()) 4193 | pmbd_set_pages_ro(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE); 4194 | 4195 | printk(KERN_INFO "pmbd: %s created\n", pmbd->pmbd_name); 4196 | error: 4197 | return err; 4198 | } 4199 | 4200 | static int pmbd_destroy (PMBD_DEVICE_T* pmbd) 4201 | { 4202 | /* flush everything down */ 4203 | // FIXME: this implies flushing CPU cache 4204 | pmbd_write_barrier(pmbd); 4205 | 4206 | /* free /proc entry */ 4207 | pmbd_proc_devstat_destroy(pmbd); 4208 | 4209 | /* free buffer space */ 4210 | pmbd_buffer_space_free(pmbd); 4211 | 4212 | /* set PM pages writable */ 4213 | if (!PMBD_USE_PMAP() && PMBD_USE_WRITE_PROTECTION()) 4214 | pmbd_set_pages_rw(pmbd, pmbd->mem_space, PMBD_MEM_TOTAL_BYTES(pmbd), FALSE); 4215 | 4216 | /* reset memory attributes to WB */ 4217 | if (PMBD_USE_VMALLOC()) 4218 | pmbd_reset_pages_cache_flags(pmbd); 4219 | 4220 | /* free block info space */ 4221 | pmbd_pbi_space_free(pmbd); 4222 | 4223 | /* free checksum space */ 4224 | pmbd_checksum_space_free(pmbd); 4225 | 4226 | /* free memory backstore space */ 4227 | pmbd_mem_space_free(pmbd); 4228 | 4229 | /* free statistics data */ 4230 | pmbd_stat_free(pmbd); 4231 | 4232 | printk(KERN_INFO "pmbd: /dev/%s is destroyed (%llu MB)\n", pmbd->pmbd_name, SECTORS_TO_MB(pmbd->num_sectors)); 4233 | 4234 | pmbd->num_sectors = 0; 4235 | pmbd->sector_size = 0; 4236 | pmbd->checksum_unit_size = 0; 4237 | return 0; 4238 | } 4239 | 4240 | static int pmbd_free_pages(PMBD_DEVICE_T* pmbd) 4241 | { 4242 | return pmbd_destroy(pmbd); 4243 | } 4244 | 4245 | /* 4246 | ************************************************************************** 4247 | * /proc file system entries 4248 | ************************************************************************** 4249 | */ 4250 | 4251 | static int pmbd_proc_create(void) 4252 | { 4253 | proc_pmbd= proc_mkdir("pmbd", 0); 4254 | if(proc_pmbd == NULL){ 4255 | printk(KERN_ERR "pmbd: %s(%d): cannot create /proc/pmbd\n", __FUNCTION__, __LINE__); 4256 | return -ENOMEM; 4257 | } 4258 | 4259 | proc_pmbdstat = create_proc_entry("pmbdstat", S_IRUGO, proc_pmbd); 4260 | if (proc_pmbdstat == NULL){ 4261 | remove_proc_entry("pmbdstat", proc_pmbd); 4262 | printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdstat\n"); 4263 | return -ENOMEM; 4264 | } 4265 | proc_pmbdstat->read_proc = pmbd_proc_pmbdstat_read; 4266 | printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat created\n"); 4267 | 4268 | proc_pmbdcfg = create_proc_entry("pmbdcfg", S_IRUGO, proc_pmbd); 4269 | if (proc_pmbdcfg == NULL){ 4270 | remove_proc_entry("pmbdcfg", proc_pmbd); 4271 | printk(KERN_ERR "pmbd: cannot create /proc/pmbd/pmbdcfg\n"); 4272 | return -ENOMEM; 4273 | } 4274 | proc_pmbdcfg->read_proc = pmbd_proc_pmbdcfg_read; 4275 | printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg created\n"); 4276 | 4277 | return 0; 4278 | } 4279 | 4280 | static int pmbd_proc_destroy(void) 4281 | { 4282 | remove_proc_entry("pmbdcfg", proc_pmbd); 4283 | printk(KERN_INFO "pmbd: /proc/pmbd/pmbdcfg is removed\n"); 4284 | 4285 | remove_proc_entry("pmbdstat", proc_pmbd); 4286 | printk(KERN_INFO "pmbd: /proc/pmbd/pmbdstat is removed\n"); 4287 | 4288 | remove_proc_entry("pmbd", 0); 4289 | printk(KERN_INFO "pmbd: /proc/pmbd is removed\n"); 4290 | return 0; 4291 | } 4292 | 4293 | /* 4294 | ************************************************************************** 4295 | * device driver interface hook functions 4296 | ************************************************************************** 4297 | */ 4298 | 4299 | static int pmbd_mergeable_bvec(struct request_queue *q, 4300 | struct bvec_merge_data *bvm, 4301 | struct bio_vec *biovec) { 4302 | static int flag = 0; 4303 | 4304 | if(PMBD_IS_MERGEABLE()) { 4305 | /* always merge */ 4306 | if (!flag) { 4307 | printk(KERN_INFO "pmbd: bio merging enabled\n"); 4308 | flag = 1; 4309 | } 4310 | return biovec->bv_len; 4311 | } else { 4312 | /* never merge */ 4313 | if (!flag) { 4314 | printk(KERN_INFO "pmbd: bio merging disabled\n"); 4315 | flag = 1; 4316 | } 4317 | if (!bvm->bi_size) { 4318 | return biovec->bv_len; 4319 | } else { 4320 | return 0; 4321 | } 4322 | } 4323 | } 4324 | 4325 | 4326 | #if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 4327 | static int pmbd_ioctl(struct block_device *bdev, fmode_t mode, 4328 | unsigned int cmd, unsigned long arg) 4329 | { 4330 | int err = 0; 4331 | printk(KERN_WARNING "pmbd: WARNING - ioctl not implemented\n"); 4332 | 4333 | #if 0 4334 | PMBD_DEVICE_T* pmbd = bdev->bd_disk->private_data; 4335 | loff_t size = PMBD_MEM_TOTAL_BYTES(pmbd); 4336 | switch (cmd){ 4337 | case BLKGETSIZE: 4338 | printk("pmbd: ioctl: BLKGETSIZE received\n"); 4339 | if ((size >> 9) > ~0UL) 4340 | return -EFBIG; 4341 | else 4342 | return put_ulong(arg, (size >> 9)); 4343 | case BLKGETSIZE64: 4344 | printk("pmbd: ioctl: BLKGETSIZE64 received\n"); 4345 | return put_u64(arg, size); 4346 | case BLKFLSBUF: 4347 | printk("pmbd(%d): pmbd_ioctl BLKFLSBUF received\n", pmbd->pmbd_id); 4348 | if (PMBD_USE_WB()) 4349 | pmbd_write_barrier(pmbd); 4350 | return err; 4351 | default: 4352 | return -ENOTTY; 4353 | } 4354 | #endif 4355 | 4356 | return err; 4357 | } 4358 | #endif 4359 | 4360 | 4361 | int pmbd_fsync(struct file* file, struct dentry* dentry, int datasync) 4362 | { 4363 | printk(KERN_WARNING "pmbd: pmbd_fsync not implemented\n"); 4364 | 4365 | return 0; 4366 | } 4367 | 4368 | int pmbd_open(struct block_device* bdev, fmode_t mode) 4369 | { 4370 | printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) opened\n", bdev->bd_disk->disk_name); 4371 | return 0; 4372 | } 4373 | 4374 | int pmbd_release (struct gendisk* disk, fmode_t mode) 4375 | { 4376 | printk(KERN_DEBUG "pmbd: pmbd (/dev/%s) released\n", disk->disk_name); 4377 | return 0; 4378 | } 4379 | 4380 | static const struct block_device_operations pmbd_fops = { 4381 | .owner = THIS_MODULE, 4382 | #if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 4383 | .locked_ioctl = pmbd_ioctl, 4384 | #endif 4385 | // .open = pmbd_open, 4386 | // .release = pmbd_release, 4387 | }; 4388 | 4389 | /* 4390 | * NOTE: partial of the following code is derived from linux/block/brd.c 4391 | */ 4392 | 4393 | 4394 | static PMBD_DEVICE_T *pmbd_alloc(int i) 4395 | { 4396 | PMBD_DEVICE_T *pmbd; 4397 | struct gendisk *disk; 4398 | 4399 | /* no more than 26 devices */ 4400 | if (i >= PMBD_MAX_NUM_DEVICES) 4401 | return NULL; 4402 | 4403 | /* alloc and set up pmbd object */ 4404 | pmbd = kzalloc(sizeof(*pmbd), GFP_KERNEL); 4405 | if (!pmbd) 4406 | goto out; 4407 | pmbd->pmbd_id = i; 4408 | pmbd->pmbd_queue = blk_alloc_queue(GFP_KERNEL); 4409 | sprintf(pmbd->pmbd_name, "pm%c", ('a' + i)); 4410 | pmbd->rdlat = g_pmbd_rdlat[i]; 4411 | pmbd->wrlat = g_pmbd_wrlat[i]; 4412 | pmbd->rdbw = g_pmbd_rdbw[i]; 4413 | pmbd->wrbw = g_pmbd_wrbw[i]; 4414 | pmbd->rdsx = g_pmbd_rdsx[i]; 4415 | pmbd->wrsx = g_pmbd_wrsx[i]; 4416 | pmbd->rdpause = g_pmbd_rdpause[i]; 4417 | pmbd->wrpause = g_pmbd_wrpause[i]; 4418 | pmbd->simmode = g_pmbd_simmode[i]; 4419 | pmbd->rammode = g_pmbd_rammode[i]; 4420 | pmbd->wpmode = g_pmbd_wpmode[i]; 4421 | pmbd->num_buffers = g_pmbd_num_buffers; 4422 | pmbd->buffer_stride = g_pmbd_buffer_stride; 4423 | pmbd->bufmode = (g_pmbd_bufsize[i] > 0 && g_pmbd_num_buffers > 0) ? TRUE : FALSE; 4424 | 4425 | if (!pmbd->pmbd_queue) 4426 | goto out_free_dev; 4427 | 4428 | /* hook functions */ 4429 | blk_queue_make_request(pmbd->pmbd_queue, pmbd_make_request); 4430 | #if LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 4431 | blk_queue_ordered(pmbd->pmbd_queue, QUEUE_ORDERED_TAG, NULL); 4432 | #endif 4433 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 4434 | /* set flush capability, otherwise, WRITE_FLUSH and WRITE_FUA will be filtered in 4435 | generic_make_request() */ 4436 | if (PMBD_USE_FUA() && PMBD_USE_WB()) 4437 | blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH | REQ_FUA); 4438 | else if (PMBD_USE_WB()) 4439 | blk_queue_flush(pmbd->pmbd_queue, REQ_FLUSH); 4440 | else if (PMBD_USE_FUA()) 4441 | blk_queue_flush(pmbd->pmbd_queue, REQ_FUA); 4442 | #endif 4443 | blk_queue_max_hw_sectors(pmbd->pmbd_queue, 1024); 4444 | blk_queue_bounce_limit(pmbd->pmbd_queue, BLK_BOUNCE_ANY); 4445 | blk_queue_merge_bvec(pmbd->pmbd_queue, pmbd_mergeable_bvec); 4446 | 4447 | disk = pmbd->pmbd_disk = alloc_disk(1 << part_shift); 4448 | if (!disk) 4449 | goto out_free_queue; 4450 | 4451 | disk->major = PMBD_MAJOR; 4452 | disk->first_minor = i << part_shift; 4453 | disk->fops = &pmbd_fops; 4454 | disk->private_data = pmbd; 4455 | disk->queue = pmbd->pmbd_queue; 4456 | strcpy(disk->disk_name, pmbd->pmbd_name); 4457 | set_capacity(disk, GB_TO_SECTORS(g_pmbd_size[i])); /* num of sectors */ 4458 | 4459 | /* allocate PM space */ 4460 | if (pmbd_create(pmbd, GB_TO_SECTORS(g_pmbd_size[i])) < 0) 4461 | goto out_free_queue; 4462 | 4463 | /* done */ 4464 | return pmbd; 4465 | 4466 | out_free_queue: 4467 | blk_cleanup_queue(pmbd->pmbd_queue); 4468 | out_free_dev: 4469 | kfree(pmbd); 4470 | out: 4471 | return NULL; 4472 | } 4473 | 4474 | static void pmbd_free(PMBD_DEVICE_T *pmbd) 4475 | { 4476 | put_disk(pmbd->pmbd_disk); 4477 | blk_cleanup_queue(pmbd->pmbd_queue); 4478 | pmbd_free_pages(pmbd); 4479 | kfree(pmbd); 4480 | } 4481 | 4482 | static void pmbd_del_one(PMBD_DEVICE_T *pmbd) 4483 | { 4484 | list_del(&pmbd->pmbd_list); 4485 | del_gendisk(pmbd->pmbd_disk); 4486 | pmbd_free(pmbd); 4487 | } 4488 | 4489 | static int check_kernel_version(void) 4490 | { 4491 | #if LINUX_VERSION_CODE == KERNEL_VERSION(3,2,1) 4492 | printk(KERN_INFO "pmbd: Linux kernel version 3.2.1 is detected\n"); 4493 | printk(KERN_INFO "pmbd: WARNING (3.2.1) FUA with PT-based protection (with buffer) incurs double-write overhead\n"); 4494 | printk(KERN_INFO "pmbd: WARNING (3.2.1) No support for changing CPU cache flags with vmalloc() based PMBD\n"); 4495 | return 0; 4496 | #elif LINUX_VERSION_CODE == KERNEL_VERSION(2,6,34) 4497 | printk(KERN_INFO "pmbd: *** Full support for Linux 2.6.34 ***\n"); 4498 | return 0; 4499 | #else 4500 | printk(KERN_INFO "pmbd: only support Linux kernel 2.6.34 and 3.2.1\n"); 4501 | return -1; 4502 | #endif 4503 | } 4504 | 4505 | static int __init pmbd_init(void) 4506 | { 4507 | int i, nr; 4508 | unsigned long range; 4509 | PMBD_DEVICE_T *pmbd, *next; 4510 | 4511 | /*check kernel version and print warning*/ 4512 | if (check_kernel_version() < 0) 4513 | return -EINVAL; 4514 | 4515 | /* parse input options */ 4516 | pmbd_parse_conf(); 4517 | 4518 | /* initialize pmap start*/ 4519 | pmap_create(); 4520 | 4521 | /* ioremap high memory space */ 4522 | if (PMBD_USE_HIGHMEM()) { 4523 | if (pmbd_highmem_map() == NULL) 4524 | return -ENOMEM; 4525 | } 4526 | 4527 | part_shift = 0; 4528 | if (max_part > 0) 4529 | part_shift = fls(max_part); 4530 | 4531 | if (g_pmbd_nr > 1UL << (MINORBITS - part_shift)) 4532 | return -EINVAL; 4533 | 4534 | if (g_pmbd_nr) { 4535 | nr = g_pmbd_nr; 4536 | range = g_pmbd_nr; 4537 | } else { 4538 | printk(KERN_ERR "pmbd: %s(%d) - g_pmbd_nr=%d\n", __FUNCTION__, __LINE__, g_pmbd_nr); 4539 | return -EINVAL; 4540 | } 4541 | 4542 | pmbd_proc_create(); 4543 | 4544 | if (register_blkdev(PMBD_MAJOR, PMBD_NAME)) 4545 | return -EIO; 4546 | else 4547 | printk(KERN_INFO "pmbd: registered device at major %d\n", PMBD_MAJOR); 4548 | 4549 | for (i = 0; i < nr; i++) { 4550 | pmbd = pmbd_alloc(i); 4551 | if (!pmbd) 4552 | goto out_free; 4553 | list_add_tail(&pmbd->pmbd_list, &pmbd_devices); 4554 | } 4555 | 4556 | /* point of no return */ 4557 | list_for_each_entry(pmbd, &pmbd_devices, pmbd_list) 4558 | add_disk(pmbd->pmbd_disk); 4559 | 4560 | printk(KERN_INFO "pmbd: module loaded\n"); 4561 | return 0; 4562 | 4563 | out_free: 4564 | list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) { 4565 | list_del(&pmbd->pmbd_list); 4566 | pmbd_free(pmbd); 4567 | } 4568 | unregister_blkdev(PMBD_MAJOR, PMBD_NAME); 4569 | 4570 | return -ENOMEM; 4571 | } 4572 | 4573 | 4574 | static void __exit pmbd_exit(void) 4575 | { 4576 | unsigned long range; 4577 | PMBD_DEVICE_T *pmbd, *next; 4578 | 4579 | range = g_pmbd_nr ? g_pmbd_nr : 1UL << (MINORBITS - part_shift); 4580 | 4581 | /* deactivate each pmbd instance*/ 4582 | list_for_each_entry_safe(pmbd, next, &pmbd_devices, pmbd_list) 4583 | pmbd_del_one(pmbd); 4584 | 4585 | /* deioremap high memory space */ 4586 | if (PMBD_USE_HIGHMEM()) { 4587 | pmbd_highmem_unmap(); 4588 | } 4589 | 4590 | /* destroy pmap entries */ 4591 | pmap_destroy(); 4592 | 4593 | unregister_blkdev(PMBD_MAJOR, PMBD_NAME); 4594 | 4595 | pmbd_proc_destroy(); 4596 | 4597 | printk(KERN_INFO "pmbd: module unloaded\n"); 4598 | return; 4599 | } 4600 | 4601 | /* module setup */ 4602 | MODULE_AUTHOR("Intel Corporation "); 4603 | MODULE_ALIAS("pmbd"); 4604 | MODULE_LICENSE("GPL v2"); 4605 | MODULE_VERSION("0.9"); 4606 | MODULE_ALIAS_BLOCKDEV_MAJOR(PMBD_MAJOR); 4607 | module_init(pmbd_init); 4608 | module_exit(pmbd_exit); 4609 | 4610 | /* THE END */ 4611 | 4612 | 4613 | -------------------------------------------------------------------------------- /pmbd.h: -------------------------------------------------------------------------------- 1 | /* 2 | * Intel Persistent Memory Block Driver 3 | * Copyright (c) <2011-2013>, Intel Corporation. 4 | * 5 | * This program is free software; you can redistribute it and/or modify it 6 | * under the terms and conditions of the GNU General Public License, 7 | * version 2, as published by the Free Software Foundation. 8 | * 9 | * This program is distributed in the hope it will be useful, but WITHOUT 10 | * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 | * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 12 | * more details. 13 | * 14 | * You should have received a copy of the GNU General Public License along with 15 | * this program; if not, write to the Free Software Foundation, Inc., 16 | * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. 17 | */ 18 | 19 | /* 20 | * Intel Persistent Memory Block Driver (v0.9) 21 | * 22 | * pmbd.h 23 | * 24 | * Intel Corporation 25 | * 03/24/2011 26 | */ 27 | 28 | #ifndef PMBD_H 29 | #define PMBD_H 30 | 31 | #define PMBD_MAJOR 261 /* FIXME: temporarily use this */ 32 | #define PMBD_NAME "pmbd" /* pmbd module name */ 33 | #define PMBD_MAX_NUM_DEVICES 26 /* max num of devices */ 34 | #define PMBD_MAX_NUM_CPUS 32 /* max num of cpus*/ 35 | 36 | /* 37 | * type definitions 38 | */ 39 | typedef uint32_t PMBD_CHECKSUM_T;/* we use CRC32 to calculate checksum */ 40 | typedef sector_t BBN_T; /* BBN_T */ 41 | typedef sector_t PBN_T; /* BBN_T */ 42 | 43 | 44 | /* 45 | * PMBD device buffer control structure 46 | * NOTE: 47 | * (1) buffer_space is an array of num_blocks of blocks, the size of which is 48 | * defined as pmbd->pb_size 49 | * (2) bbi_space is an array of num_blocks of bbi (buffer block info) units, 50 | * each of which contains the metadata information of each block in the buffer 51 | 52 | buffer space management variables 53 | * num_dirty - total number of dirty blocks in buffer 54 | * pos_dirty - point to the end of the sequence of dirty blocks 55 | * pos_clean - point to the end of the sequence of clean blocks 56 | * 57 | * post_dirty and pos_clean logically segment the buffer into 58 | * dirty/clean regions as follows. 59 | * 60 | * pos_dirty ----v v--- pos_clean 61 | * ---------------------------- 62 | * | clean |*DIRTY*| clean | 63 | * ---------------------------- 64 | * buffer_lock - protects reads/writes to the aforesaid three 65 | */ 66 | typedef struct pmbd_bbi { /* pmbd buffer block info (BBI) */ 67 | PBN_T pbn; /* physical block number in PM (converted from sector) */ 68 | unsigned dirty; /* dirty (1) or clean (0)*/ 69 | } PMBD_BBI_T; 70 | 71 | typedef struct pmbd_bsort_entry { /* pmbd buffer block info for sorting */ 72 | BBN_T bbn; /* buffer block number (in buffer)*/ 73 | PBN_T pbn; /* physical block number (in PMBD)*/ 74 | } PMBD_BSORT_ENTRY_T; 75 | 76 | typedef struct pmbd_buffer { 77 | unsigned buffer_id; 78 | struct pmbd_device* pmbd; /* the linked pmbd device */ 79 | 80 | BBN_T num_blocks; /* buffer space size (# of blocks) */ 81 | void* buffer_space; /* buffer space base vaddr address */ 82 | PMBD_BBI_T* bbi_space; /* array of buffer block info (BBI)*/ 83 | 84 | BBN_T num_dirty; /* num of dirty blocks */ 85 | BBN_T pos_dirty; /* the first dirty block */ 86 | BBN_T pos_clean; /* the first clean block */ 87 | spinlock_t buffer_lock; /* lock to protect metadata updates */ 88 | unsigned int batch_size; /* the batch size for flushing buffer pages */ 89 | 90 | struct task_struct* syncer; /* the syncer daemon */ 91 | 92 | spinlock_t flush_lock; /* lock to protect metadata updates */ 93 | PMBD_BSORT_ENTRY_T* bbi_sort_buffer;/* a temp array of the bbi for sorting */ 94 | } PMBD_BUFFER_T; 95 | 96 | /* 97 | * PM physical block information (each corresponding to a PM block) 98 | * 99 | * (1) if the physical block is buffered, bbn contains a valid buffer block 100 | * number (BBN) between 0 - (buffer->num_blocks-1), otherwise, it contains an 101 | * invalid value (buffer->num_blocks + 1) 102 | * (2) any access to the block (read/write/sync) must have this lock first to 103 | * prevent multiple concurrent accesses to the same PM block 104 | */ 105 | typedef struct pmbd_pbi{ 106 | BBN_T bbn; 107 | spinlock_t lock; 108 | } PMBD_PBI_T; 109 | 110 | typedef struct pmbd_stat{ 111 | /* stat_lock does not protect cycles_*[] counters */ 112 | spinlock_t stat_lock; /* protection lock */ 113 | 114 | unsigned last_access_jiffies; /* the timestamp of the most recent access */ 115 | uint64_t num_sectors_read; /* total num of sectors being read */ 116 | uint64_t num_sectors_write; /* total num of sectors being written */ 117 | uint64_t num_requests_read; /* total num of requests for read */ 118 | uint64_t num_requests_write; /* total num of request for write */ 119 | uint64_t num_write_barrier; /* total num of write barriers received */ 120 | uint64_t num_write_fua; /* total num of write barriers received */ 121 | 122 | /* cycles counters (enabled/disabled by timestat)*/ 123 | uint64_t cycles_total[2][PMBD_MAX_NUM_CPUS]; /* total cycles for read in make_request*/ 124 | uint64_t cycles_prepare[2][PMBD_MAX_NUM_CPUS]; /* total cycles for prepare in make_request*/ 125 | uint64_t cycles_wb[2][PMBD_MAX_NUM_CPUS]; /* total cycles for write barrier in make_request*/ 126 | uint64_t cycles_work[2][PMBD_MAX_NUM_CPUS]; /* total cycles for work in make_request*/ 127 | uint64_t cycles_endio[2][PMBD_MAX_NUM_CPUS]; /* total cycles for endio in make_request*/ 128 | uint64_t cycles_finish[2][PMBD_MAX_NUM_CPUS]; /* total cycles for finish-up in make_request*/ 129 | 130 | uint64_t cycles_pmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private mapping*/ 131 | uint64_t cycles_punmap[2][PMBD_MAX_NUM_CPUS]; /* total cycles for private unmapping */ 132 | uint64_t cycles_memcpy[2][PMBD_MAX_NUM_CPUS]; /* total cycles for memcpy */ 133 | uint64_t cycles_clflush[2][PMBD_MAX_NUM_CPUS]; /* total cycles for clflush_range */ 134 | uint64_t cycles_clflushall[2][PMBD_MAX_NUM_CPUS];/* total cycles for clflush_all */ 135 | uint64_t cycles_wrverify[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing write verification */ 136 | uint64_t cycles_checksum[2][PMBD_MAX_NUM_CPUS]; /* total cycles for doing checksum */ 137 | uint64_t cycles_pause[2][PMBD_MAX_NUM_CPUS]; /* total cycles for pause */ 138 | uint64_t cycles_slowdown[2][PMBD_MAX_NUM_CPUS]; /* total cycles for slowdown*/ 139 | uint64_t cycles_setpages_ro[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to ro*/ 140 | uint64_t cycles_setpages_rw[2][PMBD_MAX_NUM_CPUS]; /*total cycles for set pages to rw*/ 141 | } PMBD_STAT_T; 142 | 143 | /* 144 | * pmbd_device structure (each corresponding to a pmbd instance) 145 | */ 146 | #define PBN_TO_PMBD_BUFFER_ID(PMBD, PBN) (((PBN)/(PMBD)->buffer_stride) % (PMBD)->num_buffers) 147 | #define PBN_TO_PMBD_BUFFER(PMBD, PBN) ((PMBD)->buffers[PBN_TO_PMBD_BUFFER_ID((PMBD), (PBN))]) 148 | 149 | typedef struct pmbd_device { 150 | int pmbd_id; /* dev id */ 151 | char pmbd_name[DISK_NAME_LEN];/* device name */ 152 | 153 | struct request_queue * pmbd_queue; 154 | struct gendisk * pmbd_disk; 155 | struct list_head pmbd_list; 156 | 157 | /* PM backstore space */ 158 | void* mem_space; /* pointer to the kernel mem space */ 159 | uint64_t num_sectors; /* PMBD device capacity (num of 512-byte sectors)*/ 160 | unsigned sector_size; /* 512 bytes */ 161 | 162 | /* configurations */ 163 | unsigned pmbd_type; /* vmalloc() or high_mem */ 164 | unsigned rammode; /* RAM mode (no write protection) or not */ 165 | unsigned bufmode; /* use buffer or not */ 166 | unsigned wpmode; /* write protection mode: PTE change (0) or CR0/WP bit switch (1)*/ 167 | 168 | /* buffer management */ 169 | PMBD_BUFFER_T** buffers; /* buffer control structure */ 170 | unsigned num_buffers; /* number of buffers */ 171 | unsigned buffer_stride; /* the number of contiguous blocks mapped to the same buffer */ 172 | 173 | 174 | 175 | /* physical block info (metadata) */ 176 | PMBD_PBI_T* pbi_space; /* physical block info space (each) */ 177 | unsigned pb_size; /* the unit size of each block (4096 in default) */ 178 | 179 | /* checksum */ 180 | PMBD_CHECKSUM_T* checksum_space; /* checksum array */ 181 | unsigned checksum_unit_size; /* checksum unit size (bytes) */ 182 | void* checksum_iomem_buf; /* one unit buffer for ioremapped PM */ 183 | 184 | /* emulating PM with injected latency */ 185 | unsigned simmode; /* simulating whole device (0) or PM only (1)*/ 186 | uint64_t rdlat; /* read access latency (in nanoseconds)*/ 187 | uint64_t wrlat; /* write access latency (in nanoseconds)*/ 188 | uint64_t rdbw; /* read bandwidth (MB/sec) */ 189 | uint64_t wrbw; /* write bandwidth (MB/sec) */ 190 | unsigned rdsx; /* read slowdown (X) */ 191 | unsigned wrsx; /* write slowdown (X) */ 192 | uint64_t rdpause; /* read pause (cycles per 4KB page) */ 193 | uint64_t wrpause; /* write pause (cycles per 4KB page) */ 194 | 195 | spinlock_t batch_lock; /* lock protecting batch_* fields */ 196 | uint64_t batch_start_cycle[2]; /* start time of the batch (cycles)*/ 197 | uint64_t batch_end_cycle[2]; /* end time of the batch (cycles) */ 198 | uint64_t batch_sectors[2]; /* the total num of sectors in the batch */ 199 | 200 | PMBD_STAT_T* pmbd_stat; /* statistics data */ 201 | struct proc_dir_entry* proc_devstat; /* the proc output */ 202 | 203 | spinlock_t wr_barrier_lock;/* for write barrier and other control */ 204 | atomic_t num_flying_wr; /* the counter of writes on the fly */ 205 | 206 | spinlock_t tmp_lock; 207 | uint64_t tmp_data; 208 | unsigned long tmp_num; 209 | } PMBD_DEVICE_T; 210 | 211 | /* 212 | * support definitions 213 | */ 214 | #define TRUE 1 215 | #define FALSE 0 216 | 217 | #define __CURRENT_PID__ (current->pid) 218 | #define CONFIG_PMBD_DEBUG 1 219 | //#define PRINTK_DEBUG_HDR "DEBUG %s(%d)%u - " 220 | //#define PRINTK_DEBUG_PAR __FUNCTION__, __LINE__, __CURRENT_PID__ 221 | //#define PRINTK_DEBUG_1 if(CONFIG_PMBD_DEBUG >= 1) printk 222 | //#define PRINTK_DEBUG_2 if(CONFIG_PMBD_DEBUG >= 2) printk 223 | //#define PRINTK_DEBUG_3 if(CONFIG_PMBD_DEBUG >= 3) printk 224 | 225 | #define MAX_OF(A, B) (((A) > (B))? (A) : (B)) 226 | #define MIN_OF(A, B) (((A) < (B))? (A) : (B)) 227 | 228 | #define SECTOR_SHIFT 9 229 | #define PAGE_SHIFT 12 230 | #define SECTOR_SIZE (1UL << SECTOR_SHIFT) 231 | //#define PAGE_SIZE (1UL << PAGE_SHIFT) 232 | #define SECTOR_MASK (~(SECTOR_SIZE-1)) 233 | #define PAGE_MASK (~(PAGE_SIZE-1)) 234 | #define PMBD_SECTOR_SIZE SECTOR_SIZE 235 | #define PMBD_PAGE_SIZE PAGE_SIZE 236 | #define KB_SHIFT 10 237 | #define MB_SHIFT 20 238 | #define GB_SHIFT 30 239 | #define MB_TO_BYTES(N) ((N) << MB_SHIFT) 240 | #define GB_TO_BYTES(N) ((N) << GB_SHIFT) 241 | #define BYTES_TO_MB(N) ((N) >> MB_SHIFT) 242 | #define BYTES_TO_GB(N) ((N) >> GB_SHIFT) 243 | #define MB_TO_SECTORS(N) ((N) << (MB_SHIFT - SECTOR_SHIFT)) 244 | #define GB_TO_SECTORS(N) ((N) << (GB_SHIFT - SECTOR_SHIFT)) 245 | #define SECTORS_TO_MB(N) ((N) >> (MB_SHIFT - SECTOR_SHIFT)) 246 | #define SECTORS_TO_GB(N) ((N) >> (GB_SHIFT - SECTOR_SHIFT)) 247 | #define SECTOR_TO_PAGE(N) ((N) >> (PAGE_SHIFT - SECTOR_SHIFT)) 248 | #define SECTOR_TO_BYTE(N) ((N) << SECTOR_SHIFT) 249 | #define BYTE_TO_SECTOR(N) ((N) >> SECTOR_SHIFT) 250 | #define PAGE_TO_SECTOR(N) ((N) << (PAGE_SHIFT - SECTOR_SHIFT)) 251 | #define BYTE_TO_PAGE(N) ((N) >> (PAGE_SHIFT)) 252 | 253 | #define IS_SPACE(C) (isspace(C) || (C) == '\0') 254 | #define IS_DIGIT(C) (isdigit(C) && (C) != '\0') 255 | #define IS_ALPHA(C) (isalpha(C) && (C) != '\0') 256 | 257 | #define DISABLE_SAVE_IRQ(FLAGS) {local_irq_save((FLAGS)); local_irq_disable();} 258 | #define ENABLE_RESTORE_IRQ(FLAGS) {local_irq_restore((FLAGS)); local_irq_enable();} 259 | #define CUR_CPU_ID() smp_processor_id() 260 | 261 | /* 262 | * PMBD related config 263 | */ 264 | 265 | #define PMBD_CONFIG_VMALLOC 0 /* vmalloc() based PMBD (default) */ 266 | #define PMBD_CONFIG_HIGHMEM 1 /* ioremap() based PMBD */ 267 | 268 | 269 | /* global config */ 270 | #define PMBD_IS_MERGEABLE() (g_pmbd_mergeable == TRUE) 271 | #define PMBD_USE_VMALLOC() (g_pmbd_type == PMBD_CONFIG_VMALLOC) 272 | #define PMBD_USE_HIGHMEM() (g_pmbd_type == PMBD_CONFIG_HIGHMEM) 273 | #define PMBD_USE_CLFLUSH() (g_pmbd_cpu_cache_clflush == TRUE) 274 | #define PMBD_CPU_CACHE_FLAG() ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB)? "WB" : \ 275 | ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC)? "WC" : \ 276 | ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC)? "UC" : \ 277 | ((g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS)? "UC-Minus" : "UNKNOWN")))) 278 | 279 | #define PMBD_CPU_CACHE_USE_WB() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WB) /* write back */ 280 | #define PMBD_CPU_CACHE_USE_WC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_WC) /* write combining */ 281 | #define PMBD_CPU_CACHE_USE_UC() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC) /* uncachable */ 282 | #define PMBD_CPU_CACHE_USE_UM() (g_pmbd_cpu_cache_flag == _PAGE_CACHE_UC_MINUS) /* uncachable minus */ 283 | 284 | #define PMBD_USE_WRITE_PROTECTION() (g_pmbd_wr_protect == TRUE) 285 | #define PMBD_USE_WRITE_VERIFICATION() (g_pmbd_wr_verify == TRUE) 286 | #define PMBD_USE_CHECKSUM() (g_pmbd_checksum == TRUE) 287 | #define PMBD_USE_LOCK() (g_pmbd_lock == TRUE) 288 | #define PMBD_USE_SUBPAGE_UPDATE() (g_pmbd_subpage_update == TRUE) 289 | 290 | #define PMBD_USE_PMAP() (g_pmbd_pmap == TRUE && g_pmbd_type == PMBD_CONFIG_HIGHMEM) 291 | #define PMBD_USE_NTS() (g_pmbd_nts == TRUE) 292 | #define PMBD_USE_NTL() (g_pmbd_ntl == TRUE) 293 | #define PMBD_USE_WB() (g_pmbd_wb == TRUE) 294 | #define PMBD_USE_FUA() (g_pmbd_fua == TRUE) 295 | #define PMBD_USE_TIMESTAT() (g_pmbd_timestat == TRUE) 296 | 297 | #define TIMESTAMP(TS) rdtscll((TS)) 298 | #define TIMESTAT_POINT(TS) {(TS) = 0; if (PMBD_USE_TIMESTAT()) rdtscll((TS));} 299 | 300 | /* instanced based config */ 301 | #define PMBD_DEV_USE_VMALLOC(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_VMALLOC) 302 | #define PMBD_DEV_USE_HIGHMEM(PMBD) ((PMBD)->pmbd_type == PMBD_CONFIG_HIGHMEM) 303 | #define PMBD_DEV_USE_BUFFER(PMBD) ((PMBD)->bufmode) 304 | #define PMBD_DEV_USE_WPMODE_PTE(PMBD) ((PMBD)->wpmode == 0) 305 | #define PMBD_DEV_USE_WPMODE_CR0(PMBD) ((PMBD)->wpmode == 1) 306 | 307 | #define PMBD_DEV_USE_EMULATION(PMBD) ((PMBD)->rdlat || (PMBD)->wrlat || (PMBD)->rdbw || (PMBD)->wrbw) 308 | #define PMBD_DEV_SIM_PMBD(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 1) 309 | #define PMBD_DEV_SIM_DEV(PMBD) (PMBD_DEV_USE_EMULATION((PMBD)) && (PMBD)->simmode == 0) 310 | #define PMBD_DEV_USE_SLOWDOWN(PMBD) ((PMBD)->rdsx > 1 || (PMBD)->wrsx > 1) 311 | 312 | /* support functions */ 313 | #define PMBD_MEM_TOTAL_SECTORS(PMBD) ((PMBD)->num_sectors) 314 | #define PMBD_MEM_TOTAL_BYTES(PMBD) ((PMBD)->num_sectors * (PMBD)->sector_size) 315 | #define PMBD_MEM_TOTAL_PAGES(PMBD) (((PMBD)->num_sectors) >> (PAGE_SHIFT - SECTOR_SHIFT)) 316 | #define PMBD_MEM_SPACE_FIRST_BYTE(PMBD) ((PMBD)->mem_space) 317 | #define PMBD_MEM_SPACE_LAST_BYTE(PMBD) ((PMBD)->mem_space + PMBD_MEM_TOTAL_BYTES(PMBD) - 1) 318 | #define PMBD_CHECKSUM_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->checksum_unit_size) 319 | #define PMBD_LOCK_TOTAL_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->lock_unit_size) 320 | #define VADDR_IN_PMBD_SPACE(PMBD, ADDR) ((ADDR) >= PMBD_MEM_SPACE_FIRST_BYTE(PMBD) \ 321 | && (ADDR) <= PMBD_MEM_SPACE_LAST_BYTE(PMBD)) 322 | 323 | #define BYTE_TO_PBN(PMBD, BYTES) ((BYTES) / (PMBD)->pb_size) 324 | #define PBN_TO_BYTE(PMBD, PBN) ((PBN) * (PMBD)->pb_size) 325 | #define SECTOR_TO_PBN(PMBD, SECT) (BYTE_TO_PBN((PMBD), SECTOR_TO_BYTE(SECT))) 326 | #define PBN_TO_SECTOR(PMBD, PBN) (BYTE_TO_SECTOR(PBN_TO_BYTE((PMBD), (PBN)))) 327 | 328 | 329 | #define PMBD_CACHELINE_SIZE (64) /* FIXME: configure this machine by machine? (check x86_clflush_size)*/ 330 | 331 | /* buffer related functions */ 332 | #define CALLER_ALLOCATOR (0) 333 | #define CALLER_SYNCER (1) 334 | #define CALLER_DESTROYER (2) 335 | 336 | #define PMBD_BLOCK_VADDR(PMBD, PBN) ((PMBD)->mem_space + ((PMBD)->pb_size * (PBN))) 337 | #define PMBD_BLOCK_PBI(PMBD, PBN) ((PMBD)->pbi_space + (PBN)) 338 | #define PMBD_TOTAL_PB_NUM(PMBD) (PMBD_MEM_TOTAL_BYTES(PMBD) / (PMBD)->pb_size) 339 | #define PMBD_BLOCK_IS_BUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn < PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks) 340 | #define PMBD_SET_BLOCK_BUFFERED(PMBD, PBN, BBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = (BBN)) 341 | #define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PMBD_TOTAL_PB_NUM((PMBD)) + 3) 342 | //#define PMBD_SET_BLOCK_UNBUFFERED(PMBD, PBN) (PMBD_BLOCK_PBI((PMBD),(PBN))->bbn = PBN_TO_PMBD_BUFFER((PMBD), (PBN))->num_blocks + 1) 343 | 344 | #define PMBD_BUFFER_MIN_BUFSIZE (4) /* buffer size (in MBs) */ 345 | #define PMBD_BUFFER_BLOCK(BUF, BBN) ((BUF)->buffer_space + (BUF)->pmbd->pb_size*(BBN)) 346 | #define PMBD_BUFFER_BBI(BUF, BBN) ((BUF)->bbi_space + (BBN)) 347 | #define PMBD_BUFFER_BBI_INDEX(BUF, ADDR) ((ADDR)-(BUF)->bbi_space) 348 | #define PMBD_BUFFER_SET_BBI_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = FALSE) 349 | #define PMBD_BUFFER_SET_BBI_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty = TRUE) 350 | #define PMBD_BUFFER_BBI_IS_CLEAN(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == FALSE) 351 | #define PMBD_BUFFER_BBI_IS_DIRTY(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->dirty == TRUE) 352 | #define PMBD_BUFFER_SET_BBI_BUFFERED(BUF,BBN,PBN)((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = (PBN)) 353 | #define PMBD_BUFFER_SET_BBI_UNBUFFERED(BUF, BBN) ((PMBD_BUFFER_BBI((BUF), (BBN)))->pbn = PMBD_TOTAL_PB_NUM((BUF)->pmbd) + 2) 354 | 355 | #define PMBD_BUFFER_FLUSH_HW (0.7) /* high watermark */ 356 | #define PMBD_BUFFER_FLUSH_LW (0.1) /* low watermark */ 357 | #define PMBD_BUFFER_IS_FULL(BUF) ((BUF)->num_dirty >= (BUF)->num_blocks) 358 | #define PMBD_BUFFER_IS_EMPTY(BUF) ((BUF)->num_dirty == 0) 359 | #define PMBD_BUFFER_ABOVE_HW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW))) 360 | #define PMBD_BUFFER_BELOW_HW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_HW))) 361 | #define PMBD_BUFFER_ABOVE_LW(BUF) ((BUF)->num_dirty >= (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW))) 362 | #define PMBD_BUFFER_BELOW_LW(BUF) ((BUF)->num_dirty < (((BUF)->num_blocks * PMBD_BUFFER_FLUSH_LW))) 363 | #define PMBD_BUFFER_BATCH_SIZE_DEFAULT (1024) /* the batch size for each flush */ 364 | 365 | #define PMBD_BUFFER_NEXT_POS(BUF, POS) (((POS)==((BUF)->num_blocks - 1))? 0 : ((POS)+1)) 366 | #define PMBD_BUFFER_PRIO_POS(BUF, POS) (((POS)== 0)? ((BUF)->num_blocks - 1) : ((POS)-1)) 367 | #define PMBD_BUFFER_NEXT_N_POS(BUF,POS,N) (((POS)+(N))%((BUF)->num_blocks)) 368 | #define PMBD_BUFFER_PRIO_N_POS(BUF,POS,N) ((BUF)->num_blocks - (((N)+(BUF)->num_blocks-(POS))%(BUF)->num_blocks)) 369 | 370 | /* high memory */ 371 | #define PMBD_HIGHMEM_AVAILABLE_SPACE (g_highmem_virt_addr + g_highmem_size - g_highmem_curr_addr) 372 | 373 | /* emulation */ 374 | #define MAX_SYNC_SLOWDOWN (10000000) /* use async_slowdown, if larger than 10ms */ 375 | #define OVERHEAD_NANOSEC (100) 376 | #define PMBD_USLEEP(n) {set_current_state(TASK_INTERRUPTIBLE); \ 377 | schedule_timeout((n)*HZ/1000000);} 378 | 379 | /* statistics */ 380 | #define PMBD_BATCH_MAX_SECTORS (4096) /* maximum data amount requested in a batch */ 381 | #define PMBD_BATCH_MIN_SECTORS (256) /* maximum data amount requested in a batch */ 382 | #define PMBD_BATCH_MAX_INTERVAL (1000000) /* maximum interval between two requests in a batch*/ 383 | #define PMBD_BATCH_MAX_DURATION (10000000) /* maximum duration of a batch (ns)*/ 384 | 385 | /* write protection*/ 386 | #define VADDR_TO_PAGE(ADDR) ((ADDR) >> PAGE_SHIFT) 387 | #define PAGE_TO_VADDR(PAGE) ((PAGE) << PAGE_SHIFT) 388 | 389 | /* checksum */ 390 | #define VADDR_TO_CHECKSUM_IDX(PMBD, ADDR) (((ADDR) - (PMBD)->mem_space) / (PMBD)->checksum_unit_size) 391 | #define CHECKSUM_IDX_TO_VADDR(PMBD, IDX) ((PMBD)->mem_space + (IDX) * (PMBD)->checksum_unit_size) 392 | #define CHECKSUM_IDX_TO_CKADDR(PMBD, IDX) ((PMBD)->checksum_space + (IDX)) 393 | 394 | /* idle period timer */ 395 | #define PMBD_BUFFER_FLUSH_IDLE_TIMEOUT (2000) /* 1 millisecond */ 396 | #define PMBD_DEV_UPDATE_ACCESS_TIME(PMBD) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \ 397 | (PMBD)->pmbd_stat->last_access_jiffies = jiffies; \ 398 | spin_unlock(&(PMBD)->pmbd_stat->stat_lock);} 399 | #define PMBD_DEV_GET_ACCESS_TIME(PMBD, T) {spin_lock(&(PMBD)->pmbd_stat->stat_lock); \ 400 | (T) = (PMBD)->pmbd_stat->last_access_jiffies; \ 401 | spin_unlock(&(PMBD)->pmbd_stat->stat_lock);} 402 | #define PMBD_DEV_IS_IDLE(PMBD, IDLE) ((IDLE) > PMBD_BUFFER_FLUSH_IDLE_TIMEOUT) 403 | 404 | /* Help info */ 405 | #define USAGE_INFO \ 406 | "\n\n\ 407 | ============================================\n\ 408 | Intel Persistent Memory Block Driver (v0.9)\n\ 409 | ============================================\n\n\ 410 | usage: $ modprobe pmbd mode=\"pmbd<#>;hmo<#>;hms<#>;[Option1];[Option2];[Option3];..\"\n\ 411 | \n\ 412 | GENERAL OPTIONS: \n\ 413 | \t pmbd<#,#..> \t set PM block device size (GBs) \n\ 414 | \t HM|VM \t\t use high memory (HM default) or vmalloc (VM) \n\ 415 | \t hmo<#> \t high memory starting offset (GB) \n\ 416 | \t hms<#> \t high memory size (GBs) \n\ 417 | \t pmap \t use private mapping (Y) or not (N default) - (note: must enable HM and wrprotN) \n\ 418 | \t nts \t use non-temporal store (MOVNTQ) and sfence to do memcpy (Y), or regular memcpy (N default)\n\ 419 | \t wb \t use write barrier (Y) or not (N default)\n\ 420 | \t fua \t use WRITE_FUA (Y default) or not (N) (only effective for Linux 3.2.1)\n\ 421 | \t ntl \t use non-temporal load (MOVNTDQA) to do memcpy (Y), or regular memcpy (N default) - this option enforces memory type of write combining\n\ 422 | \n\ 423 | SIMULATION: \n\ 424 | \t simmode<#,#..> use the specified numbers to the whole device (0 default) or PM only (1)\n\ 425 | \t rdlat<#,#..> \t set read access latency (ns) \n\ 426 | \t wrlat<#,#..> \t set write access latency (ns)\n\ 427 | \t rdbw<#,#..> \t set read bandwidth (MB/sec) (if set 0, no emulation) \n\ 428 | \t wrbw<#,#..> \t set write bandwidth (MB/sec) (if set 0, no emulation) \n\ 429 | \t rdsx<#,#..> \t set the relative slowdown (x) for read \n\ 430 | \t wrsx<#,#..> \t set the relative slowdown (x) for write \n\ 431 | \t rdpause<#,.> \t set a pause (cycles per 4KB) for each read\n\ 432 | \t wrpause<#,.> \t set a pause (cycles per 4KB) for each write\n\ 433 | \t adj<#> \t set an adjustment to the system overhead (nanoseconds) \n\ 434 | \n\ 435 | WRITE PROTECTION: \n\ 436 | \t wrprot \t use write protection for PM pages? (Y or N)\n\ 437 | \t wpmode<#,#,..> write protection mode: use the PTE change (0 default) or switch CR0/WP bit (1) \n\ 438 | \t clflush \t use clflush to flush CPU cache for each write to PM space? (Y or N) \n\ 439 | \t wrverify \t use write verification for PM pages? (Y or N) \n\ 440 | \t checksum \t use checksum to protect PM pages? (Y or N)\n\ 441 | \t bufsize<#,#,..> the buffer size (MBs) (0 - no buffer, at least 4MB)\n\ 442 | \t bufnum<#> \t the number of buffers for a PMBD device (16 buffers, at least 1 if using buffer, 0 -no buffer) \n\ 443 | \t bufstride<#> \t the number of contiguous blocks(4KB) mapped into one buffer (bucket size for round-robin mapping) (1024 in default)\n\ 444 | \t batch<#,#> \t the batch size (num of pages) for flushing PMBD device buffer (1 means no batching) \n\ 445 | \n\ 446 | MISC: \n\ 447 | \t mgb \t mergeable? (Y or N) \n\ 448 | \t lock \t lock the on-access page to serialize accesses? (Y or N) \n\ 449 | \t cache use which CPU cache policy? Write back (WB), Write Combined (WB), or Uncachable (UC)\n\ 450 | \t subupdate only update the changed cachelines of a page? (Y or N) (check PMBD_CACHELINE_SIZE) \n\ 451 | \t timestat enable the detailed timing statistics (/proc/pmbd/pmbdstat)? (Y or N) (This will cause significant performance slowdown) \n\ 452 | \n\ 453 | NOTE: \n\ 454 | \t (1) Option rdlat/wrlat only specifies the minimum access times. Real access times can be higher.\n\ 455 | \t (2) If rdsx/wrsx is specified, the rdlat/wrlat/rdbw/wrbw would be ignored. \n\ 456 | \t (3) Option simmode1 applies the simulated specification to the PM space, rather than the whole device, which may have buffer.\n\ 457 | \n\ 458 | WARNING: \n\ 459 | \t (1) When using simmode1 to simulate slow-speed PM space, soft lockup warning may appear. Use \"nosoftlockup\" boot option to disable it.\n\ 460 | \t (2) Enabling timestat may cause performance degradation.\n\ 461 | \t (3) FUA is supported in Linux 3.2.1, but if buffer is used (for PT-based protection), enabling FUA lowers performance due to double writes.\n\ 462 | \t (4) No support for changing CPU cache related PTE attributes for VM-based PMBD in Linux 3.2.1 (RCU stalls).\n\ 463 | \n\ 464 | PROC ENTRIES: \n\ 465 | \t /proc/pmbd/pmbdcfg config info about the PMBD devices\n\ 466 | \t /proc/pmbd/pmbdstat statistics of the PMBD devices (if timestat is enabled)\n\ 467 | \n\ 468 | EXAMPLE: \n\ 469 | \t Assuming a 16GB PM space with physical memory addresses from 8GB to 24GB:\n\ 470 | \t (1) Basic (Ramdisk): \n\ 471 | \t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;\"\n\n\ 472 | \t (2) Protected (with private mapping): \n\ 473 | \t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;\"\n\n\ 474 | \t (3) Protected and synced (with private mapping, non-temp store): \n\ 475 | \t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;\"\n\n\ 476 | \t (4) *** RECOMMENDED CONFIG *** \n\ 477 | \t Protected, synced, and ordered (with private mapping, non-temp store, write barrier): \n\ 478 | \t $ sudo modprobe pmbd mode=\"pmbd16;hmo8;hms16;pmapY;ntsY;wbY;\"\n\ 479 | \n" 480 | 481 | /* functions */ 482 | static inline void pmbd_set_pages_ro(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access); 483 | static inline void pmbd_set_pages_rw(PMBD_DEVICE_T* pmbd, void* addr, uint64_t bytes, unsigned on_access); 484 | static inline void pmbd_clflush_range(PMBD_DEVICE_T* pmbd, void* dst, size_t bytes); 485 | static inline int pmbd_verify_wr_pages(PMBD_DEVICE_T* pmbd, void* dst, void* src, size_t bytes); 486 | static int pmbd_checksum_on_write(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes); 487 | static int pmbd_checksum_on_read(PMBD_DEVICE_T* pmbd, void* vaddr, size_t bytes); 488 | 489 | static inline int put_ulong(unsigned long arg, unsigned long val) 490 | { 491 | return put_user(val, (unsigned long __user *)arg); 492 | } 493 | static inline int put_u64(unsigned long arg, u64 val) 494 | { 495 | return put_user(val, (u64 __user *)arg); 496 | } 497 | 498 | static inline void mfence(void) 499 | { 500 | asm volatile("mfence": : :); 501 | } 502 | 503 | static inline void sfence(void) 504 | { 505 | asm volatile("sfence": : :); 506 | } 507 | 508 | #endif 509 | /* THEN END */ 510 | --------------------------------------------------------------------------------