├── Makefile ├── README-map-detector.md ├── README.md ├── cache_attack.cc ├── detect-mc-mapping-safe.sh ├── detect-mc-mapping-xor.sh ├── detect-mc-mapping.sh ├── gen_combination.py ├── mc-mapping-pagemap.c ├── mc-mapping.c ├── palloc-3.10.patch ├── palloc-3.13.patch ├── palloc-3.15.patch ├── palloc-3.6.patch ├── palloc-3.8.patch ├── palloc-4.14.patch ├── palloc-4.4.patch ├── palloc-4.9.patch ├── palloc-5.10.74.patch ├── palloc-5.15.patch ├── palloc-5.3.patch ├── palloc-5.4.83.patch ├── palloc-6.13.patch └── palloc-6.6.patch /Makefile: -------------------------------------------------------------------------------- 1 | CC=gcc 2 | CXX=g++ 3 | 4 | PGMS=mc-mapping mc-mapping-pagemap cache_attack 5 | 6 | CFLAGS=-Wall -O2 -std=c11 7 | CXXFLAGS=-Wall -O2 -std=c++11 8 | 9 | all: $(PGMS) 10 | 11 | mc-mapping: mc-mapping.c 12 | $(CC) $< -O2 -o $@ -lrt -g 13 | 14 | mc-mapping-pagemap: mc-mapping-pagemap.c 15 | $(CC) $< $(CFLAGS) -o $@ -lrt -lpthread -g 16 | 17 | cache_attack: cache_attack.cc 18 | $(CXX) -std=c++11 $< $(CXXFLAGS) -o $@ -lrt -lpthread -g 19 | clean: 20 | rm *.o *~ $(PGMS) 21 | -------------------------------------------------------------------------------- /README-map-detector.md: -------------------------------------------------------------------------------- 1 | DRAM Controller Address Map detector 2 | ==================================== 3 | 4 | First, build the detector as follows. 5 | 6 | $ make mc-mapping 7 | 8 | Next, run the following script to identify candidate bank bits. 9 | It tests from bit 6 to 29 and for each bit, it reports the measured 10 | average bandwidth of the microbenchmark (mc-mapping). If successful, 11 | bits can be categorized into two distinct subgroups. For example, 12 | the following is the output on an Intel Xeon W3530 (nehalem) machine 13 | with 1ch 4GB DDR3 DRAM (total 16 banks=2 ranks x 8 banks/rank), which 14 | was used in our RTAS'14 paper [1]. 15 | 16 | $ sudo ./detect-mc-mapping.sh 17 | mc-mapping: no process found 18 | Run a background task on core1-3 19 | Now run the test 20 | Bit6: 293.31 21 | Bit7: 293.38 22 | Bit8: 294.12 23 | Bit9: 293.47 24 | Bit10: 293.42 25 | Bit11: 293.39 26 | Bit12: 780.64 <--- faster 27 | Bit13: 783.43 <--- faster 28 | Bit14: 293.37 29 | Bit15: 293.51 30 | Bit16: 293.33 31 | Bit17: 293.53 32 | Bit18: 293.32 33 | Bit19: 785.48 <--- faster 34 | Bit20: 787.71 <--- faster 35 | 36 | 37 | Notice that bit 12,13,19, and 20 are noticably different 38 | from the other bits. Since we already know from the DRAM 39 | specification that there are 16 banks, we can conclude that the 40 | identified 4 bits are used to address the DRAM banks. 41 | 42 | !!!WARNING!!! Running 'detect-mc-mapping.sh' can cause system instability or 43 | even crash because it directly writes through /dev/mem. Therefore, it is 44 | recommended to reboot the machine after running the script. 45 | 46 | ## Handling XOR addressing 47 | 48 | If the number of identified bits are more than expected, then it is likely that 49 | the memory controller uses XOR addressing [2]. 50 | 51 | For example, the following is the output on an Intel Xeon E3-1230 (Haswell) 52 | machine with 1ch 4GB DDR3 DRAM (total 16 banks=2 ranks x 8 banks/rank). 53 | In this case, there are total 8 bits (bit 13,14,15,16,17,18,19) that show 54 | better bandwidth numbers, although only 4 bits are expected. This is because 55 | the memory controller uses XOR address mapping. 56 | 57 | $ sudo ./detect-mc-mapping.sh 58 | mc-mapping: no process found 59 | Run a background task on core1-3 60 | Now run the test 61 | Bit6: 347.57 62 | Bit7: 337.46 63 | Bit8: 335.15 64 | Bit9: 339.78 65 | Bit10: 344.88 66 | Bit11: 337.86 67 | Bit12: 337.51 68 | Bit13: 629.86 <--- faster 69 | Bit14: 613.87 <--- faster 70 | Bit15: 617.76 <--- faster 71 | Bit16: 608.52 <--- faster 72 | Bit17: 630.11 <--- faster 73 | Bit18: 628.36 <--- faster 74 | Bit19: 631.34 <--- faster 75 | Bit20: 628.51 <--- faster 76 | Bit21: 314.77 77 | Bit22: 315.19 78 | Bit23: 314.81 79 | Bit24: 309.67 80 | Bit25: 314.77 81 | Bit26: 315.27 82 | Bit27: 315.75 83 | Bit28: 315.51 84 | Bit29: 310.42 85 | 86 | In case the XOR addressing is used, we need to identify which pairs of two bits 87 | are XOR gated. We provide two scripts to aid this 88 | identification process. The following is performed on the same E3-1230 89 | platform. It tests all pairs of two bits out of the total 8 bits. Again, the 90 | output would form two distinct groups. In this case, the lower bandwidth 91 | number means that the two bits are XOR paired. 92 | 93 | 94 | $ sudo bash 95 | # ./gen_combination.py 13 14 15 16 17 18 19 20 | \ 96 | ./detect-mc-mapping-xor.sh 97 | mc-mapping: no process found 98 | Run a background task on core1-3 99 | Now run the test 100 | Bit 13 <--> 14: 613.18 101 | Bit 13 <--> 15: 631.86 102 | Bit 13 <--> 16: 608.09 103 | Bit 13 <--> 17: 316.41 <-- XOR pair 104 | Bit 13 <--> 18: 629.86 105 | Bit 13 <--> 19: 629.47 106 | Bit 13 <--> 20: 629.84 107 | Bit 14 <--> 15: 629.40 108 | Bit 14 <--> 16: 629.84 109 | Bit 14 <--> 17: 628.44 110 | Bit 14 <--> 18: 315.18 <-- XOR pair 111 | Bit 14 <--> 19: 610.55 112 | Bit 14 <--> 20: 629.39 113 | Bit 15 <--> 16: 628.33 114 | Bit 15 <--> 17: 631.25 115 | Bit 15 <--> 18: 628.57 116 | Bit 15 <--> 19: 315.76 <-- XOR pair 117 | Bit 15 <--> 20: 630.71 118 | Bit 16 <--> 17: 628.28 119 | Bit 16 <--> 18: 630.09 120 | Bit 16 <--> 19: 628.60 121 | Bit 16 <--> 20: 310.28 <-- XOR pair 122 | Bit 17 <--> 18: 629.57 123 | Bit 17 <--> 19: 631.22 124 | Bit 17 <--> 20: 628.41 125 | Bit 18 <--> 19: 630.48 126 | Bit 18 <--> 20: 615.39 127 | Bit 19 <--> 20: 617.36 128 | 129 | Hence, we can conclude the final mappings, which select DRAM banks, are 130 | (13 XOR 17), (14 XOR 18), (15 XOR 19), and (16 XOR 20). 131 | 132 | ## Safe pagemap based detector [experimental] 133 | 134 | We are currently testing a new detector that uses the safer pagemap interface instead of using /dev/mem. The new detector can be found in the repository: mc-mapping-pagemap.c. 135 | 136 | The following is the result of the new detector on the Nehalem platform we used in the original PALLOC paper, which clearly shows bit 12,13,19,20 are used for the mapping. 137 | 138 | $ make mc-mappgin-pagemap 139 | gcc mc-mapping-pagemap.c -Wall -O2 -std=c11 -o mc-mapping-pagemap -lrt -lpthread -g 140 | 141 | $ sudo chrt -f 1 ./mc-mapping-pagemap -p 0.7 -n 3 142 | mem_size (MB): 2756 143 | allocation complete. 144 | worker thread begins 145 | worker thread begins 146 | worker thread begins 147 | Bit6: 299.67 MB/s, 213.57 ns 148 | Bit7: 307.61 MB/s, 208.05 ns 149 | Bit8: 295.97 MB/s, 216.24 ns 150 | Bit9: 297.84 MB/s, 214.88 ns 151 | Bit10: 300.66 MB/s, 212.86 ns 152 | Bit11: 245.89 MB/s, 260.28 ns 153 | Bit12: 792.58 MB/s, 80.75 ns <--- slower 154 | Bit13: 789.23 MB/s, 81.09 ns <--- slower 155 | Bit14: 296.21 MB/s, 216.06 ns 156 | Bit15: 294.19 MB/s, 217.55 ns 157 | Bit16: 240.98 MB/s, 265.58 ns 158 | Bit17: 294.20 MB/s, 217.54 ns 159 | Bit18: 294.07 MB/s, 217.64 ns 160 | Bit19: 789.05 MB/s, 81.11 ns <--- slower 161 | Bit20: 789.15 MB/s, 81.10 ns <--- slower 162 | Bit21: 294.17 MB/s, 217.56 ns 163 | Bit22: 240.98 MB/s, 265.58 ns 164 | Bit23: 294.10 MB/s, 217.62 ns 165 | 166 | We recommend using this (mc-mapping-pagemap) over the original mc-mapping. Note, however, that it currently does not support XOR mapping detection. 167 | 168 | ## Address map database 169 | A set of successfully identified address map information can be found in the following wiki page. 170 | 171 | https://github.com/heechul/palloc/wiki/Address-map-database 172 | 173 | ## Limitations and other detection methods 174 | Our reverse engineering method does not work with many modern memory controllers which utilize more sophisticated XOR mapping schemes. 175 | 176 | If you are unable to detect the DRAM address mapping with the tools we provide here, consider checking the following alternatives. 177 | 178 | DRAMA 179 | https://github.com/IAIK/drama 180 | 181 | DRAMDig 182 | https://arxiv.org/pdf/2004.02354 183 | 184 | Blacksmith 185 | https://github.com/comsec-group/blacksmith 186 | 187 | References 188 | ========== 189 | 190 | [1] H. Yun, R. Mancuso, Z. Wu, R. Pellizzoni, "PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms", _RTAS_, 2014. 191 | 192 | [2] Z. Zhang, Z. Zhu, X. Zhang, "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality", _MICRO_, 2000 193 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PALLOC 2 | 3 | PALLOC is a kernel-level memory allocator that exploits page-based virtual-to-physical memory translation to selectively allocate memory pages of each application to the desired DRAM banks. The goal of PALLOC is to control applications' memory locations in a way to minimize memory performance unpredictability in multicore systems by eliminating bank sharing among applications executing in parallel. PALLOC is a software based solution, which is fully compatible with existing COTS hardware platforms and transparent to applications (i.e., no need to modify application code.) 4 | 5 | ## Source code 6 | 7 | The source code of the Linux 3.6.0 kernel with PALLOC support can be obtained as follows. 8 | 9 | $ git clone --depth 1 -b palloc-3.6 https://github.com/heechul/linux.git 10 | 11 | Or you can use one of the prepared patches for different Linux kernel versions. 12 | 13 | To build the kernel with PALLOC enabled, the following option must be enabled. 14 | 15 | CONFIG_CGROUP_PALLOC=y 16 | 17 | ## Detecting DRAM bank bits (for DRAM bank partitioning) 18 | 19 | See [README-map-detector.md](./README-map-detector.md) 20 | 21 | For cache partitioning, just use the cache set bits instead of DRAM bank bits. 22 | 23 | ## Usage 24 | 25 | 1. Select physical adddress bits to be used for page coloring 26 | 27 | - For normal address bits (e.g., Intel Nehalem) 28 | ``` 29 | # echo 0x00183000 > /sys/kernel/debug/palloc/palloc_mask 30 | --> select bit 12, 13, 19, 20. (total bins: 2^4 = 16) 31 | ``` 32 | - For XOR mapped address bits (e.g., Intel Haswell) 33 | ``` 34 | # echo 0x0001e000 > /sys/kernel/debug/palloc/palloc_mask 35 | # echo xor 13 17 > /sys/kernel/debug/palloc/control 36 | # echo xor 14 18 > /sys/kernel/debug/palloc/control 37 | # echo xor 15 19 > /sys/kernel/debug/palloc/control 38 | # echo xor 16 20 > /sys/kernel/debug/palloc/control 39 | # echo 1 > /sys/kernel/debug/palloc/use_mc_xor 40 | --> select (13 XOR 17), (14 XOR 18), (15 XOR 19), and (16 XOR 20) (total bins: 2^4 = 16) 41 | ``` 42 | - CGROUP partition setting 43 | ``` 44 | # mount -t cgroup xxx /sys/fs/cgroup 45 | # mkdir /sys/fs/cgroup/part1 46 | # echo 0 > /sys/fs/cgroup/part1/cpuset.cpus 47 | # echo 0 > /sys/fs/cgroup/part1/cpuset.mems 48 | # echo 0-3 > /sys/fs/cgroup/part1/palloc.bins 49 | --> bin 0,1,2,3 are assigned to part1 CGROUP. 50 | # echo $$ > /sys/fs/cgroup/part1/tasks 51 | --> from now on, all processes invoked from the shell use pages from the bins 0,1,2,3 only. 52 | ``` 53 | - Enable PALLOC 54 | ``` 55 | # echo 1 > /sys/kernel/debug/palloc/use_palloc 56 | --> enable palloc (owise the default buddy allocator will be used) 57 | ``` 58 | - Other options 59 | ``` 60 | # echo 1 > /sys/kernel/debug/palloc/debug_level 61 | --> enable debug messsages visible through /sys/kernel/debug/tracing/trace. [Recommended] 62 | # echo 4 > /sys/kernel/debug/palloc/alloc_balance 63 | --> wait until at least 4 different colors are in the color cache. [Recommended] 64 | ``` 65 | 2. Disable support for transparent huge pages from kernel: 66 | ``` 67 | # echo never > /sys/kernel/mm/transparent_hugepage/enabled 68 | --> palloc doesn't work with transparent huge page. please disable this. 69 | ``` 70 | 71 | ## Papers 72 | 73 | * Heechul Yun, Renato, Zheng-Pei Wu, Rodolfo Pellizzoni. "PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms," _IEEE Intl. Conference on Real-Time and Embedded Technology and Applications Symposium (RTAS)_, 2014. ([pdf](http://www.ittc.ku.edu/~heechul/papers/palloc-rtas2014.pdf), [ppt](http://www.slideshare.net/saiparan/palloc-rtas2014)) 74 | -------------------------------------------------------------------------------- /cache_attack.cc: -------------------------------------------------------------------------------- 1 | /** 2 | * 3 | * Copyright (C) 2018 Heechul Yun 4 | * 5 | * This file is distributed under the University of Illinois Open Source 6 | * License. See LICENSE.TXT for details. 7 | * 8 | */ 9 | 10 | /************************************************************************** 11 | * Conditional Compilation Options 12 | **************************************************************************/ 13 | 14 | /************************************************************************** 15 | * Included Files 16 | **************************************************************************/ 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | #include 29 | #include 30 | #include 31 | #include 32 | #include 33 | #include 34 | 35 | /************************************************************************** 36 | * Public Definitions 37 | **************************************************************************/ 38 | #define MAX_BIT (22) // [27:23] bits are used for iterations 39 | #define debug(f, ...) do { if(verbosity > 3) {printf("[%-9s] ", "DEBUG"); \ 40 | printf(f, __VA_ARGS__); }} while(0); 41 | 42 | /************************************************************************** 43 | * Global Variables 44 | **************************************************************************/ 45 | long g_mem_size; 46 | double g_fraction_of_physical_memory = 0.2; 47 | int g_cache_num_ways = 16; 48 | int L3_thresh_cycles = 200; 49 | 50 | void *g_mapping; 51 | ulong *g_frame_phys; 52 | 53 | int g_cpuid = 0; 54 | int g_pagemap_fd; 55 | 56 | int verbosity = 4; 57 | 58 | size_t num_reads = 500; 59 | 60 | using namespace std; 61 | 62 | /************************************************************************** 63 | * Public Function Prototypes 64 | **************************************************************************/ 65 | size_t getPhysicalMemorySize() { 66 | struct sysinfo info; 67 | sysinfo(&info); 68 | return (size_t) info.totalram * (size_t) info.mem_unit; 69 | } 70 | 71 | size_t frameNumberFromPagemap(size_t value) { 72 | return value & ((1ULL << 54) - 1); 73 | } 74 | 75 | ulong getPhysicalAddr(ulong virtual_addr) 76 | { 77 | u_int64_t value; 78 | off_t offset = (virtual_addr / 4096) * sizeof(value); 79 | int got = pread(g_pagemap_fd, &value, 8, offset); 80 | //printf("vaddr=%lu, value=0x%llx, got=%d\n", virtual_addr, value, got); 81 | assert(got == 8); 82 | 83 | // Check the "page present" flag. 84 | assert(value & (1ULL << 63)); 85 | 86 | ulong frame_num = frameNumberFromPagemap(value); 87 | return (frame_num * 4096) | (virtual_addr & (4095)); 88 | } 89 | 90 | void setupMapping() { 91 | g_mem_size = 92 | (long)(g_fraction_of_physical_memory * getPhysicalMemorySize()); 93 | printf("mem_size (MB): %d\n", (int)(g_mem_size / 1024 / 1024)); 94 | 95 | /* map */ 96 | g_mapping = mmap(NULL, g_mem_size, PROT_READ | PROT_WRITE, 97 | MAP_POPULATE | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); 98 | assert(g_mapping != (void *) -1); 99 | 100 | /* page virt -> phys translation table */ 101 | g_frame_phys = (ulong *)malloc(sizeof(long) * (g_mem_size / 0x1000)); 102 | 103 | /* initialize */ 104 | for (long i = 0; i < g_mem_size; i += 0x1000) { 105 | ulong vaddr, paddr; 106 | vaddr = (ulong)((ulong)g_mapping + i); 107 | *((ulong *)vaddr) = 0x0; 108 | paddr = getPhysicalAddr(vaddr); 109 | g_frame_phys[i/0x1000] = paddr; 110 | // printf("vaddr-paddr: %p-%p\n", (void *)vaddr, (void *)paddr); 111 | } 112 | printf("allocation complete.\n"); 113 | } 114 | 115 | void initPagemap() 116 | { 117 | g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY); 118 | assert(g_pagemap_fd >= 0); 119 | } 120 | 121 | uint64_t rdtsc() { 122 | uint64_t a, d; 123 | asm volatile ("xor %%rax, %%rax\n" "cpuid"::: "rax", "rbx", "rcx", "rdx"); 124 | asm volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx"); 125 | a = (d << 32) | a; 126 | return a; 127 | } 128 | 129 | uint64_t rdtsc2() { 130 | uint64_t a, d; 131 | asm volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx"); 132 | asm volatile ("cpuid"::: "rax", "rbx", "rcx", "rdx"); 133 | a = (d << 32) | a; 134 | return a; 135 | } 136 | 137 | int run(long *list, int count) 138 | { 139 | long i = 0; 140 | while (list && i++ < count) { 141 | list = (long *)*list; 142 | } 143 | return i; 144 | } 145 | 146 | bool check_conflict(void *addr, set &EV) 147 | { 148 | uint64_t sum = 0; 149 | long *list_curr = NULL; 150 | long *list_head = NULL; 151 | int count = 0; 152 | for (void *vaddr: EV) { 153 | count ++; 154 | if (count == 1) { 155 | list_head = list_curr = (long *)vaddr; 156 | } 157 | *list_curr = (long)vaddr; 158 | list_curr = (long *)vaddr; 159 | 160 | if (count == (int)EV.size()) { 161 | *list_curr = (ulong) list_head; 162 | } 163 | } 164 | 165 | size_t t0 = rdtsc(); 166 | sum = run(list_head, num_reads); 167 | uint64_t res = (rdtsc2() - t0) / (num_reads); 168 | printf("took: %d cycles/iteration. sum=%ld\n", (int)res, (long)sum); 169 | if ((int)res > L3_thresh_cycles) 170 | return true; 171 | else 172 | return false; 173 | } 174 | 175 | bool find_EV(set &CS, set &EV) 176 | { 177 | set::iterator it; 178 | set CS2; 179 | 180 | EV.clear(); 181 | CS2 = CS; 182 | it = CS2.begin(); 183 | void *test_addr = *it; 184 | CS2.erase(it); 185 | 186 | printf("pass 1\n"); 187 | if (check_conflict(test_addr, CS2) == false) { 188 | return false; 189 | } 190 | 191 | printf("pass 2\n"); 192 | 193 | for (it = CS2.begin(); it != CS2.end(); it++) { 194 | void *addr = *it; 195 | printf("%p ", addr); 196 | set tmpS = CS2; 197 | tmpS.erase(addr); // tmpS = CS2 - addr 198 | if (check_conflict(test_addr, tmpS) == true) { 199 | printf("conflict. add to EV\n"); 200 | EV.insert(addr); 201 | } else { 202 | printf("no conflict. continue\n"); 203 | CS2.erase(it); 204 | } 205 | } 206 | 207 | printf("pass 3\n"); 208 | for (void *addr: CS) { 209 | if (check_conflict(test_addr, EV) == true) { 210 | EV.insert(addr); 211 | } 212 | } 213 | return true; 214 | } 215 | 216 | bool find_addresses(ulong match_mask, int max_shift, int min_count, set &CS) 217 | { 218 | ulong vaddr, paddr; 219 | 220 | for (long i = 0; i < g_mem_size; i += 0x1000) { 221 | vaddr = (ulong)((long)g_mapping + i) + (match_mask & 0xFFF); 222 | paddr = g_frame_phys[i/0x1000] + (match_mask & 0xFFF); 223 | if (!((paddr & ((1< 0) 225 | continue; 226 | /* found a match */ 227 | // printf("vaddr-paddr: %p-%p\n", (void *)vaddr, (void *)paddr); 228 | CS.insert((void *)vaddr); 229 | 230 | if ((int)CS.size() == min_count) { 231 | return true; 232 | } 233 | } 234 | } 235 | debug("failed: found (%d) / requested (%d) pages\n", (int)CS.size(), min_count); 236 | return false; 237 | } 238 | 239 | int main(int argc, char *argv[]) 240 | { 241 | set CS; 242 | set EV; 243 | 244 | initPagemap(); 245 | setupMapping(); 246 | 247 | find_addresses(0x0, MAX_BIT, atoi(argv[1]), CS); 248 | printf("created a list with %lu addresses\n", CS.size()); 249 | 250 | /* for (void *addr: CS) { */ 251 | /* printf("0x%p\n", addr); */ 252 | /* } */ 253 | 254 | find_EV(CS, EV); 255 | 256 | printf("EV (%d):\n", (int)EV.size()); 257 | for (void *addr: EV) { 258 | printf("0x%p,", addr); 259 | } 260 | printf("\n"); 261 | return 0; 262 | } 263 | -------------------------------------------------------------------------------- /detect-mc-mapping-safe.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Detect bank and rank bits 4 | # 5 | # (c) 2013 Heechul Yun 6 | # 7 | # Safe detection method using huge page (2MB). 8 | # It works on Nahelem but doesn't work haswell for unknown reason 9 | # 10 | 11 | killall -9 mc-mapping 12 | 13 | if ! mount | grep hugetlbfs; then 14 | echo "run init-hugetlfs.sh" 15 | exit 1 16 | fi 17 | 18 | echo "Run a background task on core1" 19 | for cpu in 1 2 3; do 20 | ./mc-mapping -c $cpu -i 100000000000 -b 0 >& /dev/null & 21 | done 22 | 23 | sleep 1 24 | 25 | echo "Now run the test" 26 | for b in `seq 6 20`; do 27 | echo -n "Bit$b: " 28 | ./mc-mapping -c 0 -i 9000000 -b $b 2> /dev/null | grep band | awk '{ print $2 }' 29 | done 30 | killall -9 mc-mapping 31 | -------------------------------------------------------------------------------- /detect-mc-mapping-xor.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | killall -9 mc-mapping 4 | echo "Run a background task on core1-3" 5 | ./mc-mapping -c 1 -i 100000000000 -b 0 -x >& /dev/null & 6 | ./mc-mapping -c 2 -i 100000000000 -b 0 -x >& /dev/null & 7 | ./mc-mapping -c 3 -i 100000000000 -b 0 -x >& /dev/null & 8 | sleep 1 9 | 10 | echo "Now run the test" 11 | while read buf; do 12 | lbit=`echo $buf | awk '{ print $1 }'` 13 | rbit=`echo $buf | awk '{ print $2 }'` 14 | echo -n "Bit $lbit <--> $rbit: " 15 | ./mc-mapping -c 0 -i 9000000 -b $lbit -s $rbit -x 2> /dev/null | grep band | awk '{ print $2 }' || echo "N/A" 16 | done 17 | killall -9 mc-mapping 18 | -------------------------------------------------------------------------------- /detect-mc-mapping.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | killall -9 mc-mapping 3 | echo "Run a background task on core1-3" 4 | ./mc-mapping -c 1 -i 100000000000 -b 0 -x >& /dev/null & 5 | ./mc-mapping -c 2 -i 100000000000 -b 0 -x >& /dev/null & 6 | ./mc-mapping -c 3 -i 100000000000 -b 0 -x >& /dev/null & 7 | sleep 1 8 | 9 | echo "Now run the test" 10 | for b in `seq 6 23`; do 11 | echo -n "Bit$b: " 12 | ./mc-mapping -c 0 -i 9000000 -b $b -x | grep band | awk '{ print $2 }' || echo "N/A" 13 | done 14 | killall -9 mc-mapping 15 | -------------------------------------------------------------------------------- /gen_combination.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | from itertools import * 3 | 4 | import sys 5 | import os 6 | import getopt 7 | 8 | def main(): 9 | try: 10 | optlist, args = getopt.getopt(sys.argv[1:], 'h', ["help"]) 11 | except getopt.GetoptError as err: 12 | print str(err) 13 | sys.exit(2) 14 | 15 | for opt, val in optlist: 16 | if opt in ("-h", "--help"): 17 | print "print possible combinations of given integer list" 18 | print "e.g.) $" + sys.argv[0] + " 13 14 15 16" 19 | sys.exit(0) 20 | else: 21 | assert False, "unhandled option" 22 | 23 | # print "ARGS: ", map(int, args) 24 | 25 | bits = map(int, args) 26 | for f in combinations(bits, 2): 27 | print f[0], f[1] 28 | 29 | 30 | if __name__ == "__main__": 31 | main() 32 | -------------------------------------------------------------------------------- /mc-mapping-pagemap.c: -------------------------------------------------------------------------------- 1 | /** 2 | * DRAM controller address mapping detector 3 | * 4 | * Copyright (C) 2013 Heechul Yun 5 | * Copyright (C) 2018 Heechul Yun 6 | * 7 | * This file is distributed under the University of Illinois Open Source 8 | * License. See LICENSE.TXT for details. 9 | * 10 | */ 11 | 12 | /************************************************************************** 13 | * Conditional Compilation Options 14 | **************************************************************************/ 15 | 16 | /************************************************************************** 17 | * Included Files 18 | **************************************************************************/ 19 | #define _GNU_SOURCE /* See feature_test_macros(7) */ 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | #include 29 | #include 30 | #include 31 | #include 32 | #include 33 | #include 34 | #include 35 | #include 36 | #include 37 | 38 | /************************************************************************** 39 | * Public Definitions 40 | **************************************************************************/ 41 | #define MAX_BIT (24) // [27:23] bits are used for iterations 42 | 43 | #define MAX(a,b) ((a>b)?(a):(b)) 44 | #define MIN(a,b) ((a>b)?(b):(a)) 45 | #define CEIL(val,unit) (((val + unit - 1)/unit)*unit) 46 | 47 | #define FATAL do { fprintf(stderr, "Error at line %d, file %s (%d) [%s]\n", \ 48 | __LINE__, __FILE__, errno, strerror(errno)); exit(1); } while(0) 49 | 50 | /************************************************************************** 51 | * Public Types 52 | **************************************************************************/ 53 | 54 | /************************************************************************** 55 | * Global Variables 56 | **************************************************************************/ 57 | long g_mem_size; 58 | double g_fraction_of_physical_memory = 0.2; 59 | int g_cache_num_ways = 16; 60 | 61 | void *g_mapping; 62 | ulong *g_frame_phys; 63 | 64 | int g_cpuid = 0; 65 | int g_pagemap_fd; 66 | 67 | /************************************************************************** 68 | * Public Function Prototypes 69 | **************************************************************************/ 70 | uint64_t get_elapsed(struct timespec *start, struct timespec *end) 71 | { 72 | uint64_t dur; 73 | if (start->tv_nsec > end->tv_nsec) 74 | dur = (uint64_t)(end->tv_sec - 1 - start->tv_sec) * 1000000000 + 75 | (1000000000 + end->tv_nsec - start->tv_nsec); 76 | else 77 | dur = (uint64_t)(end->tv_sec - start->tv_sec) * 1000000000 + 78 | (end->tv_nsec - start->tv_nsec); 79 | 80 | return dur; 81 | } 82 | 83 | // ---------------------------------------------- 84 | size_t getPhysicalMemorySize() { 85 | struct sysinfo info; 86 | sysinfo(&info); 87 | return (size_t) info.totalram * (size_t) info.mem_unit; 88 | } 89 | 90 | // ---------------------------------------------- 91 | size_t frameNumberFromPagemap(size_t value) { 92 | return value & ((1ULL << 54) - 1); 93 | } 94 | 95 | // ---------------------------------------------- 96 | ulong getPhysicalAddr(ulong virtual_addr) 97 | { 98 | u_int64_t value; 99 | off_t offset = (virtual_addr / 4096) * sizeof(value); 100 | int got = pread(g_pagemap_fd, &value, 8, offset); 101 | //printf("vaddr=%lu, value=0x%llx, got=%d\n", virtual_addr, value, got); 102 | assert(got == 8); 103 | 104 | // Check the "page present" flag. 105 | assert(value & (1ULL << 63)); 106 | 107 | ulong frame_num = frameNumberFromPagemap(value); 108 | return (frame_num * 4096) | (virtual_addr & (4095)); 109 | } 110 | 111 | // ---------------------------------------------- 112 | void setupMapping() { 113 | g_mem_size = 114 | (long)(g_fraction_of_physical_memory * getPhysicalMemorySize()); 115 | printf("mem_size (MB): %d\n", (int)(g_mem_size / 1024 / 1024)); 116 | 117 | /* map */ 118 | g_mapping = mmap(NULL, g_mem_size, PROT_READ | PROT_WRITE, 119 | MAP_POPULATE | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); 120 | assert(g_mapping != (void *) -1); 121 | 122 | /* page virt -> phys translation table */ 123 | g_frame_phys = (ulong *)malloc(sizeof(long) * (g_mem_size / 0x1000)); 124 | 125 | /* initialize */ 126 | for (long i = 0; i < g_mem_size; i += 0x1000) { 127 | ulong vaddr, paddr; 128 | vaddr = (ulong)(g_mapping + i); 129 | *((ulong *)vaddr) = 0; 130 | paddr = getPhysicalAddr(vaddr); 131 | g_frame_phys[i/0x1000] = paddr; 132 | // printf("vaddr-paddr: %p-%p\n", (void *)vaddr, (void *)paddr); 133 | } 134 | printf("allocation complete.\n"); 135 | } 136 | 137 | 138 | // ---------------------------------------------- 139 | void initPagemap() 140 | { 141 | g_pagemap_fd = open("/proc/self/pagemap", O_RDONLY); 142 | assert(g_pagemap_fd >= 0); 143 | } 144 | 145 | // ---------------------------------------------- 146 | long utime() 147 | { 148 | struct timeval tv; 149 | gettimeofday(&tv, NULL); 150 | 151 | return (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000; 152 | } 153 | 154 | uint64_t nstime() 155 | { 156 | struct timespec ts; 157 | clock_gettime(CLOCK_REALTIME, &ts); 158 | return ts.tv_sec * 1000000000 + ts.tv_nsec; 159 | } 160 | 161 | /************************************************************************** 162 | * Implementation 163 | **************************************************************************/ 164 | 165 | long *create_list(ulong match_mask, int max_shift, int min_count) 166 | { 167 | ulong vaddr, paddr; 168 | int count = 0; 169 | long *list_curr = NULL; 170 | long *list_head = NULL; 171 | 172 | // printf("mask: 0x%lx, shift: %d\n", match_mask, max_shift); 173 | 174 | for (long i = 0; i < g_mem_size; i += 0x1000) { 175 | vaddr = (ulong)(g_mapping + i) + (match_mask & 0xFFF); 176 | paddr = g_frame_phys[i/0x1000] + (match_mask & 0xFFF); 177 | if (!((paddr & ((1< 0) 179 | continue; 180 | /* found a match */ 181 | // printf("vaddr-paddr: %p-%p\n", (void *)vaddr, (void *)paddr); 182 | count ++; 183 | 184 | if (count == 1) { 185 | list_head = list_curr = (long *)vaddr; 186 | } 187 | 188 | *list_curr = vaddr; 189 | list_curr = (long *)vaddr; 190 | 191 | if (count == min_count) { 192 | *list_curr = (ulong) list_head; 193 | // printf("#of entries in the list: %d\n", count); 194 | return list_head; 195 | } 196 | } 197 | } 198 | printf("failed: found (%d) / requested (%d) pages\n", count, min_count); 199 | return NULL; 200 | } 201 | 202 | int run(long *list, int count) 203 | { 204 | long i = 0; 205 | while (list && i++ < count) { 206 | list = (long *)*list; 207 | } 208 | return i; 209 | } 210 | 211 | void worker(void *param) 212 | { 213 | long *list = (long *)param; 214 | 215 | printf("worker thread begins\n"); 216 | 217 | while (list) { 218 | list = (long *)*list; 219 | } 220 | } 221 | 222 | int main(int argc, char* argv[]) 223 | { 224 | cpu_set_t cmask; 225 | int num_processors, n_corun = 1; 226 | int opt; 227 | int repeat = 1000000; 228 | 229 | pthread_t tid[16]; /* thread identifier */ 230 | pthread_attr_t attr; 231 | pthread_attr_init(&attr); 232 | num_processors = sysconf(_SC_NPROCESSORS_CONF); 233 | 234 | /* 235 | * get command line options 236 | */ 237 | while ((opt = getopt(argc, argv, "w:p:c:n:i:h")) != -1) { 238 | switch (opt) { 239 | case 'w': /* cache num ways */ 240 | g_cache_num_ways = strtol(optarg, NULL, 0); 241 | break; 242 | case 'p': /* set memory fraction */ 243 | g_fraction_of_physical_memory = strtof(optarg, NULL); 244 | break; 245 | case 'c': /* set CPU affinity */ 246 | g_cpuid = strtol(optarg, NULL, 0); 247 | break; 248 | case 'n': /* #of co-runners */ 249 | n_corun = strtol(optarg, NULL, 0); 250 | break; 251 | case 'i': /* iterations */ 252 | repeat = strtol(optarg, NULL, 0); 253 | printf("repeat=%d\n", repeat); 254 | break; 255 | } 256 | } 257 | 258 | initPagemap(); 259 | setupMapping(); 260 | 261 | #if 0 262 | struct sched_param param; 263 | /* try to use a real-time scheduler*/ 264 | param.sched_priority = 1; 265 | if(sched_setscheduler(0, SCHED_FIFO, ¶m) == -1) { 266 | perror("sched_setscheduler failed"); 267 | } 268 | #endif 269 | /* launch corun worker threads */ 270 | tid[0]= pthread_self(); 271 | long *corun_list[16]; 272 | 273 | /* thread affinity set */ 274 | for (int i = 0; i < MIN(1+n_corun, num_processors); i++) { 275 | if (i != 0) { 276 | corun_list[i] = create_list(0x0, MAX_BIT, g_cache_num_ways*2); 277 | pthread_create(&tid[i], &attr, (void *)worker, corun_list[i]); 278 | } 279 | CPU_ZERO(&cmask); 280 | CPU_SET((g_cpuid + i) % num_processors, &cmask); 281 | if (pthread_setaffinity_np(tid[i], sizeof(cpu_set_t), &cmask) < 0) 282 | perror("error"); 283 | } 284 | 285 | sleep(2); 286 | 287 | for (int bit = 6; bit < MAX_BIT; bit++){ 288 | /* initialize data */ 289 | ulong bank_mask = (1< 5 | * 6 | * This file is distributed under the University of Illinois Open Source 7 | * License. See LICENSE.TXT for details. 8 | * 9 | */ 10 | 11 | /************************************************************************** 12 | * Conditional Compilation Options 13 | **************************************************************************/ 14 | 15 | /************************************************************************** 16 | * Included Files 17 | **************************************************************************/ 18 | #define _GNU_SOURCE /* See feature_test_macros(7) */ 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | #include 29 | #include 30 | #include 31 | #include 32 | #include 33 | 34 | /************************************************************************** 35 | * Public Definitions 36 | **************************************************************************/ 37 | #define L3_NUM_WAYS 16 // cat /sys/devices/system/cpu/cpu0/cache/index3/ways.. 38 | #define NUM_ENTRIES (L3_NUM_WAYS*2) // # of list entries to iterate 39 | #define ENTRY_SHIFT (24) // [27:23] bits are used for iterations 40 | #define ENTRY_DIST (1<b)?(a):(b)) 44 | #define CEIL(val,unit) (((val + unit - 1)/unit)*unit) 45 | 46 | #define FATAL do { fprintf(stderr, "Error at line %d, file %s (%d) [%s]\n", \ 47 | __LINE__, __FILE__, errno, strerror(errno)); exit(1); } while(0) 48 | 49 | /************************************************************************** 50 | * Public Types 51 | **************************************************************************/ 52 | 53 | /************************************************************************** 54 | * Global Variables 55 | **************************************************************************/ 56 | static int g_mem_size = NUM_ENTRIES * ENTRY_DIST; 57 | static int* list; 58 | static int next; 59 | 60 | /************************************************************************** 61 | * Public Function Prototypes 62 | **************************************************************************/ 63 | uint64_t get_elapsed(struct timespec *start, struct timespec *end) 64 | { 65 | uint64_t dur; 66 | if (start->tv_nsec > end->tv_nsec) 67 | dur = (uint64_t)(end->tv_sec - 1 - start->tv_sec) * 1000000000 + 68 | (1000000000 + end->tv_nsec - start->tv_nsec); 69 | else 70 | dur = (uint64_t)(end->tv_sec - start->tv_sec) * 1000000000 + 71 | (end->tv_nsec - start->tv_nsec); 72 | 73 | return dur; 74 | 75 | } 76 | 77 | /************************************************************************** 78 | * Implementation 79 | **************************************************************************/ 80 | int run(int iter) 81 | { 82 | int i; 83 | int cnt = 0; 84 | for (i = 0; i < iter; i++) { 85 | next = list[next]; 86 | cnt ++; 87 | } 88 | return cnt; 89 | } 90 | 91 | 92 | int main(int argc, char* argv[]) 93 | { 94 | struct sched_param param; 95 | cpu_set_t cmask; 96 | int num_processors; 97 | int cpuid = 0; 98 | int use_dev_mem = 0; 99 | 100 | int *memchunk = NULL; 101 | int opt, prio; 102 | int i; 103 | 104 | int repeat = 1000; 105 | 106 | int page_shift = 0; 107 | int xor_page_shift = 0; 108 | 109 | /* 110 | * get command line options 111 | */ 112 | while ((opt = getopt(argc, argv, "a:xb:s:o:m:c:i:l:h")) != -1) { 113 | switch (opt) { 114 | case 'b': /* bank bit */ 115 | page_shift = strtol(optarg, NULL, 0); 116 | break; 117 | case 's': /* xor-bank bit */ 118 | xor_page_shift = strtol(optarg, NULL, 0); 119 | break; 120 | case 'm': /* set memory size */ 121 | g_mem_size = 1024 * strtol(optarg, NULL, 0); 122 | break; 123 | case 'x': /* mmap to /dev/mem, owise use hugepage */ 124 | use_dev_mem = 1; 125 | break; 126 | case 'c': /* set CPU affinity */ 127 | cpuid = strtol(optarg, NULL, 0); 128 | num_processors = sysconf(_SC_NPROCESSORS_CONF); 129 | CPU_ZERO(&cmask); 130 | CPU_SET(cpuid % num_processors, &cmask); 131 | if (sched_setaffinity(0, num_processors, &cmask) < 0) 132 | perror("error"); 133 | break; 134 | case 'p': /* set priority */ 135 | prio = strtol(optarg, NULL, 0); 136 | if (setpriority(PRIO_PROCESS, 0, prio) < 0) 137 | perror("error"); 138 | break; 139 | case 'i': /* iterations */ 140 | repeat = strtol(optarg, NULL, 0); 141 | printf("repeat=%d\n", repeat); 142 | break; 143 | } 144 | 145 | } 146 | 147 | g_mem_size += (1 << page_shift); 148 | g_mem_size = CEIL(g_mem_size, ENTRY_DIST); 149 | 150 | /* alloc memory. align to a page boundary */ 151 | if (use_dev_mem) { 152 | int fd = open("/dev/mem", O_RDWR | O_SYNC); 153 | void *addr = (void *) 0x1000000080000000; 154 | 155 | 156 | if (fd < 0) { 157 | perror("Open failed"); 158 | exit(1); 159 | } 160 | 161 | memchunk = mmap(0, g_mem_size, 162 | PROT_READ | PROT_WRITE, 163 | MAP_SHARED, 164 | fd, (off_t)addr); 165 | } else { 166 | memchunk = mmap(0, g_mem_size, 167 | PROT_READ | PROT_WRITE, 168 | MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 169 | -1, 0); 170 | } 171 | 172 | if (memchunk == MAP_FAILED) { 173 | perror("failed to alloc"); 174 | exit(1); 175 | } 176 | 177 | /* initialize data */ 178 | int off_idx = (1< 0) { 181 | off_idx = ((1< 0 || xor_page_shift > 0) 186 | off_idx ++; 187 | #else 188 | if (page_shift >= ENTRY_SHIFT || xor_page_shift >= ENTRY_SHIFT) { 189 | fprintf(stderr, "page_shift or xor_page_shift must be less than %d bits\n", 190 | ENTRY_SHIFT); 191 | exit(1); 192 | } 193 | #endif 194 | 195 | list = &memchunk[off_idx]; 196 | for (i = 0; i < NUM_ENTRIES; i++) { 197 | int idx = i * ENTRY_DIST / 4; 198 | if (i == (NUM_ENTRIES - 1)) 199 | list[idx] = 0; 200 | else 201 | list[idx] = (i+1) * ENTRY_DIST/4; 202 | } 203 | next = 0; 204 | printf("pshift: %d, XOR-pshift: %d\n", page_shift, xor_page_shift); 205 | 206 | #if 1 207 | param.sched_priority = 10; 208 | if(sched_setscheduler(0, SCHED_FIFO, ¶m) == -1) { 209 | perror("sched_setscheduler failed"); 210 | } 211 | #endif 212 | struct timespec start, end; 213 | 214 | clock_gettime(CLOCK_REALTIME, &start); 215 | 216 | /* actual access */ 217 | int naccess = run(repeat); 218 | 219 | clock_gettime(CLOCK_REALTIME, &end); 220 | 221 | int64_t nsdiff = get_elapsed(&start, &end); 222 | double avglat = (double)nsdiff/naccess; 223 | 224 | printf("size: %d (%d KB)\n", g_mem_size, g_mem_size/1024); 225 | printf("duration %"PRId64"ns, #access %d\n", nsdiff, naccess); 226 | printf("average latency: %.2f ns\n", avglat); 227 | printf("bandwidth %.2f MB/s\n", (double)64*1000*naccess/nsdiff); 228 | 229 | return 0; 230 | } 231 | -------------------------------------------------------------------------------- /palloc-3.13.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index b613ffd..b5ed1ec 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -50,6 +50,9 @@ SUBSYS(net_prio) 6 | #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_HUGETLB) 7 | SUBSYS(hugetlb) 8 | #endif 9 | -/* 10 | - * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 11 | - */ 12 | + 13 | +/* */ 14 | + 15 | +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_PALLOC) 16 | +SUBSYS(palloc) 17 | +#endif 18 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 19 | index bd791e4..2ccda37 100644 20 | --- a/include/linux/mmzone.h 21 | +++ b/include/linux/mmzone.h 22 | @@ -69,6 +69,14 @@ enum { 23 | # define is_migrate_cma(migratetype) false 24 | #endif 25 | 26 | +#ifdef CONFIG_CGROUP_PALLOC 27 | +/* Determine the number of bins according to the bits required for 28 | + each component of the address*/ 29 | +# define MAX_PALLOC_BITS 8 30 | +# define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 31 | +# define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 32 | +#endif 33 | + 34 | #define for_each_migratetype_order(order, type) \ 35 | for (order = 0; order < MAX_ORDER; order++) \ 36 | for (type = 0; type < MIGRATE_TYPES; type++) 37 | @@ -367,6 +375,14 @@ struct zone { 38 | #endif 39 | struct free_area free_area[MAX_ORDER]; 40 | 41 | +#ifdef CONFIG_CGROUP_PALLOC 42 | + /* 43 | + * Color page cache. for movable type free pages of order-0 44 | + */ 45 | + struct list_head color_list[MAX_PALLOC_BINS]; 46 | + COLOR_BITMAP(color_bitmap); 47 | +#endif 48 | + 49 | #ifndef CONFIG_SPARSEMEM 50 | /* 51 | * Flags for a pageblock_nr_pages block. See pageblock-flags.h. 52 | diff --git a/include/linux/palloc.h b/include/linux/palloc.h 53 | new file mode 100644 54 | index 0000000..ec4c092 55 | --- /dev/null 56 | +++ b/include/linux/palloc.h 57 | @@ -0,0 +1,33 @@ 58 | +#ifndef _LINUX_PALLOC_H 59 | +#define _LINUX_PALLOC_H 60 | + 61 | +/* 62 | + * kernel/palloc.h 63 | + * 64 | + * PHysical memory aware allocator 65 | + */ 66 | + 67 | +#include 68 | +#include 69 | +#include 70 | +#include 71 | + 72 | +#ifdef CONFIG_CGROUP_PALLOC 73 | + 74 | +struct palloc { 75 | + struct cgroup_subsys_state css; 76 | + COLOR_BITMAP(cmap); 77 | +}; 78 | + 79 | +/* Retrieve the palloc group corresponding to this cgroup container */ 80 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 81 | + 82 | +/* Retrieve the palloc group corresponding to this subsys */ 83 | +struct palloc * ph_from_subsys(struct cgroup_subsys_state * subsys); 84 | + 85 | +/* return #of palloc bins */ 86 | +int palloc_bins(void); 87 | + 88 | +#endif /* CONFIG_CGROUP_PALLOC */ 89 | + 90 | +#endif /* _LINUX_PALLOC_H */ 91 | diff --git a/init/Kconfig b/init/Kconfig 92 | index 4e5d96a..d4f53b7 100644 93 | --- a/init/Kconfig 94 | +++ b/init/Kconfig 95 | @@ -1075,6 +1075,12 @@ config DEBUG_BLK_CGROUP 96 | Enable some debugging help. Currently it exports additional stat 97 | files in a cgroup which can be useful for debugging. 98 | 99 | +config CGROUP_PALLOC 100 | + bool "Enable PALLOC" 101 | + help 102 | + Enables PALLOC: physical address based page allocator that 103 | + replaces the buddy allocator. 104 | + 105 | endif # CGROUPS 106 | 107 | config CHECKPOINT_RESTORE 108 | diff --git a/mm/Makefile b/mm/Makefile 109 | index 305d10a..a17928f 100644 110 | --- a/mm/Makefile 111 | +++ b/mm/Makefile 112 | @@ -60,3 +60,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o 113 | obj-$(CONFIG_CLEANCACHE) += cleancache.o 114 | obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o 115 | obj-$(CONFIG_ZBUD) += zbud.o 116 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 117 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 118 | index 5248fe0..0043a50 100644 119 | --- a/mm/page_alloc.c 120 | +++ b/mm/page_alloc.c 121 | @@ -1,3 +1,4 @@ 122 | + 123 | /* 124 | * linux/mm/page_alloc.c 125 | * 126 | @@ -63,12 +64,194 @@ 127 | #include 128 | 129 | #include 130 | +#include 131 | #include 132 | #include 133 | #include "internal.h" 134 | 135 | + 136 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 137 | static DEFINE_MUTEX(pcp_batch_high_lock); 138 | +#ifdef CONFIG_CGROUP_PALLOC 139 | +#include 140 | + 141 | +int memdbg_enable = 0; 142 | +EXPORT_SYMBOL(memdbg_enable); 143 | + 144 | +static int sysctl_alloc_balance = 0; 145 | +/* palloc address bitmask */ 146 | +static unsigned long sysctl_palloc_mask = 0x0; 147 | + 148 | +static int mc_xor_bits[64]; 149 | +static int use_mc_xor = 0; 150 | +static int use_palloc = 0; 151 | + 152 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 153 | + 154 | +#define memdbg(lvl, fmt, ...) \ 155 | + do { \ 156 | + if(memdbg_enable >= lvl) \ 157 | + trace_printk(fmt, ##__VA_ARGS__); \ 158 | + } while(0) 159 | + 160 | +struct palloc_stat { 161 | + s64 max_ns; 162 | + s64 min_ns; 163 | + s64 tot_ns; 164 | + 165 | + s64 tot_cnt; 166 | + s64 iter_cnt; /* avg_iter = iter_cnt/tot_cnt */ 167 | + 168 | + s64 cache_hit_cnt; /* hit rate = cache_hit_cnt / cache_acc_cnt */ 169 | + s64 cache_acc_cnt; 170 | + 171 | + s64 flush_cnt; 172 | + 173 | + s64 alloc_balance; 174 | + s64 alloc_balance_timeout; 175 | + ktime_t start; /* start time of the current iteration */ 176 | +}; 177 | + 178 | +static struct { 179 | + u32 enabled; 180 | + int colors; 181 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2 - fail */ 182 | +} palloc; 183 | + 184 | +static void palloc_flush(struct zone *zone); 185 | + 186 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, 187 | + size_t cnt, loff_t *ppos) 188 | +{ 189 | + char buf[64]; 190 | + int i; 191 | + if (cnt > 63) cnt = 63; 192 | + if (copy_from_user(&buf, ubuf, cnt)) 193 | + return -EFAULT; 194 | + 195 | + if (!strncmp(buf, "reset", 5)) { 196 | + printk(KERN_INFO "reset statistics...\n"); 197 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 198 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 199 | + palloc.stat[i].min_ns = 0x7fffffff; 200 | + } 201 | + } else if (!strncmp(buf, "flush", 5)) { 202 | + struct zone *zone; 203 | + printk(KERN_INFO "flush color cache...\n"); 204 | + for_each_populated_zone(zone) { 205 | + unsigned long flags; 206 | + if (!zone) 207 | + continue; 208 | + spin_lock_irqsave(&zone->lock, flags); 209 | + palloc_flush(zone); 210 | + spin_unlock_irqrestore(&zone->lock, flags); 211 | + } 212 | + } else if (!strncmp(buf, "xor", 3)) { 213 | + int bit, xor_bit; 214 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 215 | + if ((bit > 0 && bit < 64) && 216 | + (xor_bit > 0 && xor_bit < 64) && 217 | + bit != xor_bit) 218 | + { 219 | + mc_xor_bits[bit] = xor_bit; 220 | + } 221 | + } 222 | + 223 | + *ppos += cnt; 224 | + return cnt; 225 | +} 226 | + 227 | +static int palloc_show(struct seq_file *m, void *v) 228 | +{ 229 | + int i, tmp; 230 | + char *desc[] = { "Color", "Normal", "Fail" }; 231 | + char buf[256]; 232 | + for (i = 0; i < 3; i++) { 233 | + struct palloc_stat *stat = &palloc.stat[i]; 234 | + seq_printf(m, "statistics %s:\n", desc[i]); 235 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 236 | + stat->min_ns, 237 | + stat->max_ns, 238 | + (stat->tot_cnt) ? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 239 | + stat->tot_cnt); 240 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 241 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 242 | + (stat->cache_acc_cnt) ? 243 | + div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 244 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 245 | + (stat->tot_cnt) ? 246 | + div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 247 | + stat->iter_cnt, stat->tot_cnt); 248 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 249 | + 250 | + seq_printf(m, " balance: %lld | fail: %lld\n", 251 | + stat->alloc_balance, stat->alloc_balance_timeout); 252 | + } 253 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 254 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 255 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, 1< 0) 262 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 263 | + } 264 | + 265 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc) ? "enabled" : "disabled"); 266 | + return 0; 267 | +} 268 | +static int palloc_open(struct inode *inode, struct file *filp) 269 | +{ 270 | + return single_open(filp, palloc_show, NULL); 271 | +} 272 | + 273 | +static const struct file_operations palloc_fops = { 274 | + .open = palloc_open, 275 | + .write = palloc_write, 276 | + .read = seq_read, 277 | + .llseek = seq_lseek, 278 | + .release = single_release, 279 | +}; 280 | + 281 | +static int __init palloc_debugfs(void) 282 | +{ 283 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 284 | + struct dentry *dir; 285 | + int i; 286 | + 287 | + dir = debugfs_create_dir("palloc", NULL); 288 | + 289 | + /* statistics initialization */ 290 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 291 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 292 | + palloc.stat[i].min_ns = 0x7fffffff; 293 | + } 294 | + 295 | + if (!dir) 296 | + return PTR_ERR(dir); 297 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 298 | + goto fail; 299 | + if (!debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask)) 300 | + goto fail; 301 | + if (!debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor)) 302 | + goto fail; 303 | + if (!debugfs_create_u32("use_palloc", mode, dir, &use_palloc)) 304 | + goto fail; 305 | + if (!debugfs_create_u32("debug_level", mode, dir, &memdbg_enable)) 306 | + goto fail; 307 | + if (!debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance)) 308 | + goto fail; 309 | + return 0; 310 | +fail: 311 | + debugfs_remove_recursive(dir); 312 | + return -ENOMEM; 313 | +} 314 | + 315 | +late_initcall(palloc_debugfs); 316 | + 317 | +#endif /* CONFIG_CGROUP_PALLOC */ 318 | 319 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 320 | DEFINE_PER_CPU(int, numa_node); 321 | @@ -879,12 +1062,314 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) 322 | return 0; 323 | } 324 | 325 | +#ifdef CONFIG_CGROUP_PALLOC 326 | + 327 | +int palloc_bins(void) 328 | +{ 329 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, 8*sizeof (unsigned long))), 330 | + MAX_PALLOC_BINS); 331 | +} 332 | + 333 | +static inline int page_to_color(struct page *page) 334 | +{ 335 | + int color = 0; 336 | + int idx = 0; 337 | + int c; 338 | + unsigned long paddr = page_to_phys(page); 339 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 340 | + if (use_mc_xor) { 341 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 342 | + color |= (1<> c) & 0x1) 345 | + color |= (1<lock must be hold before calling this function 364 | + */ 365 | +static void palloc_flush(struct zone *zone) 366 | +{ 367 | + int c; 368 | + struct page *page; 369 | + memdbg(2, "flush the ccache for zone %s\n", zone->name); 370 | + 371 | + while (1) { 372 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 373 | + if (!list_empty(&zone->color_list[c])) { 374 | + page = list_entry(zone->color_list[c].next, 375 | + struct page, lru); 376 | + list_del_init(&page->lru); 377 | + __free_one_page(page, zone, 0, get_pageblock_migratetype(page)); 378 | + zone->free_area[0].nr_free--; 379 | + } 380 | + 381 | + if (list_empty(&zone->color_list[c])) { 382 | + bitmap_clear(zone->color_bitmap, c, 1); 383 | + INIT_LIST_HEAD(&zone->color_list[c]); 384 | + } 385 | + } 386 | + 387 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 388 | + break; 389 | + } 390 | +} 391 | + 392 | +/* move a page (size=1< 2^order x pages of colored cache. */ 397 | + 398 | + /* remove from zone->free_area[order].free_list[mt] */ 399 | + list_del(&page->lru); 400 | + zone->free_area[order].nr_free--; 401 | + 402 | + /* insert pages to zone->color_list[] (all order-0) */ 403 | + for (i = 0; i < (1<color_list[color] */ 406 | + memdbg(5, "- add pfn %ld (0x%08llx) to color_list[%d]\n", 407 | + page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 408 | + INIT_LIST_HEAD(&page[i].lru); 409 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 410 | + bitmap_set(zone->color_bitmap, color, 1); 411 | + zone->free_area[0].nr_free++; 412 | + rmv_page_order(&page[i]); 413 | + } 414 | + memdbg(4, "add order=%d zone=%s\n", order, zone->name); 415 | +} 416 | + 417 | +/* return a colored page (order-0) and remove it from the colored cache */ 418 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), 419 | + int order, 420 | + struct palloc_stat *stat) 421 | +{ 422 | + struct page *page; 423 | + COLOR_BITMAP(tmpmask); 424 | + int c; 425 | + unsigned int tmp_idx; 426 | + int found_w, want_w; 427 | + unsigned long rand_seed; 428 | + /* cache statistics */ 429 | + if (stat) stat->cache_acc_cnt++; 430 | + 431 | + /* find color cache entry */ 432 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 433 | + return NULL; 434 | + 435 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 436 | + 437 | + /* must have a balance. */ 438 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 439 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 440 | + if (sysctl_alloc_balance && 441 | + found_w < want_w && 442 | + found_w < min(sysctl_alloc_balance, want_w) && 443 | + memdbg_enable) 444 | + { 445 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 446 | + if (dur.tv64 < 1000000) { 447 | + /* try to balance unless order=MAX-2 or 1ms has passed */ 448 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", 449 | + found_w, want_w, order, dur.tv64); 450 | + stat->alloc_balance++; 451 | + 452 | + return NULL; 453 | + } 454 | + stat->alloc_balance_timeout++; 455 | + } 456 | + 457 | + /* choose a bit among the candidates */ 458 | + if (sysctl_alloc_balance && memdbg_enable) { 459 | + rand_seed = (unsigned long)stat->start.tv64; 460 | + } else { 461 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 462 | + if (rand_seed > MAX_PALLOC_BINS) 463 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 464 | + } 465 | + 466 | + tmp_idx = rand_seed % found_w; 467 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 468 | + if (tmp_idx-- <= 0) 469 | + break; 470 | + } 471 | + 472 | + 473 | + BUG_ON(c >= MAX_PALLOC_BINS); 474 | + BUG_ON(list_empty(&zone->color_list[c])); 475 | + 476 | + page = list_entry(zone->color_list[c].next, struct page, lru); 477 | + 478 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 479 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 480 | + 481 | + /* remove from the zone->color_list[color] */ 482 | + list_del(&page->lru); 483 | + if (list_empty(&zone->color_list[c])) 484 | + bitmap_clear(zone->color_bitmap, c, 1); 485 | + zone->free_area[0].nr_free--; 486 | + 487 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", 488 | + page_to_pfn(page), c); 489 | + 490 | + if (stat) stat->cache_hit_cnt++; 491 | + return page; 492 | +} 493 | + 494 | +static inline void 495 | +update_stat(struct palloc_stat *stat, struct page *page, int iters) 496 | +{ 497 | + ktime_t dur; 498 | + 499 | + if (memdbg_enable == 0) 500 | + return; 501 | + 502 | + dur = ktime_sub(ktime_get(), stat->start); 503 | + 504 | + if(dur.tv64 > 0) { 505 | + stat->min_ns = min(dur.tv64, stat->min_ns); 506 | + stat->max_ns = max(dur.tv64, stat->max_ns); 507 | + 508 | + stat->tot_ns += dur.tv64; 509 | + stat->iter_cnt += iters; 510 | + 511 | + stat->tot_cnt++; 512 | + 513 | + memdbg(2, "order %ld pfn %ld(0x%08llx) color %d iters %d in %lld ns\n", 514 | + page_order(page), page_to_pfn(page), (u64)page_to_phys(page), 515 | + (int)page_to_color(page), 516 | + iters, dur.tv64); 517 | + } else { 518 | + memdbg(5, "dur %lld is < 0\n", dur.tv64); 519 | + } 520 | +} 521 | + 522 | /* 523 | * Go through the free lists for the given migratetype and remove 524 | * the smallest available page from the freelists 525 | */ 526 | static inline 527 | struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 528 | + int migratetype) 529 | +{ 530 | + unsigned int current_order; 531 | + struct free_area *area; 532 | + struct list_head *curr, *tmp; 533 | + struct page *page; 534 | + 535 | + struct palloc *ph; 536 | + struct palloc_stat *c_stat = &palloc.stat[0]; 537 | + struct palloc_stat *n_stat = &palloc.stat[1]; 538 | + struct palloc_stat *f_stat = &palloc.stat[2]; 539 | + int iters = 0; 540 | + COLOR_BITMAP(tmpcmap); 541 | + unsigned long *cmap; 542 | + 543 | + if (memdbg_enable) 544 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 545 | + 546 | + if (!use_palloc) 547 | + goto normal_buddy_alloc; 548 | + 549 | + /* cgroup information */ 550 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_subsys_id]); 551 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 552 | + cmap = ph->cmap; 553 | + else { 554 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 555 | + cmap = tmpcmap; 556 | + } 557 | + 558 | + page = NULL; 559 | + if (order == 0) { 560 | + /* find in the cache */ 561 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 562 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 563 | + 564 | + if (page) { 565 | + update_stat(c_stat, page, iters); 566 | + return page; 567 | + } 568 | + } 569 | + 570 | + if (order == 0) { 571 | + /* build color cache */ 572 | + iters++; 573 | + /* search the entire list. make color cache in the process */ 574 | + for (current_order = 0; 575 | + current_order < MAX_ORDER; ++current_order) 576 | + { 577 | + area = &(zone->free_area[current_order]); 578 | + if (list_empty(&area->free_list[migratetype])) 579 | + continue; 580 | + memdbg(3, " order=%d (nr_free=%ld)\n", 581 | + current_order, area->nr_free); 582 | + list_for_each_safe(curr, tmp, 583 | + &area->free_list[migratetype]) 584 | + { 585 | + iters++; 586 | + page = list_entry(curr, struct page, lru); 587 | + palloc_insert(zone, page, current_order); 588 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 589 | + if (page) { 590 | + update_stat(c_stat, page, iters); 591 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", 592 | + zone->name, 593 | + page_to_pfn(page)); 594 | + return page; 595 | + } 596 | + } 597 | + } 598 | + memdbg(1, "Failed to find a matching color\n"); 599 | + } else { 600 | + normal_buddy_alloc: 601 | + /* normal buddy */ 602 | + /* Find a page of the appropriate size in the preferred list */ 603 | + for (current_order = order; 604 | + current_order < MAX_ORDER; ++current_order) 605 | + { 606 | + area = &(zone->free_area[current_order]); 607 | + iters++; 608 | + if (list_empty(&area->free_list[migratetype])) 609 | + continue; 610 | + page = list_entry(area->free_list[migratetype].next, 611 | + struct page, lru); 612 | + 613 | + list_del(&page->lru); 614 | + rmv_page_order(page); 615 | + area->nr_free--; 616 | + expand(zone, page, order, 617 | + current_order, area, migratetype); 618 | + 619 | + update_stat(n_stat, page, iters); 620 | + return page; 621 | + } 622 | + } 623 | + /* no memory (color or normal) found in this zone */ 624 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", 625 | + zone->name, order, migratetype); 626 | + 627 | + return NULL; 628 | +} 629 | +#else /* !CONFIG_CGROUP_PALLOC */ 630 | + 631 | +static inline 632 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 633 | int migratetype) 634 | { 635 | unsigned int current_order; 636 | @@ -908,7 +1393,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 637 | 638 | return NULL; 639 | } 640 | - 641 | +#endif /* CONFIG_CGROUP_PALLOC */ 642 | 643 | /* 644 | * This array describes the order lists are fallen back to when 645 | @@ -1500,9 +1985,13 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, 646 | unsigned long flags; 647 | struct page *page; 648 | int cold = !!(gfp_flags & __GFP_COLD); 649 | + struct palloc * ph; 650 | + 651 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_subsys_id]); 652 | 653 | again: 654 | - if (likely(order == 0)) { 655 | + /* Skip PCP when physically-aware allocation is requested */ 656 | + if (likely(order == 0) && !ph) { 657 | struct per_cpu_pages *pcp; 658 | struct list_head *list; 659 | 660 | @@ -4039,6 +4528,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 661 | static void __meminit zone_init_free_lists(struct zone *zone) 662 | { 663 | int order, t; 664 | + 665 | +#ifdef CONFIG_CGROUP_PALLOC 666 | + int c; 667 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 668 | + INIT_LIST_HEAD(&zone->color_list[c]); 669 | + } 670 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 671 | +#endif /* CONFIG_CGROUP_PALLOC */ 672 | + 673 | for_each_migratetype_order(order, t) { 674 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 675 | zone->free_area[order].nr_free = 0; 676 | @@ -6330,6 +6828,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 677 | return; 678 | zone = page_zone(pfn_to_page(pfn)); 679 | spin_lock_irqsave(&zone->lock, flags); 680 | +#ifdef CONFIG_CGROUP_PALLOC 681 | + palloc_flush(zone); 682 | +#endif 683 | pfn = start_pfn; 684 | while (pfn < end_pfn) { 685 | if (!pfn_valid(pfn)) { 686 | diff --git a/mm/palloc.c b/mm/palloc.c 687 | new file mode 100644 688 | index 0000000..fd1c5ce 689 | --- /dev/null 690 | +++ b/mm/palloc.c 691 | @@ -0,0 +1,169 @@ 692 | +/* 693 | + * kernel/palloc.c 694 | + * 695 | + * Physical driven User Space Allocator info for a set of tasks. 696 | + */ 697 | + 698 | +#include 699 | +#include 700 | +#include 701 | +#include 702 | +#include 703 | +#include 704 | +#include 705 | +#include 706 | +#include 707 | +#include 708 | + 709 | +/* 710 | + * Check if a page is compliant to the policy defined for the given vma 711 | + */ 712 | +#ifdef CONFIG_CGROUP_PALLOC 713 | + 714 | +#define MAX_LINE_LEN (6*128) 715 | +/* 716 | + * Types of files in a palloc group 717 | + * FILE_PALLOC - contain list of palloc bins allowed 718 | +*/ 719 | +typedef enum { 720 | + FILE_PALLOC, 721 | +} palloc_filetype_t; 722 | + 723 | +/* 724 | + * Top level palloc - mask initialized to zero implying no restriction on 725 | + * physical pages 726 | +*/ 727 | + 728 | +static struct palloc top_palloc; 729 | + 730 | +/* Retrieve the palloc group corresponding to this cgroup container */ 731 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 732 | +{ 733 | + return container_of(cgrp->subsys[palloc_subsys_id], 734 | + struct palloc, css); 735 | +} 736 | + 737 | +struct palloc * ph_from_subsys(struct cgroup_subsys_state * subsys) 738 | +{ 739 | + return container_of(subsys, struct palloc, css); 740 | +} 741 | + 742 | +/* 743 | + * Common write function for files in palloc cgroup 744 | + */ 745 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 746 | +{ 747 | + int retval = 0; 748 | + 749 | + if (!*buf) 750 | + bitmap_clear(bitmap, 0, maxbits); 751 | + else 752 | + retval = bitmap_parselist(buf, bitmap, maxbits); 753 | + 754 | + return retval; 755 | +} 756 | + 757 | + 758 | +static int palloc_file_write(struct cgroup_subsys_state *css, struct cftype *cft, 759 | + const char *buf) 760 | +{ 761 | + int retval = 0; 762 | + struct palloc *ph = container_of(css, struct palloc, css); 763 | + 764 | + switch (cft->private) { 765 | + case FILE_PALLOC: 766 | + retval = update_bitmask(ph->cmap, buf, palloc_bins()); 767 | + printk(KERN_INFO "Bins : %s\n", buf); 768 | + break; 769 | + default: 770 | + retval = -EINVAL; 771 | + break; 772 | + 773 | + } 774 | + 775 | + return retval; 776 | +} 777 | + 778 | +static ssize_t palloc_file_read(struct cgroup_subsys_state *css, 779 | + struct cftype *cft, 780 | + struct file *file, 781 | + char __user *buf, 782 | + size_t nbytes, loff_t *ppos) 783 | +{ 784 | + struct palloc *ph = container_of(css, struct palloc, css); 785 | + char *page; 786 | + ssize_t retval = 0; 787 | + char *s; 788 | + 789 | + if (!(page = (char *)__get_free_page(GFP_TEMPORARY))) 790 | + return -ENOMEM; 791 | + 792 | + s = page; 793 | + 794 | + switch (cft->private) { 795 | + case FILE_PALLOC: 796 | + s += bitmap_scnlistprintf(s, PAGE_SIZE, ph->cmap, palloc_bins()); 797 | + printk(KERN_INFO "Bins : %s\n", s); 798 | + break; 799 | + default: 800 | + retval = -EINVAL; 801 | + goto out; 802 | + } 803 | + *s++ = '\n'; 804 | + 805 | + retval = simple_read_from_buffer(buf, nbytes, ppos, page, s - page); 806 | +out: 807 | + free_page((unsigned long)page); 808 | + return retval; 809 | +} 810 | + 811 | + 812 | +/* 813 | + * struct cftype: handler definitions for cgroup control files 814 | + * 815 | + * for the common functions, 'private' gives the type of the file 816 | + */ 817 | +static struct cftype files[] = { 818 | + { 819 | + .name = "bins", 820 | + .read = palloc_file_read, 821 | + .write_string = palloc_file_write, 822 | + .max_write_len = MAX_LINE_LEN, 823 | + .private = FILE_PALLOC, 824 | + }, 825 | + { } /* terminate */ 826 | +}; 827 | + 828 | +/* 829 | + * palloc_create - create a palloc group 830 | + */ 831 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 832 | +{ 833 | + struct palloc * ph_child; 834 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 835 | + if(!ph_child) 836 | + return ERR_PTR(-ENOMEM); 837 | + 838 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 839 | + return &ph_child->css; 840 | +} 841 | + 842 | + 843 | +/* 844 | + * Destroy an existing palloc group 845 | + */ 846 | +static void palloc_destroy(struct cgroup_subsys_state *css) 847 | +{ 848 | + struct palloc *ph = container_of(css, struct palloc, css); 849 | + kfree(ph); 850 | +} 851 | + 852 | +struct cgroup_subsys palloc_subsys = { 853 | + .name = "palloc", 854 | + .css_alloc = palloc_create, 855 | + .css_free = palloc_destroy, 856 | + .subsys_id = palloc_subsys_id, 857 | + .base_cftypes = files, 858 | +}; 859 | + 860 | +#endif /* CONFIG_CGROUP_PALLOC */ 861 | diff --git a/mm/vmstat.c b/mm/vmstat.c 862 | index 7249614..4fc4fab 100644 863 | --- a/mm/vmstat.c 864 | +++ b/mm/vmstat.c 865 | @@ -23,6 +23,8 @@ 866 | 867 | #include "internal.h" 868 | 869 | +#include 870 | + 871 | #ifdef CONFIG_VM_EVENT_COUNTERS 872 | DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; 873 | EXPORT_PER_CPU_SYMBOL(vm_event_states); 874 | @@ -868,6 +870,38 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 875 | struct zone *zone) 876 | { 877 | int order; 878 | +#ifdef CONFIG_CGROUP_PALLOC 879 | +#include 880 | + int color, mt; 881 | + int cnt, bins; 882 | + struct free_area *area; 883 | + struct list_head *curr; 884 | + 885 | + seq_printf(m, "-------\n"); 886 | + /* order by memory type */ 887 | + for (mt = 0; mt < MIGRATE_ISOLATE; mt++) { 888 | + seq_printf(m, "- %17s[%d]", "mt", mt); 889 | + for (order = 0; order < MAX_ORDER; order++) { 890 | + area = &(zone->free_area[order]); 891 | + cnt = 0; 892 | + list_for_each(curr, &area->free_list[mt]) 893 | + cnt++; 894 | + seq_printf(m, "%6d ", cnt); 895 | + } 896 | + seq_printf(m, "\n"); 897 | + } 898 | + /* order by color */ 899 | + seq_printf(m, "-------\n"); 900 | + bins = palloc_bins(); 901 | + 902 | + for (color = 0; color < bins; color++) { 903 | + seq_printf(m, "- color [%d:%0x]", color, color); 904 | + cnt = 0; 905 | + list_for_each(curr, &zone->color_list[color]) 906 | + cnt++; 907 | + seq_printf(m, "%6d\n", cnt); 908 | + } 909 | +#endif /* !CONFIG_CGROUP_PALLOC */ 910 | 911 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 912 | for (order = 0; order < MAX_ORDER; ++order) 913 | @@ -875,6 +909,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 914 | seq_putc(m, '\n'); 915 | } 916 | 917 | + 918 | /* 919 | * This walks the free areas for each zone. 920 | */ 921 | -------------------------------------------------------------------------------- /palloc-3.15.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index 768fe44..8238c7c 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -50,6 +50,9 @@ SUBSYS(net_prio) 6 | #if IS_ENABLED(CONFIG_CGROUP_HUGETLB) 7 | SUBSYS(hugetlb) 8 | #endif 9 | -/* 10 | - * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 11 | - */ 12 | + 13 | +/* */ 14 | + 15 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 16 | +SUBSYS(palloc) 17 | +#endif 18 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 19 | index fac5509..3656155 100644 20 | --- a/include/linux/mmzone.h 21 | +++ b/include/linux/mmzone.h 22 | @@ -69,6 +69,14 @@ enum { 23 | # define is_migrate_cma(migratetype) false 24 | #endif 25 | 26 | +#ifdef CONFIG_CGROUP_PALLOC 27 | +/* Determine the number of bins according to the bits required for 28 | + each component of the address*/ 29 | +# define MAX_PALLOC_BITS 8 30 | +# define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 31 | +# define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 32 | +#endif 33 | + 34 | #define for_each_migratetype_order(order, type) \ 35 | for (order = 0; order < MAX_ORDER; order++) \ 36 | for (type = 0; type < MIGRATE_TYPES; type++) 37 | @@ -370,6 +378,14 @@ struct zone { 38 | #endif 39 | struct free_area free_area[MAX_ORDER]; 40 | 41 | +#ifdef CONFIG_CGROUP_PALLOC 42 | + /* 43 | + * Color page cache. for movable type free pages of order-0 44 | + */ 45 | + struct list_head color_list[MAX_PALLOC_BINS]; 46 | + COLOR_BITMAP(color_bitmap); 47 | +#endif 48 | + 49 | #ifndef CONFIG_SPARSEMEM 50 | /* 51 | * Flags for a pageblock_nr_pages block. See pageblock-flags.h. 52 | diff --git a/include/linux/palloc.h b/include/linux/palloc.h 53 | new file mode 100644 54 | index 0000000..ec4c092 55 | --- /dev/null 56 | +++ b/include/linux/palloc.h 57 | @@ -0,0 +1,33 @@ 58 | +#ifndef _LINUX_PALLOC_H 59 | +#define _LINUX_PALLOC_H 60 | + 61 | +/* 62 | + * kernel/palloc.h 63 | + * 64 | + * PHysical memory aware allocator 65 | + */ 66 | + 67 | +#include 68 | +#include 69 | +#include 70 | +#include 71 | + 72 | +#ifdef CONFIG_CGROUP_PALLOC 73 | + 74 | +struct palloc { 75 | + struct cgroup_subsys_state css; 76 | + COLOR_BITMAP(cmap); 77 | +}; 78 | + 79 | +/* Retrieve the palloc group corresponding to this cgroup container */ 80 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 81 | + 82 | +/* Retrieve the palloc group corresponding to this subsys */ 83 | +struct palloc * ph_from_subsys(struct cgroup_subsys_state * subsys); 84 | + 85 | +/* return #of palloc bins */ 86 | +int palloc_bins(void); 87 | + 88 | +#endif /* CONFIG_CGROUP_PALLOC */ 89 | + 90 | +#endif /* _LINUX_PALLOC_H */ 91 | diff --git a/init/Kconfig b/init/Kconfig 92 | index 765018c..84776c1 100644 93 | --- a/init/Kconfig 94 | +++ b/init/Kconfig 95 | @@ -1089,6 +1089,12 @@ config DEBUG_BLK_CGROUP 96 | Enable some debugging help. Currently it exports additional stat 97 | files in a cgroup which can be useful for debugging. 98 | 99 | +config CGROUP_PALLOC 100 | + bool "Enable PALLOC" 101 | + help 102 | + Enables PALLOC: physical address based page allocator that 103 | + replaces the buddy allocator. 104 | + 105 | endif # CGROUPS 106 | 107 | config CHECKPOINT_RESTORE 108 | diff --git a/mm/Makefile b/mm/Makefile 109 | index b484452..cc1e594 100644 110 | --- a/mm/Makefile 111 | +++ b/mm/Makefile 112 | @@ -63,3 +63,5 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o 113 | obj-$(CONFIG_ZBUD) += zbud.o 114 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 115 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 116 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 117 | + 118 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 119 | index 5dba293..b5b0f09 100644 120 | --- a/mm/page_alloc.c 121 | +++ b/mm/page_alloc.c 122 | @@ -1,3 +1,4 @@ 123 | + 124 | /* 125 | * linux/mm/page_alloc.c 126 | * 127 | @@ -63,12 +64,194 @@ 128 | #include 129 | 130 | #include 131 | +#include 132 | #include 133 | #include 134 | #include "internal.h" 135 | 136 | + 137 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 138 | static DEFINE_MUTEX(pcp_batch_high_lock); 139 | +#ifdef CONFIG_CGROUP_PALLOC 140 | +#include 141 | + 142 | +int memdbg_enable = 0; 143 | +EXPORT_SYMBOL(memdbg_enable); 144 | + 145 | +static int sysctl_alloc_balance = 0; 146 | +/* palloc address bitmask */ 147 | +static unsigned long sysctl_palloc_mask = 0x0; 148 | + 149 | +static int mc_xor_bits[64]; 150 | +static int use_mc_xor = 0; 151 | +static int use_palloc = 0; 152 | + 153 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 154 | + 155 | +#define memdbg(lvl, fmt, ...) \ 156 | + do { \ 157 | + if(memdbg_enable >= lvl) \ 158 | + trace_printk(fmt, ##__VA_ARGS__); \ 159 | + } while(0) 160 | + 161 | +struct palloc_stat { 162 | + s64 max_ns; 163 | + s64 min_ns; 164 | + s64 tot_ns; 165 | + 166 | + s64 tot_cnt; 167 | + s64 iter_cnt; /* avg_iter = iter_cnt/tot_cnt */ 168 | + 169 | + s64 cache_hit_cnt; /* hit rate = cache_hit_cnt / cache_acc_cnt */ 170 | + s64 cache_acc_cnt; 171 | + 172 | + s64 flush_cnt; 173 | + 174 | + s64 alloc_balance; 175 | + s64 alloc_balance_timeout; 176 | + ktime_t start; /* start time of the current iteration */ 177 | +}; 178 | + 179 | +static struct { 180 | + u32 enabled; 181 | + int colors; 182 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2 - fail */ 183 | +} palloc; 184 | + 185 | +static void palloc_flush(struct zone *zone); 186 | + 187 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, 188 | + size_t cnt, loff_t *ppos) 189 | +{ 190 | + char buf[64]; 191 | + int i; 192 | + if (cnt > 63) cnt = 63; 193 | + if (copy_from_user(&buf, ubuf, cnt)) 194 | + return -EFAULT; 195 | + 196 | + if (!strncmp(buf, "reset", 5)) { 197 | + printk(KERN_INFO "reset statistics...\n"); 198 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 199 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 200 | + palloc.stat[i].min_ns = 0x7fffffff; 201 | + } 202 | + } else if (!strncmp(buf, "flush", 5)) { 203 | + struct zone *zone; 204 | + printk(KERN_INFO "flush color cache...\n"); 205 | + for_each_populated_zone(zone) { 206 | + unsigned long flags; 207 | + if (!zone) 208 | + continue; 209 | + spin_lock_irqsave(&zone->lock, flags); 210 | + palloc_flush(zone); 211 | + spin_unlock_irqrestore(&zone->lock, flags); 212 | + } 213 | + } else if (!strncmp(buf, "xor", 3)) { 214 | + int bit, xor_bit; 215 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 216 | + if ((bit > 0 && bit < 64) && 217 | + (xor_bit > 0 && xor_bit < 64) && 218 | + bit != xor_bit) 219 | + { 220 | + mc_xor_bits[bit] = xor_bit; 221 | + } 222 | + } 223 | + 224 | + *ppos += cnt; 225 | + return cnt; 226 | +} 227 | + 228 | +static int palloc_show(struct seq_file *m, void *v) 229 | +{ 230 | + int i, tmp; 231 | + char *desc[] = { "Color", "Normal", "Fail" }; 232 | + char buf[256]; 233 | + for (i = 0; i < 3; i++) { 234 | + struct palloc_stat *stat = &palloc.stat[i]; 235 | + seq_printf(m, "statistics %s:\n", desc[i]); 236 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 237 | + stat->min_ns, 238 | + stat->max_ns, 239 | + (stat->tot_cnt) ? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 240 | + stat->tot_cnt); 241 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 242 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 243 | + (stat->cache_acc_cnt) ? 244 | + div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 245 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 246 | + (stat->tot_cnt) ? 247 | + div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 248 | + stat->iter_cnt, stat->tot_cnt); 249 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 250 | + 251 | + seq_printf(m, " balance: %lld | fail: %lld\n", 252 | + stat->alloc_balance, stat->alloc_balance_timeout); 253 | + } 254 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 255 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 256 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, 1< 0) 263 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 264 | + } 265 | + 266 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc) ? "enabled" : "disabled"); 267 | + return 0; 268 | +} 269 | +static int palloc_open(struct inode *inode, struct file *filp) 270 | +{ 271 | + return single_open(filp, palloc_show, NULL); 272 | +} 273 | + 274 | +static const struct file_operations palloc_fops = { 275 | + .open = palloc_open, 276 | + .write = palloc_write, 277 | + .read = seq_read, 278 | + .llseek = seq_lseek, 279 | + .release = single_release, 280 | +}; 281 | + 282 | +static int __init palloc_debugfs(void) 283 | +{ 284 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 285 | + struct dentry *dir; 286 | + int i; 287 | + 288 | + dir = debugfs_create_dir("palloc", NULL); 289 | + 290 | + /* statistics initialization */ 291 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 292 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 293 | + palloc.stat[i].min_ns = 0x7fffffff; 294 | + } 295 | + 296 | + if (!dir) 297 | + return PTR_ERR(dir); 298 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 299 | + goto fail; 300 | + if (!debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask)) 301 | + goto fail; 302 | + if (!debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor)) 303 | + goto fail; 304 | + if (!debugfs_create_u32("use_palloc", mode, dir, &use_palloc)) 305 | + goto fail; 306 | + if (!debugfs_create_u32("debug_level", mode, dir, &memdbg_enable)) 307 | + goto fail; 308 | + if (!debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance)) 309 | + goto fail; 310 | + return 0; 311 | +fail: 312 | + debugfs_remove_recursive(dir); 313 | + return -ENOMEM; 314 | +} 315 | + 316 | +late_initcall(palloc_debugfs); 317 | + 318 | +#endif /* CONFIG_CGROUP_PALLOC */ 319 | 320 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 321 | DEFINE_PER_CPU(int, numa_node); 322 | @@ -907,12 +1090,314 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) 323 | return 0; 324 | } 325 | 326 | +#ifdef CONFIG_CGROUP_PALLOC 327 | + 328 | +int palloc_bins(void) 329 | +{ 330 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, 8*sizeof (unsigned long))), 331 | + MAX_PALLOC_BINS); 332 | +} 333 | + 334 | +static inline int page_to_color(struct page *page) 335 | +{ 336 | + int color = 0; 337 | + int idx = 0; 338 | + int c; 339 | + unsigned long paddr = page_to_phys(page); 340 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 341 | + if (use_mc_xor) { 342 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 343 | + color |= (1<> c) & 0x1) 346 | + color |= (1<lock must be hold before calling this function 365 | + */ 366 | +static void palloc_flush(struct zone *zone) 367 | +{ 368 | + int c; 369 | + struct page *page; 370 | + memdbg(2, "flush the ccache for zone %s\n", zone->name); 371 | + 372 | + while (1) { 373 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 374 | + if (!list_empty(&zone->color_list[c])) { 375 | + page = list_entry(zone->color_list[c].next, 376 | + struct page, lru); 377 | + list_del_init(&page->lru); 378 | + __free_one_page(page, zone, 0, get_pageblock_migratetype(page)); 379 | + zone->free_area[0].nr_free--; 380 | + } 381 | + 382 | + if (list_empty(&zone->color_list[c])) { 383 | + bitmap_clear(zone->color_bitmap, c, 1); 384 | + INIT_LIST_HEAD(&zone->color_list[c]); 385 | + } 386 | + } 387 | + 388 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 389 | + break; 390 | + } 391 | +} 392 | + 393 | +/* move a page (size=1< 2^order x pages of colored cache. */ 398 | + 399 | + /* remove from zone->free_area[order].free_list[mt] */ 400 | + list_del(&page->lru); 401 | + zone->free_area[order].nr_free--; 402 | + 403 | + /* insert pages to zone->color_list[] (all order-0) */ 404 | + for (i = 0; i < (1<color_list[color] */ 407 | + memdbg(5, "- add pfn %ld (0x%08llx) to color_list[%d]\n", 408 | + page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 409 | + INIT_LIST_HEAD(&page[i].lru); 410 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 411 | + bitmap_set(zone->color_bitmap, color, 1); 412 | + zone->free_area[0].nr_free++; 413 | + rmv_page_order(&page[i]); 414 | + } 415 | + memdbg(4, "add order=%d zone=%s\n", order, zone->name); 416 | +} 417 | + 418 | +/* return a colored page (order-0) and remove it from the colored cache */ 419 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), 420 | + int order, 421 | + struct palloc_stat *stat) 422 | +{ 423 | + struct page *page; 424 | + COLOR_BITMAP(tmpmask); 425 | + int c; 426 | + unsigned int tmp_idx; 427 | + int found_w, want_w; 428 | + unsigned long rand_seed; 429 | + /* cache statistics */ 430 | + if (stat) stat->cache_acc_cnt++; 431 | + 432 | + /* find color cache entry */ 433 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 434 | + return NULL; 435 | + 436 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 437 | + 438 | + /* must have a balance. */ 439 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 440 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 441 | + if (sysctl_alloc_balance && 442 | + found_w < want_w && 443 | + found_w < min(sysctl_alloc_balance, want_w) && 444 | + memdbg_enable) 445 | + { 446 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 447 | + if (dur.tv64 < 1000000) { 448 | + /* try to balance unless order=MAX-2 or 1ms has passed */ 449 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", 450 | + found_w, want_w, order, dur.tv64); 451 | + stat->alloc_balance++; 452 | + 453 | + return NULL; 454 | + } 455 | + stat->alloc_balance_timeout++; 456 | + } 457 | + 458 | + /* choose a bit among the candidates */ 459 | + if (sysctl_alloc_balance && memdbg_enable) { 460 | + rand_seed = (unsigned long)stat->start.tv64; 461 | + } else { 462 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 463 | + if (rand_seed > MAX_PALLOC_BINS) 464 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 465 | + } 466 | + 467 | + tmp_idx = rand_seed % found_w; 468 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 469 | + if (tmp_idx-- <= 0) 470 | + break; 471 | + } 472 | + 473 | + 474 | + BUG_ON(c >= MAX_PALLOC_BINS); 475 | + BUG_ON(list_empty(&zone->color_list[c])); 476 | + 477 | + page = list_entry(zone->color_list[c].next, struct page, lru); 478 | + 479 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 480 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 481 | + 482 | + /* remove from the zone->color_list[color] */ 483 | + list_del(&page->lru); 484 | + if (list_empty(&zone->color_list[c])) 485 | + bitmap_clear(zone->color_bitmap, c, 1); 486 | + zone->free_area[0].nr_free--; 487 | + 488 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", 489 | + page_to_pfn(page), c); 490 | + 491 | + if (stat) stat->cache_hit_cnt++; 492 | + return page; 493 | +} 494 | + 495 | +static inline void 496 | +update_stat(struct palloc_stat *stat, struct page *page, int iters) 497 | +{ 498 | + ktime_t dur; 499 | + 500 | + if (memdbg_enable == 0) 501 | + return; 502 | + 503 | + dur = ktime_sub(ktime_get(), stat->start); 504 | + 505 | + if(dur.tv64 > 0) { 506 | + stat->min_ns = min(dur.tv64, stat->min_ns); 507 | + stat->max_ns = max(dur.tv64, stat->max_ns); 508 | + 509 | + stat->tot_ns += dur.tv64; 510 | + stat->iter_cnt += iters; 511 | + 512 | + stat->tot_cnt++; 513 | + 514 | + memdbg(2, "order %ld pfn %ld(0x%08llx) color %d iters %d in %lld ns\n", 515 | + page_order(page), page_to_pfn(page), (u64)page_to_phys(page), 516 | + (int)page_to_color(page), 517 | + iters, dur.tv64); 518 | + } else { 519 | + memdbg(5, "dur %lld is < 0\n", dur.tv64); 520 | + } 521 | +} 522 | + 523 | /* 524 | * Go through the free lists for the given migratetype and remove 525 | * the smallest available page from the freelists 526 | */ 527 | static inline 528 | struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 529 | + int migratetype) 530 | +{ 531 | + unsigned int current_order; 532 | + struct free_area *area; 533 | + struct list_head *curr, *tmp; 534 | + struct page *page; 535 | + 536 | + struct palloc *ph; 537 | + struct palloc_stat *c_stat = &palloc.stat[0]; 538 | + struct palloc_stat *n_stat = &palloc.stat[1]; 539 | + struct palloc_stat *f_stat = &palloc.stat[2]; 540 | + int iters = 0; 541 | + COLOR_BITMAP(tmpcmap); 542 | + unsigned long *cmap; 543 | + 544 | + if (memdbg_enable) 545 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 546 | + 547 | + if (!use_palloc) 548 | + goto normal_buddy_alloc; 549 | + 550 | + /* cgroup information */ 551 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 552 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 553 | + cmap = ph->cmap; 554 | + else { 555 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 556 | + cmap = tmpcmap; 557 | + } 558 | + 559 | + page = NULL; 560 | + if (order == 0) { 561 | + /* find in the cache */ 562 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 563 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 564 | + 565 | + if (page) { 566 | + update_stat(c_stat, page, iters); 567 | + return page; 568 | + } 569 | + } 570 | + 571 | + if (order == 0) { 572 | + /* build color cache */ 573 | + iters++; 574 | + /* search the entire list. make color cache in the process */ 575 | + for (current_order = 0; 576 | + current_order < MAX_ORDER; ++current_order) 577 | + { 578 | + area = &(zone->free_area[current_order]); 579 | + if (list_empty(&area->free_list[migratetype])) 580 | + continue; 581 | + memdbg(3, " order=%d (nr_free=%ld)\n", 582 | + current_order, area->nr_free); 583 | + list_for_each_safe(curr, tmp, 584 | + &area->free_list[migratetype]) 585 | + { 586 | + iters++; 587 | + page = list_entry(curr, struct page, lru); 588 | + palloc_insert(zone, page, current_order); 589 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 590 | + if (page) { 591 | + update_stat(c_stat, page, iters); 592 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", 593 | + zone->name, 594 | + page_to_pfn(page)); 595 | + return page; 596 | + } 597 | + } 598 | + } 599 | + memdbg(1, "Failed to find a matching color\n"); 600 | + } else { 601 | + normal_buddy_alloc: 602 | + /* normal buddy */ 603 | + /* Find a page of the appropriate size in the preferred list */ 604 | + for (current_order = order; 605 | + current_order < MAX_ORDER; ++current_order) 606 | + { 607 | + area = &(zone->free_area[current_order]); 608 | + iters++; 609 | + if (list_empty(&area->free_list[migratetype])) 610 | + continue; 611 | + page = list_entry(area->free_list[migratetype].next, 612 | + struct page, lru); 613 | + 614 | + list_del(&page->lru); 615 | + rmv_page_order(page); 616 | + area->nr_free--; 617 | + expand(zone, page, order, 618 | + current_order, area, migratetype); 619 | + 620 | + update_stat(n_stat, page, iters); 621 | + return page; 622 | + } 623 | + } 624 | + /* no memory (color or normal) found in this zone */ 625 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", 626 | + zone->name, order, migratetype); 627 | + 628 | + return NULL; 629 | +} 630 | +#else /* !CONFIG_CGROUP_PALLOC */ 631 | + 632 | +static inline 633 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 634 | int migratetype) 635 | { 636 | unsigned int current_order; 637 | @@ -936,7 +1421,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 638 | 639 | return NULL; 640 | } 641 | - 642 | +#endif /* CONFIG_CGROUP_PALLOC */ 643 | 644 | /* 645 | * This array describes the order lists are fallen back to when 646 | @@ -1528,9 +2013,13 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, 647 | unsigned long flags; 648 | struct page *page; 649 | int cold = !!(gfp_flags & __GFP_COLD); 650 | + struct palloc * ph; 651 | + 652 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 653 | 654 | again: 655 | - if (likely(order == 0)) { 656 | + /* Skip PCP when physically-aware allocation is requested */ 657 | + if (likely(order == 0) && !ph) { 658 | struct per_cpu_pages *pcp; 659 | struct list_head *list; 660 | 661 | @@ -4096,6 +4585,15 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 662 | static void __meminit zone_init_free_lists(struct zone *zone) 663 | { 664 | int order, t; 665 | + 666 | +#ifdef CONFIG_CGROUP_PALLOC 667 | + int c; 668 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 669 | + INIT_LIST_HEAD(&zone->color_list[c]); 670 | + } 671 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 672 | +#endif /* CONFIG_CGROUP_PALLOC */ 673 | + 674 | for_each_migratetype_order(order, t) { 675 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 676 | zone->free_area[order].nr_free = 0; 677 | @@ -6420,6 +6918,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 678 | return; 679 | zone = page_zone(pfn_to_page(pfn)); 680 | spin_lock_irqsave(&zone->lock, flags); 681 | +#ifdef CONFIG_CGROUP_PALLOC 682 | + palloc_flush(zone); 683 | +#endif 684 | pfn = start_pfn; 685 | while (pfn < end_pfn) { 686 | if (!pfn_valid(pfn)) { 687 | diff --git a/mm/palloc.c b/mm/palloc.c 688 | new file mode 100644 689 | index 0000000..b9acae1 690 | --- /dev/null 691 | +++ b/mm/palloc.c 692 | @@ -0,0 +1,166 @@ 693 | +/* 694 | + * kernel/palloc.c 695 | + * 696 | + * Physical driven User Space Allocator info for a set of tasks. 697 | + */ 698 | + 699 | +#include 700 | +#include 701 | +#include 702 | +#include 703 | +#include 704 | +#include 705 | +#include 706 | +#include 707 | +#include 708 | +#include 709 | + 710 | +/* 711 | + * Check if a page is compliant to the policy defined for the given vma 712 | + */ 713 | +#ifdef CONFIG_CGROUP_PALLOC 714 | + 715 | +#define MAX_LINE_LEN (6*128) 716 | +/* 717 | + * Types of files in a palloc group 718 | + * FILE_PALLOC - contain list of palloc bins allowed 719 | +*/ 720 | +typedef enum { 721 | + FILE_PALLOC, 722 | +} palloc_filetype_t; 723 | + 724 | +/* 725 | + * Top level palloc - mask initialized to zero implying no restriction on 726 | + * physical pages 727 | +*/ 728 | + 729 | +static struct palloc top_palloc; 730 | + 731 | +/* Retrieve the palloc group corresponding to this cgroup container */ 732 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 733 | +{ 734 | + return container_of(cgrp->subsys[palloc_cgrp_id], 735 | + struct palloc, css); 736 | +} 737 | + 738 | +struct palloc * ph_from_subsys(struct cgroup_subsys_state * subsys) 739 | +{ 740 | + return container_of(subsys, struct palloc, css); 741 | +} 742 | + 743 | +/* 744 | + * Common write function for files in palloc cgroup 745 | + */ 746 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 747 | +{ 748 | + int retval = 0; 749 | + 750 | + if (!*buf) 751 | + bitmap_clear(bitmap, 0, maxbits); 752 | + else 753 | + retval = bitmap_parselist(buf, bitmap, maxbits); 754 | + 755 | + return retval; 756 | +} 757 | + 758 | + 759 | +static int palloc_file_write(struct cgroup_subsys_state *css, struct cftype *cft, 760 | + char *buffer) 761 | +{ 762 | + int retval = 0; 763 | + struct palloc *ph = container_of(css, struct palloc, css); 764 | + 765 | + switch (cft->private) { 766 | + case FILE_PALLOC: 767 | + retval = update_bitmask(ph->cmap, buffer, palloc_bins()); 768 | + printk(KERN_INFO "Bins : %s\n", buffer); 769 | + break; 770 | + default: 771 | + retval = -EINVAL; 772 | + break; 773 | + } 774 | + 775 | + return retval; 776 | +} 777 | + 778 | +static int palloc_file_read(struct seq_file *sf, void *v) 779 | +{ 780 | + struct cgroup_subsys_state *css = seq_css(sf); 781 | + struct cftype *cft = seq_cft(sf); 782 | + struct palloc *ph = container_of(css, struct palloc, css); 783 | + char *page; 784 | + ssize_t retval = 0; 785 | + char *s; 786 | + 787 | + if (!(page = (char *)__get_free_page(GFP_TEMPORARY|__GFP_ZERO))) 788 | + return -ENOMEM; 789 | + 790 | + s = page; 791 | + 792 | + switch (cft->private) { 793 | + case FILE_PALLOC: 794 | + s += bitmap_scnlistprintf(s, PAGE_SIZE, ph->cmap, palloc_bins()); 795 | + *s++ = '\n'; 796 | + printk(KERN_INFO "Bins : %s", page); 797 | + break; 798 | + default: 799 | + retval = -EINVAL; 800 | + goto out; 801 | + } 802 | + 803 | + retval = seq_printf(sf, "%s", page); 804 | +out: 805 | + free_page((unsigned long)page); 806 | + return retval; 807 | +} 808 | + 809 | + 810 | +/* 811 | + * struct cftype: handler definitions for cgroup control files 812 | + * 813 | + * for the common functions, 'private' gives the type of the file 814 | + */ 815 | +static struct cftype files[] = { 816 | + { 817 | + .name = "bins", 818 | + .seq_show = palloc_file_read, 819 | + .write_string = palloc_file_write, 820 | + .max_write_len = MAX_LINE_LEN, 821 | + .private = FILE_PALLOC, 822 | + }, 823 | + { } /* terminate */ 824 | +}; 825 | + 826 | +/* 827 | + * palloc_create - create a palloc group 828 | + */ 829 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 830 | +{ 831 | + struct palloc * ph_child; 832 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 833 | + if(!ph_child) 834 | + return ERR_PTR(-ENOMEM); 835 | + 836 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 837 | + return &ph_child->css; 838 | +} 839 | + 840 | + 841 | +/* 842 | + * Destroy an existing palloc group 843 | + */ 844 | +static void palloc_destroy(struct cgroup_subsys_state *css) 845 | +{ 846 | + struct palloc *ph = container_of(css, struct palloc, css); 847 | + kfree(ph); 848 | +} 849 | + 850 | +struct cgroup_subsys palloc_cgrp_subsys = { 851 | + .name = "palloc", 852 | + .css_alloc = palloc_create, 853 | + .css_free = palloc_destroy, 854 | + .id = palloc_cgrp_id, 855 | + .base_cftypes = files, 856 | +}; 857 | + 858 | +#endif /* CONFIG_CGROUP_PALLOC */ 859 | diff --git a/mm/vmstat.c b/mm/vmstat.c 860 | index 302dd07..b5753cd 100644 861 | --- a/mm/vmstat.c 862 | +++ b/mm/vmstat.c 863 | @@ -23,6 +23,8 @@ 864 | 865 | #include "internal.h" 866 | 867 | +#include 868 | + 869 | #ifdef CONFIG_VM_EVENT_COUNTERS 870 | DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; 871 | EXPORT_PER_CPU_SYMBOL(vm_event_states); 872 | @@ -876,6 +878,38 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 873 | struct zone *zone) 874 | { 875 | int order; 876 | +#ifdef CONFIG_CGROUP_PALLOC 877 | +#include 878 | + int color, mt; 879 | + int cnt, bins; 880 | + struct free_area *area; 881 | + struct list_head *curr; 882 | + 883 | + seq_printf(m, "-------\n"); 884 | + /* order by memory type */ 885 | + for (mt = 0; mt < MIGRATE_ISOLATE; mt++) { 886 | + seq_printf(m, "- %17s[%d]", "mt", mt); 887 | + for (order = 0; order < MAX_ORDER; order++) { 888 | + area = &(zone->free_area[order]); 889 | + cnt = 0; 890 | + list_for_each(curr, &area->free_list[mt]) 891 | + cnt++; 892 | + seq_printf(m, "%6d ", cnt); 893 | + } 894 | + seq_printf(m, "\n"); 895 | + } 896 | + /* order by color */ 897 | + seq_printf(m, "-------\n"); 898 | + bins = palloc_bins(); 899 | + 900 | + for (color = 0; color < bins; color++) { 901 | + seq_printf(m, "- color [%d:%0x]", color, color); 902 | + cnt = 0; 903 | + list_for_each(curr, &zone->color_list[color]) 904 | + cnt++; 905 | + seq_printf(m, "%6d\n", cnt); 906 | + } 907 | +#endif /* !CONFIG_CGROUP_PALLOC */ 908 | 909 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 910 | for (order = 0; order < MAX_ORDER; ++order) 911 | @@ -883,6 +917,7 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 912 | seq_putc(m, '\n'); 913 | } 914 | 915 | + 916 | /* 917 | * This walks the free areas for each zone. 918 | */ 919 | -------------------------------------------------------------------------------- /palloc-4.14.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index 1a96fda..9235fae 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -84,3 +84,7 @@ SUBSYS(debug) 6 | /* 7 | * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 8 | */ 9 | + 10 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 11 | +SUBSYS(palloc) 12 | +#endif 13 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 14 | index 2b79965..f9fe950 100644 15 | --- a/include/linux/mmzone.h 16 | +++ b/include/linux/mmzone.h 17 | @@ -69,6 +69,14 @@ enum { 18 | # define is_migrate_cma(migratetype) false 19 | #endif 20 | 21 | +#ifdef CONFIG_CGROUP_PALLOC 22 | +/* Determine the number of bins according to the bits required for 23 | + each component of the address */ 24 | +#define MAX_PALLOC_BITS 8 25 | +#define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 26 | +#define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 27 | +#endif 28 | + 29 | #define for_each_migratetype_order(order, type) \ 30 | for (order = 0; order < MAX_ORDER; order++) \ 31 | for (type = 0; type < MIGRATE_TYPES; type++) 32 | @@ -478,6 +486,14 @@ struct zone { 33 | /* free areas of different sizes */ 34 | struct free_area free_area[MAX_ORDER]; 35 | 36 | +#ifdef CONFIG_CGROUP_PALLOC 37 | + /* 38 | + * Color page cache for movable type free pages of order-0 39 | + */ 40 | + struct list_head color_list[MAX_PALLOC_BINS]; 41 | + COLOR_BITMAP(color_bitmap); 42 | +#endif 43 | + 44 | /* zone flags, see below */ 45 | unsigned long flags; 46 | 47 | diff --git a/include/linux/palloc.h b/include/linux/palloc.h 48 | new file mode 100644 49 | index 0000000..7236e31 50 | --- /dev/null 51 | +++ b/include/linux/palloc.h 52 | @@ -0,0 +1,33 @@ 53 | +#ifndef _LINUX_PALLOC_H 54 | +#define _LINUX_PALLOC_H 55 | + 56 | +/* 57 | + * kernel/palloc.h 58 | + * 59 | + * Physical Memory Aware Allocator 60 | + */ 61 | + 62 | +#include 63 | +#include 64 | +#include 65 | +#include 66 | + 67 | +#ifdef CONFIG_CGROUP_PALLOC 68 | + 69 | +struct palloc { 70 | + struct cgroup_subsys_state css; 71 | + COLOR_BITMAP(cmap); 72 | +}; 73 | + 74 | +/* Retrieve the palloc group corresponding to this cgroup container */ 75 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 76 | + 77 | +/* Retrieve the palloc group corresponding to this subsys */ 78 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys); 79 | + 80 | +/* Return number of palloc bins */ 81 | +int palloc_bins(void); 82 | + 83 | +#endif /* CONFIG_CGROUP_PALLOC */ 84 | + 85 | +#endif /* _LINUX_PALLOC_H */ 86 | diff --git a/init/Kconfig b/init/Kconfig 87 | index 8221272..fdb1ff7 100644 88 | --- a/init/Kconfig 89 | +++ b/init/Kconfig 90 | @@ -1160,6 +1160,13 @@ config CGROUP_WRITEBACK 91 | depends on MEMCG && BLK_CGROUP 92 | default y 93 | 94 | +config CGROUP_PALLOC 95 | + bool "Enable PALLOC" 96 | + help 97 | + Enable PALLOC. PALLOC is a color-aware page-based physical memory 98 | + allocator which replaces the buddy allocator for order-zero page 99 | + allocations. 100 | + 101 | endif # CGROUPS 102 | 103 | config CHECKPOINT_RESTORE 104 | diff --git a/mm/Makefile b/mm/Makefile 105 | index e3a53f5..a830289 100644 106 | --- a/mm/Makefile 107 | +++ b/mm/Makefile 108 | @@ -74,6 +74,7 @@ obj-$(CONFIG_ZPOOL) += zpool.o 109 | obj-$(CONFIG_ZBUD) += zbud.o 110 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 111 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 112 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 113 | obj-$(CONFIG_CMA) += cma.o 114 | obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 115 | obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 116 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 117 | index 023e12c6e5cb..bb2a9123b6be 100644 118 | --- a/mm/page_alloc.c 119 | +++ b/mm/page_alloc.c 120 | @@ -69,12 +69,203 @@ 121 | #include 122 | 123 | #include 124 | +#include 125 | #include 126 | #include 127 | #include "internal.h" 128 | 129 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 130 | static DEFINE_MUTEX(pcp_batch_high_lock); 131 | + 132 | +#ifdef CONFIG_CGROUP_PALLOC 133 | +#include 134 | + 135 | +int memdbg_enable = 0; 136 | +EXPORT_SYMBOL(memdbg_enable); 137 | + 138 | +static int sysctl_alloc_balance = 0; 139 | + 140 | +/* PALLOC address bitmask */ 141 | +static unsigned long sysctl_palloc_mask = 0x0; 142 | + 143 | +static int mc_xor_bits[64]; 144 | +static int use_mc_xor = 0; 145 | +static int use_palloc = 0; 146 | + 147 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 148 | + 149 | +#define memdbg(lvl, fmt, ...) \ 150 | + do { \ 151 | + if(memdbg_enable >= lvl) \ 152 | + trace_printk(fmt, ##__VA_ARGS__); \ 153 | + } while(0) 154 | + 155 | +struct palloc_stat { 156 | + s64 max_ns; 157 | + s64 min_ns; 158 | + s64 tot_ns; 159 | + 160 | + s64 tot_cnt; 161 | + s64 iter_cnt; /* avg_iter = iter_cnt / tot_cnt */ 162 | + 163 | + s64 cache_hit_cnt; /* hit_rate = cache_hit_cnt / cache_acc_cnt */ 164 | + s64 cache_acc_cnt; 165 | + 166 | + s64 flush_cnt; 167 | + 168 | + s64 alloc_balance; 169 | + s64 alloc_balance_timeout; 170 | + ktime_t start; /* Start time of the current iteration */ 171 | +}; 172 | + 173 | +static struct { 174 | + u32 enabled; 175 | + int colors; 176 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2- fail */ 177 | +} palloc; 178 | + 179 | +static void palloc_flush(struct zone *zone); 180 | + 181 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 182 | +{ 183 | + char buf[64]; 184 | + int i; 185 | + 186 | + if (cnt > 63) cnt = 63; 187 | + if (copy_from_user(&buf, ubuf, cnt)) 188 | + return -EFAULT; 189 | + 190 | + if (!strncmp(buf, "reset", 5)) { 191 | + printk(KERN_INFO "reset statistics...\n"); 192 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 193 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 194 | + palloc.stat[i].min_ns = 0x7fffffff; 195 | + } 196 | + } else if (!strncmp(buf, "flush", 5)) { 197 | + struct zone *zone; 198 | + printk(KERN_INFO "flush color cache...\n"); 199 | + for_each_populated_zone(zone) { 200 | + unsigned long flags; 201 | + if (!zone) 202 | + continue; 203 | + spin_lock_irqsave(&zone->lock, flags); 204 | + palloc_flush(zone); 205 | + spin_unlock_irqrestore(&zone->lock, flags); 206 | + } 207 | + } else if (!strncmp(buf, "xor", 3)) { 208 | + int bit, xor_bit; 209 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 210 | + if ((bit > 0 && bit < 64) && (xor_bit > 0 && xor_bit < 64) && bit != xor_bit) { 211 | + mc_xor_bits[bit] = xor_bit; 212 | + } 213 | + } 214 | + 215 | + *ppos += cnt; 216 | + 217 | + return cnt; 218 | +} 219 | + 220 | +static int palloc_show(struct seq_file *m, void *v) 221 | +{ 222 | + int i, tmp; 223 | + char *desc[] = { "Color", "Normal", "Fail" }; 224 | + char buf[256]; 225 | + 226 | + for (i = 0; i < 3; i++) { 227 | + struct palloc_stat *stat = &palloc.stat[i]; 228 | + seq_printf(m, "statistics %s:\n", desc[i]); 229 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 230 | + stat->min_ns, 231 | + stat->max_ns, 232 | + (stat->tot_cnt)? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 233 | + stat->tot_cnt); 234 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 235 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 236 | + (stat->cache_acc_cnt)? div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 237 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 238 | + (stat->tot_cnt)? div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 239 | + stat->iter_cnt, stat->tot_cnt); 240 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 241 | + 242 | + seq_printf(m, " balance: %lld | fail: %lld\n", 243 | + stat->alloc_balance, stat->alloc_balance_timeout); 244 | + } 245 | + 246 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 247 | + 248 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 249 | + 250 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, (1 << tmp)); 251 | + 252 | + scnprintf(buf, 256, "%*pbl", (int)(sizeof(unsigned long) * 8), &sysctl_palloc_mask); 253 | + 254 | + seq_printf(m, "bits: %s\n", buf); 255 | + 256 | + seq_printf(m, "XOR bits: %s\n", (use_mc_xor)? "enabled" : "disabled"); 257 | + 258 | + for (i = 0; i < 64; i++) { 259 | + if (mc_xor_bits[i] > 0) 260 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 261 | + } 262 | + 263 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc)? "enabled" : "disabled"); 264 | + 265 | + return 0; 266 | +} 267 | + 268 | +static int palloc_open(struct inode *inode, struct file *filp) 269 | +{ 270 | + return single_open(filp, palloc_show, NULL); 271 | +} 272 | + 273 | +static const struct file_operations palloc_fops = { 274 | + .open = palloc_open, 275 | + .write = palloc_write, 276 | + .read = seq_read, 277 | + .llseek = seq_lseek, 278 | + .release = single_release, 279 | +}; 280 | + 281 | +static int __init palloc_debugfs(void) 282 | +{ 283 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 284 | + struct dentry *dir; 285 | + int i; 286 | + 287 | + dir = debugfs_create_dir("palloc", NULL); 288 | + 289 | + /* Statistics Initialization */ 290 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 291 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 292 | + palloc.stat[i].min_ns = 0x7fffffff; 293 | + } 294 | + 295 | + if (!dir) 296 | + return PTR_ERR(dir); 297 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 298 | + goto fail; 299 | + if (!debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask)) 300 | + goto fail; 301 | + if (!debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor)) 302 | + goto fail; 303 | + if (!debugfs_create_u32("use_palloc", mode, dir, &use_palloc)) 304 | + goto fail; 305 | + if (!debugfs_create_u32("debug_level", mode, dir, &memdbg_enable)) 306 | + goto fail; 307 | + if (!debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance)) 308 | + goto fail; 309 | + 310 | + return 0; 311 | + 312 | +fail: 313 | + debugfs_remove_recursive(dir); 314 | + return -ENOMEM; 315 | +} 316 | + 317 | +late_initcall(palloc_debugfs); 318 | + 319 | +#endif /* CONFIG_CGROUP_PALLOC */ 320 | + 321 | #define MIN_PERCPU_PAGELIST_FRACTION (8) 322 | 323 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 324 | @@ -1795,6 +1986,323 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags 325 | clear_page_pfmemalloc(page); 326 | } 327 | 328 | +#ifdef CONFIG_CGROUP_PALLOC 329 | + 330 | +int palloc_bins(void) 331 | +{ 332 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long) * 8)), MAX_PALLOC_BINS); 333 | +} 334 | + 335 | +static inline int page_to_color(struct page *page) 336 | +{ 337 | + int color = 0; 338 | + int idx = 0; 339 | + int c; 340 | + 341 | + unsigned long paddr = page_to_phys(page); 342 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 343 | + if (use_mc_xor) { 344 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 345 | + color |= (1 << idx); 346 | + } else { 347 | + if ((paddr >> c) & 0x1) 348 | + color |= (1 << idx); 349 | + } 350 | + 351 | + idx++; 352 | + } 353 | + 354 | + return color; 355 | +} 356 | + 357 | +/* Debug */ 358 | +static inline unsigned long list_count(struct list_head *head) 359 | +{ 360 | + unsigned long n = 0; 361 | + struct list_head *curr; 362 | + 363 | + list_for_each(curr, head) 364 | + n++; 365 | + 366 | + return n; 367 | +} 368 | + 369 | +/* Move all color_list pages into free_area[0].freelist[2] 370 | + * zone->lock must be held before calling this function 371 | + */ 372 | +static void palloc_flush(struct zone *zone) 373 | +{ 374 | + int c; 375 | + struct page *page; 376 | + 377 | + memdbg(2, "Flush the color-cache for zone %s\n", zone->name); 378 | + 379 | + while(1) { 380 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 381 | + if (!list_empty(&zone->color_list[c])) { 382 | + page = list_entry(zone->color_list[c].next, struct page, lru); 383 | + list_del_init(&page->lru); 384 | + __free_one_page(page, page_to_pfn(page), zone, 0, get_pageblock_migratetype(page)); 385 | + zone->free_area[0].nr_free--; 386 | + } 387 | + 388 | + if (list_empty(&zone->color_list[c])) { 389 | + bitmap_clear(zone->color_bitmap, c, 1); 390 | + INIT_LIST_HEAD(&zone->color_list[c]); 391 | + } 392 | + } 393 | + 394 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 395 | + break; 396 | + } 397 | +} 398 | + 399 | +/* Move a page (size = 1 << order) into order-0 colored cache */ 400 | +static void palloc_insert(struct zone *zone, struct page *page, int order) 401 | +{ 402 | + int i, color; 403 | + 404 | + /* 1 page (2^order) -> 2^order x pages of colored cache. 405 | + Remove from zone->free_area[order].free_list[mt] */ 406 | + list_del(&page->lru); 407 | + zone->free_area[order].nr_free--; 408 | + 409 | + /* Insert pages to zone->color_list[] (all order-0) */ 410 | + for (i = 0; i < (1 << order); i++) { 411 | + color = page_to_color(&page[i]); 412 | + 413 | + /* Add to zone->color_list[color] */ 414 | + memdbg(5, "- Add pfn %ld (0x%08llx) to color_list[%d]\n", page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 415 | + 416 | + INIT_LIST_HEAD(&page[i].lru); 417 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 418 | + bitmap_set(zone->color_bitmap, color, 1); 419 | + zone->free_area[0].nr_free++; 420 | + rmv_page_order(&page[i]); 421 | + } 422 | + 423 | + memdbg(4, "Add order=%d zone=%s\n", order, zone->name); 424 | + 425 | + return; 426 | +} 427 | + 428 | +/* Return a colored page (order-0) and remove it from the colored cache */ 429 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), int order, struct palloc_stat *stat) 430 | +{ 431 | + struct page *page; 432 | + COLOR_BITMAP(tmpmask); 433 | + int c; 434 | + unsigned int tmp_idx; 435 | + int found_w, want_w; 436 | + unsigned long rand_seed; 437 | + 438 | + /* Cache Statistics */ 439 | + if (stat) stat->cache_acc_cnt++; 440 | + 441 | + /* Find color cache entry */ 442 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 443 | + return NULL; 444 | + 445 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 446 | + 447 | + /* Must have a balance */ 448 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 449 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 450 | + 451 | + if (sysctl_alloc_balance && (found_w < want_w) && (found_w < min(sysctl_alloc_balance, want_w)) && memdbg_enable) { 452 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 453 | + if (dur < 1000000) { 454 | + /* Try to balance unless order = MAX-2 or 1ms has passed */ 455 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", found_w, want_w, order, dur); 456 | + 457 | + stat->alloc_balance++; 458 | + 459 | + return NULL; 460 | + } 461 | + 462 | + stat->alloc_balance_timeout++; 463 | + } 464 | + 465 | + /* Choose a bit among the candidates */ 466 | + if (sysctl_alloc_balance && memdbg_enable) { 467 | + rand_seed = (unsigned long)stat->start; 468 | + } else { 469 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 470 | + 471 | + if (rand_seed > MAX_PALLOC_BINS) 472 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 473 | + } 474 | + 475 | + tmp_idx = rand_seed % found_w; 476 | + 477 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 478 | + if (tmp_idx-- <= 0) 479 | + break; 480 | + } 481 | + 482 | + BUG_ON(c >= MAX_PALLOC_BINS); 483 | + BUG_ON(list_empty(&zone->color_list[c])); 484 | + 485 | + page = list_entry(zone->color_list[c].next, struct page, lru); 486 | + 487 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 488 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 489 | + 490 | + /* Remove the page from the zone->color_list[color] */ 491 | + list_del(&page->lru); 492 | + 493 | + if (list_empty(&zone->color_list[c])) 494 | + bitmap_clear(zone->color_bitmap, c, 1); 495 | + 496 | + zone->free_area[0].nr_free--; 497 | + 498 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", page_to_pfn(page), c); 499 | + 500 | + if (stat) stat->cache_hit_cnt++; 501 | + 502 | + return page; 503 | +} 504 | + 505 | +static inline void update_stat(struct palloc_stat *stat, struct page *page, int iters) 506 | +{ 507 | + ktime_t dur; 508 | + 509 | + if (memdbg_enable == 0) 510 | + return; 511 | + 512 | + dur = ktime_sub(ktime_get(), stat->start); 513 | + 514 | + if (dur > 0) { 515 | + stat->min_ns = min(dur, stat->min_ns); 516 | + stat->max_ns = max(dur, stat->max_ns); 517 | + 518 | + stat->tot_ns += dur; 519 | + stat->iter_cnt += iters; 520 | + 521 | + stat->tot_cnt++; 522 | + 523 | + memdbg(2, "order %ld pfn %ld (0x%08llx) color %d iters %d in %lld ns\n", 524 | + (long int)page_order(page), (long int)page_to_pfn(page), (u64)page_to_phys(page), 525 | + (int)page_to_color(page), iters, dur); 526 | + } else { 527 | + memdbg(5, "dur %lld is < 0\n", dur); 528 | + } 529 | + 530 | + return; 531 | +} 532 | + 533 | +/* 534 | + * Go through the free lists for the given migratetype and remove 535 | + * the smallest available page from the freelists 536 | + */ 537 | +static inline 538 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 539 | + int migratetype) 540 | +{ 541 | + unsigned int current_order; 542 | + struct free_area *area; 543 | + struct list_head *curr, *tmp; 544 | + struct page *page; 545 | + 546 | + struct palloc *ph; 547 | + struct palloc_stat *c_stat = &palloc.stat[0]; 548 | + struct palloc_stat *n_stat = &palloc.stat[1]; 549 | + struct palloc_stat *f_stat = &palloc.stat[2]; 550 | + 551 | + int iters = 0; 552 | + COLOR_BITMAP(tmpcmap); 553 | + unsigned long *cmap; 554 | + 555 | + if (memdbg_enable) 556 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 557 | + 558 | + if (!use_palloc) 559 | + goto normal_buddy_alloc; 560 | + 561 | + /* cgroup information */ 562 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 563 | + 564 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 565 | + cmap = ph->cmap; 566 | + else { 567 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 568 | + cmap = tmpcmap; 569 | + } 570 | + 571 | + page = NULL; 572 | + if (order == 0) { 573 | + /* Find page in the color cache */ 574 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 575 | + 576 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 577 | + 578 | + if (page) { 579 | + update_stat(c_stat, page, iters); 580 | + return page; 581 | + } 582 | + } 583 | + 584 | + if (order == 0) { 585 | + /* Build Color Cache */ 586 | + iters++; 587 | + 588 | + /* Search the entire list. Make color cache in the process */ 589 | + for (current_order = 0; current_order < MAX_ORDER; ++current_order) { 590 | + area = &(zone->free_area[current_order]); 591 | + 592 | + if (list_empty(&area->free_list[migratetype])) 593 | + continue; 594 | + 595 | + memdbg(3, " order=%d (nr_free=%ld)\n", current_order, area->nr_free); 596 | + 597 | + list_for_each_safe(curr, tmp, &area->free_list[migratetype]) { 598 | + iters++; 599 | + page = list_entry(curr, struct page, lru); 600 | + palloc_insert(zone, page, current_order); 601 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 602 | + 603 | + if (page) { 604 | + update_stat(c_stat, page, iters); 605 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", zone->name, page_to_pfn(page)); 606 | + 607 | + return page; 608 | + } 609 | + } 610 | + } 611 | + 612 | + memdbg(1, "Failed to find a matching color\n"); 613 | + } else { 614 | +normal_buddy_alloc: 615 | + /* Normal Buddy Algorithm */ 616 | + /* Find a page of the specified size in the preferred list */ 617 | + for (current_order = order; current_order < MAX_ORDER; ++current_order) { 618 | + area = &(zone->free_area[current_order]); 619 | + iters++; 620 | + 621 | + if (list_empty(&area->free_list[migratetype])) 622 | + continue; 623 | + 624 | + page = list_entry(area->free_list[migratetype].next, struct page, lru); 625 | + 626 | + list_del(&page->lru); 627 | + rmv_page_order(page); 628 | + area->nr_free--; 629 | + expand(zone, page, order, current_order, area, migratetype); 630 | + 631 | + update_stat(n_stat, page, iters); 632 | + 633 | + return page; 634 | + } 635 | + } 636 | + 637 | + /* No memory (colored or normal) found in this zone */ 638 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", zone->name, order, migratetype); 639 | + 640 | + return NULL; 641 | +} 642 | + 643 | +#else /* CONFIG_CGROUP_PALLOC */ 644 | + 645 | /* 646 | * Go through the free lists for the given migratetype and remove 647 | * the smallest available page from the freelists 648 | @@ -1825,6 +2333,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 649 | return NULL; 650 | } 651 | 652 | +#endif /* CONFIG_CGROUP_PALLOC */ 653 | 654 | /* 655 | * This array describes the order lists are fallen back to when 656 | @@ -2812,8 +3321,11 @@ struct page *rmqueue(struct zone *preferred_zone, 657 | { 658 | unsigned long flags; 659 | struct page *page; 660 | + struct palloc *ph; 661 | 662 | - if (likely(order == 0)) { 663 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 664 | + /* Skip PCP when physical memory aware allocation is requested */ 665 | + if (likely(order == 0) && !ph) { 666 | page = rmqueue_pcplist(preferred_zone, zone, order, 667 | gfp_flags, migratetype); 668 | goto out; 669 | @@ -5358,6 +5870,17 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 670 | static void __meminit zone_init_free_lists(struct zone *zone) 671 | { 672 | unsigned int order, t; 673 | + 674 | +#ifdef CONFIG_CGROUP_PALLOC 675 | + int c; 676 | + 677 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 678 | + INIT_LIST_HEAD(&zone->color_list[c]); 679 | + } 680 | + 681 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 682 | +#endif /* CONFIG_CGROUP_PALLOC */ 683 | + 684 | for_each_migratetype_order(order, t) { 685 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 686 | zone->free_area[order].nr_free = 0; 687 | @@ -7714,6 +8237,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 688 | offline_mem_sections(pfn, end_pfn); 689 | zone = page_zone(pfn_to_page(pfn)); 690 | spin_lock_irqsave(&zone->lock, flags); 691 | + 692 | +#ifdef CONFIG_CGROUP_PALLOC 693 | + palloc_flush(zone); 694 | +#endif 695 | + 696 | pfn = start_pfn; 697 | while (pfn < end_pfn) { 698 | if (!pfn_valid(pfn)) { 699 | diff --git a/mm/palloc.c b/mm/palloc.c 700 | new file mode 100644 701 | index 0000000..bc6a341 702 | --- /dev/null 703 | +++ b/mm/palloc.c 704 | @@ -0,0 +1,173 @@ 705 | +/** 706 | + * kernel/palloc.c 707 | + * 708 | + * Color Aware Physical Memory Allocator User-Space Information 709 | + * 710 | + */ 711 | + 712 | +#include 713 | +#include 714 | +#include 715 | +#include 716 | +#include 717 | +#include 718 | +#include 719 | +#include 720 | +#include 721 | +#include 722 | + 723 | +/** 724 | + * Check if a page is compliant with the policy defined for the given vma 725 | + */ 726 | +#ifdef CONFIG_CGROUP_PALLOC 727 | + 728 | +#define MAX_LINE_LEN (6 * 128) 729 | + 730 | +/** 731 | + * Type of files in a palloc group 732 | + * FILE_PALLOC - contains list of palloc bins allowed 733 | + */ 734 | +typedef enum { 735 | + FILE_PALLOC, 736 | +} palloc_filetype_t; 737 | + 738 | +/** 739 | + * Retrieve the palloc group corresponding to this cgroup container 740 | + */ 741 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 742 | +{ 743 | + return container_of(cgrp->subsys[palloc_cgrp_id], struct palloc, css); 744 | +} 745 | + 746 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys) 747 | +{ 748 | + return container_of(subsys, struct palloc, css); 749 | +} 750 | + 751 | +/** 752 | + * Common write function for files in palloc cgroup 753 | + */ 754 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 755 | +{ 756 | + int retval = 0; 757 | + 758 | + if (!*buf) 759 | + bitmap_clear(bitmap, 0, maxbits); 760 | + else 761 | + retval = bitmap_parselist(buf, bitmap, maxbits); 762 | + 763 | + return retval; 764 | +} 765 | + 766 | +static ssize_t palloc_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) 767 | +{ 768 | + struct cgroup_subsys_state *css; 769 | + struct cftype *cft; 770 | + int retval = 0; 771 | + struct palloc *ph; 772 | + 773 | + css = of_css(of); 774 | + cft = of_cft(of); 775 | + ph = container_of(css, struct palloc, css); 776 | + 777 | + switch (cft->private) { 778 | + case FILE_PALLOC: 779 | + retval = update_bitmask(ph->cmap, buf, palloc_bins()); 780 | + printk(KERN_INFO "Bins : %s\n", buf); 781 | + break; 782 | + 783 | + default: 784 | + retval = -EINVAL; 785 | + break; 786 | + } 787 | + 788 | + return retval? :nbytes; 789 | +} 790 | + 791 | +static int palloc_file_read(struct seq_file *sf, void *v) 792 | +{ 793 | + struct cgroup_subsys_state *css = seq_css(sf); 794 | + struct cftype *cft = seq_cft(sf); 795 | + struct palloc *ph = container_of(css, struct palloc, css); 796 | + char *page; 797 | + ssize_t retval = 0; 798 | + char *s; 799 | + 800 | + if (!(page = (char *)__get_free_page( __GFP_ZERO))) 801 | + return -ENOMEM; 802 | + 803 | + s = page; 804 | + 805 | + switch (cft->private) { 806 | + case FILE_PALLOC: 807 | + s += scnprintf(s, PAGE_SIZE, "%*pbl", (int)palloc_bins(), ph->cmap); 808 | + *s++ = '\n'; 809 | + printk(KERN_INFO "Bins : %s", page); 810 | + break; 811 | + 812 | + default: 813 | + retval = -EINVAL; 814 | + goto out; 815 | + } 816 | + 817 | + seq_printf(sf, "%s", page); 818 | + 819 | +out: 820 | + free_page((unsigned long)page); 821 | + return retval; 822 | +} 823 | + 824 | +/** 825 | + * struct cftype : handler definitions for cgroup control files 826 | + * 827 | + * for the common functions, 'private' gives the type of the file 828 | + */ 829 | +static struct cftype files[] = { 830 | + { 831 | + .name = "bins", 832 | + .seq_show = palloc_file_read, 833 | + .write = palloc_file_write, 834 | + .max_write_len = MAX_LINE_LEN, 835 | + .private = FILE_PALLOC, 836 | + }, 837 | + {} 838 | +}; 839 | + 840 | + 841 | +/** 842 | + * palloc_create - create a palloc group 843 | + */ 844 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 845 | +{ 846 | + struct palloc *ph_child; 847 | + 848 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 849 | + 850 | + if (!ph_child) 851 | + return ERR_PTR(-ENOMEM); 852 | + 853 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 854 | + 855 | + return &ph_child->css; 856 | +} 857 | + 858 | +/** 859 | + * Destroy an existing palloc group 860 | + */ 861 | +static void palloc_destroy(struct cgroup_subsys_state *css) 862 | +{ 863 | + struct palloc *ph = container_of(css, struct palloc, css); 864 | + 865 | + kfree(ph); 866 | +} 867 | + 868 | +struct cgroup_subsys palloc_cgrp_subsys = { 869 | + .name = "palloc", 870 | + .css_alloc = palloc_create, 871 | + .css_free = palloc_destroy, 872 | + .id = palloc_cgrp_id, 873 | + .dfl_cftypes = files, 874 | + .legacy_cftypes = files, 875 | +}; 876 | + 877 | +#endif /* CONFIG_CGROUP_PALLOC */ 878 | diff --git a/mm/vmstat.c b/mm/vmstat.c 879 | index c54fd29..2c8e1d1 100644 880 | --- a/mm/vmstat.c 881 | +++ b/mm/vmstat.c 882 | @@ -28,6 +28,10 @@ 883 | #include 884 | #include 885 | 886 | +#ifdef CONFIG_CGROUP_PALLOC 887 | +#include 888 | +#endif 889 | + 890 | #include "internal.h" 891 | 892 | #ifdef CONFIG_VM_EVENT_COUNTERS 893 | @@ -937,6 +941,44 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 894 | { 895 | int order; 896 | 897 | +#ifdef CONFIG_CGROUP_PALLOC 898 | + int color, mt, cnt, bins; 899 | + struct free_area *area; 900 | + struct list_head *curr; 901 | + 902 | + seq_printf(m, "--------\n"); 903 | + 904 | + /* Order by memory type */ 905 | + for (mt = 0; mt < MIGRATE_ISOLATE; mt++) { 906 | + seq_printf(m, "-%17s[%d]", "mt", mt); 907 | + for (order = 0; order < MAX_ORDER; order++) { 908 | + area = &(zone->free_area[order]); 909 | + cnt = 0; 910 | + 911 | + list_for_each(curr, &area->free_list[mt]) 912 | + cnt++; 913 | + 914 | + seq_printf(m, "%6d ", cnt); 915 | + } 916 | + 917 | + seq_printf(m, "\n"); 918 | + } 919 | + 920 | + /* Order by color */ 921 | + seq_printf(m, "--------\n"); 922 | + bins = palloc_bins(); 923 | + 924 | + for (color = 0; color < bins; color++) { 925 | + seq_printf(m, "- color [%d:%0x]", color, color); 926 | + cnt = 0; 927 | + 928 | + list_for_each(curr, &zone->color_list[color]) 929 | + cnt++; 930 | + 931 | + seq_printf(m, "%6d\n", cnt); 932 | + } 933 | +#endif /* CONFIG_CGROUP_PALLOC */ 934 | + 935 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 936 | for (order = 0; order < MAX_ORDER; ++order) 937 | seq_printf(m, "%6lu ", zone->free_area[order].nr_free); 938 | -------------------------------------------------------------------------------- /palloc-4.4.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index 1a96fda..9235fae 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -84,3 +84,7 @@ SUBSYS(debug) 6 | /* 7 | * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 8 | */ 9 | + 10 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 11 | +SUBSYS(palloc) 12 | +#endif 13 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 14 | index 2b79965..f9fe950 100644 15 | --- a/include/linux/mmzone.h 16 | +++ b/include/linux/mmzone.h 17 | @@ -69,6 +69,14 @@ enum { 18 | # define is_migrate_cma(migratetype) false 19 | #endif 20 | 21 | +#ifdef CONFIG_CGROUP_PALLOC 22 | +/* Determine the number of bins according to the bits required for 23 | + each component of the address */ 24 | +#define MAX_PALLOC_BITS 8 25 | +#define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 26 | +#define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 27 | +#endif 28 | + 29 | #define for_each_migratetype_order(order, type) \ 30 | for (order = 0; order < MAX_ORDER; order++) \ 31 | for (type = 0; type < MIGRATE_TYPES; type++) 32 | @@ -478,6 +486,14 @@ struct zone { 33 | /* free areas of different sizes */ 34 | struct free_area free_area[MAX_ORDER]; 35 | 36 | +#ifdef CONFIG_CGROUP_PALLOC 37 | + /* 38 | + * Color page cache for movable type free pages of order-0 39 | + */ 40 | + struct list_head color_list[MAX_PALLOC_BINS]; 41 | + COLOR_BITMAP(color_bitmap); 42 | +#endif 43 | + 44 | /* zone flags, see below */ 45 | unsigned long flags; 46 | 47 | diff --git a/include/linux/palloc.h b/include/linux/palloc.h 48 | new file mode 100644 49 | index 0000000..7236e31 50 | --- /dev/null 51 | +++ b/include/linux/palloc.h 52 | @@ -0,0 +1,33 @@ 53 | +#ifndef _LINUX_PALLOC_H 54 | +#define _LINUX_PALLOC_H 55 | + 56 | +/* 57 | + * kernel/palloc.h 58 | + * 59 | + * Physical Memory Aware Allocator 60 | + */ 61 | + 62 | +#include 63 | +#include 64 | +#include 65 | +#include 66 | + 67 | +#ifdef CONFIG_CGROUP_PALLOC 68 | + 69 | +struct palloc { 70 | + struct cgroup_subsys_state css; 71 | + COLOR_BITMAP(cmap); 72 | +}; 73 | + 74 | +/* Retrieve the palloc group corresponding to this cgroup container */ 75 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 76 | + 77 | +/* Retrieve the palloc group corresponding to this subsys */ 78 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys); 79 | + 80 | +/* Return number of palloc bins */ 81 | +int palloc_bins(void); 82 | + 83 | +#endif /* CONFIG_CGROUP_PALLOC */ 84 | + 85 | +#endif /* _LINUX_PALLOC_H */ 86 | diff --git a/init/Kconfig b/init/Kconfig 87 | index 8221272..fdb1ff7 100644 88 | --- a/init/Kconfig 89 | +++ b/init/Kconfig 90 | @@ -1160,6 +1160,13 @@ config CGROUP_WRITEBACK 91 | depends on MEMCG && BLK_CGROUP 92 | default y 93 | 94 | +config CGROUP_PALLOC 95 | + bool "Enable PALLOC" 96 | + help 97 | + Enable PALLOC. PALLOC is a color-aware page-based physical memory 98 | + allocator which replaces the buddy allocator for order-zero page 99 | + allocations. 100 | + 101 | endif # CGROUPS 102 | 103 | config CHECKPOINT_RESTORE 104 | diff --git a/mm/Makefile b/mm/Makefile 105 | index e3a53f5..a830289 100644 106 | --- a/mm/Makefile 107 | +++ b/mm/Makefile 108 | @@ -74,6 +74,7 @@ obj-$(CONFIG_ZPOOL) += zpool.o 109 | obj-$(CONFIG_ZBUD) += zbud.o 110 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 111 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 112 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 113 | obj-$(CONFIG_CMA) += cma.o 114 | obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 115 | obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 116 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 117 | index b0c3180..4a1bc06 100644 118 | --- a/mm/page_alloc.c 119 | +++ b/mm/page_alloc.c 120 | @@ -64,12 +64,203 @@ 121 | #include 122 | 123 | #include 124 | +#include 125 | #include 126 | #include 127 | #include "internal.h" 128 | 129 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 130 | static DEFINE_MUTEX(pcp_batch_high_lock); 131 | + 132 | +#ifdef CONFIG_CGROUP_PALLOC 133 | +#include 134 | + 135 | +int memdbg_enable = 0; 136 | +EXPORT_SYMBOL(memdbg_enable); 137 | + 138 | +static int sysctl_alloc_balance = 0; 139 | + 140 | +/* PALLOC address bitmask */ 141 | +static unsigned long sysctl_palloc_mask = 0x0; 142 | + 143 | +static int mc_xor_bits[64]; 144 | +static int use_mc_xor = 0; 145 | +static int use_palloc = 0; 146 | + 147 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 148 | + 149 | +#define memdbg(lvl, fmt, ...) \ 150 | + do { \ 151 | + if(memdbg_enable >= lvl) \ 152 | + trace_printk(fmt, ##__VA_ARGS__); \ 153 | + } while(0) 154 | + 155 | +struct palloc_stat { 156 | + s64 max_ns; 157 | + s64 min_ns; 158 | + s64 tot_ns; 159 | + 160 | + s64 tot_cnt; 161 | + s64 iter_cnt; /* avg_iter = iter_cnt / tot_cnt */ 162 | + 163 | + s64 cache_hit_cnt; /* hit_rate = cache_hit_cnt / cache_acc_cnt */ 164 | + s64 cache_acc_cnt; 165 | + 166 | + s64 flush_cnt; 167 | + 168 | + s64 alloc_balance; 169 | + s64 alloc_balance_timeout; 170 | + ktime_t start; /* Start time of the current iteration */ 171 | +}; 172 | + 173 | +static struct { 174 | + u32 enabled; 175 | + int colors; 176 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2- fail */ 177 | +} palloc; 178 | + 179 | +static void palloc_flush(struct zone *zone); 180 | + 181 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 182 | +{ 183 | + char buf[64]; 184 | + int i; 185 | + 186 | + if (cnt > 63) cnt = 63; 187 | + if (copy_from_user(&buf, ubuf, cnt)) 188 | + return -EFAULT; 189 | + 190 | + if (!strncmp(buf, "reset", 5)) { 191 | + printk(KERN_INFO "reset statistics...\n"); 192 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 193 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 194 | + palloc.stat[i].min_ns = 0x7fffffff; 195 | + } 196 | + } else if (!strncmp(buf, "flush", 5)) { 197 | + struct zone *zone; 198 | + printk(KERN_INFO "flush color cache...\n"); 199 | + for_each_populated_zone(zone) { 200 | + unsigned long flags; 201 | + if (!zone) 202 | + continue; 203 | + spin_lock_irqsave(&zone->lock, flags); 204 | + palloc_flush(zone); 205 | + spin_unlock_irqrestore(&zone->lock, flags); 206 | + } 207 | + } else if (!strncmp(buf, "xor", 3)) { 208 | + int bit, xor_bit; 209 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 210 | + if ((bit > 0 && bit < 64) && (xor_bit > 0 && xor_bit < 64) && bit != xor_bit) { 211 | + mc_xor_bits[bit] = xor_bit; 212 | + } 213 | + } 214 | + 215 | + *ppos += cnt; 216 | + 217 | + return cnt; 218 | +} 219 | + 220 | +static int palloc_show(struct seq_file *m, void *v) 221 | +{ 222 | + int i, tmp; 223 | + char *desc[] = { "Color", "Normal", "Fail" }; 224 | + char buf[256]; 225 | + 226 | + for (i = 0; i < 3; i++) { 227 | + struct palloc_stat *stat = &palloc.stat[i]; 228 | + seq_printf(m, "statistics %s:\n", desc[i]); 229 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 230 | + stat->min_ns, 231 | + stat->max_ns, 232 | + (stat->tot_cnt)? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 233 | + stat->tot_cnt); 234 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 235 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 236 | + (stat->cache_acc_cnt)? div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 237 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 238 | + (stat->tot_cnt)? div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 239 | + stat->iter_cnt, stat->tot_cnt); 240 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 241 | + 242 | + seq_printf(m, " balance: %lld | fail: %lld\n", 243 | + stat->alloc_balance, stat->alloc_balance_timeout); 244 | + } 245 | + 246 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 247 | + 248 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 249 | + 250 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, (1 << tmp)); 251 | + 252 | + scnprintf(buf, 256, "%*pbl", (int)(sizeof(unsigned long) * 8), &sysctl_palloc_mask); 253 | + 254 | + seq_printf(m, "bits: %s\n", buf); 255 | + 256 | + seq_printf(m, "XOR bits: %s\n", (use_mc_xor)? "enabled" : "disabled"); 257 | + 258 | + for (i = 0; i < 64; i++) { 259 | + if (mc_xor_bits[i] > 0) 260 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 261 | + } 262 | + 263 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc)? "enabled" : "disabled"); 264 | + 265 | + return 0; 266 | +} 267 | + 268 | +static int palloc_open(struct inode *inode, struct file *filp) 269 | +{ 270 | + return single_open(filp, palloc_show, NULL); 271 | +} 272 | + 273 | +static const struct file_operations palloc_fops = { 274 | + .open = palloc_open, 275 | + .write = palloc_write, 276 | + .read = seq_read, 277 | + .llseek = seq_lseek, 278 | + .release = single_release, 279 | +}; 280 | + 281 | +static int __init palloc_debugfs(void) 282 | +{ 283 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 284 | + struct dentry *dir; 285 | + int i; 286 | + 287 | + dir = debugfs_create_dir("palloc", NULL); 288 | + 289 | + /* Statistics Initialization */ 290 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 291 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 292 | + palloc.stat[i].min_ns = 0x7fffffff; 293 | + } 294 | + 295 | + if (!dir) 296 | + return PTR_ERR(dir); 297 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 298 | + goto fail; 299 | + if (!debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask)) 300 | + goto fail; 301 | + if (!debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor)) 302 | + goto fail; 303 | + if (!debugfs_create_u32("use_palloc", mode, dir, &use_palloc)) 304 | + goto fail; 305 | + if (!debugfs_create_u32("debug_level", mode, dir, &memdbg_enable)) 306 | + goto fail; 307 | + if (!debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance)) 308 | + goto fail; 309 | + 310 | + return 0; 311 | + 312 | +fail: 313 | + debugfs_remove_recursive(dir); 314 | + return -ENOMEM; 315 | +} 316 | + 317 | +late_initcall(palloc_debugfs); 318 | + 319 | +#endif /* CONFIG_CGROUP_PALLOC */ 320 | + 321 | #define MIN_PERCPU_PAGELIST_FRACTION (8) 322 | 323 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 324 | @@ -1424,6 +1615,323 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, 325 | return 0; 326 | } 327 | 328 | +#ifdef CONFIG_CGROUP_PALLOC 329 | + 330 | +int palloc_bins(void) 331 | +{ 332 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long) * 8)), MAX_PALLOC_BINS); 333 | +} 334 | + 335 | +static inline int page_to_color(struct page *page) 336 | +{ 337 | + int color = 0; 338 | + int idx = 0; 339 | + int c; 340 | + 341 | + unsigned long paddr = page_to_phys(page); 342 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 343 | + if (use_mc_xor) { 344 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 345 | + color |= (1 << idx); 346 | + } else { 347 | + if ((paddr >> c) & 0x1) 348 | + color |= (1 << idx); 349 | + } 350 | + 351 | + idx++; 352 | + } 353 | + 354 | + return color; 355 | +} 356 | + 357 | +/* Debug */ 358 | +static inline unsigned long list_count(struct list_head *head) 359 | +{ 360 | + unsigned long n = 0; 361 | + struct list_head *curr; 362 | + 363 | + list_for_each(curr, head) 364 | + n++; 365 | + 366 | + return n; 367 | +} 368 | + 369 | +/* Move all color_list pages into free_area[0].freelist[2] 370 | + * zone->lock must be held before calling this function 371 | + */ 372 | +static void palloc_flush(struct zone *zone) 373 | +{ 374 | + int c; 375 | + struct page *page; 376 | + 377 | + memdbg(2, "Flush the color-cache for zone %s\n", zone->name); 378 | + 379 | + while(1) { 380 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 381 | + if (!list_empty(&zone->color_list[c])) { 382 | + page = list_entry(zone->color_list[c].next, struct page, lru); 383 | + list_del_init(&page->lru); 384 | + __free_one_page(page, page_to_pfn(page), zone, 0, get_pageblock_migratetype(page)); 385 | + zone->free_area[0].nr_free--; 386 | + } 387 | + 388 | + if (list_empty(&zone->color_list[c])) { 389 | + bitmap_clear(zone->color_bitmap, c, 1); 390 | + INIT_LIST_HEAD(&zone->color_list[c]); 391 | + } 392 | + } 393 | + 394 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 395 | + break; 396 | + } 397 | +} 398 | + 399 | +/* Move a page (size = 1 << order) into order-0 colored cache */ 400 | +static void palloc_insert(struct zone *zone, struct page *page, int order) 401 | +{ 402 | + int i, color; 403 | + 404 | + /* 1 page (2^order) -> 2^order x pages of colored cache. 405 | + Remove from zone->free_area[order].free_list[mt] */ 406 | + list_del(&page->lru); 407 | + zone->free_area[order].nr_free--; 408 | + 409 | + /* Insert pages to zone->color_list[] (all order-0) */ 410 | + for (i = 0; i < (1 << order); i++) { 411 | + color = page_to_color(&page[i]); 412 | + 413 | + /* Add to zone->color_list[color] */ 414 | + memdbg(5, "- Add pfn %ld (0x%08llx) to color_list[%d]\n", page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 415 | + 416 | + INIT_LIST_HEAD(&page[i].lru); 417 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 418 | + bitmap_set(zone->color_bitmap, color, 1); 419 | + zone->free_area[0].nr_free++; 420 | + rmv_page_order(&page[i]); 421 | + } 422 | + 423 | + memdbg(4, "Add order=%d zone=%s\n", order, zone->name); 424 | + 425 | + return; 426 | +} 427 | + 428 | +/* Return a colored page (order-0) and remove it from the colored cache */ 429 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), int order, struct palloc_stat *stat) 430 | +{ 431 | + struct page *page; 432 | + COLOR_BITMAP(tmpmask); 433 | + int c; 434 | + unsigned int tmp_idx; 435 | + int found_w, want_w; 436 | + unsigned long rand_seed; 437 | + 438 | + /* Cache Statistics */ 439 | + if (stat) stat->cache_acc_cnt++; 440 | + 441 | + /* Find color cache entry */ 442 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 443 | + return NULL; 444 | + 445 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 446 | + 447 | + /* Must have a balance */ 448 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 449 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 450 | + 451 | + if (sysctl_alloc_balance && (found_w < want_w) && (found_w < min(sysctl_alloc_balance, want_w)) && memdbg_enable) { 452 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 453 | + if (dur.tv64 < 1000000) { 454 | + /* Try to balance unless order = MAX-2 or 1ms has passed */ 455 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", found_w, want_w, order, dur.tv64); 456 | + 457 | + stat->alloc_balance++; 458 | + 459 | + return NULL; 460 | + } 461 | + 462 | + stat->alloc_balance_timeout++; 463 | + } 464 | + 465 | + /* Choose a bit among the candidates */ 466 | + if (sysctl_alloc_balance && memdbg_enable) { 467 | + rand_seed = (unsigned long)stat->start.tv64; 468 | + } else { 469 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 470 | + 471 | + if (rand_seed > MAX_PALLOC_BINS) 472 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 473 | + } 474 | + 475 | + tmp_idx = rand_seed % found_w; 476 | + 477 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 478 | + if (tmp_idx-- <= 0) 479 | + break; 480 | + } 481 | + 482 | + BUG_ON(c >= MAX_PALLOC_BINS); 483 | + BUG_ON(list_empty(&zone->color_list[c])); 484 | + 485 | + page = list_entry(zone->color_list[c].next, struct page, lru); 486 | + 487 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 488 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 489 | + 490 | + /* Remove the page from the zone->color_list[color] */ 491 | + list_del(&page->lru); 492 | + 493 | + if (list_empty(&zone->color_list[c])) 494 | + bitmap_clear(zone->color_bitmap, c, 1); 495 | + 496 | + zone->free_area[0].nr_free--; 497 | + 498 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", page_to_pfn(page), c); 499 | + 500 | + if (stat) stat->cache_hit_cnt++; 501 | + 502 | + return page; 503 | +} 504 | + 505 | +static inline void update_stat(struct palloc_stat *stat, struct page *page, int iters) 506 | +{ 507 | + ktime_t dur; 508 | + 509 | + if (memdbg_enable == 0) 510 | + return; 511 | + 512 | + dur = ktime_sub(ktime_get(), stat->start); 513 | + 514 | + if (dur.tv64 > 0) { 515 | + stat->min_ns = min(dur.tv64, stat->min_ns); 516 | + stat->max_ns = max(dur.tv64, stat->max_ns); 517 | + 518 | + stat->tot_ns += dur.tv64; 519 | + stat->iter_cnt += iters; 520 | + 521 | + stat->tot_cnt++; 522 | + 523 | + memdbg(2, "order %ld pfn %ld (0x%08llx) color %d iters %d in %lld ns\n", 524 | + (long int)page_order(page), (long int)page_to_pfn(page), (u64)page_to_phys(page), 525 | + (int)page_to_color(page), iters, dur.tv64); 526 | + } else { 527 | + memdbg(5, "dur %lld is < 0\n", dur.tv64); 528 | + } 529 | + 530 | + return; 531 | +} 532 | + 533 | +/* 534 | + * Go through the free lists for the given migratetype and remove 535 | + * the smallest available page from the freelists 536 | + */ 537 | +static inline 538 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 539 | + int migratetype) 540 | +{ 541 | + unsigned int current_order; 542 | + struct free_area *area; 543 | + struct list_head *curr, *tmp; 544 | + struct page *page; 545 | + 546 | + struct palloc *ph; 547 | + struct palloc_stat *c_stat = &palloc.stat[0]; 548 | + struct palloc_stat *n_stat = &palloc.stat[1]; 549 | + struct palloc_stat *f_stat = &palloc.stat[2]; 550 | + 551 | + int iters = 0; 552 | + COLOR_BITMAP(tmpcmap); 553 | + unsigned long *cmap; 554 | + 555 | + if (memdbg_enable) 556 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 557 | + 558 | + if (!use_palloc) 559 | + goto normal_buddy_alloc; 560 | + 561 | + /* cgroup information */ 562 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 563 | + 564 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 565 | + cmap = ph->cmap; 566 | + else { 567 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 568 | + cmap = tmpcmap; 569 | + } 570 | + 571 | + page = NULL; 572 | + if (order == 0) { 573 | + /* Find page in the color cache */ 574 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 575 | + 576 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 577 | + 578 | + if (page) { 579 | + update_stat(c_stat, page, iters); 580 | + return page; 581 | + } 582 | + } 583 | + 584 | + if (order == 0) { 585 | + /* Build Color Cache */ 586 | + iters++; 587 | + 588 | + /* Search the entire list. Make color cache in the process */ 589 | + for (current_order = 0; current_order < MAX_ORDER; ++current_order) { 590 | + area = &(zone->free_area[current_order]); 591 | + 592 | + if (list_empty(&area->free_list[migratetype])) 593 | + continue; 594 | + 595 | + memdbg(3, " order=%d (nr_free=%ld)\n", current_order, area->nr_free); 596 | + 597 | + list_for_each_safe(curr, tmp, &area->free_list[migratetype]) { 598 | + iters++; 599 | + page = list_entry(curr, struct page, lru); 600 | + palloc_insert(zone, page, current_order); 601 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 602 | + 603 | + if (page) { 604 | + update_stat(c_stat, page, iters); 605 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", zone->name, page_to_pfn(page)); 606 | + 607 | + return page; 608 | + } 609 | + } 610 | + } 611 | + 612 | + memdbg(1, "Failed to find a matching color\n"); 613 | + } else { 614 | +normal_buddy_alloc: 615 | + /* Normal Buddy Algorithm */ 616 | + /* Find a page of the specified size in the preferred list */ 617 | + for (current_order = order; current_order < MAX_ORDER; ++current_order) { 618 | + area = &(zone->free_area[current_order]); 619 | + iters++; 620 | + 621 | + if (list_empty(&area->free_list[migratetype])) 622 | + continue; 623 | + 624 | + page = list_entry(area->free_list[migratetype].next, struct page, lru); 625 | + 626 | + list_del(&page->lru); 627 | + rmv_page_order(page); 628 | + area->nr_free--; 629 | + expand(zone, page, order, current_order, area, migratetype); 630 | + 631 | + update_stat(n_stat, page, iters); 632 | + 633 | + return page; 634 | + } 635 | + } 636 | + 637 | + /* No memory (colored or normal) found in this zone */ 638 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", zone->name, order, migratetype); 639 | + 640 | + return NULL; 641 | +} 642 | + 643 | +#else /* CONFIG_CGROUP_PALLOC */ 644 | + 645 | /* 646 | * Go through the free lists for the given migratetype and remove 647 | * the smallest available page from the freelists 648 | @@ -1455,6 +1963,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 649 | return NULL; 650 | } 651 | 652 | +#endif /* CONFIG_CGROUP_PALLOC */ 653 | 654 | /* 655 | * This array describes the order lists are fallen back to when 656 | @@ -2218,8 +2727,12 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, 657 | unsigned long flags; 658 | struct page *page; 659 | bool cold = ((gfp_flags & __GFP_COLD) != 0); 660 | + struct palloc *ph; 661 | + 662 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 663 | 664 | - if (likely(order == 0)) { 665 | + /* Skip PCP when physical memory aware allocation is requested */ 666 | + if (likely(order == 0) && !ph) { 667 | struct per_cpu_pages *pcp; 668 | struct list_head *list; 669 | 670 | @@ -4742,6 +5255,17 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, 671 | static void __meminit zone_init_free_lists(struct zone *zone) 672 | { 673 | unsigned int order, t; 674 | + 675 | +#ifdef CONFIG_CGROUP_PALLOC 676 | + int c; 677 | + 678 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 679 | + INIT_LIST_HEAD(&zone->color_list[c]); 680 | + } 681 | + 682 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 683 | +#endif /* CONFIG_CGROUP_PALLOC */ 684 | + 685 | for_each_migratetype_order(order, t) { 686 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 687 | zone->free_area[order].nr_free = 0; 688 | @@ -7079,6 +7603,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 689 | return; 690 | zone = page_zone(pfn_to_page(pfn)); 691 | spin_lock_irqsave(&zone->lock, flags); 692 | + 693 | +#ifdef CONFIG_CGROUP_PALLOC 694 | + palloc_flush(zone); 695 | +#endif 696 | + 697 | pfn = start_pfn; 698 | while (pfn < end_pfn) { 699 | if (!pfn_valid(pfn)) { 700 | diff --git a/mm/palloc.c b/mm/palloc.c 701 | new file mode 100644 702 | index 0000000..bc6a341 703 | --- /dev/null 704 | +++ b/mm/palloc.c 705 | @@ -0,0 +1,173 @@ 706 | +/** 707 | + * kernel/palloc.c 708 | + * 709 | + * Color Aware Physical Memory Allocator User-Space Information 710 | + * 711 | + */ 712 | + 713 | +#include 714 | +#include 715 | +#include 716 | +#include 717 | +#include 718 | +#include 719 | +#include 720 | +#include 721 | +#include 722 | +#include 723 | + 724 | +/** 725 | + * Check if a page is compliant with the policy defined for the given vma 726 | + */ 727 | +#ifdef CONFIG_CGROUP_PALLOC 728 | + 729 | +#define MAX_LINE_LEN (6 * 128) 730 | + 731 | +/** 732 | + * Type of files in a palloc group 733 | + * FILE_PALLOC - contains list of palloc bins allowed 734 | + */ 735 | +typedef enum { 736 | + FILE_PALLOC, 737 | +} palloc_filetype_t; 738 | + 739 | +/** 740 | + * Retrieve the palloc group corresponding to this cgroup container 741 | + */ 742 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 743 | +{ 744 | + return container_of(cgrp->subsys[palloc_cgrp_id], struct palloc, css); 745 | +} 746 | + 747 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys) 748 | +{ 749 | + return container_of(subsys, struct palloc, css); 750 | +} 751 | + 752 | +/** 753 | + * Common write function for files in palloc cgroup 754 | + */ 755 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 756 | +{ 757 | + int retval = 0; 758 | + 759 | + if (!*buf) 760 | + bitmap_clear(bitmap, 0, maxbits); 761 | + else 762 | + retval = bitmap_parselist(buf, bitmap, maxbits); 763 | + 764 | + return retval; 765 | +} 766 | + 767 | +static ssize_t palloc_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) 768 | +{ 769 | + struct cgroup_subsys_state *css; 770 | + struct cftype *cft; 771 | + int retval = 0; 772 | + struct palloc *ph; 773 | + 774 | + css = of_css(of); 775 | + cft = of_cft(of); 776 | + ph = container_of(css, struct palloc, css); 777 | + 778 | + switch (cft->private) { 779 | + case FILE_PALLOC: 780 | + retval = update_bitmask(ph->cmap, buf, palloc_bins()); 781 | + printk(KERN_INFO "Bins : %s\n", buf); 782 | + break; 783 | + 784 | + default: 785 | + retval = -EINVAL; 786 | + break; 787 | + } 788 | + 789 | + return retval? :nbytes; 790 | +} 791 | + 792 | +static int palloc_file_read(struct seq_file *sf, void *v) 793 | +{ 794 | + struct cgroup_subsys_state *css = seq_css(sf); 795 | + struct cftype *cft = seq_cft(sf); 796 | + struct palloc *ph = container_of(css, struct palloc, css); 797 | + char *page; 798 | + ssize_t retval = 0; 799 | + char *s; 800 | + 801 | + if (!(page = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO))) 802 | + return -ENOMEM; 803 | + 804 | + s = page; 805 | + 806 | + switch (cft->private) { 807 | + case FILE_PALLOC: 808 | + s += scnprintf(s, PAGE_SIZE, "%*pbl", (int)palloc_bins(), ph->cmap); 809 | + *s++ = '\n'; 810 | + printk(KERN_INFO "Bins : %s", page); 811 | + break; 812 | + 813 | + default: 814 | + retval = -EINVAL; 815 | + goto out; 816 | + } 817 | + 818 | + seq_printf(sf, "%s", page); 819 | + 820 | +out: 821 | + free_page((unsigned long)page); 822 | + return retval; 823 | +} 824 | + 825 | +/** 826 | + * struct cftype : handler definitions for cgroup control files 827 | + * 828 | + * for the common functions, 'private' gives the type of the file 829 | + */ 830 | +static struct cftype files[] = { 831 | + { 832 | + .name = "bins", 833 | + .seq_show = palloc_file_read, 834 | + .write = palloc_file_write, 835 | + .max_write_len = MAX_LINE_LEN, 836 | + .private = FILE_PALLOC, 837 | + }, 838 | + {} 839 | +}; 840 | + 841 | + 842 | +/** 843 | + * palloc_create - create a palloc group 844 | + */ 845 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 846 | +{ 847 | + struct palloc *ph_child; 848 | + 849 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 850 | + 851 | + if (!ph_child) 852 | + return ERR_PTR(-ENOMEM); 853 | + 854 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 855 | + 856 | + return &ph_child->css; 857 | +} 858 | + 859 | +/** 860 | + * Destroy an existing palloc group 861 | + */ 862 | +static void palloc_destroy(struct cgroup_subsys_state *css) 863 | +{ 864 | + struct palloc *ph = container_of(css, struct palloc, css); 865 | + 866 | + kfree(ph); 867 | +} 868 | + 869 | +struct cgroup_subsys palloc_cgrp_subsys = { 870 | + .name = "palloc", 871 | + .css_alloc = palloc_create, 872 | + .css_free = palloc_destroy, 873 | + .id = palloc_cgrp_id, 874 | + .dfl_cftypes = files, 875 | + .legacy_cftypes = files, 876 | +}; 877 | + 878 | +#endif /* CONFIG_CGROUP_PALLOC */ 879 | diff --git a/mm/vmstat.c b/mm/vmstat.c 880 | index c54fd29..2c8e1d1 100644 881 | --- a/mm/vmstat.c 882 | +++ b/mm/vmstat.c 883 | @@ -28,6 +28,10 @@ 884 | #include 885 | #include 886 | 887 | +#ifdef CONFIG_CGROUP_PALLOC 888 | +#include 889 | +#endif 890 | + 891 | #include "internal.h" 892 | 893 | #ifdef CONFIG_VM_EVENT_COUNTERS 894 | @@ -937,6 +941,44 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 895 | { 896 | int order; 897 | 898 | +#ifdef CONFIG_CGROUP_PALLOC 899 | + int color, mt, cnt, bins; 900 | + struct free_area *area; 901 | + struct list_head *curr; 902 | + 903 | + seq_printf(m, "--------\n"); 904 | + 905 | + /* Order by memory type */ 906 | + for (mt = 0; mt < MIGRATE_ISOLATE; mt++) { 907 | + seq_printf(m, "-%17s[%d]", "mt", mt); 908 | + for (order = 0; order < MAX_ORDER; order++) { 909 | + area = &(zone->free_area[order]); 910 | + cnt = 0; 911 | + 912 | + list_for_each(curr, &area->free_list[mt]) 913 | + cnt++; 914 | + 915 | + seq_printf(m, "%6d ", cnt); 916 | + } 917 | + 918 | + seq_printf(m, "\n"); 919 | + } 920 | + 921 | + /* Order by color */ 922 | + seq_printf(m, "--------\n"); 923 | + bins = palloc_bins(); 924 | + 925 | + for (color = 0; color < bins; color++) { 926 | + seq_printf(m, "- color [%d:%0x]", color, color); 927 | + cnt = 0; 928 | + 929 | + list_for_each(curr, &zone->color_list[color]) 930 | + cnt++; 931 | + 932 | + seq_printf(m, "%6d\n", cnt); 933 | + } 934 | +#endif /* CONFIG_CGROUP_PALLOC */ 935 | + 936 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 937 | for (order = 0; order < MAX_ORDER; ++order) 938 | seq_printf(m, "%6lu ", zone->free_area[order].nr_free); 939 | -------------------------------------------------------------------------------- /palloc-4.9.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index 7f4a2a5a..b74c6b8b 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -70,3 +70,7 @@ SUBSYS(debug) 6 | /* 7 | * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 8 | */ 9 | + 10 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 11 | +SUBSYS(palloc) 12 | +#endif 13 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 14 | index 2bb7ea2f..bb2ce4fa 100644 15 | --- a/include/linux/mmzone.h 16 | +++ b/include/linux/mmzone.h 17 | @@ -74,6 +74,14 @@ extern char * const migratetype_names[MIGRATE_TYPES]; 18 | # define is_migrate_cma_page(_page) false 19 | #endif 20 | 21 | +#ifdef CONFIG_CGROUP_PALLOC 22 | +/* Determine the number of bins according to the bits required for 23 | + each component of the address */ 24 | +#define MAX_PALLOC_BITS 8 25 | +#define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 26 | +#define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 27 | +#endif 28 | + 29 | #define for_each_migratetype_order(order, type) \ 30 | for (order = 0; order < MAX_ORDER; order++) \ 31 | for (type = 0; type < MIGRATE_TYPES; type++) 32 | @@ -449,6 +457,14 @@ struct zone { 33 | /* free areas of different sizes */ 34 | struct free_area free_area[MAX_ORDER]; 35 | 36 | +#ifdef CONFIG_CGROUP_PALLOC 37 | + /* 38 | + * Color page cache for movable type free pages of order-0 39 | + */ 40 | + struct list_head color_list[MAX_PALLOC_BINS]; 41 | + COLOR_BITMAP(color_bitmap); 42 | +#endif 43 | + 44 | /* zone flags, see below */ 45 | unsigned long flags; 46 | 47 | diff --git a/include/linux/palloc.h b/include/linux/palloc.h 48 | index e69de29b..7236e313 100644 49 | --- a/include/linux/palloc.h 50 | +++ b/include/linux/palloc.h 51 | @@ -0,0 +1,33 @@ 52 | +#ifndef _LINUX_PALLOC_H 53 | +#define _LINUX_PALLOC_H 54 | + 55 | +/* 56 | + * kernel/palloc.h 57 | + * 58 | + * Physical Memory Aware Allocator 59 | + */ 60 | + 61 | +#include 62 | +#include 63 | +#include 64 | +#include 65 | + 66 | +#ifdef CONFIG_CGROUP_PALLOC 67 | + 68 | +struct palloc { 69 | + struct cgroup_subsys_state css; 70 | + COLOR_BITMAP(cmap); 71 | +}; 72 | + 73 | +/* Retrieve the palloc group corresponding to this cgroup container */ 74 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 75 | + 76 | +/* Retrieve the palloc group corresponding to this subsys */ 77 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys); 78 | + 79 | +/* Return number of palloc bins */ 80 | +int palloc_bins(void); 81 | + 82 | +#endif /* CONFIG_CGROUP_PALLOC */ 83 | + 84 | +#endif /* _LINUX_PALLOC_H */ 85 | diff --git a/init/Kconfig b/init/Kconfig 86 | index 0d6f1ba6..dd2ebe33 100644 87 | --- a/init/Kconfig 88 | +++ b/init/Kconfig 89 | @@ -1265,6 +1265,13 @@ config SOCK_CGROUP_DATA 90 | bool 91 | default n 92 | 93 | +config CGROUP_PALLOC 94 | + bool "Enable PALLOC" 95 | + help 96 | + Enable PALLOC. PALLOC is a color-aware page-based physical memory 97 | + allocator which replaces the buddy allocator for order-zero page 98 | + allocations. 99 | + 100 | endif # CGROUPS 101 | 102 | config CHECKPOINT_RESTORE 103 | diff --git a/mm/Makefile b/mm/Makefile 104 | index 3e05161b..19208197 100644 105 | --- a/mm/Makefile 106 | +++ b/mm/Makefile 107 | @@ -92,6 +92,7 @@ obj-$(CONFIG_ZBUD) += zbud.o 108 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 109 | obj-$(CONFIG_Z3FOLD) += z3fold.o 110 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 111 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 112 | obj-$(CONFIG_CMA) += cma.o 113 | obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 114 | obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 115 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 116 | index 9d59596b..1ccfc67c 100644 117 | --- a/mm/page_alloc.c 118 | +++ b/mm/page_alloc.c 119 | @@ -66,12 +66,203 @@ 120 | #include 121 | 122 | #include 123 | +#include 124 | #include 125 | #include 126 | #include "internal.h" 127 | 128 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 129 | static DEFINE_MUTEX(pcp_batch_high_lock); 130 | + 131 | +#ifdef CONFIG_CGROUP_PALLOC 132 | +#include 133 | + 134 | +int memdbg_enable = 0; 135 | +EXPORT_SYMBOL(memdbg_enable); 136 | + 137 | +static int sysctl_alloc_balance = 0; 138 | + 139 | +/* PALLOC address bitmask */ 140 | +static unsigned long sysctl_palloc_mask = 0; 141 | + 142 | +static int mc_xor_bits[64]; 143 | +static int use_mc_xor = 0; 144 | +static int use_palloc = 0; 145 | + 146 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 147 | + 148 | +#define memdbg(lvl, fmt, ...) \ 149 | + do { \ 150 | + if(memdbg_enable >= lvl) \ 151 | + trace_printk(fmt, ##__VA_ARGS__); \ 152 | + } while(0) 153 | + 154 | +struct palloc_stat { 155 | + s64 max_ns; 156 | + s64 min_ns; 157 | + s64 tot_ns; 158 | + 159 | + s64 tot_cnt; 160 | + s64 iter_cnt; /* avg_iter = iter_cnt / tot_cnt */ 161 | + 162 | + s64 cache_hit_cnt; /* hit_rate = cache_hit_cnt / cache_acc_cnt */ 163 | + s64 cache_acc_cnt; 164 | + 165 | + s64 flush_cnt; 166 | + 167 | + s64 alloc_balance; 168 | + s64 alloc_balance_timeout; 169 | + ktime_t start; /* Start time of the current iteration */ 170 | +}; 171 | + 172 | +static struct { 173 | + u32 enabled; 174 | + int colors; 175 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2- fail */ 176 | +} palloc; 177 | + 178 | +static void palloc_flush(struct zone *zone); 179 | + 180 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 181 | +{ 182 | + char buf[64]; 183 | + int i; 184 | + 185 | + if (cnt > 63) cnt = 63; 186 | + if (copy_from_user(&buf, ubuf, cnt)) 187 | + return -EFAULT; 188 | + 189 | + if (!strncmp(buf, "reset", 5)) { 190 | + printk(KERN_INFO "reset statistics...\n"); 191 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 192 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 193 | + palloc.stat[i].min_ns = 0x7fffffff; 194 | + } 195 | + } else if (!strncmp(buf, "flush", 5)) { 196 | + struct zone *zone; 197 | + printk(KERN_INFO "flush color cache...\n"); 198 | + for_each_populated_zone(zone) { 199 | + unsigned long flags; 200 | + if (!zone) 201 | + continue; 202 | + spin_lock_irqsave(&zone->lock, flags); 203 | + palloc_flush(zone); 204 | + spin_unlock_irqrestore(&zone->lock, flags); 205 | + } 206 | + } else if (!strncmp(buf, "xor", 3)) { 207 | + int bit, xor_bit; 208 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 209 | + if ((bit > 0 && bit < 64) && (xor_bit > 0 && xor_bit < 64) && bit != xor_bit) { 210 | + mc_xor_bits[bit] = xor_bit; 211 | + } 212 | + } 213 | + 214 | + *ppos += cnt; 215 | + 216 | + return cnt; 217 | +} 218 | + 219 | +static int palloc_show(struct seq_file *m, void *v) 220 | +{ 221 | + int i, tmp; 222 | + char *desc[] = { "Color", "Normal", "Fail" }; 223 | + char buf[256]; 224 | + 225 | + for (i = 0; i < 3; i++) { 226 | + struct palloc_stat *stat = &palloc.stat[i]; 227 | + seq_printf(m, "statistics %s:\n", desc[i]); 228 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 229 | + stat->min_ns, 230 | + stat->max_ns, 231 | + (stat->tot_cnt)? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 232 | + stat->tot_cnt); 233 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 234 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 235 | + (stat->cache_acc_cnt)? div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 236 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 237 | + (stat->tot_cnt)? div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 238 | + stat->iter_cnt, stat->tot_cnt); 239 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 240 | + 241 | + seq_printf(m, " balance: %lld | fail: %lld\n", 242 | + stat->alloc_balance, stat->alloc_balance_timeout); 243 | + } 244 | + 245 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 246 | + 247 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 248 | + 249 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, (1 << tmp)); 250 | + 251 | + scnprintf(buf, 256, "%*pbl", (int)(sizeof(unsigned long) * 8), &sysctl_palloc_mask); 252 | + 253 | + seq_printf(m, "bits: %s\n", buf); 254 | + 255 | + seq_printf(m, "XOR bits: %s\n", (use_mc_xor)? "enabled" : "disabled"); 256 | + 257 | + for (i = 0; i < 64; i++) { 258 | + if (mc_xor_bits[i] > 0) 259 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 260 | + } 261 | + 262 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc)? "enabled" : "disabled"); 263 | + 264 | + return 0; 265 | +} 266 | + 267 | +static int palloc_open(struct inode *inode, struct file *filp) 268 | +{ 269 | + return single_open(filp, palloc_show, NULL); 270 | +} 271 | + 272 | +static const struct file_operations palloc_fops = { 273 | + .open = palloc_open, 274 | + .write = palloc_write, 275 | + .read = seq_read, 276 | + .llseek = seq_lseek, 277 | + .release = single_release, 278 | +}; 279 | + 280 | +static int __init palloc_debugfs(void) 281 | +{ 282 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 283 | + struct dentry *dir; 284 | + int i; 285 | + 286 | + dir = debugfs_create_dir("palloc", NULL); 287 | + 288 | + /* Statistics Initialization */ 289 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 290 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 291 | + palloc.stat[i].min_ns = 0x7fffffff; 292 | + } 293 | + 294 | + if (!dir) 295 | + return PTR_ERR(dir); 296 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 297 | + goto fail; 298 | + if (!debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask)) 299 | + goto fail; 300 | + if (!debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor)) 301 | + goto fail; 302 | + if (!debugfs_create_u32("use_palloc", mode, dir, &use_palloc)) 303 | + goto fail; 304 | + if (!debugfs_create_u32("debug_level", mode, dir, &memdbg_enable)) 305 | + goto fail; 306 | + if (!debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance)) 307 | + goto fail; 308 | + 309 | + return 0; 310 | + 311 | +fail: 312 | + debugfs_remove_recursive(dir); 313 | + return -ENOMEM; 314 | +} 315 | + 316 | +late_initcall(palloc_debugfs); 317 | + 318 | +#endif /* CONFIG_CGROUP_PALLOC */ 319 | + 320 | #define MIN_PERCPU_PAGELIST_FRACTION (8) 321 | 322 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 323 | @@ -1805,6 +1996,306 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags 324 | clear_page_pfmemalloc(page); 325 | } 326 | 327 | +#ifdef CONFIG_CGROUP_PALLOC 328 | + 329 | +int palloc_bins(void) 330 | +{ 331 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long) * 8)), MAX_PALLOC_BINS); 332 | +} 333 | + 334 | +static inline int page_to_color(struct page *page) 335 | +{ 336 | + int color = 0; 337 | + int idx = 0; 338 | + int c; 339 | + 340 | + unsigned long paddr = page_to_phys(page); 341 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 342 | + if (use_mc_xor) { 343 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 344 | + color |= (1 << idx); 345 | + } else { 346 | + if ((paddr >> c) & 0x1) 347 | + color |= (1 << idx); 348 | + } 349 | + 350 | + idx++; 351 | + } 352 | + 353 | + return color; 354 | +} 355 | + 356 | +/* Debug */ 357 | +static inline unsigned long list_count(struct list_head *head) 358 | +{ 359 | + unsigned long n = 0; 360 | + struct list_head *curr; 361 | + 362 | + list_for_each(curr, head) 363 | + n++; 364 | + 365 | + return n; 366 | +} 367 | + 368 | +/* Move all color_list pages into free_area[0].freelist[2] 369 | + * zone->lock must be held before calling this function 370 | + */ 371 | +static void palloc_flush(struct zone *zone) 372 | +{ 373 | + int c; 374 | + struct page *page; 375 | + 376 | + memdbg(2, "Flush the color-cache for zone %s\n", zone->name); 377 | + 378 | + while(1) { 379 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 380 | + if (!list_empty(&zone->color_list[c])) { 381 | + page = list_entry(zone->color_list[c].next, struct page, lru); 382 | + list_del_init(&page->lru); 383 | + __free_one_page(page, page_to_pfn(page), zone, 0, get_pageblock_migratetype(page)); 384 | + zone->free_area[0].nr_free--; 385 | + } 386 | + 387 | + if (list_empty(&zone->color_list[c])) { 388 | + bitmap_clear(zone->color_bitmap, c, 1); 389 | + INIT_LIST_HEAD(&zone->color_list[c]); 390 | + } 391 | + } 392 | + 393 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 394 | + break; 395 | + } 396 | +} 397 | + 398 | +/* Move a page (size = 1 << order) into order-0 colored cache */ 399 | +static void palloc_insert(struct zone *zone, struct page *page, int order) 400 | +{ 401 | + int i, color; 402 | + 403 | + /* 1 page (2^order) -> 2^order x pages of colored cache. 404 | + Remove from zone->free_area[order].free_list[mt] */ 405 | + list_del(&page->lru); 406 | + zone->free_area[order].nr_free--; 407 | + 408 | + /* Insert pages to zone->color_list[] (all order-0) */ 409 | + for (i = 0; i < (1 << order); i++) { 410 | + color = page_to_color(&page[i]); 411 | + 412 | + /* Add to zone->color_list[color] */ 413 | + memdbg(5, "- Add pfn %ld (0x%08llx) to color_list[%d]\n", page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 414 | + 415 | + INIT_LIST_HEAD(&page[i].lru); 416 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 417 | + bitmap_set(zone->color_bitmap, color, 1); 418 | + zone->free_area[0].nr_free++; 419 | + } 420 | + rmv_page_order(page); 421 | + 422 | + memdbg(4, "Add order=%d zone=%s\n", order, zone->name); 423 | + 424 | + return; 425 | +} 426 | + 427 | +/* Return a colored page (order-0) and remove it from the colored cache */ 428 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), int order, struct palloc_stat *stat) 429 | +{ 430 | + struct page *page; 431 | + COLOR_BITMAP(tmpmask); 432 | + int c; 433 | + unsigned int tmp_idx; 434 | + int found_w, want_w; 435 | + unsigned long rand_seed; 436 | + 437 | + /* Cache Statistics */ 438 | + if (stat) stat->cache_acc_cnt++; 439 | + 440 | + /* Find color cache entry */ 441 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 442 | + return NULL; 443 | + 444 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 445 | + 446 | + /* Must have a balance */ 447 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 448 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 449 | + 450 | + if (sysctl_alloc_balance && (found_w < want_w) && (found_w < min(sysctl_alloc_balance, want_w)) && memdbg_enable) { 451 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 452 | + if (dur.tv64 < 1000000) { 453 | + /* Try to balance unless order = MAX-2 or 1ms has passed */ 454 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", found_w, want_w, order, dur.tv64); 455 | + 456 | + stat->alloc_balance++; 457 | + 458 | + return NULL; 459 | + } 460 | + 461 | + stat->alloc_balance_timeout++; 462 | + } 463 | + 464 | + /* Choose a bit among the candidates */ 465 | + if (sysctl_alloc_balance && memdbg_enable) { 466 | + rand_seed = (unsigned long)stat->start.tv64; 467 | + } else { 468 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 469 | + 470 | + if (rand_seed > MAX_PALLOC_BINS) 471 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 472 | + } 473 | + 474 | + tmp_idx = rand_seed % found_w; 475 | + 476 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 477 | + if (tmp_idx-- <= 0) 478 | + break; 479 | + } 480 | + 481 | + BUG_ON(c >= MAX_PALLOC_BINS); 482 | + BUG_ON(list_empty(&zone->color_list[c])); 483 | + 484 | + page = list_entry(zone->color_list[c].next, struct page, lru); 485 | + 486 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 487 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 488 | + 489 | + /* Remove the page from the zone->color_list[color] */ 490 | + list_del(&page->lru); 491 | + 492 | + if (list_empty(&zone->color_list[c])) 493 | + bitmap_clear(zone->color_bitmap, c, 1); 494 | + 495 | + zone->free_area[0].nr_free--; 496 | + 497 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", page_to_pfn(page), c); 498 | + 499 | + if (stat) stat->cache_hit_cnt++; 500 | + 501 | + return page; 502 | +} 503 | + 504 | +static inline void update_stat(struct palloc_stat *stat, struct page *page, int iters) 505 | +{ 506 | + ktime_t dur; 507 | + 508 | + if (memdbg_enable == 0) 509 | + return; 510 | + 511 | + dur = ktime_sub(ktime_get(), stat->start); 512 | + 513 | + if (dur.tv64 > 0) { 514 | + stat->min_ns = min(dur.tv64, stat->min_ns); 515 | + stat->max_ns = max(dur.tv64, stat->max_ns); 516 | + 517 | + stat->tot_ns += dur.tv64; 518 | + stat->iter_cnt += iters; 519 | + 520 | + stat->tot_cnt++; 521 | + 522 | + memdbg(2, "order %ld pfn %ld (0x%08llx) color %d iters %d in %lld ns\n", 523 | + (long int)page_order(page), (long int)page_to_pfn(page), (u64)page_to_phys(page), 524 | + (int)page_to_color(page), iters, dur.tv64); 525 | + } else { 526 | + memdbg(5, "dur %lld is < 0\n", dur.tv64); 527 | + } 528 | + 529 | + return; 530 | +} 531 | + 532 | +/* 533 | + * Go through the free lists for the given migratetype and remove 534 | + * the smallest available page from the freelists 535 | + */ 536 | +static inline 537 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 538 | + int migratetype) 539 | +{ 540 | + unsigned int current_order; 541 | + struct free_area *area; 542 | + struct list_head *curr, *tmp; 543 | + struct page *page = NULL; 544 | + 545 | + struct palloc *ph; 546 | + struct palloc_stat *c_stat = &palloc.stat[0]; 547 | + struct palloc_stat *n_stat = &palloc.stat[1]; 548 | + struct palloc_stat *f_stat = &palloc.stat[2]; 549 | + 550 | + int iters = 0; 551 | + COLOR_BITMAP(tmpcmap); 552 | + unsigned long *cmap; 553 | + 554 | + if (memdbg_enable) 555 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 556 | + 557 | + if (!use_palloc || order > 0) 558 | + goto normal_buddy_alloc; 559 | + 560 | + /* cgroup information */ 561 | + memdbg(4, "current: %s | prio: %d\n", current->comm, current->prio); 562 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 563 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 564 | + cmap = ph->cmap; 565 | + else { 566 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 567 | + cmap = tmpcmap; 568 | + } 569 | + 570 | + /* Find page in the color cache */ 571 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 572 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 573 | + if (page) { 574 | + update_stat(c_stat, page, iters); 575 | + return page; 576 | + } 577 | + 578 | + /* Search the entire list. Make color cache in the process */ 579 | + for (current_order = 0; current_order < MAX_ORDER; ++current_order) { 580 | + memdbg(3, "[ITER: %d] Parsing order: %d\n", iters, current_order); 581 | + area = &(zone->free_area[current_order]); 582 | + if (list_empty(&area->free_list[migratetype])) 583 | + continue; 584 | + 585 | + memdbg(3, " order=%d (nr_free=%ld)\n", current_order, area->nr_free); 586 | + list_for_each_safe(curr, tmp, &area->free_list[migratetype]) { 587 | + iters++; 588 | + page = list_entry(curr, struct page, lru); 589 | + palloc_insert(zone, page, current_order); 590 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 591 | + 592 | + if (page) { 593 | + update_stat(c_stat, page, iters); 594 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", zone->name, page_to_pfn(page)); 595 | + return page; 596 | + } 597 | + } 598 | + } 599 | + 600 | + /* Fall back to buddy-allocator */ 601 | + memdbg(1, "Failed to find a matching color\n"); 602 | + 603 | +normal_buddy_alloc: 604 | + /* Find a page of the specified size in the preferred list */ 605 | + for (current_order = order; current_order < MAX_ORDER; ++current_order) { 606 | + area = &(zone->free_area[current_order]); 607 | + page = list_first_entry_or_null(&area->free_list[migratetype], 608 | + struct page, lru); 609 | + if (!page) 610 | + continue; 611 | + list_del(&page->lru); 612 | + rmv_page_order(page); 613 | + area->nr_free--; 614 | + expand(zone, page, order, current_order, area, migratetype); 615 | + set_pcppage_migratetype(page, migratetype); 616 | + return page; 617 | + } 618 | + 619 | + /* No memory (colored or normal) found in this zone */ 620 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", zone->name, order, migratetype); 621 | + 622 | + return NULL; 623 | +} 624 | + 625 | +#else /* CONFIG_CGROUP_PALLOC */ 626 | + 627 | /* 628 | * Go through the free lists for the given migratetype and remove 629 | * the smallest available page from the freelists 630 | @@ -1835,6 +2326,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 631 | return NULL; 632 | } 633 | 634 | +#endif /* CONFIG_CGROUP_PALLOC */ 635 | 636 | /* 637 | * This array describes the order lists are fallen back to when 638 | @@ -2546,7 +3038,7 @@ int __isolate_free_page(struct page *page, unsigned int order) 639 | struct zone *zone; 640 | int mt; 641 | 642 | - BUG_ON(!PageBuddy(page)); 643 | + WARN_ON(!PageBuddy(page)); 644 | 645 | zone = page_zone(page); 646 | mt = get_pageblock_migratetype(page); 647 | @@ -2624,8 +3116,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, 648 | unsigned long flags; 649 | struct page *page; 650 | bool cold = ((gfp_flags & __GFP_COLD) != 0); 651 | + struct palloc *ph; 652 | 653 | - if (likely(order == 0)) { 654 | + /* Skip PCP when physical memory aware allocation is requested */ 655 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 656 | + if (likely(order == 0) && !ph) { 657 | struct per_cpu_pages *pcp; 658 | struct list_head *list; 659 | 660 | @@ -5137,6 +5632,17 @@ not_early: 661 | static void __meminit zone_init_free_lists(struct zone *zone) 662 | { 663 | unsigned int order, t; 664 | + 665 | +#ifdef CONFIG_CGROUP_PALLOC 666 | + int c; 667 | + 668 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 669 | + INIT_LIST_HEAD(&zone->color_list[c]); 670 | + } 671 | + 672 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 673 | +#endif /* CONFIG_CGROUP_PALLOC */ 674 | + 675 | for_each_migratetype_order(order, t) { 676 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 677 | zone->free_area[order].nr_free = 0; 678 | @@ -7459,6 +7965,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 679 | return; 680 | zone = page_zone(pfn_to_page(pfn)); 681 | spin_lock_irqsave(&zone->lock, flags); 682 | + 683 | +#ifdef CONFIG_CGROUP_PALLOC 684 | + palloc_flush(zone); 685 | +#endif 686 | + 687 | pfn = start_pfn; 688 | while (pfn < end_pfn) { 689 | if (!pfn_valid(pfn)) { 690 | @@ -7477,7 +7988,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 691 | } 692 | 693 | BUG_ON(page_count(page)); 694 | - BUG_ON(!PageBuddy(page)); 695 | + WARN_ON(!PageBuddy(page)); 696 | order = page_order(page); 697 | #ifdef CONFIG_DEBUG_VM 698 | pr_info("remove from free list %lx %d %lx\n", 699 | diff --git a/mm/palloc.c b/mm/palloc.c 700 | index e69de29b..bc6a3415 100644 701 | --- a/mm/palloc.c 702 | +++ b/mm/palloc.c 703 | @@ -0,0 +1,173 @@ 704 | +/** 705 | + * kernel/palloc.c 706 | + * 707 | + * Color Aware Physical Memory Allocator User-Space Information 708 | + * 709 | + */ 710 | + 711 | +#include 712 | +#include 713 | +#include 714 | +#include 715 | +#include 716 | +#include 717 | +#include 718 | +#include 719 | +#include 720 | +#include 721 | + 722 | +/** 723 | + * Check if a page is compliant with the policy defined for the given vma 724 | + */ 725 | +#ifdef CONFIG_CGROUP_PALLOC 726 | + 727 | +#define MAX_LINE_LEN (6 * 128) 728 | + 729 | +/** 730 | + * Type of files in a palloc group 731 | + * FILE_PALLOC - contains list of palloc bins allowed 732 | + */ 733 | +typedef enum { 734 | + FILE_PALLOC, 735 | +} palloc_filetype_t; 736 | + 737 | +/** 738 | + * Retrieve the palloc group corresponding to this cgroup container 739 | + */ 740 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 741 | +{ 742 | + return container_of(cgrp->subsys[palloc_cgrp_id], struct palloc, css); 743 | +} 744 | + 745 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys) 746 | +{ 747 | + return container_of(subsys, struct palloc, css); 748 | +} 749 | + 750 | +/** 751 | + * Common write function for files in palloc cgroup 752 | + */ 753 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 754 | +{ 755 | + int retval = 0; 756 | + 757 | + if (!*buf) 758 | + bitmap_clear(bitmap, 0, maxbits); 759 | + else 760 | + retval = bitmap_parselist(buf, bitmap, maxbits); 761 | + 762 | + return retval; 763 | +} 764 | + 765 | +static ssize_t palloc_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) 766 | +{ 767 | + struct cgroup_subsys_state *css; 768 | + struct cftype *cft; 769 | + int retval = 0; 770 | + struct palloc *ph; 771 | + 772 | + css = of_css(of); 773 | + cft = of_cft(of); 774 | + ph = container_of(css, struct palloc, css); 775 | + 776 | + switch (cft->private) { 777 | + case FILE_PALLOC: 778 | + retval = update_bitmask(ph->cmap, buf, palloc_bins()); 779 | + printk(KERN_INFO "Bins : %s\n", buf); 780 | + break; 781 | + 782 | + default: 783 | + retval = -EINVAL; 784 | + break; 785 | + } 786 | + 787 | + return retval? :nbytes; 788 | +} 789 | + 790 | +static int palloc_file_read(struct seq_file *sf, void *v) 791 | +{ 792 | + struct cgroup_subsys_state *css = seq_css(sf); 793 | + struct cftype *cft = seq_cft(sf); 794 | + struct palloc *ph = container_of(css, struct palloc, css); 795 | + char *page; 796 | + ssize_t retval = 0; 797 | + char *s; 798 | + 799 | + if (!(page = (char *)__get_free_page(GFP_TEMPORARY | __GFP_ZERO))) 800 | + return -ENOMEM; 801 | + 802 | + s = page; 803 | + 804 | + switch (cft->private) { 805 | + case FILE_PALLOC: 806 | + s += scnprintf(s, PAGE_SIZE, "%*pbl", (int)palloc_bins(), ph->cmap); 807 | + *s++ = '\n'; 808 | + printk(KERN_INFO "Bins : %s", page); 809 | + break; 810 | + 811 | + default: 812 | + retval = -EINVAL; 813 | + goto out; 814 | + } 815 | + 816 | + seq_printf(sf, "%s", page); 817 | + 818 | +out: 819 | + free_page((unsigned long)page); 820 | + return retval; 821 | +} 822 | + 823 | +/** 824 | + * struct cftype : handler definitions for cgroup control files 825 | + * 826 | + * for the common functions, 'private' gives the type of the file 827 | + */ 828 | +static struct cftype files[] = { 829 | + { 830 | + .name = "bins", 831 | + .seq_show = palloc_file_read, 832 | + .write = palloc_file_write, 833 | + .max_write_len = MAX_LINE_LEN, 834 | + .private = FILE_PALLOC, 835 | + }, 836 | + {} 837 | +}; 838 | + 839 | + 840 | +/** 841 | + * palloc_create - create a palloc group 842 | + */ 843 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 844 | +{ 845 | + struct palloc *ph_child; 846 | + 847 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 848 | + 849 | + if (!ph_child) 850 | + return ERR_PTR(-ENOMEM); 851 | + 852 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 853 | + 854 | + return &ph_child->css; 855 | +} 856 | + 857 | +/** 858 | + * Destroy an existing palloc group 859 | + */ 860 | +static void palloc_destroy(struct cgroup_subsys_state *css) 861 | +{ 862 | + struct palloc *ph = container_of(css, struct palloc, css); 863 | + 864 | + kfree(ph); 865 | +} 866 | + 867 | +struct cgroup_subsys palloc_cgrp_subsys = { 868 | + .name = "palloc", 869 | + .css_alloc = palloc_create, 870 | + .css_free = palloc_destroy, 871 | + .id = palloc_cgrp_id, 872 | + .dfl_cftypes = files, 873 | + .legacy_cftypes = files, 874 | +}; 875 | + 876 | +#endif /* CONFIG_CGROUP_PALLOC */ 877 | diff --git a/mm/vmstat.c b/mm/vmstat.c 878 | index 5f658b6a..90ed8cf0 100644 879 | --- a/mm/vmstat.c 880 | +++ b/mm/vmstat.c 881 | @@ -28,6 +28,10 @@ 882 | #include 883 | #include 884 | 885 | +#ifdef CONFIG_CGROUP_PALLOC 886 | +#include 887 | +#endif 888 | + 889 | #include "internal.h" 890 | 891 | #ifdef CONFIG_VM_EVENT_COUNTERS 892 | @@ -1145,6 +1149,44 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 893 | { 894 | int order; 895 | 896 | +#ifdef CONFIG_CGROUP_PALLOC 897 | + int color, mt, cnt, bins; 898 | + struct free_area *area; 899 | + struct list_head *curr; 900 | + 901 | + seq_printf(m, "--------\n"); 902 | + 903 | + /* Order by memory type */ 904 | + for (mt = 0; mt < MIGRATE_ISOLATE; mt++) { 905 | + seq_printf(m, "-%17s[%d]", "mt", mt); 906 | + for (order = 0; order < MAX_ORDER; order++) { 907 | + area = &(zone->free_area[order]); 908 | + cnt = 0; 909 | + 910 | + list_for_each(curr, &area->free_list[mt]) 911 | + cnt++; 912 | + 913 | + seq_printf(m, "%6d ", cnt); 914 | + } 915 | + 916 | + seq_printf(m, "\n"); 917 | + } 918 | + 919 | + /* Order by color */ 920 | + seq_printf(m, "--------\n"); 921 | + bins = palloc_bins(); 922 | + 923 | + for (color = 0; color < bins; color++) { 924 | + seq_printf(m, "- color [%d:%0x]", color, color); 925 | + cnt = 0; 926 | + 927 | + list_for_each(curr, &zone->color_list[color]) 928 | + cnt++; 929 | + 930 | + seq_printf(m, "%6d\n", cnt); 931 | + } 932 | +#endif /* CONFIG_CGROUP_PALLOC */ 933 | + 934 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 935 | for (order = 0; order < MAX_ORDER; ++order) 936 | seq_printf(m, "%6lu ", zone->free_area[order].nr_free); 937 | -------------------------------------------------------------------------------- /palloc-5.10.74.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index acb77dcff..37a2eaa39 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -71,3 +71,7 @@ SUBSYS(debug) 6 | /* 7 | * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 8 | */ 9 | + 10 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 11 | +SUBSYS(palloc) 12 | +#endif 13 | diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h 14 | index 63b550403..53c625713 100644 15 | --- a/include/linux/mmzone.h 16 | +++ b/include/linux/mmzone.h 17 | @@ -82,6 +82,14 @@ static inline bool is_migrate_movable(int mt) 18 | return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE; 19 | } 20 | 21 | +#ifdef CONFIG_CGROUP_PALLOC 22 | +/* Determine the number of bins according to the bits required for 23 | + each component of the address */ 24 | +#define MAX_PALLOC_BITS 8 25 | +#define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 26 | +#define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 27 | +#endif 28 | + 29 | #define for_each_migratetype_order(order, type) \ 30 | for (order = 0; order < MAX_ORDER; order++) \ 31 | for (type = 0; type < MIGRATE_TYPES; type++) 32 | @@ -525,6 +533,14 @@ struct zone { 33 | /* free areas of different sizes */ 34 | struct free_area free_area[MAX_ORDER]; 35 | 36 | +#ifdef CONFIG_CGROUP_PALLOC 37 | + /* 38 | + * Color page cache for movable type free pages of order-0 39 | + */ 40 | + struct list_head color_list[MAX_PALLOC_BINS]; 41 | + COLOR_BITMAP(color_bitmap); 42 | +#endif 43 | + 44 | /* zone flags, see below */ 45 | unsigned long flags; 46 | 47 | diff --git a/init/Kconfig b/init/Kconfig 48 | index fc4c9f416..e4c879c12 100644 49 | --- a/init/Kconfig 50 | +++ b/init/Kconfig 51 | @@ -1121,6 +1121,13 @@ config SOCK_CGROUP_DATA 52 | bool 53 | default n 54 | 55 | +config CGROUP_PALLOC 56 | + bool "Enable PALLOC" 57 | + help 58 | + Enable PALLOC. PALLOC is a color-aware page-based physical memory 59 | + allocator which replaces the buddy allocator for order-zero page 60 | + allocations. 61 | + 62 | endif # CGROUPS 63 | 64 | menuconfig NAMESPACES 65 | diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c 66 | index b4ada2e9f..533b934fe 100644 67 | --- a/kernel/cgroup/cgroup.c 68 | +++ b/kernel/cgroup/cgroup.c 69 | @@ -5647,10 +5647,12 @@ int __init cgroup_init_early(void) 70 | RCU_INIT_POINTER(init_task.cgroups, &init_css_set); 71 | 72 | for_each_subsys(ss, i) { 73 | +#if 0 74 | WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id, 75 | "invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p id:name=%d:%s\n", 76 | i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free, 77 | ss->id, ss->name); 78 | +#endif 79 | WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN, 80 | "cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]); 81 | 82 | diff --git a/mm/Makefile b/mm/Makefile 83 | index d73aed0fc..11f9f1d0c 100644 84 | --- a/mm/Makefile 85 | +++ b/mm/Makefile 86 | @@ -104,6 +104,7 @@ obj-$(CONFIG_ZBUD) += zbud.o 87 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 88 | obj-$(CONFIG_Z3FOLD) += z3fold.o 89 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 90 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 91 | obj-$(CONFIG_CMA) += cma.o 92 | obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 93 | obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 94 | diff --git a/mm/page_alloc.c b/mm/page_alloc.c 95 | index 299688722..d1b26353b 100644 96 | --- a/mm/page_alloc.c 97 | +++ b/mm/page_alloc.c 98 | @@ -72,6 +72,7 @@ 99 | #include 100 | 101 | #include 102 | +#include 103 | #include 104 | #include 105 | #include "internal.h" 106 | @@ -108,6 +109,191 @@ typedef int __bitwise fpi_t; 107 | 108 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 109 | static DEFINE_MUTEX(pcp_batch_high_lock); 110 | + 111 | +#ifdef CONFIG_CGROUP_PALLOC 112 | +#include 113 | + 114 | +int memdbg_enable = 0; 115 | +EXPORT_SYMBOL(memdbg_enable); 116 | + 117 | +static int sysctl_alloc_balance = 0; 118 | + 119 | +/* PALLOC address bitmask */ 120 | +static unsigned long sysctl_palloc_mask = 0x0; 121 | + 122 | +static int mc_xor_bits[64]; 123 | +static int use_mc_xor = 0; 124 | +static int use_palloc = 0; 125 | + 126 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 127 | + 128 | +#define memdbg(lvl, fmt, ...) \ 129 | + do { \ 130 | + if(memdbg_enable >= lvl) \ 131 | + trace_printk(fmt, ##__VA_ARGS__); \ 132 | + } while(0) 133 | + 134 | +struct palloc_stat { 135 | + s64 max_ns; 136 | + s64 min_ns; 137 | + s64 tot_ns; 138 | + 139 | + s64 tot_cnt; 140 | + s64 iter_cnt; /* avg_iter = iter_cnt / tot_cnt */ 141 | + 142 | + s64 cache_hit_cnt; /* hit_rate = cache_hit_cnt / cache_acc_cnt */ 143 | + s64 cache_acc_cnt; 144 | + 145 | + s64 flush_cnt; 146 | + 147 | + s64 alloc_balance; 148 | + s64 alloc_balance_timeout; 149 | + ktime_t start; /* Start time of the current iteration */ 150 | +}; 151 | + 152 | +static struct { 153 | + u32 enabled; 154 | + int colors; 155 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2- fail */ 156 | +} palloc; 157 | + 158 | +static void palloc_flush(struct zone *zone); 159 | + 160 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 161 | +{ 162 | + char buf[64]; 163 | + int i; 164 | + 165 | + if (cnt > 63) cnt = 63; 166 | + if (copy_from_user(&buf, ubuf, cnt)) 167 | + return -EFAULT; 168 | + 169 | + if (!strncmp(buf, "reset", 5)) { 170 | + printk(KERN_INFO "reset statistics...\n"); 171 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 172 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 173 | + palloc.stat[i].min_ns = 0x7fffffff; 174 | + } 175 | + } else if (!strncmp(buf, "flush", 5)) { 176 | + struct zone *zone; 177 | + printk(KERN_INFO "flush color cache...\n"); 178 | + for_each_populated_zone(zone) { 179 | + unsigned long flags; 180 | + if (!zone) 181 | + continue; 182 | + spin_lock_irqsave(&zone->lock, flags); 183 | + palloc_flush(zone); 184 | + spin_unlock_irqrestore(&zone->lock, flags); 185 | + } 186 | + } else if (!strncmp(buf, "xor", 3)) { 187 | + int bit, xor_bit; 188 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 189 | + if ((bit > 0 && bit < 64) && (xor_bit > 0 && xor_bit < 64) && bit != xor_bit) { 190 | + mc_xor_bits[bit] = xor_bit; 191 | + } 192 | + } 193 | + 194 | + *ppos += cnt; 195 | + 196 | + return cnt; 197 | +} 198 | + 199 | +static int palloc_show(struct seq_file *m, void *v) 200 | +{ 201 | + int i, tmp; 202 | + char *desc[] = { "Color", "Normal", "Fail" }; 203 | + char buf[256]; 204 | + 205 | + for (i = 0; i < 3; i++) { 206 | + struct palloc_stat *stat = &palloc.stat[i]; 207 | + seq_printf(m, "statistics %s:\n", desc[i]); 208 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 209 | + stat->min_ns, 210 | + stat->max_ns, 211 | + (stat->tot_cnt)? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 212 | + stat->tot_cnt); 213 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 214 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 215 | + (stat->cache_acc_cnt)? div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 216 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 217 | + (stat->tot_cnt)? div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 218 | + stat->iter_cnt, stat->tot_cnt); 219 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 220 | + 221 | + seq_printf(m, " balance: %lld | fail: %lld\n", 222 | + stat->alloc_balance, stat->alloc_balance_timeout); 223 | + } 224 | + 225 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 226 | + 227 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 228 | + 229 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, (1 << tmp)); 230 | + 231 | + scnprintf(buf, 256, "%*pbl", (int)(sizeof(unsigned long) * 8), &sysctl_palloc_mask); 232 | + 233 | + seq_printf(m, "bits: %s\n", buf); 234 | + 235 | + seq_printf(m, "XOR bits: %s\n", (use_mc_xor)? "enabled" : "disabled"); 236 | + 237 | + for (i = 0; i < 64; i++) { 238 | + if (mc_xor_bits[i] > 0) 239 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 240 | + } 241 | + 242 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc)? "enabled" : "disabled"); 243 | + 244 | + return 0; 245 | +} 246 | + 247 | +static int palloc_open(struct inode *inode, struct file *filp) 248 | +{ 249 | + return single_open(filp, palloc_show, NULL); 250 | +} 251 | + 252 | +static const struct file_operations palloc_fops = { 253 | + .open = palloc_open, 254 | + .write = palloc_write, 255 | + .read = seq_read, 256 | + .llseek = seq_lseek, 257 | + .release = single_release, 258 | +}; 259 | + 260 | +static int __init palloc_debugfs(void) 261 | +{ 262 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 263 | + struct dentry *dir; 264 | + int i; 265 | + 266 | + dir = debugfs_create_dir("palloc", NULL); 267 | + 268 | + /* Statistics Initialization */ 269 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 270 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 271 | + palloc.stat[i].min_ns = 0x7fffffff; 272 | + } 273 | + 274 | + if (!dir) 275 | + return PTR_ERR(dir); 276 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 277 | + goto fail; 278 | + debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask); 279 | + debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor); 280 | + debugfs_create_u32("use_palloc", mode, dir, &use_palloc); 281 | + debugfs_create_u32("debug_level", mode, dir, &memdbg_enable); 282 | + debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance); 283 | + 284 | + return 0; 285 | + 286 | +fail: 287 | + debugfs_remove_recursive(dir); 288 | + return -ENOMEM; 289 | +} 290 | + 291 | +late_initcall(palloc_debugfs); 292 | + 293 | +#endif /* CONFIG_CGROUP_PALLOC */ 294 | + 295 | #define MIN_PERCPU_PAGELIST_FRACTION (8) 296 | 297 | #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID 298 | @@ -2302,6 +2488,328 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags 299 | clear_page_pfmemalloc(page); 300 | } 301 | 302 | +#ifdef CONFIG_CGROUP_PALLOC 303 | + 304 | +int palloc_bins(void) 305 | +{ 306 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long) * 8)), MAX_PALLOC_BINS); 307 | +} 308 | + 309 | +static inline int page_to_color(struct page *page) 310 | +{ 311 | + int color = 0; 312 | + int idx = 0; 313 | + int c; 314 | + 315 | + unsigned long paddr = page_to_phys(page); 316 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 317 | + if (use_mc_xor) { 318 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 319 | + color |= (1 << idx); 320 | + } else { 321 | + if ((paddr >> c) & 0x1) 322 | + color |= (1 << idx); 323 | + } 324 | + 325 | + idx++; 326 | + } 327 | + 328 | + return color; 329 | +} 330 | + 331 | +/* Debug */ 332 | +static inline unsigned long list_count(struct list_head *head) 333 | +{ 334 | + unsigned long n = 0; 335 | + struct list_head *curr; 336 | + 337 | + list_for_each(curr, head) 338 | + n++; 339 | + 340 | + return n; 341 | +} 342 | + 343 | +/* Move all color_list pages into free_area[0].freelist[2] 344 | + * zone->lock must be held before calling this function 345 | + */ 346 | +static void palloc_flush(struct zone *zone) 347 | +{ 348 | + int c; 349 | + struct page *page; 350 | + 351 | + memdbg(2, "Flush the color-cache for zone %s\n", zone->name); 352 | + 353 | + while(1) { 354 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 355 | + if (!list_empty(&zone->color_list[c])) { 356 | + page = list_entry(zone->color_list[c].next, struct page, lru); 357 | + list_del_init(&page->lru); 358 | + __free_one_page(page, page_to_pfn(page), zone, 0, get_pageblock_migratetype(page), 0); // FPI_SKIP_REPORT_NOTIFY | FPI_TO_TAIL 359 | + zone->free_area[0].nr_free--; 360 | + } 361 | + 362 | + if (list_empty(&zone->color_list[c])) { 363 | + bitmap_clear(zone->color_bitmap, c, 1); 364 | + INIT_LIST_HEAD(&zone->color_list[c]); 365 | + } 366 | + } 367 | + 368 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 369 | + break; 370 | + } 371 | +} 372 | + 373 | + 374 | +static inline void rmv_page_order(struct page *page) 375 | +{ 376 | + __ClearPageBuddy(page); 377 | + set_page_private(page, 0); 378 | +} 379 | + 380 | +/* Move a page (size = 1 << order) into order-0 colored cache */ 381 | +static void palloc_insert(struct zone *zone, struct page *page, int order) 382 | +{ 383 | + int i, color; 384 | + 385 | + /* 1 page (2^order) -> 2^order x pages of colored cache. 386 | + Remove from zone->free_area[order].free_list[mt] */ 387 | + list_del(&page->lru); 388 | + zone->free_area[order].nr_free--; 389 | + 390 | + /* Insert pages to zone->color_list[] (all order-0) */ 391 | + for (i = 0; i < (1 << order); i++) { 392 | + color = page_to_color(&page[i]); 393 | + 394 | + /* Add to zone->color_list[color] */ 395 | + memdbg(5, "- Add pfn %ld (0x%08llx) to color_list[%d]\n", page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 396 | + 397 | + INIT_LIST_HEAD(&page[i].lru); 398 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 399 | + bitmap_set(zone->color_bitmap, color, 1); 400 | + zone->free_area[0].nr_free++; 401 | + rmv_page_order(&page[i]); 402 | + } 403 | + 404 | + memdbg(4, "Add order=%d zone=%s\n", order, zone->name); 405 | + 406 | + return; 407 | +} 408 | + 409 | +/* Return a colored page (order-0) and remove it from the colored cache */ 410 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), int order, struct palloc_stat *stat) 411 | +{ 412 | + struct page *page; 413 | + COLOR_BITMAP(tmpmask); 414 | + int c; 415 | + unsigned int tmp_idx; 416 | + int found_w, want_w; 417 | + unsigned long rand_seed; 418 | + 419 | + /* Cache Statistics */ 420 | + if (stat) stat->cache_acc_cnt++; 421 | + 422 | + /* Find color cache entry */ 423 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 424 | + return NULL; 425 | + 426 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 427 | + 428 | + /* Must have a balance */ 429 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 430 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 431 | + 432 | + if (sysctl_alloc_balance && (found_w < want_w) && (found_w < min(sysctl_alloc_balance, want_w)) && memdbg_enable) { 433 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 434 | + if (dur < 1000000) { 435 | + /* Try to balance unless order = MAX-2 or 1ms has passed */ 436 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", found_w, want_w, order, dur); 437 | + 438 | + stat->alloc_balance++; 439 | + 440 | + return NULL; 441 | + } 442 | + 443 | + stat->alloc_balance_timeout++; 444 | + } 445 | + 446 | + /* Choose a bit among the candidates */ 447 | + if (sysctl_alloc_balance && memdbg_enable) { 448 | + rand_seed = (unsigned long)stat->start; 449 | + } else { 450 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 451 | + 452 | + if (rand_seed > MAX_PALLOC_BINS) 453 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 454 | + } 455 | + 456 | + tmp_idx = rand_seed % found_w; 457 | + 458 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 459 | + if (tmp_idx-- <= 0) 460 | + break; 461 | + } 462 | + 463 | + BUG_ON(c >= MAX_PALLOC_BINS); 464 | + BUG_ON(list_empty(&zone->color_list[c])); 465 | + 466 | + page = list_entry(zone->color_list[c].next, struct page, lru); 467 | + 468 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 469 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 470 | + 471 | + /* Remove the page from the zone->color_list[color] */ 472 | + list_del(&page->lru); 473 | + 474 | + if (list_empty(&zone->color_list[c])) 475 | + bitmap_clear(zone->color_bitmap, c, 1); 476 | + 477 | + zone->free_area[0].nr_free--; 478 | + 479 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", page_to_pfn(page), c); 480 | + 481 | + if (stat) stat->cache_hit_cnt++; 482 | + 483 | + return page; 484 | +} 485 | + 486 | +static inline void update_stat(struct palloc_stat *stat, struct page *page, int iters) 487 | +{ 488 | + ktime_t dur; 489 | + 490 | + if (memdbg_enable == 0) 491 | + return; 492 | + 493 | + dur = ktime_sub(ktime_get(), stat->start); 494 | + 495 | + if (dur > 0) { 496 | + stat->min_ns = min(dur, stat->min_ns); 497 | + stat->max_ns = max(dur, stat->max_ns); 498 | + 499 | + stat->tot_ns += dur; 500 | + stat->iter_cnt += iters; 501 | + 502 | + stat->tot_cnt++; 503 | + 504 | + memdbg(2, "order %ld pfn %ld (0x%08llx) color %d iters %d in %lld ns\n", 505 | + (long int)buddy_order(page), (long int)page_to_pfn(page), (u64)page_to_phys(page), 506 | + (int)page_to_color(page), iters, dur); 507 | + } else { 508 | + memdbg(5, "dur %lld is < 0\n", dur); 509 | + } 510 | + 511 | + return; 512 | +} 513 | + 514 | +/* 515 | + * Go through the free lists for the given migratetype and remove 516 | + * the smallest available page from the freelists 517 | + */ 518 | +static inline 519 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 520 | + int migratetype) 521 | +{ 522 | + unsigned int current_order; 523 | + struct free_area *area; 524 | + struct list_head *curr, *tmp; 525 | + struct page *page; 526 | + 527 | + struct palloc *ph; 528 | + struct palloc_stat *c_stat = &palloc.stat[0]; 529 | + struct palloc_stat *n_stat = &palloc.stat[1]; 530 | + struct palloc_stat *f_stat = &palloc.stat[2]; 531 | + 532 | + int iters = 0; 533 | + COLOR_BITMAP(tmpcmap); 534 | + unsigned long *cmap; 535 | + 536 | + if (memdbg_enable) 537 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 538 | + 539 | + if (!use_palloc) 540 | + goto normal_buddy_alloc; 541 | + 542 | + /* cgroup information */ 543 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 544 | + 545 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 546 | + cmap = ph->cmap; 547 | + else { 548 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 549 | + cmap = tmpcmap; 550 | + } 551 | + 552 | + page = NULL; 553 | + if (order == 0) { 554 | + /* Find page in the color cache */ 555 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 556 | + 557 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 558 | + 559 | + if (page) { 560 | + update_stat(c_stat, page, iters); 561 | + return page; 562 | + } 563 | + } 564 | + 565 | + if (order == 0) { 566 | + /* Build Color Cache */ 567 | + iters++; 568 | + 569 | + /* Search the entire list. Make color cache in the process */ 570 | + for (current_order = 0; current_order < MAX_ORDER; ++current_order) { 571 | + area = &(zone->free_area[current_order]); 572 | + 573 | + if (list_empty(&area->free_list[migratetype])) 574 | + continue; 575 | + 576 | + memdbg(3, " order=%d (nr_free=%ld)\n", current_order, area->nr_free); 577 | + 578 | + list_for_each_safe(curr, tmp, &area->free_list[migratetype]) { 579 | + iters++; 580 | + page = list_entry(curr, struct page, lru); 581 | + palloc_insert(zone, page, current_order); 582 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 583 | + 584 | + if (page) { 585 | + update_stat(c_stat, page, iters); 586 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", zone->name, page_to_pfn(page)); 587 | + 588 | + return page; 589 | + } 590 | + } 591 | + } 592 | + 593 | + memdbg(1, "Failed to find a matching color\n"); 594 | + } else { 595 | +normal_buddy_alloc: 596 | + /* Normal Buddy Algorithm */ 597 | + /* Find a page of the specified size in the preferred list */ 598 | + for (current_order = order; current_order < MAX_ORDER; ++current_order) { 599 | + area = &(zone->free_area[current_order]); 600 | + iters++; 601 | + 602 | + page = get_page_from_free_area(area, migratetype); 603 | + if (!page) 604 | + continue; 605 | + del_page_from_free_list(page, zone, current_order); 606 | + // del_page_from_free_area(page, area); 607 | + expand(zone, page, order, current_order, migratetype); 608 | + set_pcppage_migratetype(page, migratetype); 609 | + 610 | + update_stat(n_stat, page, iters); 611 | + 612 | + return page; 613 | + } 614 | + } 615 | + 616 | + /* No memory (colored or normal) found in this zone */ 617 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", zone->name, order, migratetype); 618 | + 619 | + return NULL; 620 | +} 621 | + 622 | +#else /* CONFIG_CGROUP_PALLOC */ 623 | + 624 | /* 625 | * Go through the free lists for the given migratetype and remove 626 | * the smallest available page from the freelists 627 | @@ -2329,6 +2837,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 628 | return NULL; 629 | } 630 | 631 | +#endif /* CONFIG_CGROUP_PALLOC */ 632 | 633 | /* 634 | * This array describes the order lists are fallen back to when 635 | @@ -3430,8 +3939,14 @@ struct page *rmqueue(struct zone *preferred_zone, 636 | { 637 | unsigned long flags; 638 | struct page *page; 639 | - 640 | + struct palloc *ph; 641 | +#ifdef CONFIG_CGROUP_PALLOC 642 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 643 | + /* Skip PCP when physical memory aware allocation is requested */ 644 | + if (likely(order == 0) && !ph) { 645 | +#else 646 | if (likely(order == 0)) { 647 | +#endif 648 | /* 649 | * MIGRATE_MOVABLE pcplist could have the pages on CMA area and 650 | * we need to skip it when CMA area isn't allowed. 651 | @@ -6184,6 +6699,17 @@ void __ref memmap_init_zone_device(struct zone *zone, 652 | static void __meminit zone_init_free_lists(struct zone *zone) 653 | { 654 | unsigned int order, t; 655 | + 656 | +#ifdef CONFIG_CGROUP_PALLOC 657 | + int c; 658 | + 659 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 660 | + INIT_LIST_HEAD(&zone->color_list[c]); 661 | + } 662 | + 663 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 664 | +#endif /* CONFIG_CGROUP_PALLOC */ 665 | + 666 | for_each_migratetype_order(order, t) { 667 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 668 | zone->free_area[order].nr_free = 0; 669 | @@ -8780,6 +9306,10 @@ void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 670 | offline_mem_sections(pfn, end_pfn); 671 | zone = page_zone(pfn_to_page(pfn)); 672 | spin_lock_irqsave(&zone->lock, flags); 673 | + 674 | +#ifdef CONFIG_CGROUP_PALLOC 675 | + palloc_flush(zone); 676 | +#endif 677 | while (pfn < end_pfn) { 678 | page = pfn_to_page(pfn); 679 | /* 680 | diff --git a/mm/vmstat.c b/mm/vmstat.c 681 | index 698bc0bc1..e14d8b7e0 100644 682 | --- a/mm/vmstat.c 683 | +++ b/mm/vmstat.c 684 | @@ -29,6 +29,10 @@ 685 | #include 686 | #include 687 | 688 | +#ifdef CONFIG_CGROUP_PALLOC 689 | +#include 690 | +#endif 691 | + 692 | #include "internal.h" 693 | 694 | #define NUMA_STATS_THRESHOLD (U16_MAX - 2) 695 | @@ -1412,6 +1416,44 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 696 | { 697 | int order; 698 | 699 | +#ifdef CONFIG_CGROUP_PALLOC 700 | + int color, mt, cnt, bins; 701 | + struct free_area *area; 702 | + struct list_head *curr; 703 | + 704 | + seq_printf(m, "--------\n"); 705 | + 706 | + /* Order by memory type */ 707 | + for (mt = 0; mt < MIGRATE_TYPES; mt++) { 708 | + seq_printf(m, "-%17s[%d]", "mt", mt); 709 | + for (order = 0; order < MAX_ORDER; order++) { 710 | + area = &(zone->free_area[order]); 711 | + cnt = 0; 712 | + 713 | + list_for_each(curr, &area->free_list[mt]) 714 | + cnt++; 715 | + 716 | + seq_printf(m, "%6d ", cnt); 717 | + } 718 | + 719 | + seq_printf(m, "\n"); 720 | + } 721 | + 722 | + /* Order by color */ 723 | + seq_printf(m, "--------\n"); 724 | + bins = palloc_bins(); 725 | + 726 | + for (color = 0; color < bins; color++) { 727 | + seq_printf(m, "- color [%d:%0x]", color, color); 728 | + cnt = 0; 729 | + 730 | + list_for_each(curr, &zone->color_list[color]) 731 | + cnt++; 732 | + 733 | + seq_printf(m, "%6d\n", cnt); 734 | + } 735 | +#endif /* CONFIG_CGROUP_PALLOC */ 736 | + 737 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 738 | for (order = 0; order < MAX_ORDER; ++order) 739 | seq_printf(m, "%6lu ", zone->free_area[order].nr_free); 740 | -------------------------------------------------------------------------------- /palloc-5.15.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h 2 | index acb77dcff..37a2eaa39 100644 3 | --- a/include/linux/cgroup_subsys.h 4 | +++ b/include/linux/cgroup_subsys.h 5 | @@ -71,3 +71,7 @@ SUBSYS(debug) 6 | /* 7 | * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS. 8 | */ 9 | + 10 | +#if IS_ENABLED(CONFIG_CGROUP_PALLOC) 11 | +SUBSYS(palloc) 12 | +#endif 13 | --- a/include/linux/mmzone.h 14 | +++ b/include/linux/mmzone.h 15 | @@ -82,6 +82,14 @@ static inline bool is_migrate_movable(int mt) 16 | return is_migrate_cma(mt) || mt == MIGRATE_MOVABLE; 17 | } 18 | 19 | +#ifdef CONFIG_CGROUP_PALLOC 20 | +/* Determine the number of bins according to the bits required for 21 | + each component of the address */ 22 | +#define MAX_PALLOC_BITS 8 23 | +#define MAX_PALLOC_BINS (1 << MAX_PALLOC_BITS) 24 | +#define COLOR_BITMAP(name) DECLARE_BITMAP(name, MAX_PALLOC_BINS) 25 | +#endif 26 | + 27 | #define for_each_migratetype_order(order, type) \ 28 | for (order = 0; order < MAX_ORDER; order++) \ 29 | for (type = 0; type < MIGRATE_TYPES; type++) 30 | @@ -521,6 +529,14 @@ struct zone { 31 | /* free areas of different sizes */ 32 | struct free_area free_area[MAX_ORDER]; 33 | 34 | +#ifdef CONFIG_CGROUP_PALLOC 35 | + /* 36 | + * Color page cache for movable type free pages of order-0 37 | + */ 38 | + struct list_head color_list[MAX_PALLOC_BINS]; 39 | + COLOR_BITMAP(color_bitmap); 40 | +#endif 41 | + 42 | /* zone flags, see below */ 43 | unsigned long flags; 44 | 45 | --- /dev/null 46 | +++ b/include/linux/palloc.h 47 | @@ -0,0 +1,33 @@ 48 | +#ifndef _LINUX_PALLOC_H 49 | +#define _LINUX_PALLOC_H 50 | + 51 | +/* 52 | + * kernel/palloc.h 53 | + * 54 | + * Physical Memory Aware Allocator 55 | + */ 56 | + 57 | +#include 58 | +#include 59 | +#include 60 | +#include 61 | + 62 | +#ifdef CONFIG_CGROUP_PALLOC 63 | + 64 | +struct palloc { 65 | + struct cgroup_subsys_state css; 66 | + COLOR_BITMAP(cmap); 67 | +}; 68 | + 69 | +/* Retrieve the palloc group corresponding to this cgroup container */ 70 | +struct palloc *cgroup_ph(struct cgroup *cgrp); 71 | + 72 | +/* Retrieve the palloc group corresponding to this subsys */ 73 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys); 74 | + 75 | +/* Return number of palloc bins */ 76 | +int palloc_bins(void); 77 | + 78 | +#endif /* CONFIG_CGROUP_PALLOC */ 79 | + 80 | +#endif /* _LINUX_PALLOC_H */ 81 | --- a/init/Kconfig 82 | +++ b/init/Kconfig 83 | @@ -1041,6 +1041,13 @@ config SOCK_CGROUP_DATA 84 | bool 85 | default n 86 | 87 | +config CGROUP_PALLOC 88 | + bool "Enable PALLOC" 89 | + help 90 | + Enable PALLOC. PALLOC is a color-aware page-based physical memory 91 | + allocator which replaces the buddy allocator for order-zero page 92 | + allocations. 93 | + 94 | endif # CGROUPS 95 | 96 | menuconfig NAMESPACES 97 | --- a/kernel/cgroup/cgroup.c 98 | +++ b/kernel/cgroup/cgroup.c 99 | @@ -5701,10 +5701,12 @@ int __init cgroup_init_early(void) 100 | RCU_INIT_POINTER(init_task.cgroups, &init_css_set); 101 | 102 | for_each_subsys(ss, i) { 103 | +#if 0 104 | WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id, 105 | "invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p id:name=%d:%s\n", 106 | i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free, 107 | ss->id, ss->name); 108 | +#endif 109 | WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN, 110 | "cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]); 111 | 112 | --- a/mm/Makefile 113 | +++ b/mm/Makefile 114 | @@ -94,6 +94,7 @@ obj-$(CONFIG_ZBUD) += zbud.o 115 | obj-$(CONFIG_ZSMALLOC) += zsmalloc.o 116 | obj-$(CONFIG_Z3FOLD) += z3fold.o 117 | obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o 118 | +obj-$(CONFIG_CGROUP_PALLOC) += palloc.o 119 | obj-$(CONFIG_CMA) += cma.o 120 | obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o 121 | obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o 122 | --- a/mm/page_alloc.c 123 | +++ b/mm/page_alloc.c 124 | @@ -71,6 +71,7 @@ 125 | #include 126 | 127 | #include 128 | +#include 129 | #include 130 | #include 131 | #include "internal.h" 132 | @@ -78,6 +79,191 @@ 133 | 134 | /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ 135 | static DEFINE_MUTEX(pcp_batch_high_lock); 136 | + 137 | +#ifdef CONFIG_CGROUP_PALLOC 138 | +#include 139 | + 140 | +int memdbg_enable = 0; 141 | +EXPORT_SYMBOL(memdbg_enable); 142 | + 143 | +static int sysctl_alloc_balance = 0; 144 | + 145 | +/* PALLOC address bitmask */ 146 | +static unsigned long sysctl_palloc_mask = 0x0; 147 | + 148 | +static int mc_xor_bits[64]; 149 | +static int use_mc_xor = 0; 150 | +static int use_palloc = 0; 151 | + 152 | +DEFINE_PER_CPU(unsigned long, palloc_rand_seed); 153 | + 154 | +#define memdbg(lvl, fmt, ...) \ 155 | + do { \ 156 | + if(memdbg_enable >= lvl) \ 157 | + trace_printk(fmt, ##__VA_ARGS__); \ 158 | + } while(0) 159 | + 160 | +struct palloc_stat { 161 | + s64 max_ns; 162 | + s64 min_ns; 163 | + s64 tot_ns; 164 | + 165 | + s64 tot_cnt; 166 | + s64 iter_cnt; /* avg_iter = iter_cnt / tot_cnt */ 167 | + 168 | + s64 cache_hit_cnt; /* hit_rate = cache_hit_cnt / cache_acc_cnt */ 169 | + s64 cache_acc_cnt; 170 | + 171 | + s64 flush_cnt; 172 | + 173 | + s64 alloc_balance; 174 | + s64 alloc_balance_timeout; 175 | + ktime_t start; /* Start time of the current iteration */ 176 | +}; 177 | + 178 | +static struct { 179 | + u32 enabled; 180 | + int colors; 181 | + struct palloc_stat stat[3]; /* 0 - color, 1 - normal, 2- fail */ 182 | +} palloc; 183 | + 184 | +static void palloc_flush(struct zone *zone); 185 | + 186 | +static ssize_t palloc_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) 187 | +{ 188 | + char buf[64]; 189 | + int i; 190 | + 191 | + if (cnt > 63) cnt = 63; 192 | + if (copy_from_user(&buf, ubuf, cnt)) 193 | + return -EFAULT; 194 | + 195 | + if (!strncmp(buf, "reset", 5)) { 196 | + printk(KERN_INFO "reset statistics...\n"); 197 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 198 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 199 | + palloc.stat[i].min_ns = 0x7fffffff; 200 | + } 201 | + } else if (!strncmp(buf, "flush", 5)) { 202 | + struct zone *zone; 203 | + printk(KERN_INFO "flush color cache...\n"); 204 | + for_each_populated_zone(zone) { 205 | + unsigned long flags; 206 | + if (!zone) 207 | + continue; 208 | + spin_lock_irqsave(&zone->lock, flags); 209 | + palloc_flush(zone); 210 | + spin_unlock_irqrestore(&zone->lock, flags); 211 | + } 212 | + } else if (!strncmp(buf, "xor", 3)) { 213 | + int bit, xor_bit; 214 | + sscanf(buf + 4, "%d %d", &bit, &xor_bit); 215 | + if ((bit > 0 && bit < 64) && (xor_bit > 0 && xor_bit < 64) && bit != xor_bit) { 216 | + mc_xor_bits[bit] = xor_bit; 217 | + } 218 | + } 219 | + 220 | + *ppos += cnt; 221 | + 222 | + return cnt; 223 | +} 224 | + 225 | +static int palloc_show(struct seq_file *m, void *v) 226 | +{ 227 | + int i, tmp; 228 | + char *desc[] = { "Color", "Normal", "Fail" }; 229 | + char buf[256]; 230 | + 231 | + for (i = 0; i < 3; i++) { 232 | + struct palloc_stat *stat = &palloc.stat[i]; 233 | + seq_printf(m, "statistics %s:\n", desc[i]); 234 | + seq_printf(m, " min(ns)/max(ns)/avg(ns)/tot_cnt: %lld %lld %lld %lld\n", 235 | + stat->min_ns, 236 | + stat->max_ns, 237 | + (stat->tot_cnt)? div64_u64(stat->tot_ns, stat->tot_cnt) : 0, 238 | + stat->tot_cnt); 239 | + seq_printf(m, " hit rate: %lld/%lld (%lld %%)\n", 240 | + stat->cache_hit_cnt, stat->cache_acc_cnt, 241 | + (stat->cache_acc_cnt)? div64_u64(stat->cache_hit_cnt*100, stat->cache_acc_cnt) : 0); 242 | + seq_printf(m, " avg iter: %lld (%lld/%lld)\n", 243 | + (stat->tot_cnt)? div64_u64(stat->iter_cnt, stat->tot_cnt) : 0, 244 | + stat->iter_cnt, stat->tot_cnt); 245 | + seq_printf(m, " flush cnt: %lld\n", stat->flush_cnt); 246 | + 247 | + seq_printf(m, " balance: %lld | fail: %lld\n", 248 | + stat->alloc_balance, stat->alloc_balance_timeout); 249 | + } 250 | + 251 | + seq_printf(m, "mask: 0x%lx\n", sysctl_palloc_mask); 252 | + 253 | + tmp = bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long)*8); 254 | + 255 | + seq_printf(m, "weight: %d (bins: %d)\n", tmp, (1 << tmp)); 256 | + 257 | + scnprintf(buf, 256, "%*pbl", (int)(sizeof(unsigned long) * 8), &sysctl_palloc_mask); 258 | + 259 | + seq_printf(m, "bits: %s\n", buf); 260 | + 261 | + seq_printf(m, "XOR bits: %s\n", (use_mc_xor)? "enabled" : "disabled"); 262 | + 263 | + for (i = 0; i < 64; i++) { 264 | + if (mc_xor_bits[i] > 0) 265 | + seq_printf(m, " %3d <-> %3d\n", i, mc_xor_bits[i]); 266 | + } 267 | + 268 | + seq_printf(m, "Use PALLOC: %s\n", (use_palloc)? "enabled" : "disabled"); 269 | + 270 | + return 0; 271 | +} 272 | + 273 | +static int palloc_open(struct inode *inode, struct file *filp) 274 | +{ 275 | + return single_open(filp, palloc_show, NULL); 276 | +} 277 | + 278 | +static const struct file_operations palloc_fops = { 279 | + .open = palloc_open, 280 | + .write = palloc_write, 281 | + .read = seq_read, 282 | + .llseek = seq_lseek, 283 | + .release = single_release, 284 | +}; 285 | + 286 | +static int __init palloc_debugfs(void) 287 | +{ 288 | + umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; 289 | + struct dentry *dir; 290 | + int i; 291 | + 292 | + dir = debugfs_create_dir("palloc", NULL); 293 | + 294 | + /* Statistics Initialization */ 295 | + for (i = 0; i < ARRAY_SIZE(palloc.stat); i++) { 296 | + memset(&palloc.stat[i], 0, sizeof(struct palloc_stat)); 297 | + palloc.stat[i].min_ns = 0x7fffffff; 298 | + } 299 | + 300 | + if (!dir) 301 | + return PTR_ERR(dir); 302 | + if (!debugfs_create_file("control", mode, dir, NULL, &palloc_fops)) 303 | + goto fail; 304 | + debugfs_create_u64("palloc_mask", mode, dir, (u64 *)&sysctl_palloc_mask); 305 | + debugfs_create_u32("use_mc_xor", mode, dir, &use_mc_xor); 306 | + debugfs_create_u32("use_palloc", mode, dir, &use_palloc); 307 | + debugfs_create_u32("debug_level", mode, dir, &memdbg_enable); 308 | + debugfs_create_u32("alloc_balance", mode, dir, &sysctl_alloc_balance); 309 | + 310 | + return 0; 311 | + 312 | +fail: 313 | + debugfs_remove_recursive(dir); 314 | + return -ENOMEM; 315 | +} 316 | + 317 | +late_initcall(palloc_debugfs); 318 | + 319 | +#endif /* CONFIG_CGROUP_PALLOC */ 320 | + 321 | #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) 322 | 323 | struct pagesets { 324 | @@ -2176,6 +2362,338 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags 325 | clear_page_pfmemalloc(page); 326 | } 327 | 328 | +#ifdef CONFIG_CGROUP_PALLOC 329 | + 330 | +int palloc_bins(void) 331 | +{ 332 | + return min((1 << bitmap_weight(&sysctl_palloc_mask, sizeof(unsigned long) * 8)), MAX_PALLOC_BINS); 333 | +} 334 | + 335 | +static inline int page_to_color(struct page *page) 336 | +{ 337 | + int color = 0; 338 | + int idx = 0; 339 | + int c; 340 | + 341 | + unsigned long paddr = page_to_phys(page); 342 | + for_each_set_bit(c, &sysctl_palloc_mask, sizeof(unsigned long) * 8) { 343 | + if (use_mc_xor) { 344 | + if (((paddr >> c) & 0x1) ^ ((paddr >> mc_xor_bits[c]) & 0x1)) 345 | + color |= (1 << idx); 346 | + } else { 347 | + if ((paddr >> c) & 0x1) 348 | + color |= (1 << idx); 349 | + } 350 | + 351 | + idx++; 352 | + } 353 | + 354 | + return color; 355 | +} 356 | + 357 | +/* Debug */ 358 | +static inline unsigned long list_count(struct list_head *head) 359 | +{ 360 | + unsigned long n = 0; 361 | + struct list_head *curr; 362 | + 363 | + list_for_each(curr, head) 364 | + n++; 365 | + 366 | + return n; 367 | +} 368 | + 369 | +/* Move all color_list pages into free_area[0].freelist[2] 370 | + * zone->lock must be held before calling this function 371 | + */ 372 | +static void palloc_flush(struct zone *zone) 373 | +{ 374 | + int c; 375 | + struct page *page; 376 | + 377 | + memdbg(2, "Flush the color-cache for zone %s\n", zone->name); 378 | + 379 | + while(1) { 380 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 381 | + if (!list_empty(&zone->color_list[c])) { 382 | + page = list_entry(zone->color_list[c].next, struct page, lru); 383 | + list_del_init(&page->lru); 384 | + __free_one_page(page, page_to_pfn(page), zone, 0, get_pageblock_migratetype(page), FPI_NONE); 385 | + zone->free_area[0].nr_free--; 386 | + } 387 | + 388 | + if (list_empty(&zone->color_list[c])) { 389 | + bitmap_clear(zone->color_bitmap, c, 1); 390 | + INIT_LIST_HEAD(&zone->color_list[c]); 391 | + } 392 | + } 393 | + 394 | + if (bitmap_weight(zone->color_bitmap, MAX_PALLOC_BINS) == 0) 395 | + break; 396 | + } 397 | +} 398 | + 399 | + 400 | +static inline void rmv_page_order(struct page *page) 401 | +{ 402 | + __ClearPageBuddy(page); 403 | + set_page_private(page, 0); 404 | +} 405 | + 406 | +/* Move a page (size = 1 << order) into order-0 colored cache */ 407 | +static void palloc_insert(struct zone *zone, struct page *page, int order) 408 | +{ 409 | + int i, color; 410 | + 411 | + /* 1 page (2^order) -> 2^order x pages of colored cache. 412 | + Remove from zone->free_area[order].free_list[mt] */ 413 | + list_del(&page->lru); 414 | + zone->free_area[order].nr_free--; 415 | + 416 | + /* Insert pages to zone->color_list[] (all order-0) */ 417 | + for (i = 0; i < (1 << order); i++) { 418 | + color = page_to_color(&page[i]); 419 | + 420 | + /* Add to zone->color_list[color] */ 421 | + memdbg(5, "- Add pfn %ld (0x%08llx) to color_list[%d]\n", page_to_pfn(&page[i]), (u64)page_to_phys(&page[i]), color); 422 | + 423 | + INIT_LIST_HEAD(&page[i].lru); 424 | + list_add_tail(&page[i].lru, &zone->color_list[color]); 425 | + bitmap_set(zone->color_bitmap, color, 1); 426 | + zone->free_area[0].nr_free++; 427 | + rmv_page_order(&page[i]); 428 | + } 429 | + 430 | + memdbg(4, "Add order=%d zone=%s\n", order, zone->name); 431 | + 432 | + return; 433 | +} 434 | + 435 | +/* Return a colored page (order-0) and remove it from the colored cache */ 436 | +static inline struct page *palloc_find_cmap(struct zone *zone, COLOR_BITMAP(cmap), int order, struct palloc_stat *stat) 437 | +{ 438 | + struct page *page; 439 | + COLOR_BITMAP(tmpmask); 440 | + int c; 441 | + unsigned int tmp_idx; 442 | + int found_w, want_w; 443 | + unsigned long rand_seed; 444 | + 445 | + /* Cache Statistics */ 446 | + if (stat) stat->cache_acc_cnt++; 447 | + 448 | + /* Find color cache entry */ 449 | + if (!bitmap_intersects(zone->color_bitmap, cmap, MAX_PALLOC_BINS)) 450 | + return NULL; 451 | + 452 | + bitmap_and(tmpmask, zone->color_bitmap, cmap, MAX_PALLOC_BINS); 453 | + 454 | + /* Must have a balance */ 455 | + found_w = bitmap_weight(tmpmask, MAX_PALLOC_BINS); 456 | + want_w = bitmap_weight(cmap, MAX_PALLOC_BINS); 457 | + 458 | + if (sysctl_alloc_balance && (found_w < want_w) && (found_w < min(sysctl_alloc_balance, want_w)) && memdbg_enable) { 459 | + ktime_t dur = ktime_sub(ktime_get(), stat->start); 460 | + if (dur < 1000000) { 461 | + /* Try to balance unless order = MAX-2 or 1ms has passed */ 462 | + memdbg(4, "found_w=%d want_w=%d order=%d elapsed=%lld ns\n", found_w, want_w, order, dur); 463 | + 464 | + stat->alloc_balance++; 465 | + 466 | + return NULL; 467 | + } 468 | + 469 | + stat->alloc_balance_timeout++; 470 | + } 471 | + 472 | + /* Choose a bit among the candidates */ 473 | + if (sysctl_alloc_balance && memdbg_enable) { 474 | + rand_seed = (unsigned long)stat->start; 475 | + } else { 476 | + rand_seed = per_cpu(palloc_rand_seed, smp_processor_id())++; 477 | + 478 | + if (rand_seed > MAX_PALLOC_BINS) 479 | + per_cpu(palloc_rand_seed, smp_processor_id()) = 0; 480 | + } 481 | + 482 | + tmp_idx = rand_seed % found_w; 483 | + 484 | + for_each_set_bit(c, tmpmask, MAX_PALLOC_BINS) { 485 | + if (tmp_idx-- <= 0) 486 | + break; 487 | + } 488 | + 489 | + BUG_ON(c >= MAX_PALLOC_BINS); 490 | + BUG_ON(list_empty(&zone->color_list[c])); 491 | + 492 | + page = list_entry(zone->color_list[c].next, struct page, lru); 493 | + 494 | + memdbg(1, "Found colored page pfn %ld color %d seed %ld found/want %d/%d\n", 495 | + page_to_pfn(page), c, rand_seed, found_w, want_w); 496 | + 497 | + /* Remove the page from the zone->color_list[color] */ 498 | + list_del(&page->lru); 499 | + 500 | + if (list_empty(&zone->color_list[c])) 501 | + bitmap_clear(zone->color_bitmap, c, 1); 502 | + 503 | + zone->free_area[0].nr_free--; 504 | + 505 | + memdbg(5, "- del pfn %ld from color_list[%d]\n", page_to_pfn(page), c); 506 | + 507 | + if (stat) stat->cache_hit_cnt++; 508 | + 509 | + return page; 510 | +} 511 | + 512 | +static inline void update_stat(struct palloc_stat *stat, struct page *page, int iters) 513 | +{ 514 | + ktime_t dur; 515 | + 516 | + if (memdbg_enable == 0) 517 | + return; 518 | + 519 | + dur = ktime_sub(ktime_get(), stat->start); 520 | + 521 | + if (dur > 0) { 522 | + stat->min_ns = min(dur, stat->min_ns); 523 | + stat->max_ns = max(dur, stat->max_ns); 524 | + 525 | + stat->tot_ns += dur; 526 | + stat->iter_cnt += iters; 527 | + 528 | + stat->tot_cnt++; 529 | + 530 | + memdbg(2, "order %ld pfn %ld (0x%08llx) color %d iters %d in %lld ns\n", 531 | + (long int)buddy_order(page), (long int)page_to_pfn(page), (u64)page_to_phys(page), 532 | + (int)page_to_color(page), iters, dur); 533 | + } else { 534 | + memdbg(5, "dur %lld is < 0\n", dur); 535 | + } 536 | + 537 | + return; 538 | +} 539 | + 540 | +/* 541 | + * Go through the free lists for the given migratetype and remove 542 | + * the smallest available page from the freelists 543 | + */ 544 | +static inline 545 | +struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 546 | + int migratetype) 547 | +{ 548 | + unsigned int current_order; 549 | + struct free_area *area; 550 | + struct list_head *curr, *tmp; 551 | + struct page *page; 552 | + 553 | + struct palloc *ph; 554 | + struct palloc_stat *c_stat = &palloc.stat[0]; 555 | + struct palloc_stat *n_stat = &palloc.stat[1]; 556 | + struct palloc_stat *f_stat = &palloc.stat[2]; 557 | + 558 | + int iters = 0; 559 | + COLOR_BITMAP(tmpcmap); 560 | + unsigned long *cmap; 561 | + 562 | + if (memdbg_enable) 563 | + c_stat->start = n_stat->start = f_stat->start = ktime_get(); 564 | + 565 | + if (!use_palloc) 566 | + goto normal_buddy_alloc; 567 | + 568 | + /* cgroup information */ 569 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 570 | + 571 | + if (ph && bitmap_weight(ph->cmap, MAX_PALLOC_BINS) > 0) 572 | + cmap = ph->cmap; 573 | + else { 574 | + bitmap_fill(tmpcmap, MAX_PALLOC_BINS); 575 | + cmap = tmpcmap; 576 | + } 577 | + 578 | + page = NULL; 579 | + if (order == 0) { 580 | + /* Find page in the color cache */ 581 | + memdbg(5, "check color cache (mt=%d)\n", migratetype); 582 | + 583 | + page = palloc_find_cmap(zone, cmap, 0, c_stat); 584 | + 585 | + if (page) { 586 | + update_stat(c_stat, page, iters); 587 | + return page; 588 | + } 589 | + } 590 | + 591 | + if (order == 0) { 592 | + /* Build Color Cache */ 593 | + iters++; 594 | + 595 | + /* Search the entire list. Make color cache in the process */ 596 | + for (current_order = 0; current_order < MAX_ORDER; ++current_order) { 597 | + area = &(zone->free_area[current_order]); 598 | + 599 | + if (list_empty(&area->free_list[migratetype])) 600 | + continue; 601 | + 602 | + memdbg(3, " order=%d (nr_free=%ld)\n", current_order, area->nr_free); 603 | + 604 | + list_for_each_safe(curr, tmp, &area->free_list[migratetype]) { 605 | + iters++; 606 | + page = list_entry(curr, struct page, lru); 607 | + palloc_insert(zone, page, current_order); 608 | + page = palloc_find_cmap(zone, cmap, current_order, c_stat); 609 | + 610 | + if (page) { 611 | + update_stat(c_stat, page, iters); 612 | + memdbg(1, "Found at Zone %s pfn 0x%lx\n", zone->name, page_to_pfn(page)); 613 | + 614 | + return page; 615 | + } 616 | + } 617 | + } 618 | + 619 | + memdbg(1, "Failed to find a matching color\n"); 620 | + } else { 621 | +normal_buddy_alloc: 622 | + /* Normal Buddy Algorithm */ 623 | + /* Find a page of the specified size in the preferred list */ 624 | + for (current_order = order; current_order < MAX_ORDER; ++current_order) { 625 | + area = &(zone->free_area[current_order]); 626 | + iters++; 627 | + 628 | +/* if (list_empty(&area->free_list[migratetype])) 629 | + continue; 630 | + 631 | + page = list_entry(area->free_list[migratetype].next, struct page, lru); 632 | + 633 | + list_del(&page->lru); 634 | + rmv_page_order(page); 635 | + area->nr_free--; 636 | + expand(zone, page, order, current_order, area, migratetype); 637 | +*/ 638 | + 639 | + page = get_page_from_free_area(area, migratetype); 640 | + if (!page) 641 | + continue; 642 | + del_page_from_free_list(page, zone, current_order); 643 | + expand(zone, page, order, current_order, migratetype); 644 | + set_pcppage_migratetype(page, migratetype); 645 | + 646 | + update_stat(n_stat, page, iters); 647 | + 648 | + return page; 649 | + } 650 | + } 651 | + 652 | + /* No memory (colored or normal) found in this zone */ 653 | + memdbg(1, "No memory in Zone %s: order %d mt %d\n", zone->name, order, migratetype); 654 | + 655 | + return NULL; 656 | +} 657 | + 658 | +#else /* CONFIG_CGROUP_PALLOC */ 659 | + 660 | /* 661 | * Go through the free lists for the given migratetype and remove 662 | * the smallest available page from the freelists 663 | @@ -2203,6 +2721,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 664 | return NULL; 665 | } 666 | 667 | +#endif /* CONFIG_CGROUP_PALLOC */ 668 | 669 | /* 670 | * This array describes the order lists are fallen back to when 671 | @@ -3268,8 +3787,16 @@ struct page *rmqueue(struct zone *preferred_zone, 672 | { 673 | unsigned long flags; 674 | struct page *page; 675 | + struct palloc *ph; 676 | 677 | +#ifdef CONFIG_CGROUP_PALLOC 678 | + ph = ph_from_subsys(current->cgroups->subsys[palloc_cgrp_id]); 679 | + /* Skip PCP when physical memory aware allocation is requested */ 680 | + if (likely(pcp_allowed_order(order)) && !ph) { 681 | +#else 682 | if (likely(pcp_allowed_order(order))) { 683 | + 684 | +#endif 685 | /* 686 | * MIGRATE_MOVABLE pcplist could have the pages on CMA area and 687 | * we need to skip it when CMA area isn't allowed. 688 | @@ -6042,6 +6574,17 @@ void __ref memmap_init_zone_device(struct zone *zone, 689 | static void __meminit zone_init_free_lists(struct zone *zone) 690 | { 691 | unsigned int order, t; 692 | + 693 | +#ifdef CONFIG_CGROUP_PALLOC 694 | + int c; 695 | + 696 | + for (c = 0; c < MAX_PALLOC_BINS; c++) { 697 | + INIT_LIST_HEAD(&zone->color_list[c]); 698 | + } 699 | + 700 | + bitmap_zero(zone->color_bitmap, MAX_PALLOC_BINS); 701 | +#endif /* CONFIG_CGROUP_PALLOC */ 702 | + 703 | for_each_migratetype_order(order, t) { 704 | INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); 705 | zone->free_area[order].nr_free = 0; 706 | @@ -8606,5 +9144,10 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) 707 | offline_mem_sections(pfn, end_pfn); 708 | zone = page_zone(pfn_to_page(pfn)); 709 | spin_lock_irqsave(&zone->lock, flags); 710 | + 711 | +#ifdef CONFIG_CGROUP_PALLOC 712 | + palloc_flush(zone); 713 | +#endif 714 | + 715 | while (pfn < end_pfn) { 716 | page = pfn_to_page(pfn); 717 | /* 718 | diff --git a/mm/palloc.c b/mm/palloc.c 719 | --- /dev/null 720 | +++ b/mm/palloc.c 721 | @@ -0,0 +1,173 @@ 722 | +/** 723 | + * kernel/palloc.c 724 | + * 725 | + * Color Aware Physical Memory Allocator User-Space Information 726 | + * 727 | + */ 728 | + 729 | +#include 730 | +#include 731 | +#include 732 | +#include 733 | +#include 734 | +#include 735 | +#include 736 | +#include 737 | +#include 738 | +#include 739 | + 740 | +/** 741 | + * Check if a page is compliant with the policy defined for the given vma 742 | + */ 743 | +#ifdef CONFIG_CGROUP_PALLOC 744 | + 745 | +#define MAX_LINE_LEN (6 * 128) 746 | + 747 | +/** 748 | + * Type of files in a palloc group 749 | + * FILE_PALLOC - contains list of palloc bins allowed 750 | + */ 751 | +typedef enum { 752 | + FILE_PALLOC, 753 | +} palloc_filetype_t; 754 | + 755 | +/** 756 | + * Retrieve the palloc group corresponding to this cgroup container 757 | + */ 758 | +struct palloc *cgroup_ph(struct cgroup *cgrp) 759 | +{ 760 | + return container_of(cgrp->subsys[palloc_cgrp_id], struct palloc, css); 761 | +} 762 | + 763 | +struct palloc *ph_from_subsys(struct cgroup_subsys_state *subsys) 764 | +{ 765 | + return container_of(subsys, struct palloc, css); 766 | +} 767 | + 768 | +/** 769 | + * Common write function for files in palloc cgroup 770 | + */ 771 | +static int update_bitmask(unsigned long *bitmap, const char *buf, int maxbits) 772 | +{ 773 | + int retval = 0; 774 | + 775 | + if (!*buf) 776 | + bitmap_clear(bitmap, 0, maxbits); 777 | + else 778 | + retval = bitmap_parselist(buf, bitmap, maxbits); 779 | + 780 | + return retval; 781 | +} 782 | + 783 | +static ssize_t palloc_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) 784 | +{ 785 | + struct cgroup_subsys_state *css; 786 | + struct cftype *cft; 787 | + int retval = 0; 788 | + struct palloc *ph; 789 | + 790 | + css = of_css(of); 791 | + cft = of_cft(of); 792 | + ph = container_of(css, struct palloc, css); 793 | + 794 | + switch (cft->private) { 795 | + case FILE_PALLOC: 796 | + retval = update_bitmask(ph->cmap, buf, palloc_bins()); 797 | + printk(KERN_INFO "Bins : %s\n", buf); 798 | + break; 799 | + 800 | + default: 801 | + retval = -EINVAL; 802 | + break; 803 | + } 804 | + 805 | + return retval? :nbytes; 806 | +} 807 | + 808 | +static int palloc_file_read(struct seq_file *sf, void *v) 809 | +{ 810 | + struct cgroup_subsys_state *css = seq_css(sf); 811 | + struct cftype *cft = seq_cft(sf); 812 | + struct palloc *ph = container_of(css, struct palloc, css); 813 | + char *page; 814 | + ssize_t retval = 0; 815 | + char *s; 816 | + 817 | + if (!(page = (char *)__get_free_page( __GFP_ZERO))) 818 | + return -ENOMEM; 819 | + 820 | + s = page; 821 | + 822 | + switch (cft->private) { 823 | + case FILE_PALLOC: 824 | + s += scnprintf(s, PAGE_SIZE, "%*pbl", (int)palloc_bins(), ph->cmap); 825 | + *s++ = '\n'; 826 | + printk(KERN_INFO "Bins : %s", page); 827 | + break; 828 | + 829 | + default: 830 | + retval = -EINVAL; 831 | + goto out; 832 | + } 833 | + 834 | + seq_printf(sf, "%s", page); 835 | + 836 | +out: 837 | + free_page((unsigned long)page); 838 | + return retval; 839 | +} 840 | + 841 | +/** 842 | + * struct cftype : handler definitions for cgroup control files 843 | + * 844 | + * for the common functions, 'private' gives the type of the file 845 | + */ 846 | +static struct cftype files[] = { 847 | + { 848 | + .name = "bins", 849 | + .seq_show = palloc_file_read, 850 | + .write = palloc_file_write, 851 | + .max_write_len = MAX_LINE_LEN, 852 | + .private = FILE_PALLOC, 853 | + }, 854 | + {} 855 | +}; 856 | + 857 | + 858 | +/** 859 | + * palloc_create - create a palloc group 860 | + */ 861 | +static struct cgroup_subsys_state *palloc_create(struct cgroup_subsys_state *css) 862 | +{ 863 | + struct palloc *ph_child; 864 | + 865 | + ph_child = kmalloc(sizeof(struct palloc), GFP_KERNEL); 866 | + 867 | + if (!ph_child) 868 | + return ERR_PTR(-ENOMEM); 869 | + 870 | + bitmap_clear(ph_child->cmap, 0, MAX_PALLOC_BINS); 871 | + 872 | + return &ph_child->css; 873 | +} 874 | + 875 | +/** 876 | + * Destroy an existing palloc group 877 | + */ 878 | +static void palloc_destroy(struct cgroup_subsys_state *css) 879 | +{ 880 | + struct palloc *ph = container_of(css, struct palloc, css); 881 | + 882 | + kfree(ph); 883 | +} 884 | + 885 | +struct cgroup_subsys palloc_cgrp_subsys = { 886 | + .name = "palloc", 887 | + .css_alloc = palloc_create, 888 | + .css_free = palloc_destroy, 889 | + .id = palloc_cgrp_id, 890 | + .dfl_cftypes = files, 891 | + .legacy_cftypes = files, 892 | +}; 893 | + 894 | +#endif /* CONFIG_CGROUP_PALLOC */ 895 | --- a/mm/vmstat.c 896 | +++ b/mm/vmstat.c 897 | @@ -29,6 +29,10 @@ 898 | #include 899 | #include 900 | 901 | +#ifdef CONFIG_CGROUP_PALLOC 902 | +#include 903 | +#endif 904 | + 905 | #include "internal.h" 906 | 907 | #define NUMA_STATS_THRESHOLD (U16_MAX - 2) 908 | @@ -1353,6 +1357,44 @@ static void frag_show_print(struct seq_file *m, pg_data_t *pgdat, 909 | { 910 | int order; 911 | 912 | +#ifdef CONFIG_CGROUP_PALLOC 913 | + int color, mt, cnt, bins; 914 | + struct free_area *area; 915 | + struct list_head *curr; 916 | + 917 | + seq_printf(m, "--------\n"); 918 | + 919 | + /* Order by memory type */ 920 | + for (mt = 0; mt < MIGRATE_TYPES; mt++) { 921 | + seq_printf(m, "-%17s[%d]", "mt", mt); 922 | + for (order = 0; order < MAX_ORDER; order++) { 923 | + area = &(zone->free_area[order]); 924 | + cnt = 0; 925 | + 926 | + list_for_each(curr, &area->free_list[mt]) 927 | + cnt++; 928 | + 929 | + seq_printf(m, "%6d ", cnt); 930 | + } 931 | + 932 | + seq_printf(m, "\n"); 933 | + } 934 | + 935 | + /* Order by color */ 936 | + seq_printf(m, "--------\n"); 937 | + bins = palloc_bins(); 938 | + 939 | + for (color = 0; color < bins; color++) { 940 | + seq_printf(m, "- color [%d:%0x]", color, color); 941 | + cnt = 0; 942 | + 943 | + list_for_each(curr, &zone->color_list[color]) 944 | + cnt++; 945 | + 946 | + seq_printf(m, "%6d\n", cnt); 947 | + } 948 | +#endif /* CONFIG_CGROUP_PALLOC */ 949 | + 950 | seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); 951 | for (order = 0; order < MAX_ORDER; ++order) 952 | seq_printf(m, "%6lu ", zone->free_area[order].nr_free); 953 | --------------------------------------------------------------------------------