├── .gitignore ├── LICENSE ├── README.md ├── drivers ├── .gitignore ├── Makefile ├── fastswap.c ├── fastswap_dram.c ├── fastswap_dram.h ├── fastswap_rdma.c └── fastswap_rdma.h ├── farmemserver ├── .gitignore ├── Makefile └── rmserver.c └── kernel ├── config-4.11.0-041100-generic └── kernel.patch /.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | *.o.d 3 | *.mod.o 4 | *.mod.c 5 | *.ko 6 | *.cmd 7 | *.tmp_versions 8 | *.pyc 9 | modules.order 10 | Module.symvers 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 2.1, February 1999 3 | 4 | Copyright (C) 1991, 1999 Free Software Foundation, Inc. 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | (This is the first released version of the Lesser GPL. It also counts 10 | as the successor of the GNU Library Public License, version 2, hence 11 | the version number 2.1.) 12 | 13 | Preamble 14 | 15 | The licenses for most software are designed to take away your 16 | freedom to share and change it. By contrast, the GNU General Public 17 | Licenses are intended to guarantee your freedom to share and change 18 | free software--to make sure the software is free for all its users. 19 | 20 | This license, the Lesser General Public License, applies to some 21 | specially designated software packages--typically libraries--of the 22 | Free Software Foundation and other authors who decide to use it. You 23 | can use it too, but we suggest you first think carefully about whether 24 | this license or the ordinary General Public License is the better 25 | strategy to use in any particular case, based on the explanations below. 26 | 27 | When we speak of free software, we are referring to freedom of use, 28 | not price. Our General Public Licenses are designed to make sure that 29 | you have the freedom to distribute copies of free software (and charge 30 | for this service if you wish); that you receive source code or can get 31 | it if you want it; that you can change the software and use pieces of 32 | it in new free programs; and that you are informed that you can do 33 | these things. 34 | 35 | To protect your rights, we need to make restrictions that forbid 36 | distributors to deny you these rights or to ask you to surrender these 37 | rights. These restrictions translate to certain responsibilities for 38 | you if you distribute copies of the library or if you modify it. 39 | 40 | For example, if you distribute copies of the library, whether gratis 41 | or for a fee, you must give the recipients all the rights that we gave 42 | you. You must make sure that they, too, receive or can get the source 43 | code. If you link other code with the library, you must provide 44 | complete object files to the recipients, so that they can relink them 45 | with the library after making changes to the library and recompiling 46 | it. And you must show them these terms so they know their rights. 47 | 48 | We protect your rights with a two-step method: (1) we copyright the 49 | library, and (2) we offer you this license, which gives you legal 50 | permission to copy, distribute and/or modify the library. 51 | 52 | To protect each distributor, we want to make it very clear that 53 | there is no warranty for the free library. Also, if the library is 54 | modified by someone else and passed on, the recipients should know 55 | that what they have is not the original version, so that the original 56 | author's reputation will not be affected by problems that might be 57 | introduced by others. 58 | 59 | Finally, software patents pose a constant threat to the existence of 60 | any free program. We wish to make sure that a company cannot 61 | effectively restrict the users of a free program by obtaining a 62 | restrictive license from a patent holder. Therefore, we insist that 63 | any patent license obtained for a version of the library must be 64 | consistent with the full freedom of use specified in this license. 65 | 66 | Most GNU software, including some libraries, is covered by the 67 | ordinary GNU General Public License. This license, the GNU Lesser 68 | General Public License, applies to certain designated libraries, and 69 | is quite different from the ordinary General Public License. We use 70 | this license for certain libraries in order to permit linking those 71 | libraries into non-free programs. 72 | 73 | When a program is linked with a library, whether statically or using 74 | a shared library, the combination of the two is legally speaking a 75 | combined work, a derivative of the original library. The ordinary 76 | General Public License therefore permits such linking only if the 77 | entire combination fits its criteria of freedom. The Lesser General 78 | Public License permits more lax criteria for linking other code with 79 | the library. 80 | 81 | We call this license the "Lesser" General Public License because it 82 | does Less to protect the user's freedom than the ordinary General 83 | Public License. It also provides other free software developers Less 84 | of an advantage over competing non-free programs. These disadvantages 85 | are the reason we use the ordinary General Public License for many 86 | libraries. However, the Lesser license provides advantages in certain 87 | special circumstances. 88 | 89 | For example, on rare occasions, there may be a special need to 90 | encourage the widest possible use of a certain library, so that it becomes 91 | a de-facto standard. To achieve this, non-free programs must be 92 | allowed to use the library. A more frequent case is that a free 93 | library does the same job as widely used non-free libraries. In this 94 | case, there is little to gain by limiting the free library to free 95 | software only, so we use the Lesser General Public License. 96 | 97 | In other cases, permission to use a particular library in non-free 98 | programs enables a greater number of people to use a large body of 99 | free software. For example, permission to use the GNU C Library in 100 | non-free programs enables many more people to use the whole GNU 101 | operating system, as well as its variant, the GNU/Linux operating 102 | system. 103 | 104 | Although the Lesser General Public License is Less protective of the 105 | users' freedom, it does ensure that the user of a program that is 106 | linked with the Library has the freedom and the wherewithal to run 107 | that program using a modified version of the Library. 108 | 109 | The precise terms and conditions for copying, distribution and 110 | modification follow. Pay close attention to the difference between a 111 | "work based on the library" and a "work that uses the library". The 112 | former contains code derived from the library, whereas the latter must 113 | be combined with the library in order to run. 114 | 115 | GNU LESSER GENERAL PUBLIC LICENSE 116 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 117 | 118 | 0. This License Agreement applies to any software library or other 119 | program which contains a notice placed by the copyright holder or 120 | other authorized party saying it may be distributed under the terms of 121 | this Lesser General Public License (also called "this License"). 122 | Each licensee is addressed as "you". 123 | 124 | A "library" means a collection of software functions and/or data 125 | prepared so as to be conveniently linked with application programs 126 | (which use some of those functions and data) to form executables. 127 | 128 | The "Library", below, refers to any such software library or work 129 | which has been distributed under these terms. A "work based on the 130 | Library" means either the Library or any derivative work under 131 | copyright law: that is to say, a work containing the Library or a 132 | portion of it, either verbatim or with modifications and/or translated 133 | straightforwardly into another language. (Hereinafter, translation is 134 | included without limitation in the term "modification".) 135 | 136 | "Source code" for a work means the preferred form of the work for 137 | making modifications to it. For a library, complete source code means 138 | all the source code for all modules it contains, plus any associated 139 | interface definition files, plus the scripts used to control compilation 140 | and installation of the library. 141 | 142 | Activities other than copying, distribution and modification are not 143 | covered by this License; they are outside its scope. The act of 144 | running a program using the Library is not restricted, and output from 145 | such a program is covered only if its contents constitute a work based 146 | on the Library (independent of the use of the Library in a tool for 147 | writing it). Whether that is true depends on what the Library does 148 | and what the program that uses the Library does. 149 | 150 | 1. You may copy and distribute verbatim copies of the Library's 151 | complete source code as you receive it, in any medium, provided that 152 | you conspicuously and appropriately publish on each copy an 153 | appropriate copyright notice and disclaimer of warranty; keep intact 154 | all the notices that refer to this License and to the absence of any 155 | warranty; and distribute a copy of this License along with the 156 | Library. 157 | 158 | You may charge a fee for the physical act of transferring a copy, 159 | and you may at your option offer warranty protection in exchange for a 160 | fee. 161 | 162 | 2. You may modify your copy or copies of the Library or any portion 163 | of it, thus forming a work based on the Library, and copy and 164 | distribute such modifications or work under the terms of Section 1 165 | above, provided that you also meet all of these conditions: 166 | 167 | a) The modified work must itself be a software library. 168 | 169 | b) You must cause the files modified to carry prominent notices 170 | stating that you changed the files and the date of any change. 171 | 172 | c) You must cause the whole of the work to be licensed at no 173 | charge to all third parties under the terms of this License. 174 | 175 | d) If a facility in the modified Library refers to a function or a 176 | table of data to be supplied by an application program that uses 177 | the facility, other than as an argument passed when the facility 178 | is invoked, then you must make a good faith effort to ensure that, 179 | in the event an application does not supply such function or 180 | table, the facility still operates, and performs whatever part of 181 | its purpose remains meaningful. 182 | 183 | (For example, a function in a library to compute square roots has 184 | a purpose that is entirely well-defined independent of the 185 | application. Therefore, Subsection 2d requires that any 186 | application-supplied function or table used by this function must 187 | be optional: if the application does not supply it, the square 188 | root function must still compute square roots.) 189 | 190 | These requirements apply to the modified work as a whole. If 191 | identifiable sections of that work are not derived from the Library, 192 | and can be reasonably considered independent and separate works in 193 | themselves, then this License, and its terms, do not apply to those 194 | sections when you distribute them as separate works. But when you 195 | distribute the same sections as part of a whole which is a work based 196 | on the Library, the distribution of the whole must be on the terms of 197 | this License, whose permissions for other licensees extend to the 198 | entire whole, and thus to each and every part regardless of who wrote 199 | it. 200 | 201 | Thus, it is not the intent of this section to claim rights or contest 202 | your rights to work written entirely by you; rather, the intent is to 203 | exercise the right to control the distribution of derivative or 204 | collective works based on the Library. 205 | 206 | In addition, mere aggregation of another work not based on the Library 207 | with the Library (or with a work based on the Library) on a volume of 208 | a storage or distribution medium does not bring the other work under 209 | the scope of this License. 210 | 211 | 3. You may opt to apply the terms of the ordinary GNU General Public 212 | License instead of this License to a given copy of the Library. To do 213 | this, you must alter all the notices that refer to this License, so 214 | that they refer to the ordinary GNU General Public License, version 2, 215 | instead of to this License. (If a newer version than version 2 of the 216 | ordinary GNU General Public License has appeared, then you can specify 217 | that version instead if you wish.) Do not make any other change in 218 | these notices. 219 | 220 | Once this change is made in a given copy, it is irreversible for 221 | that copy, so the ordinary GNU General Public License applies to all 222 | subsequent copies and derivative works made from that copy. 223 | 224 | This option is useful when you wish to copy part of the code of 225 | the Library into a program that is not a library. 226 | 227 | 4. You may copy and distribute the Library (or a portion or 228 | derivative of it, under Section 2) in object code or executable form 229 | under the terms of Sections 1 and 2 above provided that you accompany 230 | it with the complete corresponding machine-readable source code, which 231 | must be distributed under the terms of Sections 1 and 2 above on a 232 | medium customarily used for software interchange. 233 | 234 | If distribution of object code is made by offering access to copy 235 | from a designated place, then offering equivalent access to copy the 236 | source code from the same place satisfies the requirement to 237 | distribute the source code, even though third parties are not 238 | compelled to copy the source along with the object code. 239 | 240 | 5. A program that contains no derivative of any portion of the 241 | Library, but is designed to work with the Library by being compiled or 242 | linked with it, is called a "work that uses the Library". Such a 243 | work, in isolation, is not a derivative work of the Library, and 244 | therefore falls outside the scope of this License. 245 | 246 | However, linking a "work that uses the Library" with the Library 247 | creates an executable that is a derivative of the Library (because it 248 | contains portions of the Library), rather than a "work that uses the 249 | library". The executable is therefore covered by this License. 250 | Section 6 states terms for distribution of such executables. 251 | 252 | When a "work that uses the Library" uses material from a header file 253 | that is part of the Library, the object code for the work may be a 254 | derivative work of the Library even though the source code is not. 255 | Whether this is true is especially significant if the work can be 256 | linked without the Library, or if the work is itself a library. The 257 | threshold for this to be true is not precisely defined by law. 258 | 259 | If such an object file uses only numerical parameters, data 260 | structure layouts and accessors, and small macros and small inline 261 | functions (ten lines or less in length), then the use of the object 262 | file is unrestricted, regardless of whether it is legally a derivative 263 | work. (Executables containing this object code plus portions of the 264 | Library will still fall under Section 6.) 265 | 266 | Otherwise, if the work is a derivative of the Library, you may 267 | distribute the object code for the work under the terms of Section 6. 268 | Any executables containing that work also fall under Section 6, 269 | whether or not they are linked directly with the Library itself. 270 | 271 | 6. As an exception to the Sections above, you may also combine or 272 | link a "work that uses the Library" with the Library to produce a 273 | work containing portions of the Library, and distribute that work 274 | under terms of your choice, provided that the terms permit 275 | modification of the work for the customer's own use and reverse 276 | engineering for debugging such modifications. 277 | 278 | You must give prominent notice with each copy of the work that the 279 | Library is used in it and that the Library and its use are covered by 280 | this License. You must supply a copy of this License. If the work 281 | during execution displays copyright notices, you must include the 282 | copyright notice for the Library among them, as well as a reference 283 | directing the user to the copy of this License. Also, you must do one 284 | of these things: 285 | 286 | a) Accompany the work with the complete corresponding 287 | machine-readable source code for the Library including whatever 288 | changes were used in the work (which must be distributed under 289 | Sections 1 and 2 above); and, if the work is an executable linked 290 | with the Library, with the complete machine-readable "work that 291 | uses the Library", as object code and/or source code, so that the 292 | user can modify the Library and then relink to produce a modified 293 | executable containing the modified Library. (It is understood 294 | that the user who changes the contents of definitions files in the 295 | Library will not necessarily be able to recompile the application 296 | to use the modified definitions.) 297 | 298 | b) Use a suitable shared library mechanism for linking with the 299 | Library. A suitable mechanism is one that (1) uses at run time a 300 | copy of the library already present on the user's computer system, 301 | rather than copying library functions into the executable, and (2) 302 | will operate properly with a modified version of the library, if 303 | the user installs one, as long as the modified version is 304 | interface-compatible with the version that the work was made with. 305 | 306 | c) Accompany the work with a written offer, valid for at 307 | least three years, to give the same user the materials 308 | specified in Subsection 6a, above, for a charge no more 309 | than the cost of performing this distribution. 310 | 311 | d) If distribution of the work is made by offering access to copy 312 | from a designated place, offer equivalent access to copy the above 313 | specified materials from the same place. 314 | 315 | e) Verify that the user has already received a copy of these 316 | materials or that you have already sent this user a copy. 317 | 318 | For an executable, the required form of the "work that uses the 319 | Library" must include any data and utility programs needed for 320 | reproducing the executable from it. However, as a special exception, 321 | the materials to be distributed need not include anything that is 322 | normally distributed (in either source or binary form) with the major 323 | components (compiler, kernel, and so on) of the operating system on 324 | which the executable runs, unless that component itself accompanies 325 | the executable. 326 | 327 | It may happen that this requirement contradicts the license 328 | restrictions of other proprietary libraries that do not normally 329 | accompany the operating system. Such a contradiction means you cannot 330 | use both them and the Library together in an executable that you 331 | distribute. 332 | 333 | 7. You may place library facilities that are a work based on the 334 | Library side-by-side in a single library together with other library 335 | facilities not covered by this License, and distribute such a combined 336 | library, provided that the separate distribution of the work based on 337 | the Library and of the other library facilities is otherwise 338 | permitted, and provided that you do these two things: 339 | 340 | a) Accompany the combined library with a copy of the same work 341 | based on the Library, uncombined with any other library 342 | facilities. This must be distributed under the terms of the 343 | Sections above. 344 | 345 | b) Give prominent notice with the combined library of the fact 346 | that part of it is a work based on the Library, and explaining 347 | where to find the accompanying uncombined form of the same work. 348 | 349 | 8. You may not copy, modify, sublicense, link with, or distribute 350 | the Library except as expressly provided under this License. Any 351 | attempt otherwise to copy, modify, sublicense, link with, or 352 | distribute the Library is void, and will automatically terminate your 353 | rights under this License. However, parties who have received copies, 354 | or rights, from you under this License will not have their licenses 355 | terminated so long as such parties remain in full compliance. 356 | 357 | 9. You are not required to accept this License, since you have not 358 | signed it. However, nothing else grants you permission to modify or 359 | distribute the Library or its derivative works. These actions are 360 | prohibited by law if you do not accept this License. Therefore, by 361 | modifying or distributing the Library (or any work based on the 362 | Library), you indicate your acceptance of this License to do so, and 363 | all its terms and conditions for copying, distributing or modifying 364 | the Library or works based on it. 365 | 366 | 10. Each time you redistribute the Library (or any work based on the 367 | Library), the recipient automatically receives a license from the 368 | original licensor to copy, distribute, link with or modify the Library 369 | subject to these terms and conditions. You may not impose any further 370 | restrictions on the recipients' exercise of the rights granted herein. 371 | You are not responsible for enforcing compliance by third parties with 372 | this License. 373 | 374 | 11. If, as a consequence of a court judgment or allegation of patent 375 | infringement or for any other reason (not limited to patent issues), 376 | conditions are imposed on you (whether by court order, agreement or 377 | otherwise) that contradict the conditions of this License, they do not 378 | excuse you from the conditions of this License. If you cannot 379 | distribute so as to satisfy simultaneously your obligations under this 380 | License and any other pertinent obligations, then as a consequence you 381 | may not distribute the Library at all. For example, if a patent 382 | license would not permit royalty-free redistribution of the Library by 383 | all those who receive copies directly or indirectly through you, then 384 | the only way you could satisfy both it and this License would be to 385 | refrain entirely from distribution of the Library. 386 | 387 | If any portion of this section is held invalid or unenforceable under any 388 | particular circumstance, the balance of the section is intended to apply, 389 | and the section as a whole is intended to apply in other circumstances. 390 | 391 | It is not the purpose of this section to induce you to infringe any 392 | patents or other property right claims or to contest validity of any 393 | such claims; this section has the sole purpose of protecting the 394 | integrity of the free software distribution system which is 395 | implemented by public license practices. Many people have made 396 | generous contributions to the wide range of software distributed 397 | through that system in reliance on consistent application of that 398 | system; it is up to the author/donor to decide if he or she is willing 399 | to distribute software through any other system and a licensee cannot 400 | impose that choice. 401 | 402 | This section is intended to make thoroughly clear what is believed to 403 | be a consequence of the rest of this License. 404 | 405 | 12. If the distribution and/or use of the Library is restricted in 406 | certain countries either by patents or by copyrighted interfaces, the 407 | original copyright holder who places the Library under this License may add 408 | an explicit geographical distribution limitation excluding those countries, 409 | so that distribution is permitted only in or among countries not thus 410 | excluded. In such case, this License incorporates the limitation as if 411 | written in the body of this License. 412 | 413 | 13. The Free Software Foundation may publish revised and/or new 414 | versions of the Lesser General Public License from time to time. 415 | Such new versions will be similar in spirit to the present version, 416 | but may differ in detail to address new problems or concerns. 417 | 418 | Each version is given a distinguishing version number. If the Library 419 | specifies a version number of this License which applies to it and 420 | "any later version", you have the option of following the terms and 421 | conditions either of that version or of any later version published by 422 | the Free Software Foundation. If the Library does not specify a 423 | license version number, you may choose any version ever published by 424 | the Free Software Foundation. 425 | 426 | 14. If you wish to incorporate parts of the Library into other free 427 | programs whose distribution conditions are incompatible with these, 428 | write to the author to ask for permission. For software which is 429 | copyrighted by the Free Software Foundation, write to the Free 430 | Software Foundation; we sometimes make exceptions for this. Our 431 | decision will be guided by the two goals of preserving the free status 432 | of all derivatives of our free software and of promoting the sharing 433 | and reuse of software generally. 434 | 435 | NO WARRANTY 436 | 437 | 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO 438 | WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 439 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR 440 | OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY 441 | KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 442 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 443 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE 444 | LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME 445 | THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 446 | 447 | 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN 448 | WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY 449 | AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU 450 | FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR 451 | CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 452 | LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING 453 | RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A 454 | FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF 455 | SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH 456 | DAMAGES. 457 | 458 | END OF TERMS AND CONDITIONS 459 | 460 | How to Apply These Terms to Your New Libraries 461 | 462 | If you develop a new library, and you want it to be of the greatest 463 | possible use to the public, we recommend making it free software that 464 | everyone can redistribute and change. You can do so by permitting 465 | redistribution under these terms (or, alternatively, under the terms of the 466 | ordinary General Public License). 467 | 468 | To apply these terms, attach the following notices to the library. It is 469 | safest to attach them to the start of each source file to most effectively 470 | convey the exclusion of warranty; and each file should have at least the 471 | "copyright" line and a pointer to where the full notice is found. 472 | 473 | {description} 474 | Copyright (C) {year} {fullname} 475 | 476 | This library is free software; you can redistribute it and/or 477 | modify it under the terms of the GNU Lesser General Public 478 | License as published by the Free Software Foundation; either 479 | version 2.1 of the License, or (at your option) any later version. 480 | 481 | This library is distributed in the hope that it will be useful, 482 | but WITHOUT ANY WARRANTY; without even the implied warranty of 483 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 484 | Lesser General Public License for more details. 485 | 486 | You should have received a copy of the GNU Lesser General Public 487 | License along with this library; if not, write to the Free Software 488 | Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 489 | USA 490 | 491 | Also add information on how to contact you by electronic and paper mail. 492 | 493 | You should also get your employer (if you work as a programmer) or your 494 | school, if any, to sign a "copyright disclaimer" for the library, if 495 | necessary. Here is a sample; alter the names: 496 | 497 | Yoyodyne, Inc., hereby disclaims all copyright interest in the 498 | library `Frob' (a library for tweaking knobs) written by James Random 499 | Hacker. 500 | 501 | {signature of Ty Coon}, 1 April 1990 502 | Ty Coon, President of Vice 503 | 504 | That's all there is to it! 505 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fastswap 2 | This repository cointains the source code for fastswap (drivers), the far 3 | memory server (farmemserver) and the Linux kernel patches that fastswap needs 4 | (kernel). The kernel patch must be applied on top of Linux kernel 4.11.0. 5 | 6 | We assume Ubuntu 16.04 is being used, but other distributions might work as well. 7 | Further, we assume this directory is cloned in your home directory as 'fastswap'. 8 | 9 | We only support Mellanox NICs. The best supported NIC is ConnectX-3. However, 10 | if you find issues on ConnectX-4 or ConnectX-5, please let us know and we will 11 | try to address the issues. 12 | 13 | ## Compiling and installing fastswap kernel (client node) 14 | 15 | First you need a copy of the source for kernel 4.11 with SHA 16 | a351e9b9fc24e982ec2f0e76379a49826036da12. We outline the high level steps here. 17 | 18 | cd ~ 19 | wget https://github.com/torvalds/linux/archive/a351e9b9fc24e982ec2f0e76379a49826036da12.zip 20 | mv a351e9b9fc24e982ec2f0e76379a49826036da12.zip linux-4.11.zip 21 | unzip linux-4.11.zip 22 | cd linux-4.11 23 | git init . 24 | git add . 25 | git commit -m "first commit" 26 | 27 | Now use the provided patch and apply it against your copy of linux-4.11, and use 28 | the generic Ubuntu config file for kernel 4.11. You can get the config file 29 | from internet, or you can use the one we provide. 30 | 31 | git apply ~/fastswap/kernel/kernel.patch 32 | cp ~/fastswap/kernel/config-4.11.0-041100-generic ~/linux-4.11/.config 33 | 34 | Make sure you have necessary prerequisites to compile the kernel, and compile 35 | it: 36 | 37 | sudo apt-get install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache bison flex 38 | make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-fastswap 39 | 40 | Once it's done, your deb packages should be one directory above, you can simply 41 | install them all: 42 | 43 | cd .. 44 | sudo dpkg -i *.deb 45 | 46 | ## OFED setup (client node and far memory node) 47 | 48 | We only support Mellanox OFED drivers, and we recommend version 4.1, 4.2 or 49 | 4.3. You must have an OFED installation working before proceeding. A good test 50 | before continuing is being able to run both ib\_read\_lat and ib\_write\_lat 51 | between your far memory node and your client. We assume the far memory node has 52 | ip $farmemip and the client has ip $clientip. Note these ips must be bound to 53 | RDMA NICs. 54 | 55 | ## Far memory server (far memory node) 56 | 57 | To build and run the far memory server do: 58 | 59 | cd farmemserver 60 | make 61 | ./rmserver 50000 62 | 63 | You should see a message saying "listening on port 50000". 64 | 65 | ## Swap device configuration (client node) 66 | 67 | In order to ``activate`` the swap system in Linux, you must have a swap device 68 | pre-registered. This swap device won't receive any data traffic, we only need 69 | it to activate swap code paths in the kernel. Further, the amount of far memory 70 | the client will be able to access is determined by the swap device size. So if 71 | you want 32GB (default value) of far memory available, your swap device *must* 72 | be exactly 32GB or less. When you type ``free`` in the terminal you must see 73 | that Swap has 32GB of space available in total column. 74 | 75 | ## Fastswap driver (client node) 76 | 77 | To build the drivers: 78 | 79 | cd drivers 80 | make BACKEND=RDMA 81 | 82 | If you have a custom installation path for the Mellanox OFED drivers, modify 83 | the OFA\_DIR variable in the Makefile accordingly. The compilation will only 84 | succeed if you booted on a fastswap kernel. 85 | 86 | Now we will load the fastswap drivers. 87 | 88 | sudo insmod fastswap_rdma.ko sport=50000 sip="$farmemip" cip="$clientip" nq=8 89 | sudo insmod fastswap.ko 90 | 91 | sport is the port where the far memory server is running, sip is the far memory 92 | node ip, cip is this node ip (client) and nq must be set to the number of cpus 93 | available in the system. If you type dmesg and you see "ctrl is ready for reqs" 94 | then the connection was successful! 95 | 96 | A good next step would be to try out our CFM framework: https://github.com/clusterfarmem/cfm 97 | 98 | ## DRAM backend 99 | 100 | You can use the DRAM backend for experimentation. Compile and load as follows: 101 | 102 | cd drivers 103 | make BACKEND=DRAM 104 | sudo insmod fastswap_dram.ko 105 | sudo insmod fastswap.ko 106 | 107 | You still need to have swap device enabled, but data won't flow there. By default 108 | the DRAM backend will allocate 32GB of memory. 109 | 110 | ## Further reading 111 | For more information, please refer to our [paper](https://dl.acm.org/doi/abs/10.1145/3342195.3387522) accepted at [EUROSYS 2020](https://www.eurosys2020.org/) 112 | 113 | ## Questions 114 | For additional questions please contact us at amaro a@t berkeley.edu 115 | -------------------------------------------------------------------------------- /drivers/.gitignore: -------------------------------------------------------------------------------- 1 | cscope.* 2 | -------------------------------------------------------------------------------- /drivers/Makefile: -------------------------------------------------------------------------------- 1 | OFA_DIR ?= /usr/src/ofa_kernel/default 2 | OFA_INCLUDE := $(OFA_DIR)/include 3 | OFA_SYMVERS := $(OFA_DIR)/Module.symvers 4 | 5 | ifneq ($(LINUXINCLUDE),) 6 | # kbuild part of makefile 7 | 8 | LINUXINCLUDE := \ 9 | -I$(OFA_INCLUDE) \ 10 | ${LINUXINCLUDE} 11 | 12 | else 13 | # normal makefile 14 | 15 | export KBUILD_EXTRA_SYMBOLS=$(OFA_SYMVERS) 16 | KDIR ?= /lib/modules/`uname -r`/build 17 | 18 | default: 19 | $(MAKE) -C $(KDIR) M=$$PWD 20 | 21 | clean: 22 | $(MAKE) -C $(KDIR) M=$$PWD clean 23 | endif 24 | 25 | obj-m := fastswap.o 26 | ifeq ($(BACKEND),RDMA) 27 | obj-m += fastswap_rdma.o 28 | CFLAGS_fastswap.o=-DBACKEND=2 29 | else 30 | obj-m += fastswap_dram.o 31 | CFLAGS_fastswap.o=-DBACKEND=1 32 | endif 33 | -------------------------------------------------------------------------------- /drivers/fastswap.c: -------------------------------------------------------------------------------- 1 | #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 2 | 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | 13 | #define B_DRAM 1 14 | #define B_RDMA 2 15 | 16 | #ifndef BACKEND 17 | #error "Need to define BACKEND flag" 18 | #endif 19 | 20 | #if BACKEND == B_DRAM 21 | #define DRAM 22 | #include "fastswap_dram.h" 23 | #elif BACKEND == B_RDMA 24 | #define RDMA 25 | #include "fastswap_rdma.h" 26 | #else 27 | #error "BACKEND can only be 1 (DRAM) or 2 (RDMA)" 28 | #endif 29 | 30 | static int sswap_store(unsigned type, pgoff_t pageid, 31 | struct page *page) 32 | { 33 | if (sswap_rdma_write(page, pageid << PAGE_SHIFT)) { 34 | pr_err("could not store page remotely\n"); 35 | return -1; 36 | } 37 | 38 | return 0; 39 | } 40 | 41 | /* 42 | * return 0 if page is returned 43 | * return -1 otherwise 44 | */ 45 | static int sswap_load_async(unsigned type, pgoff_t pageid, struct page *page) 46 | { 47 | if (unlikely(sswap_rdma_read_async(page, pageid << PAGE_SHIFT))) { 48 | pr_err("could not read page remotely\n"); 49 | return -1; 50 | } 51 | 52 | return 0; 53 | } 54 | 55 | static int sswap_load(unsigned type, pgoff_t pageid, struct page *page) 56 | { 57 | if (unlikely(sswap_rdma_read_sync(page, pageid << PAGE_SHIFT))) { 58 | pr_err("could not read page remotely\n"); 59 | return -1; 60 | } 61 | 62 | return 0; 63 | } 64 | 65 | static int sswap_poll_load(int cpu) 66 | { 67 | return sswap_rdma_poll_load(cpu); 68 | } 69 | 70 | static void sswap_invalidate_page(unsigned type, pgoff_t offset) 71 | { 72 | return; 73 | } 74 | 75 | static void sswap_invalidate_area(unsigned type) 76 | { 77 | pr_err("sswap_invalidate_area\n"); 78 | } 79 | 80 | static void sswap_init(unsigned type) 81 | { 82 | pr_info("sswap_init end\n"); 83 | } 84 | 85 | static struct frontswap_ops sswap_frontswap_ops = { 86 | .init = sswap_init, 87 | .store = sswap_store, 88 | .load = sswap_load, 89 | .poll_load = sswap_poll_load, 90 | .load_async = sswap_load_async, 91 | .invalidate_page = sswap_invalidate_page, 92 | .invalidate_area = sswap_invalidate_area, 93 | 94 | }; 95 | 96 | static int __init sswap_init_debugfs(void) 97 | { 98 | return 0; 99 | } 100 | 101 | static int __init init_sswap(void) 102 | { 103 | frontswap_register_ops(&sswap_frontswap_ops); 104 | if (sswap_init_debugfs()) 105 | pr_err("sswap debugfs failed\n"); 106 | 107 | pr_info("sswap module loaded\n"); 108 | return 0; 109 | } 110 | 111 | static void __exit exit_sswap(void) 112 | { 113 | pr_info("unloading sswap\n"); 114 | } 115 | 116 | module_init(init_sswap); 117 | module_exit(exit_sswap); 118 | 119 | MODULE_LICENSE("GPL v2"); 120 | MODULE_DESCRIPTION("Fastswap driver"); 121 | -------------------------------------------------------------------------------- /drivers/fastswap_dram.c: -------------------------------------------------------------------------------- 1 | #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 2 | 3 | #include 4 | #include 5 | #include "fastswap_dram.h" 6 | 7 | #define ONEGB (1024UL*1024*1024) 8 | #define REMOTE_BUF_SIZE (ONEGB * 32) /* must match what server is allocating */ 9 | 10 | static void *drambuf; 11 | 12 | int sswap_rdma_write(struct page *page, u64 roffset) 13 | { 14 | void *page_vaddr; 15 | 16 | page_vaddr = kmap_atomic(page); 17 | copy_page((void *) (drambuf + roffset), page_vaddr); 18 | kunmap_atomic(page_vaddr); 19 | return 0; 20 | } 21 | EXPORT_SYMBOL(sswap_rdma_write); 22 | 23 | int sswap_rdma_poll_load(int cpu) 24 | { 25 | return 0; 26 | } 27 | EXPORT_SYMBOL(sswap_rdma_poll_load); 28 | 29 | int sswap_rdma_read_async(struct page *page, u64 roffset) 30 | { 31 | void *page_vaddr; 32 | 33 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); 34 | VM_BUG_ON_PAGE(!PageLocked(page), page); 35 | VM_BUG_ON_PAGE(PageUptodate(page), page); 36 | 37 | page_vaddr = kmap_atomic(page); 38 | copy_page(page_vaddr, (void *) (drambuf + roffset)); 39 | kunmap_atomic(page_vaddr); 40 | 41 | SetPageUptodate(page); 42 | unlock_page(page); 43 | return 0; 44 | } 45 | EXPORT_SYMBOL(sswap_rdma_read_async); 46 | 47 | int sswap_rdma_read_sync(struct page *page, u64 roffset) 48 | { 49 | return sswap_rdma_read_async(page, roffset); 50 | } 51 | EXPORT_SYMBOL(sswap_rdma_read_sync); 52 | 53 | int sswap_rdma_drain_loads_sync(int cpu, int target) 54 | { 55 | return 1; 56 | } 57 | EXPORT_SYMBOL(sswap_rdma_drain_loads_sync); 58 | 59 | static void __exit sswap_dram_cleanup_module(void) 60 | { 61 | vfree(drambuf); 62 | } 63 | 64 | static int __init sswap_dram_init_module(void) 65 | { 66 | pr_info("start: %s\n", __FUNCTION__); 67 | pr_info("will use new DRAM backend"); 68 | 69 | drambuf = vzalloc(REMOTE_BUF_SIZE); 70 | pr_info("vzalloc'ed %lu bytes for dram backend\n", REMOTE_BUF_SIZE); 71 | 72 | pr_info("DRAM backend is ready for reqs\n"); 73 | return 0; 74 | } 75 | 76 | module_init(sswap_dram_init_module); 77 | module_exit(sswap_dram_cleanup_module); 78 | 79 | MODULE_LICENSE("GPL v2"); 80 | MODULE_DESCRIPTION("DRAM backend"); 81 | -------------------------------------------------------------------------------- /drivers/fastswap_dram.h: -------------------------------------------------------------------------------- 1 | #if !defined(_SSWAP_DRAM_H) 2 | #define _SSWAP_DRAM_H 3 | 4 | #include 5 | #include 6 | 7 | 8 | int sswap_rdma_read_async(struct page *page, u64 roffset); 9 | int sswap_rdma_read_sync(struct page *page, u64 roffset); 10 | int sswap_rdma_write(struct page *page, u64 roffset); 11 | int sswap_rdma_poll_load(int cpu); 12 | int sswap_rdma_drain_loads_sync(int cpu, int target); 13 | 14 | #endif 15 | -------------------------------------------------------------------------------- /drivers/fastswap_rdma.c: -------------------------------------------------------------------------------- 1 | #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 2 | 3 | #include "fastswap_rdma.h" 4 | #include 5 | #include 6 | 7 | static struct sswap_rdma_ctrl *gctrl; 8 | static int serverport; 9 | static int numqueues; 10 | static int numcpus; 11 | static char serverip[INET_ADDRSTRLEN]; 12 | static char clientip[INET_ADDRSTRLEN]; 13 | static struct kmem_cache *req_cache; 14 | module_param_named(sport, serverport, int, 0644); 15 | module_param_named(nq, numqueues, int, 0644); 16 | module_param_string(sip, serverip, INET_ADDRSTRLEN, 0644); 17 | module_param_string(cip, clientip, INET_ADDRSTRLEN, 0644); 18 | 19 | // TODO: destroy ctrl 20 | 21 | #define CONNECTION_TIMEOUT_MS 60000 22 | #define QP_QUEUE_DEPTH 256 23 | /* we don't really use recv wrs, so any small number should do */ 24 | #define QP_MAX_RECV_WR 4 25 | /* we mainly do send wrs */ 26 | #define QP_MAX_SEND_WR (4096) 27 | #define CQ_NUM_CQES (QP_MAX_SEND_WR) 28 | #define POLL_BATCH_HIGH (QP_MAX_SEND_WR / 4) 29 | 30 | static void sswap_rdma_addone(struct ib_device *dev) 31 | { 32 | pr_info("sswap_rdma_addone() = %s\n", dev->name); 33 | } 34 | 35 | static void sswap_rdma_removeone(struct ib_device *ib_device, void *client_data) 36 | { 37 | pr_info("sswap_rdma_removeone()\n"); 38 | } 39 | 40 | static struct ib_client sswap_rdma_ib_client = { 41 | .name = "sswap_rdma", 42 | .add = sswap_rdma_addone, 43 | .remove = sswap_rdma_removeone 44 | }; 45 | 46 | static struct sswap_rdma_dev *sswap_rdma_get_device(struct rdma_queue *q) 47 | { 48 | struct sswap_rdma_dev *rdev = NULL; 49 | 50 | if (!q->ctrl->rdev) { 51 | rdev = kzalloc(sizeof(*rdev), GFP_KERNEL); 52 | if (!rdev) { 53 | pr_err("no memory\n"); 54 | goto out_err; 55 | } 56 | 57 | rdev->dev = q->cm_id->device; 58 | 59 | pr_info("selecting device %s\n", rdev->dev->name); 60 | 61 | rdev->pd = ib_alloc_pd(rdev->dev, 0); 62 | if (IS_ERR(rdev->pd)) { 63 | pr_err("ib_alloc_pd\n"); 64 | goto out_free_dev; 65 | } 66 | 67 | if (!(rdev->dev->attrs.device_cap_flags & 68 | IB_DEVICE_MEM_MGT_EXTENSIONS)) { 69 | pr_err("memory registrations not supported\n"); 70 | goto out_free_pd; 71 | } 72 | 73 | q->ctrl->rdev = rdev; 74 | } 75 | 76 | return q->ctrl->rdev; 77 | 78 | out_free_pd: 79 | ib_dealloc_pd(rdev->pd); 80 | out_free_dev: 81 | kfree(rdev); 82 | out_err: 83 | return NULL; 84 | } 85 | 86 | static void sswap_rdma_qp_event(struct ib_event *e, void *c) 87 | { 88 | pr_info("sswap_rdma_qp_event\n"); 89 | } 90 | 91 | static int sswap_rdma_create_qp(struct rdma_queue *queue) 92 | { 93 | struct sswap_rdma_dev *rdev = queue->ctrl->rdev; 94 | struct ib_qp_init_attr init_attr; 95 | int ret; 96 | 97 | pr_info("start: %s\n", __FUNCTION__); 98 | 99 | memset(&init_attr, 0, sizeof(init_attr)); 100 | init_attr.event_handler = sswap_rdma_qp_event; 101 | init_attr.cap.max_send_wr = QP_MAX_SEND_WR; 102 | init_attr.cap.max_recv_wr = QP_MAX_RECV_WR; 103 | init_attr.cap.max_recv_sge = 1; 104 | init_attr.cap.max_send_sge = 1; 105 | init_attr.sq_sig_type = IB_SIGNAL_REQ_WR; 106 | init_attr.qp_type = IB_QPT_RC; 107 | init_attr.send_cq = queue->cq; 108 | init_attr.recv_cq = queue->cq; 109 | /* just to check if we are compiling against the right headers */ 110 | init_attr.create_flags = IB_QP_EXP_CREATE_ATOMIC_BE_REPLY & 0; 111 | 112 | ret = rdma_create_qp(queue->cm_id, rdev->pd, &init_attr); 113 | if (ret) { 114 | pr_err("rdma_create_qp failed: %d\n", ret); 115 | return ret; 116 | } 117 | 118 | queue->qp = queue->cm_id->qp; 119 | return ret; 120 | } 121 | 122 | static void sswap_rdma_destroy_queue_ib(struct rdma_queue *q) 123 | { 124 | struct sswap_rdma_dev *rdev; 125 | struct ib_device *ibdev; 126 | 127 | pr_info("start: %s\n", __FUNCTION__); 128 | 129 | rdev = q->ctrl->rdev; 130 | ibdev = rdev->dev; 131 | //rdma_destroy_qp(q->ctrl->cm_id); 132 | ib_free_cq(q->cq); 133 | } 134 | 135 | static int sswap_rdma_create_queue_ib(struct rdma_queue *q) 136 | { 137 | struct ib_device *ibdev = q->ctrl->rdev->dev; 138 | int ret; 139 | int comp_vector = 0; 140 | 141 | pr_info("start: %s\n", __FUNCTION__); 142 | 143 | if (q->qp_type == QP_READ_ASYNC) 144 | q->cq = ib_alloc_cq(ibdev, q, CQ_NUM_CQES, 145 | comp_vector, IB_POLL_SOFTIRQ); 146 | else 147 | q->cq = ib_alloc_cq(ibdev, q, CQ_NUM_CQES, 148 | comp_vector, IB_POLL_DIRECT); 149 | 150 | if (IS_ERR(q->cq)) { 151 | ret = PTR_ERR(q->cq); 152 | goto out_err; 153 | } 154 | 155 | ret = sswap_rdma_create_qp(q); 156 | if (ret) 157 | goto out_destroy_ib_cq; 158 | 159 | return 0; 160 | 161 | out_destroy_ib_cq: 162 | ib_free_cq(q->cq); 163 | out_err: 164 | return ret; 165 | } 166 | 167 | static int sswap_rdma_addr_resolved(struct rdma_queue *q) 168 | { 169 | struct sswap_rdma_dev *rdev = NULL; 170 | int ret; 171 | 172 | pr_info("start: %s\n", __FUNCTION__); 173 | 174 | rdev = sswap_rdma_get_device(q); 175 | if (!rdev) { 176 | pr_err("no device found\n"); 177 | return -ENODEV; 178 | } 179 | 180 | ret = sswap_rdma_create_queue_ib(q); 181 | if (ret) { 182 | return ret; 183 | } 184 | 185 | ret = rdma_resolve_route(q->cm_id, CONNECTION_TIMEOUT_MS); 186 | if (ret) { 187 | pr_err("rdma_resolve_route failed\n"); 188 | sswap_rdma_destroy_queue_ib(q); 189 | } 190 | 191 | return 0; 192 | } 193 | 194 | static int sswap_rdma_route_resolved(struct rdma_queue *q, 195 | struct rdma_conn_param *conn_params) 196 | { 197 | struct rdma_conn_param param = {}; 198 | int ret; 199 | 200 | param.qp_num = q->qp->qp_num; 201 | param.flow_control = 1; 202 | param.responder_resources = 16; 203 | param.initiator_depth = 16; 204 | param.retry_count = 7; 205 | param.rnr_retry_count = 7; 206 | param.private_data = NULL; 207 | param.private_data_len = 0; 208 | 209 | pr_info("max_qp_rd_atom=%d max_qp_init_rd_atom=%d\n", 210 | q->ctrl->rdev->dev->attrs.max_qp_rd_atom, 211 | q->ctrl->rdev->dev->attrs.max_qp_init_rd_atom); 212 | 213 | ret = rdma_connect(q->cm_id, ¶m); 214 | if (ret) { 215 | pr_err("rdma_connect failed (%d)\n", ret); 216 | sswap_rdma_destroy_queue_ib(q); 217 | } 218 | 219 | return 0; 220 | } 221 | 222 | static int sswap_rdma_conn_established(struct rdma_queue *q) 223 | { 224 | pr_info("connection established\n"); 225 | return 0; 226 | } 227 | 228 | static int sswap_rdma_cm_handler(struct rdma_cm_id *cm_id, 229 | struct rdma_cm_event *ev) 230 | { 231 | struct rdma_queue *queue = cm_id->context; 232 | int cm_error = 0; 233 | 234 | pr_info("cm_handler msg: %s (%d) status %d id %p\n", rdma_event_msg(ev->event), 235 | ev->event, ev->status, cm_id); 236 | 237 | switch (ev->event) { 238 | case RDMA_CM_EVENT_ADDR_RESOLVED: 239 | cm_error = sswap_rdma_addr_resolved(queue); 240 | break; 241 | case RDMA_CM_EVENT_ROUTE_RESOLVED: 242 | cm_error = sswap_rdma_route_resolved(queue, &ev->param.conn); 243 | break; 244 | case RDMA_CM_EVENT_ESTABLISHED: 245 | queue->cm_error = sswap_rdma_conn_established(queue); 246 | /* complete cm_done regardless of success/failure */ 247 | complete(&queue->cm_done); 248 | return 0; 249 | case RDMA_CM_EVENT_REJECTED: 250 | pr_err("connection rejected\n"); 251 | break; 252 | case RDMA_CM_EVENT_ADDR_ERROR: 253 | case RDMA_CM_EVENT_ROUTE_ERROR: 254 | case RDMA_CM_EVENT_CONNECT_ERROR: 255 | case RDMA_CM_EVENT_UNREACHABLE: 256 | pr_err("CM error event %d\n", ev->event); 257 | cm_error = -ECONNRESET; 258 | break; 259 | case RDMA_CM_EVENT_DISCONNECTED: 260 | case RDMA_CM_EVENT_ADDR_CHANGE: 261 | case RDMA_CM_EVENT_TIMEWAIT_EXIT: 262 | pr_err("CM connection closed %d\n", ev->event); 263 | break; 264 | case RDMA_CM_EVENT_DEVICE_REMOVAL: 265 | /* device removal is handled via the ib_client API */ 266 | break; 267 | default: 268 | pr_err("CM unexpected event: %d\n", ev->event); 269 | break; 270 | } 271 | 272 | if (cm_error) { 273 | queue->cm_error = cm_error; 274 | complete(&queue->cm_done); 275 | } 276 | 277 | return 0; 278 | } 279 | 280 | inline static int sswap_rdma_wait_for_cm(struct rdma_queue *queue) 281 | { 282 | wait_for_completion_interruptible_timeout(&queue->cm_done, 283 | msecs_to_jiffies(CONNECTION_TIMEOUT_MS) + 1); 284 | return queue->cm_error; 285 | } 286 | 287 | static int sswap_rdma_init_queue(struct sswap_rdma_ctrl *ctrl, 288 | int idx) 289 | { 290 | struct rdma_queue *queue; 291 | int ret; 292 | 293 | pr_info("start: %s\n", __FUNCTION__); 294 | 295 | queue = &ctrl->queues[idx]; 296 | queue->ctrl = ctrl; 297 | init_completion(&queue->cm_done); 298 | atomic_set(&queue->pending, 0); 299 | spin_lock_init(&queue->cq_lock); 300 | queue->qp_type = get_queue_type(idx); 301 | 302 | queue->cm_id = rdma_create_id(&init_net, sswap_rdma_cm_handler, queue, 303 | RDMA_PS_TCP, IB_QPT_RC); 304 | if (IS_ERR(queue->cm_id)) { 305 | pr_err("failed to create cm id: %ld\n", PTR_ERR(queue->cm_id)); 306 | return -ENODEV; 307 | } 308 | 309 | queue->cm_error = -ETIMEDOUT; 310 | 311 | ret = rdma_resolve_addr(queue->cm_id, &ctrl->srcaddr, &ctrl->addr, 312 | CONNECTION_TIMEOUT_MS); 313 | if (ret) { 314 | pr_err("rdma_resolve_addr failed: %d\n", ret); 315 | goto out_destroy_cm_id; 316 | } 317 | 318 | ret = sswap_rdma_wait_for_cm(queue); 319 | if (ret) { 320 | pr_err("sswap_rdma_wait_for_cm failed\n"); 321 | goto out_destroy_cm_id; 322 | } 323 | 324 | return 0; 325 | 326 | out_destroy_cm_id: 327 | rdma_destroy_id(queue->cm_id); 328 | return ret; 329 | } 330 | 331 | static void sswap_rdma_stop_queue(struct rdma_queue *q) 332 | { 333 | rdma_disconnect(q->cm_id); 334 | } 335 | 336 | static void sswap_rdma_free_queue(struct rdma_queue *q) 337 | { 338 | rdma_destroy_qp(q->cm_id); 339 | ib_free_cq(q->cq); 340 | rdma_destroy_id(q->cm_id); 341 | } 342 | 343 | static int sswap_rdma_init_queues(struct sswap_rdma_ctrl *ctrl) 344 | { 345 | int ret, i; 346 | for (i = 0; i < numqueues; ++i) { 347 | ret = sswap_rdma_init_queue(ctrl, i); 348 | if (ret) { 349 | pr_err("failed to initialized queue: %d\n", i); 350 | goto out_free_queues; 351 | } 352 | } 353 | 354 | return 0; 355 | 356 | out_free_queues: 357 | for (i--; i >= 0; i--) { 358 | sswap_rdma_stop_queue(&ctrl->queues[i]); 359 | sswap_rdma_free_queue(&ctrl->queues[i]); 360 | } 361 | 362 | return ret; 363 | } 364 | 365 | static void sswap_rdma_stopandfree_queues(struct sswap_rdma_ctrl *ctrl) 366 | { 367 | int i; 368 | for (i = 0; i < numqueues; ++i) { 369 | sswap_rdma_stop_queue(&ctrl->queues[i]); 370 | sswap_rdma_free_queue(&ctrl->queues[i]); 371 | } 372 | } 373 | 374 | static int sswap_rdma_parse_ipaddr(struct sockaddr_in *saddr, char *ip) 375 | { 376 | u8 *addr = (u8 *)&saddr->sin_addr.s_addr; 377 | size_t buflen = strlen(ip); 378 | 379 | pr_info("start: %s\n", __FUNCTION__); 380 | 381 | if (buflen > INET_ADDRSTRLEN) 382 | return -EINVAL; 383 | if (in4_pton(ip, buflen, addr, '\0', NULL) == 0) 384 | return -EINVAL; 385 | saddr->sin_family = AF_INET; 386 | return 0; 387 | } 388 | 389 | static int sswap_rdma_create_ctrl(struct sswap_rdma_ctrl **c) 390 | { 391 | int ret; 392 | struct sswap_rdma_ctrl *ctrl; 393 | pr_info("will try to connect to %s:%d\n", serverip, serverport); 394 | 395 | *c = kzalloc(sizeof(struct sswap_rdma_ctrl), GFP_KERNEL); 396 | if (!*c) { 397 | pr_err("no mem for ctrl\n"); 398 | return -ENOMEM; 399 | } 400 | ctrl = *c; 401 | 402 | ctrl->queues = kzalloc(sizeof(struct rdma_queue) * numqueues, GFP_KERNEL); 403 | ret = sswap_rdma_parse_ipaddr(&(ctrl->addr_in), serverip); 404 | if (ret) { 405 | pr_err("sswap_rdma_parse_ipaddr failed: %d\n", ret); 406 | return -EINVAL; 407 | } 408 | ctrl->addr_in.sin_port = cpu_to_be16(serverport); 409 | 410 | ret = sswap_rdma_parse_ipaddr(&(ctrl->srcaddr_in), clientip); 411 | if (ret) { 412 | pr_err("sswap_rdma_parse_ipaddr failed: %d\n", ret); 413 | return -EINVAL; 414 | } 415 | /* no need to set the port on the srcaddr */ 416 | 417 | return sswap_rdma_init_queues(ctrl); 418 | } 419 | 420 | static void __exit sswap_rdma_cleanup_module(void) 421 | { 422 | sswap_rdma_stopandfree_queues(gctrl); 423 | ib_unregister_client(&sswap_rdma_ib_client); 424 | kfree(gctrl); 425 | gctrl = NULL; 426 | if (req_cache) { 427 | kmem_cache_destroy(req_cache); 428 | } 429 | } 430 | 431 | static void sswap_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc) 432 | { 433 | struct rdma_req *req = 434 | container_of(wc->wr_cqe, struct rdma_req, cqe); 435 | struct rdma_queue *q = cq->cq_context; 436 | struct ib_device *ibdev = q->ctrl->rdev->dev; 437 | 438 | if (unlikely(wc->status != IB_WC_SUCCESS)) { 439 | pr_err("sswap_rdma_write_done status is not success, it is=%d\n", wc->status); 440 | //q->write_error = wc->status; 441 | } 442 | ib_dma_unmap_page(ibdev, req->dma, PAGE_SIZE, DMA_TO_DEVICE); 443 | 444 | atomic_dec(&q->pending); 445 | kmem_cache_free(req_cache, req); 446 | } 447 | 448 | static void sswap_rdma_read_done(struct ib_cq *cq, struct ib_wc *wc) 449 | { 450 | struct rdma_req *req = 451 | container_of(wc->wr_cqe, struct rdma_req, cqe); 452 | struct rdma_queue *q = cq->cq_context; 453 | struct ib_device *ibdev = q->ctrl->rdev->dev; 454 | 455 | if (unlikely(wc->status != IB_WC_SUCCESS)) 456 | pr_err("sswap_rdma_read_done status is not success, it is=%d\n", wc->status); 457 | 458 | ib_dma_unmap_page(ibdev, req->dma, PAGE_SIZE, DMA_FROM_DEVICE); 459 | 460 | SetPageUptodate(req->page); 461 | unlock_page(req->page); 462 | complete(&req->done); 463 | atomic_dec(&q->pending); 464 | kmem_cache_free(req_cache, req); 465 | } 466 | 467 | inline static int sswap_rdma_post_rdma(struct rdma_queue *q, struct rdma_req *qe, 468 | struct ib_sge *sge, u64 roffset, enum ib_wr_opcode op) 469 | { 470 | struct ib_send_wr *bad_wr; 471 | struct ib_rdma_wr rdma_wr = {}; 472 | int ret; 473 | 474 | BUG_ON(qe->dma == 0); 475 | 476 | sge->addr = qe->dma; 477 | sge->length = PAGE_SIZE; 478 | sge->lkey = q->ctrl->rdev->pd->local_dma_lkey; 479 | 480 | /* TODO: add a chain of WR, we already have a list so should be easy 481 | * to just post requests in batches */ 482 | rdma_wr.wr.next = NULL; 483 | rdma_wr.wr.wr_cqe = &qe->cqe; 484 | rdma_wr.wr.sg_list = sge; 485 | rdma_wr.wr.num_sge = 1; 486 | rdma_wr.wr.opcode = op; 487 | rdma_wr.wr.send_flags = IB_SEND_SIGNALED; 488 | rdma_wr.remote_addr = q->ctrl->servermr.baseaddr + roffset; 489 | rdma_wr.rkey = q->ctrl->servermr.key; 490 | 491 | atomic_inc(&q->pending); 492 | ret = ib_post_send(q->qp, &rdma_wr.wr, &bad_wr); 493 | if (unlikely(ret)) { 494 | pr_err("ib_post_send failed: %d\n", ret); 495 | } 496 | 497 | return ret; 498 | } 499 | 500 | static void sswap_rdma_recv_remotemr_done(struct ib_cq *cq, struct ib_wc *wc) 501 | { 502 | struct rdma_req *qe = 503 | container_of(wc->wr_cqe, struct rdma_req, cqe); 504 | struct rdma_queue *q = cq->cq_context; 505 | struct sswap_rdma_ctrl *ctrl = q->ctrl; 506 | struct ib_device *ibdev = q->ctrl->rdev->dev; 507 | 508 | if (unlikely(wc->status != IB_WC_SUCCESS)) { 509 | pr_err("sswap_rdma_recv_done status is not success\n"); 510 | return; 511 | } 512 | ib_dma_unmap_single(ibdev, qe->dma, sizeof(struct sswap_rdma_memregion), 513 | DMA_FROM_DEVICE); 514 | pr_info("servermr baseaddr=%llx, key=%u\n", ctrl->servermr.baseaddr, 515 | ctrl->servermr.key); 516 | complete_all(&qe->done); 517 | } 518 | 519 | static int sswap_rdma_post_recv(struct rdma_queue *q, struct rdma_req *qe, 520 | size_t bufsize) 521 | { 522 | struct ib_recv_wr *bad_wr; 523 | struct ib_recv_wr wr = {}; 524 | struct ib_sge sge; 525 | int ret; 526 | 527 | sge.addr = qe->dma; 528 | sge.length = bufsize; 529 | sge.lkey = q->ctrl->rdev->pd->local_dma_lkey; 530 | 531 | wr.next = NULL; 532 | wr.wr_cqe = &qe->cqe; 533 | wr.sg_list = &sge; 534 | wr.num_sge = 1; 535 | 536 | ret = ib_post_recv(q->qp, &wr, &bad_wr); 537 | if (ret) { 538 | pr_err("ib_post_recv failed: %d\n", ret); 539 | } 540 | return ret; 541 | } 542 | 543 | /* allocates a sswap rdma request, creates a dma mapping for it in 544 | * req->dma, and synchronizes the dma mapping in the direction of 545 | * the dma map. 546 | * Don't touch the page with cpu after creating the request for it! 547 | * Deallocates the request if there was an error */ 548 | inline static int get_req_for_page(struct rdma_req **req, struct ib_device *dev, 549 | struct page *page, enum dma_data_direction dir) 550 | { 551 | int ret; 552 | 553 | ret = 0; 554 | *req = kmem_cache_alloc(req_cache, GFP_ATOMIC); 555 | if (unlikely(!req)) { 556 | pr_err("no memory for req\n"); 557 | ret = -ENOMEM; 558 | goto out; 559 | } 560 | 561 | (*req)->page = page; 562 | init_completion(&(*req)->done); 563 | 564 | (*req)->dma = ib_dma_map_page(dev, page, 0, PAGE_SIZE, dir); 565 | if (unlikely(ib_dma_mapping_error(dev, (*req)->dma))) { 566 | pr_err("ib_dma_mapping_error\n"); 567 | ret = -ENOMEM; 568 | kmem_cache_free(req_cache, req); 569 | goto out; 570 | } 571 | 572 | ib_dma_sync_single_for_device(dev, (*req)->dma, PAGE_SIZE, dir); 573 | out: 574 | return ret; 575 | } 576 | 577 | /* the buffer needs to come from kernel (not high memory) */ 578 | inline static int get_req_for_buf(struct rdma_req **req, struct ib_device *dev, 579 | void *buf, size_t size, 580 | enum dma_data_direction dir) 581 | { 582 | int ret; 583 | 584 | ret = 0; 585 | *req = kmem_cache_alloc(req_cache, GFP_ATOMIC); 586 | if (unlikely(!req)) { 587 | pr_err("no memory for req\n"); 588 | ret = -ENOMEM; 589 | goto out; 590 | } 591 | 592 | init_completion(&(*req)->done); 593 | 594 | (*req)->dma = ib_dma_map_single(dev, buf, size, dir); 595 | if (unlikely(ib_dma_mapping_error(dev, (*req)->dma))) { 596 | pr_err("ib_dma_mapping_error\n"); 597 | ret = -ENOMEM; 598 | kmem_cache_free(req_cache, req); 599 | goto out; 600 | } 601 | 602 | ib_dma_sync_single_for_device(dev, (*req)->dma, size, dir); 603 | out: 604 | return ret; 605 | } 606 | 607 | inline static void sswap_rdma_wait_completion(struct ib_cq *cq, 608 | struct rdma_req *qe) 609 | { 610 | ndelay(1000); 611 | while (!completion_done(&qe->done)) { 612 | ndelay(250); 613 | ib_process_cq_direct(cq, 1); 614 | } 615 | } 616 | 617 | /* polls queue until we reach target completed wrs or qp is empty */ 618 | static inline int poll_target(struct rdma_queue *q, int target) 619 | { 620 | unsigned long flags; 621 | int completed = 0; 622 | 623 | while (completed < target && atomic_read(&q->pending) > 0) { 624 | spin_lock_irqsave(&q->cq_lock, flags); 625 | completed += ib_process_cq_direct(q->cq, target - completed); 626 | spin_unlock_irqrestore(&q->cq_lock, flags); 627 | cpu_relax(); 628 | } 629 | 630 | return completed; 631 | } 632 | 633 | static inline int drain_queue(struct rdma_queue *q) 634 | { 635 | unsigned long flags; 636 | 637 | while (atomic_read(&q->pending) > 0) { 638 | spin_lock_irqsave(&q->cq_lock, flags); 639 | ib_process_cq_direct(q->cq, 16); 640 | spin_unlock_irqrestore(&q->cq_lock, flags); 641 | cpu_relax(); 642 | } 643 | 644 | return 1; 645 | } 646 | 647 | static inline int write_queue_add(struct rdma_queue *q, struct page *page, 648 | u64 roffset) 649 | { 650 | struct rdma_req *req; 651 | struct ib_device *dev = q->ctrl->rdev->dev; 652 | struct ib_sge sge = {}; 653 | int ret, inflight; 654 | 655 | while ((inflight = atomic_read(&q->pending)) >= QP_MAX_SEND_WR - 8) { 656 | BUG_ON(inflight > QP_MAX_SEND_WR); 657 | poll_target(q, 2048); 658 | pr_info_ratelimited("back pressure writes"); 659 | } 660 | 661 | ret = get_req_for_page(&req, dev, page, DMA_TO_DEVICE); 662 | if (unlikely(ret)) 663 | return ret; 664 | 665 | req->cqe.done = sswap_rdma_write_done; 666 | ret = sswap_rdma_post_rdma(q, req, &sge, roffset, IB_WR_RDMA_WRITE); 667 | 668 | return ret; 669 | } 670 | 671 | static inline int begin_read(struct rdma_queue *q, struct page *page, 672 | u64 roffset) 673 | { 674 | struct rdma_req *req; 675 | struct ib_device *dev = q->ctrl->rdev->dev; 676 | struct ib_sge sge = {}; 677 | int ret, inflight; 678 | 679 | /* back pressure in-flight reads, can't send more than 680 | * QP_MAX_SEND_WR at a time */ 681 | while ((inflight = atomic_read(&q->pending)) >= QP_MAX_SEND_WR) { 682 | BUG_ON(inflight > QP_MAX_SEND_WR); /* only valid case is == */ 683 | poll_target(q, 8); 684 | pr_info_ratelimited("back pressure happened on reads"); 685 | } 686 | 687 | ret = get_req_for_page(&req, dev, page, DMA_TO_DEVICE); 688 | if (unlikely(ret)) 689 | return ret; 690 | 691 | req->cqe.done = sswap_rdma_read_done; 692 | ret = sswap_rdma_post_rdma(q, req, &sge, roffset, IB_WR_RDMA_READ); 693 | return ret; 694 | } 695 | 696 | int sswap_rdma_write(struct page *page, u64 roffset) 697 | { 698 | int ret; 699 | struct rdma_queue *q; 700 | 701 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); 702 | 703 | q = sswap_rdma_get_queue(smp_processor_id(), QP_WRITE_SYNC); 704 | ret = write_queue_add(q, page, roffset); 705 | BUG_ON(ret); 706 | drain_queue(q); 707 | return ret; 708 | } 709 | EXPORT_SYMBOL(sswap_rdma_write); 710 | 711 | static int sswap_rdma_recv_remotemr(struct sswap_rdma_ctrl *ctrl) 712 | { 713 | struct rdma_req *qe; 714 | int ret; 715 | struct ib_device *dev; 716 | 717 | pr_info("start: %s\n", __FUNCTION__); 718 | dev = ctrl->rdev->dev; 719 | 720 | ret = get_req_for_buf(&qe, dev, &(ctrl->servermr), sizeof(ctrl->servermr), 721 | DMA_FROM_DEVICE); 722 | if (unlikely(ret)) 723 | goto out; 724 | 725 | qe->cqe.done = sswap_rdma_recv_remotemr_done; 726 | 727 | ret = sswap_rdma_post_recv(&(ctrl->queues[0]), qe, sizeof(struct sswap_rdma_memregion)); 728 | 729 | if (unlikely(ret)) 730 | goto out_free_qe; 731 | 732 | /* this delay doesn't really matter, only happens once */ 733 | sswap_rdma_wait_completion(ctrl->queues[0].cq, qe); 734 | 735 | out_free_qe: 736 | kmem_cache_free(req_cache, qe); 737 | out: 738 | return ret; 739 | } 740 | 741 | /* page is unlocked when the wr is done. 742 | * posts an RDMA read on this cpu's qp */ 743 | int sswap_rdma_read_async(struct page *page, u64 roffset) 744 | { 745 | struct rdma_queue *q; 746 | int ret; 747 | 748 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); 749 | VM_BUG_ON_PAGE(!PageLocked(page), page); 750 | VM_BUG_ON_PAGE(PageUptodate(page), page); 751 | 752 | q = sswap_rdma_get_queue(smp_processor_id(), QP_READ_ASYNC); 753 | ret = begin_read(q, page, roffset); 754 | return ret; 755 | } 756 | EXPORT_SYMBOL(sswap_rdma_read_async); 757 | 758 | int sswap_rdma_read_sync(struct page *page, u64 roffset) 759 | { 760 | struct rdma_queue *q; 761 | int ret; 762 | 763 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); 764 | VM_BUG_ON_PAGE(!PageLocked(page), page); 765 | VM_BUG_ON_PAGE(PageUptodate(page), page); 766 | 767 | q = sswap_rdma_get_queue(smp_processor_id(), QP_READ_SYNC); 768 | ret = begin_read(q, page, roffset); 769 | return ret; 770 | } 771 | EXPORT_SYMBOL(sswap_rdma_read_sync); 772 | 773 | int sswap_rdma_poll_load(int cpu) 774 | { 775 | struct rdma_queue *q = sswap_rdma_get_queue(cpu, QP_READ_SYNC); 776 | return drain_queue(q); 777 | } 778 | EXPORT_SYMBOL(sswap_rdma_poll_load); 779 | 780 | /* idx is absolute id (i.e. > than number of cpus) */ 781 | inline enum qp_type get_queue_type(unsigned int idx) 782 | { 783 | // numcpus = 8 784 | if (idx < numcpus) 785 | return QP_READ_SYNC; 786 | else if (idx < numcpus * 2) 787 | return QP_READ_ASYNC; 788 | else if (idx < numcpus * 3) 789 | return QP_WRITE_SYNC; 790 | 791 | BUG(); 792 | return QP_READ_SYNC; 793 | } 794 | 795 | inline struct rdma_queue *sswap_rdma_get_queue(unsigned int cpuid, 796 | enum qp_type type) 797 | { 798 | BUG_ON(gctrl == NULL); 799 | 800 | switch (type) { 801 | case QP_READ_SYNC: 802 | return &gctrl->queues[cpuid]; 803 | case QP_READ_ASYNC: 804 | return &gctrl->queues[cpuid + numcpus]; 805 | case QP_WRITE_SYNC: 806 | return &gctrl->queues[cpuid + numcpus * 2]; 807 | default: 808 | BUG(); 809 | }; 810 | } 811 | 812 | static int __init sswap_rdma_init_module(void) 813 | { 814 | int ret; 815 | 816 | pr_info("start: %s\n", __FUNCTION__); 817 | pr_info("* RDMA BACKEND *"); 818 | 819 | numcpus = num_online_cpus(); 820 | numqueues = numcpus * 3; 821 | 822 | req_cache = kmem_cache_create("sswap_req_cache", sizeof(struct rdma_req), 0, 823 | SLAB_TEMPORARY | SLAB_HWCACHE_ALIGN, NULL); 824 | 825 | if (!req_cache) { 826 | pr_err("no memory for cache allocation\n"); 827 | return -ENOMEM; 828 | } 829 | 830 | ib_register_client(&sswap_rdma_ib_client); 831 | ret = sswap_rdma_create_ctrl(&gctrl); 832 | if (ret) { 833 | pr_err("could not create ctrl\n"); 834 | ib_unregister_client(&sswap_rdma_ib_client); 835 | return -ENODEV; 836 | } 837 | 838 | ret = sswap_rdma_recv_remotemr(gctrl); 839 | if (ret) { 840 | pr_err("could not setup remote memory region\n"); 841 | ib_unregister_client(&sswap_rdma_ib_client); 842 | return -ENODEV; 843 | } 844 | 845 | pr_info("ctrl is ready for reqs\n"); 846 | return 0; 847 | } 848 | 849 | module_init(sswap_rdma_init_module); 850 | module_exit(sswap_rdma_cleanup_module); 851 | 852 | MODULE_LICENSE("GPL v2"); 853 | MODULE_DESCRIPTION("Experiments"); 854 | -------------------------------------------------------------------------------- /drivers/fastswap_rdma.h: -------------------------------------------------------------------------------- 1 | #if !defined(_SSWAP_RDMA_H) 2 | #define _SSWAP_RDMA_H 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | 14 | enum qp_type { 15 | QP_READ_SYNC, 16 | QP_READ_ASYNC, 17 | QP_WRITE_SYNC 18 | }; 19 | 20 | struct sswap_rdma_dev { 21 | struct ib_device *dev; 22 | struct ib_pd *pd; 23 | }; 24 | 25 | struct rdma_req { 26 | struct completion done; 27 | struct list_head list; 28 | struct ib_cqe cqe; 29 | u64 dma; 30 | struct page *page; 31 | }; 32 | 33 | struct sswap_rdma_ctrl; 34 | 35 | struct rdma_queue { 36 | struct ib_qp *qp; 37 | struct ib_cq *cq; 38 | spinlock_t cq_lock; 39 | enum qp_type qp_type; 40 | 41 | struct sswap_rdma_ctrl *ctrl; 42 | 43 | struct rdma_cm_id *cm_id; 44 | int cm_error; 45 | struct completion cm_done; 46 | 47 | atomic_t pending; 48 | }; 49 | 50 | struct sswap_rdma_memregion { 51 | u64 baseaddr; 52 | u32 key; 53 | }; 54 | 55 | struct sswap_rdma_ctrl { 56 | struct sswap_rdma_dev *rdev; // TODO: move this to queue 57 | struct rdma_queue *queues; 58 | struct sswap_rdma_memregion servermr; 59 | 60 | union { 61 | struct sockaddr addr; 62 | struct sockaddr_in addr_in; 63 | }; 64 | 65 | union { 66 | struct sockaddr srcaddr; 67 | struct sockaddr_in srcaddr_in; 68 | }; 69 | }; 70 | 71 | struct rdma_queue *sswap_rdma_get_queue(unsigned int idx, enum qp_type type); 72 | enum qp_type get_queue_type(unsigned int idx); 73 | int sswap_rdma_read_async(struct page *page, u64 roffset); 74 | int sswap_rdma_read_sync(struct page *page, u64 roffset); 75 | int sswap_rdma_write(struct page *page, u64 roffset); 76 | int sswap_rdma_poll_load(int cpu); 77 | 78 | #endif 79 | -------------------------------------------------------------------------------- /farmemserver/.gitignore: -------------------------------------------------------------------------------- 1 | rmserver 2 | -------------------------------------------------------------------------------- /farmemserver/Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: clean 2 | 3 | CFLAGS := -Wall -O2 -g -ggdb -Werror 4 | LDLIBS := ${LDLIBS} -lrdmacm -libverbs -lpthread 5 | CC := g++ 6 | 7 | APPS := rmserver 8 | 9 | all: ${APPS} 10 | 11 | clean: 12 | rm -f ${APPS} 13 | -------------------------------------------------------------------------------- /farmemserver/rmserver.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | 7 | #define TEST_NZ(x) do { if ( (x)) die("error: " #x " failed (returned non-zero)." ); } while (0) 8 | #define TEST_Z(x) do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0) 9 | 10 | const size_t BUFFER_SIZE = 1024 * 1024 * 1024 * 32l; 11 | const unsigned int NUM_PROCS = 8; 12 | const unsigned int NUM_QUEUES_PER_PROC = 3; 13 | const unsigned int NUM_QUEUES = NUM_PROCS * NUM_QUEUES_PER_PROC; 14 | 15 | struct device { 16 | struct ibv_pd *pd; 17 | struct ibv_context *verbs; 18 | }; 19 | 20 | struct queue { 21 | struct ibv_qp *qp; 22 | struct ibv_cq *cq; 23 | struct rdma_cm_id *cm_id; 24 | struct ctrl *ctrl; 25 | enum { 26 | INIT, 27 | CONNECTED 28 | } state; 29 | }; 30 | 31 | struct ctrl { 32 | struct queue *queues; 33 | struct ibv_mr *mr_buffer; 34 | void *buffer; 35 | struct device *dev; 36 | 37 | struct ibv_comp_channel *comp_channel; 38 | }; 39 | 40 | struct memregion { 41 | uint64_t baseaddr; 42 | uint32_t key; 43 | }; 44 | 45 | static void die(const char *reason); 46 | 47 | static int alloc_control(); 48 | static int on_connect_request(struct rdma_cm_id *id, struct rdma_conn_param *param); 49 | static int on_connection(struct queue *q); 50 | static int on_disconnect(struct queue *q); 51 | static int on_event(struct rdma_cm_event *event); 52 | static void destroy_device(struct ctrl *ctrl); 53 | 54 | static struct ctrl *gctrl = NULL; 55 | static unsigned int queue_ctr = 0; 56 | 57 | int main(int argc, char **argv) 58 | { 59 | struct sockaddr_in addr = {}; 60 | struct rdma_cm_event *event = NULL; 61 | struct rdma_event_channel *ec = NULL; 62 | struct rdma_cm_id *listener = NULL; 63 | uint16_t port = 0; 64 | 65 | if (argc != 2) { 66 | die("Need to specify a port number to listen"); 67 | } 68 | 69 | addr.sin_family = AF_INET; 70 | addr.sin_port = htons(atoi(argv[1])); 71 | 72 | TEST_NZ(alloc_control()); 73 | 74 | TEST_Z(ec = rdma_create_event_channel()); 75 | TEST_NZ(rdma_create_id(ec, &listener, NULL, RDMA_PS_TCP)); 76 | TEST_NZ(rdma_bind_addr(listener, (struct sockaddr *)&addr)); 77 | TEST_NZ(rdma_listen(listener, NUM_QUEUES + 1)); 78 | port = ntohs(rdma_get_src_port(listener)); 79 | printf("listening on port %d.\n", port); 80 | 81 | for (unsigned int i = 0; i < NUM_QUEUES; ++i) { 82 | printf("waiting for queue connection: %d\n", i); 83 | struct queue *q = &gctrl->queues[i]; 84 | 85 | // handle connection requests 86 | while (rdma_get_cm_event(ec, &event) == 0) { 87 | struct rdma_cm_event event_copy; 88 | 89 | memcpy(&event_copy, event, sizeof(*event)); 90 | rdma_ack_cm_event(event); 91 | 92 | if (on_event(&event_copy) || q->state == queue::CONNECTED) 93 | break; 94 | } 95 | } 96 | 97 | printf("done connecting all queues\n"); 98 | 99 | // handle disconnects, etc. 100 | while (rdma_get_cm_event(ec, &event) == 0) { 101 | struct rdma_cm_event event_copy; 102 | 103 | memcpy(&event_copy, event, sizeof(*event)); 104 | rdma_ack_cm_event(event); 105 | 106 | if (on_event(&event_copy)) 107 | break; 108 | } 109 | 110 | rdma_destroy_event_channel(ec); 111 | rdma_destroy_id(listener); 112 | destroy_device(gctrl); 113 | return 0; 114 | } 115 | 116 | void die(const char *reason) 117 | { 118 | fprintf(stderr, "%s - errno: %d\n", reason, errno); 119 | exit(EXIT_FAILURE); 120 | } 121 | 122 | int alloc_control() 123 | { 124 | gctrl = (struct ctrl *) malloc(sizeof(struct ctrl)); 125 | TEST_Z(gctrl); 126 | memset(gctrl, 0, sizeof(struct ctrl)); 127 | 128 | gctrl->queues = (struct queue *) malloc(sizeof(struct queue) * NUM_QUEUES); 129 | TEST_Z(gctrl->queues); 130 | memset(gctrl->queues, 0, sizeof(struct queue) * NUM_QUEUES); 131 | for (unsigned int i = 0; i < NUM_QUEUES; ++i) { 132 | gctrl->queues[i].ctrl = gctrl; 133 | gctrl->queues[i].state = queue::INIT; 134 | } 135 | 136 | 137 | return 0; 138 | } 139 | 140 | static device *get_device(struct queue *q) 141 | { 142 | struct device *dev = NULL; 143 | 144 | if (!q->ctrl->dev) { 145 | dev = (struct device *) malloc(sizeof(*dev)); 146 | TEST_Z(dev); 147 | dev->verbs = q->cm_id->verbs; 148 | TEST_Z(dev->verbs); 149 | dev->pd = ibv_alloc_pd(dev->verbs); 150 | TEST_Z(dev->pd); 151 | 152 | struct ctrl *ctrl = q->ctrl; 153 | ctrl->buffer = malloc(BUFFER_SIZE); 154 | TEST_Z(ctrl->buffer); 155 | 156 | TEST_Z(ctrl->mr_buffer = ibv_reg_mr( 157 | dev->pd, 158 | ctrl->buffer, 159 | BUFFER_SIZE, 160 | IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ)); 161 | 162 | printf("registered memory region of %zu bytes\n", BUFFER_SIZE); 163 | q->ctrl->dev = dev; 164 | } 165 | 166 | return q->ctrl->dev; 167 | } 168 | 169 | static void destroy_device(struct ctrl *ctrl) 170 | { 171 | TEST_Z(ctrl->dev); 172 | 173 | ibv_dereg_mr(ctrl->mr_buffer); 174 | free(ctrl->buffer); 175 | ibv_dealloc_pd(ctrl->dev->pd); 176 | free(ctrl->dev); 177 | ctrl->dev = NULL; 178 | } 179 | 180 | static void create_qp(struct queue *q) 181 | { 182 | struct ibv_qp_init_attr qp_attr = {}; 183 | 184 | qp_attr.send_cq = q->cq; 185 | qp_attr.recv_cq = q->cq; 186 | qp_attr.qp_type = IBV_QPT_RC; 187 | qp_attr.cap.max_send_wr = 10; 188 | qp_attr.cap.max_recv_wr = 10; 189 | qp_attr.cap.max_send_sge = 1; 190 | qp_attr.cap.max_recv_sge = 1; 191 | 192 | TEST_NZ(rdma_create_qp(q->cm_id, q->ctrl->dev->pd, &qp_attr)); 193 | q->qp = q->cm_id->qp; 194 | } 195 | 196 | int on_connect_request(struct rdma_cm_id *id, struct rdma_conn_param *param) 197 | { 198 | 199 | struct rdma_conn_param cm_params = {}; 200 | struct ibv_device_attr attrs = {}; 201 | struct queue *q = &gctrl->queues[queue_ctr++]; 202 | 203 | TEST_Z(q->state == queue::INIT); 204 | printf("%s\n", __FUNCTION__); 205 | 206 | id->context = q; 207 | q->cm_id = id; 208 | 209 | struct device *dev = get_device(q); 210 | create_qp(q); 211 | 212 | TEST_NZ(ibv_query_device(dev->verbs, &attrs)); 213 | 214 | printf("attrs: max_qp=%d, max_qp_wr=%d, max_cq=%d max_cqe=%d \ 215 | max_qp_rd_atom=%d, max_qp_init_rd_atom=%d\n", attrs.max_qp, 216 | attrs.max_qp_wr, attrs.max_cq, attrs.max_cqe, 217 | attrs.max_qp_rd_atom, attrs.max_qp_init_rd_atom); 218 | 219 | printf("ctrl attrs: initiator_depth=%d responder_resources=%d\n", 220 | param->initiator_depth, param->responder_resources); 221 | 222 | // the following should hold for initiator_depth: 223 | // initiator_depth <= max_qp_init_rd_atom, and 224 | // initiator_depth <= param->initiator_depth 225 | cm_params.initiator_depth = param->initiator_depth; 226 | // the following should hold for responder_resources: 227 | // responder_resources <= max_qp_rd_atom, and 228 | // responder_resources >= param->responder_resources 229 | cm_params.responder_resources = param->responder_resources; 230 | cm_params.rnr_retry_count = param->rnr_retry_count; 231 | cm_params.flow_control = param->flow_control; 232 | 233 | TEST_NZ(rdma_accept(q->cm_id, &cm_params)); 234 | 235 | return 0; 236 | } 237 | 238 | int on_connection(struct queue *q) 239 | { 240 | printf("%s\n", __FUNCTION__); 241 | struct ctrl *ctrl = q->ctrl; 242 | 243 | TEST_Z(q->state == queue::INIT); 244 | 245 | if (q == &ctrl->queues[0]) { 246 | struct ibv_send_wr wr = {}; 247 | struct ibv_send_wr *bad_wr = NULL; 248 | struct ibv_sge sge = {}; 249 | struct memregion servermr = {}; 250 | 251 | printf("connected. sending memory region info.\n"); 252 | printf("MR key=%u base vaddr=%p\n", ctrl->mr_buffer->rkey, ctrl->mr_buffer->addr); 253 | 254 | servermr.baseaddr = (uint64_t) ctrl->mr_buffer->addr; 255 | servermr.key = ctrl->mr_buffer->rkey; 256 | 257 | wr.opcode = IBV_WR_SEND; 258 | wr.sg_list = &sge; 259 | wr.num_sge = 1; 260 | wr.send_flags = IBV_SEND_SIGNALED | IBV_SEND_INLINE; 261 | 262 | sge.addr = (uint64_t) &servermr; 263 | sge.length = sizeof(servermr); 264 | 265 | TEST_NZ(ibv_post_send(q->qp, &wr, &bad_wr)); 266 | 267 | // TODO: poll here 268 | } 269 | 270 | q->state = queue::CONNECTED; 271 | return 0; 272 | } 273 | 274 | int on_disconnect(struct queue *q) 275 | { 276 | printf("%s\n", __FUNCTION__); 277 | 278 | if (q->state == queue::CONNECTED) { 279 | q->state = queue::INIT; 280 | rdma_destroy_qp(q->cm_id); 281 | rdma_destroy_id(q->cm_id); 282 | } 283 | 284 | return 0; 285 | } 286 | 287 | int on_event(struct rdma_cm_event *event) 288 | { 289 | printf("%s\n", __FUNCTION__); 290 | struct queue *q = (struct queue *) event->id->context; 291 | 292 | switch (event->event) { 293 | case RDMA_CM_EVENT_CONNECT_REQUEST: 294 | return on_connect_request(event->id, &event->param.conn); 295 | case RDMA_CM_EVENT_ESTABLISHED: 296 | return on_connection(q); 297 | case RDMA_CM_EVENT_DISCONNECTED: 298 | on_disconnect(q); 299 | return 1; 300 | default: 301 | printf("unknown event: %s\n", rdma_event_str(event->event)); 302 | return 1; 303 | } 304 | } 305 | 306 | -------------------------------------------------------------------------------- /kernel/kernel.patch: -------------------------------------------------------------------------------- 1 | diff --git a/include/linux/frontswap.h b/include/linux/frontswap.h 2 | index 1d18af03..6a15babc 100644 3 | --- a/include/linux/frontswap.h 4 | +++ b/include/linux/frontswap.h 5 | @@ -10,6 +10,8 @@ struct frontswap_ops { 6 | void (*init)(unsigned); /* this swap type was just swapon'ed */ 7 | int (*store)(unsigned, pgoff_t, struct page *); /* store a page */ 8 | int (*load)(unsigned, pgoff_t, struct page *); /* load a page */ 9 | + int (*load_async)(unsigned, pgoff_t, struct page *); /* load a page async */ 10 | + int (*poll_load)(int); /* poll cpu for one load */ 11 | void (*invalidate_page)(unsigned, pgoff_t); /* page no longer needed */ 12 | void (*invalidate_area)(unsigned); /* swap type just swapoff'ed */ 13 | struct frontswap_ops *next; /* private pointer to next ops */ 14 | @@ -26,6 +28,8 @@ extern bool __frontswap_test(struct swap_info_struct *, pgoff_t); 15 | extern void __frontswap_init(unsigned type, unsigned long *map); 16 | extern int __frontswap_store(struct page *page); 17 | extern int __frontswap_load(struct page *page); 18 | +extern int __frontswap_load_async(struct page *page); 19 | +extern int __frontswap_poll_load(int cpu); 20 | extern void __frontswap_invalidate_page(unsigned, pgoff_t); 21 | extern void __frontswap_invalidate_area(unsigned); 22 | 23 | @@ -92,6 +96,22 @@ static inline int frontswap_load(struct page *page) 24 | return -1; 25 | } 26 | 27 | +static inline int frontswap_load_async(struct page *page) 28 | +{ 29 | + if (frontswap_enabled()) 30 | + return __frontswap_load_async(page); 31 | + 32 | + return -1; 33 | +} 34 | + 35 | +static inline int frontswap_poll_load(int cpu) 36 | +{ 37 | + if (frontswap_enabled()) 38 | + return __frontswap_poll_load(cpu); 39 | + 40 | + return -1; 41 | +} 42 | + 43 | static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset) 44 | { 45 | if (frontswap_enabled()) 46 | diff --git a/include/linux/swap.h b/include/linux/swap.h 47 | index 45e91dd6..c052b901 100644 48 | --- a/include/linux/swap.h 49 | +++ b/include/linux/swap.h 50 | @@ -156,7 +156,7 @@ enum { 51 | SWP_SCANNING = (1 << 11), /* refcount in scan_swap_map */ 52 | }; 53 | 54 | -#define SWAP_CLUSTER_MAX 32UL 55 | +#define SWAP_CLUSTER_MAX 64UL 56 | #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX 57 | 58 | #define SWAP_MAP_MAX 0x3e /* Max duplication count, in first swap_map */ 59 | @@ -332,6 +332,7 @@ extern void kswapd_stop(int nid); 60 | 61 | /* linux/mm/page_io.c */ 62 | extern int swap_readpage(struct page *); 63 | +extern int swap_readpage_sync(struct page *); 64 | extern int swap_writepage(struct page *page, struct writeback_control *wbc); 65 | extern void end_swap_bio_write(struct bio *bio); 66 | extern int __swap_writepage(struct page *page, struct writeback_control *wbc, 67 | diff --git a/mm/frontswap.c b/mm/frontswap.c 68 | index fec8b504..6cdab53d 100644 69 | --- a/mm/frontswap.c 70 | +++ b/mm/frontswap.c 71 | @@ -264,8 +264,8 @@ int __frontswap_store(struct page *page) 72 | */ 73 | if (__frontswap_test(sis, offset)) { 74 | __frontswap_clear(sis, offset); 75 | - for_each_frontswap_ops(ops) 76 | - ops->invalidate_page(type, offset); 77 | + //for_each_frontswap_ops(ops) 78 | + // ops->invalidate_page(type, offset); 79 | } 80 | 81 | /* Try to store in each implementation, until one succeeds. */ 82 | @@ -325,6 +325,50 @@ int __frontswap_load(struct page *page) 83 | } 84 | EXPORT_SYMBOL(__frontswap_load); 85 | 86 | +int __frontswap_load_async(struct page *page) 87 | +{ 88 | + int ret = -1; 89 | + swp_entry_t entry = { .val = page_private(page), }; 90 | + int type = swp_type(entry); 91 | + struct swap_info_struct *sis = swap_info[type]; 92 | + pgoff_t offset = swp_offset(entry); 93 | + struct frontswap_ops *ops; 94 | + 95 | + VM_BUG_ON(!frontswap_ops); 96 | + VM_BUG_ON(!PageLocked(page)); 97 | + VM_BUG_ON(sis == NULL); 98 | + 99 | + if (!__frontswap_test(sis, offset)) 100 | + return -1; 101 | + 102 | + /* Try loading from each implementation, until one succeeds. */ 103 | + for_each_frontswap_ops(ops) { 104 | + ret = ops->load_async(type, offset, page); 105 | + if (!ret) /* successful load */ 106 | + break; 107 | + } 108 | + if (ret == 0) 109 | + inc_frontswap_loads(); 110 | + 111 | + return ret; 112 | +} 113 | +EXPORT_SYMBOL(__frontswap_load_async); 114 | + 115 | +int __frontswap_poll_load(int cpu) 116 | +{ 117 | + struct frontswap_ops *ops; 118 | + 119 | + VM_BUG_ON(!frontswap_ops); 120 | + 121 | + /* Try loading from each implementation, until one succeeds. */ 122 | + for_each_frontswap_ops(ops) 123 | + return ops->poll_load(cpu); 124 | + 125 | + BUG(); 126 | + return -1; 127 | +} 128 | +EXPORT_SYMBOL(__frontswap_poll_load); 129 | + 130 | /* 131 | * Invalidate any data from frontswap associated with the specified swaptype 132 | * and offset so that a subsequent "get" will fail. 133 | @@ -332,7 +376,7 @@ EXPORT_SYMBOL(__frontswap_load); 134 | void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 135 | { 136 | struct swap_info_struct *sis = swap_info[type]; 137 | - struct frontswap_ops *ops; 138 | + //struct frontswap_ops *ops; 139 | 140 | VM_BUG_ON(!frontswap_ops); 141 | VM_BUG_ON(sis == NULL); 142 | @@ -340,8 +384,8 @@ void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 143 | if (!__frontswap_test(sis, offset)) 144 | return; 145 | 146 | - for_each_frontswap_ops(ops) 147 | - ops->invalidate_page(type, offset); 148 | + //for_each_frontswap_ops(ops) 149 | + // ops->invalidate_page(type, offset); 150 | __frontswap_clear(sis, offset); 151 | inc_frontswap_invalidates(); 152 | } 153 | @@ -480,6 +524,25 @@ unsigned long frontswap_curr_pages(void) 154 | } 155 | EXPORT_SYMBOL(frontswap_curr_pages); 156 | 157 | +static int show_curr_pages(struct seq_file *m, void *v) 158 | +{ 159 | + seq_printf(m, "%lu", frontswap_curr_pages()); 160 | + return 0; 161 | +} 162 | + 163 | +static int curr_pages_open(struct inode *inode, struct file *file) 164 | +{ 165 | + return single_open(file, show_curr_pages, NULL); 166 | +} 167 | + 168 | +static const struct file_operations fops = { 169 | + .llseek = seq_lseek, 170 | + .open = curr_pages_open, 171 | + .owner = THIS_MODULE, 172 | + .read = seq_read, 173 | + .release = single_release, 174 | +}; 175 | + 176 | static int __init init_frontswap(void) 177 | { 178 | #ifdef CONFIG_DEBUG_FS 179 | @@ -492,6 +555,7 @@ static int __init init_frontswap(void) 180 | &frontswap_failed_stores); 181 | debugfs_create_u64("invalidates", S_IRUGO, 182 | root, &frontswap_invalidates); 183 | + debugfs_create_file("curr_pages", S_IRUGO, root, NULL, &fops); 184 | #endif 185 | return 0; 186 | } 187 | diff --git a/mm/memcontrol.c b/mm/memcontrol.c 188 | index 2bd7541d..53027aaa 100644 189 | --- a/mm/memcontrol.c 190 | +++ b/mm/memcontrol.c 191 | @@ -94,6 +94,8 @@ int do_swap_account __read_mostly; 192 | #define do_swap_account 0 193 | #endif 194 | 195 | +#define FASTSWAP_RECLAIM_CPU 7 196 | + 197 | /* Whether legacy memory+swap accounting is active */ 198 | static bool do_memsw_account(void) 199 | { 200 | @@ -1842,12 +1844,23 @@ static void reclaim_high(struct mem_cgroup *memcg, 201 | } while ((memcg = parent_mem_cgroup(memcg))); 202 | } 203 | 204 | +#define MAX_RECLAIM_OFFLOAD 2048UL 205 | static void high_work_func(struct work_struct *work) 206 | { 207 | - struct mem_cgroup *memcg; 208 | + struct mem_cgroup *memcg = container_of(work, struct mem_cgroup, high_work); 209 | + unsigned long high = memcg->high; 210 | + unsigned long nr_pages = page_counter_read(&memcg->memory); 211 | + unsigned long reclaim; 212 | + 213 | + if (nr_pages > high) { 214 | + reclaim = min(nr_pages - high, MAX_RECLAIM_OFFLOAD); 215 | + 216 | + /* reclaim_high only reclaims iff nr_pages > high */ 217 | + reclaim_high(memcg, reclaim, GFP_KERNEL); 218 | + } 219 | 220 | - memcg = container_of(work, struct mem_cgroup, high_work); 221 | - reclaim_high(memcg, CHARGE_BATCH, GFP_KERNEL); 222 | + if (page_counter_read(&memcg->memory) > memcg->high) 223 | + schedule_work_on(FASTSWAP_RECLAIM_CPU, &memcg->high_work); 224 | } 225 | 226 | /* 227 | @@ -1865,6 +1878,7 @@ void mem_cgroup_handle_over_high(void) 228 | memcg = get_mem_cgroup_from_mm(current->mm); 229 | reclaim_high(memcg, nr_pages, GFP_KERNEL); 230 | css_put(&memcg->css); 231 | + 232 | current->memcg_nr_pages_over_high = 0; 233 | } 234 | 235 | @@ -1878,6 +1892,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, 236 | unsigned long nr_reclaimed; 237 | bool may_swap = true; 238 | bool drained = false; 239 | + unsigned long high_limit; 240 | + unsigned long curr_pages; 241 | + unsigned long excess; 242 | 243 | if (mem_cgroup_is_root(memcg)) 244 | return 0; 245 | @@ -2006,14 +2023,20 @@ done_restock: 246 | * reclaim, the cost of mismatch is negligible. 247 | */ 248 | do { 249 | - if (page_counter_read(&memcg->memory) > memcg->high) { 250 | - /* Don't bother a random interrupted task */ 251 | - if (in_interrupt()) { 252 | - schedule_work(&memcg->high_work); 253 | - break; 254 | + high_limit = memcg->high; 255 | + curr_pages = page_counter_read(&memcg->memory); 256 | + 257 | + if (curr_pages > high_limit) { 258 | + excess = curr_pages - high_limit; 259 | + /* regardless of whether we use app cpu or worker, we evict 260 | + * at most MAX_RECLAIM_OFFLOAD pages at a time */ 261 | + if (excess > MAX_RECLAIM_OFFLOAD && !in_interrupt()) { 262 | + current->memcg_nr_pages_over_high += MAX_RECLAIM_OFFLOAD; 263 | + set_notify_resume(current); 264 | + } else { 265 | + schedule_work_on(FASTSWAP_RECLAIM_CPU, &memcg->high_work); 266 | } 267 | - current->memcg_nr_pages_over_high += batch; 268 | - set_notify_resume(current); 269 | + 270 | break; 271 | } 272 | } while ((memcg = parent_mem_cgroup(memcg))); 273 | @@ -5081,7 +5104,6 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, 274 | char *buf, size_t nbytes, loff_t off) 275 | { 276 | struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); 277 | - unsigned long nr_pages; 278 | unsigned long high; 279 | int err; 280 | 281 | @@ -5092,12 +5114,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, 282 | 283 | memcg->high = high; 284 | 285 | - nr_pages = page_counter_read(&memcg->memory); 286 | - if (nr_pages > high) 287 | - try_to_free_mem_cgroup_pages(memcg, nr_pages - high, 288 | - GFP_KERNEL, true); 289 | - 290 | + /* concurrent eviction on shrink */ 291 | memcg_wb_domain_size_changed(memcg); 292 | + schedule_work_on(FASTSWAP_RECLAIM_CPU, &memcg->high_work); 293 | return nbytes; 294 | } 295 | 296 | diff --git a/mm/memory.c b/mm/memory.c 297 | index 235ba51b..2e7b3f80 100644 298 | --- a/mm/memory.c 299 | +++ b/mm/memory.c 300 | @@ -68,6 +68,8 @@ 301 | #include 302 | #include 303 | #include 304 | +#include 305 | +#include 306 | 307 | #include 308 | #include 309 | @@ -2698,6 +2700,7 @@ int do_swap_page(struct vm_fault *vmf) 310 | } 311 | goto out; 312 | } 313 | + 314 | delayacct_set_flag(DELAYACCT_PF_SWAPIN); 315 | page = lookup_swap_cache(entry); 316 | if (!page) { 317 | diff --git a/mm/page_io.c b/mm/page_io.c 318 | index 23f6d0d3..40cddf6a 100644 319 | --- a/mm/page_io.c 320 | +++ b/mm/page_io.c 321 | @@ -3,7 +3,7 @@ 322 | * 323 | * Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds 324 | * 325 | - * Swap reorganised 29.12.95, 326 | + * Swap reorganised 29.12.95, 327 | * Asynchronous swapping added 30.12.95. Stephen Tweedie 328 | * Removed race in async swapping. 14.4.1996. Bruno Haible 329 | * Add swap of shared pages through the page cache. 20.2.1998. Stephen Tweedie 330 | @@ -338,11 +338,8 @@ int swap_readpage(struct page *page) 331 | VM_BUG_ON_PAGE(!PageSwapCache(page), page); 332 | VM_BUG_ON_PAGE(!PageLocked(page), page); 333 | VM_BUG_ON_PAGE(PageUptodate(page), page); 334 | - if (frontswap_load(page) == 0) { 335 | - SetPageUptodate(page); 336 | - unlock_page(page); 337 | + if (frontswap_load_async(page) == 0) 338 | goto out; 339 | - } 340 | 341 | if (sis->flags & SWP_FILE) { 342 | struct file *swap_file = sis->swap_file; 343 | @@ -379,6 +376,17 @@ out: 344 | return ret; 345 | } 346 | 347 | +int swap_readpage_sync(struct page *page) 348 | +{ 349 | + VM_BUG_ON_PAGE(!PageSwapCache(page), page); 350 | + VM_BUG_ON_PAGE(!PageLocked(page), page); 351 | + VM_BUG_ON_PAGE(PageUptodate(page), page); 352 | + 353 | + BUG_ON(frontswap_load(page)); 354 | + 355 | + return 0; 356 | +} 357 | + 358 | int swap_set_page_dirty(struct page *page) 359 | { 360 | struct swap_info_struct *sis = page_swap_info(page); 361 | diff --git a/mm/swap_state.c b/mm/swap_state.c 362 | index 473b71e0..0d6b4b3f 100644 363 | --- a/mm/swap_state.c 364 | +++ b/mm/swap_state.c 365 | @@ -19,6 +19,7 @@ 366 | #include 367 | #include 368 | #include 369 | +#include 370 | 371 | #include 372 | 373 | @@ -168,7 +169,7 @@ void __delete_from_swap_cache(struct page *page) 374 | * @page: page we want to move to swap 375 | * 376 | * Allocate swap space for the page and add the page to the 377 | - * swap cache. Caller needs to hold the page lock. 378 | + * swap cache. Caller needs to hold the page lock. 379 | */ 380 | int add_to_swap(struct page *page, struct list_head *list) 381 | { 382 | @@ -241,9 +242,9 @@ void delete_from_swap_cache(struct page *page) 383 | put_page(page); 384 | } 385 | 386 | -/* 387 | - * If we are the only user, then try to free up the swap cache. 388 | - * 389 | +/* 390 | + * If we are the only user, then try to free up the swap cache. 391 | + * 392 | * Its ok to check for PageSwapCache without the page lock 393 | * here because we are going to recheck again inside 394 | * try_to_free_swap() _with_ the lock. 395 | @@ -257,7 +258,7 @@ static inline void free_swap_cache(struct page *page) 396 | } 397 | } 398 | 399 | -/* 400 | +/* 401 | * Perform a free_page(), also freeing any swap cache associated with 402 | * this page if it is the last user of the page. 403 | */ 404 | @@ -426,6 +427,19 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, 405 | return retpage; 406 | } 407 | 408 | +struct page *read_swap_cache_sync(swp_entry_t entry, gfp_t gfp_mask, 409 | + struct vm_area_struct *vma, unsigned long addr) 410 | +{ 411 | + bool page_was_allocated; 412 | + struct page *retpage = __read_swap_cache_async(entry, gfp_mask, 413 | + vma, addr, &page_was_allocated); 414 | + 415 | + if (page_was_allocated) 416 | + swap_readpage_sync(retpage); 417 | + 418 | + return retpage; 419 | +} 420 | + 421 | static unsigned long swapin_nr_pages(unsigned long offset) 422 | { 423 | static unsigned long prev_offset; 424 | @@ -492,12 +506,17 @@ static unsigned long swapin_nr_pages(unsigned long offset) 425 | struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, 426 | struct vm_area_struct *vma, unsigned long addr) 427 | { 428 | - struct page *page; 429 | + struct page *page, *faultpage; 430 | unsigned long entry_offset = swp_offset(entry); 431 | unsigned long offset = entry_offset; 432 | unsigned long start_offset, end_offset; 433 | unsigned long mask; 434 | - struct blk_plug plug; 435 | + int cpu; 436 | + 437 | + preempt_disable(); 438 | + cpu = smp_processor_id(); 439 | + faultpage = read_swap_cache_sync(entry, gfp_mask, vma, addr); 440 | + preempt_enable(); 441 | 442 | mask = swapin_nr_pages(offset) - 1; 443 | if (!mask) 444 | @@ -509,22 +528,25 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, 445 | if (!start_offset) /* First page is swap header. */ 446 | start_offset++; 447 | 448 | - blk_start_plug(&plug); 449 | for (offset = start_offset; offset <= end_offset ; offset++) { 450 | + if (offset == entry_offset) 451 | + continue; 452 | + 453 | /* Ok, do the async read-ahead now */ 454 | page = read_swap_cache_async(swp_entry(swp_type(entry), offset), 455 | gfp_mask, vma, addr); 456 | if (!page) 457 | continue; 458 | - if (offset != entry_offset) 459 | - SetPageReadahead(page); 460 | + 461 | + SetPageReadahead(page); 462 | put_page(page); 463 | } 464 | - blk_finish_plug(&plug); 465 | 466 | lru_add_drain(); /* Push any new pages onto the LRU now */ 467 | + /* prefetch pages generate interrupts and are handled async */ 468 | skip: 469 | - return read_swap_cache_async(entry, gfp_mask, vma, addr); 470 | + frontswap_poll_load(cpu); 471 | + return faultpage; 472 | } 473 | 474 | int init_swap_address_space(unsigned int type, unsigned long nr_pages) 475 | diff --git a/mm/vmscan.c b/mm/vmscan.c 476 | index bc8031ef..eba9777f 100644 477 | --- a/mm/vmscan.c 478 | +++ b/mm/vmscan.c 479 | @@ -964,8 +964,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, 480 | unsigned nr_ref_keep = 0; 481 | unsigned nr_unmap_fail = 0; 482 | 483 | - cond_resched(); 484 | - 485 | while (!list_empty(page_list)) { 486 | struct address_space *mapping; 487 | struct page *page; 488 | @@ -975,8 +973,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, 489 | bool lazyfree = false; 490 | int ret = SWAP_SUCCESS; 491 | 492 | - cond_resched(); 493 | - 494 | page = lru_to_page(page_list); 495 | list_del(&page->lru); 496 | 497 | @@ -1208,6 +1204,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, 498 | case PAGE_SUCCESS: 499 | if (PageWriteback(page)) 500 | goto keep; 501 | + 502 | if (PageDirty(page)) 503 | goto keep; 504 | 505 | @@ -1330,6 +1327,7 @@ keep: 506 | stat->nr_ref_keep = nr_ref_keep; 507 | stat->nr_unmap_fail = nr_unmap_fail; 508 | } 509 | + 510 | return nr_reclaimed; 511 | } 512 | 513 | --------------------------------------------------------------------------------