├── COPYING ├── Kernel ├── README └── debug-optionally-skip-clearpage-2.6.29-rc8-mmotm-090323.patch ├── Libnuma └── lts-01-libnuma-fix-numa_num_thread_cpus+nodes.patch ├── Makefile ├── README ├── Scripts ├── README ├── fmt_vmstats ├── pft-multirun ├── pft_lockstats ├── pft_per_node ├── pft_plot.py ├── pft_profile ├── pft_task_thread ├── pft_vmstats ├── rmshm └── runpft ├── TODO ├── pft.c └── version.h /COPYING: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | 676 | -------------------------------------------------------------------------------- /Kernel/README: -------------------------------------------------------------------------------- 1 | Pft Kernel patches: 2 | 3 | Optional kernel patches for certain features of pft_mpol 4 | -------------------------------------------------------------------------------- /Kernel/debug-optionally-skip-clearpage-2.6.29-rc8-mmotm-090323.patch: -------------------------------------------------------------------------------- 1 | N.B., this patch must not be merged! It opens a big security hole 2 | by mapping unzeroed pages into user space. 3 | 4 | Temp hack to allow the page fault test program to disable __GFP_ZERO 5 | for allocating the test memory [anon, shmem]. This will eliminate 6 | clear_page() which dominates the profiles, exposing more of the 7 | allocator behavior [we hope]. 8 | 9 | This version atop 2.6.29-rc8-mmotm-090323-2234 10 | 11 | Signed-off-by: Lee Schermerhorn 12 | 13 | arch/x86/include/asm/page.h | 2 +- 14 | include/linux/mempolicy.h | 3 ++- 15 | include/linux/mm.h | 2 ++ 16 | include/linux/pagemap.h | 13 +++++++++++++ 17 | mm/mempolicy.c | 19 ++++++++++++++++++- 18 | mm/shmem.c | 3 +++ 19 | 6 files changed, 39 insertions(+), 3 deletions(-) 20 | 21 | Index: linux-2.6.29-rc8-mmotm-090323-2234/mm/mempolicy.c 22 | =================================================================== 23 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/mm/mempolicy.c 2009-04-03 11:20:35.000000000 -0400 24 | +++ linux-2.6.29-rc8-mmotm-090323-2234/mm/mempolicy.c 2009-04-03 11:29:17.000000000 -0400 25 | @@ -926,6 +926,17 @@ static struct page *new_vma_page(struct 26 | } 27 | #endif 28 | 29 | +static void vma_set_no_clear(unsigned long addr) 30 | +{ 31 | + struct vm_area_struct *vma = find_vma_intersection(current->mm, 32 | + addr, addr+1); 33 | + if (vma) { 34 | + vma->vm_flags |= VM_NOCLEAR; 35 | + if (vma->vm_file) 36 | + mapping_set_no_clear(vma->vm_file->f_mapping); 37 | + } 38 | +} 39 | + 40 | static long do_mbind(unsigned long start, unsigned long len, 41 | unsigned short mode, unsigned short mode_flags, 42 | nodemask_t *nmask, unsigned long flags) 43 | @@ -937,9 +948,15 @@ static long do_mbind(unsigned long start 44 | int err; 45 | LIST_HEAD(pagelist); 46 | 47 | - if (flags & ~(unsigned long)(MPOL_MF_STRICT | 48 | + if (flags & ~(unsigned long)(MPOL_MF_STRICT | MPOL_MF_NOCLEAR | 49 | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) 50 | return -EINVAL; 51 | + 52 | + if (flags & MPOL_MF_NOCLEAR) { 53 | + vma_set_no_clear(start); 54 | + return 0; 55 | + } 56 | + 57 | if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) 58 | return -EPERM; 59 | 60 | Index: linux-2.6.29-rc8-mmotm-090323-2234/mm/shmem.c 61 | =================================================================== 62 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/mm/shmem.c 2009-04-03 11:20:35.000000000 -0400 63 | +++ linux-2.6.29-rc8-mmotm-090323-2234/mm/shmem.c 2009-04-03 11:32:13.000000000 -0400 64 | @@ -1149,6 +1149,9 @@ static struct page *shmem_alloc_page(gfp 65 | pvma.vm_ops = NULL; 66 | pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); 67 | 68 | + if (mapping_no_clear(info->vfs_inode.i_mapping)) 69 | + gfp &= ~__GFP_ZERO; 70 | + 71 | /* 72 | * alloc_page_vma() will drop the shared policy reference 73 | */ 74 | Index: linux-2.6.29-rc8-mmotm-090323-2234/include/linux/pagemap.h 75 | =================================================================== 76 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/include/linux/pagemap.h 2009-04-03 11:20:35.000000000 -0400 77 | +++ linux-2.6.29-rc8-mmotm-090323-2234/include/linux/pagemap.h 2009-04-03 11:21:00.000000000 -0400 78 | @@ -25,8 +25,21 @@ enum mapping_flags { 79 | #ifdef CONFIG_UNEVICTABLE_LRU 80 | AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ 81 | #endif 82 | + AS_NO_CLEAR = __GFP_BITS_SHIFT + 4, // temp for page fault testing 83 | }; 84 | 85 | +static inline void mapping_set_no_clear(struct address_space *mapping) 86 | +{ 87 | + set_bit(AS_NO_CLEAR, &mapping->flags); 88 | +} 89 | + 90 | +static inline int mapping_no_clear(struct address_space *mapping) 91 | +{ 92 | + if (likely(mapping)) 93 | + return test_bit(AS_NO_CLEAR, &mapping->flags); 94 | + return !mapping; 95 | +} 96 | + 97 | static inline void mapping_set_error(struct address_space *mapping, int error) 98 | { 99 | if (unlikely(error)) { 100 | Index: linux-2.6.29-rc8-mmotm-090323-2234/arch/x86/include/asm/page.h 101 | =================================================================== 102 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/arch/x86/include/asm/page.h 2009-04-03 11:20:35.000000000 -0400 103 | +++ linux-2.6.29-rc8-mmotm-090323-2234/arch/x86/include/asm/page.h 2009-04-03 11:23:13.000000000 -0400 104 | @@ -30,7 +30,7 @@ static inline void copy_user_page(void * 105 | } 106 | 107 | #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \ 108 | - alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr) 109 | + alloc_page_vma(GFP_HIGHUSER | (vma_noclear(vma) ? 0 : __GFP_ZERO) | movableflags, vma, vaddr) 110 | #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE 111 | 112 | #define __pa(x) __phys_addr((unsigned long)(x)) 113 | Index: linux-2.6.29-rc8-mmotm-090323-2234/include/linux/mm.h 114 | =================================================================== 115 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/include/linux/mm.h 2009-04-03 11:15:14.000000000 -0400 116 | +++ linux-2.6.29-rc8-mmotm-090323-2234/include/linux/mm.h 2009-04-03 11:24:03.000000000 -0400 117 | @@ -104,6 +104,8 @@ extern unsigned int kobjsize(const void 118 | #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ 119 | #define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ 120 | #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ 121 | +#define VM_NOCLEAR 0x80000000 /* Debug: don't zero/clear pages in this vma */ 122 | +#define vma_noclear(VMA) ((VMA)->vm_flags & VM_NOCLEAR) 123 | 124 | #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ 125 | #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS 126 | Index: linux-2.6.29-rc8-mmotm-090323-2234/include/linux/mempolicy.h 127 | =================================================================== 128 | --- linux-2.6.29-rc8-mmotm-090323-2234.orig/include/linux/mempolicy.h 2009-04-03 11:19:46.000000000 -0400 129 | +++ linux-2.6.29-rc8-mmotm-090323-2234/include/linux/mempolicy.h 2009-04-03 11:25:26.000000000 -0400 130 | @@ -42,7 +42,8 @@ enum { 131 | #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ 132 | #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ 133 | #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ 134 | -#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */ 135 | +#define MPOL_MF_NOCLEAR (1<<3) /* Debug: don't clear pages in vma */ 136 | +#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ 137 | 138 | /* 139 | * Internal flags that share the struct mempolicy flags word with 140 | -------------------------------------------------------------------------------- /Libnuma/lts-01-libnuma-fix-numa_num_thread_cpus+nodes.patch: -------------------------------------------------------------------------------- 1 | Against: numactl-2.0.3 package 2 | 3 | libnuma in the 2.0.3 numactl package contained a mismatch between the 4 | names of the "numa_num_*_{cpus|nodes}()" functions used in the source 5 | code and man page, and the names used in the version.ldscript and the 6 | numa.h header. This rendered the functions inaccessible. 7 | 8 | This patch resolves the mismatch. We had to choose one of the names, 9 | "*_task_*" or "*_thread_*", for all usage. The man page and source 10 | code had used the "thread" variant, while the header and ldscript used 11 | the "task" variant. By using "task" in place of thread everywhere, 12 | including the man page, function descriptions therein become true 13 | independent of thread model: "system scope" vs "process scope" in 14 | Pthreads terminology. 15 | 16 | Because libnuma provides a Linux-specific API, the term "task" 17 | should be well understood by programmers, and should cause no 18 | portability concerns. Because the "numa_num_*_{cpus|nodes}()" 19 | functions are inaccessible in the current version of the library 20 | and these are the only exported functions whose name was 21 | changed, there should be no compatiblity concerns. 22 | 23 | 24 | libnuma.c | 22 ++++++------- 25 | numa.3 | 98 ++++++++++++++++++++++++++++++++++---------------------------- 26 | 2 files changed, 66 insertions(+), 54 deletions(-) 27 | 28 | Index: numactl-2.0.3/libnuma.c 29 | =================================================================== 30 | --- numactl-2.0.3.orig/libnuma.c 2009-06-10 08:30:03.000000000 -0400 31 | +++ numactl-2.0.3/libnuma.c 2010-04-02 10:03:58.000000000 -0400 32 | @@ -431,7 +431,7 @@ read_mask(char *s, struct bitmask *bmp) 33 | * /proc/self/status. 34 | */ 35 | static void 36 | -set_thread_constraints(void) 37 | +set_task_constraints(void) 38 | { 39 | int hicpu = sysconf(_SC_NPROCESSORS_CONF)-1; 40 | int i; 41 | @@ -567,7 +567,7 @@ set_sizes(void) 42 | set_nodemask_size(); /* size of kernel nodemask_t */ 43 | set_numa_max_cpu(); /* size of kernel cpumask_t */ 44 | set_configured_cpus(); /* cpus listed in /sys/devices/system/cpu */ 45 | - set_thread_constraints(); /* cpus and nodes for current thread */ 46 | + set_task_constraints(); /* cpus and nodes for current thread */ 47 | } 48 | 49 | int 50 | @@ -596,13 +596,13 @@ numa_num_possible_cpus(void) 51 | } 52 | 53 | int 54 | -numa_num_thread_nodes(void) 55 | +numa_num_task_nodes(void) 56 | { 57 | return maxprocnode+1; 58 | } 59 | 60 | int 61 | -numa_num_thread_cpus(void) 62 | +numa_num_task_cpus(void) 63 | { 64 | return maxproccpu+1; 65 | } 66 | @@ -1667,7 +1667,7 @@ numa_parse_nodestring(char *s) 67 | int maxnode = numa_max_node(); 68 | int invert = 0, relative = 0; 69 | int conf_nodes = numa_num_configured_nodes(); 70 | - int thread_nodes = numa_num_thread_nodes(); 71 | + int task_nodes = numa_num_task_nodes(); 72 | char *end; 73 | struct bitmask *mask; 74 | 75 | @@ -1690,7 +1690,7 @@ numa_parse_nodestring(char *s) 76 | int i; 77 | if (!strcmp(s,"all")) { 78 | int i; 79 | - for (i = 0; i < thread_nodes; i++) 80 | + for (i = 0; i < task_nodes; i++) 81 | numa_bitmask_setbit(mask, i); 82 | s+=4; 83 | break; 84 | @@ -1715,7 +1715,7 @@ numa_parse_nodestring(char *s) 85 | numa_warn(W_nodeparse, "missing node argument %s\n", s); 86 | goto err; 87 | } 88 | - if (arg2 >= thread_nodes) { 89 | + if (arg2 >= task_nodes) { 90 | numa_warn(W_nodeparse, "node argument %d out of range\n", arg2); 91 | goto err; 92 | } 93 | @@ -1763,7 +1763,7 @@ numa_parse_cpustring(char *s) 94 | { 95 | int invert = 0, relative=0; 96 | int conf_cpus = numa_num_configured_cpus(); 97 | - int thread_cpus = numa_num_thread_cpus(); 98 | + int task_cpus = numa_num_task_cpus(); 99 | char *end; 100 | struct bitmask *mask; 101 | 102 | @@ -1785,7 +1785,7 @@ numa_parse_cpustring(char *s) 103 | 104 | if (!strcmp(s,"all")) { 105 | int i; 106 | - for (i = 0; i < thread_cpus; i++) 107 | + for (i = 0; i < task_cpus; i++) 108 | numa_bitmask_setbit(mask, i); 109 | s+=4; 110 | break; 111 | @@ -1795,7 +1795,7 @@ numa_parse_cpustring(char *s) 112 | numa_warn(W_cpuparse, "unparseable cpu description `%s'\n", s); 113 | goto err; 114 | } 115 | - if (arg >= thread_cpus) { 116 | + if (arg >= task_cpus) { 117 | numa_warn(W_cpuparse, "cpu argument %s is out of range\n", s); 118 | goto err; 119 | } 120 | @@ -1811,7 +1811,7 @@ numa_parse_cpustring(char *s) 121 | numa_warn(W_cpuparse, "missing cpu argument %s\n", s); 122 | goto err; 123 | } 124 | - if (arg2 >= thread_cpus) { 125 | + if (arg2 >= task_cpus) { 126 | numa_warn(W_cpuparse, "cpu argument %s out of range\n", s); 127 | goto err; 128 | } 129 | Index: numactl-2.0.3/numa.3 130 | =================================================================== 131 | --- numactl-2.0.3.orig/numa.3 2009-06-10 08:30:03.000000000 -0400 132 | +++ numactl-2.0.3/numa.3 2010-04-02 10:27:40.000000000 -0400 133 | @@ -44,9 +44,9 @@ numa \- NUMA policy library 134 | .br 135 | .BI "struct bitmask *numa_all_cpus_ptr;" 136 | .sp 137 | -.BI "int numa_num_thread_cpus();" 138 | +.BI "int numa_num_task_cpus();" 139 | .br 140 | -.BI "int numa_num_thread_nodes();" 141 | +.BI "int numa_num_task_nodes();" 142 | .sp 143 | .BI "int numa_parse_bitmap(char *" line " , struct bitmask *" mask "); 144 | .br 145 | @@ -177,16 +177,16 @@ page interleaving (i.e., allocate in a r 146 | or a subset, of the nodes on the system), 147 | preferred node allocation (i.e., preferably allocate on a particular node), 148 | local allocation (i.e., allocate on the node on which 149 | -the thread is currently executing), 150 | +the task is currently executing), 151 | or allocation only on specific nodes (i.e., allocate on 152 | some subset of the available nodes). 153 | -It is also possible to bind threads to specific nodes. 154 | +It is also possible to bind tasks to specific nodes. 155 | 156 | -Numa memory allocation policy may be specified as a per-thread attribute, 157 | -that is inherited by children threads and processes, or as an attribute 158 | +Numa memory allocation policy may be specified as a per-task attribute, 159 | +that is inherited by children tasks and processes, or as an attribute 160 | of a range of process virtual address space. 161 | Numa memory policies specified for a range of virtual address space are 162 | -shared by all threads in the process. 163 | +shared by all tasks in the process. 164 | Further more, memory policies specified for a range of a shared memory 165 | attached using 166 | .I shmat(2) 167 | @@ -195,8 +195,8 @@ or 168 | from shmfs/hugetlbfs are shared by all processes that attach to that region. 169 | Memory policies for shared disk backed file mappings are currently ignored. 170 | 171 | -The default memory allocation policy for threads and all memory range 172 | -is local allocation. 173 | +The default memory allocation policy for tasks is local allocation. 174 | +The default policy for all memory ranges is the task's memory allocation policy. 175 | This assumes that no ancestor has installed a non-default policy. 176 | 177 | For setting a specific policy globally for all memory allocations 178 | @@ -282,7 +282,7 @@ numbers in /sys/devices/system/cpu. If t 179 | 180 | .BR numa_all_nodes_ptr 181 | points to a bitmask that is allocated by the library with bits 182 | -representing all nodes on which the calling thread may allocate memory. 183 | +representing all nodes on which the calling task may allocate memory. 184 | This set may be up to all nodes on the system, or up to the nodes in 185 | the current cpuset. 186 | The bitmask is allocated by a call to 187 | @@ -302,7 +302,7 @@ The user should not alter this bitmask. 188 | 189 | .BR numa_all_cpus_ptr 190 | points to a bitmask that is allocated by the library with bits 191 | -representing all cpus on which the calling thread may execute. 192 | +representing all cpus on which the calling task may execute. 193 | This set may be up to all cpus on the system, or up to the cpus in 194 | the current cpuset. 195 | The bitmask is allocated by a call to 196 | @@ -312,14 +312,14 @@ using size 197 | The set of cpus to record is derived from /proc/self/status, field 198 | "Cpus_allowed". The user should not alter this bitmask. 199 | 200 | -.BR numa_num_thread_cpus() 201 | -returns the number of cpus that the calling thread is allowed 202 | +.BR numa_num_task_cpus() 203 | +returns the number of cpus that the calling task is allowed 204 | to use. This count is derived from the map /proc/self/status, field 205 | "Cpus_allowed". Also see the bitmask 206 | .BR numa_all_cpus_ptr. 207 | 208 | -.BR numa_num_thread_nodes() 209 | -returns the number of nodes on which the calling thread is 210 | +.BR numa_num_task_nodes() 211 | +returns the number of nodes on which the calling task is 212 | allowed to allocate memory. This count is derived from the map 213 | /proc/self/status, field "Mems_allowed". 214 | Also see the bitmask 215 | @@ -346,9 +346,9 @@ The bit mask is allocated by 216 | The string is a comma-separated list of node numbers or node ranges. 217 | A leading ! can be used to indicate "not" this list (in other words, all 218 | nodes except this list), and a leading + can be used to indicate that the 219 | -node numbers in the list are relative to the thread's cpuset. The string can 220 | +node numbers in the list are relative to the task's cpuset. The string can 221 | be "all" to specify all ( 222 | -.BR numa_num_thread_nodes() 223 | +.BR numa_num_task_nodes() 224 | ) nodes. Node numbers are limited by the number in the system. See 225 | .BR numa_max_node() 226 | and 227 | @@ -367,12 +367,12 @@ The bit mask is allocated by 228 | The string is a comma-separated list of cpu numbers or cpu ranges. 229 | A leading ! can be used to indicate "not" this list (in other words, all 230 | cpus except this list), and a leading + can be used to indicate that the cpu 231 | -numbers in the list are relative to the thread's cpuset. The string can be 232 | +numbers in the list are relative to the task's cpuset. The string can be 233 | "all" to specify all ( 234 | -.BR numa_num_thread_cpus() 235 | +.BR numa_num_task_cpus() 236 | ) cpus. 237 | Cpu numbers are limited by the number in the system. See 238 | -.BR numa_num_thread_cpus() 239 | +.BR numa_num_task_cpus() 240 | and 241 | .BR numa_num_configured_cpus(). 242 | .br 243 | @@ -396,7 +396,7 @@ instead of 244 | This is useful on 32-bit architectures with large nodes. 245 | 246 | .BR numa_preferred () 247 | -returns the preferred node of the current thread. 248 | +returns the preferred node of the current task. 249 | This is the node on which the kernel preferably 250 | allocates memory, unless some other policy overrides this. 251 | .\" TODO: results are misleading for MPOL_PREFERRED and may 252 | @@ -406,7 +406,7 @@ allocates memory, unless some other poli 253 | .\" node. Need to tighten this up with the syscall results. 254 | 255 | .BR numa_set_preferred () 256 | -sets the preferred node for the current thread to 257 | +sets the preferred node for the current task to 258 | .IR node . 259 | The system will attempt to allocate memory from the preferred node, 260 | but will fall back to other nodes if no memory is available on the 261 | @@ -418,12 +418,12 @@ calling 262 | .BR numa_set_localalloc (). 263 | 264 | .BR numa_get_interleave_mask () 265 | -returns the current interleave mask if the thread's memory allocation policy 266 | +returns the current interleave mask if the task's memory allocation policy 267 | is page interleaved. 268 | Otherwise, this function returns an empty mask. 269 | 270 | .BR numa_set_interleave_mask () 271 | -sets the memory interleave mask for the current thread to 272 | +sets the memory interleave mask for the current task to 273 | .IR nodemask . 274 | All new memory allocations 275 | are page interleaved over all nodes in the interleave mask. Interleaving 276 | @@ -434,7 +434,7 @@ page into the current address space. It 277 | will fall back to other nodes if no memory is available on the interleave 278 | target. 279 | .\" NOTE: the following is not really the case. this function sets the 280 | -.\" thread policy for all future allocations, including stack, bss, ... 281 | +.\" task policy for all future allocations, including stack, bss, ... 282 | .\" The functions specified in this sentence actually allocate a new memory 283 | .\" range [via mmap()]. This is quite a different thing. Suggest we drop 284 | .\" this. 285 | @@ -477,7 +477,7 @@ flag is true then the operation will cau 286 | pages in the mapping that do not follow the policy. 287 | 288 | .BR numa_bind () 289 | -binds the current thread and its children to the nodes 290 | +binds the current task and its children to the nodes 291 | specified in 292 | .IR nodemask . 293 | They will only run on the CPUs of the specified nodes and only be able to allocate 294 | @@ -488,7 +488,7 @@ This function is equivalent to calling 295 | .I numa_run_on_node_mask(nodemask) 296 | followed by 297 | .IR numa_set_membind(nodemask) . 298 | -If threads should be bound to individual CPUs inside nodes 299 | +If tasks should be bound to individual CPUs inside nodes 300 | consider using 301 | .I numa_node_to_cpus 302 | and the 303 | @@ -496,15 +496,15 @@ and the 304 | syscall. 305 | 306 | .BR numa_set_localalloc () 307 | -sets the memory allocation policy for the calling thread to 308 | +sets the memory allocation policy for the calling task to 309 | local allocation. 310 | In this mode, the preferred node for memory allocation is 311 | -effectively the node where the thread is executing at the 312 | +effectively the node where the task is executing at the 313 | time of a page allocation. 314 | 315 | .BR numa_set_membind () 316 | sets the memory allocation mask. 317 | -The thread will only allocate memory from the nodes set in 318 | +The task will only allocate memory from the nodes set in 319 | .IR nodemask . 320 | Passing an empty 321 | .I nodemask 322 | @@ -610,7 +610,7 @@ The 323 | argument will be rounded up to a multiple of the system page size. 324 | 325 | .BR numa_run_on_node () 326 | -runs the current thread and its children 327 | +runs the current task and its children 328 | on a specific node. They will not migrate to CPUs of 329 | other nodes until the node affinity is reset with a new call to 330 | .BR numa_run_on_node_mask (). 331 | @@ -621,7 +621,7 @@ On success, 0 is returned; on error \-1 332 | is set to indicate the error. 333 | 334 | .BR numa_run_on_node_mask () 335 | -runs the current thread and its children only on nodes specified in 336 | +runs the current task and its children only on nodes specified in 337 | .IR nodemask . 338 | They will not migrate to CPUs of 339 | other nodes until the node affinity is reset with a new call to 340 | @@ -636,7 +636,7 @@ On success, 0 is returned; on error \-1 341 | is set to indicate the error. 342 | 343 | .BR numa_get_run_node_mask () 344 | -returns the mask of nodes that the current thread is allowed to run on. 345 | +returns the mask of nodes that the current task is allowed to run on. 346 | 347 | .BR numa_tonode_memory () 348 | put memory on a specific node. The constraints described for 349 | @@ -698,7 +698,7 @@ alternative to repeated calls to the get 350 | See getpagesize(2). 351 | 352 | .BR numa_sched_getaffinity() 353 | -retrieves a bitmask of the cpus on which a thread may run. The thread is 354 | +retrieves a bitmask of the cpus on which a task may run. The task is 355 | specified by 356 | .I pid. 357 | Returns the return value of the sched_getaffinity 358 | @@ -710,9 +710,9 @@ Test the bits in the mask by calling 359 | .BR numa_bitmask_isbitset(). 360 | 361 | .BR numa_sched_setaffinity() 362 | -sets a thread's allowed cpu's to those cpu's specified in 363 | +sets a task's allowed cpu's to those cpu's specified in 364 | .I mask. 365 | -The thread is specified by 366 | +The task is specified by 367 | .I pid. 368 | Returns the return value of the sched_setaffinity system call. 369 | See sched_setaffinity(2). You may allocate the bitmask with 370 | @@ -866,7 +866,7 @@ executing or current process. 371 | It simply uses the move_pages system call. 372 | .br 373 | .I pid 374 | -- ID of thread. If not valid, use the current thread. 375 | +- ID of task. If not valid, use the current task. 376 | .br 377 | .I count 378 | - Number of pages. 379 | @@ -887,7 +887,7 @@ See move_pages(2). 380 | 381 | .BR numa_migrate_pages() 382 | simply uses the migrate_pages system call to cause the pages of the calling 383 | -thread, or a specified thread, to be migated from one set of nodes to another. 384 | +task, or a specified task, to be migated from one set of nodes to another. 385 | See migrate_pages(2). 386 | The bit masks representing the nodes should be allocated with 387 | .BR numa_allocate_nodemask() 388 | @@ -897,7 +897,7 @@ using an 389 | .I n 390 | value returned from 391 | .BR numa_num_possible_nodes(). 392 | -A thread's current node set can be gotten by calling 393 | +A task's current node set can be gotten by calling 394 | .BR numa_get_membind(). 395 | Bits in the 396 | .I tonodes 397 | @@ -971,6 +971,17 @@ and 398 | .I numa_exit_on_error 399 | are process global. The other calls are thread safe. 400 | 401 | +.SH NOTES 402 | +Linux scheduling affinity and NUMA memory policies apply to Linux kernel tasks. 403 | +For a thread model that maps each user space thread directly onto a kernel task, 404 | +such as the Pthreads PTHREAD_SCOPE_SYSTEM model, 405 | +the terms thread and task are synonomous. 406 | +For a thread model that maps multiple user space threads onto one or more kernel tasks, 407 | +such as the Pthreads PTHREAD_SCOPE_PROCESS model, 408 | +the terms are not necessarily synonomous. 409 | +To make the function descriptions above true for either model, 410 | +this document uses the term "task" throughout. 411 | + 412 | .SH COPYRIGHT 413 | Copyright 2002, 2004, 2007, 2008 Andi Kleen, SuSE Labs. 414 | .I libnuma 415 | @@ -984,7 +995,8 @@ is under the GNU Lesser General Public L 416 | .BR mmap (2), 417 | .BR shmat (2), 418 | .BR numactl (8), 419 | -.BR sched_getaffinity (2) 420 | -.BR sched_setaffinity (2) 421 | -.BR move_pages (2) 422 | -.BR migrate_pages (2) 423 | +.BR sched_getaffinity (2), 424 | +.BR sched_setaffinity (2), 425 | +.BR move_pages (2), 426 | +.BR migrate_pages (2), 427 | +.BR pthread_attr_setscope (3) 428 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for linux "page fault test" 2 | # 3 | # 4 | 5 | # until this shows up in distro headers 6 | # pft will check whether or not kernel supports it. 7 | RUSAGE_THREAD ?= -DUSE_RUSAGE_THREAD 8 | 9 | # Requires kernel patch to work. 10 | NOCLEAR ?= -UUSE_NOCLEAR 11 | 12 | #-------------------------------------------- 13 | SHELL = /bin/sh 14 | 15 | MACH = 16 | 17 | CMODE = -std=gnu99 18 | COPT = $(CMODE) -pthread -O3 #-non_shared 19 | DEFS = -D_GNU_SOURCE $(RUSAGE_THREAD) $(NOCLEAR) 20 | INCLS = #-I 21 | CFLAGS = $(COPT) $(DEFS) $(INCLS) $(ECFLAGS) 22 | 23 | LDOPTS = #-dnon_shared 24 | # comment out '-lnuma' for platforms w/o libnuma -- laptops? 25 | LDLIBS = -lpthread -lrt -lnuma 26 | LDFLAGS = $(CMODE) $(LDOPTS) $(ELDFLAGS) 27 | 28 | HDRS = 29 | 30 | OBJS = pft.o 31 | 32 | EXTRAHDRS = 33 | 34 | # Include 'numa_stubs.o' for platforms w/o libnuma -- laptops? 35 | EXTRAOBJS = /usr/include/numa.h /usr/include/numaif.h 36 | 37 | PROGS = pft 38 | 39 | PROJ = pft 40 | 41 | #--------------------------------- 42 | 43 | all: $(PROGS) 44 | 45 | pft: $(OBJS) $(EXTRAOBJS) 46 | $(CC) -o $@ $(LDFLAGS) $(OBJS) $(EXTRAOBJS) $(LDLIBS) 47 | 48 | $(OBJS): $(HDRS) 49 | 50 | # extra dependencies to generate errors if headers are missing 51 | pft.o: 52 | 53 | install: 54 | @echo "install not implemented" 55 | 56 | clean: 57 | -rm -f *.o core.[0-9]* Log* 58 | 59 | clobber: clean 60 | -rm -f $(PROGS) cscope.* 61 | 62 | # ------------------------------------------------ 63 | # N.B., renames current directory to new version name! 64 | # [and, yes, this is really ugly...] 65 | VERSION=$$(cat $$_WD/version.h|grep _VERSION|sed 's/^.* "\([0-9.a-z+-]*\)".*$$/\1/') 66 | tarball: clobber 67 | @chmod --recursive u+r .pc; \ 68 | _WD=`pwd`; _WD=`basename $$_WD`; cd ..;\ 69 | _version=$(VERSION); _tarball=$(PROJ)-$${_version}.tar.gz; \ 70 | _newWD=`echo $$_WD | sed s:-.*:-$$_version:`; \ 71 | if [ "$$_WD" != "$$_newWD" ] ; then \ 72 | echo "Renaming '.' [$$_WD/] to $$_newWD/"; \ 73 | mv $$_WD $$_newWD; \ 74 | fi ; \ 75 | tar czf - $$_newWD >$$_tarball; \ 76 | if [ $$? -eq 0 ]; then \ 77 | echo "tarball at ../$$_tarball"; \ 78 | else \ 79 | echo "Error making tarball"; \ 80 | fi 81 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | The "overall operation" section is way out of date, sorry. 2 | Especially since restoration of multi-process version. 3 | Update is on the TODO list. Version history below is more or 4 | less up to date. 5 | 6 | "pft" : Page Fault Test 7 | 8 | Originally by Chirstoph Lameter. 9 | 10 | Modified by Lee Schermerhorn [v0.2+] to measure relative overhead 11 | of memory policy changes: reference counting, ... 12 | 13 | See "usage" in program source for command line options. 14 | 15 | Overall operation: 16 | 17 | 1) parse options, calculate memory sizes, create "communications" area 18 | in shared memory. 19 | 20 | 2) fork a child to run test "launch()" function. Originally, we did this 21 | so we could use RUSAGE_CHILDREN to obtain accumulated statistics -- 22 | faults, ... However, this included cpu time and faults from other 23 | than the test loops that we're interested in. We've changed how we 24 | capture the wall clock time and the rusage, but we still launch() the 25 | test workers [threads or processes] from a child process. 26 | Wait for the child to exit. 27 | 28 | 3) launch() function [in child]: 29 | a) creates test region of specified size and type [anon vs shm]; 30 | optionally "mbind()s" test region to install a vma/shared mempolicy. 31 | However, launch() does NOT touch the region, so as not to fault in 32 | any pages. 33 | Note: if supported by kernel [requires patch] and enabled at build 34 | time [e.g., NOCLEAR=-DUSE_NOCLEAR on make command line], 35 | '-Z' option will cause launch() to mark region as "noclear", 36 | eliminating the page zeroing overhead from the test. 37 | b) sets launch() process' scheduler policy to SCHED_FIFO and it's 38 | priority to 1 greater than will be used for worker threads/processes. 39 | c) creates/starts specified number of threads to run the "test()" 40 | function and "waits" for all threads to become "READY". Threads 41 | runs SCHED_FIFO on priority below the launch() thread. The objective 42 | is to allow a worker to run on each cpu. The launch() process/thread 43 | runs as "worker 0". 44 | d) gives all threads the "go ahead"; then run the test loop as worker 0. 45 | e) When the test loop completes, loops waiting waiting to reap al 46 | workers As each worker terminates, select the start time [and rusage] 47 | from the worker that started earliest, and the end time [and rusage] 48 | from the worker that finished last. 49 | Note: for kernels that support RUSAGE_THREAD, pft_mpol uses that to 50 | fetch each worker's rusage and sums the faults and cpu times 51 | for all workers. 52 | f) returns/exits 53 | 54 | 4) test threads: 55 | a) wait for "go ahead" from launch(). 56 | b) snap thread start time and start rusage into this thread's thread_info. 57 | c) The measured test loop: either bzero() the thread's respective test 58 | region or touch the specified number of cachelines per page therein. 59 | d) snap the thread's end time and end rusage into this thread's thread_info. 60 | e) sleep for specified delay -- default 2 sec. 61 | f) indicates 'DONE in the per thread info area. 62 | g) exits. 63 | 64 | 5) main program returns from wait() when child/launch() exits, and 65 | computes/emits results. 66 | 67 | ---------------------------- 68 | 69 | Helper scripts: 70 | 71 | See Scripts/README 72 | 73 | ------------------------------- 74 | 75 | Version history: 76 | 77 | 0.01 - first version of pft_mpol for testing mempolicy fault/allocation 78 | overhead. 79 | 80 | 0.02 - added support for agr [xmgrace] "tag" 81 | 82 | 0.02a - enhancement to helper scripts to support multiple runs per thread 83 | count and added usage string to pft_mpol. General "cleanup". 84 | 85 | 0.03 - use mmap(MAP_ANONYMOUS) for anon test memory and parent/child comm area. 86 | Added pft_mmap(), valloc_{private|shared}(), ... for this purpose. 87 | 88 | Add support for "affinitizing" pft to cpus in hopes of obtaining more 89 | repeatable results. 90 | 91 | Factor out test memory allocation and thread creation from pft results: 92 | snapshot launch() rusage just before giving threads the "go ahead", 93 | subtract this usage from final RUSAGE_CHILDREN. 94 | 95 | Added helper functions to compute elapses, cpu times as double to 96 | "improve readability" of final results reporting. 97 | 98 | 0.04 - In a further attempt to obtain repeatable and accurate results, changed 99 | to capture the end rusage in each of the test threads themselves--just 100 | after the test loops. 101 | 102 | Snap the start time and start rusage of each thread into thread_info 103 | after getting the "go ahead" from the parent, before the page fault 104 | test loop. Snap the end time and rusage after the test loop. 105 | In the parent, as threads join, select the start time and start rusage 106 | from the thread with the earlies start time, and the end time and end 107 | rusage from the thread with the latest end wall clock time. 108 | 109 | Compute difference between selected end and start rusage for determining 110 | the cpu time used and the number of faults in the test loop 111 | 112 | 0.05 - Add option to set scheduler policy of test threads, including the 113 | launch thread to SCHED_FIFO. Launch thread will run at one RT 114 | (SCHED_FIFO) priority higher to maintain control while starting threads, 115 | if nr_threads > nr_online_cpus. 116 | 117 | N.B., This is fragile. And, SCHED_FIFO seems to introduce a wall clock time 118 | delay into the tests. Under investigation. 119 | 120 | Run "thread 0" test in launch() thread, after giving other threads 121 | the go-ahead. 122 | 123 | 0.06 - Add '-L' [SHM_LOCK] and '-l' [mlock] options to test fault rate for 124 | noreclaim/mlock patches. ['-L' also in 0.05?] 125 | 126 | 0.07 - slight mods to emitted TAG and wrapper scripts to ease parsing for 127 | new pft_plot script. 128 | 129 | 0.08 - start adding back multi-process version to avoid mmap_sem on high cpu 130 | counts. Use RUSAGE_THREAD if available. 131 | 132 | 0.09 - added support for "cpus_allowed" constraint on where tests run. pft will 133 | read /proc/self/status, if possible, and extract Cpus_allowed. If the 134 | numbers of cpus allowed is < the number on-line, pft will use the allowed 135 | cpus. Otherwise, it will use cpus 0..nr_cpus_online-1. Allows taskset, 136 | numactl or cpuset constraint on pft cpu usage. 137 | 138 | 0.10 - added support for '-Z' flag to request kernel NOT to clear pages on 139 | allocations, as clear_page() tends to dominate the profiles and time 140 | to fault in a new page. This will, one hopes, give us better visibility 141 | into the behavior of the allocation path, separate from clear_page() 142 | which should be fairly constant release-to-release for a given 143 | platform. This feature requires a kernel patch. See ./Kernel/* 144 | Uses a mbind() flag--MPOL_MF_NOCLEAR--to request no clearing of a 145 | range of memory. Actually just sets "noclear" for the vma that 146 | intersects the mbind() 'start' address, if any. 147 | 148 | 0.11 - use 'cpus_allowed' handling from aim/multitask. Newer version. 149 | + option to use /dev/zero mapping instead of MAP_ANONYMOUS. 150 | 151 | add '-M' [multimap] support -- mmap() separate anon regions for each test 152 | to eliminate anon_vma sharing [WORK IN PROGRESS -- i.e, not quite working] 153 | 154 | 0.12 - rework cpus_allowed to use more libnuma support instead of parsing 155 | /proc//status 156 | N.B., uses 'numa_num_task_cpus()' which is broken in libnuma-2.0.3. 157 | Requires 2.0.4 or patched libnuma. See Libnuma/* 158 | tweak various scripts: pft_plot.py [no legend option], pft_per_node, 159 | pft-task_thread => pft_task_thread. 160 | -------------------------------------------------------------------------------- /Scripts/README: -------------------------------------------------------------------------------- 1 | Ad hoc pft automation scripts 2 | 3 | runpft - bash script to run page fault test with specified options 4 | a specified number of times [default 1] varying the task/thread 5 | count from 1 to the number of cpus allowed in /proc//status 6 | "Cpus_allowed:" mask. 7 | 8 | The following scripts use runpft to run test from 1 to number of 9 | allowed cpus: 10 | 11 | pft-multirun - old script; invoke runpft with varying 12 | options: anon+{default|w/ mpol}|shmem+{default|w/ mempol} 13 | Emits "old style" annotations. Not for use with 14 | pft_plot.py 15 | pft_task_thread - run task vs thread scalability test. 16 | Emits pft_plot.py style markup. 17 | 18 | pft_per_node - invoke runpft on each node w/ cpus 19 | restricted to that node using: 20 | "numactl --cpunodebind- runpft ...". 21 | Test per node scalability. Emits pft_plot.py 22 | style markup. 23 | 24 | The following scripts have their own internal version of runpft 25 | to vary task count while gathering stats, profile. They ignore 26 | Cpus_allowed and attempt to run tests on all cpus in the system. 27 | These scripts emit pft_plot.py 'LEGEND' markup, but require 28 | manual editing to add PLOTID and [SUB]TITLEs. 29 | 30 | pft_lockstats - requires debug kernel that supports 31 | lockstats. 32 | pft_profile - uses built in kernel profiler. Clears 33 | and stores profile for each load point. 34 | Requires kernel be booted with "profile=N". 35 | pft_vmstats - collects vmstat output during each load 36 | point. 37 | 38 | pft_plot.py - python/matplotlib script to process output of the 39 | scripts above and produce a /.png' plot. See help text. 40 | Always a "work in progress". 41 | 42 | rmshm - hack to remove sysv shm segments should pft leave 43 | some behind. Quite indiscriminant. 44 | 45 | fmt_vmstats - used by pft_vmstats to pretty up the vmstat output 46 | for systems whose numbers exceed vmstat(1)'s field widths. 47 | Unfortunately, this output doesn't play nice with vmstat_plot.py 48 | [separate tool]. 49 | -------------------------------------------------------------------------------- /Scripts/fmt_vmstats: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ## fmt_vmstats -- reformat vmstats output for large systems to align 19 | ## numeric data with headers. 20 | 21 | ## Usage: fmt_vmstats 22 | # e.g., fmt_vmstats -a -S M 10 23 | # doesn't work for all possible vmstat args. 24 | 25 | # Note: this needs at least 118 columns of screen to display all fields on one line 26 | 27 | # At some point, we may need to increase the field widths--again: 28 | 29 | # 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678 30 | hdr1="---procs---|----------------memory---------------|----swap-----|--------io-------|------system-----|-------cpu--------\n" 31 | hdr2="_run_ _blk_|swpd ___free___ ___buff___ ___cache__|__si__ __so__|___bi___ ___bo___|___in___ ___cs___|usr sys idl wai st\n" 32 | hdr2a="_run_ _blk_|swpd ___free___ ___inact__ __active__|__si__ __so__|___bi___ ___bo___|___in___ ___cs___|usr sys idl wai st\n" 33 | fmt="%5d %5d|%4d %10d %10d %10d|%6d %6d|%8d %8d|%8d %8d|%3d %3d %3d %3d %2d %s\n" 34 | 35 | 36 | declare -i irun= 37 | declare VMSTAT=/usr/bin/vmstat 38 | 39 | 40 | if [[ $# -eq 0 ]]; then 41 | echo -e "Usage: fmt_vmstats \n\nRun vmstat with and format the output.\n" 42 | exit 1 43 | fi 44 | 45 | # TODO: add optional timestamps? 46 | $VMSTAT "$@" | 47 | while read run blk swpd free buff cache si so bi bo in cs us sy id wa st rest 48 | do 49 | if [[ "$run" = "procs" ]]; then 50 | printf -- "$hdr1" 51 | continue 52 | fi 53 | if [[ "$run" = "r" ]]; then 54 | if [[ "$buff" = "buff" ]]; then 55 | printf "$hdr2" 56 | elif [[ "$buff" = "inact" ]]; then 57 | printf "$hdr2a" 58 | else 59 | echo "Sorry. Don't recognize vmstat output format." >&2 60 | exit 1 61 | fi 62 | continue 63 | fi 64 | irun=$run 65 | if [[ "$irun" != "$run" ]]; then 66 | # echo non-numeric lines 67 | echo "$run $blk $swpd $free $buff $cache $si $so $bi $bo $in $cs $us $sy $id $wa $st $rest" 68 | continue 69 | fi 70 | printf "$fmt" $run $blk $swpd $free $buff $cache $si $so $bi $bo $in $cs $us $sy $id $wa $st "$rest" 71 | 72 | done 73 | -------------------------------------------------------------------------------- /Scripts/pft-multirun: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ## pft-multirunX -- invoke runpft with varying args 19 | 20 | # for plot annotations 21 | PLATFORM=$(dmidecode -s system-product-name) 22 | if [[ -n "$PLATFORM" ]]; then 23 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 24 | else 25 | PLATFORM=Unknown 26 | fi 27 | 28 | # better way to determine nr_cpus? 29 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 30 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 31 | 32 | # TESTING=true 33 | TESTING=false 34 | export NOEXEC=$TESTING VERBOSE=$TESTING 35 | 36 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 37 | 38 | # "runpft" will run pft_mpol for nr_threads = 1 .. nr_cpus, 39 | # multiple times for each thread count for averaging, ... 40 | RUNPFT="/usr/local/bin/runpft" 41 | 42 | # use os revision for pft "tag" 43 | osrev=$(uname -r) 44 | #osrev=${osrev#2.6.} # drop the 2.6. 45 | 46 | # Memory Size -- if not set, calc from GB_PER_TASK * nr_tasks) 47 | #MEMSIZE="-m 4g" 48 | MEMSIZE="-m 0" # required to use GB_PER_TASK 49 | GB_PER_TASK="-g 8" 50 | 51 | # Number of Runs Per Thread count: 52 | NRPT="-N 4" 53 | 54 | # Affinitize -- bind to cpus 55 | BIND="-af" 56 | 57 | # Nodebind: affinitize job to specified nodes 58 | #NODEBIND= 59 | NODEBIND="numactl --cpunodebind 1" 60 | 61 | outprefix=pft_allnodes-$PLATFORM-$osrev-$TIMESTAMP 62 | 63 | STOP_FILE=./pft_stop_file 64 | 65 | { 66 | # pft_mpol build version/timestamp: 67 | pft_mpol -V 68 | echo "PLOTID pft_multirun-plotid" # edit as needed 69 | echo "TITLE Single Node Scalability - ${GB_PER_TASK##* }/task" 70 | echo "SUBTITLE $PLATFORM $osrev" 71 | 72 | 73 | echo "LEGEND no additional patches " 74 | rm -f $STOP_FILE 75 | # anon mem, sys default mpol, touch to fault 76 | $NODEBIND $RUNPFT $NRPT $MEMSIZE $GB_PER_TASK $BIND "$osrev" 77 | ## anon mem, explicit local mpol, touch to fault 78 | #$NODEBIND $RUNPFT $NRPT $MEMSIZE $GB_PER_TASK $BIND -p "$osrev" 79 | ## shmem, sys default mpol, touch to fault 80 | #$NODEBIND $RUNPFT $NRPT $MEMSIZE $GB_PER_TASK $BIND -S "$osrev" 81 | ## shmem, explicit local mpol, touch to fault 82 | #$NODEBIND $RUNPFT $NRPT $MEMSIZE $GB_PER_TASK $BIND -S -p "$osrev" 83 | } > ${outprefix}.pft 2>&1 84 | -------------------------------------------------------------------------------- /Scripts/pft_lockstats: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | # for plot annotations 19 | PLATFORM=$(dmidecode -s system-product-name ) 20 | if [[ -n "$PLATFORM" ]]; then 21 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 22 | else 23 | PLATFORM=Unknown 24 | fi 25 | 26 | # better way to determine nr_cpus? 27 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 28 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 29 | 30 | # os revision for pft "tag" 31 | osrev=$(uname -r) 32 | osrev=${osrev#2.6.} # drop the 2.6. 33 | 34 | # Memory Size -- if not set, calc from GB_PER_TASK * nr_tasks) 35 | #MEMSIZE="-m 4g" 36 | MEMSIZE= 37 | GB_PER_TASK=8 38 | 39 | # Affinitize -- bind to cpus; use SCHED_FIFO 40 | BIND="-af" 41 | 42 | # Test type: -n for tasks, nothing for threads 43 | TEST_TYPE= #-n 44 | 45 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 46 | 47 | reset_lockstats() 48 | { 49 | echo 0 >/proc/lock_stat 50 | } 51 | 52 | enable_lockstats() 53 | { 54 | echo 1 >/sys/module/lockdep/parameters/lock_stat 55 | } 56 | 57 | disable_lockstats() 58 | { 59 | echo 0 >/sys/module/lockdep/parameters/lock_stat 60 | } 61 | 62 | 63 | outprefix=pft-$PLATFORM-$osrev-$TIMESTAMP 64 | outfile=${outprefix}.pgfaults 65 | 66 | STOP_FILE=./pft_stop_file 67 | 68 | # quick and dirty intenal version. 69 | # run pft varying task count from 1 .. nr_cpus 70 | runpft() 71 | { 72 | local memtype=$1 73 | local tag=MAP_ANON 74 | [[ "$memtype" != "-Z" ]] || tag=dev_zero 75 | local memsize 76 | 77 | rm -f $STOP_FILE 78 | reset_lockstats 79 | enable_lockstats 80 | title=-T 81 | echo "LEGEND $osrev $tag" >> ${outprefix}.pgfaults 82 | echo "$osrev $tag - 1..$nr_cpus tasks" >>${outprefix}.lockstats 83 | for nr_tasks in $(seq 1 $nr_cpus); 84 | do 85 | memsize=$MEMSIZE 86 | [[ -n "$memsize" ]] || memsize="-m $(( GB_PER_TASK * nr_tasks ))g" 87 | pft $BIND $memsize $memtype -n $nr_tasks $title 88 | title= 89 | if [[ -f $STOP_FILE ]]; then 90 | echo "Saw 'stop file'" >&2 91 | break 92 | fi 93 | done >> ${outprefix}.pgfaults 94 | echo >> ${outprefix}.pgfaults 95 | disable_lockstats 96 | cat /proc/lock_stat >>${outprefix}.lockstats 97 | } 98 | 99 | # ===================================================================== 100 | #main() 101 | 102 | 103 | # page faults using /dev/zero 104 | runpft -Z 105 | 106 | # page faults using MAP_ANON 107 | runpft 108 | 109 | -------------------------------------------------------------------------------- /Scripts/pft_per_node: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ## pft_per_node: cpunodebind pft to each node in the system and 19 | ## invoke runpft to run for 1..nr_cpus [allowed] 20 | 21 | # for plot annotations 22 | PLATFORM=$(dmidecode -s system-product-name ) 23 | if [[ -n "$PLATFORM" ]]; then 24 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 25 | else 26 | PLATFORM=Unknown 27 | fi 28 | 29 | # better way to determine nr_cpus? 30 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 31 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 32 | 33 | # os revision for pft "tag" 34 | osrev=$(uname -r) 35 | #osrev=${osrev#2.6.} # drop the 2.6. 36 | 37 | # Memory Size -- if not set, calc from GB_PER_TASK * nr_tasks) 38 | #MEMSIZE="-m 4g" 39 | MEMSIZE="-m 0" 40 | GB_PER_TASK="-g 2" 41 | 42 | # Number of Runs Per Thread count: 43 | NRPT="-N 4" 44 | 45 | # Affinitize -- bind to cpus; use SCHED_FIFO 46 | BIND="-af" 47 | 48 | # Test type: -n for tasks, nothing for threads 49 | TEST_TYPE= #-n 50 | 51 | case $TEST_TYPE in 52 | -n) task_thread="task" 53 | Task_thread="Task" 54 | ;; 55 | *) task_thread="thread" 56 | Task_thread="Thread" 57 | ;; 58 | esac 59 | 60 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 61 | 62 | RUNPFT="/usr/local/bin/runpft" 63 | 64 | STOP_FILE=./pft_stop_file 65 | 66 | outprefix=pft_per_node-$PLATFORM-$osrev-$task_thread-$TIMESTAMP 67 | 68 | # ===================================================================== 69 | 70 | main() 71 | { 72 | 73 | echo "PLOTID dl785c_per_node_${task_thread}_pft" 74 | echo "TITLE Per Node $Task_thread Scalability" 75 | echo "SUBTITLE $PLATFORM $osrev" 76 | 77 | rm -f $STOP_FILE 78 | for nid in {0..7}; do 79 | if [[ -f $STOP_FILE ]]; then 80 | echo "Saw 'stop file before node $nid'" >&2 81 | break 82 | fi 83 | # echo "LEGEND DL785 node $nid" 84 | echo "LEGEND $osrev node $nid $task_thread" 85 | # anon mem, sys default mpol, touch to fault 86 | # pass dummy 'tag' to generate pft header 87 | _cmd="numactl --cpunodebind=$nid $RUNPFT $TEST_TYPE $NRPT $MEMSIZE $GB_PER_TASK $BIND tag" 88 | echo "$_cmd" >&2 89 | eval "$_cmd" 90 | done 91 | 92 | } 93 | 94 | # ===================================================================== 95 | main "$@" >${outprefix}.pft 96 | -------------------------------------------------------------------------------- /Scripts/pft_plot.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | # 18 | # Work in Progress: convert 'ptf to agr data' to "pft_plot" using matplotlib 19 | # TODO: add 'tabulate' option to emit tabular [averaged] data 20 | #----------------------------------------------------------------------------- 21 | """ 22 | pft_plot.py => plot pft [page fault test] data from multiple pft runs using 23 | matplotlib. 24 | 25 | runpft script runs from 1 to nr_cpus-1 threads. Possibly multiple runs per 26 | thread count. We average these. 27 | TODO: compute std dev'n for error bars? 28 | 29 | input file[s] may contain multiple runpft runs--e.g., with different 30 | parameters--e.g., anon vs shmem, mlock vs touch, ... -- possibly from 31 | multiple kernels--e.g., with and without patches applied. 32 | 33 | Plot faults/cpu-sec and faults/wall-sec for all data sets with same set of 34 | pft parameters--e.g., on multiple kernels--on same graph for comparison. 35 | """ 36 | #----------------------------------------------------------------------------- 37 | 38 | s_opts = ':hdl:p:v' 39 | l_opts = ['help', 'debug', 'legend_loc=', 'plot=', 'verbose' ] 40 | 41 | USAGE=\ 42 | """ 43 | Usage: pft_plot [-hv] [-l ] [-p ] ... 44 | Where: 45 | \t-h = Help! show usage. 46 | \t-v = Verbose - show what's happening\n 47 | \t-l - specify legend location; default = 'best' 48 | \t text locs: 'best' or [{upper|lower|center}-]{right|left} 49 | \t or: [{upper|lower}-]center -- e.g., 'upper-left' 50 | \t or: abbreviations for above: l, r, c, ul, lc, ... 51 | \t coordinates: , as floating point values 0.0..1.0 52 | \t 'none' to omit legend--e.g., when it obscures the plot. 53 | \t-p {[fpcs],[fpws]|[wall],[cpu]} - plot selected pft results [see 54 | \t header, below]. default: faults per both cpu & wall secs. 55 | \t Only one may be selected, except for wallclock and cpu times. 56 | \t ... is a list of zero or more files to convert. 57 | \t Default is standard input.\n 58 | Plot pft [page fault test -- mempolicy version] results. 59 | Input Format -- one or more of:\n 60 | PLOTID - arbitrary id to group data sets 61 | TITLE -- one per , last one wins. 62 | SUBTITLE -- one per , last one wins. 63 | LEGEND -- one per data set. identifies data set and curve 64 | on the plot. legends must be different or later occurrences in 65 | the input stream will overwrite earlier data set[s] with same 66 | legend. NOTE: a 'LEGEND' tag is required to indicate the 67 | start of results for an PFT run. pft_plot won't start looking 68 | for the data header until it sees a 'LEGEND'. 69 | other -- ignored 70 | Gb Thr CLine User System Wall flt/cpu/s fault/wsec 71 | N TT N N.NNs NN.NNs NN.NNs CCCCC.ccc WWWWWW.www 72 | 73 | Faults/{cpu|wallclock}-seconds or cpu and wall clock time--which ever is 74 | selected--will be averaged for successive lines with same thread count [TT]. 75 | The faults/{cpu|wall}-sec or cpu+wall clock time for all LEGENDs with the 76 | same PLOTID will be plotted on the same graph. This will allow comparison 77 | of fault rate [inversely proportional to allocation overhead], e.g., for 78 | patched and unpatched kernels. 79 | """ 80 | 81 | #----------------------------------------------------------------------------- 82 | 83 | import getopt, os, sys, inspect 84 | from math import log10 85 | 86 | # select the Anti-Grain Geometry engine backend for 'png' output 87 | # Must do this before importing pyplot, ... 88 | import matplotlib 89 | matplotlib.use('Agg') 90 | import matplotlib.pyplot as plt 91 | 92 | #----------------------------------------------------------------------------- 93 | # command line options: 94 | 95 | verbose = False 96 | debug = False 97 | 98 | #============================================================================= 99 | 100 | def vprint(msg): 101 | global verbose 102 | if bool(verbose): 103 | print >>sys.stderr, msg 104 | sys.stderr.flush() 105 | 106 | #----------------------------------------------------------------------------- 107 | def usage(ret, msg): 108 | """ 109 | emit @msg, if specified, then 110 | emit program usage and exit with @ret return code 111 | """ 112 | if msg: 113 | print >>sys.stderr, msg 114 | print >>sys.stderr, USAGE 115 | sys.stderr.flush() 116 | sys.exit(ret) 117 | 118 | #----------------------------------------------------------------------------- 119 | def warning(msg): 120 | print >>sys.stderr,'WARNING: ' + msg 121 | sys.stderr.flush() 122 | 123 | #----------------------------------------------------------------------------- 124 | def die(ret, msg): 125 | """ 126 | emit FATAL @msg and exit with @ret return code 127 | """ 128 | print >>sys.stderr,'FATAL: ' + msg 129 | sys.stderr.flush() 130 | sys.exit(ret) 131 | 132 | #----------------------------------------------------------------------------- 133 | # usage: not_implemented_yet(inspect.currentframe()) 134 | def not_implemented_yet(frame): 135 | vprint((inspect.getframeinfo(frame)[2]) + " - not implemented yet") 136 | 137 | #============================================================================= 138 | 139 | # per plot data sets for PFT: 140 | faults_per_cpu_sec = [] 141 | faults_per_wall_sec = [] 142 | user_cpu_sec = [] 143 | system_cpu_sec = [] 144 | real_time_sec = [] 145 | 146 | # default 147 | plot_items = ["fpcs", "fpws"] 148 | 149 | the_plots = {} # mapping of test/plot id [pft test parameters] to data_set[s] 150 | # each item in list is a dict: {'': []} 151 | threads = [] # x-axis values: 1, 2, ... 152 | 153 | # plot annotations: 154 | plot_annotations = ("plot_title", "plot_subtitle", "y_label") 155 | plot_title = None 156 | plot_subtitle = None 157 | 158 | # for generating multiple plots from same data set/same plotid 159 | # will add non-default 'plot items' to file name. 160 | plot_file_suffix="" 161 | 162 | y_axis_labels = { 163 | 'fpcs': "faults/second", 164 | 'fpws': "faults/second", 165 | 'cpu' : "seconds", 166 | 'wall' : "seconds" 167 | } 168 | 169 | # legend location: 170 | # {best|[{upper|lower|center} ]{right|left}|{upper|lower} ]center} 171 | # or (float(x),float(y)) 172 | legend_loc = 'best' 173 | 174 | nr_runs = None 175 | 176 | #============================================================================= 177 | # plot the results: 178 | 179 | line_styles = [ '-', '--', '-.', ':' ] 180 | line_colors = [ 'k', 'r', 'g', 'b' ] 181 | 182 | fwidth = 9.0 183 | fheight = 6.0 184 | 185 | roundup = lambda v, s: s * round(( v + (s-1)) / s) 186 | 187 | #----------------------------------------------------------------------------- 188 | def dump_curves(curves): 189 | if debug: 190 | vprint("curves:\n " + str(curves)) 191 | 192 | #----------------------------------------------------------------------------- 193 | def emit_a_plot(plotid, a_plot): 194 | global plot_file_suffix 195 | 196 | # top level Figure, containing all [sub]plots 197 | fig = plt.figure(figsize=(fwidth,fheight)) 198 | 199 | # single subplot to contain the curves for the pft data 200 | ax = fig.add_subplot(111, autoscale_on=False) 201 | 202 | # Add figure text for "title" 203 | title = fig.text(.5, .95, 204 | 'Linux Page Fault Test: ' + a_plot['plot_title'], 205 | horizontalalignment='center') 206 | title.set_fontsize('large') 207 | 208 | # Use subplot title as subtitle 209 | axtitle = ax.set_title(a_plot['plot_subtitle']) 210 | axtitle.set_fontsize('medium') 211 | 212 | # no grid [default?] 213 | ax.grid(False) 214 | 215 | # axes labels 216 | ax.set_xlabel("Number of tasks or threads") 217 | # y axis label specified by the plot 218 | ax.set_ylabel(a_plot['y_label']) 219 | 220 | # plot the curves in a_plot 221 | 222 | curves = [] 223 | legends = [] 224 | iline = 0 225 | x_max = 0 226 | y_max = 0 227 | 228 | # to get all of the curves on a single plot, I *think* I need to pass them 229 | # via a single call to plot(). So, construct the argument list and pass 230 | # [below] using the '*args' syntax. 231 | for legend, data in a_plot.iteritems(): 232 | if legend not in plot_annotations: 233 | # must be a curve 234 | x_data, y_data = data 235 | x_max = max(x_max, max(x_data)) # remember for axes ranges 236 | y_max = max(y_max, max(y_data)) 237 | 238 | vprint("appending legend " + (legend)) 239 | legends.append(legend) 240 | curves.append(x_data) 241 | curves.append(y_data) 242 | curves.append(line_colors[iline % len(line_styles)] + 243 | line_styles[iline / len(line_styles)]) 244 | iline += 1 245 | 246 | dump_curves(curves) 247 | 248 | # specify axes ranges for subplot 249 | y_scale = 10 ** int(log10(y_max)) 250 | y_limit = roundup(y_max, y_scale) 251 | ax.set_ylim(0, y_limit) 252 | 253 | #x_scale = 10 ** int(log10(x_max)) 254 | #x_limit = roundup(x_max, x_scale) 255 | # don't round up x-limit 256 | x_limit = x_max 257 | ax.set_xlim(0, x_limit) 258 | 259 | the_plot = ax.plot(*curves) 260 | 261 | if legend_loc != 'none': 262 | # plot the curves on the subplot; 263 | # returns list of curves/lines for use with legend(). 264 | leg = ax.legend(the_plot, legends, loc=legend_loc, shadow=True) 265 | 266 | # decorate the legend box 267 | # set legend background color to light gray: 268 | frame = leg.get_frame() 269 | frame.set_facecolor('0.80') 270 | 271 | # adjust legend text font size: 272 | for t in leg.get_texts(): 273 | t.set_fontsize('small') 274 | 275 | # set the legend line widths: 276 | for l in leg.get_lines(): 277 | l.set_linewidth(1.5) 278 | else: 279 | plot_file_suffix += "-no_legend" 280 | 281 | #TODO: command line option: view w/ or w/o save? 282 | #plt.show() 283 | 284 | filename = "ptf-" + plotid + plot_file_suffix + ".png" 285 | vprint("saving filename " + filename) 286 | plt.savefig(filename) 287 | 288 | #----------------------------------------------------------------------------- 289 | def emit_the_plots(): 290 | """ 291 | generate plot for each item in the_plots via matplotlib 292 | """ 293 | vprint("final threads[] = " + str(threads)) 294 | 295 | for title, a_plot in the_plots.iteritems(): 296 | emit_a_plot(title, a_plot) 297 | 298 | #----------------------------------------------------------------------------- 299 | def emit_warnings(): 300 | """ 301 | emit warnings regarding mismatched data sets, ... now that we've 302 | read and processed all of the input 303 | """ 304 | 305 | # TODO: no longer used. remove if no other uses arise 306 | return 307 | 308 | #============================================================================= 309 | # debug: dump plot, data tables 310 | 311 | def dump_a_plot(title, a_plot): 312 | vprint("dump_a_plot: " + title) 313 | 314 | for legend, data in a_plot.iteritems(): 315 | if legend in plot_annotations: 316 | vprint(legend + ", " + data) 317 | continue 318 | x_data, y_data = data 319 | vprint("legend: " + legend) 320 | vprint("x_data:\n" + str(x_data)) 321 | vprint("y_data:\n" + str(y_data)) 322 | 323 | return 324 | 325 | #----------------------------------------------------------------------------- 326 | def debug_dump_data(pft_files): 327 | global debug, verbose 328 | 329 | if not debug: 330 | return 331 | 332 | saved_verbose = verbose; verbose = True 333 | vprint("number of files processed: %d" % len(pft_files)) 334 | vprint("number of plots processed: %d" % len(the_plots)) 335 | 336 | for title, a_plot in the_plots.iteritems(): 337 | dump_a_plot(title, a_plot) 338 | 339 | verbose = saved_verbose 340 | 341 | #----------------------------------------------------------------------------- 342 | plotid = None 343 | legend = None 344 | def start_new_plot(legend): 345 | 346 | global plotid, plot_title, plot_subtitle, threads 347 | global faults_per_cpu_sec, faults_per_wall_sec, user_cpu_sec, system_cpu_sec, real_time_sec 348 | 349 | if not plotid: 350 | usage(4, "pft_plot: no PLOTID specified before LEGEND") 351 | 352 | if plotid not in the_plots: 353 | the_plots[plotid] = {} # new plot 354 | 355 | a_plot = the_plots[plotid] 356 | 357 | if not plot_title or not plot_subtitle: 358 | warning("pft_plot: no PLOT_[SUB]TITLE specified before LEGEND") 359 | 360 | # last one, if any, wins 361 | a_plot['plot_title'] = plot_title 362 | a_plot['plot_subtitle'] = plot_subtitle 363 | 364 | # new data sets for this (plot); read_pft_data() will populate 365 | threads = [] 366 | faults_per_cpu_sec = [] 367 | faults_per_wall_sec = [] 368 | user_cpu_sec = [] 369 | system_cpu_sec = [] 370 | real_time_sec = [] 371 | 372 | # add curves for selected plot items. 373 | # For pft plots, there will always be two for each data set: 374 | for item in plot_items: 375 | vprint("parsing item: " + item) 376 | a_plot["y_label"] = y_axis_labels[item] # last one wins, but they're the same 377 | if item == "fpcs": 378 | a_plot['faults/cpu-sec : ' + legend] = [threads, faults_per_cpu_sec] 379 | elif item == "fpws": 380 | a_plot['faults/wall-sec : ' + legend] = [threads, faults_per_wall_sec] 381 | elif item == "cpu": 382 | a_plot['user-cpu-sec : ' + legend] = [threads, user_cpu_sec] 383 | a_plot['system-cpu-sec : ' + legend] = [threads, system_cpu_sec] 384 | elif item in ("wall"): 385 | a_plot['real-time-sec : ' + legend] = [threads, real_time_sec] 386 | else: 387 | die(6, "Bogus plot_item ''" + item) 388 | 389 | #============================================================================= 390 | # data reduction and preparation: 391 | 392 | #----------------------------------------------------------------------------- 393 | def prep_line(line): 394 | """ trim trailing new line and leading/trailing whitespace from line """ 395 | return line[:-1].lstrip(None).rstrip(None) 396 | 397 | #----------------------------------------------------------------------------- 398 | def read_pft_data(pft_file): 399 | """ 400 | continue reading pft_file, started in caller [read_ptf_file()] and save 401 | plot data containing averages of multiple runs at each thread count. 402 | """ 403 | global threads, nr_runs 404 | 405 | nrpt = 0 # nr of reports per thread count 406 | fpcs = 0.0 # faults per cpu sec 407 | fpws = 0.0 # faults per wall clock sec 408 | usrcpu = 0.0 # user cpu seconds 409 | syscpu = 0.0 # system cpu seconds 410 | realtime = 0.0 # real [wall clock] time. 411 | prev_threads = 0 # for detecting new thread count 412 | 413 | # N.B., need a blank line to terminate the loop between 414 | # multiple sections of a single file 415 | for line in pft_file: 416 | line = prep_line(line) 417 | words = line.split(None) # [gb threads cl us ss ws fpcs fpws] 418 | if len(words) < 8: break 419 | 420 | # don't assume sequence 1,2,3...,max 421 | nr_threads = int(words[1]) 422 | if nr_threads <= 0: 423 | die(5, "Bogus input: nr_threads [%d] <= 0\n" % nr_threads) 424 | 425 | if prev_threads == 0: 426 | prev_threads = nr_threads 427 | elif nr_threads != prev_threads: 428 | # append averages for previous thread count to data set 429 | if len(threads) == 0 or prev_threads > threads[-1]: 430 | threads.append(prev_threads) # x-axis thread counts 431 | 432 | faults_per_cpu_sec.append(float(fpcs) / float(nrpt)) 433 | faults_per_wall_sec.append(float(fpws) / float(nrpt)) 434 | user_cpu_sec.append(float(usrcpu) / float (nrpt)) 435 | system_cpu_sec.append(float(syscpu) / float (nrpt)) 436 | real_time_sec.append(float(realtime) / float(nrpt)) 437 | 438 | if nr_runs == None: 439 | nr_runs = nrpt 440 | elif nr_runs != nrpt: 441 | nr_runs_mismatch = True 442 | nrpt = 0 443 | fpcs = 0.0 444 | fpws = 0.0 445 | usrcpu = 0.0 446 | syscpu = 0.0 447 | realtime = 0.0 448 | prev_threads = nr_threads 449 | 450 | # accumulate runs w/ same thread count for averaging 451 | fpcs += float(words[6]) 452 | fpws += float(words[7]) 453 | usrcpu += float(words[3].strip("s")) 454 | syscpu += float(words[4].strip("s")) 455 | realtime += float(words[5].strip("s")) 456 | nrpt += 1 457 | 458 | # emit last thread count data, if any 459 | if nrpt != 0: 460 | if len(threads) == 0 or prev_threads > threads[-1]: 461 | threads.append(prev_threads) # x-axis thread counts 462 | 463 | faults_per_cpu_sec.append(float(fpcs) / float(nrpt)) 464 | faults_per_wall_sec.append(float(fpws) / float(nrpt)) 465 | user_cpu_sec.append(float(usrcpu) / float (nrpt)) 466 | system_cpu_sec.append(float(syscpu) / float (nrpt)) 467 | real_time_sec.append(float(realtime) / float(nrpt)) 468 | 469 | 470 | #----------------------------------------------------------------------------- 471 | def read_annotation(line): 472 | """ 473 | read plot annotations: plotid, title, subtitle, ... 474 | ignore unrecognized lines 475 | """ 476 | global plotid, plot_title, plot_subtitle 477 | 478 | if line.startswith('PLOTID'): 479 | plotid = line[1+len('PLOTID'):].strip(None) 480 | elif line.startswith('TITLE'): 481 | plot_title = line[1+len('TITLE'):].strip(None) 482 | elif line.startswith('SUBTITLE'): 483 | plot_subtitle = line[1+len('SUBTITLE'):].strip(None) 484 | # else ignore 485 | 486 | #----------------------------------------------------------------------------- 487 | def expect_what(expected, line, fail): 488 | if not line.startswith(expected): 489 | if not fail: 490 | return False 491 | else: 492 | die(2, "Bogus input! expected '%s', found '%s'" 493 | % (expected, what)) 494 | return True 495 | 496 | #----------------------------------------------------------------------------- 497 | def read_pft_file(pft_file_name): 498 | """ 499 | read raw pft data [@pft_file_name] and prepare for plotting 500 | """ 501 | vprint("processing " + pft_file_name) 502 | 503 | if pft_file_name == '-': pft_file = stdin 504 | else: pft_file = open(pft_file_name) 505 | 506 | # if we don't see a 'LEGEND', we'll never look for the header, ... 507 | # while looking for legend or header, check input for annotations. 508 | state = 'expect-legend' 509 | for line in pft_file: 510 | line = prep_line(line) 511 | if len(line) == 0: 512 | continue 513 | words = line.split(None) 514 | what = words[0] 515 | 516 | if state == 'expect-legend': 517 | if not expect_what('LEGEND', what, False): 518 | read_annotation(line) 519 | continue 520 | legend = line[1+len('LEGEND'):].strip(None) 521 | if legend in ("y_label", "plot_title", "plot_subtitle"): 522 | # very unlikely, I hope 523 | legend += "-pft_plot" # avoid "keyword" legends 524 | state = 'expect-header' 525 | elif state == 'expect-header': 526 | if not expect_what('Gb', what, False): 527 | read_annotation(line) 528 | continue 529 | start_new_plot(legend) 530 | read_pft_data(pft_file) 531 | state = 'expect-legend' 532 | else: 533 | die(3, "Bogus state '" + state + "' - program error") 534 | 535 | pft_file.close() 536 | 537 | #============================================================================= 538 | #----------------------------------------------------------------------------- 539 | def check_pft_files(pft_files): 540 | """ 541 | ensure that all specified pft_files exist and are readable 542 | """ 543 | errs = 0 544 | 545 | for pf in pft_files: 546 | if pf == '-': 547 | continue 548 | try: 549 | if not os.access(pf, os.R_OK): 550 | vprint("Can't read pft file " + pf) 551 | errs += 1 552 | except os.error, msg: 553 | vprint("Can't find/access pft file " + pf + " - " + msg) 554 | errs += 1 555 | 556 | if errs: 557 | die(2, "%d pft file error[s]" % (int(errs))) 558 | 559 | #----------------------------------------------------------------------------- 560 | def parse_legend_loc(args): 561 | """ 562 | handle the -l/--legend_loc= argument 563 | """ 564 | global legend_loc 565 | vprint("parse_legend_loc\n"); 566 | 567 | # Abbreviations/shortcuts for legend location: 568 | llabbrev = { 569 | 'b' : 'best', 'r' : 'right', 'l' : 'left', 570 | 'c' : 'center', 'cl' : 'center left', 'cr' : 'center right', 571 | 'ul' : 'upper left', 'uc' : 'upper center', 'ur' : 'upper right', 572 | 'll' : 'lower left', 'lc' : 'lower center', 'lr' : 'lower right', 573 | 'n' : 'none', 574 | } 575 | 576 | # search for two word locations with ' ' separator, but allow 577 | # '-' separator in the arg list 578 | args = args.replace('-', ' ') 579 | 580 | if ',' in args: 581 | # comma indicates x,y coordinates 582 | legend_loc = tuple(map(float, args.split(','))) 583 | elif ' ' in args: 584 | # two [or more] word long location names with '-' separator 585 | # e.g., 'upper-left' 586 | legend_loc = args 587 | elif args in llabbrev: 588 | # shortcut 589 | legend_loc = llabbrev[args] 590 | else: 591 | # assume one word long location name -- e.g., best 592 | legend_loc = args 593 | #vprint("legend_loc = " + str(legend_loc)) 594 | 595 | #----------------------------------------------------------------------------- 596 | def add_plot_item(this_one): 597 | global plot_items, y_axis_label, plot_file_suffix 598 | 599 | if this_one == 'real': 600 | this_one = 'wall' 601 | 602 | if this_one not in plot_items: 603 | plot_items.append(this_one) 604 | y_axis_label = y_axis_labels[this_one] 605 | sep = '-' 606 | if plot_file_suffix: 607 | sep = '+' 608 | plot_file_suffix = plot_file_suffix + sep + this_one 609 | #vprint("plot_file_suffix now: " + plot_file_suffix) 610 | 611 | 612 | #----------------------------------------------------------------------------- 613 | def parse_plot_list(args): 614 | global plot_items 615 | 616 | vprint("parse_plot_list: " + args) 617 | 618 | plot_items = [] # forget the default 619 | largs = args.split(',') 620 | plot_faults = False 621 | plot_seconds = False 622 | 623 | for arg in largs: 624 | if arg in ("fpcs", "fpws"): 625 | if plot_seconds: 626 | print >>sys.stderr, "Can't plot faults and seconds on same plot" 627 | sys.stderr.flush() 628 | continue 629 | add_plot_item(arg) 630 | plot_faults = True 631 | elif arg in ("cpu", "wall", "real"): 632 | if plot_faults: 633 | print >>sys.stderr, "Can't plot seconds and faults on same plot" 634 | sys.stderr.flush() 635 | continue 636 | add_plot_item(arg) 637 | plot_seconds = True 638 | else: 639 | usage(2, "Bogus plot selection: " + arg + "\n") 640 | 641 | vprint("plot items = " + str(plot_items) ) 642 | 643 | 644 | #----------------------------------------------------------------------------- 645 | def parse_args(in_args): 646 | global verbose, debug 647 | sys.stderr.flush() 648 | try: 649 | opts, args = getopt.getopt(in_args, s_opts, l_opts) 650 | except getopt.GetoptError, msg: 651 | print >>sys.stderr, msg 652 | sys.stderr.flush() 653 | die(1, USAGE) 654 | 655 | sys.stderr.flush() 656 | for o, a in opts: 657 | sys.stderr.flush() 658 | if o in ("-h", "--help"): 659 | usage(0, '') 660 | elif o in ("-d", "--debug"): 661 | debug = True 662 | elif o in ("-v", "--verbose"): 663 | verbose = True 664 | elif o in ("-l", "--legend_loc="): 665 | parse_legend_loc(a) 666 | elif o in ("-p", "--plot"): 667 | parse_plot_list(a) 668 | else: 669 | usage(1, "unrecognized option " + o) 670 | 671 | if len(args) < 1: 672 | args.append('-') 673 | 674 | return args 675 | 676 | #============================================================================= 677 | def main(): 678 | print >>sys.stderr, "pft_plot - WIP... " 679 | sys.stderr.flush() 680 | 681 | pft_files = parse_args(sys.argv[1:]) 682 | 683 | check_pft_files(pft_files) 684 | 685 | for pf in pft_files: 686 | read_pft_file(pf) 687 | 688 | debug_dump_data(pft_files) 689 | 690 | if len(the_plots) > 0: 691 | emit_warnings() 692 | emit_the_plots() 693 | 694 | #============================================================================= 695 | if __name__ == "__main__": 696 | main() 697 | 698 | -------------------------------------------------------------------------------- /Scripts/pft_profile: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ## run pft with vmstats + readprofile 19 | 20 | # for plot annotations 21 | PLATFORM=$(dmidecode -s system-product-name ) 22 | if [[ -n "$PLATFORM" ]]; then 23 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 24 | else 25 | PLATFORM=Unknown 26 | fi 27 | 28 | # better way to determine nr_cpus? 29 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 30 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 31 | 32 | # os revision for pft "tag" 33 | osrev=$(uname -r) 34 | osrev=${osrev#2.6.} # drop the 2.6. 35 | 36 | # Memory Size for test region 37 | #MEMSIZE="-m 4g" # fixed total size 38 | MEMSIZE= # undefined: use GB_PER_TASK 39 | GB_PER_TASK=8 # fixed size per task 40 | 41 | # Affinitize -- bind to cpus; use SCHED_FIFO 42 | BIND="-af" 43 | 44 | # Test type: -n for tasks, nothing for threads 45 | TEST_TYPE= #-n 46 | 47 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 48 | 49 | reset_lockstats() 50 | { 51 | echo 0 >/proc/lock_stat 52 | } 53 | 54 | enable_lockstats() 55 | { 56 | echo 1 >/sys/module/lockdep/parameters/lock_stat 57 | } 58 | 59 | disable_lockstats() 60 | { 61 | echo 0 >/sys/module/lockdep/parameters/lock_stat 62 | } 63 | 64 | 65 | outprefix=pft-$PLATFORM-$osrev-$TIMESTAMP 66 | 67 | STOP_FILE=./pft_stop_file 68 | 69 | # quick and dirty intenal version. 70 | # run pft varying task count from 1 .. nr_cpus 71 | runpft() 72 | { 73 | local memtype=$1 74 | local tag=MAP_ANON 75 | local memsize= 76 | [[ "$memtype" != "-Z" ]] || tag=dev_zero 77 | local cmd= 78 | 79 | rm -f $STOP_FILE 80 | title=-T 81 | echo "LEGEND $osrev $tag" 82 | for nr_tasks in $nr_cpus # ... $(seq 1 $nr_cpus) 83 | do 84 | readprofile -r 85 | 86 | memsize=$MEMSIZE 87 | [[ -n "$memsize" ]] || memsize="-m $(( GB_PER_TASK * nr_tasks ))g" 88 | cmd="pft $BIND $memsize $memtype -n $nr_tasks $title" 89 | echo "Command: $cmd " >&2 90 | eval "$cmd" 91 | 92 | readprofile -v 93 | 94 | title= 95 | if [[ -f $STOP_FILE ]]; then 96 | echo "Saw 'stop file'" >&2 97 | break 98 | fi 99 | done 100 | echo 101 | } 102 | 103 | # ===================================================================== 104 | #main() 105 | 106 | fmt_vmstats 10 >$outprefix.vmstats & 107 | vmstat_pid="$!" 108 | 109 | { 110 | # page faults using /dev/zero 111 | runpft -Z 112 | 113 | # page faults using MAP_ANON 114 | [[ -f $STOP_FILE ]] || runpft 115 | 116 | } >$outprefix.profile 117 | 118 | kill -s SIGQUIT $vmstat_pid 119 | -------------------------------------------------------------------------------- /Scripts/pft_task_thread: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ## pft-task_thread -- invoke runpft to run processes [-n] 19 | ## and threads. 20 | ## TODO: rewrite to read config file [python?] 21 | 22 | # for plot annotations 23 | PLATFORM=$(dmidecode -s system-product-name) 24 | if [[ -n "$PLATFORM" ]]; then 25 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 26 | else 27 | PLATFORM=Unknown 28 | fi 29 | 30 | # better way to determine nr_cpus? 31 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 32 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 33 | 34 | # TESTING=true 35 | TESTING=false 36 | export NOEXEC=$TESTING VERBOSE=$TESTING 37 | 38 | RUNPFT="/usr/local/bin/runpft" 39 | 40 | # os revision for pft "tag" 41 | osrev=$(uname -r) 42 | #osrev=${osrev#2.6.} # drop the 2.6. 43 | 44 | # Memory Size -- if not set, calc from GB_PER_TASK * nr_tasks) 45 | #MEMSIZE="-m 4g" 46 | MEMSIZE="-m 0" 47 | GB_PER_TASK="-g 2" 48 | 49 | # Number of Runs Per Thread count: 50 | #NRPT="-N 10" 51 | NRPT="-N 4" 52 | 53 | # Affinitize -- bind to cpus; use SCHED_FIFO 54 | BIND="-af" 55 | 56 | # Test type: -n for tasks, nothing for threads 57 | # not used herein 58 | #TEST_TYPE= #-n 59 | 60 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 61 | 62 | outprefix=pft_task_thread-$PLATFORM-$osrev-$TIMESTAMP 63 | 64 | 65 | # pft_mpol identifies shm [vs anon] and 66 | # sys default vs vma policy 67 | 68 | { 69 | # pft_mpol build version/timestamp: 70 | pft_mpol -V 71 | echo "PLOTID dl785c_task_and_thread_pft" 72 | echo "TITLE Task vs Thread Scalability" 73 | echo "SUBTITLE $PLATFORM $osrev" 74 | 75 | rm -f $STOP_FILE 76 | # anon mem, sys default mpol, touch to fault 77 | echo "LEGEND $osrev task" 78 | $RUNPFT -n $NRPT $MEMSIZE $GB_PER_TASK $BIND "$osrev-task" 79 | 80 | [[ ! -f $STOP_FILE ]] || exit 81 | echo "LEGEND $osrev thread" 82 | $RUNPFT $NRPT $MEMSIZE $GB_PER_TASK $BIND "$osrev-thread" 83 | 84 | } >${outprefix}.pft 2>&1 85 | -------------------------------------------------------------------------------- /Scripts/pft_vmstats: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | # for plot annotations 19 | PLATFORM=$(dmidecode -s system-product-name ) 20 | if [[ -n "$PLATFORM" ]]; then 21 | PLATFORM=$( echo "$PLATFORM" | awk '{print $1"-"$2$3;}') 22 | else 23 | PLATFORM=Unknown 24 | fi 25 | 26 | # better way to determine nr_cpus? 27 | nr_cpus=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 28 | nr_nodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) 29 | 30 | # os revision for pft "tag" 31 | osrev=$(uname -r) 32 | osrev=${osrev#2.6.} # drop the 2.6. 33 | 34 | # Memory Size 35 | #MEMSIZE="-m 4g" 36 | MEMSIZE= 37 | GB_PER_TASK=8 38 | 39 | # Affinitize -- bind to cpus; use SCHED_FIFO 40 | BIND="-af" 41 | 42 | # Test type: -n for tasks, nothing for threads 43 | TEST_TYPE= #-n 44 | 45 | TIMESTAMP="$(date +%y%m%d-%H%M%S)" 46 | 47 | reset_lockstats() 48 | { 49 | echo 0 >/proc/lock_stat 50 | } 51 | 52 | enable_lockstats() 53 | { 54 | echo 1 >/sys/module/lockdep/parameters/lock_stat 55 | } 56 | 57 | disable_lockstats() 58 | { 59 | echo 0 >/sys/module/lockdep/parameters/lock_stat 60 | } 61 | 62 | 63 | outprefix=pft-$PLATFORM-$osrev-$TIMESTAMP 64 | 65 | STOP_FILE=./pft_stop_file 66 | 67 | # quick and dirty intenal version. 68 | # run pft varying task count from 1 .. nr_cpus 69 | runpft() 70 | { 71 | local memtype=$1 72 | local tag=MAP_ANON 73 | local memsize= 74 | [[ "$memtype" != "-Z" ]] || tag=dev_zero 75 | local cmd= 76 | 77 | rm -f $STOP_FILE 78 | title=-T 79 | echo "LEGEND $osrev $tag" 80 | for nr_tasks in $(seq 1 $nr_cpus); 81 | do 82 | memsize=$MEMSIZE 83 | [[ -n "$memsize" ]] || memsize="-m $(( GB_PER_TASK * nr_tasks ))g" 84 | cmd="pft $BIND $memsize $memtype -n $nr_tasks $title" 85 | echo "Command: $cmd " >&2 86 | eval "$cmd" 87 | title= 88 | if [[ -f $STOP_FILE ]]; then 89 | echo "Saw 'stop file'" >&2 90 | break 91 | fi 92 | done 93 | echo 94 | } 95 | 96 | # ===================================================================== 97 | #main() 98 | 99 | fmt_vmstats 10 >$outprefix.vmstats & 100 | vmstat_pid="$!" 101 | 102 | { 103 | # page faults using /dev/zero 104 | runpft -Z 105 | 106 | # page faults using MAP_ANON 107 | [[ -f $STOP_FILE ]] || runpft 108 | 109 | } >$outprefix.pgfaults 110 | 111 | kill -s SIGQUIT $vmstat_pid 112 | -------------------------------------------------------------------------------- /Scripts/rmshm: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | ipcs -m | \ 19 | egrep -v 'Shar|shmid|^$' | \ 20 | while read xxx id yyy 21 | do 22 | echo "removing shmid $id" 23 | ipcrm -m $id 24 | done 25 | 26 | ipcs -m 27 | -------------------------------------------------------------------------------- /Scripts/runpft: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright (c) 2006, 2014 SGI. All rights reserved. 4 | # 5 | # This program is free software; you can redistribute it and/or modify 6 | # it under the terms of the GNU General Public License as published by 7 | # the Free Software Foundation; either version 3 of the License, or 8 | # (at your option) any later version. 9 | # 10 | # This program is distributed in the hope that it will be useful, 11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | # GNU General Public License for more details. 14 | # 15 | # You should have received a copy of the GNU General Public License 16 | # along with this program. If not, see . 17 | 18 | # runpft: run "page fault test" [mpol version] with varying number 19 | # of threads, from 1 to nr_cpus [>= 0.04a], to measure scalability. 20 | 21 | : ${PFT_MEMSIZE:=16g} 22 | : ${NOEXEC:=false} 23 | : ${VERBOSE:=false} 24 | 25 | : ${PFT:=pft} # was pft_mpol 26 | 27 | DROP_CACHES=/proc/sys/vm/drop_caches 28 | 29 | # ========================================================= 30 | # adjust_tag - replace sequences of spaces with single '-' 31 | adjust_tag() 32 | { 33 | echo "$@" | sed 's/ */-/g' 34 | } 35 | 36 | # ============================================================================= 37 | # vprint -- print args if verbosity enabled 38 | vprint() 39 | { 40 | $VERBOSE || return 41 | echo -e "$@" >&2 42 | } 43 | 44 | # ============================================================================= 45 | # _do -- conditionally show and/or execute command 46 | _do() 47 | { 48 | declare _do_cmd="$@" 49 | vprint "$_do_cmd" 50 | $NOEXEC || eval "$_do_cmd" 51 | } 52 | 53 | # ========================================================= 54 | # get_nr_cpus() -- return smaller of number of cpus in 55 | # /proc/cpuinfo and /proc/$$/status[Cpus_allowed] 56 | _each_chunk_of() 57 | { 58 | echo "$1" | tr ',' ' ' 59 | } 60 | get_nr_cpus() 61 | { 62 | # actual present [on-line ?] cpus in /proc/cpuinfo: 63 | local nr_cpus_info=$(cat /proc/cpuinfo | grep '^processor' | wc -l) 64 | 65 | # cpus_allowed -- hex string: [xxxxxxxx,]*xxxxxxxx 66 | local castr=$(fgrep 'Cpus_allowed:' /proc/$$/status | awk '{print $2}') 67 | local nr_cpus=0 68 | 69 | # handle possible multiple 32-bit chunks. We're just counting. 70 | for mask in $(_each_chunk_of $castr); do 71 | val=$( printf "%ld" 0x$mask) 72 | #TODO: will this work if 'sign bit' of val is set? 73 | while [[ $val -ne 0 ]]; do 74 | nr_cpus=$((nr_cpus + (val & 1))) 75 | val=$((val >> 1)) 76 | done 77 | done 78 | [[ $nr_cpus -le $nr_cpus_info ]] || nr_cpus=$nr_cpus_info 79 | echo "$nr_cpus" 80 | } 81 | 82 | 83 | 84 | # ========================================================= 85 | #main() 86 | 87 | nr_cpus=$(get_nr_cpus) 88 | 89 | max_threads=$nr_cpus # was $(( nr_cpus - 1 )) 90 | bind= 91 | fifo= 92 | mempol= 93 | shm= 94 | verbose= 95 | mlock= 96 | NRPT=1 # number of runs per thread count for averaging 97 | test_type=-t # -t for threads, -n for tasks 98 | memsize=$PFT_MEMSIZE 99 | gb_per_task=1 # just in case... 100 | drop_pgcache=false 101 | i_am_root=false 102 | [[ $(id -u) -ne 0 ]] || i_am_root=true 103 | 104 | # process command line options: 105 | # Note: all options except '-D' and '-N' are passed to pft_* 106 | OPTIONS="afg:lm:pvDN:nSL" 107 | while getopts $OPTIONS opt 108 | do 109 | case $opt in 110 | # N = number of runs per thread count for averaging 111 | # see pft_to_agr_data 112 | N) NRPT=$OPTARG 113 | ;; 114 | 115 | a) bind=-a 116 | ;; 117 | f) fifo=-f 118 | ;; 119 | g) gb_per_task=$OPTARG 120 | ;; 121 | l) mlock=-l 122 | ;; 123 | m) memsize=$OPTARG 124 | ;; 125 | n) test_type=-n 126 | ;; 127 | p) mempol=-p 128 | ;; 129 | D) if $i_am_root; then 130 | drop_pgcache=true; 131 | else 132 | echo "Not privileged: -D [drop cache] ignored" >&2 133 | fi 134 | ;; 135 | S) shm="${shm:-'-S'}" # don't overwrite '-L' 136 | ;; 137 | L) shm=-L; # implies '-S' 138 | ;; 139 | v) verbose="$verbose -v" 140 | ;; 141 | esac 142 | done 143 | 144 | ## shift past any options to arguments: 145 | shift `expr $OPTIND - 1` 146 | 147 | tag="$(adjust_tag $@)" 148 | #echo "tag=$tag" 149 | 150 | declare -i nt=1 i=0 151 | declare title= 152 | [[ -z "$tag" ]] || title=-T 153 | 154 | cmd_pfx="$PFT $bind $fifo $mlock $mempol $shm $verbose" 155 | 156 | STOP_FILE=./pft_stop_file 157 | rm -f $STOP_FILE 158 | 159 | while [[ $nt -le $max_threads ]] 160 | do 161 | declare cmd="$cmd_pfx $test_type $nt" 162 | declare memopt="-m $memsize" 163 | [[ "X$memsize" -ne "X" ]] || memopt="-m $(( gb_per_task * nt ))"g 164 | 165 | cmd="$cmd $memopt" 166 | 167 | i=0 168 | while [[ $i -lt $NRPT ]] 169 | do 170 | if [[ -f $STOP_FILE ]]; then 171 | echo "Saw 'stop file' -- nr threads run $nti + $i/$NRPT" >&2 172 | break 173 | fi 174 | if $drop_pgcache; then 175 | _do "echo 1 >$DROP_CACHES" 176 | fi 177 | _do "$cmd $title $tag" 178 | i=$(( i + 1 )) 179 | 180 | # pass non-empty title, tag first time only 181 | title= 182 | tag= 183 | done 184 | 185 | nt=$(( nt + 1 )) 186 | done 187 | 188 | # blank line in case piping multiple runs to a single file. 189 | # needed by pft_to_agr_data script 190 | echo ""; 191 | 192 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | !!! update the README -- the only documentation we have. 2 | 3 | Signal handler for '-P' to catch/ignore SIG_INT 4 | 5 | -M -- mmap() so as to avoid sharing of anon vmas. [not so easy]. 6 | work in progress as of 27jan2010 7 | maybe not needed with RvR's private anon_vma patches. TEST! 8 | 9 | Option [-A?] to bind tests to all cpus in a node before moving on to next node. 10 | Currently, cpuids are interleaved over nodes on [HP] x86_64 platforms where 11 | pft has been tested. 12 | 13 | -------------------------------------------------------------------------------- /pft.c: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright (c) 2006, 2014 SGI. All rights reserved. 3 | * 4 | * This program is free software; you can redistribute it and/or modify 5 | * it under the terms of the GNU General Public License as published by 6 | * the Free Software Foundation; either version 3 of the License, or 7 | * (at your option) any later version. 8 | * 9 | * This program is distributed in the hope that it will be useful, 10 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | * GNU General Public License for more details. 13 | * 14 | * You should have received a copy of the GNU General Public License 15 | * # along with this program. If not, see . 16 | */ 17 | 18 | /* 19 | * Originally: Christoph Lameter's "Page Fault Test" tool. 20 | * 21 | * Posted to LKML: http://lkml.org/lkml/2006/8/29/294 22 | * 23 | * Modified by Lee Schermerhorn for mem policy testing 24 | * Change to allocate single large region before creating worker 25 | * threads/tasks. 26 | * Then, carve up the region, giving each worker a piece to fault in. 27 | * This will cause the workers to contend for the cache line[s] 28 | * holding the in-kernel memory policy structure, the zone locks 29 | * and page lists, ... 30 | * In multi-thread mode, the workers will also contend for the 31 | * single test task's mmap semaphore. 32 | * 33 | * See usage below. 34 | */ 35 | 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include 50 | #include 51 | #include 52 | #include 53 | #include 54 | #include 55 | #include 56 | 57 | #include "version.h" 58 | 59 | #if defined(USE_RUSAGE_THREAD) && !defined(RUSAGE_THREAD) 60 | #define RUSAGE_THREAD 1 61 | #endif 62 | 63 | #ifdef USE_NOCLEAR 64 | /* 65 | * N.B., make sure this matches the value used in the 'noclear' kernel patch 66 | */ 67 | #define MPOL_MF_NOCLEAR (MPOL_MF_MOVE_ALL << 1) 68 | #endif 69 | 70 | struct test_rusage { 71 | struct timespec wall_start; 72 | struct timespec wall_end; 73 | 74 | struct rusage ruse_start; 75 | struct rusage ruse_end; 76 | }; 77 | 78 | enum test_state { 79 | TEST_CREATED = 0, 80 | TEST_READY, 81 | TEST_DONE, 82 | TEST_REAPED, 83 | }; 84 | 85 | static struct test_info { 86 | pthread_t ptid; 87 | pid_t pid; /* not worth a union */ 88 | volatile enum test_state state; 89 | 90 | int idx; 91 | int cpu; /* if bound */ 92 | 93 | char *mem; /* this test's memory segment */ 94 | 95 | struct test_rusage rusage; 96 | } **test_info; 97 | 98 | static pthread_attr_t thread_attributes; 99 | static pthread_key_t tip_key; 100 | 101 | static int pft_sched_policy = SCHED_FIFO; /* if we use it */ 102 | 103 | static struct rusage ruse_start; /* earliest worker's start rusage */ 104 | 105 | /* 106 | * parent/child communication - shared anon segment 107 | * 108 | * We extend this struct by nr_tests - 1 struct test_info, where 109 | * nr_tests = nr_proc + nr_thread, one of which will be 0. 110 | * 111 | * Why do we need both this array of test_info in the comm area 112 | * and the array of pointers to allocated test_infos? 113 | * Well, we don't really -- for this test. 114 | * We need this array in the comm area for the multi-task tests, 115 | * to communicate the results back to the parent/launch task. 116 | * The array of pointers to test_infos allocated locally to the 117 | * thread/task is a hold over from another multi-threaded tool 118 | * whose thread setup infrastructure I cloned for this test. 119 | * The measured loop of the test doesn't actually touch the test_info, 120 | * so we could use the array in the comm area directly. 121 | * However, I chose to keep the per test local test_info and then 122 | * push the results back to the comm area at the end, in case we 123 | * ever DO want to access the test_info in the test loop. That 124 | * and I was too lazy to rip it out. It does complicate things, tho'. 125 | */ 126 | struct pft_comm { 127 | volatile long go; 128 | 129 | struct timespec wall_end; 130 | struct timespec wall_start; 131 | 132 | struct rusage rusage; 133 | 134 | int shmid; 135 | int abort; 136 | 137 | struct test_info test_info[1]; 138 | } *comm; 139 | 140 | #define CACHELINE_SIZE_DEFAULT (128) 141 | 142 | /* 143 | * fairly arbitrary limits 144 | */ 145 | #define MAX_TESTS 128 146 | 147 | #define ROUND_UP(x,y) (((x) + (y) - 1) & ~((y)-1)) 148 | 149 | static char* OPTSTR = "ac:fhlm:n:ps:t:vzCFLNPSTVZ"; 150 | static char *usage = "\n\ 151 | Usage: %s [-afhpvzCFLMNPSTVZ] [-c ] [-m [KMGP]]\n\ 152 | [-s ] [{-n |-t }] []\n\ 153 | Where:\n\ 154 | -h = show help/usage.\n\n\ 155 | -m = total size of test region. Optional scale factor:\n\ 156 | K = kilo, M = mega, G = giga, P = pages\n\ 157 | -p = use vma/shared memPolicy; else use sys default policy.\n\ 158 | -z = bzero the per-thread memory area; else touch cachelines.\n\ 159 | -c = number of cachelines to touch if !-z.\n\ 160 | configured cacheline size: %d bytes.\n\ 161 | -l = mlock() the region instead of bzero or touch.\n" 162 | #if 0 163 | "\ 164 | -M = mmap() separate regions for each test task/thread to avoid\n\ 165 | anon_vma sharing.\n" 166 | #endif 167 | " -S = use SysV shared memory test area; else anonymous test memory.\n\ 168 | -L = SHM_LOCK the SysV shared memory test area. Implies -S\n\n\ 169 | -n = number of test processes.\n\ 170 | -t = number of threads.\n\ 171 | '-t' and '-n' are mutually exclusive options\n\ 172 | | should <= nr_cpus\n\ 173 | Each process/thread will touch / memory.\n" 174 | #if 0 175 | "\ 176 | -F = force use of |nr_thread > nr_cpus_allowed\n" 177 | #endif 178 | " -a = affinitize test processes/threads to cpus.\n\ 179 | -f = use SCHED_FIFO policy for tests.\n\ 180 | -Z = mmap() /dev/zero for anonymous regions\n" 181 | #ifdef USE_NOCLEAR 182 | " -C = request kernel not to Clear pages to eliminate this\n\ 183 | overhead. Requires special kernel patch.\n" 184 | #endif 185 | "\n -N = dump numa_maps at end of test\n\ 186 | -P = pause after test to examine maps\n\ 187 | -T = emit title/header line and 'tag' for plots.\n\ 188 | -s = sleep delay at end of tests.\n\ 189 | -v = enable verbosity.\n\ 190 | -V = just emit version/build-stamp and exit.\n\n\ 191 | = annotation, e.g., for plots.\n\n\ 192 | "; 193 | 194 | size_t bytes; /* total size of test memory */ 195 | 196 | unsigned long nr_proc = 0; // TODO: make this default? 197 | unsigned long nr_thread = 0; 198 | unsigned long nr_tests; 199 | 200 | long cachelines = 1; 201 | 202 | int do_cpubind = 0; 203 | int do_mempol = 0; 204 | int do_numa_maps = 0; 205 | int do_pause = 0; 206 | int do_shm = 0; 207 | int do_shmlock = 0; 208 | int do_title = 0; 209 | int verbose = 0; 210 | int do_bzero = 0; 211 | int use_sched_fifo = 0; 212 | int force = 0; 213 | long sleepsec = 0; 214 | int do_mlock = 0; 215 | int no_clear = 0; /* define even when !defined(USE_NOCLEAR) */ 216 | size_t multimap = 0L; /* mmap() per test area */ 217 | 218 | int mmap_fd; /* for /dev/zero mappings */ 219 | int mmap_flags = MAP_ANONYMOUS; /* default */ 220 | 221 | int launch_cpu; 222 | 223 | long faults; 224 | long pages; 225 | long pages_per_test; 226 | size_t bytes_per_test; 227 | 228 | char *test_memory; 229 | char **test_memories; 230 | 231 | pid_t lpid; /* launch() pid */ 232 | 233 | int rusage_who; 234 | 235 | void 236 | perrorx(char *mesg) 237 | { 238 | perror(mesg); 239 | if (comm) 240 | comm->abort++; 241 | exit(1); 242 | } 243 | 244 | void 245 | vprint(int level, char *format, ...) 246 | { 247 | va_list ap; 248 | 249 | if (level > verbose) 250 | goto out; 251 | 252 | va_start(ap, format); 253 | 254 | (void)vfprintf(stderr, format, ap); 255 | fflush(stderr); 256 | 257 | out: 258 | va_end(ap); 259 | return; 260 | } 261 | 262 | /* 263 | * ============================================================================ 264 | * Support for determining allowed cpus, for distributing children over allowed 265 | * cpus. 266 | */ 267 | 268 | typedef enum{false = 0, true} bool; 269 | const unsigned int BITSPERINT = 8 * sizeof(int); 270 | 271 | static int max_cpu_allowed = -1; 272 | static int nr_cpus_allowed = -1; 273 | static struct bitmask *cpus_allowed_mask; /* bit map */ 274 | static unsigned int *cpus_allowed; /* dense array */ 275 | 276 | /* 277 | * ---------------------------------------------------------------------------- 278 | */ 279 | 280 | static bool 281 | cpu_allowed(int cpuid) 282 | { 283 | return numa_bitmask_isbitset(cpus_allowed_mask, cpuid); 284 | } 285 | 286 | /* 287 | * ---------------------------------------------------------------------------- 288 | */ 289 | 290 | /* 291 | * cpus_allowed_init(): fetch task's Cpus_allowed mask using libnuma API 292 | */ 293 | #define CA_STRLEN 4096 /* large enough for most ? */ 294 | void 295 | cpus_allowed_init(void) 296 | { 297 | unsigned int *prev_cpus_allowed; 298 | int ret, max_cpus_allowed; 299 | int i, cpuid, prev_nr_cpus_allowed; 300 | 301 | /* 302 | * libnuma wrapper returns bitmask 303 | */ 304 | cpus_allowed_mask = numa_allocate_cpumask(); 305 | ret = numa_sched_getaffinity(lpid, cpus_allowed_mask); 306 | if (ret == -1) 307 | perrorx("Can't fetch sched affinity"); 308 | 309 | // broken in numactl 2.0.3: wrong symbol name [numa_num_tread_cpus()] 310 | // in sources and numa(3) man page. Requires patched libnuma and numa(3). 311 | max_cpus_allowed = numa_num_task_cpus(); 312 | 313 | prev_cpus_allowed = cpus_allowed; /* for re-init */ 314 | prev_nr_cpus_allowed = nr_cpus_allowed; 315 | 316 | /* 317 | * Populate cpus_allowed[] dense array. 318 | */ 319 | nr_cpus_allowed = 0; 320 | cpus_allowed = calloc(sizeof(int), max_cpus_allowed); 321 | if (!cpus_allowed) 322 | perrorx("Can't allocate cpus_allowed array"); 323 | 324 | for (i=0; i < max_cpus_allowed; ++i) { 325 | if (cpu_allowed(i)) { 326 | max_cpu_allowed = i; 327 | cpus_allowed[nr_cpus_allowed++] = i; 328 | } 329 | } 330 | 331 | /* 332 | * TODO: on re-init, notify application of change in cpus_allowed: 333 | * E.g., redistribute tasks over new set of allowed cpus. 334 | */ 335 | } 336 | 337 | /* 338 | * ---------------------------------------------------------------------------- 339 | */ 340 | 341 | /* 342 | * cpus_init() -- fetch/parse allowed cpus 343 | */ 344 | static void 345 | cpus_init() 346 | { 347 | cpus_allowed_init(); 348 | launch_cpu = cpus_allowed[0]; 349 | } 350 | 351 | /* 352 | * ============================================================================ 353 | */ 354 | 355 | /* 356 | * cachline_size_init() - fetch cacheline size from kernel, if supported 357 | */ 358 | size_t cacheline_size = 0; 359 | void 360 | cacheline_size_init(void) 361 | { 362 | if(!cacheline_size) { 363 | #ifdef _SC_LEVEL1_DCACHE_LINESIZE 364 | long cls = sysconf(_SC_LEVEL1_DCACHE_LINESIZE); 365 | if (cls > 0) 366 | cacheline_size = (size_t)cls; 367 | else 368 | #endif 369 | cacheline_size = CACHELINE_SIZE_DEFAULT; 370 | } 371 | } 372 | 373 | /* 374 | * Does the kernel support RUSAGE_THREAD? 375 | */ 376 | void 377 | check_rusage_thread(void) 378 | { 379 | #ifdef RUSAGE_THREAD 380 | struct rusage rusage; 381 | 382 | if(!getrusage(RUSAGE_THREAD, &rusage)) { 383 | rusage_who = RUSAGE_THREAD; 384 | vprint(1, "Using RUSAGE_THREAD\n"); 385 | } 386 | #endif 387 | } 388 | 389 | /* 390 | * pagesize_init() - fetch pagesize when needed 391 | */ 392 | ssize_t pagesize = -1; 393 | #define PAGE_ALIGN(addr) ROUND_UP((addr), pagesize) 394 | static void 395 | pagesize_init(void) 396 | { 397 | if (pagesize == -1) { 398 | pagesize = sysconf(_SC_PAGESIZE); 399 | if (pagesize <= 0) { 400 | perror("sysconf(_SC_PAGESIZE) failed"); 401 | exit(1); 402 | } 403 | } 404 | } 405 | 406 | /* 407 | * use_dev_zero() -- open /dev/zero for use by pft_mmap() 408 | */ 409 | static void 410 | use_dev_zero(void) 411 | { 412 | int dzfd = open("/dev/zero", O_RDWR); 413 | 414 | if (dzfd < 0) 415 | perrorx("open of /dev/zero failed "); 416 | 417 | mmap_fd = dzfd; 418 | mmap_flags = 0; /* zap MAP_ANONYMOUS */ 419 | } 420 | 421 | /* 422 | * pft_mmap() - "allocate" page aligned memory using mmap(ANON) 423 | * flags: MAP_PRIVATE or MAP_SHARED 424 | */ 425 | static void * 426 | pft_mmap(size_t size, int flags, void *where) 427 | { 428 | char *addr; 429 | 430 | pagesize_init(); 431 | if (!size) 432 | size = pagesize; 433 | 434 | if (where) 435 | flags |= MAP_FIXED; 436 | 437 | /* 438 | * mmap(2) page aligns size/len 439 | */ 440 | addr = (char *)mmap(where, size, 441 | PROT_READ|PROT_WRITE, 442 | flags|mmap_flags, 443 | mmap_fd, 0); 444 | if (addr == MAP_FAILED) 445 | addr = NULL; /* like valloc(3) */ 446 | return addr; 447 | } 448 | 449 | /* 450 | * valloc_private() - "allocate" page-aligned, private anon memory. 451 | */ 452 | static void * 453 | valloc_private(size_t size) 454 | { 455 | return pft_mmap(size, MAP_PRIVATE, NULL); 456 | } 457 | 458 | /* 459 | * valloc_shared() - "allocate" page-aligned, shared anon memory. 460 | */ 461 | static void * 462 | valloc_shared(size_t size) 463 | { 464 | return pft_mmap(size, MAP_SHARED, NULL); 465 | } 466 | 467 | /* 468 | * pft_free() -- free memory "allocated" via pft_mmap() or valloc_*() 469 | * need 'size' for munmap 470 | */ 471 | static void 472 | pft_free(void *mem, size_t size) 473 | { 474 | munmap(mem, size); 475 | } 476 | 477 | /* 478 | * bind_to_cpu() - bind calling thread [main program] to specified 479 | * cpu before creating per cpu thread. 480 | * Thread id 'tid' for debug printing only 481 | */ 482 | static int 483 | bind_to_cpu(int cpu, int tid) 484 | { 485 | cpu_set_t cpu_mask; 486 | 487 | if (!do_cpubind) 488 | return 1; 489 | 490 | if (nr_cpus_allowed == 1) 491 | return 1; /* why bother? */ 492 | 493 | CPU_ZERO(&cpu_mask); 494 | CPU_SET(cpu, &cpu_mask); 495 | 496 | if (sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask) == -1) { 497 | perror("sched_setaffinity"); 498 | return 0; /* assume no such cpu? */ 499 | } 500 | vprint(2, "worker %d bound to cpu %d\n", tid, cpu); 501 | return 1; 502 | } 503 | 504 | /* 505 | * borrowed from memtoy 506 | * size_kmgp() -- convert ascii arg to numeric and scale as requested 507 | */ 508 | #define BOGUS_SIZE ((size_t)-1) /* memtoy */ 509 | #define KILO_SHIFT 10 /* shift count to multiply by 1K */ 510 | 511 | static size_t 512 | size_kmgp(char *arg) 513 | { 514 | size_t argval; 515 | char *next; 516 | 517 | argval = strtoul(arg, &next, 0); 518 | if (*next == '\0') 519 | return argval; 520 | 521 | switch (tolower(*next)) { 522 | case 'p': /* pages */ 523 | argval *= pagesize; 524 | break; 525 | 526 | case 'k': 527 | argval <<= KILO_SHIFT; 528 | break; 529 | 530 | case 'm': 531 | argval <<= KILO_SHIFT * 2; 532 | break; 533 | 534 | case 'g': 535 | argval <<= KILO_SHIFT * 3; 536 | break; 537 | 538 | default: 539 | return BOGUS_SIZE; /* bogus chars after number */ 540 | } 541 | 542 | return argval; 543 | } 544 | 545 | /* 546 | * choose an arbitrary priority for pft_sched_policy tests. 547 | * use mid-point of pft_sched_policy priority range. 548 | */ 549 | static int get_run_priority(void) 550 | { 551 | int pri_min, pri_max, sched_pri; 552 | 553 | pri_min = sched_get_priority_min(pft_sched_policy); 554 | if (pri_min < 0) { 555 | perror("sched_get_priority_min"); 556 | return 0; 557 | } 558 | pri_max = sched_get_priority_max(pft_sched_policy); 559 | if (pri_max < 0) { 560 | perror("sched_get_priority_max"); 561 | return 0; 562 | } 563 | 564 | sched_pri = (pri_min + pri_max) / 2; 565 | vprint(2, "%s returning sched_pri = %d\n", __FUNCTION__, sched_pri); 566 | return sched_pri; 567 | } 568 | 569 | /* 570 | * Set scheduler for task to SCHED_FIFO. For launch() task/thread 571 | * [test == 0], augment the priority by 1 to retain control while 572 | * starting workers on each cpu. 573 | * 574 | * Returns: 575 | * !0 on success 576 | * 0 on failure. 577 | * 578 | */ 579 | static void set_task_scheduler(int test) 580 | { 581 | struct sched_param sched_params; 582 | 583 | if (!use_sched_fifo) 584 | return; 585 | 586 | memset(&sched_params, 0, sizeof(sched_params)); 587 | 588 | sched_params.sched_priority = get_run_priority() + !test; 589 | 590 | vprint(2, "setting test %d scheduler to %d @ %d\n", test, 591 | pft_sched_policy, sched_params.sched_priority); 592 | if (sched_setscheduler(0, pft_sched_policy, &sched_params)) { 593 | perror("sched_setscheduler"); 594 | } 595 | 596 | } 597 | 598 | /* 599 | * Set scheduler to SCHED_FIFO and priorty high to minimize 600 | * variability from other processes during test loop. 601 | * Returns: 602 | * !0 on success 603 | * 0 on failure. 604 | * 605 | */ 606 | static int create_thread_attributes(void) 607 | { 608 | pthread_attr_t *attr = &thread_attributes; 609 | struct sched_param sched_params; 610 | 611 | if (pthread_attr_init(attr)) { 612 | perror("pthread_attr_init"); 613 | return 0; 614 | } 615 | 616 | if (!use_sched_fifo) 617 | return 1; 618 | 619 | if (pthread_attr_setschedpolicy(attr, pft_sched_policy)) { 620 | perror("pthread_attr_setschedpolicy"); 621 | return 0; 622 | } 623 | 624 | sched_params.sched_priority = get_run_priority(); 625 | vprint(2, "setting thread scheduler to %d @ %d\n", pft_sched_policy, 626 | sched_params.sched_priority); 627 | if (pthread_attr_setschedparam(attr, &sched_params)) { 628 | perror("pthread_attr_setschedparam"); 629 | return 0; 630 | } 631 | 632 | return 1; 633 | } 634 | 635 | /* 636 | * show_tip_node() -- fetch numa node id of thread info struct 637 | */ 638 | static void show_tip_node(struct test_info *tip) 639 | { 640 | #ifdef MPOL_F_NODE 641 | int rc, node; 642 | 643 | rc = get_mempolicy(&node, NULL, 0, tip, MPOL_F_NODE|MPOL_F_ADDR); 644 | if (rc) 645 | return; 646 | 647 | vprint(2, "test info struct for test %d [cpu %d] on node %d\n", 648 | tip->idx, tip->cpu, node); 649 | #endif 650 | } 651 | 652 | /* 653 | * create_test_info() - allocate private, page-aligned per process/thread 654 | * info for test 'tidx'. For NUMA platforms, the test info struct should 655 | * be allocated locally to the cpu where the thread is running at the time. 656 | * At end of test, the worker thread/task will dump its test info into the 657 | * shared communication area. 658 | * If the '-a' [affinitize] option was specified, the thread or process 659 | * should already be bound to its run time cpu. 660 | */ 661 | struct test_info * 662 | create_test_info(int tidx) 663 | { 664 | struct test_info *tip; 665 | 666 | if (tidx) { 667 | tip = test_info[tidx] = valloc_private(sizeof(*tip)); 668 | if (!tip) 669 | perrorx("valloc_private(test_info)"); 670 | } else 671 | tip = comm->test_info; /* use comm area directly for test 0 */ 672 | 673 | bzero(tip, sizeof(*tip)); 674 | tip->idx = tidx; 675 | return tip; 676 | } 677 | 678 | char * 679 | alloc_shm(size_t shmlen) 680 | { 681 | char *p, *locked = ""; 682 | 683 | vprint(3, "Try to allocate TOTAL shm segment of %ld bytes\n", shmlen); 684 | 685 | if ((comm->shmid = shmget(IPC_PRIVATE, shmlen, SHM_R|SHM_W)) == -1) 686 | perrorx("shmget failed"); 687 | p = (char*)shmat(comm->shmid, (void*)0, SHM_R|SHM_W); 688 | 689 | if (do_shmlock) { 690 | if (shmctl(comm->shmid, SHM_LOCK, NULL) == -1) 691 | perrorx("shmctl(SHM_LOCK) failed"); 692 | locked = "/SHM_LOCKED"; 693 | } 694 | vprint(3, "shm created, attached @ adr: 0x%lx\n", locked, (long)p); 695 | return p; 696 | } 697 | 698 | /* 699 | * do_mbind() -- apply vma policy to test memory region 700 | * Use "explicit local" policy -- MPOL_PREFERRED w/ NULL nodemask 701 | */ 702 | void 703 | do_mbind(char *start, size_t length) 704 | { 705 | if (!do_mempol) 706 | return; 707 | 708 | if (mbind(start, length, MPOL_PREFERRED, (void *)0, 0, 0) < 0) 709 | perrorx("mbind failed"); 710 | } 711 | 712 | #ifdef USE_NOCLEAR 713 | void 714 | do_noclear(char *start, size_t length) 715 | { 716 | if (!no_clear) 717 | return; 718 | /* 719 | * length, policy, nodemask/maxnodes all ignored. 720 | * this is just a "backdoor" to set "no clear" on 721 | * the vma, if supported 722 | */ 723 | if (mbind(start, length, MPOL_PREFERRED, (void *)0, 0, MPOL_MF_NOCLEAR) < 0) 724 | perrorx("mbind 'NOCLEAR failed/not supported"); 725 | vprint(1, "enabled 'NOCLEAR' on test memory\n"); 726 | } 727 | #else 728 | #define do_noclear(P, L) /* no-op, but should never be invoked */ 729 | #endif 730 | 731 | /* 732 | * alloc_test_memory: allocate the test memory region and divide up between 733 | * threads. 734 | */ 735 | void 736 | alloc_test_memory(void) 737 | { 738 | char *p = NULL; 739 | int j; 740 | 741 | if (do_shm) { 742 | if (p = alloc_shm(bytes)) { 743 | do_mbind(p, bytes); 744 | do_noclear(p, bytes); 745 | } 746 | } else { 747 | /* 748 | * mmap()'ed test area[s]. 749 | */ 750 | if (!multimap) { 751 | /* 752 | * one large test area => single anon_vma 753 | */ 754 | if (p = valloc_private(bytes)) { 755 | do_mbind(p, bytes); 756 | do_noclear(p, bytes); 757 | } 758 | } else { 759 | #if 0 760 | / Not Ready for Prime Time -- maybe never 761 | /* 762 | * multimap: per test mmap area => separate anon_vmas 763 | */ 764 | void *where; 765 | size_t abytes = bytes + (nr_tests + 1) * pagesize; 766 | size_t tbytes = bytes_per_test + pagesize; 767 | 768 | /* 769 | * reserve VA range with room for "holes" 770 | */ 771 | where = valloc_private(abytes); 772 | if(!where) 773 | perrorx("valloc_private() of multimap region failed"); 774 | vprint(3, "multimap va range: 0x%lx - 0x%lx\n", where, where+abytes); 775 | if (munmap(where, pagesize)) 776 | perrorx("munmap() of 1st test region page failed"); 777 | where += pagesize; 778 | 779 | for (j = 0; j < nr_tests; ++j) { 780 | /* 781 | * unmap per test region + a 1 page hole 782 | */ 783 | if (munmap(where, tbytes)) 784 | perrorx("munmap() of per test region failed"); 785 | 786 | /* 787 | * map per test region below the hole 788 | */ 789 | if (p = pft_mmap(bytes_per_test, MAP_PRIVATE, where)) { 790 | vprint(3, "test %d memory @ 0x%lx - 0x%lx\n", 791 | j, p, p+bytes_per_test); 792 | do_mbind(p, bytes_per_test); 793 | do_noclear(p, bytes_per_test); 794 | test_memories[j] = p; 795 | 796 | where += tbytes; /* advance past the hole */ 797 | } else 798 | goto err; 799 | } 800 | #endif 801 | } 802 | } 803 | 804 | if (p == 0) { 805 | err: 806 | printf("malloc of %Ld bytes failed.\n", bytes); 807 | exit(1); 808 | } 809 | 810 | if (!multimap) { 811 | test_memory = p; 812 | vprint(3, "test memory @ 0x%lx\n", test_memory); 813 | } 814 | } 815 | 816 | /* 817 | * calc_elapsed_time() -- elapsed "wall clock" time 818 | */ 819 | static double 820 | calc_elapsed_time(struct timespec *ws, struct timespec *we) 821 | { 822 | struct timespec wall; 823 | 824 | wall.tv_sec = we->tv_sec - ws->tv_sec; 825 | wall.tv_nsec = we->tv_nsec - ws->tv_nsec; 826 | 827 | if (wall.tv_nsec <0 ) { 828 | wall.tv_sec--; 829 | wall.tv_nsec += 1000000000; 830 | } 831 | 832 | if (wall.tv_nsec >1000000000) { 833 | wall.tv_sec++; 834 | wall.tv_nsec -= 1000000000; 835 | } 836 | 837 | return ((double) wall.tv_sec + (double) wall.tv_nsec / 1000000000.0); 838 | } 839 | 840 | /* 841 | * calc_cpu_time() -- user and/or system time for all workers 842 | */ 843 | static double 844 | calc_cpu_time(struct timeval *tvp) 845 | { 846 | 847 | return ((double) tvp->tv_sec + (double) tvp->tv_usec / 1000000.0); 848 | } 849 | 850 | /* 851 | * test_to_cpu() - distribute threads/processes, round robin, over cpus 852 | * 853 | * cpu_offset: used to prevent other worker threads from binding to 854 | * the launch cpu as that causes startup problems. 855 | */ 856 | static int cpu_offset = 0; 857 | 858 | static int 859 | test_to_cpu(int t) 860 | { 861 | return (cpus_allowed[(t + cpu_offset) % nr_cpus_allowed]); 862 | } 863 | 864 | //TODO : temp for debug 865 | void 866 | show_rusage(char *tag, struct rusage *rusage) 867 | { 868 | fprintf(stderr, "%s %8d.%06d %8d.%06d %8d %8d\n", tag, 869 | rusage->ru_utime.tv_sec, rusage->ru_utime.tv_usec, 870 | rusage->ru_stime.tv_sec, rusage->ru_stime.tv_usec, 871 | rusage->ru_minflt, rusage->ru_majflt); 872 | } 873 | 874 | /* 875 | * actual measured test loop 876 | */ 877 | void 878 | pft_loop(struct test_info *tip) 879 | { 880 | char *pe, *p = tip->mem; 881 | int cl; 882 | 883 | /* 884 | * Start Measurement Interval and snap initial rusage. 885 | * Note preemption window between 'gettime and getrusage 886 | * that can skew results if we get preempted there. 887 | * Fortunately, we only use the wall clock time to 888 | * select the earliest/latest workers' rusage when 889 | * running mult-thread test on kernel that doesn't 890 | * support RUSAGE_THREAD. 891 | */ 892 | clock_gettime(CLOCK_REALTIME, &tip->rusage.wall_start); 893 | getrusage(rusage_who, &tip->rusage.ruse_start); 894 | 895 | if (do_mlock) { 896 | mlock(p, bytes_per_test); 897 | vprint(2, " mlocked\n"); 898 | } else if (do_bzero) { 899 | bzero(p, bytes_per_test); 900 | vprint(2, " zeroed\n"); 901 | } else { 902 | /* 903 | * Touch 'cachelines' every pagesize bytes. 904 | * Use 'write' access to force anon page allocation. 905 | * TODO: if we decide to add page cache [mapped file] 906 | * tests, may want to select read or write access 907 | * to test page cache minor read faults vs COW 908 | */ 909 | for(pe = p + bytes_per_test; p < pe; p += pagesize) 910 | for(cl = 0; cl < cachelines; cl++) 911 | p[cl * cacheline_size] = 'r'; 912 | } 913 | 914 | /* 915 | * End Thread Measurement Interval and snap ending rusage. 916 | * Note preemption window. 917 | */ 918 | getrusage(rusage_who, &tip->rusage.ruse_end); 919 | clock_gettime(CLOCK_REALTIME, &tip->rusage.wall_end); 920 | } 921 | 922 | void 923 | check_wall_time(int id, struct timespec *tsp, char *what) 924 | { 925 | if (tsp->tv_sec) 926 | return; 927 | vprint(0, "!!! Test %d - %s time is zero\n", id, what); 928 | verbose = 3; 929 | } 930 | 931 | /* 932 | * per test "main-line" function 933 | */ 934 | void* 935 | test_main(void *arg) 936 | { 937 | struct test_info *tip; 938 | struct timespec sleepfor = { 0, 2500L }; /* 0.0000025 sec */ 939 | long id; 940 | 941 | tip = (struct test_info *)arg; 942 | id = tip->idx; 943 | 944 | tip->state = TEST_READY; 945 | /* 946 | * push local test_info to comm area so that launch() 947 | * sees state and ptid/pid. 948 | */ 949 | comm->test_info[id] = *tip; 950 | 951 | while(!comm->go) { 952 | //TODO: may need one of these if nr_tests > nr_cpus ... 953 | #if 0 954 | #if 0 955 | if (tip->cpu == launch_cpu) 956 | nanosleep(&sleepfor, NULL); /* relax... */ 957 | #else 958 | sched_yield(); 959 | #endif 960 | #endif 961 | } 962 | 963 | vprint (2, "test %d running\n", id); 964 | 965 | pft_loop(tip); 966 | 967 | if (sleepsec) { 968 | vprint (2, "test %d sleeping\n", id); 969 | sleep(sleepsec); 970 | } 971 | 972 | check_wall_time(tip->idx, &tip->rusage.wall_start, "test_main wall_start"); 973 | check_wall_time(tip->idx, &tip->rusage.wall_end, "test_main wall_end"); 974 | 975 | vprint (2, "test %d done\n", id); 976 | tip->state = TEST_DONE; 977 | 978 | /* 979 | * push results back into comm area 980 | */ 981 | comm->test_info[id] = *tip; 982 | 983 | if (nr_thread) 984 | pthread_exit(0); 985 | else { 986 | comm = NULL; /* don't cleanup */ 987 | exit(0); 988 | } 989 | } 990 | 991 | 992 | /* 993 | * start workers -- start nr_tests-1 threads or tasks for test, distributed 994 | * across cpus. launch() thread/task will run a test as well, for a total 995 | * of nr_tests. 996 | * Allocate test info structs local to test's cpu--i.e., after binding. 997 | * Return !0 [nr tests created] on success; 0 on failure; 998 | */ 999 | static int 1000 | start_workers(void) 1001 | { 1002 | cpu_set_t main_mask; 1003 | struct test_info **tipp; 1004 | int j; 1005 | int tests_created = 0; 1006 | int ret = 0; 1007 | 1008 | vprint(2, "Starting %d test %s\n", nr_tests, nr_thread ? "threads" : "tasks"); 1009 | 1010 | test_info = valloc_private(nr_tests * sizeof(*test_info)); 1011 | if (!test_info) 1012 | perrorx("malloc of test_info failed"); 1013 | 1014 | if (nr_thread) { 1015 | if (!create_thread_attributes()) { 1016 | fprintf(stderr, "Failed to create thread attributes\n"); 1017 | return 0; 1018 | } 1019 | 1020 | /* 1021 | * for signal handlers, if needed 1022 | */ 1023 | if (pthread_key_create(&tip_key, NULL)) { 1024 | perror("pthread_key_create"); 1025 | fprintf(stderr, "Failed to create TSD key\n"); 1026 | return 0; 1027 | } 1028 | } 1029 | 1030 | /* 1031 | * try to start requested number of test threads/tasks 1032 | */ 1033 | for (tipp = test_info, j = 0; j < nr_tests; ++j, ++tipp) { 1034 | struct test_info *tip; 1035 | int cpu = test_to_cpu(j); 1036 | 1037 | /* 1038 | * don't allow other worker threads to bind to launch_cpu. 1039 | * This can only occur when we run more workers than we 1040 | * have allowed cpus. Bump the the cpu_offset and try 1041 | * again. If we only have one cpu, this won't prevent 1042 | * us from piling onto the launch cpu, but them's the breaks. 1043 | */ 1044 | if (j && cpu == launch_cpu) { 1045 | cpu_offset++; 1046 | cpu = test_to_cpu(j); 1047 | } 1048 | 1049 | /* 1050 | * bind launch thread to 'cpu' so that per test info, 1051 | * stacks, ... get allocated locally to test's cpu. 1052 | */ 1053 | if (!bind_to_cpu(cpu, j)) { 1054 | fprintf(stderr, "Unable to bind test %d to cpu %d - " 1055 | "aborting.\n", j, cpu); 1056 | exit(1); 1057 | } 1058 | 1059 | *tipp = tip = create_test_info(j); /* create locally */ 1060 | 1061 | tip->cpu = cpu; 1062 | show_tip_node(tip); /* debug */ 1063 | 1064 | if (multimap) 1065 | tip->mem = test_memories[j]; 1066 | else 1067 | tip->mem = j * bytes_per_test + test_memory; 1068 | vprint(3, "thread %d test mem @ 0x%lx - 0x%lx\n", j, 1069 | tip->mem, tip->mem + bytes_per_test ); 1070 | 1071 | if (!j) { 1072 | /* 1073 | * launch() thread/task will run test loop as test 0. 1074 | * save its thread's cpu affinity for restoration below 1075 | */ 1076 | sched_getaffinity(lpid, sizeof(main_mask), &main_mask); 1077 | continue; 1078 | } 1079 | 1080 | if (nr_thread) { 1081 | /* 1082 | * Create the test threads: 1..nr_thread-1 1083 | */ 1084 | if (pthread_create(&tip->ptid, &thread_attributes, 1085 | test_main, tip)) { 1086 | perrorx("pthread_create"); 1087 | } 1088 | } else { 1089 | /* 1090 | * Create the test tasks: 1..nr_proc-1 1091 | */ 1092 | switch (tip->pid = fork()) { 1093 | case 0: 1094 | set_task_scheduler(j); 1095 | test_main(tip); 1096 | break; 1097 | case -1: 1098 | perrorx("fork() of test process failed"); 1099 | } 1100 | } 1101 | ++tests_created; 1102 | 1103 | } /* for each requested thread */ 1104 | 1105 | ret = tests_created; /* success */ 1106 | 1107 | out: 1108 | /* 1109 | * restore "launch task's" cpu affinity 1110 | */ 1111 | sched_setaffinity(lpid, sizeof(main_mask), &main_mask); 1112 | return ret; 1113 | } 1114 | 1115 | /* 1116 | * update_walltime_and_rusage() -- select later of current comm area wall_end 1117 | * and the specified thread's wall_end. Update the rusage from the latest 1118 | * test. 1119 | * 1120 | * Details: 1121 | * If running multi-thread tests and RUASGE_THREAD is not supported by the system 1122 | * select start rusage of earliest thread to start and end usage from the latest 1123 | * thread to complete. Otherwise, we have the start and end usage of the tests 1124 | * in the comm area. We'll accumulate them after all tests complete. 1125 | */ 1126 | static void 1127 | update_walltime_and_rusage(struct test_info *tip) 1128 | { 1129 | struct timespec *cws = &comm->wall_start; 1130 | struct timespec *cwe = &comm->wall_end; 1131 | struct timespec *tws = &tip->rusage.wall_start; 1132 | struct timespec *twe = &tip->rusage.wall_end; 1133 | 1134 | check_wall_time(tip->idx, cws, "comm->wall_start"); 1135 | check_wall_time(tip->idx, cwe, "comm->wall_end"); 1136 | check_wall_time(tip->idx, tws, "test wall_start"); 1137 | check_wall_time(tip->idx, twe, "test wall_end"); 1138 | 1139 | vprint(3, "test %d\n\tstart time %lu,%lu\n", tip->idx, tws->tv_sec, 1140 | tws->tv_nsec); 1141 | if (tws->tv_sec < cws->tv_sec || 1142 | tws->tv_sec == cws->tv_sec && tws->tv_nsec < cws->tv_nsec) { 1143 | vprint(2, "selecting start time and rusage for thread %d\n", 1144 | tip->idx); 1145 | *cws = *tws; 1146 | if (rusage_who == RUSAGE_SELF && nr_thread) 1147 | ruse_start = tip->rusage.ruse_start; 1148 | } 1149 | 1150 | vprint(3, "\tend time %lu,%lu\n", twe->tv_sec, twe->tv_nsec); 1151 | if (cwe->tv_sec < twe->tv_sec || 1152 | cwe->tv_sec == twe->tv_sec && cwe->tv_nsec < twe->tv_nsec) { 1153 | vprint(2, "selecting end time and rusage for thread %d\n", 1154 | tip->idx); 1155 | *cwe = *twe; 1156 | if (rusage_who == RUSAGE_SELF && nr_thread) 1157 | comm->rusage = tip->rusage.ruse_end; 1158 | } 1159 | 1160 | vprint(3, "\telapsed time: %8.3f\n", calc_elapsed_time(tws, twe)); 1161 | } 1162 | 1163 | void 1164 | timeval_subtract(struct timeval *tve, struct timeval *tvs) 1165 | { 1166 | tve->tv_sec -= tvs->tv_sec; 1167 | tve->tv_usec -= tvs->tv_usec; 1168 | if (tve->tv_usec < 0) { 1169 | tve->tv_sec--; 1170 | tve->tv_usec += 1000000; 1171 | } 1172 | } 1173 | 1174 | void 1175 | timeval_add(struct timeval *tvsum, struct timeval *tvadd) 1176 | { 1177 | tvsum->tv_sec += tvadd->tv_sec; 1178 | tvsum->tv_usec += tvadd->tv_usec; 1179 | if (tvsum->tv_usec > 1000000) { 1180 | tvsum->tv_usec -= 1000000; 1181 | tvsum->tv_sec++; 1182 | } 1183 | } 1184 | 1185 | /* 1186 | * Each test should incur at least 'pages_per_test' minor faults. 1187 | * Did it? 1188 | */ 1189 | void 1190 | check_test_rusage(int id, struct rusage *sru, struct rusage *eru) 1191 | { 1192 | long minflts, maxflts; 1193 | 1194 | minflts = eru->ru_minflt - sru->ru_minflt; 1195 | if (minflts < pages_per_test) { 1196 | vprint(0, "!!! test %d: expected %ld faults -- measured %ld\n" 1197 | " user time %8.4g, system time: %8.4g\n", 1198 | id, pages_per_test, minflts, 1199 | calc_cpu_time(&eru->ru_utime), calc_cpu_time(&eru->ru_stime)); 1200 | } 1201 | 1202 | maxflts = eru->ru_majflt - sru->ru_majflt; 1203 | if (maxflts) 1204 | vprint(0, "!!! test %d: unexpected major faults: %ld\n", 1205 | id, maxflts); 1206 | } 1207 | 1208 | /* 1209 | * calc_test_rusage() - 1210 | * For multithread test, and not using RUSAGE_THREAD, subtract earliest thread's 1211 | * start rusage from last thread's end rusage in comm area. 1212 | * Otherwise, accumulate each thread's or task's rusage in comm area. 1213 | */ 1214 | void 1215 | calc_test_rusage(void) 1216 | { 1217 | if (rusage_who == RUSAGE_SELF && nr_thread) { 1218 | /* 1219 | * multi-thread test; no RUSAGE_THREAD 1220 | */ 1221 | struct rusage *sru = &ruse_start; /* earliest starting thread */ 1222 | struct rusage *eru = &comm->rusage; /* latest ending thread */ 1223 | 1224 | timeval_subtract(&eru->ru_utime, &sru->ru_utime); 1225 | timeval_subtract(&eru->ru_stime, &sru->ru_stime); 1226 | 1227 | eru->ru_minflt -= sru->ru_minflt; 1228 | eru->ru_majflt -= sru->ru_majflt; 1229 | 1230 | } else { 1231 | /* 1232 | * multi-task test, or multi-thread test using RUSAGE_THREAD; 1233 | * accumulate all tests' rusage during test loop 1234 | */ 1235 | struct rusage *sumru = &comm->rusage; /* the accumulator */ 1236 | int j; 1237 | bzero(sumru, sizeof(struct rusage));/* the accumulator */ 1238 | for (j = 0; j < nr_tests; j++) { 1239 | struct test_info *tip = comm->test_info + j; 1240 | struct rusage *sru = &tip->rusage.ruse_start; 1241 | struct rusage *eru = &tip->rusage.ruse_end; 1242 | 1243 | /* 1244 | * cpu usage in pft_loop() for test j 1245 | */ 1246 | timeval_subtract(&eru->ru_utime, &sru->ru_utime); 1247 | timeval_subtract(&eru->ru_stime, &sru->ru_stime); 1248 | 1249 | timeval_add(&sumru->ru_utime, &eru->ru_utime); 1250 | timeval_add(&sumru->ru_stime, &eru->ru_stime); 1251 | 1252 | /* 1253 | * faults in pft_loop() for test j 1254 | */ 1255 | check_test_rusage(j, sru, eru); 1256 | sumru->ru_minflt += eru->ru_minflt - sru->ru_minflt; 1257 | sumru->ru_majflt += eru->ru_majflt - sru->ru_majflt; 1258 | } 1259 | } 1260 | } 1261 | 1262 | /* 1263 | * launch() - Allocate test region, create/start threads, time threads. 1264 | */ 1265 | void 1266 | launch() 1267 | { 1268 | struct timespec sleepfor = { 0, 500000L }; /* 0.5 sec */ 1269 | struct test_info *tip0 = comm->test_info; 1270 | int i, j, n; 1271 | 1272 | lpid = getpid(); 1273 | bind_to_cpu(launch_cpu, -2); /* assume 'launch_cpu' exists */ 1274 | 1275 | /* 1276 | * Pre-allocate test memory, outside of measurement interval 1277 | * but don't touch! 1278 | */ 1279 | alloc_test_memory(); 1280 | 1281 | comm->go = 0; /* threads will wait for go ahead */ 1282 | 1283 | set_task_scheduler(1); /* before starting threads */ 1284 | 1285 | if (start_workers() < (nr_tests -1)) { 1286 | fprintf(stderr, "Unable to create %d worker %s - aborting\n", 1287 | nr_tests, nr_thread ? "threads" : "tasks"); 1288 | exit(1); 1289 | } 1290 | 1291 | vprint(2, "%s: waiting for workers to get ready\n", __FUNCTION__); 1292 | nanosleep(&sleepfor, NULL); 1293 | do { 1294 | n = 0; 1295 | for (j = 1; j < nr_tests; j++) 1296 | if (comm->test_info[j].state == TEST_READY) 1297 | n++; 1298 | } while(n < (nr_tests - 1)); 1299 | 1300 | /* 1301 | * Initialize wall clock end time to known "early" time 1302 | */ 1303 | clock_gettime(CLOCK_REALTIME, &comm->wall_end); 1304 | 1305 | comm->go = 1; /* give all tests the go_ahead */ 1306 | pft_loop(tip0); /* run thread/process 0 test */ 1307 | tip0->state = TEST_DONE; 1308 | 1309 | /* 1310 | * Initialize clock start to known "late" time 1311 | */ 1312 | clock_gettime(CLOCK_REALTIME, &comm->wall_start); 1313 | update_walltime_and_rusage(tip0); /* for test 0 */ 1314 | 1315 | /* 1316 | * launch() thread will sleep in pthread_join() or waitpid() 1317 | * waiting for DONE tests to exit. 1318 | */ 1319 | vprint(2, "%s: waiting for tests to finish.\n", __FUNCTION__); 1320 | do { 1321 | n = 0; 1322 | nanosleep(&sleepfor, NULL); 1323 | 1324 | for (j = 1; j < nr_tests; j++) { 1325 | struct test_info *tip = &comm->test_info[j]; 1326 | if (tip->state == TEST_REAPED) { 1327 | n++; 1328 | continue; 1329 | } 1330 | if (tip->state != TEST_DONE) 1331 | continue; /* don't reap unfinished tests */ 1332 | if (nr_thread) { 1333 | pthread_join(tip->ptid, NULL); 1334 | vprint(2, "%s: joined thread %d\n", __FUNCTION__, tip->idx); 1335 | } else { 1336 | waitpid(tip->pid, NULL, 0); 1337 | vprint(2, "%s: process %d exited\n", __FUNCTION__, tip->idx); 1338 | } 1339 | update_walltime_and_rusage(tip); 1340 | tip->state = TEST_REAPED; 1341 | n++; 1342 | } 1343 | } while(n < (nr_tests - 1)); 1344 | 1345 | calc_test_rusage(); 1346 | 1347 | if (do_pause) { 1348 | vprint(0, "pausing...\n"); 1349 | pause(); 1350 | } 1351 | 1352 | if (do_numa_maps) { 1353 | char cmdbuf[80]; 1354 | int ret = snprintf(cmdbuf, 80, 1355 | "cat /proc/%d/numa_maps | grep '^%lx'", 1356 | lpid, test_memory); 1357 | if (ret < 0) { 1358 | perror("snprintf - dumping numa_maps"); 1359 | } else { 1360 | ret = system(cmdbuf); 1361 | if (ret == -1) 1362 | perror("system(3) - dumping numa_maps"); 1363 | } 1364 | } 1365 | 1366 | exit(0); 1367 | } 1368 | 1369 | /* 1370 | * cleanup() - at exit cleanup routine 1371 | */ 1372 | static void 1373 | cleanup() 1374 | { 1375 | vprint(3, "cleanup() entered\n"); 1376 | 1377 | if (comm && comm->shmid >= 0) { 1378 | if (shmctl(comm->shmid,IPC_RMID,0) == -1) 1379 | perror("removal of test shmem failed"); 1380 | else vprint(3, "test shmem removed\n"); 1381 | comm->shmid = -1; 1382 | } 1383 | } 1384 | 1385 | int 1386 | main(int argc, char *argv[]) 1387 | { 1388 | extern int optind, opterr; 1389 | extern char *optarg; 1390 | 1391 | int i, j, c, stat, er=0; 1392 | pid_t ppid; /* parent's pid */ 1393 | char *tag = NULL; 1394 | size_t comm_size; 1395 | 1396 | long gbyte; 1397 | double faults_per_sec, faults_per_sec_per_cpu; 1398 | double elapsed_time, user_time, sys_time; 1399 | 1400 | ppid = getpid(); 1401 | setpgid(0, ppid); 1402 | 1403 | pagesize_init(); 1404 | bytes = pagesize; 1405 | 1406 | opterr=1; 1407 | while ((c = getopt(argc, argv, OPTSTR)) != EOF) 1408 | switch (c) { 1409 | case 'a': /* affinitize threads to cpus */ 1410 | do_cpubind++; 1411 | break; 1412 | case 'c': /* number of cachelines to touch per page */ 1413 | cachelines = atol(optarg); 1414 | break; 1415 | 1416 | case 'f': 1417 | use_sched_fifo++; 1418 | break; 1419 | 1420 | case 'l': 1421 | do_mlock++; 1422 | if (do_shmlock) { 1423 | do_shmlock = 0; 1424 | vprint(0, 1425 | "Option -l [mlock] overriding '-L' [SHM_LOCK]\n"); 1426 | } 1427 | break; 1428 | 1429 | case 'm': /* memory [size] */ 1430 | bytes = size_kmgp(optarg); 1431 | if (bytes == BOGUS_SIZE) { 1432 | vprint(0, "Bogus size: -b %s\n", optarg); 1433 | er++; 1434 | } 1435 | break; 1436 | 1437 | case 'n': /* number of processes */ 1438 | nr_proc = atol(optarg); 1439 | if (nr_proc > MAX_TESTS) { 1440 | nr_proc = MAX_TESTS; 1441 | vprint(0, "nr_proc clipped at %d\n", nr_proc); 1442 | } 1443 | if (nr_thread) { 1444 | vprint(0, "nr_proc overriding nr_thread -- switching to process mode\n"); 1445 | nr_thread = 0; 1446 | } 1447 | break; 1448 | 1449 | case 'p': /* use vma/shared policy */ 1450 | do_mempol++; 1451 | break; 1452 | 1453 | case 's': /* sleep/delay for test threads */ 1454 | sleepsec = atol(optarg); 1455 | break; 1456 | 1457 | case 't': /* number of threads */ 1458 | nr_thread = atol(optarg); 1459 | if (nr_thread > MAX_TESTS) { 1460 | nr_thread = MAX_TESTS; 1461 | vprint(0, "nr_thread clipped at %d\n", nr_thread); 1462 | } 1463 | if (nr_proc) { 1464 | vprint(0, "nr_thread overriding nr_proc -- switching to thread mode\n"); 1465 | nr_thread = 0; 1466 | } 1467 | break; 1468 | 1469 | case 'v': /* multiple times for debug verbosity */ 1470 | verbose++; 1471 | break; 1472 | 1473 | case 'z': /* bzero() test region instead of "touch" */ 1474 | do_bzero++; 1475 | break; 1476 | 1477 | #if 0 1478 | case 'F': /* this doesn't really work. Rip it out. */ 1479 | force++; 1480 | break; 1481 | #endif 1482 | 1483 | case 'L': /* SHM_LOCK the shared memory test area */ 1484 | do_shm++; /* assume '-S' */ 1485 | do_shmlock++; 1486 | multimap = 0; 1487 | if (do_mlock) { 1488 | do_mlock = 0; 1489 | vprint(0, 1490 | "Option -L [SHM_LOCK] overriding '-l' [mlock]\n"); 1491 | } 1492 | break; 1493 | 1494 | #if 0 1495 | case 'M': /* multimap - mmap() separate regions */ 1496 | if (do_shm) { 1497 | vprint(0, "Ignoring '-M' for shmem\n"); 1498 | } 1499 | multimap = pagesize; 1500 | break; 1501 | #endif 1502 | 1503 | case 'N': /* dump numa_maps after test */ 1504 | do_numa_maps++; 1505 | break; 1506 | 1507 | case 'P': /* pause after test */ 1508 | do_pause++; 1509 | break; 1510 | 1511 | case 'S': /* use SysV shared memory for test region */ 1512 | do_shm++; 1513 | multimap = 0; 1514 | break; 1515 | 1516 | case 'T': /* emit title and tag */ 1517 | do_title++; 1518 | break; 1519 | 1520 | case 'V': /* show version and "build stamp" */ 1521 | vprint(0, "pft_mpol " PFT_MPOL_VERSION " built " 1522 | __DATE__ " @ " __TIME__ "\n"); 1523 | exit(0); 1524 | /* NOTREACHED */ 1525 | 1526 | #ifdef USE_NOCLEAR 1527 | case 'C': /* eliminate kernel page clearing, if supported */ 1528 | no_clear++; 1529 | break; 1530 | #endif 1531 | 1532 | case 'Z': /* use /dev/zero for private mappings */ 1533 | use_dev_zero(); 1534 | break; 1535 | 1536 | case 'h': /* help */ 1537 | er++; 1538 | break; 1539 | case '?': 1540 | er = 1; 1541 | break; 1542 | } 1543 | if (er) { 1544 | printf(usage, argv[0], cacheline_size); 1545 | exit(1); 1546 | } 1547 | 1548 | if (optind < argc) { 1549 | tag = argv[optind++]; 1550 | // TODO: warn about ignoring extraneous args? 1551 | } 1552 | 1553 | cpus_init(); 1554 | cacheline_size_init(); 1555 | 1556 | check_rusage_thread(); /* can we use it? */ 1557 | 1558 | if (!(nr_proc | nr_thread)) { 1559 | vprint(0, "Using 1 thread\n"); 1560 | nr_thread = 1; 1561 | } 1562 | nr_tests = nr_proc + nr_thread; 1563 | 1564 | bind_to_cpu(launch_cpu, -1); /* park mainline on 'launch_cpu' */ 1565 | 1566 | if (!force && nr_tests > nr_cpus_allowed) { 1567 | vprint(0, "clipping nr_tests to nr_cpus_allowed [%d]\n", 1568 | nr_cpus_allowed); 1569 | nr_tests = nr_cpus_allowed; 1570 | } 1571 | 1572 | if (multimap) { 1573 | test_memories = calloc(nr_tests, sizeof(*test_memories)); 1574 | if (!test_memories) 1575 | perrorx("can't alloc test_memories"); 1576 | } 1577 | 1578 | /* 1579 | * adjust [round up] sizes 1580 | */ 1581 | pages = PAGE_ALIGN(bytes) / pagesize; 1582 | pages_per_test = ROUND_UP(pages, nr_tests) / nr_tests; 1583 | pages = pages_per_test * nr_tests; 1584 | 1585 | bytes_per_test = pages_per_test * pagesize; 1586 | bytes = pages * pagesize; 1587 | gbyte = (bytes + pagesize * nr_tests) / (1024*1024*1024); 1588 | 1589 | if (pages_per_test < 1) { 1590 | vprint(0, "Memory size too small. " 1591 | "Need at least 1 page per test\n"); 1592 | exit(2); 1593 | } 1594 | 1595 | vprint(2, "Calculated pages=%ld,pages/test = %ld,pagesize=%ld\n", 1596 | pages, pages_per_test, pagesize); 1597 | 1598 | /* 1599 | * Register cleanup handler 1600 | */ 1601 | if (atexit(cleanup) != 0) 1602 | perrorx("atexit(cleanup) registration failed"); 1603 | 1604 | /* 1605 | * parent/child comm area 1606 | */ 1607 | comm_size = sizeof(struct pft_comm) + (nr_tests - 1) * sizeof(struct test_info); 1608 | comm = (struct pft_comm *)valloc_shared(comm_size); 1609 | bzero(comm, comm_size); 1610 | comm->shmid = -1; 1611 | 1612 | /* 1613 | * run multi-{process|thread} test in a child process 1614 | */ 1615 | if ((lpid = fork()) == 0) 1616 | launch(); 1617 | while (wait(&stat) > 0); 1618 | 1619 | if (comm->abort) 1620 | exit(1); 1621 | 1622 | elapsed_time = calc_elapsed_time(&comm->wall_start, &comm->wall_end); 1623 | vprint(2, "elapsed time = %8.2f\n", elapsed_time); 1624 | user_time = calc_cpu_time(&comm->rusage.ru_utime); 1625 | vprint(2, "user time = %8.2f\n", user_time); 1626 | sys_time = calc_cpu_time(&comm->rusage.ru_stime); 1627 | vprint(2, "system time = %8.2f\n", sys_time); 1628 | 1629 | /* 1630 | * Warn if number of minor faults differs "significantly" from 1631 | * expected value == number of pages in test memory 1632 | */ 1633 | if (abs(pages - comm->rusage.ru_minflt) > pages/10) { 1634 | vprint(0, "expected faults differs from actual faults by > 10%\n"); 1635 | verbose = 1; /* to emit calculated, actual below */ 1636 | } 1637 | 1638 | vprint(1, "Calculated faults=%ld." 1639 | " Real minor faults=%ld," 1640 | " major faults=%ld\n", 1641 | pages, comm->rusage.ru_minflt, comm->rusage.ru_majflt); 1642 | 1643 | faults_per_sec = (double) pages / elapsed_time; 1644 | faults_per_sec_per_cpu = (double) pages / (user_time + sys_time); 1645 | 1646 | if (do_title) { 1647 | if (tag) { 1648 | /* for plot post-processing */ 1649 | printf("TAG pft:%s-%s%s:%s\n", 1650 | do_shm ? "shmem" : "anon", 1651 | do_mempol ? "vma-policy" : "sys-default", 1652 | do_mlock ? "-mlocked" : 1653 | do_shmlock ? "-shm_locked" : "", 1654 | tag); 1655 | } 1656 | printf(" Gb Thr CLine User System Wall" 1657 | " flt/cpu/s fault/wsec\n"); 1658 | } 1659 | 1660 | printf(" %3ld %4ld %3ld %8.2fs %8.2fs %8.2fs %10.3f %10.3f\n", 1661 | gbyte, nr_tests, cachelines, 1662 | user_time, sys_time, elapsed_time, 1663 | faults_per_sec_per_cpu, faults_per_sec); 1664 | 1665 | exit(0); 1666 | } 1667 | -------------------------------------------------------------------------------- /version.h: -------------------------------------------------------------------------------- 1 | /* 2 | * version.h - version stamp for executable, tarball 3 | */ 4 | /* 5 | * Copyright (c) 2005,2006,2007,2009 Hewlett-Packard, Inc 6 | * 7 | * This program is free software; you can redistribute it and/or modify 8 | * it under the terms of the GNU General Public License as published by 9 | * the Free Software Foundation; either version 3 of the License, or 10 | * (at your option) any later version. 11 | * 12 | * This program is distributed in the hope that it will be useful, 13 | * but WITHOUT ANY WARRANTY; without even the implied warranty of 14 | * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 15 | * GNU General Public License for more details. 16 | * 17 | * You should have received a copy of the GNU General Public License 18 | * along with this program. If not, see . 19 | */ 20 | 21 | /* 22 | * N.B. - version string may contain only [0-9.a-z+-] 23 | * else 'make tarball' will produce unintended tarball names. 24 | */ 25 | #define PFT_MPOL_VERSION "0.12x" 26 | --------------------------------------------------------------------------------