├── InstallTestDrv.reg ├── README.md ├── benchmarks ├── README.md ├── build.sh ├── loop_cmpsb.s ├── loop_cmpsw.s ├── loop_lodsb.s ├── loop_lodsw.s ├── loop_movsb.s ├── loop_movsw.s ├── loop_scasb.s ├── loop_scasw.s ├── loop_stosb.s ├── loop_stosw.s ├── rep_cmpsb.s ├── rep_cmpsw.s ├── rep_lodsb.s ├── rep_lodsw.s ├── rep_movsb.s ├── rep_movsw.s ├── rep_scasb.s ├── rep_scasw.s ├── rep_stosb.s └── rep_stosw.s ├── builddrv.bat ├── drv ├── HPCTestDrv.c └── sources ├── output ├── hpcoutput-poll.csv └── hpcoutput-sampl.csv ├── runtest.bat ├── testcode └── test.c └── tutorial ├── images ├── image1.png ├── image2.png ├── image3.png ├── image4.png ├── image5.png └── image6.png ├── test.c └── tutorial.md /InstallTestDrv.reg: -------------------------------------------------------------------------------- 1 | Windows Registry Editor Version 5.00 2 | 3 | [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HPCTestDrv] 4 | "Type"=dword:00000001 5 | "Start"=dword:00000003 6 | "ErrorControl"=dword:00000001 7 | "Group"="Base" 8 | "ImagePath"="\\SystemRoot\\System32\\Drivers\\HPCTestDrv.sys" 9 | "Description"="HPC - Sample Test Driver" 10 | "DisplayName"="HPCTestDrv" -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | HPCTool 2 | ============================== 3 | Features: 4 | ------------------------------ 5 | - Monitors hardware performance counters 6 | - Features per-process filtering 7 | - Features two modes: 8 | 1. **POLLING** mode reads performance counter values at a desired state. We instrument source code using software trap (int 2e) to indicate the desired state. 9 | 2. **SAMPLING** mode reads performance counter values at a performance monitoring interrupt (PMI), which is triggered by setting a threshold value on an event. In our code, we set a threshold value on instruction retired event. 10 | 11 | 12 | Requirements: 13 | ------------------------------ 14 | 1. The HPCTool has been tested on 32 bit, Windows 7, both bare-metal and virtualized. 15 | - To use the HPCTool in a virtualized environment, verify that the virtual performance counters are enabled. 16 | For example, in VMWare Fusion one needs to modify following settings: 17 | - Enable virtual CPU Performance counters (e.g., on VMware -> settings: VMware->Hardware) 18 | - In .vmx file add below lines (on VMware): 19 | - vpmc.enable = "TRUE" 20 | - vpmc.freezeMode="guest" 21 | 22 | 2. To build the kernel driver, install Windows Driver Kit version 7 available at [Microsoft](https://www.microsoft.com/en-us/download/confirmation.aspx?id=11800). You will need at least the "Full Development Environment" edition. 23 | 24 | Building the HPCtool: 25 | -------------------------------- 26 | 1. Open [drv/HPCTestDrv.c](./drv/HPCTestDrv.c) and make following changes 27 | 28 | a. Set the mode of operation to "POLLING_MODE" OR "SAMPLING_MODE". 29 | 30 | ```bash 31 | #define SAMPLING_MODE 32 | ``` 33 | b. Set threshold values for generating PMI. 34 | - In SAMPLING_MODE, set pmiThreshold as desired. 35 | - In POLLING_MODE, set pmiThreshold = 0. 36 | 37 | ```bash 38 | #ifdef SAMPLING_MODE 39 | INT32 pmiThreshold = -50000; 40 | #else //polling mode 41 | INT32 pmiThreshold = 0; 42 | #endif 43 | ``` 44 | 45 | c. Change TEST_APP to a process/application that has to be monitored. 46 | 47 | ```bash 48 | #define TEST_APP "test.exe" 49 | ``` 50 | 51 | d. Modify LOG_FILE, to reflect your environment e.g. "C:\\Users\\Sanjeev\\Desktop\\hpcoutput.csv" 52 | 53 | ```bash 54 | #define LOG_FILE L"\\DosDevices\\C:\\Users\\Sanjeev\\Desktop\\hpcoutput.csv" 55 | ``` 56 | 57 | e. Set the event type of the performance counters EVENT0, EVENT1, EVENT2, EVENT3 to measure the events of interests. 58 | - By default only user space events are monitored for the following events: 59 | - EVENT0: Number of branches retired, 60 | - EVENT1: Number of mis-predicted branches retired, 61 | - EVENT2: Number of last level cache references, 62 | - EVENT3: Number of last level cache misses. 63 | 64 | ```bash 65 | #define EVENT0 0x004100C4 //Branch instruction retired 66 | #define EVENT1 0x004100C5 //Mispredicted branch instructions 67 | #define EVENT2 0x00414F2E //LLC cache reference 68 | #define EVENT3 0x0041412E //LLC misses 69 | ``` 70 | 71 | E.g., to monitor Branch Instruction retired (Event Num. = C4H, Umask Value = 00H), 72 | in user space: we set EVENT0 = 0x4100C4 73 | in kernel space: we set EVENT0 = 0x4200C4 74 | 75 | Refer to our [tutorial](./tutorial/tutorial.md) for further details on configuring each event. 76 | 77 | Refer to [Intel Manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf), Chapter 19 for more information event num and umask value for performance counter events. 78 | 2. Open **x86 Checked Build Environment** command prompt with **Administrator** privilege (right click -> Run as Administrator) 79 | 3. Change the current directory to the path containing the kernel driver source: 80 | 81 | ```bash 82 | cd PATH-TO-SRC-DIR 83 | builddrv.bat 84 | ``` 85 | If the build is successful, HPCTestDrv.sys will be copied to Windows driver directory (e.g., C:\Windows\System32\drivers) 86 | 87 | 88 | 89 | Preparing the instrumented binary: 90 | -------------------------------- 91 | 1. The binary that will be measured needs to be instrumented in the source code as shown in the test example in testcode\test.c 92 | For C/C++ code, insert the following instrumentation trigger to generate the software trap before and after the section of code to be profiled: 93 | 94 | ```bash 95 | __asm __volatile{ 96 | mov eax, 19h 97 | int 0x2e 98 | } 99 | ``` 100 | 2. Compile the source code using C compiler and run HPC tool on the compiled binary. 101 | 102 | 103 | Running the HPCTool: 104 | -------------------------------- 105 | 1. Disable the driver signing 106 | In the privileged command prompt opened in step 2) type: 107 | 108 | ```bash 109 | bcdedit.exe -set loadoptions DDISABLE_INTEGRITY_CHECKS 110 | bcdedit.exe -set TESTSIGNING ON 111 | ``` 112 | 2. Install the kernel driver - you need to do this just once 113 | 114 | Double click **InstallTestDrv.reg** and click to accept the registry of the kernel driver. **Restart** the system. 115 | 116 | 3. Create an empty file in the location specified in **LOG_FILE** 117 | 118 | 4. Open a command prompt with Administrator privileges and run the test program using the sample **runtest.bat** script. It starts the HPC driver, executes the test program and finally stops the driver. 119 | 120 | ```bash 121 | runtest.bat 122 | ``` 123 | 124 | **Output:** HPC output is logged in the file specified as **LOG_FILE** at compile time. 125 | 126 | 127 | Output: 128 | -------------------------------- 129 | The output comprises of collection of samples. 130 | - Each sample contains the measurement of 7 events -- 3 fixed and 4 programmable/configurable events. 131 | - 3 Fixed events: No. of instructions retired, logical cycles, reference cycles 132 | - 4 programmable events: No. of branches retired, mis-predicted branches retired, LLC cache references, LLC misses. 133 | - The four programmable events can be changed to address various profiling goals. However, changing the events measured requires re-compiling the kernel driver. 134 | - The data collected using the performance counters is written to a file in a comma separated value (CSV) format. The order of the fields is as follows: 135 | 136 | ```bash 137 | #instructions retired, #logical-cycles, #reference-cycles, #event0, #event1, #event2, #event3 138 | ``` 139 | - In the sampling mode, a data point is generated every **pmiThreshold** instructions retired. 140 | - In the polling mode there is only one data point collected after the second instrumentation trigger is invoked. 141 | 142 | Cite as: 143 | -------------------------------- 144 | **If you use this tool, please cite as:** 145 | 146 | *Das, S., Werner, J., Antonakakis, M., Polychronakis, M. and Monrose, F., 2019, May. SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security. To appear in Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P).* -------------------------------------------------------------------------------- /benchmarks/README.md: -------------------------------------------------------------------------------- 1 | 2 | This package consists of 20 variants of the original **rep stosb** program that was used by [Weaver & McKee](https://github.com/deater/deterministic/blob/master/static/sample_code/). These benchmarks comprise all 10 string operations, which include one-byte instructions (lodsb, stosb, movsb, scasb, cmpsb) and two-byte instructions (lodsw, stosw, movsw, scasw, cmpsw). These instructions perform load, store, copy, scan, and comparison operations. Each of these variants executes one string operation 1 million times. 3 | 4 | ## Requirements: 5 | - Runs on Linux OS. 6 | - AS - GNU assembler 7 | 8 | ## How to build: 9 | - To build an individual program e.g., **rep_stosb.s** 10 | 11 | ```bash 12 | as -o $rep_stosb.o $rep_stosb.s 13 | ld -o $rep_stosb $rep_stosb.o 14 | ``` 15 | - Run **build.sh** to compile all the programs. -------------------------------------------------------------------------------- /benchmarks/build.sh: -------------------------------------------------------------------------------- 1 | !/bin/bash 2 | 3 | declare -a arr=("rep_lodsb" "rep_stosb" "rep_cmpsb" "rep_movsb" "rep_scasb" "loop_lodsb" "loop_stosb" "loop_cmpsb" "loop_movsb" "loop_scasb" "rep_lodsw" "rep_stosw" "rep_cmpsw" "rep_movsw" "rep_scasw" "loop_lodsw" "loop_stosw" "loop_cmpsw" "loop_movsw" "loop_scasw") 4 | 5 | for i in "${arr[@]}" 6 | do 7 | #compilation-commands 8 | as -o $i.o $i.s 9 | ld -o $i $i.o 10 | done 11 | -------------------------------------------------------------------------------- /benchmarks/loop_cmpsb.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for cmpsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | mov_loop: 14 | cmpsb 15 | loop mov_loop # executed 1 million times 16 | #================================ 17 | # Exit 18 | #================================ 19 | exit: 20 | xor %rdi,%rdi # return 0 21 | mov $60,%rax # SYSCALL_EXIT 22 | nop 23 | syscall # exit 24 | 25 | .bss 26 | .lcomm buffer,1000000 27 | .lcomm buffer1,1000000 28 | -------------------------------------------------------------------------------- /benchmarks/loop_cmpsw.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for cmpsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | mov_loop: 14 | cmpsw 15 | loop mov_loop # executed 1 million times 16 | #================================ 17 | # Exit 18 | #================================ 19 | exit: 20 | xor %rdi,%rdi # return 0 21 | mov $60,%rax # SYSCALL_EXIT 22 | nop 23 | syscall # exit 24 | 25 | .bss 26 | .lcomm buffer,2000000 27 | .lcomm buffer1,2000000 28 | -------------------------------------------------------------------------------- /benchmarks/loop_lodsb.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for lodsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rsi # source string 12 | mov_loop: 13 | lodsb 14 | loop mov_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,1000000 26 | -------------------------------------------------------------------------------- /benchmarks/loop_lodsw.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for lodsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rsi # source string 12 | mov_loop: 13 | lodsw 14 | loop mov_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,2000000 26 | -------------------------------------------------------------------------------- /benchmarks/loop_movsb.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for movsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | moving_loop: 14 | movsb 15 | loop moving_loop # executed 1 million times 16 | #================================ 17 | # Exit 18 | #================================ 19 | exit: 20 | xor %rdi,%rdi # return 0 21 | mov $60,%rax # SYSCALL_EXIT 22 | nop 23 | syscall # exit 24 | 25 | .bss 26 | .lcomm buffer,1000000 27 | .lcomm buffer1,1000000 28 | -------------------------------------------------------------------------------- /benchmarks/loop_movsw.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for movsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | moving_loop: 14 | movsw 15 | loop moving_loop # executed 1 million times 16 | #================================ 17 | # Exit 18 | #================================ 19 | exit: 20 | xor %rdi,%rdi # return 0 21 | mov $60,%rax # SYSCALL_EXIT 22 | nop 23 | syscall # exit 24 | 25 | .bss 26 | .lcomm buffer,2000000 27 | .lcomm buffer1,2000000 28 | -------------------------------------------------------------------------------- /benchmarks/loop_scasb.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for scasb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi 12 | mov_loop: 13 | scasb 14 | loop mov_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,1000000 26 | -------------------------------------------------------------------------------- /benchmarks/loop_scasw.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for scasw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi 12 | mov_loop: 13 | scasw 14 | loop mov_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,2000000 26 | -------------------------------------------------------------------------------- /benchmarks/loop_stosb.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for stosb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi # destination string 12 | moving_loop: 13 | stosb 14 | loop moving_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,1000000 26 | -------------------------------------------------------------------------------- /benchmarks/loop_stosw.s: -------------------------------------------------------------------------------- 1 | # Million loop 2 | # 3 | # See how the loop prefix affects retired instructions and branches for stosw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi # destination string 12 | moving_loop: 13 | stosw 14 | loop moving_loop # executed 1 million times 15 | #================================ 16 | # Exit 17 | #================================ 18 | exit: 19 | xor %rdi,%rdi # return 0 20 | mov $60,%rax # SYSCALL_EXIT 21 | nop 22 | syscall # exit 23 | 24 | .bss 25 | .lcomm buffer,2000000 26 | -------------------------------------------------------------------------------- /benchmarks/rep_cmpsb.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for cmpsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | rep cmpsb # executed 1 million times 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,1000000 25 | .lcomm buffer1,1000000 26 | -------------------------------------------------------------------------------- /benchmarks/rep_cmpsw.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for cmpsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | rep cmpsw # executed 1 million times 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,2000000 25 | .lcomm buffer1,2000000 26 | -------------------------------------------------------------------------------- /benchmarks/rep_lodsb.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for lodsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rsi # source string 12 | rep lodsb # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,1000000 25 | -------------------------------------------------------------------------------- /benchmarks/rep_lodsw.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for lodsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rsi # source string 12 | rep lodsw # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,2000000 25 | -------------------------------------------------------------------------------- /benchmarks/rep_movsb.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for movsb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | rep movsb # executed 1 million times 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,1000000 25 | .lcomm buffer1,1000000 26 | -------------------------------------------------------------------------------- /benchmarks/rep_movsw.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for movsw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer1,%rsi 12 | mov $buffer,%rdi 13 | rep movsw # executed 1 million times 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,2000000 25 | .lcomm buffer1,2000000 26 | -------------------------------------------------------------------------------- /benchmarks/rep_scasb.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for scasb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi 12 | rep scasb # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,1000000 25 | -------------------------------------------------------------------------------- /benchmarks/rep_scasw.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for scasw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi 12 | rep scasw # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,2000000 25 | -------------------------------------------------------------------------------- /benchmarks/rep_stosb.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for stosb 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi # destination string 12 | rep stosb # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,1000000 25 | -------------------------------------------------------------------------------- /benchmarks/rep_stosw.s: -------------------------------------------------------------------------------- 1 | # Million reps 2 | # 3 | # See how the rep prefix affects retired instructions and branches for stosw 4 | 5 | # For x86_64 6 | 7 | .globl _start 8 | _start: 9 | xor %rcx,%rcx # rcx = 0 10 | mov $1000000,%rcx # load counter using rcx 11 | mov $buffer,%rdi # destination string 12 | rep stosw # executed 1 million times 13 | 14 | #================================ 15 | # Exit 16 | #================================ 17 | exit: 18 | xor %rdi,%rdi # return 0 19 | mov $60,%rax # SYSCALL_EXIT 20 | nop 21 | syscall # exit 22 | 23 | .bss 24 | .lcomm buffer,2000000 25 | -------------------------------------------------------------------------------- /builddrv.bat: -------------------------------------------------------------------------------- 1 | @echo on 2 | 3 | :: build the driver and move it into win driver dir 4 | cd drv 5 | build 6 | copy objchk_win7_x86\i386\HPCTestDrv.sys C:\Windows\System32\drivers 7 | cd ..\ -------------------------------------------------------------------------------- /drv/HPCTestDrv.c: -------------------------------------------------------------------------------- 1 | 2 | /* 3 | * Copyright University of North Carolina, 2018 4 | * Author: Sanjeev Das (sdas@cs.unc.edu) 5 | * 6 | * The context switch hooking part of the code is obtained from 7 | * https://github.com/SouhailHammou/Drivers/blob/master/SwapContextHook/swapcontext_hook.c 8 | * 9 | */ 10 | 11 | #include 12 | #include 13 | #include 14 | 15 | 16 | /***************Configurable parameters***********************/ 17 | 18 | //a) Choose mode either as SAMPLING_MODE or POLLING_MODE 19 | #define SAMPLING_MODE // SAMPLING_MODE or POLLING_MODE 20 | 21 | #ifdef SAMPLING_MODE 22 | //b) set threshold value for PMI 23 | INT32 pmiThreshold = -50000; 24 | #else 25 | //polling mode 26 | //b) set threshold as 0 27 | INT32 pmiThreshold = 0; 28 | #endif 29 | 30 | //c) Test process/application that has to be monitored 31 | #define TEST_APP "test.exe" 32 | 33 | //d) change output file path 34 | #define LOG_FILE L"\\DosDevices\\C:\\Users\\Sanjeev\\Desktop\\hpcoutput.csv" 35 | 36 | //Configure HPC events to monitor userspace events 37 | #define EVENT0 0x004100C4 //Branch instruction retired 38 | #define EVENT1 0x004100C5 //Mispredicted branch instructions 39 | #define EVENT2 0x00414F2E //LLC cache reference 40 | #define EVENT3 0x0041412E //LLC misses 41 | 42 | //maximum number of PMI that can be recorded, depends on how much memory can be used by Win kernel driver 43 | #define MAXVAL 1000000 44 | 45 | //buffer size required for writing/reading into the file 46 | #define BUFFER_SIZE 500 47 | 48 | /************************************************************/ 49 | 50 | 51 | /* Windows OS Function Prototypes for KMDF */ 52 | NTSTATUS MyDriverUnsupportedFunction(PDEVICE_OBJECT DeviceObject, PIRP Irp); 53 | DRIVER_UNLOAD MyDriverUnload; 54 | VOID MyDriverUnload(PDRIVER_OBJECT DriverObject); 55 | NTSTATUS DriverEntry(PDRIVER_OBJECT pDriverObject, PUNICODE_STRING pRegistryPath); 56 | NTKERNELAPI void KiDispatchInterrupt(void); 57 | 58 | 59 | typedef unsigned short WORD; 60 | typedef unsigned char BYTE; 61 | typedef unsigned long ULONG; 62 | 63 | #define SIOCTL_TYPE 40000 64 | #define IOCTL_HELLO\ 65 | CTL_CODE( SIOCTL_TYPE, 0x800, METHOD_BUFFERED, FILE_READ_DATA|FILE_WRITE_DATA) 66 | 67 | /* Compile directives. */ 68 | #pragma alloc_text(INIT, DriverEntry) 69 | #pragma alloc_text(PAGE, MyDriverUnload) 70 | #pragma alloc_text(PAGE, MyDriverUnsupportedFunction) 71 | 72 | #pragma pack(1) 73 | typedef struct _DESC { 74 | UINT16 offset00; 75 | UINT16 segsel; 76 | CHAR unused:5; 77 | CHAR zeros:3; 78 | CHAR type:5; 79 | CHAR DPL:2; 80 | CHAR P:1; 81 | UINT16 offset16; 82 | } DESC, *PDESC; 83 | #pragma pack() 84 | 85 | #pragma pack(1) 86 | typedef struct _IDTR { 87 | UINT16 bytes; 88 | UINT32 addr; 89 | } IDTR; 90 | #pragma pack() 91 | 92 | PIO_STACK_LOCATION irpSp; 93 | 94 | /* Global variable for storing old ISR address. */ 95 | UINT32 oldISRAddressPmi = NULL; 96 | UINT32 oldISRAddressTrap = NULL; 97 | 98 | 99 | 100 | //some flags used for logic 101 | int IsHpcStoredAtContextSwitch = 0; 102 | int IsTestAppRunning = 0; 103 | int IsCurrentProcessTestApp = 0; 104 | 105 | int trapCount = 0; // counts number of "int 2e" in source code 106 | int perfCounterId = 0; // identifies the counter of the 7 HPCs. 107 | int hpcCount = 0; // no. of times record were taken 108 | 109 | //64 bit is required for recording counter values: ecx.eax 110 | UINT64 hpcData[7][MAXVAL+1]; 111 | 112 | //Used to store/restore values at context switch 113 | UINT32 counter0LowVal = 0, counter0HighVal = 0, counter1LowVal = 0, counter1HighVal = 0, \ 114 | counter2LowVal = 0, counter2HighVal = 0, counter3LowVal = 0, counter3HighVal = 0, \ 115 | counter4LowVal = 0, counter4HighVal = 0, counter5LowVal = 0, counter5HighVal = 0, \ 116 | counter6LowVal = 0, counter6HighVal = 0; 117 | 118 | 119 | void InitializeCounters(); 120 | void WriteMSR(int lowVal, int highVal, int addr); 121 | INT64 ReadMSR(int addr); 122 | void RecordHPC(int addr); 123 | void RecordFinalSample(int lowVal, int highVal); 124 | 125 | /* 126 | * log HPC counter values in an output file; 127 | */ 128 | int LogHPCData(){ 129 | UNICODE_STRING uniName; 130 | OBJECT_ATTRIBUTES objAttr; 131 | HANDLE handle; 132 | NTSTATUS ntStatus; 133 | IO_STATUS_BLOCK ioStatusBlock; 134 | CHAR buffer[BUFFER_SIZE]; 135 | size_t cb; 136 | int i = 0; 137 | 138 | RtlInitUnicodeString(&uniName, LOG_FILE); 139 | InitializeObjectAttributes(&objAttr, &uniName,OBJ_CASE_INSENSITIVE | OBJ_KERNEL_HANDLE, NULL, NULL); 140 | 141 | // Do not try to perform any file operations at higher IRQL levels. 142 | if(KeGetCurrentIrql() != PASSIVE_LEVEL) 143 | return STATUS_INVALID_DEVICE_STATE; 144 | 145 | //creates an output file 146 | ntStatus = ZwCreateFile(&handle,GENERIC_WRITE,&objAttr, &ioStatusBlock, NULL,FILE_ATTRIBUTE_NORMAL, 0,FILE_OVERWRITE_IF,FILE_SYNCHRONOUS_IO_NONALERT, NULL, 0); 147 | 148 | //write recorded HPC values into an output file 149 | if(NT_SUCCESS(ntStatus)){ 150 | ntStatus = RtlStringCbPrintfA(buffer, sizeof(buffer),"ins,l_cycle,ref_cycle,event1,event2,event3,event4\r\n"); 151 | if(NT_SUCCESS(ntStatus)) { 152 | ntStatus = RtlStringCbLengthA(buffer, sizeof(buffer), &cb); 153 | if(NT_SUCCESS(ntStatus)) { 154 | ntStatus = ZwWriteFile(handle, NULL, NULL, NULL, &ioStatusBlock, buffer, cb, NULL, NULL); 155 | } 156 | } 157 | for(i=0; ioffset16; 214 | isrAddr = isrAddr << 16; 215 | isrAddr += descAddr->offset00; 216 | DbgPrint("Address of the ISR is: %x.\r\n", isrAddr); 217 | 218 | /* store old ISR address in global variable, so we can use it later */ 219 | if (service == 0xfe) 220 | oldISRAddressPmi = isrAddr; 221 | else 222 | oldISRAddressTrap = isrAddr; 223 | 224 | return isrAddr; 225 | } 226 | 227 | /* 228 | * Records HPC data 229 | */ 230 | void RecordHPCSample(INT64 combinedVal) { 231 | 232 | if(perfCounterId == 0){ 233 | #ifdef SAMPLING_MODE 234 | hpcData[perfCounterId][hpcCount] = abs((INT32)pmiThreshold-combinedVal); 235 | #else 236 | hpcData[perfCounterId][hpcCount] = combinedVal; 237 | #endif 238 | }else{ 239 | hpcData[perfCounterId][hpcCount] = combinedVal; 240 | } 241 | perfCounterId++; 242 | } 243 | 244 | /* 245 | * Hook function for software interrupt only 246 | */ 247 | __declspec(naked) HookTrap() { 248 | __asm { 249 | //save the context of hardware interrupt 250 | pushfd 251 | pushad 252 | push fs 253 | push ds 254 | push es 255 | } 256 | 257 | trapCount++; 258 | if (trapCount == 2){ 259 | perfCounterId = 0; 260 | RecordHPC(0x309); 261 | RecordHPC(0x30A); 262 | RecordHPC(0x30B); 263 | RecordHPC(0xC1); 264 | RecordHPC(0xC2); 265 | RecordHPC(0xC3); 266 | RecordHPC(0xC4); 267 | hpcCount++; 268 | } 269 | 270 | //zero out counters 271 | WriteMSR(0x00000000, 0x00000000, 0x309); 272 | WriteMSR(0x00000000, 0x00000000, 0x30A); 273 | WriteMSR(0x00000000, 0x00000000, 0x30B); 274 | WriteMSR(0x00000000, 0x00000000, 0xC1); 275 | WriteMSR(0x00000000, 0x00000000, 0xC2); 276 | WriteMSR(0x00000000, 0x00000000, 0xC3); 277 | WriteMSR(0x00000000, 0x00000000, 0xC4); 278 | 279 | __asm{ 280 | //Retrieve the context of hardware interrupt 281 | pop es 282 | pop ds 283 | pop fs 284 | popad 285 | popfd 286 | jmp oldISRAddressTrap 287 | } 288 | } 289 | 290 | /* 291 | * Record HPC values at PMI 292 | */ 293 | __declspec(naked) HookPMI() { 294 | __asm { 295 | //save the context of hardware interrupt 296 | pushfd 297 | pushad 298 | push fs 299 | push ds 300 | push es 301 | } 302 | 303 | if(IsCurrentProcessTestApp==1 && hpcCount<=MAXVAL){ 304 | perfCounterId = 0; 305 | RecordHPC(0x309); 306 | RecordHPC(0x30A); 307 | RecordHPC(0x30B); 308 | RecordHPC(0xC1); 309 | RecordHPC(0xC2); 310 | RecordHPC(0xC3); 311 | RecordHPC(0xC4); 312 | hpcCount++; 313 | } 314 | 315 | //set threshold for fixed_ctr0 316 | WriteMSR(pmiThreshold, 0x0000FFFF, 0x309); 317 | 318 | //Zero out remaining counters 319 | WriteMSR(0x00000000, 0x00000000, 0x30A); 320 | WriteMSR(0x00000000, 0x00000000, 0x30B); 321 | WriteMSR(0x00000000, 0x00000000, 0xC1); 322 | WriteMSR(0x00000000, 0x00000000, 0xC2); 323 | WriteMSR(0x00000000, 0x00000000, 0xC3); 324 | WriteMSR(0x00000000, 0x00000000, 0xC4); 325 | 326 | //Clear the overflow flag via IA32_PERF_GLOBAL_OVF_CTRL MSR 327 | WriteMSR(0x00000000, 0x00000001, 0x390); 328 | 329 | __asm{ 330 | //Retrieve the context of hardware interrupt 331 | pop es 332 | pop ds 333 | pop fs 334 | popad 335 | popfd 336 | 337 | //jump to original OS handler 338 | jmp oldISRAddressPmi 339 | } 340 | } 341 | 342 | /* 343 | * Hook the interrupt descriptor by overwriting its ISR pointer. 344 | */ 345 | void HookISR(UINT16 service, UINT32 hookaddr) { 346 | UINT32 isrAddr; 347 | UINT16 hookAddrLow; 348 | UINT16 hookAddrHigh; 349 | PDESC descAddr; 350 | 351 | /* check if the ISR was already hooked */ 352 | isrAddr = GetISRAddress(service); 353 | if(isrAddr == hookaddr) { 354 | DbgPrint("The service %x already hooked.\r\n", service); 355 | } else { 356 | DbgPrint("Hooking interrupt %x: ISR %x --> %x.\r\n", service, isrAddr, hookaddr); 357 | descAddr = GetDescriptorAddress(service); 358 | DbgPrint("Hook Address: %x\r\n", hookaddr); 359 | hookAddrLow = (UINT16)hookaddr; 360 | hookaddr = hookaddr >> 16; 361 | hookAddrHigh = (UINT16)hookaddr; 362 | DbgPrint("descAddr: %x\r\n", descAddr->offset00); 363 | DbgPrint("descAddr: %x\r\n", descAddr->offset16); 364 | 365 | __asm { cli } 366 | descAddr->offset00 = hookAddrLow; 367 | descAddr->offset16 = hookAddrHigh; 368 | __asm { sti } 369 | } 370 | } 371 | 372 | /* 373 | Find a relevant process and re/store performance counter values 374 | */ 375 | void SaveRestoreCounters(){ 376 | PUCHAR pKTHREADCurr, pKTHREADNext; 377 | PUCHAR ProcessCurr, ProcessNext; 378 | PUCHAR ImageFileNameCurr, ImageFileNameNext; 379 | 380 | //check the exiting process and store HPC values if it is our test process 381 | //edi: points to the exiting thread 382 | //esi: points to the incoming thread 383 | 384 | __asm{ 385 | // obtain exiting process 386 | mov pKTHREADCurr, edi 387 | mov ecx, CR3; 388 | } 389 | ProcessCurr = *(PUCHAR*)(pKTHREADCurr + 0x50); 390 | ImageFileNameCurr = (PUCHAR)ExAllocatePoolWithTag(NonPagedPool,16,'Hbf1'); 391 | if(ImageFileNameCurr != NULL){ 392 | /*Copy the image name to an allocated space*/ 393 | strncpy((char*)ImageFileNameCurr,(char*)(ProcessCurr+0x16c), 16); 394 | 395 | //If the exiting process is a test process, we store the performance counter values 396 | if(!strcmp((char*)ImageFileNameCurr, TEST_APP)){ 397 | 398 | __asm{ 399 | //Store HPC values into memory 400 | mov ecx, 0x309 401 | rdmsr 402 | mov counter0LowVal, eax 403 | mov counter0HighVal, edx 404 | 405 | mov ecx, 0x30A 406 | rdmsr 407 | mov counter1LowVal, eax 408 | mov counter1HighVal, edx 409 | 410 | mov ecx, 0x30B 411 | rdmsr 412 | mov counter2LowVal, eax 413 | mov counter2HighVal, edx 414 | 415 | mov ecx, 0xC1 416 | rdmsr 417 | mov counter3LowVal, eax 418 | mov counter3HighVal, edx 419 | 420 | mov ecx, 0xC2 421 | rdmsr 422 | mov counter4LowVal, eax 423 | mov counter4HighVal, edx 424 | 425 | mov ecx, 0xC3 426 | rdmsr 427 | mov counter5LowVal, eax 428 | mov counter5HighVal, edx 429 | 430 | mov ecx, 0xC4 431 | rdmsr 432 | mov counter6LowVal, eax 433 | mov counter6HighVal, edx 434 | 435 | mov ecx, 1 436 | mov IsHpcStoredAtContextSwitch, ecx 437 | 438 | } 439 | } 440 | }else{ 441 | goto TestRegistrycleanup; 442 | } 443 | 444 | //check the incoming process and restore HPC values if the process is our test process 445 | __asm{ 446 | mov pKTHREADNext, esi 447 | } 448 | ProcessNext = *(PUCHAR*)(pKTHREADNext + 0x50); 449 | ImageFileNameNext = (PUCHAR)ExAllocatePoolWithTag(NonPagedPool,16, 'Hbf2'); 450 | if(ImageFileNameNext != NULL){ 451 | strncpy((char*)ImageFileNameNext,(char*)(ProcessNext+0x16c), 16); 452 | 453 | if(!strcmp((char*)ImageFileNameNext,TEST_APP)){ 454 | if(IsTestAppRunning==0){ 455 | //If the test process is going to run for the first time, start HPC monitoring 456 | IsTestAppRunning=1; 457 | InitializeCounters(); 458 | } 459 | IsCurrentProcessTestApp = 1; //indicates that the current process is a test process 460 | 461 | if (IsHpcStoredAtContextSwitch==1){ 462 | IsHpcStoredAtContextSwitch=0; 463 | WriteMSR(counter0LowVal, counter0HighVal, 0x309); 464 | WriteMSR(counter1LowVal, counter1HighVal, 0x30A); 465 | WriteMSR(counter2LowVal, counter2HighVal, 0x30B); 466 | WriteMSR(counter3LowVal, counter3HighVal, 0xC1); 467 | WriteMSR(counter4LowVal, counter4HighVal, 0xC2); 468 | WriteMSR(counter5LowVal, counter5HighVal, 0xC3); 469 | WriteMSR(counter6LowVal, counter6HighVal, 0xC4); 470 | } 471 | 472 | } 473 | else 474 | IsCurrentProcessTestApp = 0; 475 | } 476 | else{ 477 | goto TestRegistrycleanup; 478 | } 479 | 480 | //free runtime memory 481 | TestRegistrycleanup: 482 | if(ImageFileNameCurr != NULL) 483 | ExFreePoolWithTag(ImageFileNameCurr, 'Hbf1'); 484 | if(ImageFileNameNext != NULL) 485 | ExFreePoolWithTag(ImageFileNameNext, 'Hbf2'); 486 | 487 | } 488 | 489 | /* 490 | Re/store performance counter values at the context switches 491 | */ 492 | //__fastcall SwapContext(PKTHREAD CurrentThread,PKTHREAD NextThread) 493 | __declspec(naked) void HooKCS(){ 494 | SaveRestoreCounters(); 495 | /*before jumping back execute the overwritten functions*/ 496 | //807e3900 cmp byte ptr [esi+39h],0 497 | //7404 je nt!SwapContext+0xa (828bdaea) 498 | __asm{ 499 | cmp byte ptr[esi+39h],0 500 | //je address , replaced in runtime 501 | _emit 0x0F 502 | _emit 0x84 503 | 504 | _emit 0xAA 505 | _emit 0xAA 506 | _emit 0xAA 507 | _emit 0xAA 508 | //jmp just after the patched bytes 509 | _emit 0xE9 510 | 511 | _emit 0xBB 512 | _emit 0xBB 513 | _emit 0xBB 514 | _emit 0xBB 515 | } 516 | } 517 | 518 | /* 519 | Read the leftover counter values of a process that were stored during context switch for the last PMI window 520 | */ 521 | void ReadFinalSample(){ 522 | __asm { 523 | 524 | //checkmaxval1: 525 | //don't exceed the maximum times that we can monitor; only required for offline analysis; limitation comes from how much data can a kernel driver hold 526 | mov ecx, hpcCount 527 | cmp ecx, MAXVAL 528 | jae terminateRead 529 | 530 | //restore values only when store has been done at context switch 531 | mov ecx, IsHpcStoredAtContextSwitch 532 | cmp ecx, 1 533 | jne terminateRead 534 | 535 | //readleftover: 536 | mov ecx, 0 537 | mov IsHpcStoredAtContextSwitch, ecx 538 | 539 | mov DWORD PTR [perfCounterId], 0 540 | 541 | mov eax, counter0LowVal 542 | mov edx, edx 543 | push edx 544 | push eax 545 | call RecordFinalSample 546 | 547 | mov eax, counter1LowVal 548 | mov edx, counter1HighVal 549 | push edx 550 | push eax 551 | call RecordFinalSample 552 | 553 | mov eax, counter2LowVal 554 | mov edx, counter2HighVal 555 | push edx 556 | push eax 557 | call RecordFinalSample 558 | 559 | mov eax, counter3LowVal 560 | mov edx, counter3HighVal 561 | push edx 562 | push eax 563 | call RecordFinalSample 564 | 565 | mov eax, counter4LowVal 566 | mov edx, counter4HighVal 567 | push edx 568 | push eax 569 | call RecordFinalSample 570 | 571 | mov eax, counter5LowVal 572 | mov edx, counter5HighVal 573 | push edx 574 | push eax 575 | call RecordFinalSample 576 | 577 | mov eax, counter6LowVal 578 | mov edx, counter6HighVal 579 | push edx 580 | push eax 581 | call RecordFinalSample 582 | 583 | inc DWORD PTR [hpcCount] //counts the no. of times reading were taken 584 | 585 | terminateRead: 586 | 587 | } 588 | } 589 | 590 | /* 591 | * Write into MSR registers 592 | */ 593 | void WriteMSR(int lowVal, int highVal, int addr){ 594 | __asm{ 595 | mov eax, lowVal 596 | mov edx, highVal 597 | mov ecx, addr 598 | wrmsr 599 | } 600 | } 601 | 602 | /* 603 | * Record HPC value, and store the HPC count into array 604 | */ 605 | void RecordHPC(int addr){ 606 | INT64 combinedVal = 0; 607 | combinedVal = ReadMSR(addr); 608 | RecordHPCSample(combinedVal); 609 | } 610 | 611 | /* 612 | * Extract 48-bit counter value 613 | */ 614 | INT64 Extract48BitVal(int lowVal, int highVal){ 615 | INT64 combinedHPCVal = 0; 616 | combinedHPCVal = (0x0000ffff & highVal); 617 | combinedHPCVal <<= 32; 618 | combinedHPCVal = combinedHPCVal + lowVal; 619 | return combinedHPCVal; 620 | } 621 | 622 | /* 623 | * Store HPC values for final sample into array 624 | */ 625 | void RecordFinalSample(int lowVal, int highVal){ 626 | INT64 combinedVal = 0; 627 | combinedVal = Extract48BitVal(lowVal, highVal); 628 | RecordHPCSample(combinedVal); 629 | } 630 | 631 | /* 632 | * Read MSR registers 633 | */ 634 | INT64 ReadMSR(int addr){ 635 | int lowVal = 0, highVal = 0; 636 | INT64 combinedVal = 0; 637 | 638 | __asm{ 639 | mov ecx, addr 640 | rdmsr 641 | mov lowVal, eax 642 | mov highVal, edx 643 | } 644 | combinedVal = Extract48BitVal(lowVal, highVal); 645 | return combinedVal; 646 | } 647 | 648 | /* 649 | * initializatizing HPCs 650 | */ 651 | void InitializeCounters(){ 652 | #ifdef SAMPLING_MODE 653 | WriteMSR(0x0000022A, 0x00000000, 0x38D); 654 | WriteMSR(pmiThreshold, 0x0000FFFF, 0x309); 655 | #else // polling_mode 656 | WriteMSR(0x00000222, 0x00000000, 0x38D); 657 | WriteMSR(0x00000000, 0x00000000, 0x309); 658 | #endif 659 | 660 | //Configure programmable counters for different events 661 | WriteMSR(EVENT0, 0x00000000, 0x186); 662 | WriteMSR(EVENT1, 0x00000000, 0x187); 663 | WriteMSR(EVENT2, 0x00000000, 0x188); 664 | WriteMSR(EVENT3, 0x00000000, 0x189); 665 | 666 | //Zero out remaining counters 667 | WriteMSR(0x00000000, 0x00000000, 0x30A); 668 | WriteMSR(0x00000000, 0x00000000, 0x30B); 669 | 670 | WriteMSR(0x00000000, 0x00000000, 0xC1); 671 | WriteMSR(0x00000000, 0x00000000, 0xC2); 672 | WriteMSR(0x00000000, 0x00000000, 0xC3); 673 | WriteMSR(0x00000000, 0x00000000, 0xC4); 674 | 675 | WriteMSR(0x0000000F, 0x00000007, 0x38F); //Enable counter globally - IA32_PERF_GLOBAL_CTRL MSR 676 | 677 | } 678 | 679 | /* 680 | * DriverEntry: entry point for drivers. 681 | */ 682 | NTSTATUS DriverEntry(PDRIVER_OBJECT pDriverObject, PUNICODE_STRING pRegistryPath) { 683 | NTSTATUS NtStatus = STATUS_SUCCESS; 684 | unsigned int uiIndex = 0; 685 | PDEVICE_OBJECT pDeviceObject = NULL; 686 | UNICODE_STRING usDriverName, usDosDeviceName; 687 | 688 | //----------- 689 | char detourBytes[] = {0xe9,0xaa,0xbb,0xcc,0xdd,0x90}; //jmp loc_ddccbbaa; nop 690 | unsigned int savedCR0; 691 | KIRQL Irql; 692 | int i; 693 | 694 | /*KiDispatchInterrupt is exported*/ 695 | //obtain address of SwapContext using KiDispatchInterrupt 696 | 697 | PUCHAR p = (PUCHAR)KiDispatchInterrupt; 698 | unsigned int relative = *(unsigned int*)(p + 0xDE); 699 | PUCHAR SwapContext = (PUCHAR)((unsigned int)(p + 0xDD) + relative + 5); //pointer -> SwapContext 700 | PUCHAR det = (PUCHAR)HooKCS; //detour to -> HookCS 701 | //----------- 702 | 703 | DbgPrint("DriverEntry Called \r\n"); 704 | RtlInitUnicodeString(&usDriverName, L"\\Device\\MyDriver"); 705 | RtlInitUnicodeString(&usDosDeviceName, L"\\DosDevices\\MyDriver"); 706 | 707 | NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); 708 | 709 | if(NtStatus == STATUS_SUCCESS) { 710 | /* MajorFunction: is a list of function pointers for entry points into the driver. */ 711 | for(uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) 712 | pDriverObject->MajorFunction[uiIndex] = MyDriverUnsupportedFunction; 713 | 714 | /* DriverUnload is required to be able to dynamically unload the driver. */ 715 | pDriverObject->DriverUnload = MyDriverUnload; 716 | pDeviceObject->Flags |= 0; 717 | pDeviceObject->Flags &= (~DO_DEVICE_INITIALIZING); 718 | 719 | /* Create a Symbolic Link to the device. MyDriver -> \Device\MyDriver */ 720 | IoCreateSymbolicLink(&usDosDeviceName, &usDriverName); 721 | } 722 | 723 | //------------Hook context switching------------- 724 | DbgPrint("KiSwapContext at : %p\n",SwapContext); 725 | 726 | /*Implement the inline hook*/ 727 | relative = (unsigned int)HooKCS - (unsigned int)SwapContext - 5; //offset of HooKCS relative to SwapContext 728 | 729 | *(unsigned int*)&detourBytes[1] = relative; //set loc_ddccbbaf = offset -> HooKCS 730 | 731 | /*Disable write protection*/ 732 | __asm{ 733 | push eax 734 | mov eax,CR0 735 | mov savedCR0,eax 736 | and eax,0xFFFEFFFF 737 | mov CR0,eax 738 | pop eax 739 | } 740 | 741 | /*Set the detour function jump addresses in runtime*/ 742 | for(i=0;;i++){ 743 | if(det[i] == 0xAA && det[i+1] == 0xAA && det[i+2] == 0xAA && det[i+3] == 0xAA) 744 | break; 745 | } 746 | /*set the relative address for the conditional jump ()*/ 747 | //je nt!SwapContext+0xa 748 | *(unsigned int*)&det[i] = (unsigned int)((SwapContext+0xa) - (det+i-2) - 6); 749 | 750 | 751 | /*set the relative address for the jump back to SwapContext*/ 752 | for(;;i++){ 753 | if(det[i] == 0xBB && det[i+1] == 0xBB && det[i+2] == 0xBB && det[i+3] == 0xBB) 754 | break; 755 | } 756 | //jmp SwapContext+6 757 | *(unsigned int*)&det[i] = (unsigned int)((SwapContext + 6) - (det+i-1) - 5); 758 | 759 | /*Raise IRQL to patch safely*/ 760 | Irql = KeRaiseIrqlToDpcLevel(); 761 | 762 | /*implement the patch*/ 763 | for(i=0;i<6;i++){ 764 | SwapContext[i] = detourBytes[i]; //jmp loc_ddccbbaf; nop 765 | } 766 | KeLowerIrql(Irql); 767 | 768 | /*restore the write protection*/ 769 | __asm{ 770 | push eax 771 | mov eax,savedCR0 772 | mov CR0,eax 773 | pop eax 774 | } 775 | //------------------------------------ 776 | 777 | #ifdef SAMPLING_MODE 778 | 779 | // We hook the default PMI handling by OS using interrupt descriptor table. "0xFE" is vector for PMI. 780 | HookISR(0xfe, (UINT32)HookPMI); 781 | #else 782 | //Hook the software interrupt 783 | HookISR(0x2e, (UINT32)HookTrap); //Also tested with "0x03" interrupt 784 | #endif 785 | 786 | return NtStatus; 787 | } 788 | 789 | /* 790 | * MyDriverUnload: called when the driver is unloaded. 791 | */ 792 | VOID MyDriverUnload(PDRIVER_OBJECT DriverObject) { 793 | UNICODE_STRING usDosDeviceName; 794 | NTSTATUS NtStatus = STATUS_SUCCESS; 795 | int i=0; 796 | 797 | //---------For Hooking Context Switch--- 798 | unsigned int savedCR0; 799 | KIRQL Irql; 800 | PUCHAR p = (PUCHAR)KiDispatchInterrupt; 801 | unsigned int relative = *(unsigned int*)(p + 0xDE); 802 | PUCHAR KiSwapContext = (PUCHAR)((unsigned int)(p+0xDD) + relative + 5); 803 | //807e3900 cmp byte ptr [esi+39h],0 804 | //7404 je nt!SwapContext+0xa (828bdaea) 805 | char savedOps[] = {0x80,0x7e,0x39,0x00,0x74,0x04}; //cmp byte ptr [esi+0x39], 0; je loc_0000000a 806 | //------------------------------ 807 | 808 | //log the leftover counter values of a process that were stored during context switch for the last PMI window 809 | #ifdef SAMPLING_MODE 810 | ReadFinalSample(); 811 | #endif 812 | 813 | 814 | #ifdef SAMPLING_MODE 815 | if(oldISRAddressPmi != NULL) { 816 | HookISR(0xfe, (UINT32)oldISRAddressPmi); 817 | } 818 | #else 819 | if(oldISRAddressTrap != NULL) { 820 | HookISR(0x2e, (UINT32)oldISRAddressTrap); //also tested with other interrupts such as "0x03" 821 | } 822 | #endif 823 | //------------------------------ 824 | 825 | /*Restore write protection*/ 826 | __asm{ 827 | push eax 828 | mov eax,CR0 829 | mov savedCR0,eax 830 | and eax,0xFFFEFFFF 831 | mov CR0,eax 832 | pop eax 833 | } 834 | Irql = KeRaiseIrqlToDpcLevel(); 835 | for(i=0;i<6;i++){ 836 | KiSwapContext[i] = savedOps[i]; 837 | } 838 | KeLowerIrql(Irql); 839 | __asm{ 840 | push eax 841 | mov eax,savedCR0 842 | mov CR0,eax 843 | pop eax 844 | } 845 | //------------Un-hooking context switch ends------------- 846 | 847 | /* delete the driver */ 848 | RtlInitUnicodeString(&usDosDeviceName, L"\\DosDevices\\MyDriver"); 849 | IoDeleteSymbolicLink(&usDosDeviceName); 850 | IoDeleteDevice(DriverObject->DeviceObject); 851 | 852 | //logs HPC data into output file 853 | LogHPCData(); 854 | 855 | } 856 | 857 | /* 858 | * MyDriverUnsupportedFunction: called when a major function is issued that isn't supported. 859 | */ 860 | NTSTATUS MyDriverUnsupportedFunction(PDEVICE_OBJECT DeviceObject, PIRP Irp) { 861 | NTSTATUS NtStatus = STATUS_NOT_SUPPORTED; 862 | DbgPrint("MyDriverUnsupportedFunction Called \r\n"); 863 | return NtStatus; 864 | } 865 | -------------------------------------------------------------------------------- /drv/sources: -------------------------------------------------------------------------------- 1 | TARGETNAME=HPCTestDrv 2 | TARGETTYPE=DRIVER 3 | SOURCES=HPCTestDrv.c 4 | -------------------------------------------------------------------------------- /output/hpcoutput-poll.csv: -------------------------------------------------------------------------------- 1 | ins,l_cycle,ref_cycle,event1,event2,event3,event4 2 | 7890,0,65550,1777,145,2302,1917 3 | -------------------------------------------------------------------------------- /output/hpcoutput-sampl.csv: -------------------------------------------------------------------------------- 1 | ins,l_cycle,ref_cycle,event1,event2,event3,event4 2 | 50039,0,190500,9803,477,4240,1512 3 | 50086,0,86700,12871,371,2215,593 4 | 50024,0,45450,12711,188,321,131 5 | 50018,0,197400,10824,504,6119,1171 6 | 50007,0,178650,11922,1163,4775,1592 7 | 50087,0,111900,11475,758,2662,448 8 | 50107,0,77850,12425,822,906,299 9 | 50160,0,66300,12266,577,611,228 10 | 50014,0,68100,12405,663,520,139 11 | 50078,0,116400,11556,585,3566,753 12 | 50070,0,79350,12471,647,1809,284 13 | 50036,0,67050,12050,554,1119,112 14 | 50044,0,119850,11278,574,2564,950 15 | 50091,0,48600,11891,252,380,124 16 | 50035,0,66900,11773,320,955,190 17 | 50129,0,67200,11960,458,633,309 18 | 50029,0,125550,11706,384,2670,640 19 | 50072,0,94500,10671,403,1706,343 20 | 50152,0,42150,6533,134,1058,276 21 | 50100,0,52500,8532,261,989,302 22 | 50186,0,43200,12268,260,420,114 23 | 50017,0,92850,11283,480,2449,659 24 | 50075,0,113250,13278,644,2637,396 25 | 50111,0,63000,12488,192,1923,47 26 | 50061,0,94050,12697,472,3985,209 27 | 50068,0,57750,12504,172,1330,1 28 | 50116,0,54600,12419,95,1446,20 29 | 50042,0,83550,11677,332,2981,425 30 | 50099,0,94200,12005,308,2513,399 31 | 50108,0,56850,12471,126,1468,25 32 | 50065,0,59100,12539,93,1777,11 33 | 50036,0,78600,11308,247,3266,301 34 | 50073,0,81450,8899,165,1943,455 35 | 50111,0,63000,13606,255,1756,31 36 | 50168,0,53850,12669,111,1180,20 37 | 50154,0,60150,11254,159,2080,149 38 | 50070,0,80100,9521,419,1996,490 39 | 50143,0,34050,7721,105,275,138 40 | 50111,0,49050,10634,332,370,16 41 | 50043,0,43350,11624,200,68,10 42 | 50005,0,55800,10845,263,477,110 43 | 50076,0,166050,13541,549,5572,4181 44 | 50072,0,48150,12499,157,1336,158 45 | 50205,0,43350,12352,92,1326,47 46 | 50091,0,54900,13225,216,1730,457 47 | 50105,0,45750,12338,115,1054,44 48 | 50186,0,43200,12325,83,1066,23 49 | 50127,0,112200,12556,367,3583,2034 50 | 5690,0,19650,1174,98,648,380 51 | -------------------------------------------------------------------------------- /runtest.bat: -------------------------------------------------------------------------------- 1 | @echo on 2 | 3 | sc start HPCTestDrv 4 | timeout 1 5 | 6 | testcode\test.exe 7 | timeout 5 8 | 9 | sc stop HPCTestDrv 10 | timeout 1 -------------------------------------------------------------------------------- /testcode/test.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | 5 | void main() 6 | { 7 | //Instrument to generate trap 8 | __asm __volatile { 9 | mov eax, 19h 10 | int 0x2e 11 | } 12 | 13 | printf("Hello World!\n"); 14 | printf("Good bye cruel world.\n"); 15 | 16 | //Instrument to generate trap 17 | __asm __volatile{ 18 | mov eax, 19h; 19 | int 0x2e 20 | } 21 | } 22 | 23 | -------------------------------------------------------------------------------- /tutorial/images/image1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image1.png -------------------------------------------------------------------------------- /tutorial/images/image2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image2.png -------------------------------------------------------------------------------- /tutorial/images/image3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image3.png -------------------------------------------------------------------------------- /tutorial/images/image4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image4.png -------------------------------------------------------------------------------- /tutorial/images/image5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image5.png -------------------------------------------------------------------------------- /tutorial/images/image6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNCSecLab/hpc/9e9e54dd38ddd400bbf8c55b727e4caea64c467d/tutorial/images/image6.png -------------------------------------------------------------------------------- /tutorial/test.c: -------------------------------------------------------------------------------- 1 | //--------Matrix multiplication-------- 2 | #include 3 | #include 4 | 5 | int main() 6 | { 7 | int first[10][10], second[10][10], multiply[10][10]; 8 | int m, n, p, q, c, d, k, sum; 9 | sum = 0, m = n = p = q = 10; 10 | 11 | //Instrument to generate software interrupt 12 | __asm __volatile{ 13 | mov eax, 19h 14 | int 0x2e 15 | } 16 | 17 | for (c = 0; c < m; c++) 18 | for (d = 0; d < n; d++) { 19 | first[c][d]=2; 20 | second[d][c]=2; 21 | } 22 | for (c = 0; c < m; c++) { 23 | for (d = 0; d < q; d++) { 24 | for (k = 0; k < p; k++) { 25 | sum = sum + first[c][k]*second[k][d]; 26 | } 27 | multiply[c][d] = sum; 28 | sum = 0; 29 | } 30 | } 31 | printf("Product of entered matrices:-\n"); 32 | for (c = 0; c < m; c++) { 33 | for (d = 0; d < q; d++) 34 | printf("%d\t", multiply[d][c]); 35 | printf("\n"); 36 | } 37 | 38 | //Instrument to generate software interrupt 39 | __asm __volatile{ 40 | mov eax, 19h 41 | int 0x2e 42 | } 43 | 44 | return 0; 45 | } -------------------------------------------------------------------------------- /tutorial/tutorial.md: -------------------------------------------------------------------------------- 1 | Tutorial on Hardware Performance Counters 2 | =========== 3 | 4 | # Introduction 5 | In this tutorial you will learn how to use hardware performance counters. 6 | 7 | # Motivation 8 | Let's suppose you wrote a program and it correctly achieved the specific goal (i.e., it computes the expected value). However, you noticed that its execution is far slower than one would expect. You’ve spent several agonizing hours looking over the source code, but you are unable to determine the root cause. To that end, you’ve decided you need to look more closely at the program’s runtime performance to see what the bottlenecks might be. 9 | 10 | ```bash 11 | //--------Matrix multiplication------// 12 | int main() 13 | { 14 | int first[10][10], second[10][10], multiply[10][10]; 15 | int m, n, p, q, c, d, k, sum; 16 | sum = 0, m = n = p = q = 10; 17 | 18 | for (c = 0; c < m; c++) 19 | for (d = 0; d < n; d++) { 20 | first[c][d]=2; 21 | second[d][c]=2; 22 | } 23 | for (c = 0; c < m; c++) { 24 | for (d = 0; d < q; d++) { 25 | for (k = 0; k < p; k++) { 26 | sum = sum + first[c][k]*second[k][d]; 27 | } 28 | multiply[c][d] = sum; 29 | sum = 0; 30 | } 31 | } 32 | printf("Product of entered matrices:-\n"); 33 | for (c = 0; c < m; c++) { 34 | for (d = 0; d < q; d++) 35 | printf("%d\t", multiply[d][c]); 36 | printf("\n"); 37 | } 38 | 39 | return 0; 40 | } 41 | ``` 42 | 43 | Luckily, you have taken some computer science courses before, and remembered that a basic metric for evaluating the performance of the program is instructions/cycle (IPC), i.e., how many instructions has been executed per CPU clock cycle. Thus, the task at hand is to obtain that measurement for your program. 44 | 45 | To obtain the no. of instructions and total cycles taken during execution, you can run it on the CPU simulator or use source/binary instrumentation technique. However, these tools do not provide an actual IPC measurement, as they are simulated environments. 46 | 47 | *So, how do you measure performance of program on a real hardware?* 48 | 49 | Approach: Hardware performance counters (HPC) offer a solution. 50 | 51 | 52 | # 1. Introduction to Hardware Performance Counter (HPC) 53 | 54 | Modern processors (such as Intel, AMD) facilitates HPCs, which measure events related to -- instructions, memory and execution behavior on CPU pipeline. You can measure events -- such as instructions, cycles, cache (L1/L2) access, translation lookaside buffer access, main memory access -- during program execution. On Intel processor, this functional unit is named as Performance Monitoring Unit (PMU). 55 | 56 | In this tutorial, we will use the code snippet from our HPC driver available at **[../drv/HPCTestDrv.c](../drv/HPCTestDrv.c)** 57 | 58 | **Our objective: Obtain number of retired instructions and CPU cycles using HPCs to compute IPC** 59 | 60 | Intel processors provide 3 fixed counters, each of 48-bit, that measure the number of instructions, cycles per logical core and cycles per core, respectively. The measurements from these counters can allow us to compute the IPC. 61 | 62 | **Note:** Modern processor divides each physical core into two or more cores, a.k.a logical core, to maximize the utilization of the CPU pipeline, similar to the multithreading in software (called as **Hyperthreading** technology by Intel). 63 | 64 | ## I. How to obtain HPC data? 65 | 66 | The basic idea is to configure HPCs before your program starts execution. Once your program terminates, you then read the counter data. 67 | 68 | ### A. Configure HPC 69 | HPCs can be configured only in kernel mode. You can write a custom kernel driver in order to configure HPCs. HPCs can be configured by using model specific registers (MSRs). One can use specific Intel instructions, namely **wrmsr** and **rdmsr** to write and read MSRs, respectively. We can also use a wrapper in the usermode, which can initialize and unload the kernel driver. 70 | 71 | In this tutorial we will refer to kernel functions WriteMSR and ReadMSR that are defined below: 72 | 73 | ```bash 74 | /* 75 | * Write into MSR registers 76 | */ 77 | void WriteMSR(int lowVal, int highVal, int addr){ 78 | __asm{ 79 | mov eax, lowVal 80 | mov edx, highVal 81 | mov ecx, addr 82 | wrmsr 83 | } 84 | } 85 | 86 | WriteMSR(0x0000022A, 0x00000000, 0x38D); 87 | ``` 88 | 89 | ```bash 90 | /* 91 | * Read MSR registers 92 | */ 93 | INT64 ReadMSR(int addr){ 94 | int lowVal = 0, highVal = 0; 95 | INT64 combinedVal = 0; 96 | 97 | __asm{ 98 | mov ecx, addr 99 | rdmsr 100 | mov lowVal, eax 101 | mov highVal, edx 102 | } 103 | combinedVal = Extract48BitVal(lowVal, highVal); 104 | return combinedVal; 105 | } 106 | 107 | INT64 combinedVal = ReadMSR(addr); 108 | ``` 109 | The **rdmsr** and **wrmsr** assembly instructions use the value in the register ecx to specify which MSR to access. Since the MSRs are 64-bit long we need to use two 32 bit registers (namely eax and edx) to load/store the lower and higher bits of the MSR. 110 | 111 | HPCs can be configured in three steps (as per Chapter 18 of Intel SDM): 112 | 113 | #### 1. Configure fixed counters via IA32_FIXED_CTR_CTRL MSR (0x38D) 114 | - Assign ‘1’ to bits 1, 5 & 9 to enable 3 fixed counters (CTR0, CTR1, CTR2) to count events in usermode only, otherwise ‘0’ 115 | 116 |

sample

117 | 118 | ```bash 119 | WriteMSR(0x0000022A, 0x00000000, 0x38D); 120 | ``` 121 | 122 | #### 2. Enable all fixed counters via IA32_PERF_GLOBAL_CTRL MSR (0x38F) 123 | - Set bits corresponding to fixed counter to ‘1’, otherwise ‘0’. 124 | 125 |

sample

126 | 127 | ```bash 128 | WriteMSR(0x0000000F, 0x00000007, 0x38F); 129 | ``` 130 | 131 | #### 3. Zero out the counters to clear any previous values 132 | - You need to zero out the counters during configuration to collect accurate measurement. To zero out a counter write a value 0 to its address. 133 | 134 | ```bash 135 | WriteMSR(0x00000000, 0x00000000, 0x309); 136 | ``` 137 | 138 | ### B. Read HPC data 139 | - Once the program terminates the execution, you need to obtain counter data. 140 | 141 | - To read the performance counter values we use the kernel function ReadMSR. It reads the value of MSR specified by the address. 142 | - Fixed counter addresses (addr) -- 0x309, 0x30A, 0x30B 143 | - Execute **rdmsr** instruction. 144 | - 48-bit counter data is collected in edx:eax registers (higher 16-bits in edx and lower 32-bits in eax register). 145 | 146 | ```bash 147 | INT64 combinedVal = ReadMSR(addr); 148 | ``` 149 | 150 | Finally, you can calculate instructions/cycles (IPC) of your program by using HPC data as follows: 151 | 152 | ```bash 153 | 1. Execute wrapper program 154 | - Configures the HPC counters at the driver startup 155 | 2. Run your program (e.g., matrix multiplication) 156 | 3. Stop the wrapper program 157 | - Stops the HPC driver and outputs the HPC data 158 | ``` 159 | 160 | ## II. Tutorial 1 :computer: 161 | - The test program (i.e., matrix multiplication) is shown in [./test.c](./test.c), which is instrumented with “int 2e”. Compile it using C compiler. 162 | - Configure and build HPC driver in **POLLING** mode by following the steps in [README.md](../README.md). 163 | - Estimate IPC for your test program using the count of instructions and cycles. 164 | 165 | ### Findings :bulb: 166 | - For my program, IPC = 0.781, with no. of instructions = 7348872, logical cycles = 9402846, core cycles = 8634600. 167 | - What is the IPC measurement in your case? Why is IPC low in my program? Generally, modern processors have high IPC > 1. 168 | 169 | ### Discussion :thought_balloon: 170 | - How can we get more information about our program to find the reason for low performance? Can we monitor other parameters too? 171 | - Why is logical cycle and core cycle values different, though our VM has only one logical core? 172 | - Does each run have same counter values? 173 | - Why each run has different counter values for the same test program? 174 | - Is the counter data only related to our test program? 175 | - What are the sources of noise? 176 | - How can we obtain counter data per-process? 177 | 178 | 179 | # 2. Programmable Monitoring Counters (PMCs) 180 | In the previous section, we learned how to monitor instructions and CPU cycles using HPCs. Instruction per cycle metric gives an idea of how our program performs on the hardware during execution. Still, we want to learn more about the execution behavior. For example, we need to know if the program exhibits memory accesses, cache misses, branch mispredictions in an abnormal way. Those are costly operations and understanding their behavior can yield deeper insights on the root cause of program's poor performance. 181 | 182 | **Our objective: Understand how to measure cache, memory, branch related events during execution.** 183 | 184 | *Approach:* Performance Monitoring Unit (PMU) in Intel CPU allows monitoring of events related to CPU microarchitecture -- such as branch, caches, memory, translation lookaside buffers. There are 100s of such events. In addition to 3 fixed counters that we used in the previous part, PMU provides 4 programmable counters (PMCs) to monitor 4 additional events. The events that have to be monitored are selected using MSRs. A comprehensive list of the performance monitoring events is listed in the [Chapter 19 of Intel's Software Developer's Manual (SDM)]((https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf)). Next we will use PMCs to monitor these events. 185 | 186 | ## I. How to obtain PMC data? 187 | 188 | Similar to fixed performance counters, we configure PMCs by using MSRs. 189 | 190 | ### A. Configure PMC 191 | - Follow the steps below to configure PMCs: 192 | 193 | #### 1. Configure PMCs via IA32_PERFEVTSELx MSRs 194 | - Event and its unit mask can be obtained from Chapter 19 of Intel SDM 195 | - Set bit 16 to ‘1’ to count only in usermode 196 | - Set bit 22 to ‘1’ to enable the counter 197 | - Set other bits to ‘0’ 198 | - Use ecx to assign the address of IA32_PERFEVTSELx MSR 199 | - Addresses are: 0x186, 0x187, 0x188, 0x189 (see Chapter 35 of Intel SDM) 200 | 201 |

sample

202 | 203 | Here we aim to monitor following events -- branch misprediction, cache miss (L3 cache), no. of load and no. of store operations. 204 | 205 | ```bash 206 | //Configure HPC events to monitor userspace events 207 | #define EVENT0 0x004100C5 //Number of mispredicted branch instructions 208 | #define EVENT1 0x004101C2 //Number of L3 cache miss 209 | #define EVENT2 0x004181D0 //Number of load operations 210 | #define EVENT3 0x004182D0 //Number of store operations 211 | 212 | WriteMSR(EVENT0, 0x00000000, 0x186); 213 | WriteMSR(EVENT1, 0x00000000, 0x187); 214 | WriteMSR(EVENT2, 0x00000000, 0x188); 215 | WriteMSR(EVENT3, 0x00000000, 0x189); 216 | ``` 217 | 218 | #### 2. Zero out the counter values 219 | - Address of PMCs: 0xC1, 0xC2, 0xC3, 0xC4 220 | 221 | ```bash 222 | WriteMSR(0x00000000, 0x00000000, 0xC1); 223 | WriteMSR(0x00000000, 0x00000000, 0xC2); 224 | WriteMSR(0x00000000, 0x00000000, 0xC3); 225 | WriteMSR(0x00000000, 0x00000000, 0xC4); 226 | ``` 227 | 228 | #### 3. Enable all PMCs via IA32_PERF_GLOBAL_CTRL MSR (0x38F) 229 | - Set bits corresponding to PMCs to ‘1’, otherwise ‘0’. 230 | - Use ecx register to write address of MSR (0x38F). 231 | 232 |

sample

233 | 234 | ```bash 235 | WriteMSR(0x0000000F, 0x00000007, 0x38F); 236 | ``` 237 | 238 | ### B. Read PMC data 239 | - After the program completes the execution, you need to obtain PMC data. 240 | - Follow the steps below: 241 | - Use ecx register to set the address for the corresponding fixed counter 242 | - PMC addresses (addr) -- 0xC1, 0xC2, 0xC3, 0xC4 243 | - Execute **rdmsr** instruction. 244 | - 48-bit counter data is collected in **edx:eax** register 245 | 246 | ```bash 247 | INT64 combinedVal = ReadMSR(addr); 248 | ``` 249 | 250 | You can run your program as stated below: 251 | ```bash 252 | 1. Execute wrapper program 253 | - Configures the HPC counters at the driver initialization 254 | 2. Run your program (e.g., matrix multiplication) 255 | 3. Stop the wrapper program 256 | - Stops the HPC driver and outputs the HPC data 257 | ``` 258 | 259 | ## II. Tutorial 2 :computer: 260 | In this tutorial, we aim to obtain PMC counter data, in addition to the fixed counter data. As discussed above, we will use PMC counters to monitor 4 events -- branch misprediction, L3 cache miss, no. of loads and no. of stores. 261 | - Configure and build your HPC driver in **POLLING** mode (follow steps in [README.md](../README.md)). 262 | - Run the test program, i.e., matrix multiplication. 263 | - Observe the events. 264 | 265 | ### Findings :bulb: 266 | - For my program, the counter values are as follows: 267 | - Inst. = 9233128, logical cycles = 10451837, core cycles = 9527850 268 | - Branch miss = 50525, L3 cache miss = 167232, 269 | - Loads = 2736803, Stores = 1437746 270 | 271 | - Hey, we obtained PMC data, let's discuss some of the parameters from our result. 272 | - IPC (inst./logical cycles) = **0.883** 273 | - Branch miss rate (Branch miss/ Inst.) = 0.547% 274 | - L3 cache miss rate (L3 cache miss/Inst.) = **1.811%** 275 | - Loads/Inst. = **29.641%** 276 | - Stores/Inst. = 15.571% 277 | - Loads/Stores = **1.9035** 278 | 279 | So, what do the highlighted numbers reveal? 280 | - IPC - overall execution seems a slower. 281 | - Loads and store values show that there are more memory operations. 282 | - More load operations have been executed than the store operations. 283 | 284 | ### Assignment :notebook: 285 | We have several other events related to memory load operation: 286 | - mem_load_retired.l1_hit, mem_load_retired.l2_hit, mem_load_retired.l3_hit 287 | - mem_load_retired.l1_miss, mem_load_retired.l2_miss, mem_load_retired.l3_miss 288 | - mem_load_retired.fb_hit 289 | 290 | **Which event will you choose for evaluating the performance? :pensive:** 291 | 292 | ### Discussion :thought_balloon: 293 | We understood that our program has slower executions. There are many memory operations executed. 294 | 1. Is the L3 cache miss value normal? 295 | 2. Here, we lack the information about which part of the program has a low performance? 296 | 297 | # 3. Programmable Monitoring Interrupt (PMI) 298 | So far we learned how to monitor and measure events using fixed and programmable counters. Using these counters, we can gain various information about the overall execution behavior of the program on CPU. However, as discussed above, by learning the overall program behavior, we cannot precisely determine which part of the program has a low performance. In order to gain this knowledge, we must be able to sample the events for each window (of certain length) during the program execution. In other words, we must be able to record the HPC data for each window during execution. 299 | 300 | **How to monitor events at a fine-grained level using HPCs?** 301 | - Well, PMU has a solution -- Performance Monitoring Interrupt (PMI). PMU allows fixed and programmable counters to generate PMI for a preset threshold value. So, we can generate a PMI for a window size, which is determined by a threshold limit preset on that counter. For example, we can monitor events for each 512,000 retired instructions during the execution. This will help us to understand which window has a slow performance, i.e., higher memory access or branch misses or memory operations. 302 | 303 | ## I. How to obtain HPC data using PMI? 304 | - First, we need to set a threshold limit on a counter to generate a PMI. At each PMI, we need to reset the threshold values to continue generating PMI. 305 | 306 | ### A. PMI Configuration 307 | - Follow the steps below to configure PMI: 308 | 309 | #### 1. Configure counter for PMI 310 | - Enable PMI on CTR0 via IA32_FIXED_CTR_CTRL MSR (0x38D) 311 | - Assign ‘1’ to bit 3 to enable PMI on fixed counter CTR0 312 | - Assign ‘1’ to bits 1, 5 & 9 to enable 3 fixed counters (CTR0, CTR1, CTR2) to count events in usermode only, otherwise ‘0’ 313 | 314 |

sample

315 | 316 | ```bash 317 | WriteMSR(0x0000022A, 0x00000000, 0x38D); 318 | ``` 319 | #### 2. Configure programmable counters as stated in Section [1-I.A](#a-configure-hpc) 320 | 321 | #### 3. Set a threshold value for a counter 322 | - Select fixed counter0 (CTR0) to generate PMI, for every 512,000 instructions retired 323 | - Set threshold value = - 512000 on CTR0 as: 324 | 325 | ```bash 326 | INT32 pmiThreshold = -512000; 327 | 328 | WriteMSR(pmiThreshold, 0x0000FFFF, 0x309); 329 | ``` 330 | #### 4. Zero out the counters 331 | 332 | #### 5. Enable the counters via IA32_PERF_GLOBAL_CTRL MSR (0x38F) 333 | - Set bits corresponding to the counters to ‘1’, otherwise ‘0’. 334 | 335 |

sample

336 | 337 | ```bash 338 | WriteMSR(0x0000000F, 0x00000007, 0x38F); 339 | ``` 340 | 341 | ### B. Read and reset MSRs at PMI 342 | - Follow the steps below to collect data and reset MSRs: 343 | 344 | #### 1. Read counter data 345 | - Read the values of the fixed and programmable counters: 346 | - Fixed counters MSRs -- 0x309, 0x30A, 0x30B 347 | - Programmable counter MSRs -- 0xC1, 0xC2, 0xC3, 0xC4 348 | 349 | ```bash 350 | INT64 combinedVal = ReadMSR(addr); 351 | ``` 352 | 353 | #### 2. Reset the threshold value and zero out counters as done in Section [3-I.A.3](#3-set-a-threshold-value-for-a-counter) and [3-I.A.4](#4-zero-out-the-counters) 354 | - Clear the overflow flag via IA32_PERF_GLOBAL_OVF_CTRL MSR (0x390) 355 | - Clear overflow on CTR0 to continue monitoring -- set bit 32 to ‘1’ 356 | 357 |

sample

358 | 359 | ```bash 360 | WriteMSR(0x00000000, 0x00000001, 0x390); 361 | ``` 362 | 363 | ## II. Tutorial 3 :computer: 364 | - In this tutorial, we aim to obtain PMC counter data, in addition to the fixed counter data. As discussed above, we will use PMC counters to monitor 4 events -- branch misprediction, L3 cache miss, no. of loads and no. of stores. 365 | - Configure and build your HPC driver in the **SAMPLING** mode by following the instruction in [README.md](../README.md). 366 | - Run the test program (i.e., matrix multiplication). 367 | - Observe the events. 368 | - Plot the graph. 369 | 370 | ### Findings :bulb: 371 | - I plotted the graph for the parameters we observed in Tutorial 3. My test program shows following graph. 372 | 373 |

sample

374 | 375 | Can we understand the performance behavior from these graphs? 376 | - From the graphs, it is not evident the exact window where the performance is low. Because this example (matrix multiplication) does not have terribly bad performance. However, using this example, we learnt how to use the HPCs. :smile: 377 | 378 | ### Assignment :notebook: 379 | - Can we observe the behavior for other events? 380 | 381 | ### Discussion :thought_balloon: 382 | - How can we obtain counter data per-process? 383 | - We monitor context switch to save the HPC values for a process. 384 | 385 | ## III. Frequently Asked Questions :question: 386 | 387 | 1. Can we generate PMI on multiple counters? 388 | - Yes we can generate PMI for more than one counter. We can follow the configuration steps similar to PMI Configuration. 389 | 390 | 2. How to check which counter triggered the PMI? Is it necessary to check? 391 | - We can read IA32_PERF_GLOBAL_STATUS MSR to check which counter triggered the PMI. When the counter reaches the preset threshold value, overflow occurs and the corresponding bit in this MSR is set to ‘1’. At each PMI, we can check this MSR so that we can perform different operations for different counter overflows. 392 | 393 | **Check overflow status at each PMI** 394 | - via IA32_PERF_GLOBAL_STATUS MSR (0x38E) 395 | - when overflow occurs corresponding counter bit is 1 396 | 397 |

sample

398 | 399 | ```bash 400 | INT64 combinedVal = ReadMSR(addr); 401 | ``` 402 | 403 | ## Recommended Readings: :books: 404 | - Intel Software Developer's Manual Chapter 18: Performance Monitoring (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf). 405 | - Intel Software Developer's Manual Chapter 19: Performance-Monitoring Events (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf). 406 | - Intel Software Developer's Manual Chapter 35: Model-Specific Registers (MSRs) (https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf). 407 | - Tutorial - Linux kernel profiling with perf [https://perf.wiki.kernel.org/index.php/Tutorial](https://perf.wiki.kernel.org/index.php/Tutorial) 408 | - Computer Architecture: A Quantitative Approach. David Patterson, John L. Hennessy 409 | 410 | Cite as: :star: 411 | -------------------------------- 412 | **If you use this tool, please cite as:** 413 | 414 | *Das, S., Werner, J., Antonakakis, M., Polychronakis, M. and Monrose, F., 2019, May. SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security. To appear in Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P).* 415 | --------------------------------------------------------------------------------