├── LICENSE ├── Makefile ├── README.md ├── arm └── linux │ ├── 100.main │ └── simple_main.S │ ├── 110.system_call │ ├── syscall_exit.S │ ├── syscall_read.S │ └── syscall_write.S │ ├── 120.libc_call │ ├── call_putchar.S │ ├── call_puts.S │ └── call_variable_args.S │ ├── 300.operation │ ├── add.S │ ├── load.S │ ├── load_size.S │ └── store.S │ ├── 310.floating │ └── fadd.S │ ├── 400.control_flow │ ├── call_ret.S │ ├── for.S │ └── if.S │ ├── 700.sync │ ├── atomic_ll_sc.S │ ├── atomic_op.S │ ├── memory_ordering.S │ └── memory_ordering_rel_ack.S │ ├── 710.sync_parts │ ├── counter_cas.S │ ├── counter_fetch_and_add.S │ ├── counter_llsc.S │ └── counter_swap.S │ ├── 750.multi │ ├── pthread.S │ ├── pthread2.S │ └── sync_threads.S │ ├── E00.perf_expt │ ├── README.md │ ├── branch_miss_few.S │ ├── branch_miss_many.S │ ├── cache_miss_few.S │ ├── cache_miss_many.S │ ├── latency_add.S │ ├── latency_fadd.S │ ├── latency_load.S │ ├── latency_mul.S │ ├── ooo_perform.S │ ├── ooo_wait.S │ ├── superscalar_add.S │ ├── tlb_miss_few.S │ └── tlb_miss_many.S │ ├── E05.sys_expt │ ├── README.md │ ├── excep_load.S │ └── syscall_write.S │ ├── E10.multi_expt │ ├── README.md │ ├── cacheline_different.S │ ├── cacheline_same.S │ ├── counter_atomic.S │ ├── counter_bad.S │ ├── ordering_force.S │ ├── ordering_unexpected.S │ ├── weak_ordering_force.S │ ├── weak_ordering_force_relacq.S │ └── weak_ordering_unexpected.S │ ├── Makefile │ └── README.md ├── riscv └── linux │ ├── 100.main │ └── simple_main.S │ ├── 110.system_call │ ├── syscall_exit.S │ ├── syscall_read.S │ └── syscall_write.S │ ├── 120.libc_call │ ├── call_putchar.S │ ├── call_puts.S │ └── call_variable_args.S │ ├── 300.operation │ ├── add.S │ ├── load.S │ ├── load_size.S │ └── store.S │ ├── 310.floating │ └── fadd.S │ ├── 400.control_flow │ ├── call_ret.S │ ├── for.S │ └── if.S │ ├── 700.sync │ ├── atomic_ll_sc.S │ ├── atomic_op.S │ └── memory_ordering.S │ ├── 710.sync_parts │ ├── counter_fetch_and_add.S │ ├── counter_llsc.S │ └── counter_swap.S │ ├── 750.multi │ ├── pthread.S │ ├── pthread2.S │ └── sync_threads.S │ ├── Makefile │ └── README.md └── x86 └── linux ├── 100.main └── simple_main.S ├── 110.system_call ├── syscall_exit.S ├── syscall_read.S └── syscall_write.S ├── 120.libc_call ├── call_putchar.S ├── call_puts.S └── call_variable_args.S ├── 200.time ├── rdtsc.S ├── rdtsc2.S ├── rdtsc3.S ├── read_rtc.S ├── read_rtc2.S └── syscall_time.S ├── 300.operation ├── add.S ├── load.S ├── load_size.S └── store.S ├── 310.floating └── fadd.S ├── 400.control_flow ├── call_ret.S ├── for.S └── if.S ├── 700.sync ├── atomic_op.S └── memory_ordering.S ├── 710.sync_parts ├── counter_cas.S ├── counter_fetch_and_add.S └── counter_swap.S ├── 750.multi ├── clone.S ├── pthread.S ├── pthread2.S └── sync_threads.S ├── E00.perf_expt ├── README.md ├── branch_miss_few.S ├── branch_miss_many.S ├── cache_miss_few.S ├── cache_miss_many.S ├── latency_add.S ├── latency_fadd.S ├── latency_load.S ├── latency_mul.S ├── ooo_perform.S ├── ooo_wait.S ├── superscalar_add.S ├── tlb_miss_few.S └── tlb_miss_many.S ├── E05.sys_expt ├── README.md ├── excep_div.S ├── excep_load.S ├── io_read_pci.S ├── io_read_rtc.S ├── latency_io_read.S └── syscall_write.S ├── E10.multi_expt ├── README.md ├── cacheline_different.S ├── cacheline_same.S ├── counter_atomic.S ├── counter_bad.S ├── ordering_force.S └── ordering_unexpected.S ├── Makefile └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2021, Takenobu Tani 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Makefile for common 3 | 4 | 5 | TARGETS := $(basename $(wildcard *.S)) 6 | 7 | 8 | .PHONY: default 9 | default: 10 | @echo "Use with arguments" 11 | 12 | all : $(TARGETS) 13 | 14 | %: %.S 15 | $(CC) $(ASFLAGS) $(LDFLAGS) $< -o $@ 16 | 17 | %.s : %.c 18 | $(CC) $(CFLAGS) -S $< 19 | 20 | %.disasm : % 21 | $(OBJDUMP) $(OBJDUMPFLAGS) -D $< 22 | 23 | %.o.disasm : %.o 24 | $(OBJDUMP) $(OBJDUMPFLAGS) -d $< 25 | 26 | 27 | %.gdb : % 28 | $(GDB) -q -ex "layout regs" -ex "b main" $(GDBFLAGS) $< 29 | 30 | 31 | clean: 32 | rm -f *.o 33 | rm -f a.out 34 | 35 | cleanexe: 36 | rm -f $(TARGETS) 37 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | CPU assembly examples 3 | ===================== 4 | 5 | * This is a repo about tiny assembly examples for various CPUs (x86, Arm, and RISC-V). 6 | * There are examples such as system-calls, library-calls, load/store, if/for/call, barriers, atomics, and threads. 7 | * This repo focuses on CPU hardwares, not assembly notation or ABI conventions. 8 | 9 | 10 | ## Contents 11 | 12 | * [x86(x86_64)/linux](x86/linux) 13 | * [Arm(Armv8 aarch64)/linux](arm/linux) 14 | * [RISC-V(RV64G)/linux](riscv/linux) 15 | 16 | 17 | ## An example 18 | 19 | 100.main/simple_main.S: 20 | 21 | ```asm 22 | .global main 23 | 24 | main: 25 | ret 26 | ``` 27 | 28 | 29 | ## Assemble and excecute 30 | 31 | ``` 32 | $ cd 33 | $ make -f ../Makefile # assemble 34 | $ ./ # execute 35 | ``` 36 | -------------------------------------------------------------------------------- /arm/linux/100.main/simple_main.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Simple main example for Arm(Armv8 aarch64)/Linux 3 | */ 4 | 5 | .global main 6 | 7 | main: 8 | ret 9 | -------------------------------------------------------------------------------- /arm/linux/110.system_call/syscall_exit.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * exit(2) system-call 5 | * * see: man 2 exit 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* exit(2) system-call */ 14 | mov x8, 93 /* system-call number: exit() */ 15 | mov x0, 0 /* status: success */ 16 | svc 0 /* system call */ 17 | -------------------------------------------------------------------------------- /arm/linux/110.system_call/syscall_read.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * read(2) system-call 5 | * * see: man 2 read 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* read(2) system-call */ 14 | mov x8, 63 /* system-call number: read() */ 15 | mov x0, 0 /* fd: stdin */ 16 | adr x1, buf /* buf: */ 17 | mov x2, 3 /* count: */ 18 | svc 0 /* system call */ 19 | 20 | /* write(2) system-call */ 21 | mov x8, 64 /* system-call number: write() */ 22 | mov x0, 1 /* fd: stdout */ 23 | adr x1, buf /* buf: */ 24 | mov x2, 3 /* count: */ 25 | svc 0 /* system call */ 26 | 27 | /* return from main */ 28 | ret 29 | 30 | 31 | /* read-write zero initialized data */ 32 | .bss 33 | buf: 34 | .space 10 35 | -------------------------------------------------------------------------------- /arm/linux/110.system_call/syscall_write.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * write(2) system-call 5 | * * see: man 2 write 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* write(2) system-call */ 14 | mov x8, 64 /* system-call number: write() */ 15 | mov x0, 1 /* fd: stdout */ 16 | adr x1, msg /* buf: */ 17 | mov x2, 13 /* count: */ 18 | svc 0 /* system call */ 19 | 20 | 21 | /* return from main */ 22 | ret 23 | 24 | 25 | /* read-only data */ 26 | .section .rodata 27 | msg: 28 | .string "Hello world!\n" 29 | -------------------------------------------------------------------------------- /arm/linux/120.libc_call/call_putchar.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * putchar(3) standard-library 5 | * * see: man 3 putchar 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save fp,lr registers */ 12 | stp x29, x30, [sp, -16]! 13 | 14 | 15 | /* putchar(3) library-call */ 16 | mov x0, 0x41 /* 'A' */ 17 | bl putchar 18 | 19 | mov x0, 0x42 /* 'B' */ 20 | bl putchar 21 | 22 | mov x0, 'C' 23 | bl putchar 24 | 25 | mov x0, '\n' 26 | bl putchar 27 | 28 | 29 | /* restore fp,lr registers and return from main*/ 30 | ldp x29, x30, [sp], 16 31 | ret 32 | -------------------------------------------------------------------------------- /arm/linux/120.libc_call/call_puts.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * puts(3) standard-library 5 | * * see: man 3 puts 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save fp,lr registers */ 12 | stp x29, x30, [sp, -16]! 13 | 14 | 15 | /* puts(3) library-call */ 16 | adr x0, msg /* 1st argument */ 17 | bl puts 18 | 19 | 20 | /* restore fp,lr registers and return from main*/ 21 | ldp x29, x30, [sp], 16 22 | ret 23 | 24 | 25 | /* read-only data */ 26 | .section .rodata 27 | msg: 28 | .string "Hello world!" 29 | -------------------------------------------------------------------------------- /arm/linux/120.libc_call/call_variable_args.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * printf(3) standard-library 5 | * * variable argument call 6 | * * see: man 3 printf 7 | */ 8 | 9 | .global main 10 | 11 | main: 12 | /* save fp,lr registers */ 13 | stp x29, x30, [sp, -16]! 14 | 15 | mov x9, 100 /* test data */ 16 | 17 | 18 | /* printf(3) library-call */ 19 | adr x0, msg /* 1st argument */ 20 | mov x1, x9 /* 2nd argument */ 21 | bl printf 22 | 23 | 24 | /* restore fp,lr registers and return from main*/ 25 | ldp x29, x30, [sp], 16 26 | ret 27 | 28 | 29 | /* read-only data */ 30 | .section .rodata 31 | msg: 32 | .string "x9 = %d\n" 33 | -------------------------------------------------------------------------------- /arm/linux/300.operation/add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * add operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save fp,lr registers */ 11 | stp x29, x30, [sp, -16]! 12 | 13 | 14 | /* Add operation */ 15 | mov x11, 1 16 | mov x12, 2 17 | add x10, x11, x12 /* rx10 <- x11 + x12 */ 18 | 19 | 20 | /* printf for result-checking */ 21 | adr x0, fmt /* 1st argument for printf*/ 22 | mov x1, x10 /* 2nd argument for printf*/ 23 | bl printf 24 | 25 | 26 | /* restore fp,lr registers and return from main*/ 27 | ldp x29, x30, [sp], 16 28 | ret 29 | 30 | 31 | /* read-only data */ 32 | .section .rodata 33 | fmt: 34 | .string "x11 + x12 = %x\n" 35 | -------------------------------------------------------------------------------- /arm/linux/300.operation/load.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * load operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save fp,lr registers */ 11 | stp x29, x30, [sp, -16]! 12 | 13 | 14 | /* Load operation */ 15 | adr x10, byte_array 16 | ldr x11, [x10] /* load (64 bits) */ 17 | 18 | 19 | /* printf for result-checking */ 20 | adr x0, fmt /* 1st argument for printf*/ 21 | mov x1, x11 /* 2nd argument for printf*/ 22 | bl printf 23 | 24 | 25 | /* restore fp,lr registers and return from main*/ 26 | ldp x29, x30, [sp], 16 27 | ret 28 | 29 | 30 | /* read-only data */ 31 | .section .rodata 32 | .balign 8 33 | fmt: 34 | .string "byte_array = %llx\n" /* %llx for long long hex */ 35 | 36 | byte_array: 37 | .quad 0xfedcba9876543210 38 | -------------------------------------------------------------------------------- /arm/linux/300.operation/load_size.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * load operations for each size with zero extension 5 | * 6 | * * 64-bit load with ldr instruction and `x`-register 7 | * * 32-bit load with ldr instruction and `w`-register 8 | * * 16-bit load with ldrh instruction and `w`-register 9 | * * 8-bit load with ldrb instruction and `w`-register 10 | */ 11 | 12 | .global main 13 | 14 | main: 15 | /* save fp,lr registers */ 16 | stp x29, x30, [sp, -16]! 17 | 18 | 19 | /* Load operation */ 20 | adr x10, byte_array 21 | ldr x11, [x10] /* load (64 bits) */ 22 | ldr w12, [x10] /* load (32 bits) */ 23 | ldrh w13, [x10] /* load (16 bits) */ 24 | ldrb w14, [x10] /* load (8 bits) */ 25 | 26 | 27 | /* printf for result-checking */ 28 | adr x0, fmt /* 1st argument for printf */ 29 | mov x1, x11 /* 2nd argument */ 30 | mov x2, x12 /* 3rd argument */ 31 | mov x3, x13 /* 4th argument */ 32 | mov x4, x14 /* 5th argument */ 33 | bl printf 34 | 35 | 36 | /* restore fp,lr registers and return from main*/ 37 | ldp x29, x30, [sp], 16 38 | ret 39 | 40 | 41 | /* read-only data */ 42 | .section .rodata 43 | .balign 8 44 | fmt: 45 | .string "x11, x12, x13, x14 = \n%016llx\n%016llx\n%016llx\n%016llx\n" 46 | 47 | byte_array: 48 | .quad 0xfedcba9876543210 49 | -------------------------------------------------------------------------------- /arm/linux/300.operation/store.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * store operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save fp,lr registers */ 11 | stp x29, x30, [sp, -16]! 12 | 13 | 14 | /* Store operation */ 15 | adr x10, memory_buf 16 | mov x11, 0x3210 /* set 64-bit immediate */ 17 | movk x11, 0x7654, lsl 16 18 | movk x11, 0xba98, lsl 32 19 | movk x11, 0xfedc, lsl 48 20 | str x11, [x10] /* store (64 bits) */ 21 | 22 | 23 | /* Load operation for result-checking*/ 24 | ldr x12, [x10] /* load (64 bits) */ 25 | 26 | /* printf for result-checking */ 27 | adr x0, fmt /* 1st argument for printf*/ 28 | mov x1, x12 /* 2nd argument for printf*/ 29 | bl printf 30 | 31 | 32 | /* restore fp,lr registers and return from main*/ 33 | ldp x29, x30, [sp], 16 34 | ret 35 | 36 | 37 | /* read-only data */ 38 | .section .rodata 39 | fmt: 40 | .string "memory_buf = %016llx\n" /* %llx for long long hex */ 41 | 42 | 43 | /* read-write initialized data */ 44 | .data 45 | .balign 8 46 | memory_buf: 47 | .quad 0 48 | -------------------------------------------------------------------------------- /arm/linux/310.floating/fadd.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Floating-point example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Double-precision (64-bit) 5 | * * Floating-point add operation 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save fp,lr registers */ 12 | stp x29, x30, [sp, -16]! 13 | 14 | 15 | /* Floating-point add operation */ 16 | adr x10, value1 17 | ldr d1 , [x10] /* d1 <- 1.0 */ 18 | adr x10, value2 19 | ldr d2 , [x10] /* d2 <- 2.0 */ 20 | 21 | fadd d0, d1, d2 /* d0 <- d1 + d2 */ 22 | 23 | 24 | /* printf for result-checking */ 25 | adr x0, fmt /* 1st argument for printf */ 26 | // mov d0, d0 /* 2nd argument for printf */ 27 | bl printf 28 | 29 | 30 | /* restore fp,lr registers and return from main*/ 31 | ldp x29, x30, [sp], 16 32 | ret 33 | 34 | 35 | /* read-only data */ 36 | .section .rodata 37 | value1: 38 | .double 1.0 39 | value2: 40 | .double 2.0 41 | 42 | fmt: 43 | .string "d1 + d2 = %f\n" 44 | -------------------------------------------------------------------------------- /arm/linux/400.control_flow/call_ret.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `call` and `ret` structure example 5 | * 6 | * * C-like pseudo-code: 7 | * void main() { 8 | * func_inc(100); 9 | * } 10 | * 11 | * int func_inc(int x0) { 12 | * return (x0++); 13 | * } 14 | */ 15 | 16 | .global main 17 | 18 | main: 19 | /* save fp,lr registers */ 20 | stp x29, x30, [sp, -16]! 21 | 22 | 23 | /* call */ 24 | mov x0, 100 /* passing 1st argument with x0 */ 25 | bl func_inc /* call */ 26 | mov x11, x0 /* preserve return value (x0) */ 27 | 28 | 29 | /* printf for result-checking */ 30 | adr x0, fmt /* 1st argument for printf */ 31 | mov x1, x11 /* 2nd argument */ 32 | bl printf 33 | 34 | /* restore fp,lr registers and return from main*/ 35 | ldp x29, x30, [sp], 16 36 | ret 37 | 38 | 39 | /* function declaration */ 40 | func_inc: 41 | add x0, x0, 1 /* function body */ 42 | ret /* return to caller */ 43 | 44 | 45 | /* read-only data */ 46 | .section .rodata 47 | fmt: 48 | .string "x0 = %d\n" 49 | -------------------------------------------------------------------------------- /arm/linux/400.control_flow/for.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `for`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * x12 = 0; 8 | * for (x11 = 1; x11 <= 10; x11++) { 9 | * x12 = x12 + x11; 10 | * } 11 | */ 12 | 13 | .global main 14 | 15 | main: 16 | /* save fp,lr registers */ 17 | stp x29, x30, [sp, -16]! 18 | 19 | 20 | /* accumulator for test */ 21 | mov x12, 0 /* x12 = 0 */ 22 | 23 | 24 | /* `for`-structure */ 25 | mov x11, 1 /* x11 = 1 */ 26 | 27 | L_loop: 28 | add x12, x12, x11 /* loop-body (x12 = x12 + x11) */ 29 | 30 | add x11, x11, 1 /* x11++ */ 31 | cmp x11, 10 /* x11 <= 10 */ 32 | ble L_loop 33 | 34 | 35 | /* printf for result-checking */ 36 | adr x0, fmt /* 1st argument for printf */ 37 | mov x1, x12 /* 2nd argument */ 38 | bl printf 39 | 40 | 41 | /* restore fp,lr registers and return from main*/ 42 | ldp x29, x30, [sp], 16 43 | ret 44 | 45 | 46 | /* read-only data */ 47 | .section .rodata 48 | fmt: 49 | .string "x12 = %d\n" 50 | -------------------------------------------------------------------------------- /arm/linux/400.control_flow/if.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `if`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * if (r11 >= 0) { 8 | * r12 = 1 9 | * } else { 10 | * r12 = -1 11 | * } 12 | */ 13 | 14 | .global main 15 | 16 | main: 17 | /* save fp,lr registers */ 18 | stp x29, x30, [sp, -16]! 19 | 20 | 21 | /* test data */ 22 | mov x11, 10 /* Change here */ 23 | 24 | 25 | /* if (x11 >= 0) */ 26 | cmp x11, 0 27 | b.lt L_else 28 | 29 | /* then */ 30 | mov x12, 1 31 | b L_endif 32 | 33 | /* else */ 34 | L_else: 35 | mov x12, -1 36 | 37 | /* endif */ 38 | L_endif: 39 | 40 | 41 | /* printf for result-checking */ 42 | adr x0, fmt /* 1st argument for printf */ 43 | mov x1, x11 /* 2nd argument */ 44 | mov x2, x12 /* 3rd argument */ 45 | bl printf 46 | 47 | 48 | /* restore fp,lr registers and return from main*/ 49 | ldp x29, x30, [sp], 16 50 | ret 51 | 52 | 53 | /* read-only data */ 54 | .section .rodata 55 | fmt: 56 | .string "r11, r12 = %d, %d\n" 57 | -------------------------------------------------------------------------------- /arm/linux/700.sync/atomic_ll_sc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Atomic operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * atomic operations with load-linked / store-conditional. 5 | * operations with weak consistency (without release/acquire). 6 | * 7 | * * see: Arm Architecture Reference Manual Armv8, for Armv8-A 8 | * architecture profile, 9 | * Chapter B2 The AArch64 Application Level Memory Model 10 | * Appendix K11 Barrier Litmus Tests 11 | */ 12 | 13 | .global main 14 | 15 | main: 16 | /* load-linked - store-conditional */ 17 | adr x10, flag 18 | ldxr x11, [x10] /* load-linked */ 19 | 20 | stxr w12, x11, [x10] /* store-conditional */ 21 | 22 | 23 | /* return from main */ 24 | ret 25 | 26 | 27 | /* read-write initialized data */ 28 | .data 29 | .balign 8 30 | flag: 31 | .quad 0 32 | -------------------------------------------------------------------------------- /arm/linux/700.sync/atomic_op.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Atomic operation example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * atomic operations with weak consistency (without release/acquire). 5 | * * swap (swp) 6 | * * compare and swap (cas) 7 | * * fetch and add (ldadd) 8 | * 9 | * * see: Arm Architecture Reference Manual Armv8, for Armv8-A 10 | * architecture profile, 11 | * Chapter B2 The AArch64 Application Level Memory Model 12 | * Appendix K11 Barrier Litmus Tests 13 | */ 14 | 15 | .global main 16 | 17 | main: 18 | /* atomic operations */ 19 | adr x10, flag 20 | swp x11, x12, [x10] /* swap */ 21 | 22 | cas x11, x12, [x10] /* compare and swap */ 23 | 24 | ldadd x11, x12, [x10] /* fetch and add */ 25 | 26 | 27 | /* return from main */ 28 | ret 29 | 30 | 31 | /* read-write initialized data */ 32 | .data 33 | .balign 8 34 | flag: 35 | .quad 0 36 | -------------------------------------------------------------------------------- /arm/linux/700.sync/memory_ordering.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Memory ordeirng example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * memory ordering operations. 5 | * * data memory barrier (dmb) 6 | * * data synchronization barrier (dsb) 7 | * * instruction synchronization barrier (isb) 8 | * 9 | * * see: Arm Architecture Reference Manual Armv8, for Armv8-A 10 | * architecture profile, 11 | * Chapter B2 The AArch64 Application Level Memory Model 12 | * Appendix K11 Barrier Litmus Tests 13 | */ 14 | 15 | .global main 16 | 17 | main: 18 | /* data memory barrier */ 19 | dmb ld /* load barrier */ 20 | 21 | dmb st /* store barrier */ 22 | 23 | dmb sy /* load/store barrier */ 24 | 25 | 26 | /* data synchronization barrier */ 27 | dsb sy 28 | 29 | 30 | /* instruction synchronization barrier */ 31 | isb 32 | 33 | 34 | /* return from main */ 35 | ret 36 | -------------------------------------------------------------------------------- /arm/linux/700.sync/memory_ordering_rel_ack.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Memory ordeirng example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * memory ordering operations with release/aquire consistency. 5 | * 6 | * * see: Arm Architecture Reference Manual Armv8, for Armv8-A 7 | * architecture profile, 8 | * Chapter B2 The AArch64 Application Level Memory Model 9 | * Appendix K11 Barrier Litmus Tests 10 | */ 11 | 12 | .global main 13 | 14 | main: 15 | /* release - acquire */ 16 | adr x10, flag 17 | stlr x0, [x10] /* release */ 18 | 19 | 20 | ldar x0, [x10] /* aqcuire */ 21 | 22 | 23 | /* return from main */ 24 | ret 25 | 26 | 27 | /* read-write initialized data */ 28 | .data 29 | .balign 8 30 | flag: 31 | .quad 0 32 | -------------------------------------------------------------------------------- /arm/linux/710.sync_parts/counter_cas.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * cas instruction 5 | * 6 | * * C-like pseudo-code: 7 | * L_loop: 8 | * x1 = *counter; // read counter 9 | * x2 = x1; // save current cunter 10 | * x5 = x1 + 1; // increment 11 | * 12 | * // cas instruction 13 | * tmp = *counter; // read counter 14 | * if (x1 == tmp) *counter = x5; // swap if unchanged 15 | * x1 = tmp; 16 | * 17 | * // check 18 | * if (x1 != x2) goto L_loop; // loop if fail 19 | */ 20 | 21 | .global main 22 | 23 | main: 24 | /* save fp,lr registers for printf */ 25 | stp x29, x30, [sp, -16]! 26 | 27 | 28 | 29 | /* set the address of the counter */ 30 | adr x10, counter 31 | 32 | 33 | /* increment the counter with cas loop */ 34 | L_loop: 35 | ldr x1, [x10] /* x1 = *counter */ 36 | mov x2, x1 /* save current counter */ 37 | add x5, x1, 1 /* increment */ 38 | 39 | cas x1, x5, [x10] /* cas instruction */ 40 | 41 | cmp x1, x2 /* check the result of cas */ 42 | bne L_loop 43 | 44 | 45 | 46 | /* printf for result-checking */ 47 | adr x0, fmt /* 1st argument for printf */ 48 | ldr x1, [x10] /* 2nd argument */ 49 | bl printf 50 | 51 | 52 | /* restore fp,lr registers and return from main */ 53 | ldp x29, x30, [sp], 16 54 | ret 55 | 56 | 57 | /* read-only data */ 58 | .section .rodata 59 | fmt: 60 | .string "counter = %d\n" 61 | 62 | 63 | /* read-write initialized data */ 64 | .data 65 | .balign 8 66 | counter: 67 | .quad 10 68 | -------------------------------------------------------------------------------- /arm/linux/710.sync_parts/counter_fetch_and_add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * fetch-and-add instruction 5 | * 6 | * * C-like pseudo-code: 7 | * x1 = 1; 8 | * *counter = *counter + x1; 9 | */ 10 | 11 | .global main 12 | 13 | main: 14 | /* save fp,lr registers for printf */ 15 | stp x29, x30, [sp, -16]! 16 | 17 | 18 | 19 | /* set the address of the counter */ 20 | adr x10, counter 21 | 22 | /* increment the counter with fetch-and-add */ 23 | mov x1, 1 /* x1 = 1 */ 24 | stadd x1, [x10] /* *counter = *counter + x1 */ 25 | 26 | 27 | 28 | /* printf for result-checking */ 29 | adr x0, fmt /* 1st argument for printf */ 30 | ldr x1, [x10] /* 2nd argument */ 31 | bl printf 32 | 33 | 34 | /* restore fp,lr registers and return from main */ 35 | ldp x29, x30, [sp], 16 36 | ret 37 | 38 | 39 | /* read-only data */ 40 | .section .rodata 41 | fmt: 42 | .string "counter = %d\n" 43 | 44 | 45 | /* read-write initialized data */ 46 | .data 47 | .balign 8 48 | counter: 49 | .quad 10 50 | -------------------------------------------------------------------------------- /arm/linux/710.sync_parts/counter_llsc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * ll/sc instruction 5 | * 6 | * * C-like pseudo-code: 7 | * L_loop: 8 | * x1 = *counter; 9 | * x1 = x1 + 1; 10 | * if (not-conflict) *counter = x1; 11 | * if (fail) goto L_loop; 12 | */ 13 | 14 | .global main 15 | 16 | main: 17 | /* save fp,lr registers for printf */ 18 | stp x29, x30, [sp, -16]! 19 | 20 | 21 | 22 | /* set the address of the counter */ 23 | adr x10, counter 24 | 25 | 26 | /* increment the counter with ll/sc loop */ 27 | L_loop: 28 | ldxr x1, [x10] /* x1 = *counter */ 29 | add x1, x1, 1 /* x1 = x1 + 1 */ 30 | stxr w2, x1, [x10] /* if (not-conflict) *counter = x1 */ 31 | cbnz w2, L_loop /* if (fail) goto L_loop */ 32 | 33 | 34 | 35 | /* printf for result-checking */ 36 | adr x0, fmt /* 1st argument for printf */ 37 | ldr x1, [x10] /* 2nd argument */ 38 | bl printf 39 | 40 | 41 | /* restore fp,lr registers and return from main */ 42 | ldp x29, x30, [sp], 16 43 | ret 44 | 45 | 46 | /* read-only data */ 47 | .section .rodata 48 | fmt: 49 | .string "counter = %d\n" 50 | 51 | 52 | /* read-write initialized data */ 53 | .data 54 | .balign 8 55 | counter: 56 | .quad 10 57 | -------------------------------------------------------------------------------- /arm/linux/710.sync_parts/counter_swap.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * swap instruction 5 | * 6 | * * C-like pseudo-code: 7 | // acquire the lock 8 | * x2 = 1; 9 | * L_loop: 10 | * x0 = *lock; 11 | * *lock = x2; 12 | * if (x0 != 0) goto L_loop; 13 | * 14 | * // increment the counter 15 | * x1 = *counter; 16 | * x1 = x1 + 1; 17 | * *counter = x1; 18 | * 19 | * // release the lock 20 | * x1 = 0; 21 | * *lock = x1; 22 | 23 | */ 24 | 25 | .global main 26 | 27 | main: 28 | /* save fp,lr registers for printf */ 29 | stp x29, x30, [sp, -16]! 30 | 31 | 32 | 33 | /* set addresses of the lock and the counter */ 34 | adr x11, lock 35 | adr x10, counter 36 | 37 | 38 | /* acquire the lock */ 39 | mov x2, 1 /* x2 = 1 */ 40 | L_loop: 41 | swp x2, x0, [x11] /* x0 = *lock; *lock = x2 */ 42 | cbnz x0, L_loop /* if (x0 !=0) goto L_loop */ 43 | 44 | /* increment the counter */ 45 | ldr x1, [x10] /* x1 = *counter */ 46 | add x1, x1, 1 /* x1 = x1 + 1 */ 47 | str x1, [x10] /* *counter = x1 */ 48 | 49 | /* release the lock */ 50 | mov x1, 0 /* x1 = 0 */ 51 | str x1, [x11] /* *lock = x1 */ 52 | 53 | 54 | 55 | /* printf for result-checking */ 56 | adr x0, fmt /* 1st argument for printf */ 57 | ldr x1, [x10] /* 2nd argument */ 58 | bl printf 59 | 60 | 61 | /* restore fp,lr registers and return from main */ 62 | ldp x29, x30, [sp], 16 63 | ret 64 | 65 | 66 | /* read-only data */ 67 | .section .rodata 68 | fmt: 69 | .string "counter = %d\n" 70 | 71 | 72 | /* read-write initialized data */ 73 | .data 74 | .balign 8 75 | lock: 76 | .quad 0 77 | counter: 78 | .quad 10 79 | -------------------------------------------------------------------------------- /arm/linux/750.multi/pthread.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating a thread) example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child: a child thread (generated by pthread_create) 7 | * 8 | * * see: 9 | * * man pthread_create 10 | * * man pthread_join 11 | */ 12 | 13 | .globl main 14 | 15 | main: 16 | /* puts for trace-log */ 17 | adr x0, fmtMain_start 18 | bl puts 19 | 20 | 21 | /* create a thread with pthread_create() */ 22 | adr x0, tid /* pthread_t: &tid */ 23 | mov x1, 0 /* pthread_attr_t: NULL */ 24 | adr x2, child /* start_routine: &child */ 25 | mov x3, 0 /* arg: NULL */ 26 | bl pthread_create 27 | 28 | 29 | /* wait the thread with pthread_join() */ 30 | adr x10, tid 31 | ldr x0, [x10] /* pthread_t: tid */ 32 | mov x1, 0 /* retval: NULL */ 33 | bl pthread_join 34 | 35 | 36 | /* puts for trace-log */ 37 | adr x0, fmtMain_finish 38 | bl puts 39 | 40 | /* exit from main */ 41 | mov x0, 0 /* status: EXIT_SUCCESS */ 42 | bl exit 43 | 44 | 45 | 46 | 47 | /* a child thread */ 48 | child: 49 | stp x29, x30, [sp, -16]! 50 | 51 | /* puts for trace-log */ 52 | adr x0, fmtChild 53 | bl puts 54 | 55 | 56 | /* finish this thead */ 57 | mov x0, 0 /* return value (not used) */ 58 | ldp x29, x30, [sp], 16 59 | ret 60 | 61 | 62 | 63 | /* read-only data */ 64 | .section .rodata 65 | fmtMain_start: 66 | .string "main(): start" 67 | fmtMain_finish: 68 | .string "main(): finish" 69 | fmtChild: 70 | .string "child(): start" 71 | 72 | 73 | 74 | /* read-write data */ 75 | .data 76 | .balign 8 77 | tid: 78 | .quad 0 79 | -------------------------------------------------------------------------------- /arm/linux/750.multi/pthread2.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating twi threads) example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (generated by pthread_create) 7 | * * child2: 2nd child thread (generated by pthread_create) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | * * man pthread_join 12 | */ 13 | 14 | .globl main 15 | 16 | main: 17 | /* puts for trace-log */ 18 | adr x0, fmtMain_start 19 | bl puts 20 | 21 | 22 | /* pthread_create() */ 23 | 24 | /* for 1st child thread */ 25 | adr x0, tid1 /* pthread_t: &tid1 */ 26 | mov x1, 0 /* pthread_attr_t: NULL */ 27 | adr x2, child1 /* start_routine: &child1 */ 28 | mov x3, 0 /* arg: NULL */ 29 | bl pthread_create 30 | 31 | /* for 2nd child thread */ 32 | adr x0, tid2 /* pthread_t: &tid2 */ 33 | mov x1, 0 /* pthread_attr_t: NULL */ 34 | adr x2, child2 /* start_routine: &child2 */ 35 | mov x3, 0 /* arg: NULL */ 36 | bl pthread_create 37 | 38 | 39 | /* pthread_join() */ 40 | 41 | /* for 1st child thread */ 42 | adr x10, tid1 43 | ldr x0, [x10] /* pthread_t: tid1 */ 44 | mov x1, 0 /* retval: NULL */ 45 | bl pthread_join 46 | 47 | /* for 2nd child thread */ 48 | adr x10, tid2 49 | ldr x0, [x10] /* pthread_t: tid2 */ 50 | mov x1, 0 /* retval: NULL */ 51 | bl pthread_join 52 | 53 | 54 | /* puts for trace-log */ 55 | adr x0, fmtMain_finish 56 | bl puts 57 | 58 | /* exit from main */ 59 | mov x0, 0 /* status: EXIT_SUCCESS */ 60 | bl exit 61 | 62 | 63 | 64 | 65 | /* 1st child thread */ 66 | child1: 67 | stp x29, x30, [sp, -16]! 68 | 69 | /* puts for trace-log */ 70 | adr x0, fmtChild1 71 | bl puts 72 | 73 | /* finish this thead */ 74 | mov x0, 0 /* return value (not used) */ 75 | ldp x29, x30, [sp], 16 76 | ret 77 | 78 | 79 | 80 | /* 2nd child thread */ 81 | child2: 82 | stp x29, x30, [sp, -16]! 83 | 84 | /* puts for trace-log */ 85 | adr x0, fmtChild2 86 | bl puts 87 | 88 | /* finish this thead */ 89 | mov x0, 0 /* return value (not used) */ 90 | ldp x29, x30, [sp], 16 91 | ret 92 | 93 | 94 | 95 | /* read-only data */ 96 | .section .rodata 97 | fmtMain_start: 98 | .string "main(): start" 99 | fmtMain_finish: 100 | .string "main(): finish" 101 | 102 | fmtChild1: 103 | .string "child1(): start" 104 | fmtChild2: 105 | .string "child2(): start" 106 | 107 | 108 | 109 | /* read-write data */ 110 | .data 111 | .balign 8 112 | tid1: 113 | .quad 0 114 | tid2: 115 | .quad 0 116 | -------------------------------------------------------------------------------- /arm/linux/750.multi/sync_threads.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Sync threads example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (set the sync_flag for 2nd thread) 7 | * * child2: 2nd child thread (wait 1st thread via the sync_flag) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | */ 12 | 13 | .globl main 14 | 15 | main: 16 | /* puts for trace-log */ 17 | adr x0, fmtMain_start 18 | bl puts 19 | 20 | 21 | /* pthread_create() */ 22 | 23 | /* for 1st child thread */ 24 | adr x0, tid1 /* pthread_t: &tid1 */ 25 | mov x1, 0 /* pthread_attr_t: NULL */ 26 | adr x2, child1 /* start_routine: &child1 */ 27 | mov x3, 0 /* arg: NULL */ 28 | bl pthread_create 29 | 30 | /* for 2nd child thread */ 31 | adr x0, tid2 /* pthread_t: &tid2 */ 32 | mov x1, 0 /* pthread_attr_t: NULL */ 33 | adr x2, child2 /* start_routine: &child2 */ 34 | mov x3, 0 /* arg: NULL */ 35 | bl pthread_create 36 | 37 | 38 | /* pthread_join() */ 39 | 40 | /* for 1st child thread */ 41 | adr x10, tid1 42 | ldr x0, [x10] /* pthread_t: tid1 */ 43 | mov x1, 0 /* retval: NULL */ 44 | bl pthread_join 45 | 46 | /* for 2nd child thread */ 47 | adr x10, tid2 48 | ldr x0, [x10] /* pthread_t: tid2 */ 49 | mov x1, 0 /* retval: NULL */ 50 | bl pthread_join 51 | 52 | 53 | /* puts for trace-log */ 54 | adr x0, fmtMain_finish 55 | bl puts 56 | 57 | /* exit from main */ 58 | mov x0, 0 /* status: EXIT_SUCCESS */ 59 | bl exit 60 | 61 | 62 | 63 | 64 | /* 1st child thread */ 65 | child1: 66 | stp x29, x30, [sp, -16]! 67 | 68 | /* puts for trace-log */ 69 | adr x0, fmtChild1_start 70 | bl puts 71 | 72 | /* wait your key-in */ 73 | adr x0, fmtChild1_prompt 74 | bl puts 75 | bl getchar 76 | 77 | 78 | /* set the sync-flag to 1 */ 79 | mov x0, 1 80 | adr x10, sync_flag 81 | str x0, [x10] 82 | 83 | 84 | /* puts for trace-log */ 85 | adr x0, fmtChild1_finish 86 | bl puts 87 | 88 | /* finish this thead */ 89 | mov x0, 0 /* return value (not used) */ 90 | ldp x29, x30, [sp], 16 91 | ret 92 | 93 | 94 | 95 | /* 2nd child thread */ 96 | child2: 97 | stp x29, x30, [sp, -16]! 98 | 99 | /* puts for trace-log */ 100 | adr x0, fmtChild2_start 101 | bl puts 102 | 103 | 104 | /* wait 1st thread via the sync_flag */ 105 | L_loop: 106 | adr x10, sync_flag 107 | ldr x0, [x10] 108 | cbz x0, L_loop 109 | 110 | 111 | /* puts for trace-log */ 112 | adr x0, fmtChild2_finish 113 | bl puts 114 | 115 | /* finish this thead */ 116 | mov x0, 0 /* return value (not used) */ 117 | ldp x29, x30, [sp], 16 118 | ret 119 | 120 | 121 | 122 | /* read-only data */ 123 | .section .rodata 124 | fmtMain_start: 125 | .string "main(): start" 126 | fmtMain_finish: 127 | .string "main(): finish" 128 | 129 | fmtChild1_start: 130 | .string "child1(): start" 131 | fmtChild1_prompt: 132 | .string "child1(): enter key:" 133 | fmtChild1_finish: 134 | .string "child1(): finish" 135 | 136 | fmtChild2_start: 137 | .string "child2(): start" 138 | fmtChild2_finish: 139 | .string "child2(): finish" 140 | 141 | 142 | 143 | /* read-write data */ 144 | .data 145 | .balign 8 146 | tid1: 147 | .quad 0 148 | tid2: 149 | .quad 0 150 | 151 | sync_flag: 152 | .quad 0 153 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | Performance experiments 3 | ======================= 4 | 5 | ## Instruction latencies 6 | 7 | ### latency_add.S 8 | 9 | ```sh 10 | $ make -f ../Makefile latency_add 11 | $ perf stat -e "cycles,instructions" ./latency_add 12 | 13 | 1,003,740,164 cycles 14 | 1,033,320,563 instructions # 1.03 insn per cycle 15 | ``` 16 | 17 | ### latency_mul.S 18 | 19 | ```sh 20 | $ make -f ../Makefile latency_mul 21 | $ perf stat -e "cycles,instructions" ./latency_mul 22 | 23 | 5,013,165,523 cycles 24 | 1,042,066,873 instructions # 0.21 insn per cycle 25 | ``` 26 | 27 | ### latency_load.S 28 | 29 | ```sh 30 | $ make -f ../Makefile latency_load 31 | $ perf stat -e "cycles,instructions" ./latency_load 32 | 33 | 4,010,823,103 cycles 34 | 1,039,938,754 instructions # 0.26 insn per cycle 35 | ``` 36 | 37 | 38 | ## Branch-prediction misses 39 | 40 | ### branch_miss_few.S 41 | 42 | ```sh 43 | $ make -f ../Makefile branch_miss_few 44 | $ perf stat -e "cycles,instructions,branch-misses" ./branch_miss_few 45 | 46 | 9,382 branch-misses 47 | ``` 48 | 49 | ### branch_miss_many.S 50 | 51 | ```sh 52 | $ make -f ../Makefile branch_miss_many 53 | $ perf stat -e "cycles,instructions,branch-misses" ./branch_miss_many 54 | 55 | 5,008,736 branch-misses 56 | ``` 57 | 58 | diff files: 59 | 60 | ```sh 61 | $ diff branch_miss_few.S branch_miss_many.S 62 | < cmp x2, x2 /* zero(eq) flag is always true */ 63 | --- 64 | > cmp x2, 0 /* zero(eq) flag changes randomly */ 65 | ``` 66 | 67 | 68 | ## Cache misses (conflict-miss) 69 | 70 | ### cache_miss_few.S 71 | 72 | ```sh 73 | $ make -f ../Makefile cache_miss_few 74 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 75 | ./cache_miss_few 76 | 77 | 10,537,602 L1-dcache-loads 78 | 3,774 L1-dcache-load-misses # 0.04% of all L1-dcache hits 79 | ``` 80 | 81 | ### cache_miss_many.S 82 | 83 | ```sh 84 | $ make -f ../Makefile cache_miss_many 85 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 86 | ./cache_miss_many 87 | 88 | 10,623,085 L1-dcache-loads 89 | 9,846,872 L1-dcache-load-misses # 92.69% of all L1-dcache hits 90 | ``` 91 | 92 | diff files: 93 | 94 | ```sh 95 | $ diff cache_miss_few.S cache_miss_many.S 96 | < mov x13, 64 /* stride is 64byte (cache-line size) */ 97 | --- 98 | > mov x13, 4096 /* stride is 4Kbyte (cache-macro? size) */ 99 | ``` 100 | 101 | 102 | ## TLB misses (capacity-miss) 103 | 104 | ### tlb_miss_few.S 105 | 106 | ```sh 107 | $ make -f ../Makefile tlb_miss_few 108 | $ perf stat -e "cycles,instructions,L1-dcache-loads,dTLB-load-misses" \ 109 | ./tlb_miss_few 110 | 111 | 538,288,655 cycles 112 | 802,071,287 instructions 113 | 100,953,436 L1-dcache-loads 114 | 11,671 dTLB-load-misses 115 | ``` 116 | 117 | ### tlb_miss_many.S 118 | 119 | ```sh 120 | $ make -f ../Makefile tlb_miss_many 121 | $ perf stat -e "cycles,instructions,L1-dcache-loads,dTLB-load-misses" \ 122 | ./tlb_miss_many 123 | 124 | 4,052,938,587 cycles 125 | 825,978,648 instructions 126 | 109,003,552 L1-dcache-loads 127 | 94,691,845 dTLB-load-misses 128 | ``` 129 | 130 | diff files: 131 | 132 | ```sh 133 | $ diff tlb_miss_few.S tlb_miss_many.S 134 | < mov x14, 0x0f /* wrap for 16 times */ 135 | --- 136 | > mov x14, 0x1fff /* wrap for 8192 times */ 137 | ``` 138 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/branch_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Branch prediction example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Branch conditions are always true. 5 | * * FEW branch mispredictions occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * for (i=0; i<10000000; i++) { 10 | * 11 | * x = xorshift(); // generate a random number 12 | * 13 | * if (x == x) goto L_br; // branch-prediction test 14 | * nop; 15 | * L_br: 16 | * 17 | * } 18 | * 19 | */ 20 | 21 | .global main 22 | 23 | main: 24 | /* save fp,lr registers */ 25 | stp x29, x30, [sp, -16]! 26 | 27 | 28 | /* loop conditions */ 29 | mov x11, 0 /* loop variable */ 30 | ldr x12, =10000000 /* loop max-number */ 31 | 32 | adr x10, random 33 | ldr x0, [x10] /* initial number of xorshift(xor64) */ 34 | 35 | L_loop: 36 | /* generate a random number with the xorshift algorithm */ 37 | lsl x1, x0, 13 /* x = x ^ (x << 13) */ 38 | eor x0, x0, x1 39 | 40 | asr x1, x0, 7 /* x = x ^ (x >> 7) */ 41 | eor x0, x0, x1 42 | 43 | lsl x1, x0, 17 /* x = x ^ (x << 17) */ 44 | eor x0, x0, x1 45 | 46 | 47 | /* branch-prediction test */ 48 | and x2, x0, 1 49 | cmp x2, x2 /* zero(eq) flag is always true */ 50 | beq L_br /* branch-prediction test */ 51 | nop 52 | L_br: 53 | 54 | /* increment the loop-variable and loop-back */ 55 | add x11, x11, 1 56 | cmp x11, x12 57 | blt L_loop 58 | 59 | 60 | /* print the last loop-variable */ 61 | adr x0, fmt /* 1st argument for printf */ 62 | mov x1, x11 /* 2nd argument */ 63 | bl printf 64 | 65 | 66 | 67 | /* restore fp,lr registers and return from main*/ 68 | ldp x29, x30, [sp], 16 69 | ret 70 | 71 | 72 | /* read-only data */ 73 | .section .rodata 74 | random: 75 | .quad 88172645463325252 /* xorshift(xor64)'s initial value */ 76 | 77 | fmt: 78 | .string "loop-variable = %d\n" 79 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/branch_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Branch prediction example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Branch conditions change randomly. 5 | * * MANY branch mispredictions occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * for (i=0; i<10000000; i++) { 10 | * 11 | * x = xorshift(); // generate a random number 12 | * 13 | * if ((x & 1) == 0) goto L_br; // branch-prediction test 14 | * nop; 15 | * L_br: 16 | * 17 | * } 18 | * 19 | */ 20 | 21 | .global main 22 | 23 | main: 24 | /* save fp,lr registers */ 25 | stp x29, x30, [sp, -16]! 26 | 27 | 28 | /* loop conditions */ 29 | mov x11, 0 /* loop variable */ 30 | ldr x12, =10000000 /* loop max-number */ 31 | 32 | adr x10, random 33 | ldr x0, [x10] /* initial number of xorshift(xor64) */ 34 | 35 | L_loop: 36 | /* generate a random number with the xorshift algorithm */ 37 | lsl x1, x0, 13 /* x = x ^ (x << 13) */ 38 | eor x0, x0, x1 39 | 40 | asr x1, x0, 7 /* x = x ^ (x >> 7) */ 41 | eor x0, x0, x1 42 | 43 | lsl x1, x0, 17 /* x = x ^ (x << 17) */ 44 | eor x0, x0, x1 45 | 46 | 47 | /* branch-prediction test */ 48 | and x2, x0, 1 49 | cmp x2, 0 /* zero(eq) flag changes randomly */ 50 | beq L_br /* branch-prediction test */ 51 | nop 52 | L_br: 53 | 54 | /* increment the loop-variable and loop-back */ 55 | add x11, x11, 1 56 | cmp x11, x12 57 | blt L_loop 58 | 59 | 60 | /* print the last loop-variable */ 61 | adr x0, fmt /* 1st argument for printf */ 62 | mov x1, x11 /* 2nd argument */ 63 | bl printf 64 | 65 | 66 | 67 | /* restore fp,lr registers and return from main*/ 68 | ldp x29, x30, [sp], 16 69 | ret 70 | 71 | 72 | /* read-only data */ 73 | .section .rodata 74 | random: 75 | .quad 88172645463325252 /* xorshift(xor64)'s initial value */ 76 | 77 | fmt: 78 | .string "loop-variable = %d\n" 79 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/cache_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Cache-miss example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Cache accesses with 64-byte (line-size) stride. 5 | * * FEW cache-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 64; 10 | * wrap = 0x1f; // 32times - 1 11 | * 12 | * for (i=0; i<10000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | 26 | /* access parameters */ 27 | mov x13, 64 /* stride is 64byte (cache-line size) */ 28 | mov x14, 0x1f /* wrap for 32 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov x11, 0 /* loop variable */ 33 | ldr x12, =10000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | and x0, x11, x14 /* i & wrap */ 38 | mul x0, x0, x13 /* (i % wrap) x stride */ 39 | 40 | adr x10, work_area 41 | add x10, x10, x0 42 | 43 | /* cache access (simple load) */ 44 | ldr x1, [x10] /* x1 = load(x10) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add x11, x11, 1 48 | cmp x11, x12 49 | blt L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | adr x0, fmt /* 1st argument for printf */ 54 | mov x1, x11 /* 2nd argument */ 55 | bl printf 56 | 57 | 58 | 59 | /* restore fp,lr registers and return from main*/ 60 | ldp x29, x30, [sp], 16 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x40000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/cache_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Cache-miss example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Cache accesses with 4K-byte (cache-macro?) stride for way-conflict. 5 | * * MANY cache-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 4096; 10 | * wrap = 0x1f; // 32times - 1 11 | * 12 | * for (i=0; i<10000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | 26 | /* access parameters */ 27 | mov x13, 4096 /* stride is 4Kbyte (cache-macro? size) */ 28 | mov x14, 0x1f /* wrap for 32 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov x11, 0 /* loop variable */ 33 | ldr x12, =10000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | and x0, x11, x14 /* i & wrap */ 38 | mul x0, x0, x13 /* (i % wrap) x stride */ 39 | 40 | adr x10, work_area 41 | add x10, x10, x0 42 | 43 | /* cache access (simple load) */ 44 | ldr x1, [x10] /* x1 = load(x10) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add x11, x11, 1 48 | cmp x11, x12 49 | blt L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | adr x0, fmt /* 1st argument for printf */ 54 | mov x1, x11 /* 2nd argument */ 55 | bl printf 56 | 57 | 58 | 59 | /* restore fp,lr registers and return from main*/ 60 | ldp x29, x30, [sp], 16 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x40000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/latency_fadd.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Latency example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `fadd` instruction 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * fadd instruction 10 | * : // 100 times 11 | * } 12 | * 13 | */ 14 | 15 | .global main 16 | 17 | main: 18 | /* save fp,lr registers */ 19 | stp x29, x30, [sp, -16]! 20 | 21 | /* loop conditions */ 22 | mov x11, 0 /* loop variable */ 23 | ldr x12, =10000000 /* loop max-number */ 24 | 25 | adr x10, value1 26 | ldr d1 , [x10] /* 1st operand for add instructions */ 27 | ldr d2 , [x10] /* 2nd operand for add instructions */ 28 | 29 | 30 | /* loop body */ 31 | L_loop: 32 | fadd d1, d1, d2 /* d1 <- d1 + d2 */ 33 | fadd d1, d1, d2 34 | fadd d1, d1, d2 35 | fadd d1, d1, d2 36 | fadd d1, d1, d2 37 | fadd d1, d1, d2 38 | fadd d1, d1, d2 39 | fadd d1, d1, d2 40 | fadd d1, d1, d2 41 | fadd d1, d1, d2 42 | 43 | fadd d1, d1, d2 44 | fadd d1, d1, d2 45 | fadd d1, d1, d2 46 | fadd d1, d1, d2 47 | fadd d1, d1, d2 48 | fadd d1, d1, d2 49 | fadd d1, d1, d2 50 | fadd d1, d1, d2 51 | fadd d1, d1, d2 52 | fadd d1, d1, d2 53 | 54 | fadd d1, d1, d2 55 | fadd d1, d1, d2 56 | fadd d1, d1, d2 57 | fadd d1, d1, d2 58 | fadd d1, d1, d2 59 | fadd d1, d1, d2 60 | fadd d1, d1, d2 61 | fadd d1, d1, d2 62 | fadd d1, d1, d2 63 | fadd d1, d1, d2 64 | 65 | fadd d1, d1, d2 66 | fadd d1, d1, d2 67 | fadd d1, d1, d2 68 | fadd d1, d1, d2 69 | fadd d1, d1, d2 70 | fadd d1, d1, d2 71 | fadd d1, d1, d2 72 | fadd d1, d1, d2 73 | fadd d1, d1, d2 74 | fadd d1, d1, d2 75 | 76 | fadd d1, d1, d2 77 | fadd d1, d1, d2 78 | fadd d1, d1, d2 79 | fadd d1, d1, d2 80 | fadd d1, d1, d2 81 | fadd d1, d1, d2 82 | fadd d1, d1, d2 83 | fadd d1, d1, d2 84 | fadd d1, d1, d2 85 | fadd d1, d1, d2 86 | 87 | fadd d1, d1, d2 88 | fadd d1, d1, d2 89 | fadd d1, d1, d2 90 | fadd d1, d1, d2 91 | fadd d1, d1, d2 92 | fadd d1, d1, d2 93 | fadd d1, d1, d2 94 | fadd d1, d1, d2 95 | fadd d1, d1, d2 96 | fadd d1, d1, d2 97 | 98 | fadd d1, d1, d2 99 | fadd d1, d1, d2 100 | fadd d1, d1, d2 101 | fadd d1, d1, d2 102 | fadd d1, d1, d2 103 | fadd d1, d1, d2 104 | fadd d1, d1, d2 105 | fadd d1, d1, d2 106 | fadd d1, d1, d2 107 | fadd d1, d1, d2 108 | 109 | fadd d1, d1, d2 110 | fadd d1, d1, d2 111 | fadd d1, d1, d2 112 | fadd d1, d1, d2 113 | fadd d1, d1, d2 114 | fadd d1, d1, d2 115 | fadd d1, d1, d2 116 | fadd d1, d1, d2 117 | fadd d1, d1, d2 118 | fadd d1, d1, d2 119 | 120 | fadd d1, d1, d2 121 | fadd d1, d1, d2 122 | fadd d1, d1, d2 123 | fadd d1, d1, d2 124 | fadd d1, d1, d2 125 | fadd d1, d1, d2 126 | fadd d1, d1, d2 127 | fadd d1, d1, d2 128 | fadd d1, d1, d2 129 | fadd d1, d1, d2 130 | 131 | fadd d1, d1, d2 132 | fadd d1, d1, d2 133 | fadd d1, d1, d2 134 | fadd d1, d1, d2 135 | fadd d1, d1, d2 136 | fadd d1, d1, d2 137 | fadd d1, d1, d2 138 | fadd d1, d1, d2 139 | fadd d1, d1, d2 140 | fadd d1, d1, d2 141 | 142 | /* increment the loop-variable and loop-back */ 143 | add x11, x11, 1 144 | cmp x11, x12 145 | blt L_loop 146 | 147 | 148 | /* print the last loop-variable */ 149 | adr x0, fmt /* 1st argument for printf */ 150 | mov x1, x11 /* 2nd argument */ 151 | bl printf 152 | 153 | 154 | /* restore fp,lr registers and return from main*/ 155 | ldp x29, x30, [sp], 16 156 | ret 157 | 158 | 159 | /* read-only data */ 160 | .section .rodata 161 | value1: 162 | .double 1.0 163 | 164 | fmt: 165 | .string "loop-variable = %d\n" 166 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/ooo_perform.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Out-of-order execution example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `load` instruction 5 | * 6 | * * Reordering with a pending instruction 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * x14 = load(x14); 10 | * x14 = x14 + x10; // pendinig the load instruction 11 | * x15 = x15 + x10; // NO out-of-order execution 12 | * x15 = x15 + x10; 13 | * x15 = x15 + x10; 14 | * : // 100 times 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | /* loop conditions */ 26 | mov x11, 0 /* loop variable */ 27 | ldr x12, =10000000 /* loop max-number */ 28 | 29 | adr x14, buf /* address for load instructions */ 30 | str x14, [x14] /* pre-store the address for load */ 31 | 32 | mov x10, 0 33 | 34 | 35 | /* loop body */ 36 | L_loop: 37 | ldr x14, [x14] /* x14 = load(x14) */ 38 | add x14, x14, x10 /* true data dependency with rdx */ 39 | add x15, x15, x10 /* NO true data dpendency with rdx */ 40 | add x15, x15, x10 /* NO true data dpendency with rdx */ 41 | add x15, x15, x10 /* NO true data dpendency with rdx */ 42 | 43 | ldr x14, [x14] 44 | add x14, x14, x10 45 | add x15, x15, x10 46 | add x15, x15, x10 47 | add x15, x15, x10 48 | 49 | ldr x14, [x14] 50 | add x14, x14, x10 51 | add x15, x15, x10 52 | add x15, x15, x10 53 | add x15, x15, x10 54 | 55 | ldr x14, [x14] 56 | add x14, x14, x10 57 | add x15, x15, x10 58 | add x15, x15, x10 59 | add x15, x15, x10 60 | 61 | ldr x14, [x14] 62 | add x14, x14, x10 63 | add x15, x15, x10 64 | add x15, x15, x10 65 | add x15, x15, x10 66 | 67 | ldr x14, [x14] 68 | add x14, x14, x10 69 | add x15, x15, x10 70 | add x15, x15, x10 71 | add x15, x15, x10 72 | 73 | ldr x14, [x14] 74 | add x14, x14, x10 75 | add x15, x15, x10 76 | add x15, x15, x10 77 | add x15, x15, x10 78 | 79 | ldr x14, [x14] 80 | add x14, x14, x10 81 | add x15, x15, x10 82 | add x15, x15, x10 83 | add x15, x15, x10 84 | 85 | ldr x14, [x14] 86 | add x14, x14, x10 87 | add x15, x15, x10 88 | add x15, x15, x10 89 | add x15, x15, x10 90 | 91 | ldr x14, [x14] 92 | add x14, x14, x10 93 | add x15, x15, x10 94 | add x15, x15, x10 95 | add x15, x15, x10 96 | 97 | /* increment the loop-variable and loop-back */ 98 | add x11, x11, 1 99 | cmp x11, x12 100 | blt L_loop 101 | 102 | 103 | /* print the last loop-variable */ 104 | adr x0, fmt /* 1st argument for printf */ 105 | mov x1, x11 /* 2nd argument */ 106 | bl printf 107 | 108 | 109 | /* restore fp,lr registers and return from main*/ 110 | ldp x29, x30, [sp], 16 111 | ret 112 | 113 | 114 | /* read-only data */ 115 | .section .rodata 116 | fmt: 117 | .string "loop-variable = %d\n" 118 | 119 | 120 | /* read-write data */ 121 | .data 122 | buf: 123 | .quad 0 124 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/ooo_wait.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Out-of-order execution example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `load` instruction 5 | * 6 | * * NO-Reordering with a pending instruction 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * x14 = load(x14); 10 | * x14 = x14 + x10; // pendinig the load instruction 11 | * x14 = x14 + x10; // NO out-of-order execution 12 | * x14 = x14 + x10; 13 | * x14 = x14 + x10; 14 | * : // 100 times 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | /* loop conditions */ 26 | mov x11, 0 /* loop variable */ 27 | ldr x12, =10000000 /* loop max-number */ 28 | 29 | adr x14, buf /* address for load instructions */ 30 | str x14, [x14] /* pre-store the address for load */ 31 | 32 | mov x10, 0 33 | 34 | 35 | /* loop body */ 36 | L_loop: 37 | ldr x14, [x14] /* x14 = load(x14) */ 38 | add x14, x14, x10 /* true data dependency with rdx */ 39 | add x14, x14, x10 /* true data dependency with rdx */ 40 | add x14, x14, x10 /* true data dependency with rdx */ 41 | add x14, x14, x10 /* true data dependency with rdx */ 42 | 43 | ldr x14, [x14] 44 | add x14, x14, x10 45 | add x14, x14, x10 46 | add x14, x14, x10 47 | add x14, x14, x10 48 | 49 | ldr x14, [x14] 50 | add x14, x14, x10 51 | add x14, x14, x10 52 | add x14, x14, x10 53 | add x14, x14, x10 54 | 55 | ldr x14, [x14] 56 | add x14, x14, x10 57 | add x14, x14, x10 58 | add x14, x14, x10 59 | add x14, x14, x10 60 | 61 | ldr x14, [x14] 62 | add x14, x14, x10 63 | add x14, x14, x10 64 | add x14, x14, x10 65 | add x14, x14, x10 66 | 67 | ldr x14, [x14] 68 | add x14, x14, x10 69 | add x14, x14, x10 70 | add x14, x14, x10 71 | add x14, x14, x10 72 | 73 | ldr x14, [x14] 74 | add x14, x14, x10 75 | add x14, x14, x10 76 | add x14, x14, x10 77 | add x14, x14, x10 78 | 79 | ldr x14, [x14] 80 | add x14, x14, x10 81 | add x14, x14, x10 82 | add x14, x14, x10 83 | add x14, x14, x10 84 | 85 | ldr x14, [x14] 86 | add x14, x14, x10 87 | add x14, x14, x10 88 | add x14, x14, x10 89 | add x14, x14, x10 90 | 91 | ldr x14, [x14] 92 | add x14, x14, x10 93 | add x14, x14, x10 94 | add x14, x14, x10 95 | add x14, x14, x10 96 | 97 | /* increment the loop-variable and loop-back */ 98 | add x11, x11, 1 99 | cmp x11, x12 100 | blt L_loop 101 | 102 | 103 | /* print the last loop-variable */ 104 | adr x0, fmt /* 1st argument for printf */ 105 | mov x1, x11 /* 2nd argument */ 106 | bl printf 107 | 108 | 109 | /* restore fp,lr registers and return from main*/ 110 | ldp x29, x30, [sp], 16 111 | ret 112 | 113 | 114 | /* read-only data */ 115 | .section .rodata 116 | fmt: 117 | .string "loop-variable = %d\n" 118 | 119 | 120 | /* read-write data */ 121 | .data 122 | buf: 123 | .quad 0 124 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/superscalar_add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Superscalar-issue example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * `add` instructions without true-data-dependencies 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * add instruction 10 | * : // 100 times 11 | * } 12 | * 13 | */ 14 | 15 | .global main 16 | 17 | main: 18 | /* save fp,lr registers */ 19 | stp x29, x30, [sp, -16]! 20 | 21 | /* loop conditions */ 22 | mov x11, 0 /* loop variable */ 23 | ldr x12, =10000000 /* loop max-number */ 24 | 25 | mov x14, 0 /* 2nd operand for add instructions */ 26 | 27 | 28 | /* loop body */ 29 | L_loop: 30 | add x1, x1, x14 /* x1 = x1 + x14 */ 31 | add x2, x2, x14 32 | add x3, x3, x14 33 | add x4, x4, x14 34 | add x5, x5, x14 35 | add x6, x6, x14 36 | add x7, x7, x14 37 | add x8, x8, x14 38 | add x9, x9, x14 39 | add x10, x10, x14 40 | 41 | add x1, x1, x14 42 | add x2, x2, x14 43 | add x3, x3, x14 44 | add x4, x4, x14 45 | add x5, x5, x14 46 | add x6, x6, x14 47 | add x7, x7, x14 48 | add x8, x8, x14 49 | add x9, x9, x14 50 | add x10, x10, x14 51 | 52 | add x1, x1, x14 53 | add x2, x2, x14 54 | add x3, x3, x14 55 | add x4, x4, x14 56 | add x5, x5, x14 57 | add x6, x6, x14 58 | add x7, x7, x14 59 | add x8, x8, x14 60 | add x9, x9, x14 61 | add x10, x10, x14 62 | 63 | add x1, x1, x14 64 | add x2, x2, x14 65 | add x3, x3, x14 66 | add x4, x4, x14 67 | add x5, x5, x14 68 | add x6, x6, x14 69 | add x7, x7, x14 70 | add x8, x8, x14 71 | add x9, x9, x14 72 | add x10, x10, x14 73 | 74 | add x1, x1, x14 75 | add x2, x2, x14 76 | add x3, x3, x14 77 | add x4, x4, x14 78 | add x5, x5, x14 79 | add x6, x6, x14 80 | add x7, x7, x14 81 | add x8, x8, x14 82 | add x9, x9, x14 83 | add x10, x10, x14 84 | 85 | add x1, x1, x14 86 | add x2, x2, x14 87 | add x3, x3, x14 88 | add x4, x4, x14 89 | add x5, x5, x14 90 | add x6, x6, x14 91 | add x7, x7, x14 92 | add x8, x8, x14 93 | add x9, x9, x14 94 | add x10, x10, x14 95 | 96 | add x1, x1, x14 97 | add x2, x2, x14 98 | add x3, x3, x14 99 | add x4, x4, x14 100 | add x5, x5, x14 101 | add x6, x6, x14 102 | add x7, x7, x14 103 | add x8, x8, x14 104 | add x9, x9, x14 105 | add x10, x10, x14 106 | 107 | add x1, x1, x14 108 | add x2, x2, x14 109 | add x3, x3, x14 110 | add x4, x4, x14 111 | add x5, x5, x14 112 | add x6, x6, x14 113 | add x7, x7, x14 114 | add x8, x8, x14 115 | add x9, x9, x14 116 | add x10, x10, x14 117 | 118 | add x1, x1, x14 119 | add x2, x2, x14 120 | add x3, x3, x14 121 | add x4, x4, x14 122 | add x5, x5, x14 123 | add x6, x6, x14 124 | add x7, x7, x14 125 | add x8, x8, x14 126 | add x9, x9, x14 127 | add x10, x10, x14 128 | 129 | add x1, x1, x14 130 | add x2, x2, x14 131 | add x3, x3, x14 132 | add x4, x4, x14 133 | add x5, x5, x14 134 | add x6, x6, x14 135 | add x7, x7, x14 136 | add x8, x8, x14 137 | add x9, x9, x14 138 | add x10, x10, x14 139 | 140 | 141 | /* increment the loop-variable and loop-back */ 142 | add x11, x11, 1 143 | cmp x11, x12 144 | blt L_loop 145 | 146 | 147 | /* print the last loop-variable */ 148 | adr x0, fmt /* 1st argument for printf */ 149 | mov x1, x11 /* 2nd argument */ 150 | bl printf 151 | 152 | 153 | /* restore fp,lr registers and return from main*/ 154 | ldp x29, x30, [sp], 16 155 | ret 156 | 157 | 158 | /* read-only data */ 159 | .section .rodata 160 | fmt: 161 | .string "loop-variable = %d\n" 162 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/tlb_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * TLB-miss example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Memory accesses within 64Kbyte (16 pages) 5 | * * FEW cache-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 4096; // 4096 byte (page size) 10 | * wrap = 0x0f; // 16 times (pages) - 1 11 | * 12 | * for (i=0; i<100000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | 26 | /* access parameters */ 27 | mov x13, 4096 /* stride is 4Kbyte (page size) */ 28 | mov x14, 0x0f /* wrap for 16 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov x11, 0 /* loop variable */ 33 | ldr x12, =100000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | and x0, x11, x14 /* i & wrap */ 38 | mul x0, x0, x13 /* (i % wrap) x stride */ 39 | 40 | adr x10, work_area 41 | add x10, x10, x0 42 | 43 | /* cache access (simple load) */ 44 | ldr x1, [x10] /* x1 = load(x10) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add x11, x11, 1 48 | cmp x11, x12 49 | blt L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | adr x0, fmt /* 1st argument for printf */ 54 | mov x1, x11 /* 2nd argument */ 55 | bl printf 56 | 57 | 58 | 59 | /* restore fp,lr registers and return from main*/ 60 | ldp x29, x30, [sp], 16 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x8000000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /arm/linux/E00.perf_expt/tlb_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * TLB-miss example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Memory accesses within 32Mbyte (8192 pages) 5 | * * MANY tlb-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 4096; // 4096 byte (page size) 10 | * wrap = 0x1fff; // 8192 times (pages) - 1 11 | * 12 | * for (i=0; i<100000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .global main 20 | 21 | main: 22 | /* save fp,lr registers */ 23 | stp x29, x30, [sp, -16]! 24 | 25 | 26 | /* access parameters */ 27 | mov x13, 4096 /* stride is 4Kbyte (page size) */ 28 | mov x14, 0x1fff /* wrap for 8192 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov x11, 0 /* loop variable */ 33 | ldr x12, =100000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | and x0, x11, x14 /* i & wrap */ 38 | mul x0, x0, x13 /* (i % wrap) x stride */ 39 | 40 | adr x10, work_area 41 | add x10, x10, x0 42 | 43 | /* cache access (simple load) */ 44 | ldr x1, [x10] /* x1 = load(x10) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add x11, x11, 1 48 | cmp x11, x12 49 | blt L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | adr x0, fmt /* 1st argument for printf */ 54 | mov x1, x11 /* 2nd argument */ 55 | bl printf 56 | 57 | 58 | 59 | /* restore fp,lr registers and return from main*/ 60 | ldp x29, x30, [sp], 16 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x8000000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /arm/linux/E05.sys_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | System experiments 3 | ================== 4 | 5 | ## System call 6 | 7 | ### syscall_write.S 8 | 9 | ```sh 10 | $ make -f ../Makefile syscall_write 11 | 12 | $ ./syscall_write 13 | hello, world 14 | ``` 15 | 16 | 17 | ## Execption 18 | 19 | ### excep_load.S 20 | 21 | ```sh 22 | $ make -f ../Makefile excep_load 23 | 24 | $ ./excep_load 25 | Segmentation fault (core dumped) 26 | ``` 27 | -------------------------------------------------------------------------------- /arm/linux/E05.sys_expt/excep_load.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Exception example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * Page fault exception 5 | * 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | mov x10, 0 12 | ldr x1, [x10] /* x1 = load(x10) */ 13 | 14 | ret 15 | -------------------------------------------------------------------------------- /arm/linux/E05.sys_expt/syscall_write.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for Arm(Armv8 aarch64)/Linux 3 | * 4 | * * write(2) system-call 5 | * * see: man 2 write 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* write(2) system-call */ 14 | mov x8, 64 /* system-call number: write() */ 15 | mov x0, 1 /* fd: stdout */ 16 | adr x1, msg /* buf: */ 17 | mov x2, 13 /* count: */ 18 | svc 0 /* system call */ 19 | 20 | 21 | /* return from main */ 22 | ret 23 | 24 | 25 | /* read-only data */ 26 | .section .rodata 27 | msg: 28 | .string "hello, world\n" 29 | -------------------------------------------------------------------------------- /arm/linux/E10.multi_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | Experiments with multiprocessors 3 | ================================ 4 | 5 | ## Cache-line(s) with line-invalidation 6 | 7 | ### cacheline_different.S 8 | 9 | ```sh 10 | $ make -f ../Makefile cacheline_different 11 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 12 | ./cacheline_different 13 | 14 | 20,984,095 L1-dcache-loads 15 | 11,795 L1-dcache-load-misses # 0.06% of all L1-dcache hits 16 | ``` 17 | 18 | 19 | ### cacheline_same.S 20 | 21 | ```sh 22 | $ make -f ../Makefile cacheline_same 23 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 24 | ./cacheline_same 25 | 26 | 21,080,723 L1-dcache-loads 27 | 4,955,727 L1-dcache-load-misses # 23.51% of all L1-dcache hits 28 | ``` 29 | 30 | diff files: 31 | 32 | ```sh 33 | $ diff cacheline_different.S cacheline_same.S 34 | < .space 64 /* padding for different cache-lines */ 35 | ``` 36 | 37 | 38 | 39 | ## Memory ordering 40 | 41 | ### ordering_unexpected.S 42 | 43 | ```sh 44 | $ make -f ../Makefile ordering_unexpected 45 | $ ./ordering_unexpected 46 | child1(): UNEXPECTED!: x14 == 0 && x15 == 0; loop-variable = 543 47 | child2(): UNEXPECTED!: x14 == 0 && x15 == 0; loop-variable = 543 48 | ``` 49 | 50 | 51 | ### ordering_force.S 52 | 53 | ```sh 54 | $ make -f ../Makefile ordering_force 55 | $ ./ordering_force 56 | child1(): finish: loop-variable = 5000000 57 | child2(): finish: loop-variable = 5000000 58 | ``` 59 | 60 | diff files: 61 | 62 | ```sh 63 | $ diff ordering_unexpected.S ordering_force.S 64 | > dmb sy /* FORCE ORDERING */ 65 | ``` 66 | 67 | 68 | 69 | ## Memory ordering (weak-ordering) 70 | 71 | ### weak_ordering_unexpected.S 72 | 73 | ```sh 74 | $ make -f ../Makefile weak_ordering_unexpected 75 | $ ./weak_ordering_unexpected 76 | child2(): UNEXPECTED!: req = 0x1, info = 0xbad, loop-variable = 82852 77 | ``` 78 | 79 | 80 | ### weak_ordering_force.S 81 | 82 | ```sh 83 | $ make -f ../Makefile weak_ordering_force 84 | $ ./weak_ordering_force 85 | child1(): finish: loop-variable = 5000000 86 | child2(): finish: loop-variable = 5000000 87 | ``` 88 | 89 | diff files: 90 | 91 | ```sh 92 | $ diff weak_ordering_unexpected.S weak_ordering_force.S 93 | 104a107 94 | > dmb st /* FORCE ORDERING */ 95 | 190a194 96 | > dmb ld /* FORCE ORDERING */ 97 | ``` 98 | 99 | 100 | ### weak_ordering_force_relack.S 101 | 102 | ```sh 103 | $ make -f ../Makefile weak_ordering_force_relack 104 | $ ./weak_ordering_force 105 | child1(): finish: loop-variable = 5000000 106 | child2(): finish: loop-variable = 5000000 107 | ``` 108 | 109 | diff files: 110 | 111 | ```sh 112 | $ diff weak_ordering_unexpected.S weak_ordering_force_relack.S 113 | 11c111 114 | < str x1, [x13] /* req = 1 */ 115 | --- 116 | > stlr x1, [x13] /* req = 1 with RELEASE ORDERING */ 117 | 193c193 118 | < ldr x9, [x13] /* load the req */ 119 | --- 120 | > ldar x9, [x13] /* load the req with ACQUIRE ORDERING */ 121 | ``` 122 | 123 | 124 | 125 | ## Shared-counter with atomicity 126 | 127 | ### counter_atomic.S 128 | 129 | ```sh 130 | $ make -f ../Makefile counter_atomic 131 | 132 | $ ./counter_atomic 133 | main(): start 134 | child1(): start 135 | child2(): start 136 | child2(): finish: loop-variable = 5000000 137 | child1(): finish: loop-variable = 5000000 138 | main(): finish: counter = 10000000 139 | ``` 140 | 141 | ### counter_bad.S 142 | 143 | ```sh 144 | $ make -f ../Makefile counter_bad 145 | 146 | $ ./counter_bad 147 | main(): start 148 | child1(): start 149 | child2(): start 150 | child1(): finish: loop-variable = 5000000 151 | child2(): finish: loop-variable = 5000000 152 | main(): finish: counter = 5044589 153 | ``` 154 | 155 | diff files: 156 | 157 | ```sh 158 | $ diff counter_atomic.S counter_bad.S 159 | 84c84 160 | < ldxr x14, [x10] 161 | --- 162 | > ldr x14, [x10] 163 | 86c86 164 | < stxr w2, x14, [x10] 165 | --- 166 | > str x14, [x10] 167 | 88d87 168 | < cbnz w2, L1_loop 169 | ``` 170 | -------------------------------------------------------------------------------- /arm/linux/Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Makefile for Armv8/aarch64 3 | 4 | CPU = $(shell uname -m) 5 | ifeq "$(CPU)" "x86_64" 6 | CC = aarch64-linux-gnu-gcc 7 | OBJDUMP = aarch64-linux-gnu-objdump 8 | GDB = gdb-multiarch 9 | else 10 | CC = gcc 11 | OBJDUMP = objdump 12 | GDB = gdb 13 | endif 14 | 15 | 16 | CFLAGS += -g 17 | ASFLAGS += -g -march=armv8.1-a # armv8.1 for +lse 18 | LDFLAGS += -static -pthread 19 | 20 | 21 | include ../../../Makefile 22 | -------------------------------------------------------------------------------- /arm/linux/README.md: -------------------------------------------------------------------------------- 1 | 2 | Arm (Armv8 aarch64) assembly examples on linux 3 | ============================================== 4 | 5 | ## Examples 6 | * [simple main](./100.main) (./100.main) 7 | * [system call](./110.system_call) (./110.system_call) 8 | * [library call](./120.libc_call) (./120.libc_call) 9 | * [basic operations](./300.operation) (./300.operation), [floating](./310.floating) (./310.floating) 10 | * [control flow](./400.control_flow) (./400.control_flow) 11 | * [sync](./700.sync) (./700.sync), [sync_parts](./710.sync_parts) (./710.sync_parts) 12 | * [multi-threads](./750.multi) (./750.multi) 13 | 14 | 15 | ## Experiments 16 | * Performance: [data-dependency, branch, cache, virtual-memory](./E00.perf_expt) (./E00.perf_expt) 17 | * System: [syscall/exception](./E05.sys_expt) (./E05.sys_expt) 18 | * Multiprocessors: [coherence, memory-ordering, atomic](./E10.multi_expt) (./E10.multi_expt) 19 | 20 | 21 | ## How to try 22 | 23 | * Assemble (generate binary) 24 | 25 | ``` 26 | $ make -f ../Makefile 27 | ``` 28 | 29 | * Execute 30 | 31 | ``` 32 | $ ./ 33 | or 34 | $ qemu-aarch64 ./ 35 | ``` 36 | 37 | * Disassemble (for full object) 38 | 39 | ``` 40 | $ make -f ../Makefile .disasm | less 41 | (search "main>" in `less` command) 42 | ``` 43 | 44 | * Disassemble (for .S only) 45 | 46 | ``` 47 | $ make -f ../Makefile .o.disasm 48 | ``` 49 | 50 | * Step execution with gdb 51 | 52 | ``` 53 | $ make -f ../Makefile 54 | $ gdb ./ 55 | (gdb) layout asm 56 | (gdb) layout regs 57 | (gdb) break main 58 | (gdb) run 59 | (gdb) stepi 60 | : 61 | (gdb) quit 62 | ``` 63 | 64 | * Step execution with qemu and gdb 65 | 66 | * on first terminal for qemu: 67 | 68 | ``` 69 | $ qemu-aarch64 -g 1234 ./ 70 | ``` 71 | 72 | * on second terminal for gdb: 73 | 74 | ``` 75 | $ gdb-multiarch -ex "target remote localhost:1234" ./ 76 | (gdb) layout asm 77 | (gdb) layout regs 78 | (gdb) break main 79 | (gdb) continue 80 | (gdb) stepi 81 | : 82 | (gdb) quit 83 | ``` 84 | 85 | 86 | ## References 87 | 88 | * Arm (Armv8) 89 | * [Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile](https://developer.arm.com/documentation/ddi0487/latest/) 90 | 91 | * Linux 92 | * [Linux kernel; arch/arm64](https://github.com/torvalds/linux/tree/master/arch/arm64) 93 | 94 | * glibc 95 | * [glibc](https://www.gnu.org/software/libc/libc.html) 96 | * sysdeps/unix/sysv/linux/aarch64 97 | 98 | * ABI for the Arm 64-bit Architecture 99 | * [Application Binary Interface (ABI)](https://developer.arm.com/architectures/system-architectures/software-standards/abi) 100 | 101 | * System call ABI 102 | * [syscall(2) - Linux manual page](https://man7.org/linux/man-pages/man2/syscall.2.html) 103 | 104 | * GCC 105 | * [GCC online documentation](https://gcc.gnu.org/onlinedocs/) 106 | 107 | * GNU assembler and linker 108 | * [Documentation for binutils](https://sourceware.org/binutils/docs/) 109 | 110 | * GDB 111 | * [GDB Documentation](https://www.gnu.org/software/gdb/documentation/) 112 | 113 | * QEMU 114 | * [QEMU User space emulator](https://qemu-project.gitlab.io/qemu/user/main.html) 115 | 116 | 117 | ## Further information 118 | 119 | ### Calling convention 120 | 121 | * System call 122 | * x8, x0, x1, x2, x3, x4, x5 -> x0, x1 123 | 124 | * Funcation call 125 | * x0, x1, x2, x3, x4, x5, x6, x7 -> x0, x1 126 | 127 | * see: 128 | * [Application Binary Interface (ABI)](https://developer.arm.com/architectures/system-architectures/software-standards/abi) 129 | * glibc's sysdeps/unix/sysv/linux/aarch64/syscall.S 130 | 131 | ### Tool installation on ubuntu (x86) 132 | 133 | * Cross assembler (GNU Compiler Toolchain) 134 | 135 | ``` 136 | apt install g++-aarch64-linux-gnu 137 | ``` 138 | 139 | * QEMU User space emulator 140 | 141 | ``` 142 | apt install qemu-user 143 | ``` 144 | 145 | * Cross debugger 146 | 147 | ``` 148 | apt install gdb-multiarch 149 | ``` 150 | -------------------------------------------------------------------------------- /riscv/linux/100.main/simple_main.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Simple main example for RISC-V(RV64)/Linux 3 | */ 4 | 5 | .global main 6 | 7 | main: 8 | ret 9 | -------------------------------------------------------------------------------- /riscv/linux/110.system_call/syscall_exit.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for RISC-V(RV64)/Linux 3 | * 4 | * * exit(2) system-call 5 | * * see: man 2 exit 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* exit(2) system-call */ 14 | li a7, 93 /* system-call number: exit() */ 15 | li a0, 0 /* status: success */ 16 | ecall /* system call */ 17 | -------------------------------------------------------------------------------- /riscv/linux/110.system_call/syscall_read.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for RISC-V(RV64)/Linux 3 | * 4 | * * read(2) system-call 5 | * * see: man 2 read 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* read(2) system-call */ 14 | li a7, 63 /* system-call number: read() */ 15 | li a0, 0 /* fd: stdin */ 16 | la a1, buf /* buf: */ 17 | li a2, 3 /* count: */ 18 | ecall /* system call */ 19 | 20 | 21 | /* write(2) system-call */ 22 | li a7, 64 /* system-call number: write() */ 23 | li a0, 1 /* fd: stdout */ 24 | la a1, buf /* buf: */ 25 | li a2, 3 /* count: */ 26 | ecall /* system call */ 27 | 28 | 29 | /* return from main */ 30 | ret 31 | 32 | 33 | /* read-write zero initialized data */ 34 | .bss 35 | buf: 36 | .space 10 37 | -------------------------------------------------------------------------------- /riscv/linux/110.system_call/syscall_write.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for RISC-V(RV64)/Linux 3 | * 4 | * * write(2) system-call 5 | * * see: man 2 write 6 | * * see: linux-kernel's include/uapi/asm-generic/unistd.h 7 | * * see: man syscall (https://man7.org/linux/man-pages/man2/syscall.2.html) 8 | */ 9 | 10 | .global main 11 | 12 | main: 13 | /* write(2) system-call */ 14 | li a7, 64 /* system-call number: write() */ 15 | li a0, 1 /* fd: stdout */ 16 | la a1, msg /* buf: */ 17 | li a2, 13 /* count: */ 18 | ecall /* system call */ 19 | 20 | 21 | /* return from main */ 22 | ret 23 | 24 | 25 | /* read-only data */ 26 | .section .rodata 27 | msg: 28 | .string "Hello world!\n" 29 | -------------------------------------------------------------------------------- /riscv/linux/120.libc_call/call_putchar.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for RISC-V(RV64)/Linux 3 | * 4 | * * putchar(3) standard-library 5 | * * see: man 3 putchar 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save ra(return address) */ 12 | addi sp,sp,-8 13 | sd ra,(sp) 14 | 15 | 16 | /* putchar(3) library-call */ 17 | li a0, 0x41 /* 'A' */ 18 | jal putchar 19 | 20 | li a0, 0x42 /* 'B' */ 21 | jal putchar 22 | 23 | li a0, 'C' 24 | jal putchar 25 | 26 | li a0, '\n' 27 | jal putchar 28 | 29 | 30 | /* restore ra(return address) */ 31 | ld ra,(sp) 32 | addi sp,sp,8 33 | 34 | /* return from main */ 35 | ret 36 | -------------------------------------------------------------------------------- /riscv/linux/120.libc_call/call_puts.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for RISC-V(RV64)/Linux 3 | * 4 | * * puts(3) standard-library 5 | * * see: man 3 puts 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save ra(return address) */ 12 | addi sp,sp,-8 13 | sd ra,(sp) 14 | 15 | 16 | /* puts(3) library-call */ 17 | la a0, msg /* 1st argument */ 18 | jal puts 19 | 20 | 21 | /* restore ra(return address) */ 22 | ld ra,(sp) 23 | addi sp,sp,8 24 | 25 | /* return from main */ 26 | ret 27 | 28 | 29 | /* read-only data */ 30 | .section .rodata 31 | msg: 32 | .string "Hello world!" 33 | -------------------------------------------------------------------------------- /riscv/linux/120.libc_call/call_variable_args.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for RISC-V(RV64)/Linux 3 | * 4 | * * printf(3) standard-library 5 | * * variable argument call 6 | * * see: man 3 printf 7 | */ 8 | 9 | .global main 10 | 11 | main: 12 | /* save ra(return address) */ 13 | addi sp,sp,-8 14 | sd ra,(sp) 15 | 16 | li x6, 100 /* test data */ 17 | 18 | 19 | /* printf(3) library-call */ 20 | la a0, msg /* 1st argument */ 21 | mv a1, x6 /* 2nd argument */ 22 | jal printf 23 | 24 | 25 | /* restore ra(return address) */ 26 | ld ra,(sp) 27 | addi sp,sp,8 28 | 29 | /* return from main */ 30 | ret 31 | 32 | 33 | /* read-only data */ 34 | .section .rodata 35 | msg: 36 | .string "x6 = %d\n" 37 | -------------------------------------------------------------------------------- /riscv/linux/300.operation/add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for RISC-V(RV64)/Linux 3 | * 4 | * * add operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save ra(return address) */ 11 | addi sp,sp,-8 12 | sd ra,(sp) 13 | 14 | 15 | /* Add operation */ 16 | li t1, 1 17 | li t2, 2 18 | add t0, t1, t2 /* t0 <- t1 + t2 */ 19 | 20 | 21 | /* printf for result-checking */ 22 | la a0, fmt /* 1st argument for printf*/ 23 | mv a1, t0 /* 2nd argument for printf*/ 24 | jal printf 25 | 26 | 27 | /* restore ra(return address) */ 28 | ld ra,(sp) 29 | addi sp,sp,8 30 | 31 | /* return from main */ 32 | ret 33 | 34 | 35 | /* read-only data */ 36 | .section .rodata 37 | fmt: 38 | .string "t1 + t2 = %x\n" 39 | -------------------------------------------------------------------------------- /riscv/linux/300.operation/load.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for RISC-V(RV64)/Linux 3 | * 4 | * * load operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save ra(return address) */ 11 | addi sp,sp,-8 12 | sd ra,(sp) 13 | 14 | 15 | /* Load operation */ 16 | la t0, byte_array 17 | ld t1, (t0) /* load (64 bits) */ 18 | 19 | 20 | /* printf for result-checking */ 21 | la a0, fmt /* 1st argument for printf*/ 22 | mv a1, t1 /* 2nd argument for printf*/ 23 | jal printf 24 | 25 | 26 | /* restore ra(return address) */ 27 | ld ra,(sp) 28 | addi sp,sp,8 29 | 30 | /* return from main */ 31 | ret 32 | 33 | 34 | /* read-only data */ 35 | .section .rodata 36 | .balign 8 37 | fmt: 38 | .string "byte_array = %llx\n" /* %llx for long long hex */ 39 | 40 | byte_array: 41 | .quad 0xfedcba9876543210 42 | -------------------------------------------------------------------------------- /riscv/linux/300.operation/load_size.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for RISC-V(RV64)/Linux 3 | * 4 | * * load operations for each size with zero extension 5 | * 6 | * * 64-bit load with ld instruction 7 | * * 32-bit load with lw instruction 8 | * * 16-bit load with lh instruction 9 | * * 8-bit load with lb instruction 10 | */ 11 | 12 | .global main 13 | 14 | main: 15 | /* save ra(return address) */ 16 | addi sp,sp,-8 17 | sd ra,(sp) 18 | 19 | 20 | /* Load operation */ 21 | la t0, byte_array 22 | ld t1, (t0) /* load (64 bits) */ 23 | lw t2, (t0) /* load (32 bits) */ 24 | lh t3, (t0) /* load (16 bits) */ 25 | lb t4, (t0) /* load (8 bits) */ 26 | 27 | 28 | /* printf for result-checking */ 29 | la a0, fmt /* 1st argument for printf */ 30 | mv a1, t1 /* 2nd argument */ 31 | mv a2, t2 /* 3rd argument */ 32 | mv a3, t3 /* 4th argument */ 33 | mv a4, t4 /* 5th argument */ 34 | jal printf 35 | 36 | 37 | /* restore ra(return address) */ 38 | ld ra,(sp) 39 | addi sp,sp,8 40 | 41 | /* return from main */ 42 | ret 43 | 44 | 45 | /* read-only data */ 46 | .section .rodata 47 | .balign 8 48 | fmt: 49 | .string "t1, t2, t3, t4 = \n%016llx\n%016llx\n%016llx\n%016llx\n" 50 | 51 | byte_array: 52 | .quad 0xfedcba9876543210 53 | -------------------------------------------------------------------------------- /riscv/linux/300.operation/store.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for RISC-V(RV64)/Linux 3 | * 4 | * * store operation 5 | */ 6 | 7 | .global main 8 | 9 | main: 10 | /* save ra(return address) */ 11 | addi sp,sp,-8 12 | sd ra,(sp) 13 | 14 | 15 | /* Store operation */ 16 | la t0, memory_buf 17 | li t1, 0xfedcba9876543210 18 | sd t1, (t0) /* store (64 bits) */ 19 | 20 | 21 | /* Load operation for result-checking*/ 22 | ld t2, (t0) /* load (64 bits) */ 23 | 24 | /* printf for result-checking */ 25 | la a0, fmt /* 1st argument for printf*/ 26 | mv a1, t2 /* 2nd argument for printf*/ 27 | jal printf 28 | 29 | 30 | /* restore ra(return address) */ 31 | ld ra,(sp) 32 | addi sp,sp,8 33 | 34 | /* return from main */ 35 | ret 36 | 37 | 38 | /* read-only data */ 39 | .section .rodata 40 | fmt: 41 | .string "memory_buf = %016llx\n" /* %llx for long long hex */ 42 | 43 | 44 | /* read-write initialized data */ 45 | .data 46 | .balign 8 47 | memory_buf: 48 | .quad 0 49 | -------------------------------------------------------------------------------- /riscv/linux/310.floating/fadd.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Floating-point example for RISC-V(RV64)/Linux 3 | * 4 | * * Double-precision (64-bit) 5 | * * Floating-point add operation 6 | */ 7 | 8 | .global main 9 | 10 | main: 11 | /* save ra(return address) */ 12 | addi sp,sp,-8 13 | sd ra,(sp) 14 | 15 | 16 | /* Floating-point add operation */ 17 | la t0, value1 18 | fld ft1, (t0) /* ft1 <- 1.0 */ 19 | la t0, value2 20 | fld ft2, (t0) /* ft1 <- 2.0 */ 21 | 22 | fadd.d ft0, ft1, ft2 /* ft0 <- ft1 + ft2 */ 23 | 24 | 25 | /* printf for result-checking */ 26 | la a0, fmt /* 1st argument for printf*/ 27 | fmv.x.d a1, ft0 /* 2nd argument for printf*/ 28 | jal printf 29 | 30 | 31 | /* restore ra(return address) */ 32 | ld ra,(sp) 33 | addi sp,sp,8 34 | 35 | /* return from main */ 36 | ret 37 | 38 | 39 | /* read-only data */ 40 | .section .rodata 41 | value1: 42 | .double 1.0 43 | value2: 44 | .double 2.0 45 | 46 | fmt: 47 | .string "ft1 + ft2 = %f\n" 48 | -------------------------------------------------------------------------------- /riscv/linux/400.control_flow/call_ret.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for RISC-V(RV64)/Linux 3 | * 4 | * * `call` and `ret` structure example 5 | * 6 | * * C-like pseudo-code: 7 | * void main() { 8 | * func_inc(100); 9 | * } 10 | * 11 | * int func_inc(int a0) { 12 | * return (a0++); 13 | * } 14 | */ 15 | 16 | .global main 17 | 18 | main: 19 | /* save ra(return address) */ 20 | addi sp,sp,-8 21 | sd ra,(sp) 22 | 23 | 24 | /* call */ 25 | li a0, 100 /* passing 1st argument with x0 */ 26 | jal func_inc /* call */ 27 | mv t0, a0 /* preserve return value (x0) */ 28 | 29 | 30 | /* printf for result-checking */ 31 | la a0, fmt /* 1st argument for printf */ 32 | mv a1, t0 /* 2nd argument */ 33 | jal printf 34 | 35 | /* restore ra(return address) */ 36 | ld ra,(sp) 37 | addi sp,sp,8 38 | 39 | /* return from main */ 40 | ret 41 | 42 | 43 | /* function declaration */ 44 | func_inc: 45 | add a0, a0, 1 /* function body */ 46 | ret /* return to caller */ 47 | 48 | 49 | /* read-only data */ 50 | .section .rodata 51 | fmt: 52 | .string "a0 = %d\n" 53 | -------------------------------------------------------------------------------- /riscv/linux/400.control_flow/for.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for RISC-V(RV64)/Linux 3 | * 4 | * * `for`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * t2 = 0; 8 | * for (t1 = 1; t1 <= 10; t1++) { 9 | * t2 = t2 + t1; 10 | * } 11 | */ 12 | 13 | .global main 14 | 15 | main: 16 | /* save ra(return address) */ 17 | addi sp,sp,-8 18 | sd ra,(sp) 19 | 20 | 21 | /* accumulator for test */ 22 | li t2, 0 /* t2 = 0 */ 23 | 24 | 25 | /* `for`-structure */ 26 | li t1, 1 /* t1 = 1 */ 27 | 28 | L_loop: 29 | add t2, t2, t1 /* loop-body (t2 = t2 + t1) */ 30 | 31 | add t1, t1, 1 /* t1++ */ 32 | li t0, 10 33 | ble t1, t0, L_loop /* t1 <= 10 */ 34 | 35 | 36 | /* printf for result-checking */ 37 | la a0, fmt /* 1st argument for printf */ 38 | mv a1, t2 /* 2nd argument */ 39 | jal printf 40 | 41 | 42 | /* restore ra(return address) */ 43 | ld ra,(sp) 44 | addi sp,sp,8 45 | 46 | /* return from main */ 47 | ret 48 | 49 | 50 | /* read-only data */ 51 | .section .rodata 52 | fmt: 53 | .string "t2 = %d\n" 54 | -------------------------------------------------------------------------------- /riscv/linux/400.control_flow/if.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for RISC-V(RV64)/Linux 3 | * 4 | * * `if`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * if (t1 >= 0) { 8 | * t2 = 1 9 | * } else { 10 | * t2 = -1 11 | * } 12 | */ 13 | 14 | .global main 15 | 16 | main: 17 | /* save ra(return address) */ 18 | addi sp,sp,-8 19 | sd ra,(sp) 20 | 21 | 22 | /* test data */ 23 | li t1, 10 /* Change here */ 24 | 25 | 26 | /* if (t1 >= 0) */ 27 | bltz t1, L_else 28 | 29 | /* then */ 30 | li t2, 1 31 | j L_endif 32 | 33 | /* else */ 34 | L_else: 35 | li t2, -1 36 | 37 | /* endif */ 38 | L_endif: 39 | 40 | 41 | /* printf for result-checking */ 42 | la a0, fmt /* 1st argument for printf */ 43 | mv a1, t1 /* 2nd argument */ 44 | mv a2, t2 /* 3rd argument */ 45 | jal printf 46 | 47 | 48 | /* restore ra(return address) */ 49 | ld ra,(sp) 50 | addi sp,sp,8 51 | 52 | /* return from main */ 53 | ret 54 | 55 | 56 | /* read-only data */ 57 | .section .rodata 58 | fmt: 59 | .string "t1, t2 = %d, %d\n" 60 | -------------------------------------------------------------------------------- /riscv/linux/700.sync/atomic_ll_sc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Atomic operation example for RISC-V(RV64)/Linux 3 | * 4 | * * atomic operations with load-linked / store-conditional. 5 | * operations with weak consistency (without release/acquire). 6 | * 7 | * * see: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA 8 | * Appendix A RVWMO Explanatory Material 9 | */ 10 | 11 | .global main 12 | 13 | main: 14 | /* load-linked (load-reserved) - store-conditional */ 15 | la t0, flag 16 | lr.d t1, (t0) /* load-linked (load-reserved) */ 17 | 18 | sc.d t2, t1, (t0) /* store-conditional */ 19 | 20 | 21 | /* return from main */ 22 | ret 23 | 24 | 25 | /* read-write initialized data */ 26 | .data 27 | .balign 8 28 | flag: 29 | .quad 0 30 | -------------------------------------------------------------------------------- /riscv/linux/700.sync/atomic_op.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Atomic operation example for RISC-V(RV64)/Linux 3 | * 4 | * * atomic operations with weak consistency (without release/acquire). 5 | * * swap (amoswap) 6 | * * fetch and add (amoadd) 7 | * 8 | * * see: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA 9 | * Appendix A RVWMO Explanatory Material 10 | */ 11 | 12 | .global main 13 | 14 | main: 15 | /* atomic operations */ 16 | la t0, flag 17 | amoswap.d t1, t2, (t0) /* swap */ 18 | 19 | amoadd.d t1, t2, (t0) /* fetch and add */ 20 | 21 | 22 | /* return from main */ 23 | ret 24 | 25 | 26 | /* read-write initialized data */ 27 | .data 28 | .balign 8 29 | flag: 30 | .quad 0 31 | -------------------------------------------------------------------------------- /riscv/linux/700.sync/memory_ordering.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Memory ordeirng example for RISC-V(RV64)/Linux 3 | * 4 | * * memory ordering operations. 5 | * * load fence 6 | * * store fence 7 | * * load/store fence 8 | * 9 | * * see: The RISC-V Instruction Set Manual Volume I: Unprivileged ISA 10 | * Appendix A RVWMO Explanatory Material 11 | */ 12 | 13 | .global main 14 | 15 | main: 16 | /* memory fence */ 17 | fence r,r /* load - load fence */ 18 | 19 | fence w,w /* store - store fence */ 20 | 21 | fence rw,rw /* load/store - load/store fence */ 22 | 23 | 24 | /* return from main */ 25 | ret 26 | -------------------------------------------------------------------------------- /riscv/linux/710.sync_parts/counter_fetch_and_add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for RISC-V(RV64)/Linux 3 | * 4 | * * fetch-and-add instruction 5 | * 6 | * * C-like pseudo-code: 7 | * t1 = 1; 8 | * *counter = *counter + t1; 9 | */ 10 | 11 | .global main 12 | 13 | main: 14 | /* save ra(return address) */ 15 | addi sp,sp,-8 16 | sd ra,(sp) 17 | 18 | 19 | 20 | /* set the address of the counter */ 21 | la t0, counter 22 | 23 | /* increment the counter with fetch-and-add */ 24 | li t1, 1 /* t1 = 1 */ 25 | amoadd.d x0, t1, (t0) /* *counter = *counter + t1 */ 26 | 27 | 28 | 29 | /* printf for result-checking */ 30 | la a0, fmt /* 1st argument for printf*/ 31 | ld a1, (t0) /* 2nd argument for printf*/ 32 | jal printf 33 | 34 | 35 | /* restore ra(return address) */ 36 | ld ra,(sp) 37 | addi sp,sp,8 38 | 39 | /* return from main */ 40 | ret 41 | 42 | 43 | /* read-only data */ 44 | .section .rodata 45 | fmt: 46 | .string "counter = %d\n" 47 | 48 | 49 | /* read-write initialized data */ 50 | .data 51 | .balign 8 52 | counter: 53 | .quad 10 54 | -------------------------------------------------------------------------------- /riscv/linux/710.sync_parts/counter_llsc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for RISC-V(RV64)/Linux 3 | * 4 | * * ll/sc instruction 5 | * 6 | * * C-like pseudo-code: 7 | * L_loop: 8 | * t1 = *counter; 9 | * t1 = t1 + 1; 10 | * if (not-conflict) *counter = t1; 11 | * if (fail) goto L_loop; 12 | */ 13 | 14 | .global main 15 | 16 | main: 17 | /* save ra(return address) */ 18 | addi sp,sp,-8 19 | sd ra,(sp) 20 | 21 | 22 | 23 | /* set the address of the counter */ 24 | la t0, counter 25 | 26 | /* increment the counter with ll/sc loop */ 27 | L_loop: 28 | lr.d t1, (t0) /* t1 = *counter */ 29 | addi t1, t1, 1 /* t1 = t1 + 1 */ 30 | sc.d t2, t1, (t0) /* if (not-conflict) *counter = t1 */ 31 | bnez t2, L_loop /* if (fail) goto L_loop */ 32 | 33 | 34 | 35 | /* printf for result-checking */ 36 | la a0, fmt /* 1st argument for printf*/ 37 | ld a1, (t0) /* 2nd argument for printf*/ 38 | jal printf 39 | 40 | 41 | /* restore ra(return address) */ 42 | ld ra,(sp) 43 | addi sp,sp,8 44 | 45 | /* return from main */ 46 | ret 47 | 48 | 49 | /* read-only data */ 50 | .section .rodata 51 | fmt: 52 | .string "counter = %d\n" 53 | 54 | 55 | /* read-write initialized data */ 56 | .data 57 | .balign 8 58 | counter: 59 | .quad 10 60 | -------------------------------------------------------------------------------- /riscv/linux/710.sync_parts/counter_swap.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for RISC-V(RV64)/Linux 3 | * 4 | * * swap instruction 5 | * 6 | * * C-like pseudo-code: 7 | // acquire the lock 8 | * t2 = 1; 9 | * L_loop: 10 | * t3 = *lock; 11 | * *lock = t2; 12 | * if (t3 != 0) goto L_loop; 13 | * 14 | * // increment the counter 15 | * t2 = *counter; 16 | * t2 = t2 + 1; 17 | * *counter = t2; 18 | * 19 | * // release the lock 20 | * *lock = 0; 21 | */ 22 | 23 | .global main 24 | 25 | main: 26 | /* save ra(return address) */ 27 | addi sp,sp,-8 28 | sd ra,(sp) 29 | 30 | 31 | 32 | /* set addresses of the lock and the counter */ 33 | la t1, lock 34 | la t0, counter 35 | 36 | 37 | /* acquire the lock */ 38 | li t2, 1 39 | L_loop: 40 | amoswap.d t3, t2, (t1) /* swap */ 41 | bne t3, x0, L_loop /* if (t3 !=0) goto L_loop */ 42 | 43 | /* increment the counter */ 44 | ld t2, (t0) /* t2 = *counter */ 45 | addi t2, t2, 1 /* t2 = t2 + 1 */ 46 | sd t2, (t0) /* *counter = t2 */ 47 | 48 | /* release the lock */ 49 | sd x0, (t1) /* *lock = 0 */ 50 | 51 | 52 | 53 | /* printf for result-checking */ 54 | la a0, fmt /* 1st argument for printf*/ 55 | ld a1, (t0) /* 2nd argument for printf*/ 56 | jal printf 57 | 58 | 59 | /* restore ra(return address) */ 60 | ld ra,(sp) 61 | addi sp,sp,8 62 | 63 | /* return from main */ 64 | ret 65 | 66 | 67 | /* read-only data */ 68 | .section .rodata 69 | fmt: 70 | .string "counter = %d\n" 71 | 72 | 73 | /* read-write initialized data */ 74 | .data 75 | .balign 8 76 | lock: 77 | .quad 0 78 | counter: 79 | .quad 10 80 | -------------------------------------------------------------------------------- /riscv/linux/750.multi/pthread.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating a thread) example for RISC-V(RV64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child: a child thread (generated by pthread_create) 7 | * 8 | * * see: 9 | * * man pthread_create 10 | * * man pthread_join 11 | */ 12 | 13 | .globl main 14 | 15 | main: 16 | /* puts for trace-log */ 17 | la a0, fmtMain_start 18 | jal puts 19 | 20 | 21 | /* create a thread with pthread_create() */ 22 | la a0, tid /* pthread_t: &tid */ 23 | li a1, 0 /* pthread_attr_t: NULL */ 24 | la a2, child /* start_routine: &child */ 25 | li a3, 0 /* arg: NULL */ 26 | jal pthread_create 27 | 28 | 29 | /* wait the thread with pthread_join() */ 30 | la t0, tid 31 | ld a0, (t0) /* pthread_t: tid */ 32 | li a1, 0 /* retval: NULL */ 33 | jal pthread_join 34 | 35 | 36 | /* puts for trace-log */ 37 | la a0, fmtMain_finish 38 | jal puts 39 | 40 | /* exit from main */ 41 | li a0, 0 /* status: EXIT_SUCCESS */ 42 | jal exit 43 | 44 | 45 | 46 | 47 | /* a child thread */ 48 | child: 49 | addi sp,sp,-8 50 | sd ra,(sp) 51 | 52 | /* puts for trace-log */ 53 | la a0, fmtChild 54 | jal puts 55 | 56 | /* finish this thead */ 57 | li a0, 0 /* return value (not used) */ 58 | ld ra,(sp) 59 | addi sp,sp,8 60 | ret 61 | 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmtMain_start: 67 | .string "main(): start" 68 | fmtMain_finish: 69 | .string "main(): finish" 70 | fmtChild: 71 | .string "child(): start" 72 | 73 | 74 | 75 | /* read-write data */ 76 | .data 77 | .balign 8 78 | tid: 79 | .quad 0 80 | -------------------------------------------------------------------------------- /riscv/linux/750.multi/pthread2.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating twi threads) example for RISC-V(RV64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (generated by pthread_create) 7 | * * child2: 2nd child thread (generated by pthread_create) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | * * man pthread_join 12 | */ 13 | 14 | .globl main 15 | 16 | main: 17 | /* puts for trace-log */ 18 | la a0, fmtMain_start 19 | jal puts 20 | 21 | 22 | /* pthread_create() */ 23 | 24 | /* for 1st child thread */ 25 | la a0, tid1 /* pthread_t: &tid1 */ 26 | li a1, 0 /* pthread_attr_t: NULL */ 27 | la a2, child1 /* start_routine: &child1 */ 28 | li a3, 0 /* arg: NULL */ 29 | jal pthread_create 30 | 31 | /* for 2nd child thread */ 32 | la a0, tid2 /* pthread_t: &tid2 */ 33 | li a1, 0 /* pthread_attr_t: NULL */ 34 | la a2, child2 /* start_routine: &child2 */ 35 | li a3, 0 /* arg: NULL */ 36 | Jal pthread_create 37 | 38 | 39 | /* pthread_join() */ 40 | 41 | /* for 1st child thread */ 42 | la t0, tid1 43 | ld a0, (t0) /* pthread_t: tid1 */ 44 | li a1, 0 /* retval: NULL */ 45 | jal pthread_join 46 | 47 | /* for 2nd child thread */ 48 | la t0, tid2 49 | ld a0, (t0) /* pthread_t: tid2 */ 50 | li a1, 0 /* retval: NULL */ 51 | jal pthread_join 52 | 53 | 54 | /* puts for trace-log */ 55 | la a0, fmtMain_finish 56 | jal puts 57 | 58 | /* exit from main */ 59 | li a0, 0 /* status: EXIT_SUCCESS */ 60 | jal exit 61 | 62 | 63 | 64 | 65 | /* 1st child thread */ 66 | child1: 67 | addi sp,sp,-8 68 | sd ra,(sp) 69 | 70 | /* puts for trace-log */ 71 | la a0, fmtChild1 72 | jal puts 73 | 74 | /* finish this thead */ 75 | li a0, 0 /* return value (not used) */ 76 | ld ra,(sp) 77 | addi sp,sp,8 78 | ret 79 | 80 | 81 | 82 | /* 2nd child thread */ 83 | child2: 84 | addi sp,sp,-8 85 | sd ra,(sp) 86 | 87 | /* puts for trace-log */ 88 | la a0, fmtChild2 89 | jal puts 90 | 91 | /* finish this thead */ 92 | li a0, 0 /* return value (not used) */ 93 | ld ra,(sp) 94 | addi sp,sp,8 95 | ret 96 | 97 | 98 | 99 | /* read-only data */ 100 | .section .rodata 101 | fmtMain_start: 102 | .string "main(): start" 103 | fmtMain_finish: 104 | .string "main(): finish" 105 | 106 | fmtChild1: 107 | .string "child1(): start" 108 | fmtChild2: 109 | .string "child2(): start" 110 | 111 | 112 | 113 | /* read-write data */ 114 | .data 115 | .balign 8 116 | tid1: 117 | .quad 0 118 | tid2: 119 | .quad 0 120 | -------------------------------------------------------------------------------- /riscv/linux/750.multi/sync_threads.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Sync threads example for RISC-V(RV64)/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (set the sync_flag for 2nd thread) 7 | * * child2: 2nd child thread (wait 1st thread via the sync_flag) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | */ 12 | 13 | .globl main 14 | 15 | main: 16 | /* puts for trace-log */ 17 | la a0, fmtMain_start 18 | jal puts 19 | 20 | 21 | /* pthread_create() */ 22 | 23 | /* for 1st child thread */ 24 | la a0, tid1 /* pthread_t: &tid1 */ 25 | li a1, 0 /* pthread_attr_t: NULL */ 26 | la a2, child1 /* start_routine: &child1 */ 27 | li a3, 0 /* arg: NULL */ 28 | jal pthread_create 29 | 30 | /* for 2nd child thread */ 31 | la a0, tid2 /* pthread_t: &tid2 */ 32 | li a1, 0 /* pthread_attr_t: NULL */ 33 | la a2, child2 /* start_routine: &child2 */ 34 | li a3, 0 /* arg: NULL */ 35 | jal pthread_create 36 | 37 | 38 | /* pthread_join() */ 39 | 40 | /* for 1st child thread */ 41 | la t0, tid1 42 | ld a0, (t0) /* pthread_t: tid1 */ 43 | li a1, 0 /* retval: NULL */ 44 | jal pthread_join 45 | 46 | /* for 2nd child thread */ 47 | la t0, tid2 48 | ld a0, (t0) /* pthread_t: tid2 */ 49 | li a1, 0 /* retval: NULL */ 50 | jal pthread_join 51 | 52 | 53 | /* puts for trace-log */ 54 | la a0, fmtMain_finish 55 | jal puts 56 | 57 | /* exit from main */ 58 | li a0, 0 /* status: EXIT_SUCCESS */ 59 | jal exit 60 | 61 | 62 | 63 | 64 | /* 1st child thread */ 65 | child1: 66 | addi sp,sp,-8 67 | sd ra,(sp) 68 | 69 | /* puts for trace-log */ 70 | la a0, fmtChild1_start 71 | jal puts 72 | 73 | /* wait your key-in */ 74 | la a0, fmtChild1_prompt 75 | jal puts 76 | jal getchar 77 | 78 | 79 | /* set the sync-flag to 1 */ 80 | li t0, 1 81 | la t1, sync_flag 82 | sd t0, (t1) 83 | 84 | 85 | /* puts for trace-log */ 86 | la a0, fmtChild1_finish 87 | jal puts 88 | 89 | /* finish this thead */ 90 | li a0, 0 /* return value (not used) */ 91 | ld ra,(sp) 92 | addi sp,sp,8 93 | ret 94 | 95 | 96 | 97 | /* 2nd child thread */ 98 | child2: 99 | addi sp,sp,-8 100 | sd ra,(sp) 101 | 102 | /* puts for trace-log */ 103 | la a0, fmtChild2_start 104 | jal puts 105 | 106 | 107 | /* wait 1st thread via the sync_flag */ 108 | L_loop: 109 | la t2, sync_flag 110 | ld t0, (t2) 111 | li t1, 1 112 | bne t0, t1, L_loop 113 | 114 | 115 | /* puts for trace-log */ 116 | la a0, fmtChild2_finish 117 | jal puts 118 | 119 | /* finish this thead */ 120 | li a0, 0 /* return value (not used) */ 121 | ld ra,(sp) 122 | addi sp,sp,8 123 | ret 124 | 125 | 126 | 127 | /* read-only data */ 128 | .section .rodata 129 | fmtMain_start: 130 | .string "main(): start" 131 | fmtMain_finish: 132 | .string "main(): finish" 133 | 134 | fmtChild1_start: 135 | .string "child1(): start" 136 | fmtChild1_prompt: 137 | .string "child1(): enter key:" 138 | fmtChild1_finish: 139 | .string "child1(): finish" 140 | 141 | fmtChild2_start: 142 | .string "child2(): start" 143 | fmtChild2_finish: 144 | .string "child2(): finish" 145 | 146 | 147 | 148 | /* read-write data */ 149 | .data 150 | .balign 8 151 | tid1: 152 | .quad 0 153 | tid2: 154 | .quad 0 155 | 156 | sync_flag: 157 | .quad 0 158 | -------------------------------------------------------------------------------- /riscv/linux/Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Makefile for RISC-V/RV64G 3 | 4 | CC = riscv64-unknown-linux-gnu-gcc 5 | OBJDUMP = riscv64-unknown-linux-gnu-objdump 6 | GDB = riscv64-unknown-linux-gnu-gdb 7 | 8 | CFLAGS += -g -march=rv64g # without rv64c (compact instruction) 9 | ASFLAGS += -g 10 | LDFLAGS += -static -pthread 11 | 12 | 13 | include ../../../Makefile 14 | -------------------------------------------------------------------------------- /riscv/linux/README.md: -------------------------------------------------------------------------------- 1 | 2 | RISC-V (RV64G) assembly examples on linux 3 | ========================================= 4 | 5 | ## Examples 6 | * [simple main](./100.main) (./100.main) 7 | * [system call](./110.system_call) (./110.system_call) 8 | * [library call](./120.libc_call) (./120.libc_call) 9 | * [basic operations](./300.operation) (./300.operation) 10 | * [control flow](./400.control_flow) (./400.control_flow) 11 | * [sync](./700.sync) (./700.sync), [sync_parts](./710.sync_parts) (./710.sync_parts) 12 | * [multi-threads](./750.multi) (./750.multi) 13 | 14 | 15 | ## How to try 16 | 17 | * Assemble (generate binary) 18 | 19 | ``` 20 | $ make -f ../Makefile 21 | ``` 22 | 23 | * Execute 24 | 25 | ``` 26 | $ ./ 27 | or 28 | $ qemu-riscv64 ./ 29 | ``` 30 | 31 | * Disassemble (for full object) 32 | 33 | ``` 34 | $ make -f ../Makefile .disasm | less 35 | (search "main>" in `less` command) 36 | ``` 37 | 38 | * Disassemble (for .S only) 39 | 40 | ``` 41 | $ make -f ../Makefile .o.disasm 42 | ``` 43 | 44 | * Step execution with gdb 45 | 46 | ``` 47 | $ make -f ../Makefile 48 | $ gdb ./ 49 | (gdb) layout asm 50 | (gdb) layout regs 51 | (gdb) break main 52 | (gdb) run 53 | (gdb) stepi 54 | : 55 | (gdb) quit 56 | 57 | * Step execution with qemu and gdb 58 | 59 | * on first terminal for qemu: 60 | 61 | ``` 62 | $ qemu-riscv64 -g 1234 ./ 63 | ``` 64 | 65 | * on second terminal for gdb: 66 | 67 | ``` 68 | $ riscv64-unknown-linux-gnu-gdb ./ 69 | (gdb) layout asm 70 | (gdb) layout regs 71 | (gdb) target remote localhost:1234 72 | (gdb) break main 73 | (gdb) continue 74 | (gdb) stepi 75 | : 76 | (gdb) quit 77 | ``` 78 | 79 | 80 | ## References 81 | 82 | * RISC-V 83 | * [The RISC-V Instruction Set Manual Volume I: Unprivileged ISA](https://riscv.org/technical/specifications/) 84 | * [RISC-V Assembly Programmer's Manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md) 85 | 86 | * Linux 87 | * [Linux kernel; arch/riscv](https://github.com/torvalds/linux/tree/master/arch/riscv) 88 | 89 | * glibc 90 | * [glibc](https://www.gnu.org/software/libc/libc.html) 91 | * sysdeps/unix/sysv/linux/riscv 92 | 93 | * RISC-V ABI 94 | * [RISC-V ELF psABI specification](https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md) 95 | 96 | * System call ABI 97 | * [syscall(2) - Linux manual page](https://man7.org/linux/man-pages/man2/syscall.2.html) 98 | 99 | * GCC 100 | * [GCC online documentation](https://gcc.gnu.org/onlinedocs/) 101 | 102 | * GNU assembler and linker 103 | * [Documentation for binutils](https://sourceware.org/binutils/docs/) 104 | 105 | 106 | ## Further information 107 | 108 | ### Calling convention 109 | 110 | * System call 111 | * a7, a0, a1, a2, a3, a4, a5 -> a0, a1 112 | 113 | * Funcation call 114 | * a0, a1, a2, a3, a4, a5, a6, a7 -> a0, a1 115 | 116 | * see: 117 | * [RISC-V ELF psABI Document](https://github.com/riscv-non-isa/riscv-elf-psabi-doc) 118 | * glibc's sysdeps/unix/sysv/linux/riscv/syscall.c 119 | 120 | ### Tool installation on ubuntu (x86) 121 | 122 | * Cross assembler (RISC-V GNU Compiler Toolchain) 123 | 124 | Build with https://github.com/riscv/riscv-gnu-toolchain 125 | 126 | * QEMU User space emulator 127 | 128 | ``` 129 | apt install qemu-user 130 | ``` 131 | -------------------------------------------------------------------------------- /x86/linux/100.main/simple_main.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Simple main example for x86/Linux 3 | */ 4 | 5 | .intel_syntax noprefix 6 | .global main 7 | 8 | main: 9 | ret 10 | -------------------------------------------------------------------------------- /x86/linux/110.system_call/syscall_exit.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for x86/Linux 3 | * 4 | * * exit(2) system-call 5 | * * see: man 2 exit 6 | * * see: linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | 17 | /* exit(2) system-call */ 18 | mov eax, 60 /* system-call number: exit() */ 19 | mov edi, 0 /* status: success */ 20 | syscall 21 | -------------------------------------------------------------------------------- /x86/linux/110.system_call/syscall_read.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for x86/Linux 3 | * 4 | * * read(2) system-call 5 | * * see: man 2 read 6 | * * see: linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | 17 | /* read(2) system-call */ 18 | mov eax, 0 /* system-call number: read() */ 19 | mov edi, 0 /* fd: stdin */ 20 | lea rsi, [rip + buf] /* buf: */ 21 | mov edx, 3 /* count: */ 22 | syscall 23 | 24 | /* write(2) system-call */ 25 | mov eax, 1 /* system-call number: write() */ 26 | mov edi, 1 /* fd: stdout */ 27 | lea rsi, [rip + buf] /* buf: */ 28 | mov edx, 3 /* count: */ 29 | syscall 30 | 31 | 32 | /* return from main */ 33 | add rsp, 8 /* 16-byte alignment */ 34 | ret 35 | 36 | 37 | /* read-write zero initialized data */ 38 | .bss 39 | buf: 40 | .space 10 41 | -------------------------------------------------------------------------------- /x86/linux/110.system_call/syscall_write.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for x86/Linux 3 | * 4 | * * write(2) system-call 5 | * * see: man 2 write 6 | * * see: linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | 17 | /* write(2) system-call */ 18 | mov eax, 1 /* system-call number: write() */ 19 | mov edi, 1 /* fd: stdout */ 20 | lea rsi, [rip + msg] /* buf: */ 21 | mov edx, 13 /* count: */ 22 | syscall 23 | 24 | 25 | /* return from main */ 26 | add rsp, 8 /* 16-byte alignment */ 27 | ret 28 | 29 | 30 | /* read-only data */ 31 | .section .rodata 32 | msg: 33 | .string "Hello world!\n" 34 | -------------------------------------------------------------------------------- /x86/linux/120.libc_call/call_putchar.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for x86/Linux 3 | * 4 | * * putchar(3) standard-library 5 | * * see: man 3 putchar 6 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 7 | */ 8 | 9 | .intel_syntax noprefix 10 | .global main 11 | 12 | main: 13 | sub rsp, 8 /* 16-byte alignment */ 14 | 15 | 16 | /* putchar(3) library-call */ 17 | mov edi, 0x41 /* 'A' */ 18 | call putchar 19 | 20 | mov edi, 0x42 /* 'B' */ 21 | call putchar 22 | 23 | mov edi, 'C' 24 | call putchar 25 | 26 | mov edi, '\n' 27 | call putchar 28 | 29 | 30 | /* return from main */ 31 | add rsp, 8 /* 16-byte alignment */ 32 | ret 33 | -------------------------------------------------------------------------------- /x86/linux/120.libc_call/call_puts.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for x86/Linux 3 | * 4 | * * puts(3) standard-library 5 | * * see: man 3 puts 6 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 7 | */ 8 | 9 | .intel_syntax noprefix 10 | .global main 11 | 12 | main: 13 | sub rsp, 8 /* 16-byte alignment */ 14 | 15 | 16 | /* puts(3) library-call */ 17 | lea rdi, [rip + msg] /* 1st argument */ 18 | call puts 19 | 20 | 21 | /* return from main */ 22 | add rsp, 8 /* 16-byte alignment */ 23 | ret 24 | 25 | 26 | /* read-only data */ 27 | .section .rodata 28 | msg: 29 | .string "Hello world!" 30 | -------------------------------------------------------------------------------- /x86/linux/120.libc_call/call_variable_args.S: -------------------------------------------------------------------------------- 1 | /* 2 | * C standard library call example for x86/Linux 3 | * 4 | * * printf(3) standard-library 5 | * * variable argument call 6 | * * see: man 3 printf 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | mov rcx, 100 /* test data */ 17 | 18 | 19 | /* printf(3) library-call */ 20 | lea rdi, [rip + msg] /* 1st argument */ 21 | mov rsi, rcx /* 2nd argument */ 22 | mov eax, 0 /* the number of vector regsters */ 23 | call printf 24 | 25 | 26 | /* return from main */ 27 | add rsp, 8 /* 16-byte alignment */ 28 | ret 29 | 30 | 31 | /* read-only data */ 32 | .section .rodata 33 | msg: 34 | .string "rcx = %d\n" 35 | -------------------------------------------------------------------------------- /x86/linux/200.time/rdtsc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (counter) for x86/Linux 3 | * 4 | * * Read time-stamp counter with rdtsc instruction 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* read time-stamp counter (TSC) */ 15 | rdtsc /* read TSC into edx:eax */ 16 | 17 | 18 | /* printf for result-checking */ 19 | lea rdi, [rip + fmt] /* 1st argument for printf */ 20 | mov esi, edx /* 2nd argument for printf */ 21 | mov edx, eax /* 3rd argument for printf */ 22 | mov eax, 0 /* the number of vector regsters */ 23 | call printf 24 | 25 | 26 | /* return from main */ 27 | add rsp, 8 /* 16-byte alignment */ 28 | ret 29 | 30 | 31 | /* read-only data */ 32 | .section .rodata 33 | fmt: 34 | .string "edx = %08x; eax = %08x\n" 35 | -------------------------------------------------------------------------------- /x86/linux/200.time/rdtsc2.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (counter) for x86/Linux 3 | * 4 | * * Read time-stamp counter twice 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* 1st rdtsc */ 15 | rdtsc /* read TSC into edx:eax */ 16 | 17 | /* 1st printf */ 18 | lea rdi, [rip + fmt] /* 1st argument for printf */ 19 | mov esi, edx /* 2nd argument for printf */ 20 | mov edx, eax /* 3rd argument for printf */ 21 | mov eax, 0 /* the number of vector regsters */ 22 | call printf 23 | 24 | 25 | /* 2nd rdtsc */ 26 | rdtsc /* read TSC into edx:eax */ 27 | 28 | /* 2nd printf */ 29 | lea rdi, [rip + fmt] /* 1st argument for printf */ 30 | mov esi, edx /* 2nd argument for printf */ 31 | mov edx, eax /* 3rd argument for printf */ 32 | mov eax, 0 /* the number of vector regsters */ 33 | call printf 34 | 35 | 36 | /* return from main */ 37 | add rsp, 8 /* 16-byte alignment */ 38 | ret 39 | 40 | 41 | /* read-only data */ 42 | .section .rodata 43 | fmt: 44 | .string "edx = %08x; eax = %08x\n" 45 | -------------------------------------------------------------------------------- /x86/linux/200.time/rdtsc3.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (counter) for x86/Linux 3 | * 4 | * * Read time-stamp counter twice, more quickly 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* 1st rdtsc */ 15 | rdtsc /* read TSC into edx:eax */ 16 | mov r12, rdx /* preserve edx */ 17 | mov r13, rax /* preserve eax */ 18 | 19 | /* 2nd rdtsc */ 20 | rdtsc /* read TSC into edx:eax */ 21 | mov r14, rdx /* preserve edx */ 22 | mov r15, rax /* preserve eax */ 23 | 24 | 25 | /* printf for 1st rdtsc */ 26 | lea rdi, [rip + fmt] /* 1st argument for printf */ 27 | mov rsi, r12 /* 2nd argument for printf */ 28 | mov rdx, r13 /* 3rd argument for printf */ 29 | mov eax, 0 /* the number of vector regsters */ 30 | call printf 31 | 32 | /* printf for 2nd rdtsc */ 33 | lea rdi, [rip + fmt] /* 1st argument for printf */ 34 | mov rsi, r14 /* 2nd argument for printf */ 35 | mov rdx, r15 /* 3rd argument for printf */ 36 | mov eax, 0 /* the number of vector regsters */ 37 | call printf 38 | 39 | 40 | /* return from main */ 41 | add rsp, 8 /* 16-byte alignment */ 42 | ret 43 | 44 | 45 | /* read-only data */ 46 | .section .rodata 47 | fmt: 48 | .string "edx = %08x; eax = %08x\n" 49 | -------------------------------------------------------------------------------- /x86/linux/200.time/read_rtc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (sec) for x86/Linux 3 | * 4 | * * Cannot execute this under kernel lockdown! 5 | * 6 | * * Read real-time clock(RTC) with io-port 70,71 7 | * * RTS is coded with BCD(binary-coded decimal) 8 | * * Execute with sudo for ioperm 9 | * * see: linux-kernel's arch/x86/kernel/rtc.c 10 | */ 11 | 12 | .intel_syntax noprefix 13 | .global main 14 | 15 | main: 16 | sub rsp, 8 /* 16-byte alignment */ 17 | 18 | 19 | /* set io-port permissions with ioperm(2) */ 20 | mov edi, 0x70 21 | mov esi, 2 22 | mov edx, 1 23 | call ioperm /* see; man ioperm(2) */ 24 | 25 | 26 | /* read real-time clock(RTC) */ 27 | mov al, 0 /* RTC_SECONDS */ 28 | out 0x70, al /* write rtc */ 29 | in al, 0x71 /* read rtc */ 30 | 31 | 32 | /* printf for result-checking */ 33 | lea rdi, [rip + fmt] /* 1st argument for printf */ 34 | mov rsi, rax /* 2nd argument for printf */ 35 | mov eax, 0 /* the number of vector regsters */ 36 | call printf 37 | 38 | 39 | /* return from main */ 40 | add rsp, 8 /* 16-byte alignment */ 41 | ret 42 | 43 | 44 | /* read-only data */ 45 | .section .rodata 46 | fmt: 47 | .string "rtc seconds = %02x\n" 48 | -------------------------------------------------------------------------------- /x86/linux/200.time/read_rtc2.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (year, month, day, hour, min, sec) for x86/Linux 3 | * 4 | * * Cannot execute this under kernel lockdown! 5 | * 6 | * * Read real-time clock(RTC) with io-port 70,71 7 | * * RTS is coded with BCD(binary-coded decimal) 8 | * * Execute with sudo for ioperm 9 | * * see: linux-kernel's arch/x86/kernel/rtc.c 10 | */ 11 | 12 | .intel_syntax noprefix 13 | .global main 14 | 15 | main: 16 | sub rsp, 8 /* 16-byte alignment */ 17 | 18 | 19 | /* set io-port permissions with ioperm(2) */ 20 | mov edi, 0x70 21 | mov esi, 2 22 | mov edx, 1 23 | call ioperm /* see; man ioperm(2) */ 24 | 25 | 26 | /* read real-time clock(RTC) for date */ 27 | mov al, 9 /* RTC_YEAR */ 28 | out 0x70, al /* write rtc */ 29 | in al, 0x71 /* read rtc */ 30 | mov rsi, rax /* 2nd arg for printf */ 31 | 32 | mov al, 8 /* RTC_MONTH */ 33 | out 0x70, al /* write rtc */ 34 | in al, 0x71 /* read rtc */ 35 | mov rdx, rax /* 3rd arg for printf */ 36 | 37 | mov al, 7 /* RTC_DAY_OF_MONTH */ 38 | out 0x70, al /* write rtc */ 39 | in al, 0x71 /* read rtc */ 40 | mov rcx, rax /* 4th arg for printf */ 41 | 42 | /* printf for result-checking */ 43 | lea rdi, [rip + fmt_date] /* 1st argument for printf */ 44 | mov eax, 0 /* the number of vector regsters */ 45 | call printf 46 | 47 | 48 | /* read real-time clock(RTC) for time */ 49 | mov al, 4 /* RTC_HOURS */ 50 | out 0x70, al /* write rtc */ 51 | in al, 0x71 /* read rtc */ 52 | mov rsi, rax /* 2nd arg for printf */ 53 | 54 | mov al, 2 /* RTC_MINUTES */ 55 | out 0x70, al /* write rtc */ 56 | in al, 0x71 /* read rtc */ 57 | mov rdx, rax /* 3rd arg for printf */ 58 | 59 | mov al, 0 /* RTC_SECONDS */ 60 | out 0x70, al /* write rtc */ 61 | in al, 0x71 /* read rtc */ 62 | mov rcx, rax /* 4th arg for printf */ 63 | 64 | /* printf for result-checking */ 65 | lea rdi, [rip + fmt_time] /* 1st argument for printf */ 66 | mov eax, 0 /* the number of vector regsters */ 67 | call printf 68 | 69 | 70 | /* return from main */ 71 | add rsp, 8 /* 16-byte alignment */ 72 | ret 73 | 74 | 75 | /* read-only data */ 76 | .section .rodata 77 | fmt_date: 78 | .string "year-month-day = %02x - %02x - %02x\n" 79 | 80 | fmt_time: 81 | .string "hour:min:sec = %02x : %02x : %02x\n" 82 | -------------------------------------------------------------------------------- /x86/linux/200.time/syscall_time.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Getting time-information (seconds since the Epoch) for x86/Linux 3 | * 4 | * * time(2) system-call 5 | * * see: man 2 time 6 | * * see: linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | 17 | /* time(2) system-call */ 18 | mov eax, 201 /* system-call number: time() */ 19 | mov edi, 0 /* tloc: null */ 20 | syscall /* get seconds into rax */ 21 | 22 | 23 | /* printf for result-checking */ 24 | lea rdi, [rip + fmt] /* 1st argument for printf */ 25 | mov rsi, rax /* 2nd argument for printf */ 26 | mov eax, 0 /* the number of vector regsters */ 27 | call printf 28 | 29 | 30 | /* return from main */ 31 | add rsp, 8 /* 16-byte alignment */ 32 | ret 33 | 34 | 35 | /* read-only data */ 36 | .section .rodata 37 | fmt: 38 | .string "seconds since the Epoch = %d\n" 39 | -------------------------------------------------------------------------------- /x86/linux/300.operation/add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for x86/Linux 3 | * 4 | * * add operation 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* Add operation */ 15 | mov eax, 1 16 | mov ebx, 2 17 | add rax, rbx /* rax <- rax + rbx */ 18 | 19 | 20 | /* printf for result-checking */ 21 | lea rdi, [rip + fmt] /* 1st argument for printf */ 22 | mov rsi, rax /* 2nd argument for printf */ 23 | mov eax, 0 /* the number of vector regsters */ 24 | call printf 25 | 26 | 27 | /* return from main */ 28 | add rsp, 8 /* 16-byte alignment */ 29 | ret 30 | 31 | 32 | /* read-only data */ 33 | .section .rodata 34 | fmt: 35 | .string "rax + rbx = %x\n" 36 | -------------------------------------------------------------------------------- /x86/linux/300.operation/load.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for x86/Linux 3 | * 4 | * * load operation 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* Load operation */ 15 | mov rbx, [rip + byte_array] /* load (64 bits) */ 16 | 17 | 18 | /* printf for result-checking */ 19 | lea rdi, [rip + fmt] /* 1st argument for printf */ 20 | mov rsi, rbx /* 2nd argument for printf */ 21 | mov eax, 0 /* the number of vector regsters */ 22 | call printf 23 | 24 | 25 | /* return from main */ 26 | add rsp, 8 /* 16-byte alignment */ 27 | ret 28 | 29 | 30 | /* read-only data */ 31 | .section .rodata 32 | .balign 8 33 | fmt: 34 | .string "byte_array = %llx\n" /* %llx for long long hex */ 35 | 36 | byte_array: 37 | .quad 0xfedcba9876543210 38 | -------------------------------------------------------------------------------- /x86/linux/300.operation/load_size.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for x86/Linux 3 | * 4 | * * load operations for each size with zero extension 5 | * 6 | * * 64-bit load with mov instruction 7 | * * 32-bit load with mov instruction and `d`word-register 8 | * * 16-bit load with movzx instruction and `word ptr` 9 | * * 8-bit load with movzx instruction and `byte ptr` 10 | */ 11 | 12 | .intel_syntax noprefix 13 | .global main 14 | 15 | main: 16 | sub rsp, 8 /* 16-byte alignment */ 17 | 18 | 19 | /* Load operation */ 20 | mov r11, [rip + byte_array] /* load (64 bits) */ 21 | mov r12d, [rip + byte_array] /* load (32 bits) */ 22 | movzx r13 , word ptr [rip + byte_array] /* load (16 bits) */ 23 | movzx r14 , byte ptr [rip + byte_array] /* load (8 bits) */ 24 | 25 | 26 | /* printf for result-checking */ 27 | lea rdi, [rip + fmt] /* 1st argument for printf */ 28 | mov rsi, r11 /* 2nd argument */ 29 | mov rdx, r12 /* 3nd argument */ 30 | mov rcx, r13 /* 4nd argument */ 31 | mov r8 , r14 /* 5nd argument */ 32 | mov eax, 0 /* the number of vector regsters */ 33 | call printf 34 | 35 | 36 | /* return from main */ 37 | add rsp, 8 /* 16-byte alignment */ 38 | ret 39 | 40 | 41 | /* read-only data */ 42 | .section .rodata 43 | .balign 8 44 | fmt: 45 | .string "r11, r12, r13, r14 = \n%016llx\n%016llx\n%016llx\n%016llx\n" 46 | 47 | byte_array: 48 | .quad 0xfedcba9876543210 49 | -------------------------------------------------------------------------------- /x86/linux/300.operation/store.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Operation example for x86/Linux 3 | * 4 | * * store operation 5 | */ 6 | 7 | .intel_syntax noprefix 8 | .global main 9 | 10 | main: 11 | sub rsp, 8 /* 16-byte alignment */ 12 | 13 | 14 | /* Store operation */ 15 | mov rax, 0xfedcba9876543210 16 | mov [rip + memory_buf], rax /* store (64 bits) */ 17 | 18 | 19 | /* Load operation for result-checking*/ 20 | mov rbx, [rip + memory_buf] /* load (64 bits) */ 21 | 22 | /* printf for result-checking */ 23 | lea rdi, [rip + fmt] /* 1st argument for printf */ 24 | mov rsi, rbx /* 2nd argument for printf */ 25 | mov eax, 0 /* the number of vector regsters */ 26 | call printf 27 | 28 | 29 | /* return from main */ 30 | add rsp, 8 /* 16-byte alignment */ 31 | ret 32 | 33 | 34 | /* read-only data */ 35 | .section .rodata 36 | fmt: 37 | .string "memory_buf = %016llx\n" /* %llx for long long hex */ 38 | 39 | 40 | /* read-write initialized data */ 41 | .data 42 | .balign 8 43 | memory_buf: 44 | .quad 0 45 | -------------------------------------------------------------------------------- /x86/linux/310.floating/fadd.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Floating-point example for x86/Linux 3 | * 4 | * * Double-precision (64-bit) 5 | * * Floating-point add operation 6 | */ 7 | 8 | .intel_syntax noprefix 9 | .global main 10 | 11 | main: 12 | sub rsp, 8 /* 16-byte alignment */ 13 | 14 | 15 | /* Floating-point add operation */ 16 | movsd xmm0, [rip + value1] /* xmm0 <- 1.0 */ 17 | movsd xmm1, [rip + value2] /* xmm0 <- 2.0 */ 18 | addsd xmm0, xmm1 /* xmm0 <- xmm0 + xmm1 */ 19 | 20 | 21 | /* printf for result-checking */ 22 | lea rdi, [rip + fmt] /* 1st argument for printf */ 23 | // mov xmm0, xmm0 /* 2nd argument for printf */ 24 | mov eax, 1 /* the number of vector regsters */ 25 | call printf 26 | 27 | 28 | /* return from main */ 29 | add rsp, 8 /* 16-byte alignment */ 30 | ret 31 | 32 | 33 | /* read-only data */ 34 | .section .rodata 35 | value1: 36 | .double 1.0 37 | value2: 38 | .double 2.0 39 | 40 | fmt: 41 | .string "xmm0 + xmm1 = %f\n" 42 | -------------------------------------------------------------------------------- /x86/linux/400.control_flow/call_ret.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for x86/Linux 3 | * 4 | * * `call` and `ret` structure example 5 | * 6 | * * C-like pseudo-code: 7 | * void main() { 8 | * func_inc(100); 9 | * } 10 | * 11 | * int func_inc(int rax) { 12 | * return (rax++); 13 | * } 14 | */ 15 | 16 | .intel_syntax noprefix 17 | .global main 18 | 19 | main: 20 | sub rsp, 8 /* 16-byte alignment */ 21 | 22 | 23 | /* call */ 24 | mov eax, 100 /* passing 1st argument with rax */ 25 | call func_inc /* return value into rax */ 26 | 27 | 28 | /* printf for result-checking */ 29 | lea rdi, [rip + fmt] /* 1st argument for printf */ 30 | mov rsi, rax /* 2nd argument */ 31 | mov eax, 0 /* the number of vector regsters */ 32 | call printf 33 | 34 | /* return from main */ 35 | add rsp, 8 /* 16-byte alignment */ 36 | ret 37 | 38 | 39 | /* function declaration */ 40 | func_inc: 41 | add eax, 1 /* function body */ 42 | ret /* return to caller */ 43 | 44 | 45 | /* read-only data */ 46 | .section .rodata 47 | fmt: 48 | .string "rax = %d\n" 49 | -------------------------------------------------------------------------------- /x86/linux/400.control_flow/for.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for x86/Linux 3 | * 4 | * * `for`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * rbx = 0; 8 | * for (rax = 1; rax <= 10; rax++) { 9 | * rbx = rbx + rax; 10 | * } 11 | */ 12 | 13 | .intel_syntax noprefix 14 | .global main 15 | 16 | main: 17 | sub rsp, 8 /* 16-byte alignment */ 18 | 19 | 20 | /* accumulator for test */ 21 | mov ebx, 0 /* rbx = 0 */ 22 | 23 | 24 | /* `for`-structure */ 25 | mov eax, 1 /* rax = 1 */ 26 | 27 | L_loop: 28 | add rbx, rax /* loop-body (rbx = rbx + rax) */ 29 | 30 | add eax, 1 /* rax++ */ 31 | cmp eax, 10 /* rax <= 10 */ 32 | jle L_loop 33 | 34 | 35 | /* printf for result-checking */ 36 | lea rdi, [rip + fmt] /* 1st argument for printf */ 37 | mov rsi, rbx /* 2nd argument */ 38 | mov eax, 0 /* the number of vector regsters */ 39 | call printf 40 | 41 | 42 | /* return from main */ 43 | add rsp, 8 /* 16-byte alignment */ 44 | ret 45 | 46 | 47 | /* read-only data */ 48 | .section .rodata 49 | fmt: 50 | .string "rbx = %d\n" 51 | -------------------------------------------------------------------------------- /x86/linux/400.control_flow/if.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Control-flow example for x86/Linux 3 | * 4 | * * `if`-structure example 5 | * 6 | * * C-like pseudo-code: 7 | * if (rax >= 0) { 8 | * rbx = 1 9 | * } else { 10 | * rbx = -1 11 | * } 12 | */ 13 | 14 | .intel_syntax noprefix 15 | .global main 16 | 17 | main: 18 | sub rsp, 8 /* 16-byte alignment */ 19 | 20 | 21 | /* test data */ 22 | mov eax, 10 /* Change here */ 23 | 24 | 25 | /* if (rax >= 0) */ 26 | cmp eax, 0 27 | js L_else 28 | 29 | /* then */ 30 | mov ebx, 1 31 | jmp L_endif 32 | 33 | /* else */ 34 | L_else: 35 | mov rbx, -1 36 | 37 | /* endif */ 38 | L_endif: 39 | 40 | 41 | /* printf for result-checking */ 42 | lea rdi, [rip + fmt] /* 1st argument for printf */ 43 | mov rsi, rax /* 2nd argument */ 44 | mov rdx, rbx /* 3rd argument */ 45 | mov eax, 0 /* the number of vector regsters */ 46 | call printf 47 | 48 | 49 | /* return from main */ 50 | add rsp, 8 /* 16-byte alignment */ 51 | ret 52 | 53 | 54 | /* read-only data */ 55 | .section .rodata 56 | fmt: 57 | .string "rax, rbx = %d, %d\n" 58 | -------------------------------------------------------------------------------- /x86/linux/700.sync/atomic_op.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Atomic operation example for x86/Linux 3 | * 4 | * * atomic operations. 5 | * * swap (xchg) 6 | * * compare and swap (cmpxchg) 7 | * * fetch and add (xadd) 8 | * 9 | * * see: Intel 64 and IA-32 Architectures Software Developer’s Manual 10 | * Volume 3: System Programming Guide: 11 | * 8.2 MEMORY ORDERING 12 | */ 13 | 14 | .intel_syntax noprefix 15 | .global main 16 | 17 | main: 18 | sub rsp, 8 /* 16-byte alignment */ 19 | 20 | 21 | /* atomic operations */ 22 | xchg [rip + flag], rax /* swap */ 23 | 24 | lock cmpxchg [rip + flag], rbx /* compare and swap */ 25 | 26 | lock xadd [rip + flag], rax /* fetch and add */ 27 | 28 | 29 | /* return from main */ 30 | add rsp, 8 /* 16-byte alignment */ 31 | ret 32 | 33 | 34 | /* read-write initialized data */ 35 | .data 36 | .balign 8 37 | flag: 38 | .quad 0 39 | -------------------------------------------------------------------------------- /x86/linux/700.sync/memory_ordering.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Memory ordeirng example for x86/Linux 3 | * 4 | * * memory ordering operations. 5 | * * load fence (lfence) 6 | * * store fence (sfence) 7 | * * memory fence (mfence) 8 | * 9 | * * see: Intel 64 and IA-32 Architectures Software Developer’s Manual 10 | * Volume 3: System Programming Guide: 11 | * 8.2 MEMORY ORDERING 12 | * * see: Intel 64 and IA-32 Architectures Software Developer's Manual 13 | * Volumes 2: Instruction Set Reference, A-Z 14 | * lfence, sfence, mfence 15 | */ 16 | 17 | .intel_syntax noprefix 18 | .global main 19 | 20 | main: 21 | /* memory ordering */ 22 | lfence /* load fence (load barrier) */ 23 | 24 | sfence /* store fence (store barrier)*/ 25 | 26 | mfence /* memory fence (load/store barrier)*/ 27 | 28 | 29 | /* return from main */ 30 | ret 31 | -------------------------------------------------------------------------------- /x86/linux/710.sync_parts/counter_cas.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for x86/Linux 3 | * 4 | * * cmpxchg(compare-and-swap) instruction 5 | * 6 | * * C-like pseudo-code: 7 | * L_loop: 8 | * rax = *counter; // read counter 9 | * rbx = rax; 10 | * rbx = rbx + 1; // increment 11 | * 12 | * // cmpare-and-swap instruction 13 | * tmp = *counter; // read counter 14 | * if (rax == tmp) *counter = rbx; // swap if unchanged 15 | * rax = tmp; 16 | * 17 | * // check 18 | * if (zero_flag != 1) goto L_loop; // loop if fail 19 | * 20 | */ 21 | 22 | .intel_syntax noprefix 23 | .global main 24 | 25 | main: 26 | sub rsp, 8 /* 16-byte alignment */ 27 | 28 | 29 | /* increment the counter with cmpare-and-swap loop */ 30 | L_loop: 31 | mov rax, [rip + counter] /* rax = *counter */ 32 | mov rbx, rax /* rbx = rax */ 33 | add rbx, 1 /* rbx = rbx + 1 */ 34 | 35 | lock cmpxchg [rip + counter], rbx /* compare-and-swap */ 36 | 37 | jne L_loop /* loop if fail */ 38 | 39 | 40 | /* printf for result-checking */ 41 | lea rdi, [rip + fmt] /* 1st argument for printf */ 42 | mov rsi, [rip + counter] /* 2nd argument for printf */ 43 | mov rax, 0 /* the number of vector regsters */ 44 | call printf 45 | 46 | 47 | /* return from main */ 48 | add rsp, 8 /* 16-byte alignment */ 49 | ret 50 | 51 | 52 | /* read-only data */ 53 | .section .rodata 54 | fmt: 55 | .string "counter = %d\n" 56 | 57 | 58 | /* read-write initialized data */ 59 | .data 60 | .balign 8 61 | counter: 62 | .quad 10 63 | -------------------------------------------------------------------------------- /x86/linux/710.sync_parts/counter_fetch_and_add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for x86/Linux 3 | * 4 | * * fetch-and-add instruction 5 | * 6 | * * C-like pseudo-code: 7 | * ra = 1; 8 | * *counter = *counter + ra; 9 | * 10 | */ 11 | 12 | .intel_syntax noprefix 13 | .global main 14 | 15 | main: 16 | sub rsp, 8 /* 16-byte alignment */ 17 | 18 | 19 | /* increment the counter with fetch-and-add */ 20 | mov rax, 0x1 /* rax = 1 */ 21 | lock xadd [rip + counter], rax /* *counter = *counter + rax */ 22 | 23 | 24 | /* printf for result-checking */ 25 | lea rdi, [rip + fmt] /* 1st argument for printf */ 26 | mov rsi, [rip + counter] /* 2nd argument for printf */ 27 | mov rax, 0 /* the number of vector regsters */ 28 | call printf 29 | 30 | 31 | /* return from main */ 32 | add rsp, 8 /* 16-byte alignment */ 33 | ret 34 | 35 | 36 | /* read-only data */ 37 | .section .rodata 38 | fmt: 39 | .string "counter = %d\n" 40 | 41 | 42 | /* read-write initialized data */ 43 | .data 44 | .balign 8 45 | counter: 46 | .quad 10 47 | -------------------------------------------------------------------------------- /x86/linux/710.sync_parts/counter_swap.S: -------------------------------------------------------------------------------- 1 | /* 2 | * An atomic counter example for x86/Linux 3 | * 4 | * * xchg(swap) instruction 5 | * 6 | * * C-like pseudo-code: 7 | // acquire the lock 8 | * L_loop: 9 | * rax = 1; 10 | * tmp = *lock; // xchg(swap) 11 | * *lock = rax; // xchg(swap) 12 | * rax = tmp; // xchg(swap) 13 | * if (rax != 0) goto L_loop; 14 | * 15 | * // increment the counter 16 | * rbx = 1; 17 | * *counter = *counter + rbx; 18 | * 19 | * // release the lock 20 | * rax = 0; 21 | * *lock = rax; 22 | * 23 | */ 24 | 25 | .intel_syntax noprefix 26 | .global main 27 | 28 | main: 29 | sub rsp, 8 /* 16-byte alignment */ 30 | 31 | 32 | /* acquire the lock */ 33 | L_loop: 34 | mov rax, 1 /* rax = 1 */ 35 | xchg [rip + lock], rax /* swap lock and rax */ 36 | cmp rax, 0 37 | jne L_loop /* if (rax !=0) goto L_loop */ 38 | 39 | /* increment the counter */ 40 | mov rbx, 1 /* rbx = 1 */ 41 | add [rip + counter], rbx /* *counter = *counter + rbx */ 42 | 43 | /* release the lock */ 44 | mov rax, 0 /* rax = 0 */ 45 | mov [rip + lock], rax /* *lock = rax */ 46 | 47 | 48 | 49 | /* printf for result-checking */ 50 | lea rdi, [rip + fmt] /* 1st argument for printf */ 51 | mov rsi, [rip + counter] /* 2nd argument for printf */ 52 | mov rax, 0 /* the number of vector regsters */ 53 | call printf 54 | 55 | 56 | /* return from main */ 57 | add rsp, 8 /* 16-byte alignment */ 58 | ret 59 | 60 | 61 | /* read-only data */ 62 | .section .rodata 63 | fmt: 64 | .string "counter = %d\n" 65 | 66 | 67 | /* read-write initialized data */ 68 | .data 69 | .balign 8 70 | lock: 71 | .quad 0 72 | counter: 73 | .quad 10 74 | -------------------------------------------------------------------------------- /x86/linux/750.multi/clone.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Clone (generating a thread) example for x86/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child: a child thread (cloned from the main thread) 7 | * 8 | * * NOTE: 9 | * * This is a simple example for clone with CLONE_VM. 10 | * * You should be careful for handling stack and clone-arguments. 11 | * 12 | * * see: 13 | * * man 2 clone 14 | * * linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 15 | * * glibc's sysdeps/unix/sysv/linux/x86_64/clone.S 16 | * * x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 17 | */ 18 | 19 | .intel_syntax noprefix 20 | .globl main 21 | 22 | main: 23 | sub rsp, 8 /* 16-byte alignment */ 24 | 25 | 26 | /* clone() */ 27 | lea rdi, [rip + child] /* fn: &child */ 28 | lea rsi, [rip + stackChildTop] /* child_stack: &stackChildTop */ 29 | mov edx, 0x111 /* flags: CLONE_VM | SIGCHLD */ 30 | mov ecx, 0 /* arg: NULL */ 31 | mov eax, 0 /* the number of vector reg */ 32 | call clone 33 | 34 | 35 | /* waitpid() */ 36 | mov edi, eax /* pid: (from clone) */ 37 | mov esi, 0 /* wstatus: NULL */ 38 | mov edx, 0 /* options: 0 */ 39 | call waitpid 40 | 41 | 42 | /* puts for trace-log */ 43 | lea rdi, [rip + fmtMain] 44 | call puts 45 | 46 | /* exit from main */ 47 | mov edi, 0 /* status: EXIT_SUCCESS */ 48 | call exit 49 | 50 | 51 | 52 | /* a child thread */ 53 | child: 54 | sub rsp, 8 /* 16-byte alignment */ 55 | 56 | 57 | /* puts for trace-log */ 58 | lea rdi, [rip + fmtChild] 59 | call puts 60 | 61 | /* sleep() */ 62 | mov edi, 1 /* seconds: 1sec */ 63 | call sleep 64 | 65 | /* return to main-thread */ 66 | mov eax, 0 67 | add rsp, 8 /* 16-byte alignment */ 68 | ret 69 | 70 | 71 | 72 | /* read-only data */ 73 | .section .rodata 74 | fmtMain: 75 | .string "main()" 76 | fmtChild: 77 | .string "child()" 78 | 79 | 80 | 81 | /* read-write data */ 82 | .data 83 | .balign 8 84 | stackChild: 85 | .space 1024*1024 86 | stackChildTop: 87 | -------------------------------------------------------------------------------- /x86/linux/750.multi/pthread.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating a thread) example for x86/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child: a child thread (generated by pthread_create) 7 | * 8 | * * see: 9 | * * man pthread_create 10 | * * man pthread_join 11 | * * x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 12 | */ 13 | 14 | .intel_syntax noprefix 15 | .globl main 16 | 17 | main: 18 | sub rsp, 8 /* 16-byte alignment */ 19 | 20 | /* puts for trace-log */ 21 | lea rdi, [rip + fmtMain_start] 22 | call puts 23 | 24 | 25 | /* create a thread with pthread_create() */ 26 | lea rdi, [rip + tid] /* pthread_t: &tid */ 27 | mov esi, 0 /* pthread_attr_t: NULL */ 28 | lea rdx, [rip + child] /* start_routine: &child */ 29 | mov ecx, 0 /* arg: NULL */ 30 | call pthread_create 31 | 32 | /* wait the thread with pthread_join() */ 33 | mov rdi, [rip + tid] /* pthread_t: tid */ 34 | mov esi, 0 /* retval: NULL */ 35 | call pthread_join 36 | 37 | 38 | /* puts for trace-log */ 39 | lea rdi, [rip + fmtMain_finish] 40 | call puts 41 | 42 | /* exit from main */ 43 | mov edi, 0 /* status: EXIT_SUCCESS */ 44 | call exit 45 | 46 | 47 | 48 | /* a child thread */ 49 | child: 50 | sub rsp, 8 /* 16-byte alignment */ 51 | 52 | /* puts for trace-log */ 53 | lea rdi, [rip + fmtChild] 54 | call puts 55 | 56 | /* finish this thead */ 57 | mov eax, 0 /* return value (not used) */ 58 | add rsp, 8 59 | ret 60 | 61 | 62 | 63 | /* read-only data */ 64 | .section .rodata 65 | fmtMain_start: 66 | .string "main(): start" 67 | fmtMain_finish: 68 | .string "main(): finish" 69 | fmtChild: 70 | .string "child(): start" 71 | 72 | 73 | 74 | /* read-write data */ 75 | .data 76 | .balign 8 77 | tid: 78 | .quad 0 79 | -------------------------------------------------------------------------------- /x86/linux/750.multi/pthread2.S: -------------------------------------------------------------------------------- 1 | /* 2 | * pthread (generating two threads) example for x86/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (generated by pthread_create) 7 | * * child2: 2nd child thread (generated by pthread_create) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | * * man pthread_join 12 | * * x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 13 | */ 14 | 15 | .intel_syntax noprefix 16 | .globl main 17 | 18 | main: 19 | sub rsp, 8 /* 16-byte alignment */ 20 | 21 | /* puts for trace-log */ 22 | lea rdi, [rip + fmtMain_start] 23 | call puts 24 | 25 | 26 | /* pthread_create() */ 27 | 28 | /* for 1st child thread */ 29 | lea rdi, [rip + tid1] /* pthread_t: &tid1 */ 30 | mov esi, 0 /* pthread_attr_t: NULL */ 31 | lea rdx, [rip + child1] /* start_routine: &child1 */ 32 | mov ecx, 0 /* arg: NULL */ 33 | call pthread_create 34 | 35 | /* for 2st child thread */ 36 | lea rdi, [rip + tid2] /* pthread_t: &tid2 */ 37 | mov esi, 0 /* pthread_attr_t: NULL */ 38 | lea rdx, [rip + child2] /* start_routine: &child2 */ 39 | mov ecx, 0 /* arg: NULL */ 40 | call pthread_create 41 | 42 | 43 | /* pthread_join() */ 44 | 45 | /* for 1st child thread */ 46 | mov rdi, [rip + tid1] /* pthread_t: tid1 */ 47 | mov esi, 0 /* retval: NULL */ 48 | call pthread_join 49 | 50 | /* for 2nd child thread */ 51 | mov rdi, [rip + tid2] /* pthread_t: tid2 */ 52 | mov esi, 0 /* retval: NULL */ 53 | call pthread_join 54 | 55 | 56 | /* puts for trace-log */ 57 | lea rdi, [rip + fmtMain_finish] 58 | call puts 59 | 60 | /* exit from main */ 61 | mov edi, 0 /* status: EXIT_SUCCESS */ 62 | call exit 63 | 64 | 65 | 66 | /* 1st child thread */ 67 | child1: 68 | sub rsp, 8 /* 16-byte alignment */ 69 | 70 | /* puts for trace-log */ 71 | lea rdi, [rip + fmtChild1] 72 | call puts 73 | 74 | /* finish this thead */ 75 | mov eax, 0 /* return value (not used) */ 76 | add rsp, 8 77 | ret 78 | 79 | 80 | 81 | /* 2nd child thread */ 82 | child2: 83 | sub rsp, 8 /* 16-byte alignment */ 84 | 85 | /* puts for trace-log */ 86 | lea rdi, [rip + fmtChild2] 87 | call puts 88 | 89 | /* finish this thead */ 90 | mov eax, 0 /* return value (not used) */ 91 | add rsp, 8 92 | ret 93 | 94 | 95 | 96 | /* read-only data */ 97 | .section .rodata 98 | fmtMain_start: 99 | .string "main(): start" 100 | fmtMain_finish: 101 | .string "main(): finish" 102 | 103 | fmtChild1: 104 | .string "child1(): start" 105 | fmtChild2: 106 | .string "child2(): start" 107 | 108 | 109 | 110 | /* read-write data */ 111 | .data 112 | .balign 8 113 | tid1: 114 | .quad 0 115 | tid2: 116 | .quad 0 117 | -------------------------------------------------------------------------------- /x86/linux/750.multi/sync_threads.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Sync threads example for x86/Linux 3 | * 4 | * * Threads 5 | * * main: a main thread 6 | * * child1: 1st child thread (set the sync_flag for 2nd thread) 7 | * * child2: 2nd child thread (wait 1st thread via the sync_flag) 8 | * 9 | * * see: 10 | * * man pthread_create 11 | * * x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 12 | */ 13 | 14 | .intel_syntax noprefix 15 | .globl main 16 | 17 | main: 18 | sub rsp, 8 /* 16-byte alignment */ 19 | 20 | /* puts for trace-log */ 21 | lea rdi, [rip + fmtMain_start] 22 | call puts 23 | 24 | 25 | /* pthread_create() */ 26 | 27 | /* for 1st child thread */ 28 | lea rdi, [rip + tid1] /* pthread_t: &tid1 */ 29 | mov esi, 0 /* pthread_attr_t: NULL */ 30 | lea rdx, [rip + child1] /* start_routine: &child1 */ 31 | mov ecx, 0 /* arg: NULL */ 32 | call pthread_create 33 | 34 | /* for 2st child thread */ 35 | lea rdi, [rip + tid2] /* pthread_t: &tid2 */ 36 | mov esi, 0 /* pthread_attr_t: NULL */ 37 | lea rdx, [rip + child2] /* start_routine: &child2 */ 38 | mov ecx, 0 /* arg: NULL */ 39 | call pthread_create 40 | 41 | 42 | /* pthread_join() */ 43 | 44 | /* for 1st child thread */ 45 | mov rdi, [rip + tid1] /* pthread_t: tid1 */ 46 | mov esi, 0 /* retval: NULL */ 47 | call pthread_join 48 | 49 | /* for 2nd child thread */ 50 | mov rdi, [rip + tid2] /* pthread_t: tid2 */ 51 | mov esi, 0 /* retval: NULL */ 52 | call pthread_join 53 | 54 | 55 | /* puts for trace-log */ 56 | lea rdi, [rip + fmtMain_finish] 57 | call puts 58 | 59 | /* exit from main */ 60 | mov edi, 0 /* status: EXIT_SUCCESS */ 61 | call exit 62 | 63 | 64 | 65 | /* 1st child thread */ 66 | child1: 67 | sub rsp, 8 /* 16-byte alignment */ 68 | 69 | /* puts for trace-log */ 70 | lea rdi, [rip + fmtChild1_start] 71 | call puts 72 | 73 | /* wait your key-in */ 74 | lea rdi, [rip + fmtChild1_prompt] 75 | call puts 76 | call getchar 77 | 78 | 79 | /* set the sync-flag to 1 */ 80 | mov eax, 1 81 | mov [rip + sync_flag], rax 82 | 83 | 84 | /* puts for trace-log */ 85 | lea rdi, [rip + fmtChild1_finish] 86 | call puts 87 | 88 | /* finish this thead */ 89 | mov eax, 0 /* return value (not used) */ 90 | add rsp, 8 91 | ret 92 | 93 | 94 | 95 | /* 2nd child thread */ 96 | child2: 97 | sub rsp, 8 /* 16-byte alignment */ 98 | 99 | /* puts for trace-log */ 100 | lea rdi, [rip + fmtChild2_start] 101 | call puts 102 | 103 | 104 | /* wait 1st thread via the sync_flag */ 105 | L_loop: 106 | mov rax, [rip + sync_flag] 107 | cmp rax, 0 108 | je L_loop 109 | 110 | 111 | /* puts for trace-log */ 112 | lea rdi, [rip + fmtChild2_finish] 113 | call puts 114 | 115 | /* finish this thead */ 116 | mov eax, 0 /* return value (not used) */ 117 | add rsp, 8 118 | ret 119 | 120 | 121 | 122 | /* read-only data */ 123 | .section .rodata 124 | fmtMain_start: 125 | .string "main(): start" 126 | fmtMain_finish: 127 | .string "main(): finish" 128 | 129 | fmtChild1_start: 130 | .string "child1(): start" 131 | fmtChild1_prompt: 132 | .string "child1(): enter key:" 133 | fmtChild1_finish: 134 | .string "child1(): finish" 135 | 136 | fmtChild2_start: 137 | .string "child2(): start" 138 | fmtChild2_finish: 139 | .string "child2(): finish" 140 | 141 | 142 | 143 | /* read-write data */ 144 | .data 145 | .balign 8 146 | tid1: 147 | .quad 0 148 | tid2: 149 | .quad 0 150 | 151 | sync_flag: 152 | .quad 0 153 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | Performance experiments 3 | ======================= 4 | 5 | ## Instruction latencies 6 | 7 | ### latency_add.S 8 | 9 | ```sh 10 | $ make -f ../Makefile latency_add 11 | $ perf stat -e "cycles,instructions" ./latency_add 12 | 13 | 1,001,378,100 cycles 14 | 1,031,040,746 instructions # 1.03 insn per cycle 15 | ``` 16 | 17 | ### latency_mul.S 18 | 19 | ```sh 20 | $ make -f ../Makefile latency_mul 21 | $ perf stat -e "cycles,instructions" ./latency_mul 22 | 23 | 3,002,220,151 cycles 24 | 1,031,501,374 instructions # 0.34 insn per cycle 25 | ``` 26 | 27 | ### latency_load.S 28 | 29 | ```sh 30 | $ make -f ../Makefile latency_load 31 | $ perf stat -e "cycles,instructions" ./latency_load 32 | 33 | 4,002,179,368 cycles 34 | 1,031,642,273 instructions # 0.26 insn per cycle 35 | ``` 36 | 37 | 38 | ## Branch-prediction misses 39 | 40 | ### branch_miss_few.S 41 | 42 | ```sh 43 | $ make -f ../Makefile branch_miss_few 44 | $ perf stat -e "cycles,instructions,branches,branch-misses" ./branch_miss_few 45 | 46 | 20,155,446 branches 47 | 6,666 branch-misses # 0.03% of all branches 48 | ``` 49 | 50 | ### branch_miss_many.S 51 | 52 | ```sh 53 | $ make -f ../Makefile branch_miss_many 54 | $ perf stat -e "cycles,instructions,branches,branch-misses" ./branch_miss_many 55 | 56 | 20,166,111 branches 57 | 5,007,361 branch-misses # 24.83% of all branches 58 | ``` 59 | 60 | diff files: 61 | 62 | ```sh 63 | $ diff branch_miss_few.S branch_miss_many.S 64 | < cmp r10, r10 /* zero(eq) flag is always true */ 65 | --- 66 | > and r10, 1 /* zero(eq) flag changes randomly */ 67 | 68 | ``` 69 | 70 | 71 | ## Cache misses (conflict-miss) 72 | 73 | ### cache_miss_few.S 74 | 75 | ```sh 76 | $ make -f ../Makefile cache_miss_few 77 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 78 | ./cache_miss_few 79 | 80 | 10,214,893 L1-dcache-loads 81 | 14,137 L1-dcache-load-misses # 0.14% of all L1-dcache accesses 82 | ``` 83 | 84 | ### cache_miss_many.S 85 | 86 | ```sh 87 | $ make -f ../Makefile cache_miss_many 88 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 89 | ./cache_miss_many 90 | 91 | 10,217,855 L1-dcache-loads 92 | 9,569,636 L1-dcache-load-misses # 93.66% of all L1-dcache accesses 93 | ``` 94 | 95 | diff files: 96 | 97 | ```sh 98 | $ diff cache_miss_few.S cache_miss_many.S 99 | < mov r12, 64 /* stride is 64byte (cache-line size) */ 100 | --- 101 | > mov r12, 2048 /* stride is 2Kbyte (cache-macro? size) */ 102 | ``` 103 | 104 | 105 | ## TLB misses (capacity-miss) 106 | 107 | ### tlb_miss_few.S 108 | 109 | ```sh 110 | $ make -f ../Makefile tlb_miss_few 111 | $ perf stat -e "cycles,instructions,L1-dcache-loads,dTLB-loads,dTLB-load-misses" \ 112 | ./tlb_miss_few 113 | 114 | 609,318,688 cycles 115 | 901,212,631 instructions 116 | 100,334,179 L1-dcache-loads 117 | 100,334,179 dTLB-loads 118 | 844 dTLB-load-misses # 0.00% of all dTLB cache accesses 119 | ``` 120 | 121 | ### tlb_miss_many.S 122 | 123 | ```sh 124 | $ make -f ../Makefile tlb_miss_many 125 | $ perf stat -e "cycles,instructions,L1-dcache-loads,dTLB-loads,dTLB-load-misses" \ 126 | ./tlb_miss_many 127 | 128 | 2,332,532,859 cycles 129 | 910,194,864 instructions 130 | 102,914,279 L1-dcache-loads 131 | 102,914,279 dTLB-loads 132 | 100,014,282 dTLB-load-misses # 97.18% of all dTLB cache accesses 133 | ``` 134 | 135 | diff files: 136 | 137 | ```sh 138 | $ diff tlb_miss_few.S tlb_miss_many.S 139 | < mov r13, 0xff /* wrap for 256 times */ 140 | --- 141 | > mov r13, 0x1fff /* wrap for 8192 times */ 142 | ``` 143 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/branch_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Branch prediction example for x86/Linux 3 | * 4 | * * Branch conditions are always true. 5 | * * FEW branch mispredictions occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * for (i=0; i<10000000; i++) { 10 | * 11 | * x = xorshift(); // generate a random number 12 | * 13 | * if (x == x) goto L_br; // branch-prediction test 14 | * nop; 15 | * L_br: 16 | * 17 | * } 18 | * 19 | */ 20 | 21 | .intel_syntax noprefix 22 | .global main 23 | 24 | main: 25 | sub rsp, 8 /* 16-byte alignment */ 26 | 27 | 28 | /* loop conditions */ 29 | mov ecx, 0 /* loop variable */ 30 | mov rbx, 10000000 /* loop max-number */ 31 | 32 | mov rax, [rip + random] /* initial number of xorshift(xor64) */ 33 | 34 | L_loop: 35 | /* generate a random number with the xorshift algorithm */ 36 | mov rdx, rax /* x = x ^ (x << 13) */ 37 | sal rax, 13 38 | xor rax, rdx 39 | 40 | mov rdx, rax /* x = x ^ (x >> 7) */ 41 | sar rax, 7 42 | xor rax, rdx 43 | 44 | mov rdx, rax /* x = x ^ (x << 17) */ 45 | sal rax, 17 46 | xor rax, rdx 47 | 48 | 49 | /* branch-prediction test */ 50 | mov r10, rax 51 | cmp r10, r10 /* zero(eq) flag is always true */ 52 | jz L_br /* branch-prediction test */ 53 | nop 54 | L_br: 55 | 56 | /* increment the loop-variable and loop-back */ 57 | add rcx, 1 58 | cmp rcx, rbx 59 | jl L_loop 60 | 61 | 62 | /* print the last loop-variable */ 63 | lea rdi, [rip + fmt] /* 1st argument for printf */ 64 | mov rsi, rcx /* 2nd argument */ 65 | mov eax, 0 /* the number of vector regsters */ 66 | call printf 67 | 68 | 69 | /* return from main */ 70 | add rsp, 8 /* 16-byte alignment */ 71 | ret 72 | 73 | 74 | /* read-only data */ 75 | .section .rodata 76 | random: 77 | .quad 88172645463325252 /* xorshift(xor64)'s initial value */ 78 | 79 | fmt: 80 | .string "loop-variable = %d\n" 81 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/branch_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Branch prediction example for x86/Linux 3 | * 4 | * * Branch conditions change randomly. 5 | * * MANY branch mispredictions occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * for (i=0; i<10000000; i++) { 10 | * 11 | * x = xorshift(); // generate a random number 12 | * 13 | * if ((x & 1) == 0) goto L_br; // branch-prediction test 14 | * nop; 15 | * L_br: 16 | * 17 | * } 18 | * 19 | */ 20 | 21 | .intel_syntax noprefix 22 | .global main 23 | 24 | main: 25 | sub rsp, 8 /* 16-byte alignment */ 26 | 27 | 28 | /* loop conditions */ 29 | mov ecx, 0 /* loop variable */ 30 | mov rbx, 10000000 /* loop max-number */ 31 | 32 | mov rax, [rip + random] /* initial number of xorshift(xor64) */ 33 | 34 | L_loop: 35 | /* generate a random number with the xorshift algorithm */ 36 | mov rdx, rax /* x = x ^ (x << 13) */ 37 | sal rax, 13 38 | xor rax, rdx 39 | 40 | mov rdx, rax /* x = x ^ (x >> 7) */ 41 | sar rax, 7 42 | xor rax, rdx 43 | 44 | mov rdx, rax /* x = x ^ (x << 17) */ 45 | sal rax, 17 46 | xor rax, rdx 47 | 48 | 49 | /* branch-prediction test */ 50 | mov r10, rax 51 | and r10, 1 /* zero(eq) flag changes randomly */ 52 | jz L_br /* branch-prediction test */ 53 | nop 54 | L_br: 55 | 56 | /* increment the loop-variable and loop-back */ 57 | add rcx, 1 58 | cmp rcx, rbx 59 | jl L_loop 60 | 61 | 62 | /* print the last loop-variable */ 63 | lea rdi, [rip + fmt] /* 1st argument for printf */ 64 | mov rsi, rcx /* 2nd argument */ 65 | mov eax, 0 /* the number of vector regsters */ 66 | call printf 67 | 68 | 69 | /* return from main */ 70 | add rsp, 8 /* 16-byte alignment */ 71 | ret 72 | 73 | 74 | /* read-only data */ 75 | .section .rodata 76 | random: 77 | .quad 88172645463325252 /* xorshift(xor64)'s initial value */ 78 | 79 | fmt: 80 | .string "loop-variable = %d\n" 81 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/cache_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Cache-miss example for x86/Linux 3 | * 4 | * * Cache accesses with 64-byte (line-size) stride. 5 | * * FEW cache-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 64; 10 | * wrap = 0x1f; // 32times - 1 11 | * 12 | * for (i=0; i<10000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .intel_syntax noprefix 20 | .global main 21 | 22 | main: 23 | sub rsp, 8 /* 16-byte alignment */ 24 | 25 | 26 | /* access parameters */ 27 | mov r12, 64 /* stride is 64byte (cache-line size) */ 28 | mov r13, 0x1f /* wrap for 32 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov ecx, 0 /* loop variable */ 33 | mov rbx, 10000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | mov rax, rcx /* i */ 38 | and rax, r13 /* i & wrap */ 39 | imul rax, r12 /* (i % wrap) x stride */ 40 | 41 | lea r10, [rip + work_area] 42 | add rax, r10 43 | 44 | /* cache access (simple load) */ 45 | mov rdx, [rax] /* rdx = load(rax) */ 46 | 47 | /* increment the loop-variable and loop-back */ 48 | add rcx, 1 49 | cmp rcx, rbx 50 | jl L_loop 51 | 52 | 53 | /* print the last loop-variable */ 54 | lea rdi, [rip + fmt] /* 1st argument for printf */ 55 | mov rsi, rcx /* 2nd argument */ 56 | mov eax, 0 /* the number of vector regsters */ 57 | call printf 58 | 59 | 60 | /* return from main */ 61 | add rsp, 8 /* 16-byte alignment */ 62 | ret 63 | 64 | 65 | /* read-only data */ 66 | .section .rodata 67 | fmt: 68 | .string "loop-variable = %d\n" 69 | 70 | 71 | .balign 8 72 | work_area: 73 | .skip 0x40000 74 | 75 | dummy_tail: 76 | .quad 0 77 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/cache_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Cache-miss example for x86/Linux 3 | * 4 | * * Cache accesses with 2K-byte (cache-macro?) stride for way-conflict. 5 | * * MANY cache-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 2048; 10 | * wrap = 0x1f; // 32times - 1 11 | * 12 | * for (i=0; i<10000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .intel_syntax noprefix 20 | .global main 21 | 22 | main: 23 | sub rsp, 8 /* 16-byte alignment */ 24 | 25 | 26 | /* access parameters */ 27 | mov r12, 2048 /* stride is 2Kbyte (cache-macro? size) */ 28 | mov r13, 0x1f /* wrap for 32 times */ 29 | 30 | 31 | /* loop conditions */ 32 | mov ecx, 0 /* loop variable */ 33 | mov rbx, 10000000 /* loop max-number */ 34 | 35 | L_loop: 36 | /* calc address */ 37 | mov rax, rcx /* i */ 38 | and rax, r13 /* i & wrap */ 39 | imul rax, r12 /* (i % wrap) x stride */ 40 | 41 | lea r10, [rip + work_area] 42 | add rax, r10 43 | 44 | /* cache access (simple load) */ 45 | mov rdx, [rax] /* rdx = load(rax) */ 46 | 47 | /* increment the loop-variable and loop-back */ 48 | add rcx, 1 49 | cmp rcx, rbx 50 | jl L_loop 51 | 52 | 53 | /* print the last loop-variable */ 54 | lea rdi, [rip + fmt] /* 1st argument for printf */ 55 | mov rsi, rcx /* 2nd argument */ 56 | mov eax, 0 /* the number of vector regsters */ 57 | call printf 58 | 59 | 60 | /* return from main */ 61 | add rsp, 8 /* 16-byte alignment */ 62 | ret 63 | 64 | 65 | /* read-only data */ 66 | .section .rodata 67 | fmt: 68 | .string "loop-variable = %d\n" 69 | 70 | 71 | .balign 8 72 | work_area: 73 | .skip 0x40000 74 | 75 | dummy_tail: 76 | .quad 0 77 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/latency_add.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Latency example for x86/Linux 3 | * 4 | * * `add` instruction 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * add instruction 10 | * : // 100 times 11 | * } 12 | * 13 | */ 14 | 15 | .intel_syntax noprefix 16 | .global main 17 | 18 | main: 19 | sub rsp, 8 /* 16-byte alignment */ 20 | 21 | 22 | /* loop conditions */ 23 | mov ecx, 0 /* loop variable */ 24 | mov rbx, 10000000 /* loop max-number */ 25 | 26 | mov rax, 1 /* 1st operand for add instructions */ 27 | mov rdx, 1 /* 2nd operand for add instructions */ 28 | 29 | 30 | /* loop body */ 31 | L_loop: 32 | add rax, rdx /* rax = rax + rdx */ 33 | add rax, rdx 34 | add rax, rdx 35 | add rax, rdx 36 | add rax, rdx 37 | add rax, rdx 38 | add rax, rdx 39 | add rax, rdx 40 | add rax, rdx 41 | add rax, rdx 42 | 43 | add rax, rdx 44 | add rax, rdx 45 | add rax, rdx 46 | add rax, rdx 47 | add rax, rdx 48 | add rax, rdx 49 | add rax, rdx 50 | add rax, rdx 51 | add rax, rdx 52 | add rax, rdx 53 | 54 | add rax, rdx 55 | add rax, rdx 56 | add rax, rdx 57 | add rax, rdx 58 | add rax, rdx 59 | add rax, rdx 60 | add rax, rdx 61 | add rax, rdx 62 | add rax, rdx 63 | add rax, rdx 64 | 65 | add rax, rdx 66 | add rax, rdx 67 | add rax, rdx 68 | add rax, rdx 69 | add rax, rdx 70 | add rax, rdx 71 | add rax, rdx 72 | add rax, rdx 73 | add rax, rdx 74 | add rax, rdx 75 | 76 | add rax, rdx 77 | add rax, rdx 78 | add rax, rdx 79 | add rax, rdx 80 | add rax, rdx 81 | add rax, rdx 82 | add rax, rdx 83 | add rax, rdx 84 | add rax, rdx 85 | add rax, rdx 86 | 87 | add rax, rdx 88 | add rax, rdx 89 | add rax, rdx 90 | add rax, rdx 91 | add rax, rdx 92 | add rax, rdx 93 | add rax, rdx 94 | add rax, rdx 95 | add rax, rdx 96 | add rax, rdx 97 | 98 | add rax, rdx 99 | add rax, rdx 100 | add rax, rdx 101 | add rax, rdx 102 | add rax, rdx 103 | add rax, rdx 104 | add rax, rdx 105 | add rax, rdx 106 | add rax, rdx 107 | add rax, rdx 108 | 109 | add rax, rdx 110 | add rax, rdx 111 | add rax, rdx 112 | add rax, rdx 113 | add rax, rdx 114 | add rax, rdx 115 | add rax, rdx 116 | add rax, rdx 117 | add rax, rdx 118 | add rax, rdx 119 | 120 | add rax, rdx 121 | add rax, rdx 122 | add rax, rdx 123 | add rax, rdx 124 | add rax, rdx 125 | add rax, rdx 126 | add rax, rdx 127 | add rax, rdx 128 | add rax, rdx 129 | add rax, rdx 130 | 131 | add rax, rdx 132 | add rax, rdx 133 | add rax, rdx 134 | add rax, rdx 135 | add rax, rdx 136 | add rax, rdx 137 | add rax, rdx 138 | add rax, rdx 139 | add rax, rdx 140 | add rax, rdx 141 | 142 | /* increment the loop-variable and loop-back */ 143 | add rcx, 1 144 | cmp rcx, rbx 145 | jl L_loop 146 | 147 | 148 | /* print the last loop-variable */ 149 | lea rdi, [rip + fmt] /* 1st argument for printf */ 150 | mov rsi, rcx /* 2nd argument */ 151 | mov eax, 0 /* the number of vector regsters */ 152 | call printf 153 | 154 | 155 | /* return from main */ 156 | add rsp, 8 /* 16-byte alignment */ 157 | ret 158 | 159 | 160 | /* read-only data */ 161 | .section .rodata 162 | fmt: 163 | .string "loop-variable = %d\n" 164 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/latency_mul.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Latency example for x86/Linux 3 | * 4 | * * `mul` instruction 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * mul instruction 10 | * : // 100 times 11 | * } 12 | * 13 | */ 14 | 15 | .intel_syntax noprefix 16 | .global main 17 | 18 | main: 19 | sub rsp, 8 /* 16-byte alignment */ 20 | 21 | 22 | /* loop conditions */ 23 | mov ecx, 0 /* loop variable */ 24 | mov rbx, 10000000 /* loop max-number */ 25 | 26 | mov rax, 1 /* 1st operand for mul instructions */ 27 | mov r10, 1 /* 2nd operand for mul instructions */ 28 | 29 | 30 | /* loop body */ 31 | L_loop: 32 | mul r10 /* rdx:rax = rax * r10 */ 33 | mul r10 34 | mul r10 35 | mul r10 36 | mul r10 37 | mul r10 38 | mul r10 39 | mul r10 40 | mul r10 41 | mul r10 42 | 43 | mul r10 44 | mul r10 45 | mul r10 46 | mul r10 47 | mul r10 48 | mul r10 49 | mul r10 50 | mul r10 51 | mul r10 52 | mul r10 53 | 54 | mul r10 55 | mul r10 56 | mul r10 57 | mul r10 58 | mul r10 59 | mul r10 60 | mul r10 61 | mul r10 62 | mul r10 63 | mul r10 64 | 65 | mul r10 66 | mul r10 67 | mul r10 68 | mul r10 69 | mul r10 70 | mul r10 71 | mul r10 72 | mul r10 73 | mul r10 74 | mul r10 75 | 76 | mul r10 77 | mul r10 78 | mul r10 79 | mul r10 80 | mul r10 81 | mul r10 82 | mul r10 83 | mul r10 84 | mul r10 85 | mul r10 86 | 87 | mul r10 88 | mul r10 89 | mul r10 90 | mul r10 91 | mul r10 92 | mul r10 93 | mul r10 94 | mul r10 95 | mul r10 96 | mul r10 97 | 98 | mul r10 99 | mul r10 100 | mul r10 101 | mul r10 102 | mul r10 103 | mul r10 104 | mul r10 105 | mul r10 106 | mul r10 107 | mul r10 108 | 109 | mul r10 110 | mul r10 111 | mul r10 112 | mul r10 113 | mul r10 114 | mul r10 115 | mul r10 116 | mul r10 117 | mul r10 118 | mul r10 119 | 120 | mul r10 121 | mul r10 122 | mul r10 123 | mul r10 124 | mul r10 125 | mul r10 126 | mul r10 127 | mul r10 128 | mul r10 129 | mul r10 130 | 131 | mul r10 132 | mul r10 133 | mul r10 134 | mul r10 135 | mul r10 136 | mul r10 137 | mul r10 138 | mul r10 139 | mul r10 140 | mul r10 141 | 142 | /* increment the loop-variable and loop-back */ 143 | add rcx, 1 144 | cmp rcx, rbx 145 | jl L_loop 146 | 147 | 148 | /* print the last loop-variable */ 149 | lea rdi, [rip + fmt] /* 1st argument for printf */ 150 | mov rsi, rcx /* 2nd argument */ 151 | mov eax, 0 /* the number of vector regsters */ 152 | call printf 153 | 154 | 155 | /* return from main */ 156 | add rsp, 8 /* 16-byte alignment */ 157 | ret 158 | 159 | 160 | /* read-only data */ 161 | .section .rodata 162 | fmt: 163 | .string "loop-variable = %d\n" 164 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/ooo_perform.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Out-of-order execution example for x86/Linux 3 | * 4 | * * Reordering with a pending instruction 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * rdx = load(rdx); 10 | * rdx = rdx + r10; // pendinig the load instruction 11 | * r12 = r12 + r10; // OUT-OF-ORDER execution 12 | * r12 = r12 + r10; 13 | * r12 = r12 + r10; 14 | * 15 | * : // 100 times 16 | * } 17 | * 18 | */ 19 | 20 | .intel_syntax noprefix 21 | .global main 22 | 23 | main: 24 | sub rsp, 8 /* 16-byte alignment */ 25 | 26 | 27 | /* loop conditions */ 28 | mov ecx, 0 /* loop variable */ 29 | mov rbx, 10000000 /* loop max-number */ 30 | 31 | lea rdx, [rip + buf] /* address for load instructions */ 32 | mov [rdx], rdx /* pre-store the address for load */ 33 | 34 | mov r10, 0 35 | 36 | 37 | /* loop body */ 38 | L_loop: 39 | mov rdx, [rdx] /* rdx = load(rdx) */ 40 | add rdx, r10 /* true data dependency with rdx */ 41 | add r12, r10 /* NO true data dpendency with rdx */ 42 | add r12, r10 /* NO true data dpendency with rdx */ 43 | add r12, r10 /* NO true data dpendency with rdx */ 44 | 45 | mov rdx, [rdx] 46 | add rdx, r10 47 | add r12, r10 48 | add r12, r10 49 | add r12, r10 50 | 51 | mov rdx, [rdx] 52 | add rdx, r10 53 | add r12, r10 54 | add r12, r10 55 | add r12, r10 56 | 57 | mov rdx, [rdx] 58 | add rdx, r10 59 | add r12, r10 60 | add r12, r10 61 | add r12, r10 62 | 63 | mov rdx, [rdx] 64 | add rdx, r10 65 | add r12, r10 66 | add r12, r10 67 | add r12, r10 68 | 69 | mov rdx, [rdx] 70 | add rdx, r10 71 | add r12, r10 72 | add r12, r10 73 | add r12, r10 74 | 75 | mov rdx, [rdx] 76 | add rdx, r10 77 | add r12, r10 78 | add r12, r10 79 | add r12, r10 80 | 81 | mov rdx, [rdx] 82 | add rdx, r10 83 | add r12, r10 84 | add r12, r10 85 | add r12, r10 86 | 87 | mov rdx, [rdx] 88 | add rdx, r10 89 | add r12, r10 90 | add r12, r10 91 | add r12, r10 92 | 93 | mov rdx, [rdx] 94 | add rdx, r10 95 | add r12, r10 96 | add r12, r10 97 | add r12, r10 98 | 99 | /* increment the loop-variable and loop-back */ 100 | add rcx, 1 101 | cmp rcx, rbx 102 | jl L_loop 103 | 104 | 105 | /* print the last loop-variable */ 106 | lea rdi, [rip + fmt] /* 1st argument for printf */ 107 | mov rsi, rcx /* 2nd argument */ 108 | mov eax, 0 /* the number of vector regsters */ 109 | call printf 110 | 111 | 112 | /* return from main */ 113 | add rsp, 8 /* 16-byte alignment */ 114 | ret 115 | 116 | 117 | /* read-only data */ 118 | .section .rodata 119 | fmt: 120 | .string "loop-variable = %d\n" 121 | 122 | 123 | /* read-write data */ 124 | .data 125 | buf: 126 | .quad 0 127 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/ooo_wait.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Out-of-order execution example for x86/Linux 3 | * 4 | * * NO-Reordering with a pending instruction 5 | * 6 | * * C-like pseudo-code: 7 | * 8 | * for (i=0; i<10000000; i++) { 9 | * rdx = load(rdx); 10 | * rdx = rdx + r10; // pendinig the load instruction 11 | * rdx = rdx + r10; // NO out-of-order execution 12 | * rdx = rdx + r10; 13 | * rdx = rdx + r10; 14 | * 15 | * : // 100 times 16 | * } 17 | * 18 | */ 19 | 20 | .intel_syntax noprefix 21 | .global main 22 | 23 | main: 24 | sub rsp, 8 /* 16-byte alignment */ 25 | 26 | 27 | /* loop conditions */ 28 | mov ecx, 0 /* loop variable */ 29 | mov rbx, 10000000 /* loop max-number */ 30 | 31 | lea rdx, [rip + buf] /* address for load instructions */ 32 | mov [rdx], rdx /* pre-store the address for load */ 33 | 34 | mov r10, 0 35 | 36 | 37 | /* loop body */ 38 | L_loop: 39 | mov rdx, [rdx] /* rdx = load(rdx) */ 40 | add rdx, r10 /* true data dependency with rdx */ 41 | add rdx, r10 /* true data dependency with rdx */ 42 | add rdx, r10 /* true data dependency with rdx */ 43 | add rdx, r10 /* true data dependency with rdx */ 44 | 45 | mov rdx, [rdx] 46 | add rdx, r10 47 | add rdx, r10 48 | add rdx, r10 49 | add rdx, r10 50 | 51 | mov rdx, [rdx] 52 | add rdx, r10 53 | add rdx, r10 54 | add rdx, r10 55 | add rdx, r10 56 | 57 | mov rdx, [rdx] 58 | add rdx, r10 59 | add rdx, r10 60 | add rdx, r10 61 | add rdx, r10 62 | 63 | mov rdx, [rdx] 64 | add rdx, r10 65 | add rdx, r10 66 | add rdx, r10 67 | add rdx, r10 68 | 69 | mov rdx, [rdx] 70 | add rdx, r10 71 | add rdx, r10 72 | add rdx, r10 73 | add rdx, r10 74 | 75 | mov rdx, [rdx] 76 | add rdx, r10 77 | add rdx, r10 78 | add rdx, r10 79 | add rdx, r10 80 | 81 | mov rdx, [rdx] 82 | add rdx, r10 83 | add rdx, r10 84 | add rdx, r10 85 | add rdx, r10 86 | 87 | mov rdx, [rdx] 88 | add rdx, r10 89 | add rdx, r10 90 | add rdx, r10 91 | add rdx, r10 92 | 93 | mov rdx, [rdx] 94 | add rdx, r10 95 | add rdx, r10 96 | add rdx, r10 97 | add rdx, r10 98 | 99 | /* increment the loop-variable and loop-back */ 100 | add rcx, 1 101 | cmp rcx, rbx 102 | jl L_loop 103 | 104 | 105 | /* print the last loop-variable */ 106 | lea rdi, [rip + fmt] /* 1st argument for printf */ 107 | mov rsi, rcx /* 2nd argument */ 108 | mov eax, 0 /* the number of vector regsters */ 109 | call printf 110 | 111 | 112 | /* return from main */ 113 | add rsp, 8 /* 16-byte alignment */ 114 | ret 115 | 116 | 117 | /* read-only data */ 118 | .section .rodata 119 | fmt: 120 | .string "loop-variable = %d\n" 121 | 122 | 123 | /* read-write data */ 124 | .data 125 | buf: 126 | .quad 0 127 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/tlb_miss_few.S: -------------------------------------------------------------------------------- 1 | /* 2 | * TLB-miss example for x86/Linux 3 | * 4 | * * Memory accesses within 1Mbyte (256 pages) 5 | * * FEW tlb-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 4096; // 4096 byte (page size) 10 | * wrap = 0xff; // 256 times (pages) - 1 11 | * 12 | * for (i=0; i<100000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .intel_syntax noprefix 20 | .global main 21 | 22 | main: 23 | sub rsp, 8 /* 16-byte alignment */ 24 | 25 | 26 | /* access parameters */ 27 | mov r12, 4096 /* stride is 4Kbyte (page size) */ 28 | mov r13, 0xff /* wrap for 256 times */ 29 | 30 | /* loop conditions */ 31 | mov ecx, 0 /* loop variable */ 32 | mov rbx, 100000000 /* loop max-number */ 33 | 34 | L_loop: 35 | /* calc address */ 36 | mov rax, rcx /* i */ 37 | and rax, r13 /* i & wrap */ 38 | imul rax, r12 /* (i % wrap) x stride */ 39 | 40 | lea r10, [rip + work_area] 41 | add rax, r10 42 | 43 | /* cache access (simple load) */ 44 | mov rdx, [rax] /* rdx = load(rax) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add rcx, 1 48 | cmp rcx, rbx 49 | jl L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | lea rdi, [rip + fmt] /* 1st argument for printf */ 54 | mov rsi, rcx /* 2nd argument */ 55 | mov eax, 0 /* the number of vector regsters */ 56 | call printf 57 | 58 | 59 | /* return from main */ 60 | add rsp, 8 /* 16-byte alignment */ 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x8000000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /x86/linux/E00.perf_expt/tlb_miss_many.S: -------------------------------------------------------------------------------- 1 | /* 2 | * TLB-miss example for x86/Linux 3 | * 4 | * * Memory accesses within 32Mbyte (8192 pages) 5 | * * MANY tlb-miss occur. 6 | * 7 | * * C-like pseudo-code: 8 | * 9 | * stride = 4096; // 4096 byte (page size) 10 | * wrap = 0x1fff; // 8192 times (pages) - 1 11 | * 12 | * for (i=0; i<100000000; i++) { 13 | * load(address); 14 | * address = address + ((i & wrap) * stride) 15 | * } 16 | * 17 | */ 18 | 19 | .intel_syntax noprefix 20 | .global main 21 | 22 | main: 23 | sub rsp, 8 /* 16-byte alignment */ 24 | 25 | 26 | /* access parameters */ 27 | mov r12, 4096 /* stride is 4Kbyte (page size) */ 28 | mov r13, 0x1fff /* wrap for 8192 times */ 29 | 30 | /* loop conditions */ 31 | mov ecx, 0 /* loop variable */ 32 | mov rbx, 100000000 /* loop max-number */ 33 | 34 | L_loop: 35 | /* calc address */ 36 | mov rax, rcx /* i */ 37 | and rax, r13 /* i & wrap */ 38 | imul rax, r12 /* (i % wrap) x stride */ 39 | 40 | lea r10, [rip + work_area] 41 | add rax, r10 42 | 43 | /* cache access (simple load) */ 44 | mov rdx, [rax] /* rdx = load(rax) */ 45 | 46 | /* increment the loop-variable and loop-back */ 47 | add rcx, 1 48 | cmp rcx, rbx 49 | jl L_loop 50 | 51 | 52 | /* print the last loop-variable */ 53 | lea rdi, [rip + fmt] /* 1st argument for printf */ 54 | mov rsi, rcx /* 2nd argument */ 55 | mov eax, 0 /* the number of vector regsters */ 56 | call printf 57 | 58 | 59 | /* return from main */ 60 | add rsp, 8 /* 16-byte alignment */ 61 | ret 62 | 63 | 64 | /* read-only data */ 65 | .section .rodata 66 | fmt: 67 | .string "loop-variable = %d\n" 68 | 69 | 70 | .balign 8 71 | work_area: 72 | .skip 0x8000000 73 | 74 | dummy_tail: 75 | .quad 0 76 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | System experiments 3 | ================== 4 | 5 | ## I/O access 6 | 7 | ### io_read_rtc.S 8 | 9 | ```sh 10 | $ make -f ../Makefile io_read_rtc 11 | 12 | $ sudo ./io_read_rtc 13 | rtc seconds = 51 14 | 15 | $ sudo ./io_read_rtc 16 | rtc seconds = 17 17 | ``` 18 | 19 | ### io_read_pci.S 20 | 21 | ```sh 22 | $ make -f ../Makefile io_read_pci 23 | 24 | $ sudo ./io_read_pci 25 | Vendor ID = 8086 26 | ``` 27 | 28 | ### latency_io_read.S 29 | 30 | ```sh 31 | $ make -f ../Makefile latency_io_read 32 | 33 | $ perf stat -e "cycles,instructions" ./latency_io_read 34 | 35 | 87,249,692,746 cycles 36 | 127,360,444 instructions 37 | ``` 38 | 39 | 40 | ## System call 41 | 42 | ### syscall_write.S 43 | 44 | ```sh 45 | $ make -f ../Makefile syscall_write 46 | 47 | $ ./syscall_write 48 | hello, world 49 | ``` 50 | 51 | 52 | ## Execption 53 | 54 | ### excep_load.S 55 | 56 | ```sh 57 | $ make -f ../Makefile excep_load 58 | 59 | $ ./excep_load 60 | Segmentation fault (core dumped) 61 | ``` 62 | 63 | ### excep_div.S 64 | 65 | ```sh 66 | $ make -f ../Makefile excep_div 67 | 68 | $ ./excep_div 69 | Floating point exception (core dumped) 70 | ``` 71 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/excep_div.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Exception example for x86/Linux 3 | * 4 | * * Divide error exception 5 | * 6 | */ 7 | 8 | .intel_syntax noprefix 9 | .global main 10 | 11 | main: 12 | mov rbx, 0 13 | div rbx /* rdx:rax / rbx */ 14 | 15 | ret 16 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/excep_load.S: -------------------------------------------------------------------------------- 1 | /* 2 | * Exception example for x86/Linux 3 | * 4 | * * Page fault exception 5 | * 6 | */ 7 | 8 | .intel_syntax noprefix 9 | .global main 10 | 11 | main: 12 | mov rdx, 0 13 | mov rax, [rdx] /* rdx = load(rdx) */ 14 | 15 | ret 16 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/io_read_pci.S: -------------------------------------------------------------------------------- 1 | /* 2 | * I/O access example for x86/Linux 3 | * 4 | * * Cannot execute this under kernel lockdown! 5 | * 6 | * * Read PCI configuration with io-port 0x0cf8, 0x0cfc 7 | */ 8 | 9 | .intel_syntax noprefix 10 | .global main 11 | 12 | main: 13 | sub rsp, 8 /* 16-byte alignment */ 14 | 15 | 16 | /* set io-port permissions with ioperm(2) */ 17 | mov edi, 0x0cf8 18 | mov esi, 8 19 | mov edx, 1 20 | call ioperm /* see; man ioperm(2) */ 21 | 22 | 23 | /* set the PCI-address */ 24 | mov eax, 0x80000000 25 | mov dx, 0x0cf8 /* PCI register (CONFIG_ADDRESS) */ 26 | out dx, eax /* I/O-write */ 27 | 28 | /* set the address for I/O-read */ 29 | mov eax, 0 30 | mov dx, 0x0cfc /* PCI register (CONFIG_DATA) */ 31 | in ax, dx /* I/O-read */ 32 | 33 | 34 | 35 | /* printf for result-checking */ 36 | lea rdi, [rip + fmt] /* 1st argument for printf */ 37 | mov rsi, rax /* 2nd argument for printf */ 38 | mov eax, 0 /* the number of vector regsters */ 39 | call printf 40 | 41 | 42 | /* return from main */ 43 | add rsp, 8 /* 16-byte alignment */ 44 | ret 45 | 46 | 47 | /* read-only data */ 48 | .section .rodata 49 | fmt: 50 | .string "Vendor ID = %x\n" 51 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/io_read_rtc.S: -------------------------------------------------------------------------------- 1 | /* 2 | * I/O access example for x86/Linux 3 | * 4 | * * Cannot execute this under kernel lockdown! 5 | * 6 | * * Read real-time clock(RTC) with io-port 70,71 7 | * * RTS is coded with BCD(binary-coded decimal) 8 | * * Execute with sudo for ioperm 9 | * * see: linux-kernel's arch/x86/kernel/rtc.c 10 | */ 11 | 12 | .intel_syntax noprefix 13 | .global main 14 | 15 | main: 16 | sub rsp, 8 /* 16-byte alignment */ 17 | 18 | 19 | /* set io-port permissions with ioperm(2) */ 20 | mov edi, 0x70 21 | mov esi, 2 22 | mov edx, 1 23 | call ioperm /* see; man ioperm(2) */ 24 | 25 | 26 | /* read real-time clock(RTC) */ 27 | mov al, 0 /* RTC_SECONDS */ 28 | out 0x70, al /* write rtc */ 29 | in al, 0x71 /* read rtc */ 30 | 31 | 32 | /* printf for result-checking */ 33 | lea rdi, [rip + fmt] /* 1st argument for printf */ 34 | mov rsi, rax /* 2nd argument for printf */ 35 | mov eax, 0 /* the number of vector regsters */ 36 | call printf 37 | 38 | 39 | /* return from main */ 40 | add rsp, 8 /* 16-byte alignment */ 41 | ret 42 | 43 | 44 | /* read-only data */ 45 | .section .rodata 46 | fmt: 47 | .string "rtc seconds = %02x\n" 48 | -------------------------------------------------------------------------------- /x86/linux/E05.sys_expt/syscall_write.S: -------------------------------------------------------------------------------- 1 | /* 2 | * System-call example for x86/Linux 3 | * 4 | * * write(2) system-call 5 | * * see: man 2 write 6 | * * see: linux-kernel's arch/x86/entry/syscalls/syscall_64.tbl 7 | * * see: x86-64 ABI (https://gitlab.com/x86-psABIs/x86-64-ABI) 8 | */ 9 | 10 | .intel_syntax noprefix 11 | .global main 12 | 13 | main: 14 | sub rsp, 8 /* 16-byte alignment */ 15 | 16 | 17 | /* write(2) system-call */ 18 | mov eax, 1 /* system-call number: write() */ 19 | mov edi, 1 /* fd: stdout */ 20 | lea rsi, [rip + msg] /* buf: */ 21 | mov edx, 13 /* count: */ 22 | syscall 23 | 24 | 25 | /* return from main */ 26 | add rsp, 8 /* 16-byte alignment */ 27 | ret 28 | 29 | 30 | /* read-only data */ 31 | .section .rodata 32 | msg: 33 | .string "hello, world\n" 34 | -------------------------------------------------------------------------------- /x86/linux/E10.multi_expt/README.md: -------------------------------------------------------------------------------- 1 | 2 | Experiments with multiprocessors 3 | ================================ 4 | 5 | ## Cache-line(s) with line-invalidation 6 | 7 | ### cacheline_different.S 8 | 9 | ```sh 10 | $ make -f ../Makefile cacheline_different 11 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 12 | ./cacheline_different 13 | 14 | 10,292,444 L1-dcache-loads 15 | 24,106 L1-dcache-load-misses # 0.23% of all L1-dcache accesses 16 | ``` 17 | 18 | 19 | ### cacheline_same.S 20 | 21 | ```sh 22 | $ make -f ../Makefile cacheline_same 23 | $ perf stat -e "cycles,instructions,L1-dcache-loads,L1-dcache-load-misses" \ 24 | ./cacheline_same 25 | 26 | 10,331,779 L1-dcache-loads 27 | 2,644,805 L1-dcache-load-misses # 25.60% of all L1-dcache accesses 28 | ``` 29 | 30 | diff files: 31 | 32 | ```sh 33 | $ diff cacheline_different.S cacheline_same.S 34 | < .space 64 /* padding for different cache-lines */ 35 | ``` 36 | 37 | 38 | 39 | ## Memory ordering 40 | 41 | ### ordering_unexpected.S 42 | 43 | ```sh 44 | $ make -f ../Makefile ordering_unexpected 45 | $ ./ordering_unexpected 46 | child1(): UNEXPECTED!: r14 == 0 && r15 == 0; loop-variable = 342 47 | child2(): UNEXPECTED!: r14 == 0 && r15 == 0; loop-variable = 342 48 | ``` 49 | 50 | 51 | ### ordering_force.S 52 | 53 | ```sh 54 | $ make -f ../Makefile ordering_force 55 | $ ./ordering_force 56 | child1(): finish: loop-variable = 5000000 57 | child2(): finish: loop-variable = 5000000 58 | ``` 59 | 60 | diff files: 61 | 62 | ```sh 63 | $ diff ordering_unexpected.S ordering_force.S 64 | > mfence /* force ordering */ 65 | ``` 66 | 67 | 68 | 69 | ## Shared-counter with atomicity 70 | 71 | ### counter_atomic.S 72 | 73 | ```sh 74 | $ make -f ../Makefile counter_atomic 75 | 76 | $ ./counter_atomic 77 | main(): start 78 | child1(): start 79 | child2(): start 80 | child2(): finish: loop-variable = 5000000 81 | child1(): finish: loop-variable = 5000000 82 | main(): finish: counter = 10000000 83 | ``` 84 | 85 | ### counter_bad.S 86 | 87 | ```sh 88 | $ make -f ../Makefile counter_bad 89 | 90 | $ ./counter_bad 91 | main(): start 92 | child1(): start 93 | child2(): start 94 | child2(): finish: loop-variable = 5000000 95 | child1(): finish: loop-variable = 5000000 96 | main(): finish: counter = 7230454 97 | ``` 98 | 99 | diff files: 100 | 101 | ```sh 102 | $ diff counter_atomic.S counter_bad.S 103 | < mov rax, 1 104 | < lock xadd [rip + counter], rax 105 | --- 106 | > mov rax, [rip + counter] 107 | > add rax, 1 108 | > mov [rip + counter], rax 109 | > sfence /* to read from cache (not store-buf) 110 | ``` 111 | -------------------------------------------------------------------------------- /x86/linux/Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Makefile for x86_64 3 | 4 | CC = gcc 5 | OBJDUMP = objdump 6 | GDB = gdb 7 | 8 | CFLAGS += -g -masm=intel 9 | ASFLAGS += -g 10 | LDFLAGS += -no-pie -pthread 11 | OBJDUMPFLAGS += -M intel 12 | GDBFLAGS += -ex "set disassembly-flavor intel" 13 | 14 | 15 | include ../../../Makefile 16 | -------------------------------------------------------------------------------- /x86/linux/README.md: -------------------------------------------------------------------------------- 1 | 2 | x86 assembly examples on linux 3 | ============================== 4 | 5 | ## Examples 6 | * [simple main](./100.main) (./100.main) 7 | * [system call](./110.system_call) (./110.system_call) 8 | * [library call](./120.libc_call) (./120.libc_call) 9 | * [getting time](./200.time) (./200.time) 10 | * [basic operations](./300.operation) (./300.operation), [floating](./310.floating) (./310.floating) 11 | * [control flow](./400.control_flow) (./400.control_flow) 12 | * [sync](./700.sync) (./700.sync), [sync_parts](./710.sync_parts) (./710.sync_parts) 13 | * [multi-threads](./750.multi) (./750.multi) 14 | 15 | 16 | ## Experiments 17 | * Performance: [data-dependency, branch, cache, virtual-memory](./E00.perf_expt) (./E00.perf_expt) 18 | * System: [i/o, syscall/exception](./E05.sys_expt) (./E05.sys_expt) 19 | * Multiprocessors: [coherence, memory-ordering, atomic](./E10.multi_expt) (./E10.multi_expt) 20 | 21 | 22 | ## How to try 23 | 24 | * Assemble (generate binary) 25 | 26 | ``` 27 | $ make -f ../Makefile 28 | ``` 29 | 30 | * Execute 31 | 32 | ``` 33 | $ ./ 34 | ``` 35 | 36 | * Disassemble (for full object) 37 | 38 | ``` 39 | $ make -f ../Makefile .disasm | less 40 | (search "main>" in `less` command) 41 | ``` 42 | 43 | * Disassemble (for .S only) 44 | 45 | ``` 46 | $ make -f ../Makefile .o.disasm 47 | ``` 48 | 49 | * Step execution with gdb 50 | 51 | ``` 52 | $ make -f ../Makefile 53 | $ gdb ./ 54 | (gdb) set disassembly-flavor intel 55 | (gdb) layout asm 56 | (gdb) layout regs 57 | (gdb) break main 58 | (gdb) run 59 | (gdb) stepi 60 | : 61 | (gdb) quit 62 | ``` 63 | 64 | 65 | ## References 66 | 67 | * x86 68 | * [Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes 2A, 2B, 2C, and 2D: Instruction Set Reference, A-Z](https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-2a-2b-2c-and-2d-instruction-set-reference-a-z.html) 69 | 70 | * Linux 71 | * [Linux kernel; arch/x86](https://github.com/torvalds/linux/tree/master/arch/x86) 72 | 73 | * glibc 74 | * [glibc](https://www.gnu.org/software/libc/libc.html) 75 | * sysdeps/unix/sysv/linux/x86_64 76 | 77 | * x86-64 ABI 78 | * [x86-64 psABI](https://gitlab.com/x86-psABIs/x86-64-ABI) 79 | 80 | * GCC 81 | * [GCC online documentation](https://gcc.gnu.org/onlinedocs/) 82 | 83 | * GNU assembler and linker 84 | * [Documentation for binutils](https://sourceware.org/binutils/docs/) 85 | 86 | * GDB 87 | * [GDB Documentation](https://www.gnu.org/software/gdb/documentation/) 88 | 89 | 90 | ## Further information 91 | 92 | ### Calling convention 93 | 94 | * System call 95 | * rax, rdi, rsi, rdx, r10, r8, r9 -> rax, rdx 96 | 97 | * Funcation call 98 | * rdi, rsi, rdx, rcx, r8, r9, (rsp) -> rax, rdx 99 | 100 | * see: 101 | * glibc's sysdeps/unix/sysv/linux/x86_64/syscall.S 102 | * [x86-64 psABI](https://gitlab.com/x86-psABIs/x86-64-ABI) 103 | --------------------------------------------------------------------------------