├── Express_Data_Path.pdf ├── README.md ├── XDP_Inside_and_Out.pdf ├── bpf-internals-1.md ├── bpf-internals-2.md ├── bpf_helpers.rst ├── bpf_llvm_2015aug19.pdf ├── bpf_netdev_conference_2016Feb12.pdf ├── bpf_netdev_conference_2016Feb12_report.pdf ├── bpf_netdev_conference_2016Oct07.pdf ├── bpf_netdev_conference_2016Oct07_tcws.pdf ├── bpf_netvirt_2015aug21.pdf ├── bpf_network_examples_2015aug20.pdf ├── bpftrace_public_template_jun2019.odp ├── bpftrace_public_template_jun2019.pdf ├── eBPF.md ├── ebpf_excerpt_20Aug2015.pdf ├── ebpf_http_filter.pdf ├── meetups └── 2015-09-21 │ └── iovisor-bcc-intro.pdf ├── netconf_2016feb.pdf ├── openstack ├── 2015-10-29 │ └── iovisor-mesos-demo.pdf └── 2016-04-25 │ └── OpenStackSummitAustin2016_iovisor_v1.0.pdf ├── p4 ├── 2015-11-18 │ └── iovisor-p4-workshop-nov-2015.pdf └── p4toEbpf-bcc.pdf ├── p4AbstractSwitch.pdf ├── tsc-meeting-minutes ├── 2015-09-02 │ └── eBPF_to_IOV_Module.pptx └── 2015-09-16 │ ├── iomodules-slides.pdf │ └── iovisor-odl-gbp-module.pdf └── university ├── eBPF_IOVisor_academic_research.pdf └── sigcomm-ccr-InKev-2016.pdf /Express_Data_Path.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/Express_Data_Path.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # bpf-docs 2 | Presentations and docs 3 | -------------------------------------------------------------------------------- /XDP_Inside_and_Out.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/XDP_Inside_and_Out.pdf -------------------------------------------------------------------------------- /bpf-internals-1.md: -------------------------------------------------------------------------------- 1 | # BPF Internals - I 2 | 3 | *by Suchakra Sharma* 4 | 5 | Recent [post by Brendan Gregg](http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html) inspired me 6 | to write my own blog post about my findings of how Berkeley Packet Filter (BPF) evolved, it's interesting history 7 | and the immense powers it holds - the way Brendan calls it 'brutal'. I came across this while studying interpreters 8 | and small process virtual machines like the proposed KTap's VM. I was looking at some known papers on 9 | [register vs stack basd VMs](https://www.usenix.org/legacy/events/vee05/full_papers/p153-yunhe.pdf), their performances and various code dispatch mechanisms used in these small VMs. The review of state-of-the-art soon moved to native code compilation and a [discussion on LWN](http://lwn.net/Articles/598545/) caught my eye. The benefits of JIT were too good to be overlooked, and BPF's application in things like filtering, tracing and seccomp (used in Chrome as well) made me interested. I knew that the kernel devs were on to something here. This is when I started digging through the BPF background. 10 | 11 | ## Background 12 | 13 | Network packet analysis requires an interesting bunch of tech. Right from the time a packet reaches the embedded 14 | controller on the network hardware in your PC (hardware/data link layer) to the point they do someting useful in 15 | your system, such as display something in your browser (application layer). For connected systems evolving these 16 | days, the amount of data transfer is huge, and the support infrastructure for the network analysis needed a way to 17 | filter out things pretty fast. The initial concept of packet filtering developed keeping in mind such needs and there 18 | were many stategies discussed with every filter such as CMU/Stanford packet Filter (CSPF), Sun's NIT filter and so on. 19 | For example, some earlier filtering approaches used a tree based model (in CSPF) to represenf filters and filter them 20 | out using predicate-tree walking. This earlier approach was also inherited in the Linux kernel's old filter in the 21 | net subsystem. 22 | 23 | Consider an engineer's need to have a probably simple and unrealistic filter on the network packets with the predicates 24 | P1, P2, P3 and P4 : 25 | 26 | [![equation](https://suchakra.files.wordpress.com/2015/05/equation.png?w=660)](https://suchakra.files.wordpress.com/2015/05/equation.png) 27 | 28 | Filtering approach like the one of CSPF would have represented this filter in a expression tree structure as follows: 29 | 30 | [![tree](https://suchakra.files.wordpress.com/2015/05/tree2.png?w=127)](https://suchakra.files.wordpress.com/2015/05/tree2.png) 31 | 32 | It is then trivial to walk the tree evaluating each expression and performing operations on each of them. But this 33 | would mean there can be extra costs assiciated with evaluating the predicates which may not necessarily have to be 34 | evaluated. For example, what if the packet is neither an ARP packet nor an IP packet? Having the knowledge that P1 35 | and P2 predicates are untrue, we may need not have to evaluate other 2 predicates and perform 2 other boolean operation 36 | on them to determine the outcome. 37 | 38 | In 1992-93, McCanne et al. proposed a [BSD Packet Filter](http://www.tcpdump.org/papers/bpf-usenix93.pdf) with a 39 | new CFG-bytecode based filter design. This was an in-kernel approach where a tiny interpreter would evaluate 40 | expressions represented as BPF bytecodes. Instead of simple expression trees, they proposed a CFG based filter design. 41 | One of the control flow graph representation of the same filter above can be: 42 | 43 | [![cfg](https://suchakra.files.wordpress.com/2015/05/cfg.png?w=167)](https://suchakra.files.wordpress.com/2015/05/cfg.png) 44 | 45 | The evaluation can start from P1 and the right edge is for FALSE and left is for TRUE with each predicate being 46 | evaluated in this fashion until the evaluation reaches the final result of TRUE or FALSE. The inherent property of 47 | 'remembering' in the CFG, i.e, if P1 and P2 are false, the path reaches a final FALSE is remembered and P3 and P4 48 | need not be evaluated. This was then easy to represent in bytecode form where a minimal BPF VM can be designed to 49 | evaluate these predicates with jumps to TRUE or FALSE targets. 50 | 51 | ### The BPF Machine 52 | 53 | A pseudo-instruction representation of the same filter described above for earlierversions of BPF in Linux kernel can be shown as, 54 | 55 | ```ASM 56 | l0: ldh [12] 57 | l1: jeq #0x800, l3, l2 58 | l2: jeq #0x805, l3, l8 59 | l3: ld [26] 60 | l4: jeq #SRC, l4, l8 61 | l5: ld len 62 | l6: jlt 0x400, l7, l8 63 | l7: ret #0xffff 64 | l8: ret #0 65 | ``` 66 | 67 | To know how to read these BPF instructions, look at the 68 | [filter documentation](https://www.kernel.org/doc/Documentation/networking/filter.txt) in Kernel source and see 69 | what each line does. Each of these instructions are actually just bytecodes which the BPF machine interprets. Like 70 | all real machines, this requires a definition of how the VM internals would look like. In the Linux kernel's version 71 | of the BPF based in-kernel filtering technique they adopted, there were initially just 2 important registers, A and 72 | X with another 16 register 'scratch space' M[0-15]. The Instruction format and some sample instructions for this 73 | earlier version of BPF are shown below: 74 | 75 | ```C 76 | /* Instruction format: { OP, JT, JF, K } 77 | * OP: opcode, 16 bit 78 | * JT: Jump target for TRUE 79 | * JF: Jump target for FALSE 80 | * K: 32 bit constant 81 | */ 82 | 83 | /* Sample instructions*/ 84 | { 0x28, 0, 0, 0x0000000c }, /* 0x28 is opcode for ldh */ 85 | { 0x15, 1, 0, 0x00000800 }, /* jump next to next instr if A = 0x800 */ 86 | { 0x15, 0, 5, 0x00000805 }, /* jump to FALSE (offset 5) if A != 0x805 */ 87 | .. 88 | ``` 89 | 90 | There were some **radical changes done to the BPF infrastructure recently** - extensions to its instruction set, 91 | registers, addition of things like BPF-maps etc. We shall discuss what those changes in detail, probably in the 92 | next post in this series. For now we'll just see the good ol' way of how BPF worked. 93 | 94 | ### Interpreter 95 | 96 | Each of the instructions seen above are represented as arrays of these 4 values and each program is an array of 97 | such instructions. The BPF interpreter sees each opcode and performs the operations on the registers or data 98 | accordingly after it goes through a verifier for a sanity check to make sure the filter code is secure and would not 99 | cause harm. The program which consists of these instructions, then passes through a dispatch routine. As an example, 100 | here is a small snippet from the BPF instruction dispatch for the instruction 'add' before it was restructured in 101 | Linux kernel v3.15 onwards, 102 | 103 | ```C 104 | 127 u32 A = 0; /* Accumulator */ 105 | 128 u32 X = 0; /* Index Register */ 106 | 129 u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */ 107 | 130 u32 tmp; 108 | 131 int k; 109 | 132 110 | 133 /* 111 | 134 * Process array of filter instructions. 112 | 135 */ 113 | 136 for (;; fentry++) { 114 | 137 #if defined(CONFIG_X86_32) 115 | 138 #define K (fentry->k) 116 | 139 #else 117 | 140 const u32 K = fentry->k; 118 | 141 #endif 119 | 142 120 | 143 switch (fentry->code) { 121 | 144 case BPF_S_ALU_ADD_X: 122 | 145 A += X; 123 | 146 continue; 124 | 147 case BPF_S_ALU_ADD_K: 125 | 148 A += K; 126 | 149 continue; 127 | 150 .. 128 | ``` 129 | 130 | Above snippet is taken from net/core/filter.c in Linux kernel v3.14\. Here, `fentry` is the `socket_filter` 131 | structure and the filter is applied to the `sk_buff` data element. The dispatch loop (136), runs till all the 132 | instructions are exhaused. The dispatch is basically a huge switch-case dispatch with each opcode being tested (143) 133 | and necessary action being taken. For example, here an 'add' operation on registers would add A+X and store it in A. 134 | Yes, this is simple isn't it? Let us take it a level above. 135 | 136 | ### JIT Compilation 137 | 138 | This is nothing new. JIT compilation of bytecodes has been there for a long time. I think it is one of those eventual steps taken once an interpreted language decides to look for optimizing bytecode execution speed. Interpreter dispatches can be a bit costly once the size of the filter/code and the execution time increases. With high frequency packet filtering, we need to save as much time as possible and a good way is to convert the bytecode to native machine code by Just-In-Time compiling it and then executing the native code from the code cache. For BPF, JIT was discussed first in the [BPF+ research paper](http://dl.acm.org/citation.cfm?id=316214) by Begel etc al. in 1999\. Along with other optimizations (redundant predicate elimination, peephole optimizations etc,) a JIT assembler for BPF bytecodes was also discussed. They showed improvements from 3.5x to 9x in certain cases. I quickly started seeing if the Linux kernel had done something similar. And behold, here is how the JIT looks like for the 'add' instruction we discussed before (Linux kernel v3.14), 139 | 140 | ```C 141 | 288 switch (filter[i].code) { 142 | 289 case BPF_S_ALU_ADD_X: /* A += X; */ 143 | 290 seen |= SEEN_XREG; 144 | 291 EMIT2(0x01, 0xd8); /* add %ebx,%eax */ 145 | 292 break; 146 | 293 case BPF_S_ALU_ADD_K: /* A += K; */ 147 | 294 if (!K) 148 | 295 break; 149 | 296 if (is_imm8(K)) 150 | 297 EMIT3(0x83, 0xc0, K); /* add imm8,%eax */ 151 | 298 else 152 | 299 EMIT1_off32(0x05, K); /* add imm32,%eax */ 153 | 300 break; 154 | ``` 155 | 156 | As seen above in arch/x86/net/bpf_jit_comp.c for v3.14, instead of performing operations during the code dispatch 157 | directly, the JIT compiler [emits](http://lxr.free-electrons.com/source/arch/x86/net/bpf_jit_comp.c?v=3.14#L40) 158 | the native code to a memory area and keeps it ready for execution.The JITed filter image is built like a function 159 | call, so we add some prologue and epilogue to it as well, 160 | 161 | ```C 162 | /* JIT image prologue */ 163 | 221 EMIT4(0x55, 0x48, 0x89, 0xe5); /* push %rbp; mov %rsp,%rbp */ 164 | 222 EMIT4(0x48, 0x83, 0xec, 96); /* subq $96,%rsp */ 165 | ``` 166 | 167 | There are rules to BPF (such as no-loop etc.) which the verifier checks before the image is built as we are 168 | now in dangerous waters of executing external machine code inside the linux kernel. In those days, all this 169 | would have been done by [bpf_jit_compile](http://lxr.free-electrons.com/source/arch/x86/net/bpf_jit_comp.c?v=3.14#L181) 170 | which upon completion would point the filter function to the filter image, 171 | 172 | ```C 173 | 774 fp->bpf_func = (void *)image 174 | ``` 175 | Smooooooth... Upon execution of the filter function, instead of interpreting, the filter will now start executing 176 | the native code. Even though things have changed a bit recently, this had been indeed a fun way to learn how 177 | interpreters and JIT compilers work in general and the kind of optimizations that can be done. In the next part of 178 | this post series, I will look into what changes have been done recently, the restructuring and extension efforts to 179 | BPF and its evolution to eBPF along with BPF maps and the very recent and ongoing efforts in 180 | [hist-triggers](https://lwn.net/Articles/639992/). I will discuss about my experiemntal userspace eBPF library 181 | and it's use for LTTng's UST event filtering and its comparison to LTTng's bytecode interpreter. 182 | Brendan's [blog-post](http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html) is highly recommended 183 | and so are the links to 'More Reading' in that post. Thanks to Alexei Starovoitov, Eric Dumazet and all the other 184 | kernel contributors to BPF that I may have missed. They are doing awesome work and are the direct source for my 185 | learnings as well. It seems, looking at versatility of eBPF, it's adoption in newer tools like 186 | [shark](http://www.sharkly.io/), and with Brendan's views and 187 | [first experiments](https://github.com/brendangregg/BPF-tools), this may indeed be the next big thing in tracing. 188 | -------------------------------------------------------------------------------- /bpf-internals-2.md: -------------------------------------------------------------------------------- 1 | #BPF Internals - II 2 | *by Suchakra Sharma* 3 | 4 | Continuing from where I left [before](https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/), in this post we would see some of the major changes in BPF that have happened recently - how it is evolving to be a very stable and accepted in-kernel VM and can probably be the next big thing - in not just filtering but going beyond. From what I observe, the most attractive feature of BPF is its ability to give access to the developers so that they can execute dynamically compiled code within the kernel - in a limited context, but still securely. This itself is a valuable asset. 5 | 6 | As we have seen already, the use of BPF is not just limited to filtering out network packets but for seccomp, tracing etc. The eventual step for BPF in such a scenario was to evolve and come out of it's use in the network filtering world. To improve the architecture and bytecode, lots of additions have been proposed. I started a bit late when I saw Alexei's patches for kernel version 3.17-rcX. Perhaps, [this](https://lkml.org/lkml/2013/9/30/627) was the relevant mail by Alexei that got me interested in the upcoming changes. So, here is a summary of what all major changes have occured. We will be seeing each of them in sufficient detail. 7 | 8 | ### Architecture 9 | 10 | The classic BPF we discussed in the last post had two 32-bit registers - A and X. All arithmetic operations were supported and performed using these two registers. The newer BPF called extended-BPF or eBPF has ten 64-bit registers and supports arbitary load/stores. It also contains new instructions like `BPF_CALL` which can be used to call some new kernel-side helper functions. We will look into this in detail a bit later as well. The new eBPF follows calling conventions which are more like modern machines (x86_64). Here is the mapping of the new eBPF registers to x86 registers : 11 | 12 | ``` 13 | R0 – rax return value from function 14 | R1 – rdi 1st argument 15 | R2 – rsi 2nd argument 16 | R3 – rdx 3rd argument 17 | R4 – rcx 4th argument 18 | R5 – r8 5th argument 19 | R6 – rbx callee saved 20 | R7 - r13 callee saved 21 | R8 - r14 callee saved 22 | R9 - r15 callee saved 23 | R10 – rbp frame pointer 24 | ``` 25 | 26 | The closeness to the machine ABI also ensures that unnecessary register spilling/copying can be avoided. The R0 register stores the return from the eBPF program and the eBPF program contexts can be loaded through register R1\. Earlier, there used to be just two jump targets i.e. either jump to TRUE or FALSE targets. Now, there can be arbitary jump targets - true or fall through. Another aspect of the eBPF instruction set is the ease of use with the in-kernel JIT compiler. eBPF Registers and most instructions are now mapped one-to-one. This makes emitting these eBPF instructions from any external compiler (in userspace) not such a daunting task. Of course, prior to any execution, the generated bytecode is passed through a verifier in the kernel to check its sanity. The verifier in itself is a very interesting and important piece of code and probably story for another day. 27 | 28 | ### Building BPF Programs 29 | 30 | From a users perspective, the new eBPF bytecode can now be another headache to generate. But fear not, an LLVM based backend now supports instructions being generated for BPF pseudo-machine type directly. It is being 'graduated' from just being an experimental backend and can hit the shelf anytime soon. In the meantime, you can always use [this script](https://gist.github.com/tuxology/357d8826e97eb72c9277) to setup the BPF supported LLVM yourslef. But, then what next? So, a BPF program (not necessarily just a filter anymore) can be done in two parts - A kernel part (the BPF bytecode which will get loaded in the kernel) and the userspace part (which may, if needed gather data from the kernel part) Currently you can specify a eBPF program in a restricted C like language. For example, here is a program in the restricted C which returns true if the first argument of the input program context is 42\. Nothing fancy : 31 | 32 | ```C 33 | #include 34 | 35 | int answer(struct bpf_context *ctx) 36 | { 37 | int life; 38 | life = ctx->arg1; 39 | 40 | if (life == 42){ 41 | return 1; 42 | } 43 | return 0; 44 | } 45 | ``` 46 | 47 | This C like syntax generates a BPF binary which can then be loaded in the kernel. Here is what it looks like in BPF 'assembly' representation as generated by the LLVM backed (supplied with 3.4) : 48 | 49 | ```ASM 50 | .text 51 | .globl answer 52 | .align 8 53 | answer: # @answer 54 | # BB#0: 55 | ldw r1, 0(r1) 56 | mov r0, 1 57 | mov r2, 42 58 | jeq r1, r2 goto .LBB0_2 59 | # BB#1: 60 | mov r0, 0 61 | .LBB0_2: 62 | andi r0, 1 63 | ret 64 | ``` 65 | 66 | If you are adventerous enough, you can also probably write complete and valid [BPF programs in assembly](http://lxr.free-electrons.com/source/samples/bpf/test_verifier.c#L36) in a single go - right from your userspace program. I do not know if this is of any use these days. I have done this sometime back for a moderately elaborate trace filtering program though. It is also not effective as well, becasue I think at this point in human history, LLVM can generate assembly better and more efficiently than a human. 67 | 68 | What we discussed just now is probably not a relevant program anymore. An [example by Alexei](http://lxr.free-electrons.com/source/samples/bpf/tracex1_kern.c) here is what is more relevant these days. With the integration of Kprobe with BPF, a BPF program can be run at any valid dynamically instrumentable function in the kernel. So now, we can probably just use pt_regs as the context and get individual register values at each time the probe is hit. As of now, some helper functions are available in BPF as well, which can get the current timestamp. You can have a very cheap tracing tool right there :) 69 | 70 | ### BPF Maps 71 | 72 | I think one of the most interesting features in this new eBPF is the BPF maps. It looks like an abstract data type - initially a hash-table, but from kernel 3.19 onwards, support for array-maps seems to have been added as well. These bpf_maps can be used to store data generated from a eBPF program being executed. You can see the implementation details in [arraymap.c](http://lxr.free-electrons.com/source/kernel/bpf/arraymap.c) or [hashtab.c](http://lxr.free-electrons.com/source/kernel/bpf/hashtab.c) Lets pause for a while and see some more magic added in eBPF - esp. the BPF syscall which forms the primary interface for the user to interact and use eBPF. The reason we want to know more about this syscall is to know how to work with these cool BPF maps. 73 | 74 | #### BPF Syscall 75 | 76 | Another nice thing about eBPF is a new syscall being added to make life easier while dealing with BPF programs. In an [article](https://lwn.net/Articles/603983/) last year on LWN Jonathan Corbet discussed the use of BPF syscall. For example, to load a BPF program you could call 77 | 78 | ```C 79 | syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); 80 | ``` 81 | 82 | with of course, the corresponding bpf_attr structure being filled before : 83 | 84 | ```C 85 | union bpf_attr attr = { 86 | .prog_type = prog_type, /* kprobe filter? socket filter? */ 87 | .insns = ptr_to_u64((void *) insns), /* complete bpf instructions */ 88 | .insn_cnt = prog_len / sizeof(struct bpf_insn), /* how many? */ 89 | .license = ptr_to_u64((void *) license), /* GPL maybe */ 90 | .log_buf = ptr_to_u64(bpf_log_buf), /* log buffer */ 91 | .log_size = LOG_BUF_SIZE, 92 | .log_level = 1, 93 | }; 94 | ``` 95 | 96 | Yes, this may seem cumbersome to some, so for now, there are some [wrapper functions](http://lxr.free-electrons.com/source/samples/bpf/bpf_load.c#L33) in [bpf_load.c](http://lxr.free-electrons.com/source/samples/bpf/bpf_load.c) and [libbpf.c](http://lxr.free-electrons.com/source/samples/bpf/libbpf.c) released to help folks out where you may need not give too many details about your compiled bpf program. Much of what happens in the BPF syscall is determined by the arguments supported [here](http://lxr.free-electrons.com/source/kernel/bpf/syscall.c#L551). To elaborate more, let's see how to load the BPF program we did before. Assuming that we have the [sample program](http://lxr.free-electrons.com/source/samples/bpf/tracex1_kern.c) in its BPF bytecode form generated and now we want to load it up, we take the help of the wrapper function [load_bpf_file()](http://lxr.free-electrons.com/source/samples/bpf/bpf_load.c#L190) which parses the BPF ELF file and extracts the BPF bytecode from the relevant section. It also iterates over all ELF sections to get licence info, map info etc. Eventually, as per the type of BPF program - Kprobre/kretprobe or socket program, and the info and bytecode just gathered from the ELF parsing, the bpf_attr attribute structure is filled and actual syscall is made. 97 | 98 | #### Creating and accessing BPF maps 99 | 100 | Coming back to the maps, apart from this simple syscall to load the BPF program, there are many more actions that can be taken based on just the arguments. Have a look at [bpf/syscall.c](http://lxr.free-electrons.com/source/kernel/bpf/syscall.c) From the userspace side the new BPF syscall comes to the rescue and allows most of these operations on bpf_maps to be performed! From the kernel side however, with some special helper function and the use of BPF_CALL instruction, the values in these maps can be updated/deleted/accessed etc. These [helpers](http://lxr.free-electrons.com/source/kernel/bpf/helpers.c) inturn call the actual function according to the type of map - hash-map or an array. For example, here is a BPF program that just creates an array-map and does nothing else, 101 | 102 | ```C 103 | #include 104 | #include "bpf_helpers.h" 105 | #include 106 | 107 | struct bpf_map_def SEC("maps") sample_map = { 108 | .type = BPF_MAP_TYPE_ARRAY, 109 | .key_size = sizeof(u32), 110 | .value_size = sizeof(unsigned int), 111 | .max_entries = 1000, 112 | }; 113 | 114 | char _license[] SEC("license") = "GPL"; 115 | u32 _version SEC("version") = LINUX_VERSION_CODE; 116 | ``` 117 | 118 | When loaded in the kernel, the array-map is created. Form the userspace we can then probably initialize the map with some values with a function that look likes this, 119 | 120 | ```C 121 | static void init_array() 122 | { 123 | int key; 124 | for (key = 0; key < 1000; key++) { 125 | bpf_update_elem(map_fd[0], &key, &value1, BPF_ANY); 126 | } 127 | } 128 | ``` 129 | 130 | where `bpf_update_elem()` wrapper is in-turn calling the BPF syscall with proper arguments and attributes as, 131 | 132 | ```C 133 | syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); 134 | ``` 135 | 136 | This inturn calls [`map_update_elem()`](http://lxr.free-electrons.com/source/kernel/bpf/syscall.c#L205) which securely copies the key and value using `copy_from_user()` and then calls the [specialized function](http://lxr.free-electrons.com/source/kernel/bpf/arraymap.c#L94) for updating the value for array-map at the specified index. Similar things happen for reading/deleting/creating has or array maps from userspace. 137 | 138 | So probably, things will start falling into pieces now from the [earlier post](http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html) by Brendan Gregg where he was updating a map from the BPF program (using the `BPF_CALL` instruction which calls the internal [kernel helpers](http://lxr.free-electrons.com/source/kernel/bpf/helpers.c)) and then concurrently accessing it from userspace to generate a beautiful histogram (through the syscall I just mentioned above). BPF Maps are indeed a very powerful addition to the system. You can also checkout more detailed and complete [examples](https://github.com/iovisor/bcc/tree/master/examples) now that you know what is going on. To summarize, this is how an example BPF program written in restricted C for kernel part (`foo_kern.c`) and normal C for userspace part (`foo_user.c`) would run these days: 139 | 140 | [![img] (https://suchakra.files.wordpress.com/2015/08/ebpf-session.png)](https://suchakra.files.wordpress.com/2015/08/ebpf-session.png) 141 | 142 | In the next BPF post, I will discuss the eBPF verifier in detail. This is the most crucial part of BPF and deserves detailed attention I think. There is also something cool happening these days on the Plumgrid side I think - the [BPF Compiler Collection](https://github.com/iovisor/bcc). There was a very interesting demo using such tools and the power of eBPF at the recent Red Hat Summit. I got BCC working and tried out some examples with probes - where I could easily compile and load BPF programs from my Python scripts! How cool is that :) Also, I have been digging through the LTTng's interpreter lately so probably another post detailing how the BPF and LTTng's interpreters work would be nice to know. That's all for now. Run BPF. 143 | -------------------------------------------------------------------------------- /bpf_helpers.rst: -------------------------------------------------------------------------------- 1 | .. Copyright (C) All BPF authors and contributors from 2014 to present. 2 | .. See git log include/uapi/linux/bpf.h in kernel tree for details. 3 | .. 4 | .. %%%LICENSE_START(VERBATIM) 5 | .. Permission is granted to make and distribute verbatim copies of this 6 | .. manual provided the copyright notice and this permission notice are 7 | .. preserved on all copies. 8 | .. 9 | .. Permission is granted to copy and distribute modified versions of this 10 | .. manual under the conditions for verbatim copying, provided that the 11 | .. entire resulting derived work is distributed under the terms of a 12 | .. permission notice identical to this one. 13 | .. 14 | .. Since the Linux kernel and libraries are constantly changing, this 15 | .. manual page may be incorrect or out-of-date. The author(s) assume no 16 | .. responsibility for errors or omissions, or for damages resulting from 17 | .. the use of the information contained herein. The author(s) may not 18 | .. have taken the same level of care in the production of this manual, 19 | .. which is licensed free of charge, as they might when working 20 | .. professionally. 21 | .. 22 | .. Formatted or processed versions of this manual, if unaccompanied by 23 | .. the source, must acknowledge the copyright and authors of this work. 24 | .. %%%LICENSE_END 25 | .. 26 | .. Please do not edit this file. It was generated from the documentation 27 | .. located in file include/uapi/linux/bpf.h of the Linux kernel sources 28 | .. (helpers description), and from scripts/bpf_helpers_doc.py in the same 29 | .. repository (header and footer). 30 | 31 | =========== 32 | BPF-HELPERS 33 | =========== 34 | ------------------------------------------------------------------------------- 35 | list of eBPF helper functions 36 | ------------------------------------------------------------------------------- 37 | 38 | :Manual section: 7 39 | 40 | DESCRIPTION 41 | =========== 42 | 43 | The extended Berkeley Packet Filter (eBPF) subsystem consists in programs 44 | written in a pseudo-assembly language, then attached to one of the several 45 | kernel hooks and run in reaction of specific events. This framework differs 46 | from the older, "classic" BPF (or "cBPF") in several aspects, one of them being 47 | the ability to call special functions (or "helpers") from within a program. 48 | These functions are restricted to a white-list of helpers defined in the 49 | kernel. 50 | 51 | These helpers are used by eBPF programs to interact with the system, or with 52 | the context in which they work. For instance, they can be used to print 53 | debugging messages, to get the time since the system was booted, to interact 54 | with eBPF maps, or to manipulate network packets. Since there are several eBPF 55 | program types, and that they do not run in the same context, each program type 56 | can only call a subset of those helpers. 57 | 58 | Due to eBPF conventions, a helper can not have more than five arguments. 59 | 60 | Internally, eBPF programs call directly into the compiled helper functions 61 | without requiring any foreign-function interface. As a result, calling helpers 62 | introduces no overhead, thus offering excellent performance. 63 | 64 | This document is an attempt to list and document the helpers available to eBPF 65 | developers. They are sorted by chronological order (the oldest helpers in the 66 | kernel at the top). 67 | 68 | HELPERS 69 | ======= 70 | 71 | **void \*bpf_map_lookup_elem(struct bpf_map \***\ *map*\ **, const void \***\ *key*\ **)** 72 | Description 73 | Perform a lookup in *map* for an entry associated to *key*. 74 | Return 75 | Map value associated to *key*, or **NULL** if no entry was 76 | found. 77 | 78 | **int bpf_map_update_elem(struct bpf_map \***\ *map*\ **, const void \***\ *key*\ **, const void \***\ *value*\ **, u64** *flags*\ **)** 79 | Description 80 | Add or update the value of the entry associated to *key* in 81 | *map* with *value*. *flags* is one of: 82 | 83 | **BPF_NOEXIST** 84 | The entry for *key* must not exist in the map. 85 | **BPF_EXIST** 86 | The entry for *key* must already exist in the map. 87 | **BPF_ANY** 88 | No condition on the existence of the entry for *key*. 89 | 90 | Flag value **BPF_NOEXIST** cannot be used for maps of types 91 | **BPF_MAP_TYPE_ARRAY** or **BPF_MAP_TYPE_PERCPU_ARRAY** (all 92 | elements always exist), the helper would return an error. 93 | Return 94 | 0 on success, or a negative error in case of failure. 95 | 96 | **int bpf_map_delete_elem(struct bpf_map \***\ *map*\ **, const void \***\ *key*\ **)** 97 | Description 98 | Delete entry with *key* from *map*. 99 | Return 100 | 0 on success, or a negative error in case of failure. 101 | 102 | **int bpf_probe_read(void \***\ *dst*\ **, u32** *size*\ **, const void \***\ *src*\ **)** 103 | Description 104 | For tracing programs, safely attempt to read *size* bytes from 105 | address *src* and store the data in *dst*. 106 | Return 107 | 0 on success, or a negative error in case of failure. 108 | 109 | **u64 bpf_ktime_get_ns(void)** 110 | Description 111 | Return the time elapsed since system boot, in nanoseconds. 112 | Return 113 | Current *ktime*. 114 | 115 | **int bpf_trace_printk(const char \***\ *fmt*\ **, u32** *fmt_size*\ **, ...)** 116 | Description 117 | This helper is a "printk()-like" facility for debugging. It 118 | prints a message defined by format *fmt* (of size *fmt_size*) 119 | to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if 120 | available. It can take up to three additional **u64** 121 | arguments (as an eBPF helpers, the total number of arguments is 122 | limited to five). 123 | 124 | Each time the helper is called, it appends a line to the trace. 125 | The format of the trace is customizable, and the exact output 126 | one will get depends on the options set in 127 | *\/sys/kernel/debug/tracing/trace_options* (see also the 128 | *README* file under the same directory). However, it usually 129 | defaults to something like: 130 | 131 | :: 132 | 133 | telnet-470 [001] .N.. 419421.045894: 0x00000001: 134 | 135 | In the above: 136 | 137 | * ``telnet`` is the name of the current task. 138 | * ``470`` is the PID of the current task. 139 | * ``001`` is the CPU number on which the task is 140 | running. 141 | * In ``.N..``, each character refers to a set of 142 | options (whether irqs are enabled, scheduling 143 | options, whether hard/softirqs are running, level of 144 | preempt_disabled respectively). **N** means that 145 | **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED** 146 | are set. 147 | * ``419421.045894`` is a timestamp. 148 | * ``0x00000001`` is a fake value used by BPF for the 149 | instruction pointer register. 150 | * ```` is the message formatted with 151 | *fmt*. 152 | 153 | The conversion specifiers supported by *fmt* are similar, but 154 | more limited than for printk(). They are **%d**, **%i**, 155 | **%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**, 156 | **%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size 157 | of field, padding with zeroes, etc.) is available, and the 158 | helper will return **-EINVAL** (but print nothing) if it 159 | encounters an unknown specifier. 160 | 161 | Also, note that **bpf_trace_printk**\ () is slow, and should 162 | only be used for debugging purposes. For this reason, a notice 163 | bloc (spanning several lines) is printed to kernel logs and 164 | states that the helper should not be used "for production use" 165 | the first time this helper is used (or more precisely, when 166 | **trace_printk**\ () buffers are allocated). For passing values 167 | to user space, perf events should be preferred. 168 | Return 169 | The number of bytes written to the buffer, or a negative error 170 | in case of failure. 171 | 172 | **u32 bpf_get_prandom_u32(void)** 173 | Description 174 | Get a pseudo-random number. 175 | 176 | From a security point of view, this helper uses its own 177 | pseudo-random internal state, and cannot be used to infer the 178 | seed of other random functions in the kernel. However, it is 179 | essential to note that the generator used by the helper is not 180 | cryptographically secure. 181 | Return 182 | A random 32-bit unsigned value. 183 | 184 | **u32 bpf_get_smp_processor_id(void)** 185 | Description 186 | Get the SMP (symmetric multiprocessing) processor id. Note that 187 | all programs run with preemption disabled, which means that the 188 | SMP processor id is stable during all the execution of the 189 | program. 190 | Return 191 | The SMP id of the processor running the program. 192 | 193 | **int bpf_skb_store_bytes(struct sk_buff \***\ *skb*\ **, u32** *offset*\ **, const void \***\ *from*\ **, u32** *len*\ **, u64** *flags*\ **)** 194 | Description 195 | Store *len* bytes from address *from* into the packet 196 | associated to *skb*, at *offset*. *flags* are a combination of 197 | **BPF_F_RECOMPUTE_CSUM** (automatically recompute the 198 | checksum for the packet after storing the bytes) and 199 | **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\ 200 | **->swhash** and *skb*\ **->l4hash** to 0). 201 | 202 | A call to this helper is susceptible to change the underlaying 203 | packet buffer. Therefore, at load time, all checks on pointers 204 | previously done by the verifier are invalidated and must be 205 | performed again, if the helper is used in combination with 206 | direct packet access. 207 | Return 208 | 0 on success, or a negative error in case of failure. 209 | 210 | **int bpf_l3_csum_replace(struct sk_buff \***\ *skb*\ **, u32** *offset*\ **, u64** *from*\ **, u64** *to*\ **, u64** *size*\ **)** 211 | Description 212 | Recompute the layer 3 (e.g. IP) checksum for the packet 213 | associated to *skb*. Computation is incremental, so the helper 214 | must know the former value of the header field that was 215 | modified (*from*), the new value of this field (*to*), and the 216 | number of bytes (2 or 4) for this field, stored in *size*. 217 | Alternatively, it is possible to store the difference between 218 | the previous and the new values of the header field in *to*, by 219 | setting *from* and *size* to 0. For both methods, *offset* 220 | indicates the location of the IP checksum within the packet. 221 | 222 | This helper works in combination with **bpf_csum_diff**\ (), 223 | which does not update the checksum in-place, but offers more 224 | flexibility and can handle sizes larger than 2 or 4 for the 225 | checksum to update. 226 | 227 | A call to this helper is susceptible to change the underlaying 228 | packet buffer. Therefore, at load time, all checks on pointers 229 | previously done by the verifier are invalidated and must be 230 | performed again, if the helper is used in combination with 231 | direct packet access. 232 | Return 233 | 0 on success, or a negative error in case of failure. 234 | 235 | **int bpf_l4_csum_replace(struct sk_buff \***\ *skb*\ **, u32** *offset*\ **, u64** *from*\ **, u64** *to*\ **, u64** *flags*\ **)** 236 | Description 237 | Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum for the 238 | packet associated to *skb*. Computation is incremental, so the 239 | helper must know the former value of the header field that was 240 | modified (*from*), the new value of this field (*to*), and the 241 | number of bytes (2 or 4) for this field, stored on the lowest 242 | four bits of *flags*. Alternatively, it is possible to store 243 | the difference between the previous and the new values of the 244 | header field in *to*, by setting *from* and the four lowest 245 | bits of *flags* to 0. For both methods, *offset* indicates the 246 | location of the IP checksum within the packet. In addition to 247 | the size of the field, *flags* can be added (bitwise OR) actual 248 | flags. With **BPF_F_MARK_MANGLED_0**, a null checksum is left 249 | untouched (unless **BPF_F_MARK_ENFORCE** is added as well), and 250 | for updates resulting in a null checksum the value is set to 251 | **CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR** indicates 252 | the checksum is to be computed against a pseudo-header. 253 | 254 | This helper works in combination with **bpf_csum_diff**\ (), 255 | which does not update the checksum in-place, but offers more 256 | flexibility and can handle sizes larger than 2 or 4 for the 257 | checksum to update. 258 | 259 | A call to this helper is susceptible to change the underlaying 260 | packet buffer. Therefore, at load time, all checks on pointers 261 | previously done by the verifier are invalidated and must be 262 | performed again, if the helper is used in combination with 263 | direct packet access. 264 | Return 265 | 0 on success, or a negative error in case of failure. 266 | 267 | **int bpf_tail_call(void \***\ *ctx*\ **, struct bpf_map \***\ *prog_array_map*\ **, u32** *index*\ **)** 268 | Description 269 | This special helper is used to trigger a "tail call", or in 270 | other words, to jump into another eBPF program. The same stack 271 | frame is used (but values on stack and in registers for the 272 | caller are not accessible to the callee). This mechanism allows 273 | for program chaining, either for raising the maximum number of 274 | available eBPF instructions, or to execute given programs in 275 | conditional blocks. For security reasons, there is an upper 276 | limit to the number of successive tail calls that can be 277 | performed. 278 | 279 | Upon call of this helper, the program attempts to jump into a 280 | program referenced at index *index* in *prog_array_map*, a 281 | special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes 282 | *ctx*, a pointer to the context. 283 | 284 | If the call succeeds, the kernel immediately runs the first 285 | instruction of the new program. This is not a function call, 286 | and it never returns to the previous program. If the call 287 | fails, then the helper has no effect, and the caller continues 288 | to run its subsequent instructions. A call can fail if the 289 | destination program for the jump does not exist (i.e. *index* 290 | is superior to the number of entries in *prog_array_map*), or 291 | if the maximum number of tail calls has been reached for this 292 | chain of programs. This limit is defined in the kernel by the 293 | macro **MAX_TAIL_CALL_CNT** (not accessible to user space), 294 | which is currently set to 32. 295 | Return 296 | 0 on success, or a negative error in case of failure. 297 | 298 | **int bpf_clone_redirect(struct sk_buff \***\ *skb*\ **, u32** *ifindex*\ **, u64** *flags*\ **)** 299 | Description 300 | Clone and redirect the packet associated to *skb* to another 301 | net device of index *ifindex*. Both ingress and egress 302 | interfaces can be used for redirection. The **BPF_F_INGRESS** 303 | value in *flags* is used to make the distinction (ingress path 304 | is selected if the flag is present, egress path otherwise). 305 | This is the only flag supported for now. 306 | 307 | In comparison with **bpf_redirect**\ () helper, 308 | **bpf_clone_redirect**\ () has the associated cost of 309 | duplicating the packet buffer, but this can be executed out of 310 | the eBPF program. Conversely, **bpf_redirect**\ () is more 311 | efficient, but it is handled through an action code where the 312 | redirection happens only after the eBPF program has returned. 313 | 314 | A call to this helper is susceptible to change the underlaying 315 | packet buffer. Therefore, at load time, all checks on pointers 316 | previously done by the verifier are invalidated and must be 317 | performed again, if the helper is used in combination with 318 | direct packet access. 319 | Return 320 | 0 on success, or a negative error in case of failure. 321 | 322 | **u64 bpf_get_current_pid_tgid(void)** 323 | Return 324 | A 64-bit integer containing the current tgid and pid, and 325 | created as such: 326 | *current_task*\ **->tgid << 32 \|** 327 | *current_task*\ **->pid**. 328 | 329 | **u64 bpf_get_current_uid_gid(void)** 330 | Return 331 | A 64-bit integer containing the current GID and UID, and 332 | created as such: *current_gid* **<< 32 \|** *current_uid*. 333 | 334 | **int bpf_get_current_comm(char \***\ *buf*\ **, u32** *size_of_buf*\ **)** 335 | Description 336 | Copy the **comm** attribute of the current task into *buf* of 337 | *size_of_buf*. The **comm** attribute contains the name of 338 | the executable (excluding the path) for the current task. The 339 | *size_of_buf* must be strictly positive. On success, the 340 | helper makes sure that the *buf* is NUL-terminated. On failure, 341 | it is filled with zeroes. 342 | Return 343 | 0 on success, or a negative error in case of failure. 344 | 345 | **u32 bpf_get_cgroup_classid(struct sk_buff \***\ *skb*\ **)** 346 | Description 347 | Retrieve the classid for the current task, i.e. for the net_cls 348 | cgroup to which *skb* belongs. 349 | 350 | This helper can be used on TC egress path, but not on ingress. 351 | 352 | The net_cls cgroup provides an interface to tag network packets 353 | based on a user-provided identifier for all traffic coming from 354 | the tasks belonging to the related cgroup. See also the related 355 | kernel documentation, available from the Linux sources in file 356 | *Documentation/cgroup-v1/net_cls.txt*. 357 | 358 | The Linux kernel has two versions for cgroups: there are 359 | cgroups v1 and cgroups v2. Both are available to users, who can 360 | use a mixture of them, but note that the net_cls cgroup is for 361 | cgroup v1 only. This makes it incompatible with BPF programs 362 | run on cgroups, which is a cgroup-v2-only feature (a socket can 363 | only hold data for one version of cgroups at a time). 364 | 365 | This helper is only available is the kernel was compiled with 366 | the **CONFIG_CGROUP_NET_CLASSID** configuration option set to 367 | "**y**" or to "**m**". 368 | Return 369 | The classid, or 0 for the default unconfigured classid. 370 | 371 | **int bpf_skb_vlan_push(struct sk_buff \***\ *skb*\ **, __be16** *vlan_proto*\ **, u16** *vlan_tci*\ **)** 372 | Description 373 | Push a *vlan_tci* (VLAN tag control information) of protocol 374 | *vlan_proto* to the packet associated to *skb*, then update 375 | the checksum. Note that if *vlan_proto* is different from 376 | **ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to 377 | be **ETH_P_8021Q**. 378 | 379 | A call to this helper is susceptible to change the underlaying 380 | packet buffer. Therefore, at load time, all checks on pointers 381 | previously done by the verifier are invalidated and must be 382 | performed again, if the helper is used in combination with 383 | direct packet access. 384 | Return 385 | 0 on success, or a negative error in case of failure. 386 | 387 | **int bpf_skb_vlan_pop(struct sk_buff \***\ *skb*\ **)** 388 | Description 389 | Pop a VLAN header from the packet associated to *skb*. 390 | 391 | A call to this helper is susceptible to change the underlaying 392 | packet buffer. Therefore, at load time, all checks on pointers 393 | previously done by the verifier are invalidated and must be 394 | performed again, if the helper is used in combination with 395 | direct packet access. 396 | Return 397 | 0 on success, or a negative error in case of failure. 398 | 399 | **int bpf_skb_get_tunnel_key(struct sk_buff \***\ *skb*\ **, struct bpf_tunnel_key \***\ *key*\ **, u32** *size*\ **, u64** *flags*\ **)** 400 | Description 401 | Get tunnel metadata. This helper takes a pointer *key* to an 402 | empty **struct bpf_tunnel_key** of **size**, that will be 403 | filled with tunnel metadata for the packet associated to *skb*. 404 | The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which 405 | indicates that the tunnel is based on IPv6 protocol instead of 406 | IPv4. 407 | 408 | The **struct bpf_tunnel_key** is an object that generalizes the 409 | principal parameters used by various tunneling protocols into a 410 | single struct. This way, it can be used to easily make a 411 | decision based on the contents of the encapsulation header, 412 | "summarized" in this struct. In particular, it holds the IP 413 | address of the remote end (IPv4 or IPv6, depending on the case) 414 | in *key*\ **->remote_ipv4** or *key*\ **->remote_ipv6**. Also, 415 | this struct exposes the *key*\ **->tunnel_id**, which is 416 | generally mapped to a VNI (Virtual Network Identifier), making 417 | it programmable together with the **bpf_skb_set_tunnel_key**\ 418 | () helper. 419 | 420 | Let's imagine that the following code is part of a program 421 | attached to the TC ingress interface, on one end of a GRE 422 | tunnel, and is supposed to filter out all messages coming from 423 | remote ends with IPv4 address other than 10.0.0.1: 424 | 425 | :: 426 | 427 | int ret; 428 | struct bpf_tunnel_key key = {}; 429 | 430 | ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0); 431 | if (ret < 0) 432 | return TC_ACT_SHOT; // drop packet 433 | 434 | if (key.remote_ipv4 != 0x0a000001) 435 | return TC_ACT_SHOT; // drop packet 436 | 437 | return TC_ACT_OK; // accept packet 438 | 439 | This interface can also be used with all encapsulation devices 440 | that can operate in "collect metadata" mode: instead of having 441 | one network device per specific configuration, the "collect 442 | metadata" mode only requires a single device where the 443 | configuration can be extracted from this helper. 444 | 445 | This can be used together with various tunnels such as VXLan, 446 | Geneve, GRE or IP in IP (IPIP). 447 | Return 448 | 0 on success, or a negative error in case of failure. 449 | 450 | **int bpf_skb_set_tunnel_key(struct sk_buff \***\ *skb*\ **, struct bpf_tunnel_key \***\ *key*\ **, u32** *size*\ **, u64** *flags*\ **)** 451 | Description 452 | Populate tunnel metadata for packet associated to *skb.* The 453 | tunnel metadata is set to the contents of *key*, of *size*. The 454 | *flags* can be set to a combination of the following values: 455 | 456 | **BPF_F_TUNINFO_IPV6** 457 | Indicate that the tunnel is based on IPv6 protocol 458 | instead of IPv4. 459 | **BPF_F_ZERO_CSUM_TX** 460 | For IPv4 packets, add a flag to tunnel metadata 461 | indicating that checksum computation should be skipped 462 | and checksum set to zeroes. 463 | **BPF_F_DONT_FRAGMENT** 464 | Add a flag to tunnel metadata indicating that the 465 | packet should not be fragmented. 466 | **BPF_F_SEQ_NUMBER** 467 | Add a flag to tunnel metadata indicating that a 468 | sequence number should be added to tunnel header before 469 | sending the packet. This flag was added for GRE 470 | encapsulation, but might be used with other protocols 471 | as well in the future. 472 | 473 | Here is a typical usage on the transmit path: 474 | 475 | :: 476 | 477 | struct bpf_tunnel_key key; 478 | populate key ... 479 | bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0); 480 | bpf_clone_redirect(skb, vxlan_dev_ifindex, 0); 481 | 482 | See also the description of the **bpf_skb_get_tunnel_key**\ () 483 | helper for additional information. 484 | Return 485 | 0 on success, or a negative error in case of failure. 486 | 487 | **u64 bpf_perf_event_read(struct bpf_map \***\ *map*\ **, u64** *flags*\ **)** 488 | Description 489 | Read the value of a perf event counter. This helper relies on a 490 | *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of 491 | the perf event counter is selected when *map* is updated with 492 | perf event file descriptors. The *map* is an array whose size 493 | is the number of available CPUs, and each cell contains a value 494 | relative to one CPU. The value to retrieve is indicated by 495 | *flags*, that contains the index of the CPU to look up, masked 496 | with **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to 497 | **BPF_F_CURRENT_CPU** to indicate that the value for the 498 | current CPU should be retrieved. 499 | 500 | Note that before Linux 4.13, only hardware perf event can be 501 | retrieved. 502 | 503 | Also, be aware that the newer helper 504 | **bpf_perf_event_read_value**\ () is recommended over 505 | **bpf_perf_event_read**\ () in general. The latter has some ABI 506 | quirks where error and counter value are used as a return code 507 | (which is wrong to do since ranges may overlap). This issue is 508 | fixed with **bpf_perf_event_read_value**\ (), which at the same 509 | time provides more features over the **bpf_perf_event_read**\ 510 | () interface. Please refer to the description of 511 | **bpf_perf_event_read_value**\ () for details. 512 | Return 513 | The value of the perf event counter read from the map, or a 514 | negative error code in case of failure. 515 | 516 | **int bpf_redirect(u32** *ifindex*\ **, u64** *flags*\ **)** 517 | Description 518 | Redirect the packet to another net device of index *ifindex*. 519 | This helper is somewhat similar to **bpf_clone_redirect**\ 520 | (), except that the packet is not cloned, which provides 521 | increased performance. 522 | 523 | Except for XDP, both ingress and egress interfaces can be used 524 | for redirection. The **BPF_F_INGRESS** value in *flags* is used 525 | to make the distinction (ingress path is selected if the flag 526 | is present, egress path otherwise). Currently, XDP only 527 | supports redirection to the egress interface, and accepts no 528 | flag at all. 529 | 530 | The same effect can be attained with the more generic 531 | **bpf_redirect_map**\ (), which requires specific maps to be 532 | used but offers better performance. 533 | Return 534 | For XDP, the helper returns **XDP_REDIRECT** on success or 535 | **XDP_ABORTED** on error. For other program types, the values 536 | are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on 537 | error. 538 | 539 | **u32 bpf_get_route_realm(struct sk_buff \***\ *skb*\ **)** 540 | Description 541 | Retrieve the realm or the route, that is to say the 542 | **tclassid** field of the destination for the *skb*. The 543 | indentifier retrieved is a user-provided tag, similar to the 544 | one used with the net_cls cgroup (see description for 545 | **bpf_get_cgroup_classid**\ () helper), but here this tag is 546 | held by a route (a destination entry), not by a task. 547 | 548 | Retrieving this identifier works with the clsact TC egress hook 549 | (see also **tc-bpf(8)**), or alternatively on conventional 550 | classful egress qdiscs, but not on TC ingress path. In case of 551 | clsact TC egress hook, this has the advantage that, internally, 552 | the destination entry has not been dropped yet in the transmit 553 | path. Therefore, the destination entry does not need to be 554 | artificially held via **netif_keep_dst**\ () for a classful 555 | qdisc until the *skb* is freed. 556 | 557 | This helper is available only if the kernel was compiled with 558 | **CONFIG_IP_ROUTE_CLASSID** configuration option. 559 | Return 560 | The realm of the route for the packet associated to *skb*, or 0 561 | if none was found. 562 | 563 | **int bpf_perf_event_output(struct pt_reg \***\ *ctx*\ **, struct bpf_map \***\ *map*\ **, u64** *flags*\ **, void \***\ *data*\ **, u64** *size*\ **)** 564 | Description 565 | Write raw *data* blob into a special BPF perf event held by 566 | *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf 567 | event must have the following attributes: **PERF_SAMPLE_RAW** 568 | as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and 569 | **PERF_COUNT_SW_BPF_OUTPUT** as **config**. 570 | 571 | The *flags* are used to indicate the index in *map* for which 572 | the value must be put, masked with **BPF_F_INDEX_MASK**. 573 | Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU** 574 | to indicate that the index of the current CPU core should be 575 | used. 576 | 577 | The value to write, of *size*, is passed through eBPF stack and 578 | pointed by *data*. 579 | 580 | The context of the program *ctx* needs also be passed to the 581 | helper. 582 | 583 | On user space, a program willing to read the values needs to 584 | call **perf_event_open**\ () on the perf event (either for 585 | one or for all CPUs) and to store the file descriptor into the 586 | *map*. This must be done before the eBPF program can send data 587 | into it. An example is available in file 588 | *samples/bpf/trace_output_user.c* in the Linux kernel source 589 | tree (the eBPF program counterpart is in 590 | *samples/bpf/trace_output_kern.c*). 591 | 592 | **bpf_perf_event_output**\ () achieves better performance 593 | than **bpf_trace_printk**\ () for sharing data with user 594 | space, and is much better suitable for streaming data from eBPF 595 | programs. 596 | 597 | Note that this helper is not restricted to tracing use cases 598 | and can be used with programs attached to TC or XDP as well, 599 | where it allows for passing data to user space listeners. Data 600 | can be: 601 | 602 | * Only custom structs, 603 | * Only the packet payload, or 604 | * A combination of both. 605 | Return 606 | 0 on success, or a negative error in case of failure. 607 | 608 | **int bpf_skb_load_bytes(const struct sk_buff \***\ *skb*\ **, u32** *offset*\ **, void \***\ *to*\ **, u32** *len*\ **)** 609 | Description 610 | This helper was provided as an easy way to load data from a 611 | packet. It can be used to load *len* bytes from *offset* from 612 | the packet associated to *skb*, into the buffer pointed by 613 | *to*. 614 | 615 | Since Linux 4.7, usage of this helper has mostly been replaced 616 | by "direct packet access", enabling packet data to be 617 | manipulated with *skb*\ **->data** and *skb*\ **->data_end** 618 | pointing respectively to the first byte of packet data and to 619 | the byte after the last byte of packet data. However, it 620 | remains useful if one wishes to read large quantities of data 621 | at once from a packet into the eBPF stack. 622 | Return 623 | 0 on success, or a negative error in case of failure. 624 | 625 | **int bpf_get_stackid(struct pt_reg \***\ *ctx*\ **, struct bpf_map \***\ *map*\ **, u64** *flags*\ **)** 626 | Description 627 | Walk a user or a kernel stack and return its id. To achieve 628 | this, the helper needs *ctx*, which is a pointer to the context 629 | on which the tracing program is executed, and a pointer to a 630 | *map* of type **BPF_MAP_TYPE_STACK_TRACE**. 631 | 632 | The last argument, *flags*, holds the number of stack frames to 633 | skip (from 0 to 255), masked with 634 | **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set 635 | a combination of the following flags: 636 | 637 | **BPF_F_USER_STACK** 638 | Collect a user space stack instead of a kernel stack. 639 | **BPF_F_FAST_STACK_CMP** 640 | Compare stacks by hash only. 641 | **BPF_F_REUSE_STACKID** 642 | If two different stacks hash into the same *stackid*, 643 | discard the old one. 644 | 645 | The stack id retrieved is a 32 bit long integer handle which 646 | can be further combined with other data (including other stack 647 | ids) and used as a key into maps. This can be useful for 648 | generating a variety of graphs (such as flame graphs or off-cpu 649 | graphs). 650 | 651 | For walking a stack, this helper is an improvement over 652 | **bpf_probe_read**\ (), which can be used with unrolled loops 653 | but is not efficient and consumes a lot of eBPF instructions. 654 | Instead, **bpf_get_stackid**\ () can collect up to 655 | **PERF_MAX_STACK_DEPTH** both kernel and user frames. Note that 656 | this limit can be controlled with the **sysctl** program, and 657 | that it should be manually increased in order to profile long 658 | user stacks (such as stacks for Java programs). To do so, use: 659 | 660 | :: 661 | 662 | # sysctl kernel.perf_event_max_stack= 663 | 664 | Return 665 | The positive or null stack id on success, or a negative error 666 | in case of failure. 667 | 668 | **s64 bpf_csum_diff(__be32 \***\ *from*\ **, u32** *from_size*\ **, __be32 \***\ *to*\ **, u32** *to_size*\ **, __wsum** *seed*\ **)** 669 | Description 670 | Compute a checksum difference, from the raw buffer pointed by 671 | *from*, of length *from_size* (that must be a multiple of 4), 672 | towards the raw buffer pointed by *to*, of size *to_size* 673 | (same remark). An optional *seed* can be added to the value 674 | (this can be cascaded, the seed may come from a previous call 675 | to the helper). 676 | 677 | This is flexible enough to be used in several ways: 678 | 679 | * With *from_size* == 0, *to_size* > 0 and *seed* set to 680 | checksum, it can be used when pushing new data. 681 | * With *from_size* > 0, *to_size* == 0 and *seed* set to 682 | checksum, it can be used when removing data from a packet. 683 | * With *from_size* > 0, *to_size* > 0 and *seed* set to 0, it 684 | can be used to compute a diff. Note that *from_size* and 685 | *to_size* do not need to be equal. 686 | 687 | This helper can be used in combination with 688 | **bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\ (), to 689 | which one can feed in the difference computed with 690 | **bpf_csum_diff**\ (). 691 | Return 692 | The checksum result, or a negative error code in case of 693 | failure. 694 | 695 | **int bpf_skb_get_tunnel_opt(struct sk_buff \***\ *skb*\ **, u8 \***\ *opt*\ **, u32** *size*\ **)** 696 | Description 697 | Retrieve tunnel options metadata for the packet associated to 698 | *skb*, and store the raw tunnel option data to the buffer *opt* 699 | of *size*. 700 | 701 | This helper can be used with encapsulation devices that can 702 | operate in "collect metadata" mode (please refer to the related 703 | note in the description of **bpf_skb_get_tunnel_key**\ () for 704 | more details). A particular example where this can be used is 705 | in combination with the Geneve encapsulation protocol, where it 706 | allows for pushing (with **bpf_skb_get_tunnel_opt**\ () helper) 707 | and retrieving arbitrary TLVs (Type-Length-Value headers) from 708 | the eBPF program. This allows for full customization of these 709 | headers. 710 | Return 711 | The size of the option data retrieved. 712 | 713 | **int bpf_skb_set_tunnel_opt(struct sk_buff \***\ *skb*\ **, u8 \***\ *opt*\ **, u32** *size*\ **)** 714 | Description 715 | Set tunnel options metadata for the packet associated to *skb* 716 | to the option data contained in the raw buffer *opt* of *size*. 717 | 718 | See also the description of the **bpf_skb_get_tunnel_opt**\ () 719 | helper for additional information. 720 | Return 721 | 0 on success, or a negative error in case of failure. 722 | 723 | **int bpf_skb_change_proto(struct sk_buff \***\ *skb*\ **, __be16** *proto*\ **, u64** *flags*\ **)** 724 | Description 725 | Change the protocol of the *skb* to *proto*. Currently 726 | supported are transition from IPv4 to IPv6, and from IPv6 to 727 | IPv4. The helper takes care of the groundwork for the 728 | transition, including resizing the socket buffer. The eBPF 729 | program is expected to fill the new headers, if any, via 730 | **skb_store_bytes**\ () and to recompute the checksums with 731 | **bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\ 732 | (). The main case for this helper is to perform NAT64 733 | operations out of an eBPF program. 734 | 735 | Internally, the GSO type is marked as dodgy so that headers are 736 | checked and segments are recalculated by the GSO/GRO engine. 737 | The size for GSO target is adapted as well. 738 | 739 | All values for *flags* are reserved for future usage, and must 740 | be left at zero. 741 | 742 | A call to this helper is susceptible to change the underlaying 743 | packet buffer. Therefore, at load time, all checks on pointers 744 | previously done by the verifier are invalidated and must be 745 | performed again, if the helper is used in combination with 746 | direct packet access. 747 | Return 748 | 0 on success, or a negative error in case of failure. 749 | 750 | **int bpf_skb_change_type(struct sk_buff \***\ *skb*\ **, u32** *type*\ **)** 751 | Description 752 | Change the packet type for the packet associated to *skb*. This 753 | comes down to setting *skb*\ **->pkt_type** to *type*, except 754 | the eBPF program does not have a write access to *skb*\ 755 | **->pkt_type** beside this helper. Using a helper here allows 756 | for graceful handling of errors. 757 | 758 | The major use case is to change incoming *skb*s to 759 | **PACKET_HOST** in a programmatic way instead of having to 760 | recirculate via **redirect**\ (..., **BPF_F_INGRESS**), for 761 | example. 762 | 763 | Note that *type* only allows certain values. At this time, they 764 | are: 765 | 766 | **PACKET_HOST** 767 | Packet is for us. 768 | **PACKET_BROADCAST** 769 | Send packet to all. 770 | **PACKET_MULTICAST** 771 | Send packet to group. 772 | **PACKET_OTHERHOST** 773 | Send packet to someone else. 774 | Return 775 | 0 on success, or a negative error in case of failure. 776 | 777 | **int bpf_skb_under_cgroup(struct sk_buff \***\ *skb*\ **, struct bpf_map \***\ *map*\ **, u32** *index*\ **)** 778 | Description 779 | Check whether *skb* is a descendant of the cgroup2 held by 780 | *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*. 781 | Return 782 | The return value depends on the result of the test, and can be: 783 | 784 | * 0, if the *skb* failed the cgroup2 descendant test. 785 | * 1, if the *skb* succeeded the cgroup2 descendant test. 786 | * A negative error code, if an error occurred. 787 | 788 | **u32 bpf_get_hash_recalc(struct sk_buff \***\ *skb*\ **)** 789 | Description 790 | Retrieve the hash of the packet, *skb*\ **->hash**. If it is 791 | not set, in particular if the hash was cleared due to mangling, 792 | recompute this hash. Later accesses to the hash can be done 793 | directly with *skb*\ **->hash**. 794 | 795 | Calling **bpf_set_hash_invalid**\ (), changing a packet 796 | prototype with **bpf_skb_change_proto**\ (), or calling 797 | **bpf_skb_store_bytes**\ () with the 798 | **BPF_F_INVALIDATE_HASH** are actions susceptible to clear 799 | the hash and to trigger a new computation for the next call to 800 | **bpf_get_hash_recalc**\ (). 801 | Return 802 | The 32-bit hash. 803 | 804 | **u64 bpf_get_current_task(void)** 805 | Return 806 | A pointer to the current task struct. 807 | 808 | **int bpf_probe_write_user(void \***\ *dst*\ **, const void \***\ *src*\ **, u32** *len*\ **)** 809 | Description 810 | Attempt in a safe way to write *len* bytes from the buffer 811 | *src* to *dst* in memory. It only works for threads that are in 812 | user context, and *dst* must be a valid user space address. 813 | 814 | This helper should not be used to implement any kind of 815 | security mechanism because of TOC-TOU attacks, but rather to 816 | debug, divert, and manipulate execution of semi-cooperative 817 | processes. 818 | 819 | Keep in mind that this feature is meant for experiments, and it 820 | has a risk of crashing the system and running programs. 821 | Therefore, when an eBPF program using this helper is attached, 822 | a warning including PID and process name is printed to kernel 823 | logs. 824 | Return 825 | 0 on success, or a negative error in case of failure. 826 | 827 | **int bpf_current_task_under_cgroup(struct bpf_map \***\ *map*\ **, u32** *index*\ **)** 828 | Description 829 | Check whether the probe is being run is the context of a given 830 | subset of the cgroup2 hierarchy. The cgroup2 to test is held by 831 | *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*. 832 | Return 833 | The return value depends on the result of the test, and can be: 834 | 835 | * 0, if the *skb* task belongs to the cgroup2. 836 | * 1, if the *skb* task does not belong to the cgroup2. 837 | * A negative error code, if an error occurred. 838 | 839 | **int bpf_skb_change_tail(struct sk_buff \***\ *skb*\ **, u32** *len*\ **, u64** *flags*\ **)** 840 | Description 841 | Resize (trim or grow) the packet associated to *skb* to the 842 | new *len*. The *flags* are reserved for future usage, and must 843 | be left at zero. 844 | 845 | The basic idea is that the helper performs the needed work to 846 | change the size of the packet, then the eBPF program rewrites 847 | the rest via helpers like **bpf_skb_store_bytes**\ (), 848 | **bpf_l3_csum_replace**\ (), **bpf_l3_csum_replace**\ () 849 | and others. This helper is a slow path utility intended for 850 | replies with control messages. And because it is targeted for 851 | slow path, the helper itself can afford to be slow: it 852 | implicitly linearizes, unclones and drops offloads from the 853 | *skb*. 854 | 855 | A call to this helper is susceptible to change the underlaying 856 | packet buffer. Therefore, at load time, all checks on pointers 857 | previously done by the verifier are invalidated and must be 858 | performed again, if the helper is used in combination with 859 | direct packet access. 860 | Return 861 | 0 on success, or a negative error in case of failure. 862 | 863 | **int bpf_skb_pull_data(struct sk_buff \***\ *skb*\ **, u32** *len*\ **)** 864 | Description 865 | Pull in non-linear data in case the *skb* is non-linear and not 866 | all of *len* are part of the linear section. Make *len* bytes 867 | from *skb* readable and writable. If a zero value is passed for 868 | *len*, then the whole length of the *skb* is pulled. 869 | 870 | This helper is only needed for reading and writing with direct 871 | packet access. 872 | 873 | For direct packet access, testing that offsets to access 874 | are within packet boundaries (test on *skb*\ **->data_end**) is 875 | susceptible to fail if offsets are invalid, or if the requested 876 | data is in non-linear parts of the *skb*. On failure the 877 | program can just bail out, or in the case of a non-linear 878 | buffer, use a helper to make the data available. The 879 | **bpf_skb_load_bytes**\ () helper is a first solution to access 880 | the data. Another one consists in using **bpf_skb_pull_data** 881 | to pull in once the non-linear parts, then retesting and 882 | eventually access the data. 883 | 884 | At the same time, this also makes sure the *skb* is uncloned, 885 | which is a necessary condition for direct write. As this needs 886 | to be an invariant for the write part only, the verifier 887 | detects writes and adds a prologue that is calling 888 | **bpf_skb_pull_data()** to effectively unclone the *skb* from 889 | the very beginning in case it is indeed cloned. 890 | 891 | A call to this helper is susceptible to change the underlaying 892 | packet buffer. Therefore, at load time, all checks on pointers 893 | previously done by the verifier are invalidated and must be 894 | performed again, if the helper is used in combination with 895 | direct packet access. 896 | Return 897 | 0 on success, or a negative error in case of failure. 898 | 899 | **s64 bpf_csum_update(struct sk_buff \***\ *skb*\ **, __wsum** *csum*\ **)** 900 | Description 901 | Add the checksum *csum* into *skb*\ **->csum** in case the 902 | driver has supplied a checksum for the entire packet into that 903 | field. Return an error otherwise. This helper is intended to be 904 | used in combination with **bpf_csum_diff**\ (), in particular 905 | when the checksum needs to be updated after data has been 906 | written into the packet through direct packet access. 907 | Return 908 | The checksum on success, or a negative error code in case of 909 | failure. 910 | 911 | **void bpf_set_hash_invalid(struct sk_buff \***\ *skb*\ **)** 912 | Description 913 | Invalidate the current *skb*\ **->hash**. It can be used after 914 | mangling on headers through direct packet access, in order to 915 | indicate that the hash is outdated and to trigger a 916 | recalculation the next time the kernel tries to access this 917 | hash or when the **bpf_get_hash_recalc**\ () helper is called. 918 | 919 | 920 | **int bpf_get_numa_node_id(void)** 921 | Description 922 | Return the id of the current NUMA node. The primary use case 923 | for this helper is the selection of sockets for the local NUMA 924 | node, when the program is attached to sockets using the 925 | **SO_ATTACH_REUSEPORT_EBPF** option (see also **socket(7)**), 926 | but the helper is also available to other eBPF program types, 927 | similarly to **bpf_get_smp_processor_id**\ (). 928 | Return 929 | The id of current NUMA node. 930 | 931 | **int bpf_skb_change_head(struct sk_buff \***\ *skb*\ **, u32** *len*\ **, u64** *flags*\ **)** 932 | Description 933 | Grows headroom of packet associated to *skb* and adjusts the 934 | offset of the MAC header accordingly, adding *len* bytes of 935 | space. It automatically extends and reallocates memory as 936 | required. 937 | 938 | This helper can be used on a layer 3 *skb* to push a MAC header 939 | for redirection into a layer 2 device. 940 | 941 | All values for *flags* are reserved for future usage, and must 942 | be left at zero. 943 | 944 | A call to this helper is susceptible to change the underlaying 945 | packet buffer. Therefore, at load time, all checks on pointers 946 | previously done by the verifier are invalidated and must be 947 | performed again, if the helper is used in combination with 948 | direct packet access. 949 | Return 950 | 0 on success, or a negative error in case of failure. 951 | 952 | **int bpf_xdp_adjust_head(struct xdp_buff \***\ *xdp_md*\ **, int** *delta*\ **)** 953 | Description 954 | Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that 955 | it is possible to use a negative value for *delta*. This helper 956 | can be used to prepare the packet for pushing or popping 957 | headers. 958 | 959 | A call to this helper is susceptible to change the underlaying 960 | packet buffer. Therefore, at load time, all checks on pointers 961 | previously done by the verifier are invalidated and must be 962 | performed again, if the helper is used in combination with 963 | direct packet access. 964 | Return 965 | 0 on success, or a negative error in case of failure. 966 | 967 | **int bpf_probe_read_str(void \***\ *dst*\ **, int** *size*\ **, const void \***\ *unsafe_ptr*\ **)** 968 | Description 969 | Copy a NUL terminated string from an unsafe address 970 | *unsafe_ptr* to *dst*. The *size* should include the 971 | terminating NUL byte. In case the string length is smaller than 972 | *size*, the target is not padded with further NUL bytes. If the 973 | string length is larger than *size*, just *size*-1 bytes are 974 | copied and the last byte is set to NUL. 975 | 976 | On success, the length of the copied string is returned. This 977 | makes this helper useful in tracing programs for reading 978 | strings, and more importantly to get its length at runtime. See 979 | the following snippet: 980 | 981 | :: 982 | 983 | SEC("kprobe/sys_open") 984 | void bpf_sys_open(struct pt_regs *ctx) 985 | { 986 | char buf[PATHLEN]; // PATHLEN is defined to 256 987 | int res = bpf_probe_read_str(buf, sizeof(buf), 988 | ctx->di); 989 | 990 | // Consume buf, for example push it to 991 | // userspace via bpf_perf_event_output(); we 992 | // can use res (the string length) as event 993 | // size, after checking its boundaries. 994 | } 995 | 996 | In comparison, using **bpf_probe_read()** helper here instead 997 | to read the string would require to estimate the length at 998 | compile time, and would often result in copying more memory 999 | than necessary. 1000 | 1001 | Another useful use case is when parsing individual process 1002 | arguments or individual environment variables navigating 1003 | *current*\ **->mm->arg_start** and *current*\ 1004 | **->mm->env_start**: using this helper and the return value, 1005 | one can quickly iterate at the right offset of the memory area. 1006 | Return 1007 | On success, the strictly positive length of the string, 1008 | including the trailing NUL character. On error, a negative 1009 | value. 1010 | 1011 | **u64 bpf_get_socket_cookie(struct sk_buff \***\ *skb*\ **)** 1012 | Description 1013 | If the **struct sk_buff** pointed by *skb* has a known socket, 1014 | retrieve the cookie (generated by the kernel) of this socket. 1015 | If no cookie has been set yet, generate a new cookie. Once 1016 | generated, the socket cookie remains stable for the life of the 1017 | socket. This helper can be useful for monitoring per socket 1018 | networking traffic statistics as it provides a unique socket 1019 | identifier per namespace. 1020 | Return 1021 | A 8-byte long non-decreasing number on success, or 0 if the 1022 | socket field is missing inside *skb*. 1023 | 1024 | **u32 bpf_get_socket_uid(struct sk_buff \***\ *skb*\ **)** 1025 | Return 1026 | The owner UID of the socket associated to *skb*. If the socket 1027 | is **NULL**, or if it is not a full socket (i.e. if it is a 1028 | time-wait or a request socket instead), **overflowuid** value 1029 | is returned (note that **overflowuid** might also be the actual 1030 | UID value for the socket). 1031 | 1032 | **u32 bpf_set_hash(struct sk_buff \***\ *skb*\ **, u32** *hash*\ **)** 1033 | Description 1034 | Set the full hash for *skb* (set the field *skb*\ **->hash**) 1035 | to value *hash*. 1036 | Return 1037 | 0 1038 | 1039 | **int bpf_setsockopt(struct bpf_sock_ops \***\ *bpf_socket*\ **, int** *level*\ **, int** *optname*\ **, char \***\ *optval*\ **, int** *optlen*\ **)** 1040 | Description 1041 | Emulate a call to **setsockopt()** on the socket associated to 1042 | *bpf_socket*, which must be a full socket. The *level* at 1043 | which the option resides and the name *optname* of the option 1044 | must be specified, see **setsockopt(2)** for more information. 1045 | The option value of length *optlen* is pointed by *optval*. 1046 | 1047 | This helper actually implements a subset of **setsockopt()**. 1048 | It supports the following *level*\ s: 1049 | 1050 | * **SOL_SOCKET**, which supports the following *optname*\ s: 1051 | **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**, 1052 | **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**. 1053 | * **IPPROTO_TCP**, which supports the following *optname*\ s: 1054 | **TCP_CONGESTION**, **TCP_BPF_IW**, 1055 | **TCP_BPF_SNDCWND_CLAMP**. 1056 | * **IPPROTO_IP**, which supports *optname* **IP_TOS**. 1057 | * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**. 1058 | Return 1059 | 0 on success, or a negative error in case of failure. 1060 | 1061 | **int bpf_skb_adjust_room(struct sk_buff \***\ *skb*\ **, u32** *len_diff*\ **, u32** *mode*\ **, u64** *flags*\ **)** 1062 | Description 1063 | Grow or shrink the room for data in the packet associated to 1064 | *skb* by *len_diff*, and according to the selected *mode*. 1065 | 1066 | There is a single supported mode at this time: 1067 | 1068 | * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer 1069 | (room space is added or removed below the layer 3 header). 1070 | 1071 | All values for *flags* are reserved for future usage, and must 1072 | be left at zero. 1073 | 1074 | A call to this helper is susceptible to change the underlaying 1075 | packet buffer. Therefore, at load time, all checks on pointers 1076 | previously done by the verifier are invalidated and must be 1077 | performed again, if the helper is used in combination with 1078 | direct packet access. 1079 | Return 1080 | 0 on success, or a negative error in case of failure. 1081 | 1082 | **int bpf_redirect_map(struct bpf_map \***\ *map*\ **, u32** *key*\ **, u64** *flags*\ **)** 1083 | Description 1084 | Redirect the packet to the endpoint referenced by *map* at 1085 | index *key*. Depending on its type, this *map* can contain 1086 | references to net devices (for forwarding packets through other 1087 | ports), or to CPUs (for redirecting XDP frames to another CPU; 1088 | but this is only implemented for native XDP (with driver 1089 | support) as of this writing). 1090 | 1091 | All values for *flags* are reserved for future usage, and must 1092 | be left at zero. 1093 | 1094 | When used to redirect packets to net devices, this helper 1095 | provides a high performance increase over **bpf_redirect**\ (). 1096 | This is due to various implementation details of the underlying 1097 | mechanisms, one of which is the fact that **bpf_redirect_map**\ 1098 | () tries to send packet as a "bulk" to the device. 1099 | Return 1100 | **XDP_REDIRECT** on success, or **XDP_ABORTED** on error. 1101 | 1102 | **int bpf_sk_redirect_map(struct bpf_map \***\ *map*\ **, u32** *key*\ **, u64** *flags*\ **)** 1103 | Description 1104 | Redirect the packet to the socket referenced by *map* (of type 1105 | **BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and 1106 | egress interfaces can be used for redirection. The 1107 | **BPF_F_INGRESS** value in *flags* is used to make the 1108 | distinction (ingress path is selected if the flag is present, 1109 | egress path otherwise). This is the only flag supported for now. 1110 | Return 1111 | **SK_PASS** on success, or **SK_DROP** on error. 1112 | 1113 | **int bpf_sock_map_update(struct bpf_sock_ops \***\ *skops*\ **, struct bpf_map \***\ *map*\ **, void \***\ *key*\ **, u64** *flags*\ **)** 1114 | Description 1115 | Add an entry to, or update a *map* referencing sockets. The 1116 | *skops* is used as a new value for the entry associated to 1117 | *key*. *flags* is one of: 1118 | 1119 | **BPF_NOEXIST** 1120 | The entry for *key* must not exist in the map. 1121 | **BPF_EXIST** 1122 | The entry for *key* must already exist in the map. 1123 | **BPF_ANY** 1124 | No condition on the existence of the entry for *key*. 1125 | 1126 | If the *map* has eBPF programs (parser and verdict), those will 1127 | be inherited by the socket being added. If the socket is 1128 | already attached to eBPF programs, this results in an error. 1129 | Return 1130 | 0 on success, or a negative error in case of failure. 1131 | 1132 | **int bpf_xdp_adjust_meta(struct xdp_buff \***\ *xdp_md*\ **, int** *delta*\ **)** 1133 | Description 1134 | Adjust the address pointed by *xdp_md*\ **->data_meta** by 1135 | *delta* (which can be positive or negative). Note that this 1136 | operation modifies the address stored in *xdp_md*\ **->data**, 1137 | so the latter must be loaded only after the helper has been 1138 | called. 1139 | 1140 | The use of *xdp_md*\ **->data_meta** is optional and programs 1141 | are not required to use it. The rationale is that when the 1142 | packet is processed with XDP (e.g. as DoS filter), it is 1143 | possible to push further meta data along with it before passing 1144 | to the stack, and to give the guarantee that an ingress eBPF 1145 | program attached as a TC classifier on the same device can pick 1146 | this up for further post-processing. Since TC works with socket 1147 | buffers, it remains possible to set from XDP the **mark** or 1148 | **priority** pointers, or other pointers for the socket buffer. 1149 | Having this scratch space generic and programmable allows for 1150 | more flexibility as the user is free to store whatever meta 1151 | data they need. 1152 | 1153 | A call to this helper is susceptible to change the underlaying 1154 | packet buffer. Therefore, at load time, all checks on pointers 1155 | previously done by the verifier are invalidated and must be 1156 | performed again, if the helper is used in combination with 1157 | direct packet access. 1158 | Return 1159 | 0 on success, or a negative error in case of failure. 1160 | 1161 | **int bpf_perf_event_read_value(struct bpf_map \***\ *map*\ **, u64** *flags*\ **, struct bpf_perf_event_value \***\ *buf*\ **, u32** *buf_size*\ **)** 1162 | Description 1163 | Read the value of a perf event counter, and store it into *buf* 1164 | of size *buf_size*. This helper relies on a *map* of type 1165 | **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf event 1166 | counter is selected when *map* is updated with perf event file 1167 | descriptors. The *map* is an array whose size is the number of 1168 | available CPUs, and each cell contains a value relative to one 1169 | CPU. The value to retrieve is indicated by *flags*, that 1170 | contains the index of the CPU to look up, masked with 1171 | **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to 1172 | **BPF_F_CURRENT_CPU** to indicate that the value for the 1173 | current CPU should be retrieved. 1174 | 1175 | This helper behaves in a way close to 1176 | **bpf_perf_event_read**\ () helper, save that instead of 1177 | just returning the value observed, it fills the *buf* 1178 | structure. This allows for additional data to be retrieved: in 1179 | particular, the enabled and running times (in *buf*\ 1180 | **->enabled** and *buf*\ **->running**, respectively) are 1181 | copied. In general, **bpf_perf_event_read_value**\ () is 1182 | recommended over **bpf_perf_event_read**\ (), which has some 1183 | ABI issues and provides fewer functionalities. 1184 | 1185 | These values are interesting, because hardware PMU (Performance 1186 | Monitoring Unit) counters are limited resources. When there are 1187 | more PMU based perf events opened than available counters, 1188 | kernel will multiplex these events so each event gets certain 1189 | percentage (but not all) of the PMU time. In case that 1190 | multiplexing happens, the number of samples or counter value 1191 | will not reflect the case compared to when no multiplexing 1192 | occurs. This makes comparison between different runs difficult. 1193 | Typically, the counter value should be normalized before 1194 | comparing to other experiments. The usual normalization is done 1195 | as follows. 1196 | 1197 | :: 1198 | 1199 | normalized_counter = counter * t_enabled / t_running 1200 | 1201 | Where t_enabled is the time enabled for event and t_running is 1202 | the time running for event since last normalization. The 1203 | enabled and running times are accumulated since the perf event 1204 | open. To achieve scaling factor between two invocations of an 1205 | eBPF program, users can can use CPU id as the key (which is 1206 | typical for perf array usage model) to remember the previous 1207 | value and do the calculation inside the eBPF program. 1208 | Return 1209 | 0 on success, or a negative error in case of failure. 1210 | 1211 | **int bpf_perf_prog_read_value(struct bpf_perf_event_data \***\ *ctx*\ **, struct bpf_perf_event_value \***\ *buf*\ **, u32** *buf_size*\ **)** 1212 | Description 1213 | For en eBPF program attached to a perf event, retrieve the 1214 | value of the event counter associated to *ctx* and store it in 1215 | the structure pointed by *buf* and of size *buf_size*. Enabled 1216 | and running times are also stored in the structure (see 1217 | description of helper **bpf_perf_event_read_value**\ () for 1218 | more details). 1219 | Return 1220 | 0 on success, or a negative error in case of failure. 1221 | 1222 | **int bpf_getsockopt(struct bpf_sock_ops \***\ *bpf_socket*\ **, int** *level*\ **, int** *optname*\ **, char \***\ *optval*\ **, int** *optlen*\ **)** 1223 | Description 1224 | Emulate a call to **getsockopt()** on the socket associated to 1225 | *bpf_socket*, which must be a full socket. The *level* at 1226 | which the option resides and the name *optname* of the option 1227 | must be specified, see **getsockopt(2)** for more information. 1228 | The retrieved value is stored in the structure pointed by 1229 | *opval* and of length *optlen*. 1230 | 1231 | This helper actually implements a subset of **getsockopt()**. 1232 | It supports the following *level*\ s: 1233 | 1234 | * **IPPROTO_TCP**, which supports *optname* 1235 | **TCP_CONGESTION**. 1236 | * **IPPROTO_IP**, which supports *optname* **IP_TOS**. 1237 | * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**. 1238 | Return 1239 | 0 on success, or a negative error in case of failure. 1240 | 1241 | **int bpf_override_return(struct pt_reg \***\ *regs*\ **, u64** *rc*\ **)** 1242 | Description 1243 | Used for error injection, this helper uses kprobes to override 1244 | the return value of the probed function, and to set it to *rc*. 1245 | The first argument is the context *regs* on which the kprobe 1246 | works. 1247 | 1248 | This helper works by setting setting the PC (program counter) 1249 | to an override function which is run in place of the original 1250 | probed function. This means the probed function is not run at 1251 | all. The replacement function just returns with the required 1252 | value. 1253 | 1254 | This helper has security implications, and thus is subject to 1255 | restrictions. It is only available if the kernel was compiled 1256 | with the **CONFIG_BPF_KPROBE_OVERRIDE** configuration 1257 | option, and in this case it only works on functions tagged with 1258 | **ALLOW_ERROR_INJECTION** in the kernel code. 1259 | 1260 | Also, the helper is only available for the architectures having 1261 | the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing, 1262 | x86 architecture is the only one to support this feature. 1263 | Return 1264 | 0 1265 | 1266 | **int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops \***\ *bpf_sock*\ **, int** *argval*\ **)** 1267 | Description 1268 | Attempt to set the value of the **bpf_sock_ops_cb_flags** field 1269 | for the full TCP socket associated to *bpf_sock_ops* to 1270 | *argval*. 1271 | 1272 | The primary use of this field is to determine if there should 1273 | be calls to eBPF programs of type 1274 | **BPF_PROG_TYPE_SOCK_OPS** at various points in the TCP 1275 | code. A program of the same type can change its value, per 1276 | connection and as necessary, when the connection is 1277 | established. This field is directly accessible for reading, but 1278 | this helper must be used for updates in order to return an 1279 | error if an eBPF program tries to set a callback that is not 1280 | supported in the current kernel. 1281 | 1282 | The supported callback values that *argval* can combine are: 1283 | 1284 | * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) 1285 | * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) 1286 | * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) 1287 | 1288 | Here are some examples of where one could call such eBPF 1289 | program: 1290 | 1291 | * When RTO fires. 1292 | * When a packet is retransmitted. 1293 | * When the connection terminates. 1294 | * When a packet is sent. 1295 | * When a packet is received. 1296 | Return 1297 | Code **-EINVAL** if the socket is not a full TCP socket; 1298 | otherwise, a positive number containing the bits that could not 1299 | be set is returned (which comes down to 0 if all bits were set 1300 | as required). 1301 | 1302 | **int bpf_msg_redirect_map(struct sk_msg_buff \***\ *msg*\ **, struct bpf_map \***\ *map*\ **, u32** *key*\ **, u64** *flags*\ **)** 1303 | Description 1304 | This helper is used in programs implementing policies at the 1305 | socket level. If the message *msg* is allowed to pass (i.e. if 1306 | the verdict eBPF program returns **SK_PASS**), redirect it to 1307 | the socket referenced by *map* (of type 1308 | **BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and 1309 | egress interfaces can be used for redirection. The 1310 | **BPF_F_INGRESS** value in *flags* is used to make the 1311 | distinction (ingress path is selected if the flag is present, 1312 | egress path otherwise). This is the only flag supported for now. 1313 | Return 1314 | **SK_PASS** on success, or **SK_DROP** on error. 1315 | 1316 | **int bpf_msg_apply_bytes(struct sk_msg_buff \***\ *msg*\ **, u32** *bytes*\ **)** 1317 | Description 1318 | For socket policies, apply the verdict of the eBPF program to 1319 | the next *bytes* (number of bytes) of message *msg*. 1320 | 1321 | For example, this helper can be used in the following cases: 1322 | 1323 | * A single **sendmsg**\ () or **sendfile**\ () system call 1324 | contains multiple logical messages that the eBPF program is 1325 | supposed to read and for which it should apply a verdict. 1326 | * An eBPF program only cares to read the first *bytes* of a 1327 | *msg*. If the message has a large payload, then setting up 1328 | and calling the eBPF program repeatedly for all bytes, even 1329 | though the verdict is already known, would create unnecessary 1330 | overhead. 1331 | 1332 | When called from within an eBPF program, the helper sets a 1333 | counter internal to the BPF infrastructure, that is used to 1334 | apply the last verdict to the next *bytes*. If *bytes* is 1335 | smaller than the current data being processed from a 1336 | **sendmsg**\ () or **sendfile**\ () system call, the first 1337 | *bytes* will be sent and the eBPF program will be re-run with 1338 | the pointer for start of data pointing to byte number *bytes* 1339 | **+ 1**. If *bytes* is larger than the current data being 1340 | processed, then the eBPF verdict will be applied to multiple 1341 | **sendmsg**\ () or **sendfile**\ () calls until *bytes* are 1342 | consumed. 1343 | 1344 | Note that if a socket closes with the internal counter holding 1345 | a non-zero value, this is not a problem because data is not 1346 | being buffered for *bytes* and is sent as it is received. 1347 | Return 1348 | 0 1349 | 1350 | **int bpf_msg_cork_bytes(struct sk_msg_buff \***\ *msg*\ **, u32** *bytes*\ **)** 1351 | Description 1352 | For socket policies, prevent the execution of the verdict eBPF 1353 | program for message *msg* until *bytes* (byte number) have been 1354 | accumulated. 1355 | 1356 | This can be used when one needs a specific number of bytes 1357 | before a verdict can be assigned, even if the data spans 1358 | multiple **sendmsg**\ () or **sendfile**\ () calls. The extreme 1359 | case would be a user calling **sendmsg**\ () repeatedly with 1360 | 1-byte long message segments. Obviously, this is bad for 1361 | performance, but it is still valid. If the eBPF program needs 1362 | *bytes* bytes to validate a header, this helper can be used to 1363 | prevent the eBPF program to be called again until *bytes* have 1364 | been accumulated. 1365 | Return 1366 | 0 1367 | 1368 | **int bpf_msg_pull_data(struct sk_msg_buff \***\ *msg*\ **, u32** *start*\ **, u32** *end*\ **, u64** *flags*\ **)** 1369 | Description 1370 | For socket policies, pull in non-linear data from user space 1371 | for *msg* and set pointers *msg*\ **->data** and *msg*\ 1372 | **->data_end** to *start* and *end* bytes offsets into *msg*, 1373 | respectively. 1374 | 1375 | If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a 1376 | *msg* it can only parse data that the (**data**, **data_end**) 1377 | pointers have already consumed. For **sendmsg**\ () hooks this 1378 | is likely the first scatterlist element. But for calls relying 1379 | on the **sendpage** handler (e.g. **sendfile**\ ()) this will 1380 | be the range (**0**, **0**) because the data is shared with 1381 | user space and by default the objective is to avoid allowing 1382 | user space to modify data while (or after) eBPF verdict is 1383 | being decided. This helper can be used to pull in data and to 1384 | set the start and end pointer to given values. Data will be 1385 | copied if necessary (i.e. if data was not linear and if start 1386 | and end pointers do not point to the same chunk). 1387 | 1388 | A call to this helper is susceptible to change the underlaying 1389 | packet buffer. Therefore, at load time, all checks on pointers 1390 | previously done by the verifier are invalidated and must be 1391 | performed again, if the helper is used in combination with 1392 | direct packet access. 1393 | 1394 | All values for *flags* are reserved for future usage, and must 1395 | be left at zero. 1396 | Return 1397 | 0 on success, or a negative error in case of failure. 1398 | 1399 | **int bpf_bind(struct bpf_sock_addr \***\ *ctx*\ **, struct sockaddr \***\ *addr*\ **, int** *addr_len*\ **)** 1400 | Description 1401 | Bind the socket associated to *ctx* to the address pointed by 1402 | *addr*, of length *addr_len*. This allows for making outgoing 1403 | connection from the desired IP address, which can be useful for 1404 | example when all processes inside a cgroup should use one 1405 | single IP address on a host that has multiple IP configured. 1406 | 1407 | This helper works for IPv4 and IPv6, TCP and UDP sockets. The 1408 | domain (*addr*\ **->sa_family**) must be **AF_INET** (or 1409 | **AF_INET6**). Looking for a free port to bind to can be 1410 | expensive, therefore binding to port is not permitted by the 1411 | helper: *addr*\ **->sin_port** (or **sin6_port**, respectively) 1412 | must be set to zero. 1413 | Return 1414 | 0 on success, or a negative error in case of failure. 1415 | 1416 | **int bpf_xdp_adjust_tail(struct xdp_buff \***\ *xdp_md*\ **, int** *delta*\ **)** 1417 | Description 1418 | Adjust (move) *xdp_md*\ **->data_end** by *delta* bytes. It is 1419 | only possible to shrink the packet as of this writing, 1420 | therefore *delta* must be a negative integer. 1421 | 1422 | A call to this helper is susceptible to change the underlaying 1423 | packet buffer. Therefore, at load time, all checks on pointers 1424 | previously done by the verifier are invalidated and must be 1425 | performed again, if the helper is used in combination with 1426 | direct packet access. 1427 | Return 1428 | 0 on success, or a negative error in case of failure. 1429 | 1430 | **int bpf_skb_get_xfrm_state(struct sk_buff \***\ *skb*\ **, u32** *index*\ **, struct bpf_xfrm_state \***\ *xfrm_state*\ **, u32** *size*\ **, u64** *flags*\ **)** 1431 | Description 1432 | Retrieve the XFRM state (IP transform framework, see also 1433 | **ip-xfrm(8)**) at *index* in XFRM "security path" for *skb*. 1434 | 1435 | The retrieved value is stored in the **struct bpf_xfrm_state** 1436 | pointed by *xfrm_state* and of length *size*. 1437 | 1438 | All values for *flags* are reserved for future usage, and must 1439 | be left at zero. 1440 | 1441 | This helper is available only if the kernel was compiled with 1442 | **CONFIG_XFRM** configuration option. 1443 | Return 1444 | 0 on success, or a negative error in case of failure. 1445 | 1446 | **int bpf_get_stack(struct pt_regs \***\ *regs*\ **, void \***\ *buf*\ **, u32** *size*\ **, u64** *flags*\ **)** 1447 | Description 1448 | Return a user or a kernel stack in bpf program provided buffer. 1449 | To achieve this, the helper needs *ctx*, which is a pointer 1450 | to the context on which the tracing program is executed. 1451 | To store the stacktrace, the bpf program provides *buf* with 1452 | a nonnegative *size*. 1453 | 1454 | The last argument, *flags*, holds the number of stack frames to 1455 | skip (from 0 to 255), masked with 1456 | **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set 1457 | the following flags: 1458 | 1459 | **BPF_F_USER_STACK** 1460 | Collect a user space stack instead of a kernel stack. 1461 | **BPF_F_USER_BUILD_ID** 1462 | Collect buildid+offset instead of ips for user stack, 1463 | only valid if **BPF_F_USER_STACK** is also specified. 1464 | 1465 | **bpf_get_stack**\ () can collect up to 1466 | **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject 1467 | to sufficient large buffer size. Note that 1468 | this limit can be controlled with the **sysctl** program, and 1469 | that it should be manually increased in order to profile long 1470 | user stacks (such as stacks for Java programs). To do so, use: 1471 | 1472 | :: 1473 | 1474 | # sysctl kernel.perf_event_max_stack= 1475 | 1476 | Return 1477 | a non-negative value equal to or less than size on success, or 1478 | a negative error in case of failure. 1479 | 1480 | 1481 | EXAMPLES 1482 | ======== 1483 | 1484 | Example usage for most of the eBPF helpers listed in this manual page are 1485 | available within the Linux kernel sources, at the following locations: 1486 | 1487 | * *samples/bpf/* 1488 | * *tools/testing/selftests/bpf/* 1489 | 1490 | LICENSE 1491 | ======= 1492 | 1493 | eBPF programs can have an associated license, passed along with the bytecode 1494 | instructions to the kernel when the programs are loaded. The format for that 1495 | string is identical to the one in use for kernel modules (Dual licenses, such 1496 | as "Dual BSD/GPL", may be used). Some helper functions are only accessible to 1497 | programs that are compatible with the GNU Privacy License (GPL). 1498 | 1499 | In order to use such helpers, the eBPF program must be loaded with the correct 1500 | license string passed (via **attr**) to the **bpf**\ () system call, and this 1501 | generally translates into the C source code of the program containing a line 1502 | similar to the following: 1503 | 1504 | :: 1505 | 1506 | char ____license[] __attribute__((section("license"), used)) = "GPL"; 1507 | 1508 | IMPLEMENTATION 1509 | ============== 1510 | 1511 | This manual page is an effort to document the existing eBPF helper functions. 1512 | But as of this writing, the BPF sub-system is under heavy development. New eBPF 1513 | program or map types are added, along with new helper functions. Some helpers 1514 | are occasionally made available for additional program types. So in spite of 1515 | the efforts of the community, this page might not be up-to-date. If you want to 1516 | check by yourself what helper functions exist in your kernel, or what types of 1517 | programs they can support, here are some files among the kernel tree that you 1518 | may be interested in: 1519 | 1520 | * *include/uapi/linux/bpf.h* is the main BPF header. It contains the full list 1521 | of all helper functions, as well as many other BPF definitions including most 1522 | of the flags, structs or constants used by the helpers. 1523 | * *net/core/filter.c* contains the definition of most network-related helper 1524 | functions, and the list of program types from which they can be used. 1525 | * *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related 1526 | helpers. 1527 | * *kernel/bpf/verifier.c* contains the functions used to check that valid types 1528 | of eBPF maps are used with a given helper function. 1529 | * *kernel/bpf/* directory contains other files in which additional helpers are 1530 | defined (for cgroups, sockmaps, etc.). 1531 | 1532 | Compatibility between helper functions and program types can generally be found 1533 | in the files where helper functions are defined. Look for the **struct 1534 | bpf_func_proto** objects and for functions returning them: these functions 1535 | contain a list of helpers that a given program type can call. Note that the 1536 | **default:** label of the **switch ... case** used to filter helpers can call 1537 | other functions, themselves allowing access to additional helpers. The 1538 | requirement for GPL license is also in those **struct bpf_func_proto**. 1539 | 1540 | Compatibility between helper functions and map types can be found in the 1541 | **check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*. 1542 | 1543 | Helper functions that invalidate the checks on **data** and **data_end** 1544 | pointers for network processing are listed in function 1545 | **bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*. 1546 | 1547 | SEE ALSO 1548 | ======== 1549 | 1550 | **bpf**\ (2), 1551 | **cgroups**\ (7), 1552 | **ip**\ (8), 1553 | **perf_event_open**\ (2), 1554 | **sendmsg**\ (2), 1555 | **socket**\ (7), 1556 | **tc-bpf**\ (8) 1557 | -------------------------------------------------------------------------------- /bpf_llvm_2015aug19.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_llvm_2015aug19.pdf -------------------------------------------------------------------------------- /bpf_netdev_conference_2016Feb12.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_netdev_conference_2016Feb12.pdf -------------------------------------------------------------------------------- /bpf_netdev_conference_2016Feb12_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_netdev_conference_2016Feb12_report.pdf -------------------------------------------------------------------------------- /bpf_netdev_conference_2016Oct07.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_netdev_conference_2016Oct07.pdf -------------------------------------------------------------------------------- /bpf_netdev_conference_2016Oct07_tcws.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_netdev_conference_2016Oct07_tcws.pdf -------------------------------------------------------------------------------- /bpf_netvirt_2015aug21.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_netvirt_2015aug21.pdf -------------------------------------------------------------------------------- /bpf_network_examples_2015aug20.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpf_network_examples_2015aug20.pdf -------------------------------------------------------------------------------- /bpftrace_public_template_jun2019.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpftrace_public_template_jun2019.odp -------------------------------------------------------------------------------- /bpftrace_public_template_jun2019.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/bpftrace_public_template_jun2019.pdf -------------------------------------------------------------------------------- /eBPF.md: -------------------------------------------------------------------------------- 1 | # Unofficial eBPF spec 2 | 3 | The [official documentation for the eBPF instruction set][1] is in the 4 | Linux repository. However, while it is concise, it isn't always easy to 5 | use as a reference. This document lists each valid eBPF opcode. 6 | 7 | [1]: https://www.kernel.org/doc/Documentation/networking/filter.txt 8 | 9 | ## Instruction encoding 10 | 11 | An eBPF program is a sequence of 64-bit instructions. This project assumes each 12 | instruction is encoded in host byte order, but the byte order is not relevant 13 | to this spec. 14 | 15 | All eBPF instructions have the same basic encoding: 16 | 17 | msb lsb 18 | +------------------------+----------------+----+----+--------+ 19 | |immediate |offset |src |dst |opcode | 20 | +------------------------+----------------+----+----+--------+ 21 | 22 | From least significant to most significant bit: 23 | 24 | - 8 bit opcode 25 | - 4 bit destination register (dst) 26 | - 4 bit source register (src) 27 | - 16 bit offset 28 | - 32 bit immediate (imm) 29 | 30 | Most instructions do not use all of these fields. Unused fields should be 31 | zeroed. 32 | 33 | The low 3 bits of the opcode field are the "instruction class". 34 | This groups together related opcodes. 35 | 36 | LD/LDX/ST/STX opcode structure: 37 | 38 | msb lsb 39 | +---+--+---+ 40 | |mde|sz|cls| 41 | +---+--+---+ 42 | 43 | The `sz` field specifies the size of the memory location. The `mde` field is 44 | the memory access mode. uBPF only supports the generic "MEM" access mode. 45 | 46 | ALU/ALU64/JMP opcode structure: 47 | 48 | msb lsb 49 | +----+-+---+ 50 | |op |s|cls| 51 | +----+-+---+ 52 | 53 | If the `s` bit is zero, then the source operand is `imm`. If `s` is one, then 54 | the source operand is `src`. The `op` field specifies which ALU or branch 55 | operation is to be performed. 56 | 57 | ## ALU Instructions 58 | 59 | ### 64-bit 60 | 61 | Opcode | Mnemonic | Pseudocode 62 | -------|---------------|----------------------- 63 | 0x07 | add dst, imm | dst += imm 64 | 0x0f | add dst, src | dst += src 65 | 0x17 | sub dst, imm | dst -= imm 66 | 0x1f | sub dst, src | dst -= src 67 | 0x27 | mul dst, imm | dst *= imm 68 | 0x2f | mul dst, src | dst *= src 69 | 0x37 | div dst, imm | dst /= imm 70 | 0x3f | div dst, src | dst /= src 71 | 0x47 | or dst, imm | dst \|= imm 72 | 0x4f | or dst, src | dst \|= src 73 | 0x57 | and dst, imm | dst &= imm 74 | 0x5f | and dst, src | dst &= src 75 | 0x67 | lsh dst, imm | dst <<= imm 76 | 0x6f | lsh dst, src | dst <<= src 77 | 0x77 | rsh dst, imm | dst >>= imm (logical) 78 | 0x7f | rsh dst, src | dst >>= src (logical) 79 | 0x87 | neg dst | dst = -dst 80 | 0x97 | mod dst, imm | dst %= imm 81 | 0x9f | mod dst, src | dst %= src 82 | 0xa7 | xor dst, imm | dst ^= imm 83 | 0xaf | xor dst, src | dst ^= src 84 | 0xb7 | mov dst, imm | dst = imm 85 | 0xbf | mov dst, src | dst = src 86 | 0xc7 | arsh dst, imm | dst >>= imm (arithmetic) 87 | 0xcf | arsh dst, src | dst >>= src (arithmetic) 88 | 89 | ### 32-bit 90 | 91 | These instructions use only the lower 32 bits of their operands and zero the 92 | upper 32 bits of the destination register. 93 | 94 | Opcode | Mnemonic | Pseudocode 95 | -------|-----------------|------------------------------ 96 | 0x04 | add32 dst, imm | dst += imm 97 | 0x0c | add32 dst, src | dst += src 98 | 0x14 | sub32 dst, imm | dst -= imm 99 | 0x1c | sub32 dst, src | dst -= src 100 | 0x24 | mul32 dst, imm | dst *= imm 101 | 0x2c | mul32 dst, src | dst *= src 102 | 0x34 | div32 dst, imm | dst /= imm 103 | 0x3c | div32 dst, src | dst /= src 104 | 0x44 | or32 dst, imm | dst \|= imm 105 | 0x4c | or32 dst, src | dst \|= src 106 | 0x54 | and32 dst, imm | dst &= imm 107 | 0x5c | and32 dst, src | dst &= src 108 | 0x64 | lsh32 dst, imm | dst <<= imm 109 | 0x6c | lsh32 dst, src | dst <<= src 110 | 0x74 | rsh32 dst, imm | dst >>= imm (logical) 111 | 0x7c | rsh32 dst, src | dst >>= src (logical) 112 | 0x84 | neg32 dst | dst = -dst 113 | 0x94 | mod32 dst, imm | dst %= imm 114 | 0x9c | mod32 dst, src | dst %= src 115 | 0xa4 | xor32 dst, imm | dst ^= imm 116 | 0xac | xor32 dst, src | dst ^= src 117 | 0xb4 | mov32 dst, imm | dst = imm 118 | 0xbc | mov32 dst, src | dst = src 119 | 0xc4 | arsh32 dst, imm | dst >>= imm (arithmetic) 120 | 0xcc | arsh32 dst, src | dst >>= src (arithmetic) 121 | 122 | ### Byteswap instructions 123 | 124 | Opcode | Mnemonic | Pseudocode 125 | -----------------|----------|------------------- 126 | 0xd4 (imm == 16) | le16 dst | dst = htole16(dst) 127 | 0xd4 (imm == 32) | le32 dst | dst = htole32(dst) 128 | 0xd4 (imm == 64) | le64 dst | dst = htole64(dst) 129 | 0xdc (imm == 16) | be16 dst | dst = htobe16(dst) 130 | 0xdc (imm == 32) | be32 dst | dst = htobe32(dst) 131 | 0xdc (imm == 64) | be64 dst | dst = htobe64(dst) 132 | 133 | ## Memory Instructions 134 | 135 | Opcode | Mnemonic | Pseudocode 136 | -------|-----------------------|-------------------------------- 137 | 0x18 | lddw dst, imm | dst = imm 138 | 0x20 | ldabsw src, dst, imm | See kernel documentation 139 | 0x28 | ldabsh src, dst, imm | ... 140 | 0x30 | ldabsb src, dst, imm | ... 141 | 0x38 | ldabsdw src, dst, imm | ... 142 | 0x40 | ldindw src, dst, imm | ... 143 | 0x48 | ldindh src, dst, imm | ... 144 | 0x50 | ldindb src, dst, imm | ... 145 | 0x58 | ldinddw src, dst, imm | ... 146 | 0x61 | ldxw dst, [src+off] | dst = *(uint32_t *) (src + off) 147 | 0x69 | ldxh dst, [src+off] | dst = *(uint16_t *) (src + off) 148 | 0x71 | ldxb dst, [src+off] | dst = *(uint8_t *) (src + off) 149 | 0x79 | ldxdw dst, [src+off] | dst = *(uint64_t *) (src + off) 150 | 0x62 | stw [dst+off], imm | *(uint32_t *) (dst + off) = imm 151 | 0x6a | sth [dst+off], imm | *(uint16_t *) (dst + off) = imm 152 | 0x72 | stb [dst+off], imm | *(uint8_t *) (dst + off) = imm 153 | 0x7a | stdw [dst+off], imm | *(uint64_t *) (dst + off) = imm 154 | 0x63 | stxw [dst+off], src | *(uint32_t *) (dst + off) = src 155 | 0x6b | stxh [dst+off], src | *(uint16_t *) (dst + off) = src 156 | 0x73 | stxb [dst+off], src | *(uint8_t *) (dst + off) = src 157 | 0x7b | stxdw [dst+off], src | *(uint64_t *) (dst + off) = src 158 | 159 | ## Branch Instructions 160 | 161 | Opcode | Mnemonic | Pseudocode 162 | -------|---------------------|------------------------ 163 | 0x05 | ja +off | PC += off 164 | 0x15 | jeq dst, imm, +off | PC += off if dst == imm 165 | 0x1d | jeq dst, src, +off | PC += off if dst == src 166 | 0x25 | jgt dst, imm, +off | PC += off if dst > imm 167 | 0x2d | jgt dst, src, +off | PC += off if dst > src 168 | 0x35 | jge dst, imm, +off | PC += off if dst >= imm 169 | 0x3d | jge dst, src, +off | PC += off if dst >= src 170 | 0xa5 | jlt dst, imm, +off | PC += off if dst < imm 171 | 0xad | jlt dst, src, +off | PC += off if dst < src 172 | 0xb5 | jle dst, imm, +off | PC += off if dst <= imm 173 | 0xbd | jle dst, src, +off | PC += off if dst <= src 174 | 0x45 | jset dst, imm, +off | PC += off if dst & imm 175 | 0x4d | jset dst, src, +off | PC += off if dst & src 176 | 0x55 | jne dst, imm, +off | PC += off if dst != imm 177 | 0x5d | jne dst, src, +off | PC += off if dst != src 178 | 0x65 | jsgt dst, imm, +off | PC += off if dst > imm (signed) 179 | 0x6d | jsgt dst, src, +off | PC += off if dst > src (signed) 180 | 0x75 | jsge dst, imm, +off | PC += off if dst >= imm (signed) 181 | 0x7d | jsge dst, src, +off | PC += off if dst >= src (signed) 182 | 0xc5 | jslt dst, imm, +off | PC += off if dst < imm (signed) 183 | 0xcd | jslt dst, src, +off | PC += off if dst < src (signed) 184 | 0xd5 | jsle dst, imm, +off | PC += off if dst <= imm (signed) 185 | 0xdd | jsle dst, src, +off | PC += off if dst <= src (signed) 186 | 0x85 | call imm | Function call 187 | 0x95 | exit | return r0 188 | -------------------------------------------------------------------------------- /ebpf_excerpt_20Aug2015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/ebpf_excerpt_20Aug2015.pdf -------------------------------------------------------------------------------- /ebpf_http_filter.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/ebpf_http_filter.pdf -------------------------------------------------------------------------------- /meetups/2015-09-21/iovisor-bcc-intro.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/meetups/2015-09-21/iovisor-bcc-intro.pdf -------------------------------------------------------------------------------- /netconf_2016feb.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/netconf_2016feb.pdf -------------------------------------------------------------------------------- /openstack/2015-10-29/iovisor-mesos-demo.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/openstack/2015-10-29/iovisor-mesos-demo.pdf -------------------------------------------------------------------------------- /openstack/2016-04-25/OpenStackSummitAustin2016_iovisor_v1.0.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/openstack/2016-04-25/OpenStackSummitAustin2016_iovisor_v1.0.pdf -------------------------------------------------------------------------------- /p4/2015-11-18/iovisor-p4-workshop-nov-2015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/p4/2015-11-18/iovisor-p4-workshop-nov-2015.pdf -------------------------------------------------------------------------------- /p4/p4toEbpf-bcc.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/p4/p4toEbpf-bcc.pdf -------------------------------------------------------------------------------- /p4AbstractSwitch.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/p4AbstractSwitch.pdf -------------------------------------------------------------------------------- /tsc-meeting-minutes/2015-09-02/eBPF_to_IOV_Module.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/tsc-meeting-minutes/2015-09-02/eBPF_to_IOV_Module.pptx -------------------------------------------------------------------------------- /tsc-meeting-minutes/2015-09-16/iomodules-slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/tsc-meeting-minutes/2015-09-16/iomodules-slides.pdf -------------------------------------------------------------------------------- /tsc-meeting-minutes/2015-09-16/iovisor-odl-gbp-module.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/tsc-meeting-minutes/2015-09-16/iovisor-odl-gbp-module.pdf -------------------------------------------------------------------------------- /university/eBPF_IOVisor_academic_research.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/university/eBPF_IOVisor_academic_research.pdf -------------------------------------------------------------------------------- /university/sigcomm-ccr-InKev-2016.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iovisor/bpf-docs/0b9f8ab13f1d2e946325c179f961563ea6e23e65/university/sigcomm-ccr-InKev-2016.pdf --------------------------------------------------------------------------------