├── .gitignore ├── .gitmodules ├── README.org ├── bin ├── functions.sh ├── set_irq_affinity_with_rss_conf.sh ├── smp_affinity_rss.conf ├── tc_mq_htb_setup_example.sh └── xps_setup.sh ├── headers ├── bpf_endian.h ├── bpf_helpers.h ├── bpf_util.h ├── jhash.h └── perf-sys.h └── src ├── Makefile ├── common_kern_user.h ├── common_user.c ├── common_user.h ├── howto_debug.org ├── shared_maps.h ├── tc_classify_kern.c ├── tc_classify_user.c ├── tc_queue_mapping_kern.c ├── xdp_iphash_to_cpu_cmdline.c ├── xdp_iphash_to_cpu_kern.c ├── xdp_iphash_to_cpu_user.c ├── xdp_pass_kern.c └── xdp_pass_user.c /.gitignore: -------------------------------------------------------------------------------- 1 | *.ll 2 | *~ 3 | *.o 4 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "libbpf"] 2 | path = libbpf 3 | url = https://github.com/libbpf/libbpf/ 4 | -------------------------------------------------------------------------------- /README.org: -------------------------------------------------------------------------------- 1 | # -*- fill-column: 76; -*- 2 | #+Title: Project XDP cooperating with TC 3 | #+OPTIONS: ^:nil 4 | 5 | This project demonstrate how XDP cpumap redirect can be used together 6 | with Linux TC (Traffic Control) for solving the Qdisc locking problem. 7 | 8 | The focus is on use-case where global rate limiting is /not the goal/, but 9 | instead the *goal is to rate limit customers*, *services* or *containers*, to 10 | something significantly lower than NIC link speed. 11 | 12 | The basic components (in TC MQ-setup [[file:bin/tc_mq_htb_setup_example.sh][example script]]) are: 13 | - Setup *MQ qdisc* which have multiple transmit queues (*TXQ*). 14 | - For each MQ *TXQ* assign an *independent HTB qdisc*. 15 | - Use XDP *cpumap* to redirect traffic to CPU with *associated HTB qdisc* 16 | - Use *TC BPF-prog* to assign *TXQ* (via =skb->queue_mapping=) and TC *major:minor* number. 17 | - Configure *CPU* assignment to *RX-queues* (see [[file:bin/set_irq_affinity_with_rss_conf.sh][script]]) 18 | 19 | * Contents overview :TOC: 20 | - [[#disable-xps][Disable XPS]] 21 | - [[#scaling-with-xdp-cpumap-redirect][Scaling with XDP cpumap redirect]] 22 | - [[#assign-cpus-to-rx-queues][Assign CPUs to RX-queues]] 23 | - [[#rx-and-tx-queue-scaling][RX and TX queue scaling]] 24 | - [[#config-number-of-rx-vs-tx-queues][Config number of RX vs TX-queues]] 25 | - [[#dependencies-and-alternatives][Dependencies and alternatives]] 26 | 27 | * Disable XPS 28 | 29 | For this project to work disable XPS (Transmit Packet Steering). A script for 30 | configuring and disabling XPS is provided here: [[file:bin/xps_setup.sh]]. 31 | 32 | Script command line to disable XPS: 33 | #+begin_src sh 34 | sudo ./bin/xps_setup.sh --dev DEVICE --default --disable 35 | #+end_src 36 | 37 | The reason is that XPS (Transmit Packet Steering) takes precedence over setting 38 | =skb->queue_mapping= used by TC BPF-prog. XPS is configured per DEVICE via 39 | =/sys/class/net/DEVICE/queues/tx-*/xps_cpus= via a CPU hex mask. To disable set 40 | mask=00. More details see [[file:src/howto_debug.org]]. 41 | 42 | * Scaling with XDP cpumap redirect 43 | 44 | We recommend reading this [[https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap][blogpost]] for details on how the XDP "[[https://github.com/torvalds/linux/blob/master/kernel/bpf/cpumap.c][cpumap]]" 45 | redirect features works. Basically XDP is a layer before the normal Linux 46 | kernel network stack (netstack). 47 | 48 | The XDP *cpumap* feature is a scalability and isolation mechanism, that 49 | allow separating this early XDP layer, from the rest of the netstack, and 50 | assigning dedicated CPUs for this stage. An XDP program will essentially 51 | decide on what CPU the netstack start processing a given packet. 52 | 53 | ** Assign CPUs to RX-queues 54 | 55 | Configuring what CPU receives RX-packets for a specific NIC RX-queue involves 56 | changing the contents of the =/proc/irq/= "smp_affinity" file for the specific 57 | IRQ number e.g.: =/proc/irq/N/smp_affinity_list= 58 | 59 | Looking up what IRQs a given NIC driver have assigned to a interface name, is a 60 | little tedious and can vary across NIC drivers (e.g. Mellanox naming in 61 | =/proc/interrupts= is non-standard). The most standardized method is looking in 62 | =/sys/class/net/$IFACE/device/msi_irqs=, but remember to filter IRQs not 63 | related to RX-queues, as some IRQs can be used by NIC for other things. 64 | 65 | This project contains a script to ease configuring this: 66 | [[file:bin/set_irq_affinity_with_rss_conf.sh]]. 67 | 68 | The script default uses config file =/etc/smp_affinity_rss.conf= and an 69 | example config is available here: [[file:bin/smp_affinity_rss.conf]]. 70 | 71 | ** RX and TX queue scaling 72 | 73 | For this project it is recommended to assign dedicated CPUs to RX 74 | processing, which will run the XDP [[file:src/xdp_iphash_to_cpu_kern.c][program]]. This XDP-prog requires 75 | significantly less CPU-cycles per packet, than netstack and TX-qdisc 76 | handling. Thus, the number of CPU cores needed for RX-processing is 77 | significantly less than the amount of CPU cores needed for netstack + 78 | TX-processing. 79 | 80 | It is most natural for the netstack + TX-qdisc processing CPU cores to be 81 | "assigned" to the lower CPU id's. As most of the scripts and BPF-prog in 82 | this project assumes CPU core id's are mapped directly to the 83 | =queue_mapping= and MQ-leaf number (actually =smp_processor_id= plus one as 84 | qdisc have 1-indexed =queue_mapping=). This is not a requirement, just a 85 | convention, as it depend on software configuration for how the XDP maps 86 | assign CPUs and what MQ-leafs qdisc are configured. 87 | 88 | ** Config number of RX vs TX-queues 89 | 90 | This project basically scale less RX-queues to a larger number of TX-queues. 91 | Allowing to run a heavier Traffic Control shaping algorithm per netstack 92 | TX-queue, without any locking between the TX-queues via the MQ-qdisc as 93 | root-qdisc (and CPU-redirects). 94 | 95 | Configuring less RX-queues than TX-queues is often not possible on modern NIC 96 | hardware, as they often use what is called =combined= queues, which bind "RxTx" 97 | queues together. (See config via =ethtool --show-channels=). 98 | 99 | In our (less-RX-than-TX-CPUs) setup, this force us to configure multiple 100 | RX-queues to be handled by a single "RX" CPU. This is not good for cross-CPU 101 | scaling, because packets will be spread across these multiple RX-queues, and the 102 | XDP (NAPI) processing can only generate packet-bulks per RX-queue, which 103 | decrease bulking opportunities into cpumap. (See why bulking improve cross-CPU 104 | scaling in [[https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#appendix][blogpost]]). 105 | 106 | The *solution* is to adjust the NIC hardware RSS (Receive Side Scaling) or 107 | "RX-flow-hash" indirection table. (See config via =ethtool --show-rxfh-indir=). 108 | The trick is to adjusting RSS indirect table to only use the first N RX-queues 109 | via the command: =ethtool --set-rxfh-indir $IFACE equal N=. 110 | 111 | This features is also supported by the mention [[file:bin/set_irq_affinity_with_rss_conf.sh][config script]] via [[file:bin/smp_affinity_rss.conf][config]] variable 112 | =RSS_INDIR_EQUAL_QUEUES=. 113 | 114 | * Dependencies and alternatives 115 | 116 | Notice that the TC BPF-prog's ([[file:src/tc_classify_kern.c]] and 117 | [[file:src/tc_queue_mapping_kern.c]]) depends on a kernel feature that are available 118 | since in kernel v5.1, via [[https://github.com/torvalds/linux/commit/74e31ca850c1][kernel commit 74e31ca850c1]]. The alternative is to 119 | configure XPS for queue_mapping or use tc-skbedit(8) together with a TC-filter 120 | setup. 121 | 122 | The BPF-prog [[file:src/tc_classify_kern.c]] also setup the HTB-class id (via 123 | =skb->priority=), which have been supported for a long time, but due the above 124 | dependency (on =skb->queue_mapping=) it cannot be loaded. Alternative it is 125 | possible to use iptables CLASSIFY target module to change the HTB-class id. 126 | -------------------------------------------------------------------------------- /bin/functions.sh: -------------------------------------------------------------------------------- 1 | # 2 | # Common functions used by scripts in this directory 3 | # - Depending on bash 3 (or higher) syntax 4 | # 5 | # Author: Jesper Dangaaard Brouer 6 | # License: GPLv2 7 | 8 | ## -- sudo trick -- 9 | function root_check_run_with_sudo() { 10 | # Trick so, program can be run as normal user, will just use "sudo" 11 | # call as root_check_run_as_sudo "$@" 12 | if [ "$EUID" -ne 0 ]; then 13 | if [ -x $0 ]; then # Directly executable use sudo 14 | echo "# (Not root, running with sudo)" >&2 15 | sudo "$0" "$@" 16 | exit $? 17 | fi 18 | echo "cannot perform sudo run of $0" 19 | exit 1 20 | fi 21 | } 22 | 23 | ## -- General shell logging cmds -- 24 | function err() { 25 | local exitcode=$1 26 | shift 27 | echo -e "ERROR: $@" >&2 28 | exit $exitcode 29 | } 30 | 31 | function warn() { 32 | echo -e "WARN : $@" >&2 33 | } 34 | 35 | function info() { 36 | if [[ -n "$VERBOSE" ]]; then 37 | echo "# $@" 38 | fi 39 | } 40 | 41 | ## -- Wrapper calls for TC -- 42 | function _call_tc() { 43 | local allow_fail="$1" 44 | shift 45 | if [[ -n "$VERBOSE" ]]; then 46 | echo "tc $@" 47 | fi 48 | if [[ -n "$DRYRUN" ]]; then 49 | return 50 | fi 51 | $TC "$@" 52 | local status=$? 53 | if (( $status != 0 )); then 54 | if [[ "$allow_fail" == "" ]]; then 55 | err 3 "Exec error($status) occurred cmd: \"$TC $@\"" 56 | fi 57 | fi 58 | } 59 | function call_tc() { 60 | _call_tc "" "$@" 61 | } 62 | function call_tc_allow_fail() { 63 | _call_tc "allow_fail" "$@" 64 | } 65 | -------------------------------------------------------------------------------- /bin/set_irq_affinity_with_rss_conf.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # -*- mode: shell-script; sh-shell: bash; sh-basic-offset: 8; sh-indentation:8; -*- 3 | # 4 | # Rewrite of script[1] 'set_irq_affinity' with extention for being called as 5 | # Debian network if.up script (/etc/network/if-up.d/). 6 | # 7 | # [1] https://github.com/netoptimizer/network-testing/blob/master/bin/set_irq_affinity 8 | # 9 | # Old script was based on reading /proc/interrupts. This approach is not 10 | # possible on Mellanox NICs, because netdev interface name is not included. 11 | # 12 | # All Linux NICs (net_device's) have files under /sys/class/net/$IFACE and 13 | # physical devices have a 'device' symlink: /sys/class/net/$IFACE/device 14 | # 15 | # For physical net_device's the directory /sys/class/net/$IFACE/device/msi_irqs/ 16 | # contains filename-numbers for each IRQ number. 17 | # Then script knows which /proc/irq/${IRQ}/smp_affinity_list to adjust. 18 | 19 | export OUT=/tmp/ifup-set-irq-affinity-DEBUG 20 | DEBUG=$VERBOSITY 21 | DEBUG=1 #Force debugging on 22 | 23 | export CFG_FILE=/etc/smp_affinity_rss.conf 24 | 25 | function usage() { 26 | echo 27 | echo "Script for binding NIC interface IRQs to specific CPUs" 28 | echo 29 | echo " Usage: $0 " 30 | echo " -i : Cmdline set iface (default is env \$IFACE or shell arg1)" 31 | echo " -c : Cmdline override CPU_LIST from config file" 32 | echo " -n : Cmdline override RSS_INDIR_EQUAL_QUEUES from config file" 33 | echo " -f : Redefine config file to use (default $CFG_FILE)" 34 | echo 35 | } 36 | 37 | export TIME_FMT="%Y%m%dT%H%M%S" 38 | 39 | function info() { 40 | if [ -n "$DEBUG" -a "$DEBUG" -ne 0 ]; then 41 | TS=$(date +$TIME_FMT) 42 | echo "$TS iface:$IFACE -- $@" >> $OUT 43 | # echo "$TS iface:$IFACE -- $@" >&2 44 | fi 45 | } 46 | 47 | function warn() { 48 | TS=$(date +$TIME_FMT) 49 | echo "$TS iface:$IFACE -- WARN : $@" >> $OUT 50 | echo "$TS iface:$IFACE -- WARN : $@" >&2 51 | } 52 | 53 | function err() { 54 | TS=$(date +$TIME_FMT) 55 | echo "$TS iface:$IFACE -- ERROR : $@" >> $OUT 56 | echo "$TS iface:$IFACE -- ERROR : $@" >&2 57 | # Don't exit script, as it can cause ifup to not bringup interface 58 | } 59 | 60 | 61 | function get_iface_irqs() 62 | { 63 | local _IFACE=$1 64 | 65 | if [[ ! -d /sys/class/net/$_IFACE/device ]]; then 66 | exit 0 67 | fi 68 | 69 | local msi_irqs=$(ls -x /sys/class/net/$_IFACE/device/msi_irqs) 70 | if [[ -z "msi_irqs" ]]; then 71 | exit 0 72 | fi 73 | 74 | # Walk IRQs for cleaning 75 | irqs="" 76 | for i in $msi_irqs ; do 77 | # Skip certain types of NIC IRQs 78 | if $(egrep -q -e "$i:.*(async|fdir)" /proc/interrupts) ; then 79 | # echo "SKIP : IRQ $i" >&2 80 | continue 81 | else 82 | : # echo "XXX msi $i ($irqs)" >&2 83 | fi 84 | irqs+="$i " 85 | done 86 | 87 | echo $irqs 88 | } 89 | 90 | function set_cpulist_iface() 91 | { 92 | local _IFACE=$1 93 | local _CPU_LIST=$2 94 | irq_list=$(get_iface_irqs $_IFACE) 95 | 96 | for IRQ in $irq_list ; do 97 | info "NIC IRQ:$IRQ will be processed by CPUs: $_CPU_LIST" 98 | smp_file="/proc/irq/${IRQ}/smp_affinity_list" 99 | echo $_CPU_LIST > $smp_file 100 | local status=$? 101 | if [[ $status -ne 0 ]];then 102 | err "cannot conf IRQ:$IRQ ($smp_file) CPUs:$_CPU_LIST" 103 | fi 104 | # grep -H . $smp_file 105 | done 106 | } 107 | 108 | function set_rss_indir_queues() 109 | { 110 | local _IFACE=$1 111 | local _QUEUES=$2 112 | 113 | if [[ -n "$_QUEUES" ]]; then 114 | info "Change RSS table to use first $_QUEUES queues" 115 | ethtool --set-rxfh-indir $IFACE equal $_QUEUES 116 | local status=$? 117 | if [[ $status -ne 0 ]];then 118 | err "cannot conf RSS indirection table with $_QUEUES" 119 | fi 120 | fi 121 | } 122 | 123 | function disable_vlan_offload() 124 | { 125 | local _IFACE=$1 126 | 127 | if [[ -n "$DISABLE_VLAN_OFFLOAD_RX" ]]; then 128 | info "Disable hardware VLAN offload for RX" 129 | ethtool -K $_IFACE rxvlan off 130 | local status=$? 131 | if [[ $status -ne 0 ]];then 132 | err "cannot disable RX VLAN offload" 133 | fi 134 | fi 135 | } 136 | 137 | info "Start set_irq_affinity" 138 | 139 | ## --- Parse command line arguments / parameters --- 140 | while getopts "i:f:c:n:vh" option; do 141 | case $option in 142 | i) # interface IFACE can also come from ifup env or arg1 143 | export IFACE=$OPTARG 144 | info "NIC Interface device set to: IFACE=$IFACE" 145 | ;; 146 | f) 147 | export CFG_FILE=$OPTARG 148 | info "Redefine config file to: CFG_FILE=$CFG_FILE" 149 | ;; 150 | c) 151 | export CPU_LIST2=$OPTARG 152 | info "Defining CPU_LIST via command line: CPU_LIST=$CPU_LIST2" 153 | ;; 154 | n) 155 | export RSS_INDIR_EQUAL_QUEUES2=$OPTARG 156 | info "Defining RSS_INDIR_EQUAL_QUEUES via command line" 157 | ;; 158 | 159 | h|?|*) 160 | usage; 161 | warn "Unknown parameters!!!" 162 | exit 0 163 | esac 164 | done 165 | shift $(( $OPTIND - 1 )) 166 | 167 | ## --- Load config file --- 168 | if [[ ! -e "$CFG_FILE" ]]; then 169 | err "Cannot read config file: $CFG_FILE" 170 | # 171 | # Allow to continue of CPU_LIST were defined on cmdline 172 | if [[ -z "$CPU_LIST2" ]]; then 173 | exit 0 174 | fi 175 | else 176 | source $CFG_FILE 177 | fi 178 | 179 | # Let cmdline CPU_LIST dominate over config file 180 | if [[ -n "$CPU_LIST2" ]]; then 181 | export THE_CPU_LIST=$CPU_LIST2 182 | else 183 | export THE_CPU_LIST=$CPU_LIST 184 | fi 185 | 186 | if [[ -n "$RSS_INDIR_EQUAL_QUEUES2" ]]; then 187 | export RSS_INDIR_EQUAL_QUEUES=$RSS_INDIR_EQUAL_QUEUES2 188 | fi 189 | 190 | ## --- The $IFACE variable must be resolved to continue --- 191 | if [[ -z "$IFACE" ]]; then 192 | if [ -n "$1" ]; then 193 | IFACE=$1 194 | info "Setup NIC interface $IFACE (as arg1)" 195 | else 196 | usage 197 | echo " Supports: To be called by the ifup scripts" 198 | echo " - Then, expects environment variable \$IFACE is set" 199 | err "Cannot resolve \$IFACE" 200 | exit 0 201 | fi 202 | fi 203 | 204 | if [[ ! -d /sys/class/net/$IFACE/ ]]; then 205 | warn "Invalid interface $IFACE" 206 | exit 0 207 | fi 208 | 209 | if [[ ! -d /sys/class/net/$IFACE/device ]]; then 210 | warn "Non-physical interface $IFACE - Skip IRQ adjustments" 211 | exit 0 212 | fi 213 | 214 | # --- Do IRQ smp_affinity adjustments --- 215 | set_cpulist_iface $IFACE $THE_CPU_LIST 216 | 217 | # --- Reduce/Setup RSS : RX flow hash indirection table --- 218 | set_rss_indir_queues $IFACE $RSS_INDIR_EQUAL_QUEUES 219 | 220 | # --- XDP cannot handle hardware offloaded VLAN info --- 221 | disable_vlan_offload $IFACE 222 | 223 | exit 0 224 | -------------------------------------------------------------------------------- /bin/smp_affinity_rss.conf: -------------------------------------------------------------------------------- 1 | # Config file for script: set_irq_affinity_with_rss_conf.sh 2 | # 3 | # The format for CPU_LIST variable is the /proc/irq/${IRQ}/smp_affinity_list 4 | # format the kernel supports. 5 | CPU_LIST="4-5" 6 | 7 | # RSS indir table size used by ethtool --set-rxfh-indir 'equal' parameter. 8 | # 9 | # From man ethtool(8): 10 | # --set-rxfh-indir equal N 11 | # 12 | # Sets the receive flow hash indirection table to spread flows evenly between 13 | # the first N receive queues. 14 | # 15 | # In effect this limit how many RX-queues are active. 16 | # 17 | # For the use-case of xdp-cpumap-tc, that scale the Linux network-stack by 18 | # load-balancing across multiple TX-queues, it is desired to have less RX-queues 19 | # than TX-queues. 20 | # 21 | # First of all XDP RX-work takes less time, but second objective is to increase 22 | # chances of bulk RX processing. Each RX-queue can bulk up-to 64 packets, but 23 | # when too many RX-queues are configured packets will be distributed too thin 24 | # across RX-queues. 25 | # 26 | # Most NICs have "combined" RX+TX queues (ethtool --show-channels). Thus, 27 | # reducing RX-queues also result in reduced TX-queues. Adjusting RSS indirect 28 | # table to only use the first N RX-queues, allows for more TX-queues to 29 | # load-balance across. 30 | # 31 | # Practical experience shows that an uneven number gives better hardware RSS 32 | # distribution across RX-queues. 33 | # 34 | RSS_INDIR_EQUAL_QUEUES=3 35 | 36 | # Disable NIC hardware VLAN offloading. 37 | # 38 | # The XDP RX-hook cannot (currently) see any offloaded VLAN tags in the RX 39 | # descriptor. Thus, disable this explicitly. Else this can cause loosing the 40 | # VLAN tag when CPUMAP redirecting the xdp_frame. 41 | # 42 | DISABLE_VLAN_OFFLOAD_RX=yes 43 | -------------------------------------------------------------------------------- /bin/tc_mq_htb_setup_example.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Example script for how to solve qdisc locking issue when shaping traffic. Can 3 | # be used in cases where global rate limiting it not the goal, but instead the 4 | # goal is to rate limit customers or services (to something significantly lower 5 | # than NIC link speed). 6 | # 7 | # Basic solution: 8 | # - Use MQ which have multiple transmit queues (TXQ). 9 | # - For each MQ TXQ assign a HTB qdisc 10 | # 11 | basedir=`dirname $0` 12 | source ${basedir}/functions.sh 13 | export TC=tc 14 | 15 | VERBOSE=y 16 | 17 | root_check_run_with_sudo "$@" 18 | 19 | # Global setup variables 20 | # 21 | # Each of the HTB root-class(es) get these RATE+CEIL upper bandwidth bounds. 22 | ROOT_RATE=2500Mbit 23 | ROOT_CEIL=3000Mbit 24 | # 25 | # The default HTB class 26 | DEF_RATE=100Mbit 27 | DEF_CEIL=150Mbit 28 | 29 | DEV=$1 30 | if [[ -z "$DEV" ]]; then 31 | err 1 "Must specify DEV as argument" 32 | fi 33 | 34 | info "Applying TC setup on device: $DEV" 35 | 36 | # Can see how many TXQs a device have via directories: 37 | # /sys/class/net//queues 38 | 39 | # Try to detect if HW can be used (as the TC error message is useless) 40 | if [[ ! -e /sys/class/net/$DEV/queues/tx-1 ]]; then 41 | err 2 "The device ($DEV) must have multiple TX hardware queue.\n" \ 42 | "The MQ qdisc only works to multi-queue capable hardware" 43 | fi 44 | 45 | info "Clear existing setup" 46 | call_tc_allow_fail qdisc del dev $DEV root 47 | 48 | info " New MQ, with larger handle (MAJOR:) to allow HTB qdisc to use major 1:" 49 | call_tc qdisc replace dev $DEV root handle 7FFF: mq 50 | 51 | function sorted_tx_queues() { 52 | # Returns numerically sorted TX queues 53 | local queues=$(ls -d /sys/class/net/$DEV/queues/tx-* | sort --field-separator='-' -k2n) 54 | echo $queues 55 | } 56 | 57 | export TX_QUEUES=$(sorted_tx_queues) 58 | 59 | info "Foreach TXQ - create HTB leaf(s) under MQ 0x7FFF:TXQ" 60 | i=0 61 | for dir in $TX_QUEUES; do 62 | ((i++)) || true 63 | # TC-handle major:minor numbers are in hex 64 | hex=$(printf "%x" $i) 65 | # Qdisc HTB $i: under parent 7FFF:$i 66 | call_tc qdisc add dev $DEV parent 7FFF:$hex handle $hex: htb default 2 67 | # tc qdisc add dev $DEV parent 7FFF:1 handle 1: htb default 2 68 | # tc qdisc add dev $DEV parent 7FFF:2 handle 2: htb default 2 69 | # tc qdisc add dev $DEV parent 7FFF:3 handle 3: htb default 2 70 | # tc qdisc add dev $DEV parent 7FFF:4 handle 4: htb default 2 71 | done 72 | 73 | # Create root-CLASS(es) under each HTB-qdisc 74 | info "Create HTB root-class(es) n:1 (rate $ROOT_RATE ceil $ROOT_CEIL)" 75 | info " - Also create HTB default class n:2" 76 | i=0 77 | for dir in $TX_QUEUES; do 78 | ((i++)) || true 79 | 80 | # TC-handle major:minor numbers are in hex 81 | hex=$(printf "%x" $i) 82 | 83 | # The root-class set upper bandwidth usage 84 | call_tc class add dev $DEV parent $hex: classid $hex:1 \ 85 | htb rate $ROOT_RATE ceil $ROOT_CEIL 86 | 87 | # Create HTB default class $hex:2 88 | # call_tc class add dev $DEV parent $hex:1 classid $hex:2 \ 89 | # htb rate $DEF_RATE ceil $DEF_CEIL 90 | # - set default rate different, to measure which major-class we hit 91 | call_tc class add dev $DEV parent $hex:1 classid $hex:2 \ 92 | htb rate ${i}00Mbit ceil ${i}20Mbit 93 | 94 | # Also change the qdisc on default HTB class $hex:2 ? 95 | # tc qdisc add dev $DEV parent $hex:2 sfq 96 | call_tc qdisc add dev $DEV parent $hex:2 fq_codel 97 | 98 | [[ -n "$VERBOSE" ]] && echo "" 99 | done 100 | 101 | info "Now create services/customers bandwidth limits" 102 | # Simple example: 103 | call_tc class add dev $DEV parent 2:1 classid 2:2a htb rate 2Mbit ceil 3Mbit 104 | call_tc qdisc add dev $DEV parent 2:2a sfq 105 | 106 | set -v 107 | # 108 | # ***NOTICE*** YOU ARE NOT DONE 109 | # 110 | # Getting services/customers correctly categorised is the next challenge that is 111 | # currently left as an exercise... 112 | # 113 | # For solving the TX-queue locking congestion, the traffic needs to be 114 | # redirected to the appropriate CPUs. This can either be done with RSS (Receive 115 | # Side Scaling) and RPS (Receive Packet Steering), or with XDP cpumap redirect. 116 | # 117 | 118 | # Hint: this script is part of my testing of CPUMAP XDP-redirect 119 | -------------------------------------------------------------------------------- /bin/xps_setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | 4 | function usage() { 5 | echo "Change setting of XPS txq to CPU mapping via files" 6 | echo " /sys/class/net/DEV/queues/tx-*/xps_cpus " 7 | echo "" 8 | echo "Usage: $0 [-h] --dev ethX --txq N --cpu N" 9 | echo " -d | --dev : (\$DEV) Interface/device (required)" 10 | echo " --default : (\$DEFAULT) Setup 1:1 mapping TXQ-to-CPU" 11 | echo " --disable : (\$DISABLE) Disable XPS via mask 0x00" 12 | echo " --list : (\$LIST) List current setting" 13 | echo " --txq N : (\$TXQ) Select TXQ" 14 | echo " --cpu N : (\$CPU) Select CPU that use TXQ" 15 | echo " -v | --verbose : (\$VERBOSE) verbose" 16 | echo "" 17 | } 18 | 19 | ## -- General shell logging cmds -- 20 | function err() { 21 | local exitcode=$1 22 | shift 23 | echo -e "ERROR: $@" >&2 24 | exit $exitcode 25 | } 26 | 27 | function info() { 28 | if [[ -n "$VERBOSE" ]]; then 29 | echo "# $@" 30 | fi 31 | } 32 | 33 | # Convert a mask to a list of CPUs this cover 34 | function mask_to_cpus() { 35 | local mask=$1 36 | local cpu=0 37 | 38 | printf "CPUs in MASK=0x%02X =>" $mask 39 | if [[ $mask == 0 ]]; then 40 | echo " disabled" 41 | fi 42 | while [ $mask -gt 0 ]; do 43 | if [[ $((mask & 1)) -eq 1 ]]; then 44 | echo -n " cpu:$cpu" 45 | fi 46 | let cpu++ 47 | let mask=$((mask >> 1)) 48 | done 49 | } 50 | 51 | function sorted_txq_xps_cpus() { 52 | local queues=$(ls /sys/class/net/$DEV/queues/tx-*/xps_cpus | sort --field-separator='-' -k2n) 53 | echo $queues 54 | } 55 | 56 | function list_xps_setup() { 57 | local txq=0 58 | local mqleaf=0 59 | for xps_cpus in $(sorted_txq_xps_cpus); do 60 | let mqleaf++ 61 | mask=$(cat $xps_cpus) 62 | value=$((0x$mask)) 63 | #echo MASK:0x$mask 64 | txt=$(mask_to_cpus $value) 65 | echo "NIC=$DEV TXQ:$txq (MQ-leaf :$mqleaf) use $txt" 66 | let txq++ 67 | done 68 | } 69 | 70 | function cpu_to_mask() { 71 | local cpu=$1 72 | printf "%X" $((1 << $cpu)) 73 | } 74 | 75 | # Setup TXQ to only use a single specific CPU 76 | function xps_txq_to_cpu() { 77 | local txq=$1 78 | local cpu=$2 79 | local mask=0 80 | if [[ "$DISABLE" != "yes" ]]; then 81 | mask=$(cpu_to_mask $cpu) 82 | fi 83 | local txq_file=/sys/class/net/$DEV/queues/tx-$txq/xps_cpus 84 | 85 | if [[ -e "$txq_file" ]]; then 86 | echo $mask > $txq_file 87 | fi 88 | } 89 | 90 | function xps_setup_1to1_mapping() { 91 | local cpu=0 92 | local txq=0 93 | for xps_cpus in $(sorted_txq_xps_cpus); do 94 | 95 | if [[ "$DISABLE" != "yes" ]]; then 96 | # Map the TXQ to CPU number 1-to-1 97 | mask=$(cpu_to_mask $cpu) 98 | else 99 | # Disable XPS on TXQ 100 | mask=0 101 | fi 102 | 103 | echo $mask > $xps_cpus 104 | info "NIC=$DEV TXQ:$txq use CPU $cpu (MQ-leaf :$mqleaf)" 105 | let cpu++ 106 | let txq++ 107 | done 108 | } 109 | 110 | # Using external program "getopt" to get --long-options 111 | OPTIONS=$(getopt -o ld: \ 112 | --long list,default,disable,dev:,txq:,cpu: -- "$@") 113 | if (( $? != 0 )); then 114 | usage 115 | err 2 "Error calling getopt" 116 | fi 117 | eval set -- "$OPTIONS" 118 | 119 | ## --- Parse command line arguments / parameters --- 120 | while true; do 121 | case "$1" in 122 | -d | --dev ) # device 123 | export DEV=$2 124 | info "Device set to: DEV=$DEV" >&2 125 | shift 2 126 | ;; 127 | -v | --verbose) 128 | export VERBOSE=yes 129 | # info "Verbose mode: VERBOSE=$VERBOSE" >&2 130 | shift 131 | ;; 132 | --list ) 133 | info "Listing --list" >&2 134 | export LIST=yes 135 | shift 1 136 | ;; 137 | --default ) 138 | info "Setup default 1-to-1 mapping TXQ-to-CPUs" >&2 139 | export DEFAULT=yes 140 | shift 1 141 | ;; 142 | --disable ) 143 | info "Disable XPS via mask 0x00" >&2 144 | export DISABLE=yes 145 | shift 1 146 | ;; 147 | --txq ) 148 | export TXQ=$2 149 | info "Selected: TXQ=$TXQ" >&2 150 | shift 2 151 | ;; 152 | --cpu ) 153 | export CPU=$2 154 | info "Selected: CPU=$CPU" >&2 155 | shift 2 156 | ;; 157 | -- ) 158 | shift 159 | break 160 | ;; 161 | -h | --help ) 162 | usage; 163 | exit 0 164 | ;; 165 | * ) 166 | shift 167 | break 168 | ;; 169 | esac 170 | done 171 | 172 | if [ -z "$DEV" ]; then 173 | usage 174 | err 2 "Please specify device" 175 | fi 176 | 177 | if [[ -n "$TXQ" ]]; then 178 | if [[ -z "$CPU" && -z "$DISABLE" ]]; then 179 | err 4 "CPU also needed when giving TXQ:$TXQ (or --disable)" 180 | fi 181 | xps_txq_to_cpu $TXQ $CPU 182 | fi 183 | 184 | if [[ -n "$DEFAULT" ]]; then 185 | xps_setup_1to1_mapping 186 | fi 187 | 188 | if [[ "$DISABLE" == "yes" ]]; then 189 | if [[ -z "$DEFAULT" && -z "$TXQ" ]]; then 190 | err 5 "Use --disable together with --default or --txq" 191 | fi 192 | fi 193 | 194 | if [[ -n "$LIST" ]]; then 195 | list_xps_setup 196 | fi 197 | -------------------------------------------------------------------------------- /headers/bpf_endian.h: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | /* Copied from $(LINUX)/tools/testing/selftests/bpf/bpf_endian.h */ 3 | #ifndef __BPF_ENDIAN__ 4 | #define __BPF_ENDIAN__ 5 | 6 | #include 7 | 8 | /* LLVM's BPF target selects the endianness of the CPU 9 | * it compiles on, or the user specifies (bpfel/bpfeb), 10 | * respectively. The used __BYTE_ORDER__ is defined by 11 | * the compiler, we cannot rely on __BYTE_ORDER from 12 | * libc headers, since it doesn't reflect the actual 13 | * requested byte order. 14 | * 15 | * Note, LLVM's BPF target has different __builtin_bswapX() 16 | * semantics. It does map to BPF_ALU | BPF_END | BPF_TO_BE 17 | * in bpfel and bpfeb case, which means below, that we map 18 | * to cpu_to_be16(). We could use it unconditionally in BPF 19 | * case, but better not rely on it, so that this header here 20 | * can be used from application and BPF program side, which 21 | * use different targets. 22 | */ 23 | #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 24 | # define __bpf_ntohs(x)__builtin_bswap16(x) 25 | # define __bpf_htons(x)__builtin_bswap16(x) 26 | # define __bpf_constant_ntohs(x)___constant_swab16(x) 27 | # define __bpf_constant_htons(x)___constant_swab16(x) 28 | # define __bpf_ntohl(x)__builtin_bswap32(x) 29 | # define __bpf_htonl(x)__builtin_bswap32(x) 30 | # define __bpf_constant_ntohl(x)___constant_swab32(x) 31 | # define __bpf_constant_htonl(x)___constant_swab32(x) 32 | #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ 33 | # define __bpf_ntohs(x)(x) 34 | # define __bpf_htons(x)(x) 35 | # define __bpf_constant_ntohs(x)(x) 36 | # define __bpf_constant_htons(x)(x) 37 | # define __bpf_ntohl(x)(x) 38 | # define __bpf_htonl(x)(x) 39 | # define __bpf_constant_ntohl(x)(x) 40 | # define __bpf_constant_htonl(x)(x) 41 | #else 42 | # error "Fix your compiler's __BYTE_ORDER__?!" 43 | #endif 44 | 45 | #define bpf_htons(x)\ 46 | (__builtin_constant_p(x) ?\ 47 | __bpf_constant_htons(x) : __bpf_htons(x)) 48 | #define bpf_ntohs(x)\ 49 | (__builtin_constant_p(x) ?\ 50 | __bpf_constant_ntohs(x) : __bpf_ntohs(x)) 51 | #define bpf_htonl(x)\ 52 | (__builtin_constant_p(x) ?\ 53 | __bpf_constant_htonl(x) : __bpf_htonl(x)) 54 | #define bpf_ntohl(x)\ 55 | (__builtin_constant_p(x) ?\ 56 | __bpf_constant_ntohl(x) : __bpf_ntohl(x)) 57 | 58 | #endif /* __BPF_ENDIAN__ */ 59 | -------------------------------------------------------------------------------- /headers/bpf_helpers.h: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | /* Copied from $(LINUX)/tools/testing/selftests/bpf/bpf_helpers.h */ 3 | #ifndef __BPF_HELPERS_H 4 | #define __BPF_HELPERS_H 5 | 6 | /* helper macro to place programs, maps, license in 7 | * different sections in elf_bpf file. Section names 8 | * are interpreted by elf_bpf loader 9 | */ 10 | #define SEC(NAME) __attribute__((section(NAME), used)) 11 | 12 | /* helper functions called from eBPF programs written in C */ 13 | static void *(*bpf_map_lookup_elem)(void *map, void *key) = 14 | (void *) BPF_FUNC_map_lookup_elem; 15 | static int (*bpf_map_update_elem)(void *map, void *key, void *value, 16 | unsigned long long flags) = 17 | (void *) BPF_FUNC_map_update_elem; 18 | static int (*bpf_map_delete_elem)(void *map, void *key) = 19 | (void *) BPF_FUNC_map_delete_elem; 20 | static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) = 21 | (void *) BPF_FUNC_probe_read; 22 | static unsigned long long (*bpf_ktime_get_ns)(void) = 23 | (void *) BPF_FUNC_ktime_get_ns; 24 | static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) = 25 | (void *) BPF_FUNC_trace_printk; 26 | static void (*bpf_tail_call)(void *ctx, void *map, int index) = 27 | (void *) BPF_FUNC_tail_call; 28 | static unsigned long long (*bpf_get_smp_processor_id)(void) = 29 | (void *) BPF_FUNC_get_smp_processor_id; 30 | static unsigned long long (*bpf_get_current_pid_tgid)(void) = 31 | (void *) BPF_FUNC_get_current_pid_tgid; 32 | static unsigned long long (*bpf_get_current_uid_gid)(void) = 33 | (void *) BPF_FUNC_get_current_uid_gid; 34 | static int (*bpf_get_current_comm)(void *buf, int buf_size) = 35 | (void *) BPF_FUNC_get_current_comm; 36 | static unsigned long long (*bpf_perf_event_read)(void *map, 37 | unsigned long long flags) = 38 | (void *) BPF_FUNC_perf_event_read; 39 | static int (*bpf_clone_redirect)(void *ctx, int ifindex, int flags) = 40 | (void *) BPF_FUNC_clone_redirect; 41 | static int (*bpf_redirect)(int ifindex, int flags) = 42 | (void *) BPF_FUNC_redirect; 43 | static int (*bpf_redirect_map)(void *map, int key, int flags) = 44 | (void *) BPF_FUNC_redirect_map; 45 | static int (*bpf_perf_event_output)(void *ctx, void *map, 46 | unsigned long long flags, void *data, 47 | int size) = 48 | (void *) BPF_FUNC_perf_event_output; 49 | static int (*bpf_get_stackid)(void *ctx, void *map, int flags) = 50 | (void *) BPF_FUNC_get_stackid; 51 | static int (*bpf_probe_write_user)(void *dst, void *src, int size) = 52 | (void *) BPF_FUNC_probe_write_user; 53 | static int (*bpf_current_task_under_cgroup)(void *map, int index) = 54 | (void *) BPF_FUNC_current_task_under_cgroup; 55 | static int (*bpf_skb_get_tunnel_key)(void *ctx, void *key, int size, int flags) = 56 | (void *) BPF_FUNC_skb_get_tunnel_key; 57 | static int (*bpf_skb_set_tunnel_key)(void *ctx, void *key, int size, int flags) = 58 | (void *) BPF_FUNC_skb_set_tunnel_key; 59 | static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int size) = 60 | (void *) BPF_FUNC_skb_get_tunnel_opt; 61 | static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) = 62 | (void *) BPF_FUNC_skb_set_tunnel_opt; 63 | static unsigned long long (*bpf_get_prandom_u32)(void) = 64 | (void *) BPF_FUNC_get_prandom_u32; 65 | static int (*bpf_xdp_adjust_head)(void *ctx, int offset) = 66 | (void *) BPF_FUNC_xdp_adjust_head; 67 | 68 | /* llvm builtin functions that eBPF C program may use to 69 | * emit BPF_LD_ABS and BPF_LD_IND instructions 70 | */ 71 | struct sk_buff; 72 | unsigned long long load_byte(void *skb, 73 | unsigned long long off) asm("llvm.bpf.load.byte"); 74 | unsigned long long load_half(void *skb, 75 | unsigned long long off) asm("llvm.bpf.load.half"); 76 | unsigned long long load_word(void *skb, 77 | unsigned long long off) asm("llvm.bpf.load.word"); 78 | 79 | /* a helper structure used by eBPF C program 80 | * to describe map attributes to elf_bpf loader 81 | */ 82 | struct bpf_map_def { 83 | unsigned int type; 84 | unsigned int key_size; 85 | unsigned int value_size; 86 | unsigned int max_entries; 87 | unsigned int map_flags; 88 | unsigned int inner_map_idx; 89 | }; 90 | 91 | static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) = 92 | (void *) BPF_FUNC_skb_load_bytes; 93 | static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) = 94 | (void *) BPF_FUNC_skb_store_bytes; 95 | static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) = 96 | (void *) BPF_FUNC_l3_csum_replace; 97 | static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) = 98 | (void *) BPF_FUNC_l4_csum_replace; 99 | static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) = 100 | (void *) BPF_FUNC_skb_under_cgroup; 101 | static int (*bpf_skb_change_head)(void *, int len, int flags) = 102 | (void *) BPF_FUNC_skb_change_head; 103 | 104 | #if defined(__x86_64__) 105 | 106 | #define PT_REGS_PARM1(x) ((x)->di) 107 | #define PT_REGS_PARM2(x) ((x)->si) 108 | #define PT_REGS_PARM3(x) ((x)->dx) 109 | #define PT_REGS_PARM4(x) ((x)->cx) 110 | #define PT_REGS_PARM5(x) ((x)->r8) 111 | #define PT_REGS_RET(x) ((x)->sp) 112 | #define PT_REGS_FP(x) ((x)->bp) 113 | #define PT_REGS_RC(x) ((x)->ax) 114 | #define PT_REGS_SP(x) ((x)->sp) 115 | #define PT_REGS_IP(x) ((x)->ip) 116 | 117 | #elif defined(__s390x__) 118 | 119 | #define PT_REGS_PARM1(x) ((x)->gprs[2]) 120 | #define PT_REGS_PARM2(x) ((x)->gprs[3]) 121 | #define PT_REGS_PARM3(x) ((x)->gprs[4]) 122 | #define PT_REGS_PARM4(x) ((x)->gprs[5]) 123 | #define PT_REGS_PARM5(x) ((x)->gprs[6]) 124 | #define PT_REGS_RET(x) ((x)->gprs[14]) 125 | #define PT_REGS_FP(x) ((x)->gprs[11]) /* Works only with CONFIG_FRAME_POINTER */ 126 | #define PT_REGS_RC(x) ((x)->gprs[2]) 127 | #define PT_REGS_SP(x) ((x)->gprs[15]) 128 | #define PT_REGS_IP(x) ((x)->psw.addr) 129 | 130 | #elif defined(__aarch64__) 131 | 132 | #define PT_REGS_PARM1(x) ((x)->regs[0]) 133 | #define PT_REGS_PARM2(x) ((x)->regs[1]) 134 | #define PT_REGS_PARM3(x) ((x)->regs[2]) 135 | #define PT_REGS_PARM4(x) ((x)->regs[3]) 136 | #define PT_REGS_PARM5(x) ((x)->regs[4]) 137 | #define PT_REGS_RET(x) ((x)->regs[30]) 138 | #define PT_REGS_FP(x) ((x)->regs[29]) /* Works only with CONFIG_FRAME_POINTER */ 139 | #define PT_REGS_RC(x) ((x)->regs[0]) 140 | #define PT_REGS_SP(x) ((x)->sp) 141 | #define PT_REGS_IP(x) ((x)->pc) 142 | 143 | #elif defined(__powerpc__) 144 | 145 | #define PT_REGS_PARM1(x) ((x)->gpr[3]) 146 | #define PT_REGS_PARM2(x) ((x)->gpr[4]) 147 | #define PT_REGS_PARM3(x) ((x)->gpr[5]) 148 | #define PT_REGS_PARM4(x) ((x)->gpr[6]) 149 | #define PT_REGS_PARM5(x) ((x)->gpr[7]) 150 | #define PT_REGS_RC(x) ((x)->gpr[3]) 151 | #define PT_REGS_SP(x) ((x)->sp) 152 | #define PT_REGS_IP(x) ((x)->nip) 153 | 154 | #elif defined(__sparc__) 155 | 156 | #define PT_REGS_PARM1(x) ((x)->u_regs[UREG_I0]) 157 | #define PT_REGS_PARM2(x) ((x)->u_regs[UREG_I1]) 158 | #define PT_REGS_PARM3(x) ((x)->u_regs[UREG_I2]) 159 | #define PT_REGS_PARM4(x) ((x)->u_regs[UREG_I3]) 160 | #define PT_REGS_PARM5(x) ((x)->u_regs[UREG_I4]) 161 | #define PT_REGS_RET(x) ((x)->u_regs[UREG_I7]) 162 | #define PT_REGS_RC(x) ((x)->u_regs[UREG_I0]) 163 | #define PT_REGS_SP(x) ((x)->u_regs[UREG_FP]) 164 | #if defined(__arch64__) 165 | #define PT_REGS_IP(x) ((x)->tpc) 166 | #else 167 | #define PT_REGS_IP(x) ((x)->pc) 168 | #endif 169 | 170 | #endif 171 | 172 | #ifdef __powerpc__ 173 | #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ (ip) = (ctx)->link; }) 174 | #define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP 175 | #elif defined(__sparc__) 176 | #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ (ip) = PT_REGS_RET(ctx); }) 177 | #define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP 178 | #else 179 | #define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ \ 180 | bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); }) 181 | #define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({ \ 182 | bpf_probe_read(&(ip), sizeof(ip), \ 183 | (void *)(PT_REGS_FP(ctx) + sizeof(ip))); }) 184 | #endif 185 | 186 | #endif 187 | -------------------------------------------------------------------------------- /headers/bpf_util.h: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | #ifndef __BPF_UTIL__ 3 | #define __BPF_UTIL__ 4 | 5 | #include 6 | #include 7 | #include 8 | #include 9 | 10 | static inline unsigned int bpf_num_possible_cpus(void) 11 | { 12 | static const char *fcpu = "/sys/devices/system/cpu/possible"; 13 | unsigned int start, end, possible_cpus = 0; 14 | char buff[128]; 15 | FILE *fp; 16 | int n; 17 | 18 | fp = fopen(fcpu, "r"); 19 | if (!fp) { 20 | printf("Failed to open %s: '%s'!\n", fcpu, strerror(errno)); 21 | exit(1); 22 | } 23 | 24 | while (fgets(buff, sizeof(buff), fp)) { 25 | n = sscanf(buff, "%u-%u", &start, &end); 26 | if (n == 0) { 27 | printf("Failed to retrieve # possible CPUs!\n"); 28 | exit(1); 29 | } else if (n == 1) { 30 | end = start; 31 | } 32 | possible_cpus = start == 0 ? end + 1 : 0; 33 | break; 34 | } 35 | fclose(fp); 36 | 37 | return possible_cpus; 38 | } 39 | 40 | #define __bpf_percpu_val_align __attribute__((__aligned__(8))) 41 | 42 | #define BPF_DECLARE_PERCPU(type, name) \ 43 | struct { type v; /* padding */ } __bpf_percpu_val_align \ 44 | name[bpf_num_possible_cpus()] 45 | #define bpf_percpu(name, cpu) name[(cpu)].v 46 | 47 | #ifndef ARRAY_SIZE 48 | # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) 49 | #endif 50 | 51 | #ifndef sizeof_field 52 | #define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) 53 | #endif 54 | 55 | #ifndef offsetofend 56 | #define offsetofend(TYPE, MEMBER) \ 57 | (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER)) 58 | #endif 59 | 60 | #endif /* __BPF_UTIL__ */ 61 | -------------------------------------------------------------------------------- /headers/jhash.h: -------------------------------------------------------------------------------- 1 | #ifndef _LINUX_JHASH_H 2 | #define _LINUX_JHASH_H 3 | 4 | /* Copied from $(LINUX)/include/linux/jhash.h (kernel 4.18) */ 5 | 6 | /* jhash.h: Jenkins hash support. 7 | * 8 | * Copyright (C) 2006. Bob Jenkins (bob_jenkins@burtleburtle.net) 9 | * 10 | * http://burtleburtle.net/bob/hash/ 11 | * 12 | * These are the credits from Bob's sources: 13 | * 14 | * lookup3.c, by Bob Jenkins, May 2006, Public Domain. 15 | * 16 | * These are functions for producing 32-bit hashes for hash table lookup. 17 | * hashword(), hashlittle(), hashlittle2(), hashbig(), mix(), and final() 18 | * are externally useful functions. Routines to test the hash are included 19 | * if SELF_TEST is defined. You can use this free for any purpose. It's in 20 | * the public domain. It has no warranty. 21 | * 22 | * Copyright (C) 2009-2010 Jozsef Kadlecsik (kadlec@blackhole.kfki.hu) 23 | */ 24 | 25 | static inline __u32 rol32(__u32 word, unsigned int shift) 26 | { 27 | return (word << shift) | (word >> ((-shift) & 31)); 28 | } 29 | 30 | /* copy paste of jhash from kernel sources (include/linux/jhash.h) to make sure 31 | * LLVM can compile it into valid sequence of BPF instructions 32 | */ 33 | #define __jhash_mix(a, b, c) \ 34 | { \ 35 | a -= c; a ^= rol32(c, 4); c += b; \ 36 | b -= a; b ^= rol32(a, 6); a += c; \ 37 | c -= b; c ^= rol32(b, 8); b += a; \ 38 | a -= c; a ^= rol32(c, 16); c += b; \ 39 | b -= a; b ^= rol32(a, 19); a += c; \ 40 | c -= b; c ^= rol32(b, 4); b += a; \ 41 | } 42 | 43 | #define __jhash_final(a, b, c) \ 44 | { \ 45 | c ^= b; c -= rol32(b, 14); \ 46 | a ^= c; a -= rol32(c, 11); \ 47 | b ^= a; b -= rol32(a, 25); \ 48 | c ^= b; c -= rol32(b, 16); \ 49 | a ^= c; a -= rol32(c, 4); \ 50 | b ^= a; b -= rol32(a, 14); \ 51 | c ^= b; c -= rol32(b, 24); \ 52 | } 53 | 54 | #define JHASH_INITVAL 0xdeadbeef 55 | 56 | typedef unsigned int u32; 57 | 58 | /* jhash - hash an arbitrary key 59 | * @k: sequence of bytes as key 60 | * @length: the length of the key 61 | * @initval: the previous hash, or an arbitray value 62 | * 63 | * The generic version, hashes an arbitrary sequence of bytes. 64 | * No alignment or length assumptions are made about the input key. 65 | * 66 | * Returns the hash value of the key. The result depends on endianness. 67 | */ 68 | static inline u32 jhash(const void *key, u32 length, u32 initval) 69 | { 70 | u32 a, b, c; 71 | const unsigned char *k = key; 72 | 73 | /* Set up the internal state */ 74 | a = b = c = JHASH_INITVAL + length + initval; 75 | 76 | /* All but the last block: affect some 32 bits of (a,b,c) */ 77 | while (length > 12) { 78 | a += *(u32 *)(k); 79 | b += *(u32 *)(k + 4); 80 | c += *(u32 *)(k + 8); 81 | __jhash_mix(a, b, c); 82 | length -= 12; 83 | k += 12; 84 | } 85 | /* Last block: affect all 32 bits of (c) */ 86 | switch (length) { 87 | case 12: c += (u32)k[11]<<24; /* fall through */ 88 | case 11: c += (u32)k[10]<<16; /* fall through */ 89 | case 10: c += (u32)k[9]<<8; /* fall through */ 90 | case 9: c += k[8]; /* fall through */ 91 | case 8: b += (u32)k[7]<<24; /* fall through */ 92 | case 7: b += (u32)k[6]<<16; /* fall through */ 93 | case 6: b += (u32)k[5]<<8; /* fall through */ 94 | case 5: b += k[4]; /* fall through */ 95 | case 4: a += (u32)k[3]<<24; /* fall through */ 96 | case 3: a += (u32)k[2]<<16; /* fall through */ 97 | case 2: a += (u32)k[1]<<8; /* fall through */ 98 | case 1: a += k[0]; 99 | __jhash_final(a, b, c); 100 | case 0: /* Nothing left to add */ 101 | break; 102 | } 103 | 104 | return c; 105 | } 106 | 107 | /* jhash2 - hash an array of u32's 108 | * @k: the key which must be an array of u32's 109 | * @length: the number of u32's in the key 110 | * @initval: the previous hash, or an arbitray value 111 | * 112 | * Returns the hash value of the key. 113 | */ 114 | static inline u32 jhash2(const u32 *k, u32 length, u32 initval) 115 | { 116 | u32 a, b, c; 117 | 118 | /* Set up the internal state */ 119 | a = b = c = JHASH_INITVAL + (length<<2) + initval; 120 | 121 | /* Handle most of the key */ 122 | while (length > 3) { 123 | a += k[0]; 124 | b += k[1]; 125 | c += k[2]; 126 | __jhash_mix(a, b, c); 127 | length -= 3; 128 | k += 3; 129 | } 130 | 131 | /* Handle the last 3 u32's */ 132 | switch (length) { 133 | case 3: c += k[2]; /* fall through */ 134 | case 2: b += k[1]; /* fall through */ 135 | case 1: a += k[0]; 136 | __jhash_final(a, b, c); 137 | case 0: /* Nothing left to add */ 138 | break; 139 | } 140 | 141 | return c; 142 | } 143 | 144 | 145 | /* __jhash_nwords - hash exactly 3, 2 or 1 word(s) */ 146 | static inline u32 __jhash_nwords(u32 a, u32 b, u32 c, u32 initval) 147 | { 148 | a += initval; 149 | b += initval; 150 | c += initval; 151 | 152 | __jhash_final(a, b, c); 153 | 154 | return c; 155 | } 156 | 157 | static inline u32 jhash_3words(u32 a, u32 b, u32 c, u32 initval) 158 | { 159 | return __jhash_nwords(a, b, c, initval + JHASH_INITVAL + (3 << 2)); 160 | } 161 | 162 | static inline u32 jhash_2words(u32 a, u32 b, u32 initval) 163 | { 164 | return __jhash_nwords(a, b, 0, initval + JHASH_INITVAL + (2 << 2)); 165 | } 166 | 167 | static inline u32 jhash_1word(u32 a, u32 initval) 168 | { 169 | return __jhash_nwords(a, 0, 0, initval + JHASH_INITVAL + (1 << 2)); 170 | } 171 | 172 | #endif /* _LINUX_JHASH_H */ 173 | -------------------------------------------------------------------------------- /headers/perf-sys.h: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | /* Copied from $(LINUX)/tools/perf/perf-sys.h (kernel 4.18) */ 3 | #ifndef _PERF_SYS_H 4 | #define _PERF_SYS_H 5 | 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | /* 12 | * remove the following headers to allow for userspace program compilation 13 | * #include 14 | * #include 15 | */ 16 | #ifdef __powerpc__ 17 | #define CPUINFO_PROC {"cpu"} 18 | #endif 19 | 20 | #ifdef __s390__ 21 | #define CPUINFO_PROC {"vendor_id"} 22 | #endif 23 | 24 | #ifdef __sh__ 25 | #define CPUINFO_PROC {"cpu type"} 26 | #endif 27 | 28 | #ifdef __hppa__ 29 | #define CPUINFO_PROC {"cpu"} 30 | #endif 31 | 32 | #ifdef __sparc__ 33 | #define CPUINFO_PROC {"cpu"} 34 | #endif 35 | 36 | #ifdef __alpha__ 37 | #define CPUINFO_PROC {"cpu model"} 38 | #endif 39 | 40 | #ifdef __arm__ 41 | #define CPUINFO_PROC {"model name", "Processor"} 42 | #endif 43 | 44 | #ifdef __mips__ 45 | #define CPUINFO_PROC {"cpu model"} 46 | #endif 47 | 48 | #ifdef __arc__ 49 | #define CPUINFO_PROC {"Processor"} 50 | #endif 51 | 52 | #ifdef __xtensa__ 53 | #define CPUINFO_PROC {"core ID"} 54 | #endif 55 | 56 | #ifndef CPUINFO_PROC 57 | #define CPUINFO_PROC { "model name", } 58 | #endif 59 | 60 | static inline int 61 | sys_perf_event_open(struct perf_event_attr *attr, 62 | pid_t pid, int cpu, int group_fd, 63 | unsigned long flags) 64 | { 65 | int fd; 66 | 67 | fd = syscall(__NR_perf_event_open, attr, pid, cpu, 68 | group_fd, flags); 69 | 70 | #ifdef HAVE_ATTR_TEST 71 | if (unlikely(test_attr__enabled)) 72 | test_attr__open(attr, pid, cpu, fd, group_fd, flags); 73 | #endif 74 | return fd; 75 | } 76 | 77 | #endif /* _PERF_SYS_H */ 78 | -------------------------------------------------------------------------------- /src/Makefile: -------------------------------------------------------------------------------- 1 | # SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) 2 | 3 | TARGET := xdp_pass 4 | TARGET += xdp_iphash_to_cpu 5 | TARGET += tc_classify 6 | 7 | CMDLINE_TOOLS := xdp_iphash_to_cpu_cmdline 8 | 9 | TC_TARGETS += tc_queue_mapping 10 | 11 | LLC ?= llc 12 | CLANG ?= clang 13 | CC := gcc 14 | 15 | LIBBPF_DIR = ../libbpf/src 16 | LIBBPF_INSTALL := libbpf-install 17 | LIBBPF_INSTDIR=../../$(LIBBPF_INSTALL) 18 | LIBBPF_INC ?= $(LIBBPF_DIR)/$(LIBBPF_INSTDIR)/usr/include/ 19 | 20 | XDP_C = ${TARGET:=_kern.c} ${TC_TARGETS:=_kern.c} 21 | XDP_OBJ = ${XDP_C:.c=.o} 22 | USER_C = ${TARGET:=_user.c} 23 | USER_OBJ = ${USER_C:.c=.o} 24 | OBJECT_LIBBPF = $(LIBBPF_DIR)/libbpf.a $(LIBBPF_INSTDIR) 25 | 26 | CFLAGS += -I$(LIBBPF_INC) 27 | CFLAGS += -I../headers/ 28 | LDFLAGS ?= -L$(LIBBPF_DIR) 29 | 30 | COMMON_USER_OBJ := common_user.o 31 | 32 | LIBS = -l:libbpf.a -lelf -lz 33 | 34 | all: llvm-check $(TARGET) $(XDP_OBJ) $(CMDLINE_TOOLS) 35 | 36 | .PHONY: clean $(CLANG) $(LLC) 37 | 38 | clean: 39 | cd $(LIBBPF_DIR) && $(MAKE) clean 40 | cd $(LIBBPF_DIR) && rm -r $(LIBBPF_INSTDIR) 41 | rm -f $(TARGET) 42 | rm -f $(XDP_OBJ) 43 | rm -f $(USER_OBJ) 44 | rm -f $(COMMON_USER_OBJ) 45 | rm -f *.ll 46 | rm -f *~ 47 | 48 | llvm-check: $(CLANG) $(LLC) 49 | @for TOOL in $^ ; do \ 50 | if [ ! $$(command -v $${TOOL} 2>/dev/null) ]; then \ 51 | echo "*** ERROR: Cannot find tool $${TOOL}" ;\ 52 | exit 1; \ 53 | else true; fi; \ 54 | done 55 | 56 | $(OBJECT_LIBBPF): 57 | @if [ ! -d $(LIBBPF_DIR) ]; then \ 58 | echo "Error: Need libbpf submodule"; \ 59 | echo "May need to run git submodule update --init"; \ 60 | exit 1; \ 61 | else \ 62 | cd $(LIBBPF_DIR) && $(MAKE) all; \ 63 | DESTDIR=$(LIBBPF_INSTDIR) $(MAKE) install_headers; \ 64 | fi 65 | 66 | $(COMMON_USER_OBJ): common_user.c common_user.h 67 | $(CC) -c -o $@ $< $(CFLAGS) 68 | 69 | # Define dependencies to other files 70 | #xdp_iphash_to_cpu_kern.o: xdp_iphash_to_cpu_common.h 71 | #xdp_iphash_to_cpu_cmdline: xdp_iphash_to_cpu_common.h 72 | #xdp_iphash_to_cpu: xdp_iphash_to_cpu_common.h 73 | 74 | $(TARGET): %: %_user.c $(OBJECT_LIBBPF) Makefile $(COMMON_USER_OBJ) 75 | $(CC) -Wall $(CFLAGS) $(LDFLAGS) -o $@ $(COMMON_USER_OBJ) \ 76 | $< $(LIBS) 77 | 78 | $(CMDLINE_TOOLS): %: %.c $(OBJECT_LIBBPF) Makefile $(COMMON_USER_OBJ) 79 | $(CC) -Wall $(CFLAGS) $(LDFLAGS) -o $@ $(COMMON_USER_OBJ) \ 80 | $< $(LIBS) 81 | 82 | $(XDP_OBJ): %.o: %.c common_kern_user.h shared_maps.h 83 | $(CLANG) -S \ 84 | -target bpf \ 85 | -D __BPF_TRACING__ \ 86 | $(CFLAGS) \ 87 | -Wall \ 88 | -Wno-unused-value \ 89 | -Wno-pointer-sign \ 90 | -Wno-compare-distinct-pointer-types \ 91 | -Werror \ 92 | -O2 -emit-llvm -c -g -o ${@:.o=.ll} $< 93 | $(LLC) -march=bpf -filetype=obj -o $@ ${@:.o=.ll} 94 | -------------------------------------------------------------------------------- /src/common_kern_user.h: -------------------------------------------------------------------------------- 1 | /* This common_kern_user.h is used by BPF-progs (both XDP and TC) and 2 | * userspace programs, for sharing common struct's and DEFINEs. 3 | */ 4 | #ifndef __COMMON_KERN_USER_H 5 | #define __COMMON_KERN_USER_H 6 | 7 | #include 8 | 9 | /* Interface (ifindex) direction type */ 10 | #define INTERFACE_NONE 0 /* Not configured */ 11 | #define INTERFACE_WAN (1 << 0) 12 | #define INTERFACE_LAN (1 << 1) 13 | 14 | #define MAX_CPUS 64 15 | 16 | /* This ifindex limit is an artifical limit that can easily be bumped. 17 | * The reason for this is allowing to use a faster BPF_MAP_TYPE_ARRAY 18 | * in fast-path lookups. 19 | */ 20 | #define MAX_IFINDEX 256 21 | 22 | /* Data structure used for map_txq_config */ 23 | struct txq_config { 24 | /* lookup key: __u32 cpu; */ 25 | __u16 queue_mapping; 26 | __u16 htb_major; 27 | }; 28 | 29 | #define IP_HASH_ENTRIES_MAX 32767 30 | /* Data structure used for map_ip_hash */ 31 | struct ip_hash_info { 32 | /* lookup key: __u32 IPv4-address */ 33 | __u32 cpu; 34 | __u32 tc_handle; /* TC handle MAJOR:MINOR combined in __u32 */ 35 | }; 36 | 37 | /* Key type used for map_ip_hash trie */ 38 | struct ip_hash_key { 39 | __u32 prefixlen; /* Length of the prefix to match */ 40 | struct in6_addr address; /* An IPv6 address. IPv4 uses the last 32 bits. */ 41 | }; 42 | 43 | #endif /* __COMMON_KERN_USER_H */ 44 | -------------------------------------------------------------------------------- /src/common_user.c: -------------------------------------------------------------------------------- 1 | #include /* __u32 */ 2 | #include /* fprintf */ 3 | #include /* strerror */ 4 | #include /* access */ 5 | #include /* bool */ 6 | #include /* dirname */ 7 | #include /* inet_pton */ 8 | #include /* statfs */ 9 | #include /* stat(2) + S_IRWXU */ 10 | #include /* mount(2) */ 11 | #include 12 | 13 | #include /* TC_H_MAJ + TC_H_MIN */ 14 | 15 | #include "common_user.h" 16 | #include "common_kern_user.h" 17 | 18 | #include "bpf_util.h" 19 | 20 | #include /* LIBBPF_API: bpf_map_update_elem */ 21 | //#include /* System kernel-headers, BPF_ANY, but inc by bpf/bpf.h */ 22 | 23 | int verbose = 1; /* extern in common_user.h */ 24 | 25 | const char *mapfile_txq_config = BASEDIR_MAPS "/map_txq_config"; 26 | const char *mapfile_ip_hash = BASEDIR_MAPS "/map_ip_hash"; 27 | const char *mapfile_ifindex_type = BASEDIR_MAPS "/map_ifindex_type"; 28 | const char *mapfile_cpu_map = BASEDIR_MAPS "/cpu_map"; 29 | 30 | /* Check consistency between map_txq_config and ip_hash_info that is 31 | * going to be inserted into ip_hash 32 | */ 33 | bool map_txq_config_check_ip_info(int map_fd, struct ip_hash_info *ip_info) { 34 | struct txq_config txq_cfg; 35 | __u16 ip_htb_major; 36 | __u32 cpu; 37 | int err; 38 | 39 | if (map_fd < 0) { 40 | fprintf(stderr, "ERR: (bad map_fd:%d) " 41 | "cannot proceed without access to txq_config map\n", 42 | map_fd); 43 | return false; 44 | } 45 | 46 | cpu = ip_info->cpu; 47 | err = bpf_map_lookup_elem(map_fd, &cpu, &txq_cfg); 48 | if (err) { 49 | fprintf(stderr, 50 | "ERR: %s() lookup cpu-key:%d err(%d):%s\n", 51 | __func__, cpu, errno, strerror(errno)); 52 | return false; 53 | } 54 | 55 | if (txq_cfg.queue_mapping == 0) { 56 | fprintf(stderr, "WARN: " 57 | "Looks like map_txq_config --base-setup is missing\n"); 58 | fprintf(stderr, "WARN: " 59 | "Fixing, doing map_txq_config --base-setup\n"); 60 | if (!map_txq_config_base_setup(map_fd)) 61 | return false; 62 | return true; // FIXME, redo check 63 | } 64 | 65 | ip_htb_major = TC_H_MAJ(ip_info->tc_handle) >> 16; 66 | if (txq_cfg.htb_major != ip_htb_major) { 67 | if (verbose) 68 | fprintf(stderr, 69 | "WARN: Bad config mismatch " 70 | "ip handle:0x%X (major:0x%X) " 71 | "not matching TXQ-config:0x%X\n", 72 | ip_info->tc_handle, ip_htb_major, 73 | txq_cfg.htb_major); 74 | return false; 75 | } 76 | return true; 77 | } 78 | 79 | struct ip_hash_key ip_string_to_key(char *ip_string) { 80 | struct ip_hash_key key; 81 | int res; 82 | char addr[INET6_ADDRSTRLEN]; /* Temporary buffer if parsing IP */ 83 | 84 | key.address.__in6_u.__u6_addr32[0] = 0xFFFFFFFF; 85 | key.address.__in6_u.__u6_addr32[1] = 0xFFFFFFFF; 86 | key.address.__in6_u.__u6_addr32[2] = 0xFFFFFFFF; 87 | key.address.__in6_u.__u6_addr32[3] = 0xFFFFFFFF; 88 | key.prefixlen = 128; 89 | 90 | /* Does the IP string contain a prefix? */ 91 | char * slash_loc = strchr(ip_string, '/'); 92 | if (slash_loc != NULL) { 93 | char cidr[4]; 94 | memset(&addr, 0, sizeof(addr)); 95 | memset(&cidr, 0, sizeof(cidr)); 96 | strncpy(addr, ip_string, slash_loc - ip_string); 97 | strncpy(cidr, slash_loc+1, 4); 98 | key.prefixlen = atoi(cidr); 99 | ip_string = (char *)&addr; 100 | } 101 | struct addrinfo hints = {}, *result; 102 | memset (&hints, 0, sizeof (hints)); 103 | hints.ai_family = AF_UNSPEC; 104 | res = getaddrinfo(ip_string, NULL, &hints, &result); 105 | if (res < 0) { 106 | printf("Code: %d\n", res); 107 | perror("getaddrinfo"); 108 | key.prefixlen = 255; /* Indicates fail */ 109 | return key; 110 | } 111 | 112 | switch (result->ai_family) { 113 | case AF_INET: 114 | key.address.__in6_u.__u6_addr32[3] = ((struct sockaddr_in *) result->ai_addr)->sin_addr.s_addr; 115 | if (key.prefixlen != 128) { 116 | key.prefixlen = key.prefixlen + 96; 117 | } 118 | break; 119 | case AF_INET6: 120 | printf("IPv6\n"); 121 | key.address = ((struct sockaddr_in6 *) result->ai_addr)->sin6_addr; 122 | break; 123 | } 124 | 125 | 126 | freeaddrinfo(result); 127 | return key; 128 | } 129 | 130 | void print_key_binary(struct ip_hash_key *key) { 131 | if (key->address.__in6_u.__u6_addr32[0] == 0 && key->address.__in6_u.__u6_addr32[1] == 0 && key->address.__in6_u.__u6_addr32[2] == 0) { 132 | /* It's IPv4 */ 133 | printf("IPv4: 0x%X/%d", key->address.__in6_u.__u6_addr32[3], key->prefixlen); 134 | } else { 135 | /* It's an IPv6 address */ 136 | printf("IPv6: 0x%X/0x%X/0x%X/0x%X/%d", key->address.__in6_u.__u6_addr32[0], 137 | key->address.__in6_u.__u6_addr32[1], key->address.__in6_u.__u6_addr32[2], 138 | key->address.__in6_u.__u6_addr32[3], key->prefixlen); 139 | } 140 | } 141 | 142 | int iphash_modify(int fd, char *ip_string, unsigned int action, 143 | __u32 cpu_idx, __u32 tc_handle, int txq_map_fd) 144 | { 145 | //printf ("In iphash_modify %u\n",cpu_idx); 146 | struct ip_hash_key key; 147 | int res; 148 | unsigned int nr_cpus = bpf_num_possible_cpus(); 149 | struct ip_hash_info ip_info; 150 | 151 | if (cpu_idx+1 > nr_cpus || cpu_idx+1 < 0) 152 | return EXIT_FAIL_CPU; 153 | 154 | /* Value for the map */ 155 | ip_info.cpu = cpu_idx; 156 | ip_info.tc_handle = tc_handle; 157 | 158 | /* Convert IP-string into network byte-order value */ 159 | key = ip_string_to_key(ip_string); 160 | if (key.prefixlen == 255) { 161 | return EXIT_FAIL_IP; 162 | } 163 | print_key_binary(&key); 164 | if (action == ACTION_ADD) { 165 | //res = bpf_map_update_elem(fd, &key, &ip_info, BPF_NOEXIST); 166 | if (!map_txq_config_check_ip_info(txq_map_fd, &ip_info)) 167 | fprintf(stderr, "Misconf: But allowing to continue\n"); 168 | res = bpf_map_update_elem(fd, &key, &ip_info, BPF_ANY); 169 | } else if (action == ACTION_DEL) { 170 | res = bpf_map_delete_elem(fd, &key); 171 | } else { 172 | fprintf(stderr, "ERR: %s() invalid action 0x%x\n", 173 | __func__, action); 174 | return EXIT_FAIL_OPTION; 175 | } 176 | 177 | if (res != 0) { /* 0 == success */ 178 | fprintf(stderr, 179 | "%s() IP:%s errno(%d/%s)", 180 | __func__, ip_string, errno, strerror(errno)); 181 | 182 | if (errno == 17) { 183 | fprintf(stderr, ": Already in Iphash\n"); 184 | return EXIT_OK; 185 | } 186 | fprintf(stderr, "\n"); 187 | return EXIT_FAIL_MAP_KEY; 188 | } 189 | if (verbose) 190 | fprintf(stderr, 191 | "%s() IP:%s TC-handle:0x%X\n", 192 | __func__, ip_string, tc_handle); 193 | return EXIT_OK; 194 | } 195 | 196 | bool locate_kern_object(char *execname, char *filename, size_t size) 197 | { 198 | char *basec, *bname; 199 | 200 | snprintf(filename, size, "%s_kern.o", execname); 201 | 202 | if (access(filename, F_OK) != -1 ) 203 | return true; 204 | 205 | /* Cannot find the _kern.o ELF object file directly. 206 | * Lets start searching for it in different paths. 207 | */ 208 | basec = strdup(execname); 209 | if (basec == NULL) 210 | return false; 211 | bname = basename(basec); 212 | 213 | /* Maybe enough to add a "./" */ 214 | snprintf(filename, size, "./%s_kern.o", bname); 215 | if (access( filename, F_OK ) != -1 ) { 216 | free(basec); 217 | return true; 218 | } 219 | 220 | /* Maybe /usr/local/lib/ */ 221 | snprintf(filename, size, "/usr/local/lib/%s_kern.o", bname); 222 | if (access( filename, F_OK ) != -1 ) { 223 | free(basec); 224 | return true; 225 | } 226 | 227 | /* Maybe /usr/local/bin/ */ 228 | snprintf(filename, size, "/usr/local/bin/%s_kern.o", bname); 229 | if (access(filename, F_OK) != -1 ) { 230 | free(basec); 231 | return true; 232 | } 233 | 234 | free(basec); 235 | return false; 236 | } 237 | 238 | #ifndef BPF_FS_MAGIC 239 | # define BPF_FS_MAGIC 0xcafe4a11 240 | #endif 241 | 242 | #define FILEMODE (S_IRWXU | S_IRGRP | S_IXGRP | S_IROTH | S_IXOTH) 243 | 244 | /* Verify BPF-filesystem is mounted on given file path */ 245 | int __bpf_fs_check_path(const char *path) 246 | { 247 | struct statfs st_fs; 248 | char *dname, *dir; 249 | int err = 0; 250 | 251 | if (path == NULL) 252 | return -EINVAL; 253 | 254 | dname = strdup(path); 255 | if (dname == NULL) 256 | return -ENOMEM; 257 | 258 | dir = dirname(dname); 259 | if (statfs(dir, &st_fs)) { 260 | fprintf(stderr, "ERR: failed to statfs %s: (%d)%s\n", 261 | dir, errno, strerror(errno)); 262 | err = -errno; 263 | } 264 | free(dname); 265 | 266 | if (!err && st_fs.f_type != BPF_FS_MAGIC) { 267 | err = -EMEDIUMTYPE; 268 | } 269 | 270 | return err; 271 | } 272 | 273 | int bpf_fs_check() 274 | { 275 | const char *path = BPF_DIR_MNT "/some_file"; 276 | int err; 277 | 278 | err = __bpf_fs_check_path(path); 279 | 280 | if (err == -EMEDIUMTYPE) { 281 | fprintf(stderr, 282 | "ERR: specified path %s is not on BPF FS\n\n" 283 | " You need to mount the BPF filesystem type like:\n" 284 | " mount -t bpf bpf /sys/fs/bpf/\n\n", 285 | path); 286 | } 287 | return err; 288 | } 289 | 290 | 291 | int __bpf_fs_subdir_check_and_fix(const char *dir) 292 | { 293 | int err; 294 | 295 | err = access(dir, F_OK); 296 | if (err) { 297 | if (errno == EACCES) { 298 | fprintf(stderr,"ERR: " 299 | "Got root? dir access %s fail: %s\n", 300 | dir, strerror(errno)); 301 | return -1; 302 | } 303 | err = mkdir(dir, FILEMODE); 304 | if (err) { 305 | fprintf(stderr, "ERR: mkdir %s failed: %s\n", 306 | dir, strerror(errno)); 307 | return -1; 308 | } 309 | // printf("DEBUG: mkdir %s\n", dir); 310 | } 311 | 312 | return err; 313 | } 314 | 315 | int bpf_fs_check_and_fix() 316 | { 317 | const char *some_base_path = BPF_DIR_MNT "/some_file"; 318 | const char *dir_tc_globals = BPF_DIR_MNT "/tc/globals"; 319 | const char *dir_tc = BPF_DIR_MNT "/tc"; 320 | const char *target = BPF_DIR_MNT; 321 | bool did_mkdir = false; 322 | int err; 323 | 324 | err = __bpf_fs_check_path(some_base_path); 325 | 326 | if (err) { 327 | /* First fix step: mkdir /sys/fs/bpf if dir not exist */ 328 | struct stat sb = {0}; 329 | int ret; 330 | 331 | ret = stat(target, &sb); 332 | if (ret) { 333 | ret = mkdir(target, FILEMODE); 334 | if (ret) { 335 | fprintf(stderr, "mkdir %s failed: %s\n", target, 336 | strerror(errno)); 337 | return ret; 338 | } 339 | did_mkdir = true; 340 | } 341 | } 342 | 343 | if (err == -EMEDIUMTYPE || did_mkdir) { 344 | /* Fix step 2: Mount bpf filesystem */ 345 | if (mount("bpf", target, "bpf", 0, "mode=0755")) { 346 | fprintf(stderr, "ERR: mount -t bpf bpf %s failed: %s\n", 347 | target, strerror(errno)); 348 | return -1; 349 | } 350 | } 351 | 352 | /* Fix step 3: Check sub-directories exists */ 353 | err = __bpf_fs_subdir_check_and_fix(dir_tc); 354 | if (err) 355 | return err; 356 | 357 | err = __bpf_fs_subdir_check_and_fix(dir_tc_globals); 358 | if (err) 359 | return err; 360 | 361 | return 0; 362 | } 363 | 364 | 365 | bool map_txq_config_list_setup(int map_fd) { 366 | unsigned int possible_cpus = bpf_num_possible_cpus(); 367 | struct txq_config txq_cfg; 368 | int cpu, err; 369 | 370 | printf("Current configuration:\n"); 371 | printf("|-----------+---------------+-----------|\n" 372 | "| key (cpu) | queue_mapping | htb_major |\n" 373 | "|-----------+---------------+-----------|\n"); 374 | 375 | for (cpu = 0; cpu < possible_cpus; cpu++) { 376 | 377 | err = bpf_map_lookup_elem(map_fd, &cpu, &txq_cfg); 378 | if (err) { 379 | fprintf(stderr, 380 | "ERR: %s() lookup cpu-key:%d err(%d):%s\n", 381 | __func__, cpu, errno, strerror(errno)); 382 | return false; 383 | } 384 | 385 | printf("| %-6u | %-6u | 0x%-6X |\n", 386 | cpu, txq_cfg.queue_mapping, txq_cfg.htb_major); 387 | } 388 | 389 | printf("|-----------+---------------+-----------|\n"); 390 | return true; 391 | } 392 | 393 | /* 394 | Create a simple default base setup for the "map_txq_config", where the 395 | queue_mapping is CPU + 1, and HTB qdisc have handles equal to 396 | queue_mapping. 397 | 398 | |-----------+---------------+-----------| 399 | | key (cpu) | queue_mapping | htb_major | 400 | |-----------+---------------+-----------| 401 | | 0 | 1 | 1 | 402 | | 1 | 2 | 2 | 403 | | 2 | 3 | 3 | 404 | | 3 | 4 | 4 | 405 | |-----------+---------------+-----------| 406 | 407 | */ 408 | bool map_txq_config_base_setup(int map_fd) { 409 | unsigned int possible_cpus = bpf_num_possible_cpus(); 410 | struct txq_config txq_cfg; 411 | __u32 cpu; 412 | int err; 413 | 414 | if (map_fd < 0) { 415 | fprintf(stderr, "ERR: (bad map_fd:%d) " 416 | "cannot proceed without access to txq_config map\n", 417 | map_fd); 418 | return false; 419 | } 420 | 421 | for (cpu = 0; cpu < possible_cpus; cpu++) { 422 | txq_cfg.queue_mapping = cpu + 1; 423 | txq_cfg.htb_major = cpu + 1; 424 | 425 | err = bpf_map_update_elem(map_fd, &cpu, &txq_cfg, 0); 426 | if (err) { 427 | fprintf(stderr, 428 | "ERR: %s() updating cpu-key:%d err(%d):%s\n", 429 | __func__, cpu, errno, strerror(errno)); 430 | return false; 431 | } 432 | } 433 | 434 | return true; 435 | } 436 | 437 | #define CMD_MAX 2048 438 | #define CMD_MAX_TC 256 439 | static char tc_cmd[CMD_MAX_TC] = "tc"; 440 | 441 | /* 442 | * TC require attaching the bpf-object via the TC cmdline tool. 443 | * 444 | * Manually like: 445 | * $TC qdisc del dev $DEV clsact 446 | * $TC qdisc add dev $DEV clsact 447 | * $TC filter add dev $DEV egress bpf da obj $BPF_OBJ sec $SEC_NAME 448 | * $TC filter show dev $DEV egress 449 | * $TC filter del dev $DEV egress 450 | * 451 | * (The tc "replace" command does not seem to work as expected) 452 | */ 453 | int tc_egress_attach_bpf(const char* dev, const char* bpf_obj, 454 | const char* sec_name) 455 | { 456 | char cmd[CMD_MAX]; 457 | int ret = 0; 458 | 459 | /* Step-1: Delete clsact, which also remove filters */ 460 | memset(&cmd, 0, CMD_MAX); 461 | snprintf(cmd, CMD_MAX, 462 | "%s qdisc del dev %s clsact 2> /dev/null", 463 | tc_cmd, dev); 464 | if (verbose) printf(" - Run: %s\n", cmd); 465 | ret = system(cmd); 466 | if (!WIFEXITED(ret)) { 467 | fprintf(stderr, 468 | "ERR(%d): Cannot exec tc cmd\n Cmdline:%s\n", 469 | WEXITSTATUS(ret), cmd); 470 | exit(EXIT_FAILURE); 471 | } else if (WEXITSTATUS(ret) == 2) { 472 | /* Unfortunately TC use same return code for many errors */ 473 | if (verbose) printf(" - (First time loading clsact?)\n"); 474 | } 475 | 476 | /* Step-2: Attach a new clsact qdisc */ 477 | memset(&cmd, 0, CMD_MAX); 478 | snprintf(cmd, CMD_MAX, 479 | "%s qdisc add dev %s clsact", 480 | tc_cmd, dev); 481 | if (verbose) printf(" - Run: %s\n", cmd); 482 | ret = system(cmd); 483 | if (ret) { 484 | fprintf(stderr, 485 | "ERR(%d): tc cannot attach qdisc hook\n Cmdline:%s\n", 486 | WEXITSTATUS(ret), cmd); 487 | exit(EXIT_FAILURE); 488 | } 489 | 490 | /* Step-3: Attach BPF program/object as ingress filter */ 491 | memset(&cmd, 0, CMD_MAX); 492 | snprintf(cmd, CMD_MAX, 493 | "%s filter add dev %s " 494 | "egress prio 1 handle 1 bpf da obj %s sec %s", 495 | tc_cmd, dev, bpf_obj, sec_name); 496 | if (verbose) printf(" - Run: %s\n", cmd); 497 | ret = system(cmd); 498 | if (ret) { 499 | fprintf(stderr, 500 | "ERR(%d): tc cannot attach filter\n Cmdline:%s\n", 501 | WEXITSTATUS(ret), cmd); 502 | exit(EXIT_FAILURE); 503 | } 504 | 505 | return ret; 506 | } 507 | 508 | int tc_list_egress_filter(const char* dev) 509 | { 510 | char cmd[CMD_MAX]; 511 | int ret = 0; 512 | 513 | memset(&cmd, 0, CMD_MAX); 514 | snprintf(cmd, CMD_MAX, 515 | "%s filter show dev %s egress", 516 | tc_cmd, dev); 517 | if (verbose) printf(" - Run: %s\n", cmd); 518 | ret = system(cmd); 519 | if (ret) { 520 | fprintf(stderr, 521 | "ERR(%d): tc cannot list filters\n Cmdline:%s\n", 522 | ret, cmd); 523 | exit(EXIT_FAILURE); 524 | } 525 | return ret; 526 | } 527 | 528 | int tc_remove_egress_filter(const char* dev) 529 | { 530 | char cmd[CMD_MAX]; 531 | int ret = 0; 532 | 533 | memset(&cmd, 0, CMD_MAX); 534 | snprintf(cmd, CMD_MAX, 535 | /* Remove all egress filters on dev */ 536 | "%s filter delete dev %s egress", 537 | /* Alternatively could remove specific filter handle: 538 | "%s filter delete dev %s egress prio 1 handle 1 bpf", 539 | */ 540 | tc_cmd, dev); 541 | if (verbose) printf(" - Run: %s\n", cmd); 542 | ret = system(cmd); 543 | if (ret) { 544 | fprintf(stderr, 545 | "ERR(%d): tc cannot remove filters\n Cmdline:%s\n", 546 | ret, cmd); 547 | exit(EXIT_FAILURE); 548 | } 549 | return ret; 550 | } 551 | -------------------------------------------------------------------------------- /src/common_user.h: -------------------------------------------------------------------------------- 1 | /* This common_user.h is used by userspace programs. 2 | */ 3 | #ifndef __COMMON_USER_H 4 | #define __COMMON_USER_H 5 | 6 | extern int verbose; /* common_user.c */ 7 | 8 | /* Also see: #include "common_kern_user.h" */ 9 | 10 | /* Exit return codes */ 11 | #define EXIT_OK 0 /* == EXIT_SUCCESS (stdlib.h) man exit(3) */ 12 | #define EXIT_FAIL 1 /* == EXIT_FAILURE (stdlib.h) man exit(3) */ 13 | #define EXIT_FAIL_OPTION 2 14 | #define EXIT_FAIL_XDP 3 15 | #define EXIT_FAIL_MAP 20 16 | #define EXIT_FAIL_MAP_KEY 21 17 | #define EXIT_FAIL_MAP_FILE 22 18 | #define EXIT_FAIL_MAP_FS 23 19 | #define EXIT_FAIL_IP 30 20 | #define EXIT_FAIL_CPU 31 21 | #define EXIT_FAIL_BPF 40 22 | #define EXIT_FAIL_BPF_ELF 41 23 | #define EXIT_FAIL_BPF_RELOCATE 42 24 | 25 | /* 26 | * Map files shared between TC and XDP program, are due to iproute2 27 | * limitations, located under /sys/fs/bpf/tc/globals/ 28 | */ 29 | /* Basedir due to iproute2 use this path */ 30 | #define BASEDIR_MAPS "/sys/fs/bpf/tc/globals" 31 | 32 | extern const char *mapfile_txq_config; /* located in common_user.c */ 33 | extern const char *mapfile_ip_hash; 34 | extern const char *mapfile_ifindex_type; 35 | extern const char *mapfile_cpu_map; 36 | /* 37 | * Gotcha need to mount: 38 | * mount -t bpf bpf /sys/fs/bpf/ 39 | */ 40 | 41 | /* iphash_modify operations */ 42 | #define ACTION_ADD (1 << 0) 43 | #define ACTION_DEL (1 << 1) 44 | 45 | int iphash_modify(int fd, char *ip_string, unsigned int action, 46 | __u32 cpu_idx, __u32 tc_handle, int txq_map_fd); 47 | 48 | bool locate_kern_object(char *execname, char *filename, size_t size); 49 | 50 | #define BPF_DIR_MNT "/sys/fs/bpf" 51 | int bpf_fs_check(); 52 | int bpf_fs_check_and_fix(); 53 | 54 | bool map_txq_config_list_setup(int map_fd); 55 | bool map_txq_config_base_setup (int map_fd); 56 | 57 | struct ip_hash_info; /* to use #include "common_kern_user.h" */ 58 | bool map_txq_config_check_ip_info(int map_fd, struct ip_hash_info *ip_info); 59 | 60 | int tc_egress_attach_bpf(const char* dev, const char* bpf_obj, 61 | const char* sec_name); 62 | int tc_list_egress_filter(const char* dev); 63 | int tc_remove_egress_filter(const char* dev);; 64 | 65 | #endif /* __COMMON_USER_H */ 66 | -------------------------------------------------------------------------------- /src/howto_debug.org: -------------------------------------------------------------------------------- 1 | # -*- fill-column: 76; -*- 2 | #+Title: Howto verify/debug TXQ selection via skb->queue_mapping 3 | 4 | * Background 5 | 6 | The purpose of changing =skb->queue_mapping= is to influence the selection 7 | of the =net_device= "txq" (struct netdev_queue), which influence selection 8 | of the qdisc "root_lock" (via =txq->qdisc->q.lock=) and =txq->_xmit_lock=. 9 | When using the =MQ= qdisc the =txq->qdisc= points to different qdisc and 10 | associated locks, and HARD_TX_LOCK (=txq->_xmit_lock=), for CPU scalability. 11 | 12 | * Common mistake 13 | 14 | The most common mistake is that XPS (Transmit Packet Steering) takes 15 | precedence over setting =skb->queue_mapping=. XPS is configured per DEVICE 16 | via =/sys/class/net/DEVICE/queues/tx-*/xps_cpus= via a CPU hex mask. To 17 | disable set mask=00. 18 | 19 | See current config via command: 20 | #+BEGIN_SRC bash 21 | $ grep -H . /sys/class/net/ixgbe2/queues/tx-*/xps_cpus 22 | /sys/class/net/ixgbe2/queues/tx-0/xps_cpus:00 23 | /sys/class/net/ixgbe2/queues/tx-1/xps_cpus:00 24 | /sys/class/net/ixgbe2/queues/tx-2/xps_cpus:00 25 | /sys/class/net/ixgbe2/queues/tx-3/xps_cpus:00 26 | /sys/class/net/ixgbe2/queues/tx-4/xps_cpus:00 27 | /sys/class/net/ixgbe2/queues/tx-5/xps_cpus:00 28 | #+END_SRC 29 | 30 | A script for configuring XPS easier is provided here: [[file:../bin/xps_setup.sh]]. 31 | 32 | * Debugging TXQ selection 33 | 34 | The recommended hook for changing the =skb->queue_mapping= is via TC egress 35 | hook on device (see kernel function =sch_handle_egress=), which happens in 36 | (__dev_queue_xmit) just before seleting the txq via =netdev_pick_tx=. 37 | 38 | For debugging and seeing both the =skb->queue_mapping= and the resulting txq 39 | index (which is usually queue_mapping - 1), we can install perf probes when 40 | calling =netdev_pick_tx= and observe return value from =__netdev_pick_tx=. 41 | 42 | ** Capping TXQ index 43 | 44 | If setting a high =queue_mapping= for debugging purposes, notice that the 45 | kernel will cap the =txq= index in =skb_tx_hash()= (and in other situation 46 | also in =netdev_cap_txqueue()=). 47 | 48 | ** Using perf probe to inspect 49 | 50 | Add two probes. First =netdev_pick_tx= to see the queue_mapping before it 51 | gets capped. And second return value from =__netdev_pick_tx=, which returns 52 | the "txq" index capped (if not capped it should be queue_mapping - 1). 53 | 54 | #+begin_example 55 | perf probe --add 'netdev_pick_tx dev->name:string queue_mapping_before_cap=skb->queue_mapping dev->real_num_tx_queues dev->num_tc' 56 | perf probe --add '__netdev_pick_tx%return txq_queue_mapping_minus_1_after_cap=$retval' 57 | #+end_example 58 | 59 | Record via: 60 | #+begin_example 61 | perf record -aR \ 62 | -e probe:__netdev_pick_tx__return \ 63 | -e probe:netdev_pick_tx sleep 2 64 | #+end_example 65 | 66 | View result via: 67 | #+begin_example 68 | perf script 69 | #+end_example 70 | 71 | Delete all probes again: 72 | #+BEGIN_EXAMPLE 73 | perf probe -d '*' 74 | #+END_EXAMPLE 75 | 76 | trace_net_dev_queue 77 | -------------------------------------------------------------------------------- /src/shared_maps.h: -------------------------------------------------------------------------------- 1 | #ifndef SHARED_MAPS_H 2 | #define SHARED_MAPS_H 3 | 4 | #include 5 | 6 | #include "common_kern_user.h" 7 | 8 | /* Pinned shared map: see mapfile_ip_hash */ 9 | struct { 10 | __uint(type, BPF_MAP_TYPE_LPM_TRIE); 11 | __uint(max_entries, IP_HASH_ENTRIES_MAX); 12 | __type(key, struct ip_hash_key); 13 | __type(value, struct ip_hash_info); 14 | __uint(pinning, LIBBPF_PIN_BY_NAME); 15 | __uint(map_flags, BPF_F_NO_PREALLOC); 16 | } map_ip_hash SEC(".maps"); 17 | 18 | /* Map shared with XDP programs */ 19 | struct { 20 | __uint(type, BPF_MAP_TYPE_ARRAY); 21 | __uint(max_entries, MAX_IFINDEX); 22 | __type(key, __u32); 23 | __type(value, __u32); 24 | __uint(pinning, LIBBPF_PIN_BY_NAME); 25 | } map_ifindex_type SEC(".maps"); 26 | 27 | #endif 28 | -------------------------------------------------------------------------------- /src/tc_classify_kern.c: -------------------------------------------------------------------------------- 1 | #define DEBUG 1 2 | /* SPDX-License-Identifier: GPL-2.0 */ 3 | #include 4 | #include 5 | #include /* TC_H_MAJ + TC_H_MIN */ 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | 14 | #include 15 | 16 | #include 17 | #include 18 | 19 | #include "common_kern_user.h" 20 | #include "shared_maps.h" 21 | 22 | /* More dynamic: let create a map that contains the mapping table, to 23 | * allow more dynamic configuration. (See common.h for struct txq_config) 24 | */ 25 | struct { 26 | __uint(type, BPF_MAP_TYPE_ARRAY); 27 | __uint(max_entries, MAX_CPUS); 28 | __type(key, __u32); 29 | __type(value, struct txq_config); 30 | __uint(pinning, LIBBPF_PIN_BY_NAME); 31 | } map_txq_config SEC(".maps"); 32 | 33 | /* Manuel setup: 34 | 35 | tc qdisc del dev ixgbe2 clsact # clears all 36 | tc qdisc add dev ixgbe2 clsact 37 | tc filter add dev ixgbe2 egress bpf da obj tc_classify_kern.o sec tc 38 | tc filter list dev ixgbe2 egress 39 | 40 | */ 41 | 42 | struct vlan_hdr { 43 | __be16 h_vlan_TCI; 44 | __be16 h_vlan_encapsulated_proto; 45 | }; 46 | 47 | /* iproute2 use another ELF map layout than libbpf. The PIN_GLOBAL_NS 48 | * will cause map to be exported to /sys/fs/bpf/tc/globals/ 49 | */ 50 | #define PIN_GLOBAL_NS 2 51 | struct bpf_elf_map { 52 | __u32 type; 53 | __u32 size_key; 54 | __u32 size_value; 55 | __u32 max_elem; 56 | __u32 flags; 57 | __u32 id; 58 | __u32 pinning; 59 | __u32 inner_id; 60 | __u32 inner_idx; 61 | }; 62 | 63 | /* 64 | CPU config map table (struct txq_config): 65 | 66 | |----------+---------------+-----------+-----------------| 67 | | Key: CPU | queue_mapping | htb_major | maps-to-MQ-leaf | 68 | |----------+---------------+-----------+-----------------| 69 | | 0 | 1 | 100: | 7FFF:1 | 70 | | 1 | 2 | 101: | 7FFF:2 | 71 | | 2 | 3 | 102: | 7FFF:3 | 72 | | 3 | 4 | 103: | 7FFF:4 | 73 | |----------+---------------+-----------+-----------------| 74 | 75 | Last column "maps-to-MQ-leaf" is not part of config, but illustrates 76 | that queue_mapping corresponds to MQ-leaf "minor" numbers, assuming 77 | MQ is created with handle 7FFF, like: 78 | 79 | # tc qdisc replace dev ixgbe2 root handle 7FFF: mq 80 | 81 | The HTB-qdisc major number "handle" is choosen by the user, when 82 | attaching the HTB qdisc to the MQ-leaf "parent", like: 83 | 84 | # tc qdisc add dev ixgbe2 parent 7FFF:1 handle 100: htb default 2 85 | # tc qdisc add dev ixgbe2 parent 7FFF:2 handle 101: htb default 2 86 | 87 | */ 88 | 89 | #define DEBUG 1 90 | #ifdef DEBUG 91 | /* Only use this for debug output. Notice output from bpf_trace_printk() 92 | * end-up in /sys/kernel/debug/tracing/trace_pipe 93 | */ 94 | #define bpf_debug(fmt, ...) \ 95 | ({ \ 96 | char ____fmt[] = "(tc) " fmt; \ 97 | bpf_trace_printk(____fmt, sizeof(____fmt), \ 98 | ##__VA_ARGS__); \ 99 | }) 100 | #else 101 | #define bpf_debug(fmt, ...) { } while (0) 102 | #endif 103 | 104 | /* Wrap the macros from */ 105 | #define TC_H_MAJOR(x) TC_H_MAJ(x) 106 | #define TC_H_MINOR(x) TC_H_MIN(x) 107 | 108 | /* Parse Ethernet layer 2, extract network layer 3 offset and protocol 109 | * 110 | * Returns false on error and non-supported ether-type 111 | */ 112 | static __always_inline 113 | bool parse_eth(struct ethhdr *eth, void *data_end, 114 | __u16 *eth_proto, __u32 *l3_offset) 115 | { 116 | __u16 eth_type; 117 | __u64 offset; 118 | 119 | offset = sizeof(*eth); 120 | if ((void *)eth + offset > data_end) 121 | return false; 122 | 123 | eth_type = eth->h_proto; 124 | 125 | /* Skip non 802.3 Ethertypes */ 126 | if (bpf_ntohs(eth_type) < ETH_P_802_3_MIN) 127 | return false; 128 | 129 | /* Handle VLAN tagged packet */ 130 | if (eth_type == bpf_htons(ETH_P_8021Q) || 131 | eth_type == bpf_htons(ETH_P_8021AD)) { 132 | struct vlan_hdr *vlan_hdr; 133 | 134 | vlan_hdr = (void *)eth + offset; 135 | offset += sizeof(*vlan_hdr); 136 | if ((void *)eth + offset > data_end) 137 | return false; 138 | eth_type = vlan_hdr->h_vlan_encapsulated_proto; 139 | } 140 | /* Handle double VLAN tagged packet */ 141 | if (eth_type == bpf_htons(ETH_P_8021Q) || 142 | eth_type == bpf_htons(ETH_P_8021AD)) { 143 | struct vlan_hdr *vlan_hdr; 144 | 145 | vlan_hdr = (void *)eth + offset; 146 | offset += sizeof(*vlan_hdr); 147 | if ((void *)eth + offset > data_end) 148 | return false; 149 | eth_type = vlan_hdr->h_vlan_encapsulated_proto; 150 | } 151 | 152 | *eth_proto = bpf_ntohs(eth_type); 153 | *l3_offset = offset; 154 | return true; 155 | } 156 | 157 | static __always_inline 158 | void get_ipv4_addr(struct __sk_buff *skb, __u32 l3_offset, __u32 ifindex_type, 159 | struct ip_hash_key *key) 160 | { 161 | void *data_end = (void *)(long)skb->data_end; 162 | void *data = (void *)(long)skb->data; 163 | struct iphdr *iph = data + l3_offset; 164 | __u32 ipv4 = 0; 165 | 166 | if (iph + 1 > data_end) { 167 | //bpf_debug("Invalid IPv4 packet: L3off:%llu\n", l3_offset); 168 | return; 169 | } 170 | 171 | /* The IP-addr to match against depend on the "direction" of 172 | * the packet. This TC hook runs at egress. 173 | */ 174 | switch (ifindex_type) { 175 | case INTERFACE_WAN: /* Egress on WAN interface: match on src IP */ 176 | ipv4 = iph->saddr; 177 | break; 178 | case INTERFACE_LAN: /* Egress on LAN interface: match on dst IP */ 179 | ipv4 = iph->daddr; 180 | break; 181 | default: 182 | ipv4 = 0; 183 | } 184 | key->address.in6_u.u6_addr32[3] = ipv4; 185 | } 186 | 187 | 188 | static __always_inline 189 | void get_ipv6_addr(struct __sk_buff *skb, __u32 l3_offset, __u32 ifindex_type, 190 | struct ip_hash_key *key) 191 | { 192 | void *data_end = (void *)(long)skb->data_end; 193 | void *data = (void *)(long)skb->data; 194 | struct ipv6hdr *ip6h = data + l3_offset; 195 | 196 | if (ip6h + 1 > data_end) { 197 | //bpf_debug("Invalid IPv6 packet: L3off:%llu\n", l3_offset); 198 | return; 199 | } 200 | 201 | /* The IP-addr to match against depend on the "direction" of 202 | * the packet. This TC hook runs at egress. 203 | */ 204 | switch (ifindex_type) { 205 | case INTERFACE_WAN: /* Egress on WAN interface: match on src IP */ 206 | key->address = ip6h->saddr; 207 | break; 208 | case INTERFACE_LAN: /* Egress on LAN interface: match on dst IP */ 209 | key->address = ip6h->daddr; 210 | break; 211 | } 212 | } 213 | 214 | /* Locahost generated traffic gets assigned a classid MINOR number */ 215 | #define DEFAULT_LOCALHOST_MINOR 0x0003 216 | /* 217 | * Localhost generated traffic, goes into another default qdisc, but 218 | * need fixup of class MAJOR number to match CPU. 219 | */ 220 | static __always_inline 221 | __u32 localhost_default_classid(struct __sk_buff *skb, 222 | struct txq_config *txq_cfg) 223 | { 224 | __u32 cpu_major; 225 | 226 | if (!txq_cfg) 227 | return TC_ACT_SHOT; 228 | 229 | cpu_major = txq_cfg->htb_major << 16; 230 | 231 | if (skb->priority == 0) { 232 | skb->priority = cpu_major | DEFAULT_LOCALHOST_MINOR; 233 | } else { 234 | /* The classid (via skb->priority) is already set, we 235 | * allow this, but update major number (assigned to CPU) 236 | */ 237 | __u32 curr_minor = TC_H_MINOR(skb->priority); 238 | 239 | skb->priority = cpu_major | curr_minor; 240 | } 241 | return TC_ACT_OK; 242 | } 243 | 244 | /* Special types of traffic exists. 245 | * 246 | * Like LAN-to-LAN or WAN-to-WAN traffic. The LAN-to-LAN traffic can 247 | * also be between different VLANS, thus it is not possible to 248 | * identify this via comparing skb->ifindex and skb->ingress_ifindex. 249 | * 250 | * Instead allow other filters (e.g. iptables -t mangle -j CLASSIFY) 251 | * to set the TC-handle/classid (in skb->priority) and match the 252 | * special TC-minor classid here. 253 | */ 254 | #define SPECIAL_MINOR_CLASSID_LOW 3 255 | #define SPECIAL_MINOR_CLASSID_HIGH 9 256 | static __always_inline 257 | bool special_minor_classid(struct __sk_buff *skb, 258 | struct txq_config *txq_cfg) 259 | { 260 | __u32 curr_minor; 261 | 262 | if (!txq_cfg) 263 | return false; 264 | 265 | if (skb->priority == 0) 266 | return false; /* no special pre-set classid */ 267 | 268 | curr_minor = TC_H_MINOR(skb->priority); 269 | 270 | if (curr_minor >= SPECIAL_MINOR_CLASSID_LOW && 271 | curr_minor <= SPECIAL_MINOR_CLASSID_HIGH) { 272 | /* The classid (via skb->priority) was already set 273 | * with a special minor-classid, but update major 274 | * number assigned to this CPU 275 | */ 276 | __u32 cpu_major = txq_cfg->htb_major << 16; 277 | 278 | skb->priority = cpu_major | curr_minor; 279 | return true; 280 | } 281 | return false; 282 | } 283 | 284 | /* Quick manual reload command: 285 | tc filter replace dev ixgbe2 prio 0xC000 handle 1 egress bpf da obj tc_classify_kern.o sec tc 286 | */ 287 | SEC("tc") 288 | int tc_iphash_to_cpu(struct __sk_buff *skb) 289 | { 290 | __u32 cpu = bpf_get_smp_processor_id(); 291 | struct ip_hash_info *ip_info; 292 | struct txq_config *txq_cfg; 293 | __u32 *ifindex_type; 294 | __u32 ifindex; 295 | __u32 action = TC_ACT_OK; 296 | 297 | /* For packet parsing */ 298 | void *data_end = (void *)(long)skb->data_end; 299 | void *data = (void *)(long)skb->data; 300 | struct ethhdr *eth = data; 301 | __u16 eth_proto = 0; 302 | __u32 l3_offset = 0; 303 | //__u32 ipv4 = bpf_ntohl(0xFFFFFFFF); // default not found 304 | struct ip_hash_key hash_key; 305 | 306 | txq_cfg = bpf_map_lookup_elem(&map_txq_config, &cpu); 307 | if (!txq_cfg) 308 | return TC_ACT_SHOT; 309 | 310 | if (txq_cfg->queue_mapping != 0) { 311 | skb->queue_mapping = txq_cfg->queue_mapping; 312 | } else { 313 | bpf_debug("Misconf: CPU:%u no conf (curr qm:%d)\n", 314 | cpu, skb->queue_mapping); 315 | } 316 | 317 | /* Localhost generated traffic, goes into another default qdisc */ 318 | if (skb->ingress_ifindex == 0) { 319 | return localhost_default_classid(skb, txq_cfg); 320 | } 321 | 322 | if (special_minor_classid(skb, txq_cfg)) { 323 | /* SKB was pre-marked with special class id */ 324 | return TC_ACT_OK; 325 | } 326 | 327 | /* Ethernet header parsing: The protocol is already known via 328 | * skb->protocol (host-byte-order). But due to double VLAN 329 | * tagging, we still need to parse eth-headers. The 330 | * skb->{vlan_present,vlan_tci} can only show outer VLAN. 331 | */ 332 | if (!(parse_eth(eth, data_end, ð_proto, &l3_offset))) { 333 | bpf_debug("Cannot parse L2: L3off:%llu proto:0x%x\n", 334 | l3_offset, eth_proto); 335 | return TC_ACT_OK; /* Skip */ 336 | } 337 | 338 | /* Get interface "direction" via map_ifindex_type */ 339 | ifindex = skb->ifindex; 340 | ifindex_type = bpf_map_lookup_elem(&map_ifindex_type, &ifindex); 341 | if (!ifindex_type) 342 | return TC_ACT_OK; 343 | 344 | /* Get IP addr to match against */ 345 | hash_key.prefixlen = 128; 346 | hash_key.address.in6_u.u6_addr32[0] = 0xFFFFFFFF; 347 | hash_key.address.in6_u.u6_addr32[1] = 0xFFFFFFFF; 348 | hash_key.address.in6_u.u6_addr32[2] = 0xFFFFFFFF; 349 | hash_key.address.in6_u.u6_addr32[3] = 0xFFFFFFFF; 350 | switch (eth_proto) { 351 | case ETH_P_IP: 352 | get_ipv4_addr(skb, l3_offset, *ifindex_type, &hash_key); 353 | break; 354 | case ETH_P_IPV6: 355 | get_ipv6_addr(skb, l3_offset, *ifindex_type, &hash_key); 356 | break; 357 | case ETH_P_ARP: /* Let OS handle ARP */ 358 | // TODO: Should we choose a special classid for these? 359 | /* Fall-through */ 360 | default: 361 | // bpf_debug("Not handling eth_proto:0x%x\n", eth_proto); 362 | return TC_ACT_OK; 363 | } 364 | 365 | ip_info = bpf_map_lookup_elem(&map_ip_hash, &hash_key); 366 | if (!ip_info) { 367 | /* Check for 255.255.255.255/32 as a default if no 0.0.0.0/0 is provided */ 368 | hash_key.prefixlen = 128; 369 | hash_key.address.in6_u.u6_addr32[3] = 0xFFFFFFFF; 370 | ip_info = bpf_map_lookup_elem(&map_ip_hash, &hash_key); 371 | if (!ip_info) { 372 | bpf_debug("Misconf: FAILED lookup IP:0x%x ifindex_ingress:%d prio:%x\n", 373 | hash_key.address.in6_u.u6_addr32[3], skb->ingress_ifindex, skb->priority); 374 | // TODO: Assign to some default classid? 375 | return TC_ACT_OK; 376 | } 377 | } 378 | 379 | if (ip_info->cpu != cpu) { 380 | bpf_debug("Mismatch: Curr-CPU:%u but IP:%x wants CPU:%u\n", 381 | cpu, hash_key.address.in6_u.u6_addr32[3], ip_info->cpu); 382 | bpf_debug("Mismatch: more-info ifindex:%d ingress:%d skb->prio:%x\n", 383 | skb->ifindex, skb->ingress_ifindex, skb->priority); 384 | } 385 | 386 | /* Catch if TC handle major number mismatch, between CPU 387 | * config and ip_info config. 388 | * TODO: Can this be done setup time? 389 | */ 390 | __u16 ip_info_major = (TC_H_MAJOR(ip_info->tc_handle) >> 16); 391 | if (txq_cfg->htb_major != ip_info_major) 392 | { 393 | // TODO: Could fixup MAJOR number 394 | bpf_debug("Misconf: TC major(%d) mismatch %x\n", 395 | txq_cfg->htb_major, ip_info->tc_handle); 396 | } 397 | 398 | /* Setup skb->priority (TC-handle) based on ip_info */ 399 | if (ip_info->tc_handle != 0) 400 | skb->priority = ip_info->tc_handle; 401 | 402 | //bpf_debug("Lookup IP:%x prio:0x%x tc_handle:0x%x\n", 403 | // ipv4, skb->priority, ip_info->tc_handle); 404 | 405 | //return TC_ACT_OK; 406 | return action; 407 | } 408 | 409 | char _license[] SEC("license") = "GPL"; 410 | -------------------------------------------------------------------------------- /src/tc_classify_user.c: -------------------------------------------------------------------------------- 1 | static const char *__doc__= 2 | "TC: Control program for tc_classify_kern.o\n" 3 | " - When using --dev, loads TC-egress filter calling BPF program\n" 4 | " - Config of map_txq_config, that control CPU to queue_mapping\n" 5 | " - List current queue_mapping (txq) config via --list\n" 6 | "\n" 7 | ; 8 | 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | 22 | #include "common_user.h" 23 | #include "common_kern_user.h" 24 | 25 | static int map_txq_config_fd = -1; 26 | 27 | const char *bpf_obj = "tc_classify_kern.o"; 28 | const char *sec_name = "tc"; 29 | 30 | static const struct option long_options[] = { 31 | {"help", no_argument, NULL, 'h' }, 32 | {"base-setup", no_argument, NULL, 'b' }, 33 | {"list", no_argument, NULL, 'l' }, 34 | {"quiet", no_argument, NULL, 'q' }, 35 | {"cpu", required_argument, NULL, 'c' }, 36 | {"queue-mapping",required_argument, NULL, 'm' }, 37 | {"htb-major-hex",required_argument, NULL, 'j' }, /* Hex base 16 */ 38 | {"dev-egress" ,required_argument, NULL, 'd' }, 39 | {0, 0, NULL, 0 } 40 | }; 41 | 42 | static void usage(const char *prog_name_argv0, const char *doctxt) 43 | { 44 | int i; 45 | printf("\nDOCUMENTATION:\n%s\n", doctxt); 46 | printf(" Usage: %s (options-see-below)\n", prog_name_argv0); 47 | printf(" Listing options:\n"); 48 | for (i = 0; long_options[i].name != 0; i++) { 49 | printf(" --%-12s", long_options[i].name); 50 | if (long_options[i].flag != NULL) 51 | printf(" flag (internal value:%d)", 52 | *long_options[i].flag); 53 | else 54 | printf(" short-option: -%c", 55 | long_options[i].val); 56 | printf("\n"); 57 | } 58 | printf("\n"); 59 | } 60 | 61 | int open_bpf_map_file(const char *file) 62 | { 63 | int fd; 64 | 65 | fd = bpf_obj_get(file); 66 | if (fd < 0) { 67 | fprintf(stderr, 68 | "WARN: Failed to open bpf map file:%s err(%d):%s\n", 69 | file, errno, strerror(errno)); 70 | return fd; 71 | } 72 | return fd; 73 | } 74 | 75 | bool single_cpu_setup(int map_fd, __s64 set_cpu, struct txq_config txq_cfg, 76 | bool set_queue_mapping, bool set_htb_major) 77 | { 78 | __u32 cpu; 79 | int err; 80 | 81 | if (!set_queue_mapping) { 82 | fprintf(stderr, "ERR: missing option --queue-mapping\n"); 83 | return false; 84 | } 85 | if (!set_htb_major) { 86 | fprintf(stderr, "ERR: missing option --htb-major\n"); 87 | return false; 88 | } 89 | if (set_cpu < 0) { 90 | fprintf(stderr, "ERR: missing option --cpu\n"); 91 | return false; 92 | } 93 | cpu = (__u32) set_cpu; 94 | 95 | err = bpf_map_update_elem(map_fd, &cpu, &txq_cfg, 0); 96 | if (err) { 97 | fprintf(stderr, 98 | "ERR: %s() updating cpu-key:%d err(%d):%s\n", 99 | __func__, cpu, errno, strerror(errno)); 100 | return false; 101 | } 102 | if (verbose) { 103 | printf("Set CPU=%u to use queue_mapping=%u + htb_major=0x%X:\n", 104 | cpu, txq_cfg.queue_mapping, txq_cfg.htb_major); 105 | map_txq_config_list_setup(map_fd); 106 | } 107 | return true; 108 | } 109 | 110 | static char ifname_buf[IF_NAMESIZE]; 111 | static char *ifname = NULL; 112 | static int ifindex = -1; 113 | 114 | int main(int argc, char **argv) 115 | { 116 | int opt, longindex = 0; 117 | struct txq_config txq_cfg; 118 | bool set_queue_mapping = false; 119 | bool set_htb_major = false; 120 | bool do_map_init = false; 121 | bool do_list = false; 122 | __s64 set_cpu = -1; 123 | char filename[512]; 124 | 125 | /* Depend on sharing pinned maps */ 126 | if (bpf_fs_check_and_fix()) { 127 | fprintf(stderr, "ERR: " 128 | "Need access to bpf-fs(%s) for pinned maps " 129 | "(%d): %s\n", BPF_DIR_MNT, errno, strerror(errno)); 130 | return EXIT_FAIL_MAP_FS; 131 | } 132 | 133 | /* Try opening txq_config map for CPU to queue_mapping */ 134 | map_txq_config_fd = open_bpf_map_file(mapfile_txq_config); 135 | 136 | if (!locate_kern_object(argv[0], filename, sizeof(filename))) { 137 | fprintf(stderr, "ERR: " 138 | "cannot locate BPF _kern.o ELF file:%s errno(%d):%s\n", 139 | filename, errno, strerror(errno)); 140 | return EXIT_FAIL_BPF_ELF; 141 | } 142 | 143 | /* Parse commands line args */ 144 | while ((opt = getopt_long(argc, argv, "hqblc:m:j:d:", 145 | long_options, &longindex)) != -1) { 146 | switch (opt) { 147 | case 'q': 148 | verbose = 0; 149 | break; 150 | case 'b': 151 | do_map_init = true; 152 | break; 153 | case 'l': 154 | do_list = true; 155 | break; 156 | case 'c': 157 | set_cpu = strtoul(optarg, NULL, 0); 158 | break; 159 | case 'm': 160 | set_queue_mapping = true; 161 | txq_cfg.queue_mapping = strtoul(optarg, NULL, 0); 162 | break; 163 | case 'j': 164 | set_htb_major = true; 165 | txq_cfg.htb_major = strtoul(optarg, NULL, 16); /* Hex */ 166 | break; 167 | case 'd': 168 | if (strlen(optarg) >= IF_NAMESIZE) { 169 | fprintf(stderr, "ERR: --dev name too long\n"); 170 | goto error; 171 | } 172 | ifname = (char *)&ifname_buf; 173 | strncpy(ifname, optarg, IF_NAMESIZE); 174 | ifindex = if_nametoindex(ifname); 175 | if (ifindex == 0) { 176 | fprintf(stderr, 177 | "ERR: --dev name unknown err(%d):%s\n", 178 | errno, strerror(errno)); 179 | goto error; 180 | } 181 | if (ifindex >= MAX_IFINDEX) { 182 | fprintf(stderr, 183 | "ERR: Fix MAX_IFINDEX err(%d):%s\n", 184 | errno, strerror(errno)); 185 | goto error; 186 | } 187 | break; 188 | case 'h': 189 | error: 190 | default: 191 | usage(argv[0], __doc__); 192 | return EXIT_FAIL_OPTION; 193 | } 194 | } 195 | 196 | if (verbose) 197 | printf("%s Map filename: %s\n", __doc__, mapfile_txq_config); 198 | 199 | if (ifindex > 0 && !do_list) { 200 | int err; 201 | 202 | if (verbose) 203 | printf("Dev:%s -- Loading: TC-clsact egress\n", ifname); 204 | 205 | err = tc_egress_attach_bpf(ifname, filename, sec_name); 206 | if (err) { 207 | fprintf(stderr, "ERR: dev:%s" 208 | " Fail TC-clsact loading %s sec:%s\n", 209 | ifname, filename, sec_name); 210 | return err; 211 | } 212 | 213 | if (map_txq_config_fd < 0) { 214 | /* Just loaded TC prog should have pinned it */ 215 | map_txq_config_fd = 216 | open_bpf_map_file(mapfile_txq_config); 217 | do_map_init = true; 218 | } 219 | } 220 | 221 | if (do_map_init) { 222 | if (!map_txq_config_base_setup(map_txq_config_fd)) 223 | return EXIT_FAIL_MAP; 224 | if (verbose) 225 | map_txq_config_list_setup(map_txq_config_fd); 226 | } 227 | 228 | if (set_cpu >= 0 || set_queue_mapping || set_htb_major) { 229 | 230 | if (map_txq_config_fd < 0) { 231 | fprintf(stderr, 232 | "ERR: cannot proceed without access to config map\n"); 233 | return EXIT_FAIL_MAP; 234 | } 235 | 236 | if (!single_cpu_setup(map_txq_config_fd, set_cpu, txq_cfg, 237 | set_queue_mapping, set_htb_major)) 238 | return EXIT_FAIL_OPTION; 239 | } 240 | 241 | if (do_list) { 242 | if (!map_txq_config_list_setup(map_txq_config_fd)) 243 | return EXIT_FAIL_MAP; 244 | 245 | if (ifindex > 0) 246 | tc_list_egress_filter(ifname); 247 | } 248 | 249 | 250 | return EXIT_OK; 251 | } 252 | -------------------------------------------------------------------------------- /src/tc_queue_mapping_kern.c: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | #include 3 | #include 4 | #include /* TC_H_MAJ + TC_H_MIN */ 5 | #include "bpf_helpers.h" 6 | 7 | /* Manuel setup: 8 | 9 | tc qdisc add dev ixgbe2 clsact 10 | tc filter add dev ixgbe2 egress bpf da obj tc_queue_mapping_kern.o sec tc_qmap2cpu 11 | tc filter list dev ixgbe2 egress 12 | 13 | */ 14 | 15 | #define DEBUG 1 16 | #ifdef DEBUG 17 | /* Only use this for debug output. Notice output from bpf_trace_printk() 18 | * end-up in /sys/kernel/debug/tracing/trace_pipe 19 | */ 20 | #define bpf_debug(fmt, ...) \ 21 | ({ \ 22 | char ____fmt[] = fmt; \ 23 | bpf_trace_printk(____fmt, sizeof(____fmt), \ 24 | ##__VA_ARGS__); \ 25 | }) 26 | #else 27 | #define bpf_debug(fmt, ...) { } while (0) 28 | #endif 29 | 30 | /* Wrap the macros from */ 31 | #define TC_H_MAJOR(x) TC_H_MAJ(x) 32 | #define TC_H_MINOR(x) TC_H_MIN(x) 33 | 34 | SEC("tc") 35 | int tc_cls_prog(struct __sk_buff *skb) 36 | { 37 | __u32 cpu = bpf_get_smp_processor_id(); 38 | __u16 txq_root_handle; 39 | 40 | /* The skb->queue_mapping is 1-indexed (zero means queue_mapping not 41 | * set). The underlying MQ leaf's are also 1-indexed, which makes it 42 | * easier to reason about. 43 | */ 44 | txq_root_handle = cpu + 1; 45 | skb->queue_mapping = txq_root_handle; 46 | 47 | /* Details: Kernel double protect against setting a too high 48 | * queue_mapping. In skb_tx_hash() it will reduce number to be 49 | * less-than (or equal) dev->real_num_tx_queues. And netdev_pick_tx() 50 | * cap via netdev_cap_txqueue(). 51 | */ 52 | /* 53 | Do simple mapping of CPU to queue_mapping. 54 | ----------------------------------------- 55 | Assuming MQ is created with handle 7FFF: 56 | tc qdisc replace dev ixgbe2 root handle 7FFF: mq 57 | 58 | And for each MQ-leaf HTBs are created 59 | # Foreach TXQ - create HTB leaf(s) under MQ 0x7FFF:TXQ 60 | tc qdisc add dev ixgbe2 parent 7FFF:1 handle 1: htb default 2 61 | tc qdisc add dev ixgbe2 parent 7FFF:2 handle 2: htb default 2 62 | tc qdisc add dev ixgbe2 parent 7FFF:3 handle 3: htb default 2 63 | tc qdisc add dev ixgbe2 parent 7FFF:4 handle 4: htb default 2 64 | 65 | Gives the following mapping table: 66 | |-----+---------------+---------+-----------| 67 | | CPU | queue_mapping | MQ-leaf | HTB major | 68 | |-----+---------------+---------+-----------| 69 | | 0 | 1 | 7FFF:1 | 1: | 70 | | 1 | 2 | 7FFF:2 | 2: | 71 | | 2 | 3 | 7FFF:3 | 3: | 72 | | 3 | 4 | 7FFF:4 | 4: | 73 | |-----+---------------+---------+-----------| 74 | 75 | */ 76 | /* The __u32 TC "handle" is stored in skb->priority */ 77 | bpf_debug("queue_mapping:%u major:%u minor:%u\n", 78 | skb->queue_mapping, 79 | TC_H_MAJOR(skb->priority) >> 16, 80 | TC_H_MINOR(skb->priority)); 81 | /*Changing the handle class from iptables 82 | * iptables -t mangle -A FORWARD -j CLASSIFY --set-class 0001:0004 83 | */ 84 | 85 | // Test can we write into skb->priority ? 86 | // skb->priority = TC_H_MAKE(1 << 16, 42); 87 | 88 | return TC_ACT_OK; 89 | } 90 | 91 | #define USHRT_MAX ((__u16)(~0U)) 92 | #define NO_QUEUE_MAPPING USHRT_MAX 93 | 94 | #define barrier() __asm__ __volatile__("": : :"memory") 95 | 96 | SEC("tc") 97 | int tc_cls_prog_test(struct __sk_buff *skb) 98 | { 99 | /* Kernel should not allow this to take effect */ 100 | skb->queue_mapping = NO_QUEUE_MAPPING; 101 | barrier(); /* Don't let compiler trick us */ 102 | bpf_debug("Tried to change queue_mapping=NO_QUEUE_MAPPING now=%d\n", 103 | skb->queue_mapping); 104 | return TC_ACT_OK; 105 | } 106 | 107 | 108 | char _license[] SEC("license") = "GPL"; 109 | -------------------------------------------------------------------------------- /src/xdp_iphash_to_cpu_cmdline.c: -------------------------------------------------------------------------------- 1 | 2 | static const char *__doc__= 3 | " XDP ip_hash: command line tool"; 4 | 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | 15 | #include 16 | #include 17 | #include 18 | 19 | #include 20 | 21 | /* libbpf.h defines bpf_* function helpers for syscalls, 22 | * indirectly via ./tools/lib/bpf/bpf.h */ 23 | #include 24 | #include 25 | #include 26 | 27 | #include /* TC macros */ 28 | 29 | #include "common_user.h" 30 | #include "common_kern_user.h" 31 | 32 | #define TC_H_MAJOR(x) TC_H_MAJ(x) 33 | #define TC_H_MINOR(x) TC_H_MIN(x) 34 | 35 | static const struct option long_options[] = { 36 | {"help", no_argument, NULL, 'h' }, 37 | {"add", no_argument, NULL, 'a' }, 38 | {"del", no_argument, NULL, 'x' }, 39 | {"ip", required_argument, NULL, 'i' }, 40 | {"classid", required_argument, NULL, 't' }, 41 | {"cpu", required_argument, NULL, 'c' }, 42 | {"list", no_argument, NULL, 'l' }, 43 | {"clear", no_argument, NULL, 'e' }, 44 | {0, 0, NULL, 0 } 45 | }; 46 | 47 | static void usage(char *argv[]) 48 | { 49 | int i; 50 | printf("\nDOCUMENTATION:\n%s\n", __doc__); 51 | printf("\n"); 52 | printf(" Usage: %s (options-see-below)\n", 53 | argv[0]); 54 | printf(" Listing options:\n"); 55 | for (i = 0; long_options[i].name != 0; i++) { 56 | printf(" --%-12s", long_options[i].name); 57 | if (long_options[i].flag != NULL) 58 | printf(" flag (internal value:%d)", 59 | *long_options[i].flag); 60 | else 61 | printf(" short-option: -%c", 62 | long_options[i].val); 63 | printf("\n"); 64 | } 65 | printf("\n"); 66 | } 67 | 68 | static bool get_key_value_ip_info(int fd, struct ip_hash_key key, struct ip_hash_info *ip_info) 69 | { 70 | if ((bpf_map_lookup_elem(fd, &key, ip_info)) != 0) { 71 | fprintf(stderr, 72 | "ERR: bpf_map_lookup_elem failed key:%u errno(%d):%s\n", 73 | key.address.__in6_u.__u6_addr32[3], errno, strerror(errno)); 74 | return false; 75 | } 76 | return true; 77 | } 78 | 79 | static void iphash_print_ip(struct ip_hash_key ip, struct ip_hash_info *ip_info,int i) 80 | { 81 | char ip_txt[INET6_ADDRSTRLEN] = {0}; 82 | __u32 prefix = 128; 83 | 84 | if (!ip_info) { 85 | fprintf(stderr, "ERR: %s() NULL pointer\n", __func__); 86 | exit(EXIT_FAIL); 87 | } 88 | 89 | if (ip.address.__in6_u.__u6_addr32[0] == 0xFFFFFFFF && ip.address.__in6_u.__u6_addr32[1] == 0xFFFFFFFF && ip.address.__in6_u.__u6_addr32[2] == 0xFFFFFFFF) { 90 | // It's IPv4 91 | if (!inet_ntop(AF_INET, &ip.address.__in6_u.__u6_addr32[3], ip_txt, sizeof(ip_txt))) { 92 | fprintf(stderr, 93 | "ERR: Cannot convert u32 IP:0x%X to IP-txt\n", ip.address.__in6_u.__u6_addr32[3]); 94 | exit(EXIT_FAIL_IP); 95 | } 96 | prefix = ip.prefixlen - 96; 97 | } else { 98 | // It's IPv6 99 | if (!inet_ntop(AF_INET6, &ip.address, ip_txt, sizeof(ip_txt))) { 100 | fprintf(stderr, 101 | "ERR: Cannot convert u128 IP:0x%X to IP-txt\n", ip.address.__in6_u.__u6_addr32[0]); 102 | exit(EXIT_FAIL_IP); 103 | } 104 | prefix = ip.prefixlen; 105 | } 106 | 107 | if (i > 0) 108 | printf(",\n"); 109 | __u16 ip_info_major = (TC_H_MAJOR(ip_info->tc_handle) >> 16); 110 | __u16 ip_info_minor = (TC_H_MINOR(ip_info->tc_handle)); 111 | printf("\"%s/%u\" : { \"cpu\" : %u, \"tc_maj\" : \"%X\" , \"tc_min\" : \"%X\" }", 112 | ip_txt, prefix, ip_info->cpu, ip_info_major, ip_info_minor); 113 | } 114 | static void iphash_list_all_ip(int fd) 115 | { 116 | struct ip_hash_key key, *prev_key = NULL; 117 | struct ip_hash_info ip_info; 118 | int err; 119 | int i = 0; 120 | printf("{\n"); 121 | while ((err = bpf_map_get_next_key(fd, prev_key, &key)) == 0) { 122 | if (!get_key_value_ip_info(fd, key, &ip_info)) { 123 | err = -1; 124 | break; 125 | } 126 | iphash_print_ip(key, &ip_info, i); 127 | prev_key = &key; 128 | i++; 129 | } 130 | printf("}\n"); 131 | /* Make sure err was result of last key reached */ 132 | if (err < 0 && errno != ENOENT) 133 | fprintf(stderr, 134 | "WARN: %s() didn't list all entries: err(%d/%d):%s\n", 135 | __func__, err, errno, strerror(errno)); 136 | } 137 | static void iphash_clear_all_ip(int fd) 138 | { 139 | struct ip_hash_key key, *prev_key = NULL; 140 | 141 | while (bpf_map_get_next_key(fd, prev_key, &key) == 0) { 142 | bpf_map_delete_elem(fd, &key); 143 | prev_key = &key; 144 | } 145 | } 146 | int open_bpf_map(const char *file) 147 | { 148 | int fd; 149 | 150 | fd = bpf_obj_get(file); 151 | if (fd < 0) { 152 | printf("ERR: Failed to open bpf map file:%s err(%d):%s\n", 153 | file, errno, strerror(errno)); 154 | exit(EXIT_FAIL_MAP_FILE); 155 | } 156 | return fd; 157 | } 158 | 159 | /* Handle classid parsing based on iproute source */ 160 | int get_tc_classid(__u32 *h, const char *str) 161 | { 162 | __u32 major, minor; 163 | char *p; 164 | 165 | major = TC_H_ROOT; 166 | if (strcmp(str, "root") == 0) 167 | goto ok; 168 | major = TC_H_UNSPEC; 169 | if (strcmp(str, "none") == 0) 170 | goto ok; 171 | major = strtoul(str, &p, 16); 172 | if (p == str) { 173 | major = 0; 174 | if (*p != ':') 175 | return -1; 176 | } 177 | if (*p == ':') { 178 | if (major >= (1<<16)) 179 | return -1; 180 | major <<= 16; 181 | str = p+1; 182 | minor = strtoul(str, &p, 16); 183 | if (*p != 0) 184 | return -1; 185 | if (minor >= (1<<16)) 186 | return -1; 187 | major |= minor; 188 | } else if (*p != 0) 189 | return -1; 190 | 191 | ok: 192 | *h = major; 193 | return 0; 194 | } 195 | 196 | 197 | int main(int argc, char **argv) { 198 | # define STR_MAX 42 /* For trivial input validation */ 199 | char _ip_string_buf[STR_MAX] = {}; 200 | char *ip_string = NULL; 201 | unsigned int action = 0; 202 | int longindex = 0; 203 | bool do_list = false; 204 | bool do_clear = false; 205 | int opt; 206 | int fd; 207 | __u32 cpu = -1; 208 | __u32 tc_handle = 0; 209 | bool provided_classid = false; 210 | 211 | while ((opt = getopt_long(argc, argv, "hac:t:i:le", 212 | long_options, &longindex)) != -1) { 213 | switch (opt) { 214 | case 'a': 215 | action |= ACTION_ADD; 216 | break; 217 | case 'x': 218 | action |= ACTION_DEL; 219 | break; 220 | case 'c': 221 | cpu = strtoul(optarg, NULL, 0); 222 | break; 223 | case 'i': 224 | if (!optarg || strlen(optarg) >= STR_MAX) { 225 | printf("ERR: src ip too long or NULL\n"); 226 | goto fail_opt; 227 | } 228 | ip_string = (char *)&_ip_string_buf; 229 | strncpy(ip_string, optarg, STR_MAX); 230 | break; 231 | case 't': /* classid parse like iproute2 into __u32 tc_handle */ 232 | if ( get_tc_classid(&tc_handle, optarg) < 0) { 233 | printf("ERR: classid tc syntax (HEX) major:minor\n"); 234 | goto fail_opt; 235 | } 236 | // printf("Got --classid=%s handle:0x%X\n", optarg, tc_handle); 237 | provided_classid = true; 238 | break; 239 | case 'l': 240 | do_list = true; 241 | break; 242 | case 'e': 243 | do_clear = true; 244 | break; 245 | case 'h': 246 | fail_opt: 247 | default: 248 | usage(argv); 249 | return EXIT_FAIL_OPTION; 250 | } 251 | } 252 | 253 | if (bpf_fs_check()) { 254 | return EXIT_FAIL_MAP_FS; 255 | } 256 | 257 | if (do_list) { 258 | fd = open_bpf_map(mapfile_ip_hash); 259 | iphash_list_all_ip(fd); 260 | close(fd); 261 | return EXIT_OK; 262 | } 263 | if (do_clear) { 264 | fd = open_bpf_map(mapfile_ip_hash); 265 | iphash_clear_all_ip(fd); 266 | close(fd); 267 | return EXIT_OK; 268 | } 269 | if (action == 0) { 270 | printf("ERR: required option --add or --del missing"); 271 | goto fail_opt; 272 | } 273 | if (!ip_string) { 274 | printf("ERR: required option --ip missing"); 275 | goto fail_opt; 276 | } 277 | if (action == ACTION_ADD && cpu == -1) { 278 | printf("ERR: required option --cpu missing when using --add"); 279 | goto fail_opt; 280 | } 281 | if (action == ACTION_ADD && !provided_classid) { 282 | printf("ERR: required option --classid missing when using --add"); 283 | goto fail_opt; 284 | } 285 | if (action) { 286 | int res = 0; 287 | 288 | if (!ip_string) { 289 | fprintf(stderr, 290 | "ERR: action require data, e.g option --ip\n"); 291 | goto fail_opt; 292 | } 293 | 294 | if (ip_string) { 295 | int txq_fd = open_bpf_map(mapfile_txq_config); 296 | fd = open_bpf_map(mapfile_ip_hash); 297 | res = iphash_modify(fd, ip_string, action, cpu, 298 | tc_handle, txq_fd); 299 | close(fd); 300 | close(txq_fd); 301 | } 302 | return res; 303 | } 304 | } 305 | -------------------------------------------------------------------------------- /src/xdp_iphash_to_cpu_kern.c: -------------------------------------------------------------------------------- 1 | //#include 2 | #include 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | 15 | #include 16 | #include 17 | 18 | #include "common_kern_user.h" 19 | #include "shared_maps.h" 20 | 21 | struct vlan_hdr { 22 | __be16 h_vlan_TCI; 23 | __be16 h_vlan_encapsulated_proto; 24 | }; 25 | 26 | /* Special map type that can XDP_REDIRECT frames to another CPU */ 27 | struct { 28 | __uint(type, BPF_MAP_TYPE_CPUMAP); 29 | __uint(max_entries, MAX_CPUS); 30 | __type(key, __u32); 31 | __type(value, __u32); 32 | __uint(pinning, LIBBPF_PIN_BY_NAME); 33 | } cpu_map SEC(".maps"); 34 | 35 | struct { 36 | __uint(type, BPF_MAP_TYPE_ARRAY); 37 | __uint(max_entries, MAX_CPUS); 38 | __type(key, __u32); 39 | __type(value, __u32); 40 | } cpus_available SEC(".maps"); 41 | 42 | #ifdef DEBUG 43 | /* Only use this for debug output. Notice output from bpf_trace_printk() 44 | * end-up in /sys/kernel/debug/tracing/trace_pipe 45 | */ 46 | #define bpf_debug(fmt, ...) \ 47 | ({ \ 48 | char ____fmt[] = "(xdp) " fmt; \ 49 | bpf_trace_printk(____fmt, sizeof(____fmt), \ 50 | ##__VA_ARGS__); \ 51 | }) 52 | #else 53 | #define bpf_debug(fmt, ...) { } while (0) 54 | #endif 55 | 56 | /* Parse Ethernet layer 2, extract network layer 3 offset and protocol 57 | * 58 | * Returns false on error and non-supported ether-type 59 | */ 60 | static __always_inline 61 | bool parse_eth(struct ethhdr *eth, void *data_end, 62 | __u16 *eth_proto, __u32 *l3_offset) 63 | { 64 | __u16 eth_type; 65 | __u64 offset; 66 | 67 | offset = sizeof(*eth); 68 | if ((void *)eth + offset > data_end) 69 | return false; 70 | 71 | eth_type = eth->h_proto; 72 | 73 | /* Skip non 802.3 Ethertypes */ 74 | if (bpf_ntohs(eth_type) < ETH_P_802_3_MIN) 75 | return false; 76 | 77 | /* Handle VLAN tagged packet */ 78 | if (eth_type == bpf_htons(ETH_P_8021Q) || 79 | eth_type == bpf_htons(ETH_P_8021AD)) { 80 | struct vlan_hdr *vlan_hdr; 81 | 82 | vlan_hdr = (void *)eth + offset; 83 | offset += sizeof(*vlan_hdr); 84 | if ((void *)eth + offset > data_end) 85 | return false; 86 | eth_type = vlan_hdr->h_vlan_encapsulated_proto; 87 | } 88 | /* Handle double VLAN tagged packet */ 89 | if (eth_type == bpf_htons(ETH_P_8021Q) || 90 | eth_type == bpf_htons(ETH_P_8021AD)) { 91 | struct vlan_hdr *vlan_hdr; 92 | 93 | vlan_hdr = (void *)eth + offset; 94 | offset += sizeof(*vlan_hdr); 95 | if ((void *)eth + offset > data_end) 96 | return false; 97 | eth_type = vlan_hdr->h_vlan_encapsulated_proto; 98 | } 99 | 100 | *eth_proto = bpf_ntohs(eth_type); 101 | *l3_offset = offset; 102 | return true; 103 | } 104 | 105 | static __always_inline struct ip_hash_info *get_ip_info(struct ip_hash_key *ip) 106 | { 107 | struct ip_hash_info *ip_info; 108 | 109 | ip_info = bpf_map_lookup_elem(&map_ip_hash, ip); 110 | if (!ip_info) { 111 | struct ip_hash_key null_addr; 112 | null_addr.prefixlen = 128; 113 | null_addr.address.in6_u.u6_addr32[0] = 0xFFFFFFFF; 114 | null_addr.address.in6_u.u6_addr32[1] = 0xFFFFFFFF; 115 | null_addr.address.in6_u.u6_addr32[2] = 0xFFFFFFFF; 116 | null_addr.address.in6_u.u6_addr32[3] = 0xFFFFFFFF; 117 | /* On LAN side (XDP-ingress) some uncategorized traffic are 118 | * expected, e.g. services like DHCP are running and IPs 119 | * contacting captive portal (which are not yet configured) 120 | */ 121 | // bpf_debug("cant find ip_info->cpu id for ip:%u\n", ip); 122 | // the all-zeroes address is for default traffic 123 | ip_info = bpf_map_lookup_elem(&map_ip_hash, &null_addr); 124 | } 125 | return ip_info; 126 | } 127 | 128 | static __always_inline 129 | __u32 parse_ip(struct xdp_md *ctx, __u32 l3_offset, __u32 ifindex, __u16 eth_proto) 130 | { 131 | void *data_end = (void *)(long)ctx->data_end; 132 | void *data = (void *)(long)ctx->data; 133 | 134 | /* aliases pointers used for v4/v6 based on version */ 135 | struct iphdr *iph = data + l3_offset; 136 | struct ipv6hdr *ip6h = data + l3_offset; 137 | __u32 *direction_lookup; 138 | __u32 direction; 139 | struct ip_hash_info *ip_info; 140 | //u32 *cpu_id_lookup; 141 | __u32 cpu_id; 142 | __u32 *cpu_lookup; 143 | __u32 cpu_dest; 144 | 145 | /* Setup the ip_hash_key lookup structure */ 146 | struct ip_hash_key lookup; 147 | lookup.prefixlen = 128; 148 | lookup.address.in6_u.u6_addr32[0] = 0xFFFFFFFF; 149 | lookup.address.in6_u.u6_addr32[1] = 0xFFFFFFFF; 150 | lookup.address.in6_u.u6_addr32[2] = 0xFFFFFFFF; 151 | lookup.address.in6_u.u6_addr32[3] = 0xFFFFFFFF; 152 | 153 | /* WAN or LAN interface? */ 154 | direction_lookup = bpf_map_lookup_elem(&map_ifindex_type, &ifindex); 155 | if (!direction_lookup) 156 | return XDP_PASS; 157 | direction = *direction_lookup; 158 | if (direction != INTERFACE_WAN && direction != INTERFACE_LAN) { 159 | bpf_debug("Cant determin ifindex(%u) direction\n", ifindex); 160 | return XDP_PASS; 161 | } 162 | 163 | /* we know it's v4 or v6, so just check the version field of the IP 164 | * header itself 165 | */ 166 | if (eth_proto == ETH_P_IP) { /* IPv4 */ 167 | /* Hint: +1 is sizeof(struct iphdr) */ 168 | if (iph + 1 > data_end) { 169 | bpf_debug("Invalid IPv4 packet: L3off:%llu\n", l3_offset); 170 | return XDP_PASS; 171 | } 172 | 173 | /* Extract key, XDP operate at "ingress" */ 174 | if (direction == INTERFACE_WAN) { 175 | lookup.address.in6_u.u6_addr32[3] = iph->daddr; 176 | } else if (direction == INTERFACE_LAN) { 177 | lookup.address.in6_u.u6_addr32[3] = iph->saddr; 178 | } 179 | } else { 180 | if (ip6h + 1 > data_end) { 181 | bpf_debug("Invalid IPv4 packet: L3off:%llu\n", l3_offset); 182 | return XDP_PASS; 183 | } 184 | if (direction == INTERFACE_WAN) 185 | lookup.address = ip6h->daddr; 186 | else if (direction == INTERFACE_LAN) 187 | lookup.address = ip6h->saddr; 188 | } 189 | 190 | ip_info = get_ip_info(&lookup); 191 | if (!ip_info) { 192 | bpf_debug("cant find default cpu_idx_lookup\n"); 193 | return XDP_PASS; 194 | } 195 | cpu_id = ip_info->cpu; 196 | 197 | /* The CPUMAP type doesn't allow to bpf_map_lookup_elem (see 198 | * verifier.c check_map_func_compatibility()). Thus, maintain 199 | * another map that says if a CPU is avail for redirect. 200 | */ 201 | cpu_lookup = bpf_map_lookup_elem(&cpus_available, &cpu_id); 202 | if (!cpu_lookup) { 203 | bpf_debug("cant find cpu_lookup\n"); 204 | return XDP_PASS; 205 | } 206 | cpu_dest = *cpu_lookup; 207 | if (cpu_dest >= MAX_CPUS) { 208 | /* _user side set/marked non-configured CPUs with MAX_CPUS */ 209 | bpf_debug("cpu_dest too high %i\n",cpu_dest); 210 | return XDP_PASS; 211 | } 212 | 213 | return bpf_redirect_map(&cpu_map, cpu_dest, 0); 214 | } 215 | 216 | static __always_inline 217 | __u32 handle_eth_protocol(struct xdp_md *ctx, __u16 eth_proto, __u32 l3_offset, 218 | __u32 ifindex) 219 | { 220 | __u32 action; 221 | 222 | switch (eth_proto) { 223 | case ETH_P_IP: 224 | case ETH_P_IPV6: 225 | action = parse_ip(ctx, l3_offset, ifindex, eth_proto); 226 | return action; 227 | break; 228 | case ETH_P_ARP: /* Let OS handle ARP */ 229 | /* Fall-through */ 230 | default: 231 | // ARP traffic is handled locally on RX CPU 232 | // bpf_debug("Not handling eth_proto:0x%x\n", eth_proto); 233 | return XDP_PASS; 234 | } 235 | return XDP_PASS; 236 | } 237 | 238 | SEC("xdp") 239 | int xdp_iphash_to_cpu(struct xdp_md *ctx) 240 | { 241 | void *data_end = (void *)(long)ctx->data_end; 242 | void *data = (void *)(long)ctx->data; 243 | __u32 ifindex = ctx->ingress_ifindex; 244 | struct ethhdr *eth = data; 245 | __u16 eth_proto = 0; 246 | __u32 l3_offset = 0; 247 | __u32 action; 248 | 249 | if (!(parse_eth(eth, data_end, ð_proto, &l3_offset))) { 250 | bpf_debug("Cannot parse L2: L3off:%llu proto:0x%x\n", 251 | l3_offset, eth_proto); 252 | return XDP_PASS; /* Skip */ 253 | } 254 | 255 | action = handle_eth_protocol(ctx, eth_proto, l3_offset, ifindex); 256 | 257 | //stats_action_verdict(action); 258 | return action; 259 | } 260 | 261 | char _license[] SEC("license") = "GPL"; 262 | -------------------------------------------------------------------------------- /src/xdp_iphash_to_cpu_user.c: -------------------------------------------------------------------------------- 1 | static const char *__doc__= 2 | " XDP: Lookup IPv4 and redirect to CPU hash\n" 3 | "\n" 4 | "This program loads the XDP eBPF program into the kernel.\n" 5 | "Use the cmdline tool for add/removing dest IPs to the hash\n" 6 | ; 7 | 8 | #include 9 | 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | 25 | #include 26 | #include 27 | #include 28 | 29 | #include 30 | #include /* dirname */ 31 | 32 | #include 33 | #include 34 | 35 | #include 36 | #include 37 | #include 38 | 39 | #include "common_user.h" 40 | #include "common_kern_user.h" 41 | 42 | static char ifname_buf[IF_NAMESIZE]; 43 | static char *ifname = NULL; 44 | static int ifindex = -1; 45 | 46 | static const struct option long_options[] = { 47 | {"help", no_argument, NULL, 'h' }, 48 | {"remove", no_argument, NULL, 'r' }, 49 | {"dev", required_argument, NULL, 'd' }, 50 | {"wan", no_argument, NULL, 'w' }, 51 | {"lan", no_argument, NULL, 'l' }, 52 | {"cpu", required_argument, NULL, 'c' }, 53 | {"quiet", no_argument, NULL, 'q' }, 54 | {"owner", required_argument, NULL, 'o' }, 55 | {"skb-mode", no_argument, NULL, 'S' }, 56 | {"all-cpus", no_argument, NULL, 'a' }, 57 | {"qsize", required_argument, NULL, 's' }, 58 | {0, 0, NULL, 0 } 59 | }; 60 | 61 | static void usage(char *argv[]) 62 | { 63 | int i; 64 | printf("\nDOCUMENTATION:\n%s\n", __doc__); 65 | printf(" Usage: %s (options-see-below)\n", 66 | argv[0]); 67 | printf(" Listing options:\n"); 68 | for (i = 0; long_options[i].name != 0; i++) { 69 | printf(" --%-12s", long_options[i].name); 70 | if (long_options[i].flag != NULL) 71 | printf(" flag (internal value:%d)", 72 | *long_options[i].flag); 73 | else 74 | printf(" short-option: -%c", 75 | long_options[i].val); 76 | printf("\n"); 77 | } 78 | printf("\n"); 79 | } 80 | 81 | /* TODO move to libbpf */ 82 | struct bpf_pinned_map { 83 | const char *name; 84 | const char *filename; 85 | int map_fd; 86 | }; 87 | 88 | /* bpf_prog_load_attr extended */ 89 | struct bpf_prog_load_attr_maps { 90 | const char *file; 91 | enum bpf_prog_type prog_type; 92 | enum bpf_attach_type expected_attach_type; 93 | int ifindex; 94 | int nr_pinned_maps; 95 | struct bpf_pinned_map *pinned_maps; 96 | }; 97 | 98 | /* below: shared pinned maps */ 99 | static int ip_hash_map_fd = -1; 100 | static int ifindex_type_map_fd = -1; 101 | 102 | /* below: private maps */ 103 | static int cpu_map_fd = -1; 104 | static int cpus_available_map_fd = -1; 105 | 106 | static int find_map_fd_by_name(struct bpf_object *obj, 107 | const char *mapname, 108 | struct bpf_prog_load_attr_maps *attr) 109 | { 110 | int map_fd, i; 111 | 112 | /* Prefer using libbpf function to find_fd_by_name */ 113 | map_fd = bpf_object__find_map_fd_by_name(obj, mapname); 114 | if (map_fd > 0) 115 | return map_fd; 116 | 117 | /* If an old TC tool created and pinned map then it have no "name". 118 | * In that case use the FD that was returned when opening pinned file. 119 | */ 120 | for (i = 0; i < attr->nr_pinned_maps; i++) { 121 | struct bpf_pinned_map *pin_map = &attr->pinned_maps[i]; 122 | 123 | if (strcmp(mapname, pin_map->name) != 0) 124 | continue; 125 | 126 | /* Matched, use FD stored in bpf_pinned_map */ 127 | map_fd = pin_map->map_fd; 128 | if (verbose) 129 | printf("TC workaround for mapname: %s map_fd:%d\n", 130 | mapname, map_fd); 131 | } 132 | return map_fd; 133 | } 134 | 135 | static int init_map_fds(struct bpf_object *obj, 136 | struct bpf_prog_load_attr_maps *attr) 137 | { 138 | cpu_map_fd = find_map_fd_by_name(obj,"cpu_map", attr); 139 | cpus_available_map_fd= find_map_fd_by_name(obj,"cpus_available",attr); 140 | ip_hash_map_fd = find_map_fd_by_name(obj,"map_ip_hash", attr); 141 | ifindex_type_map_fd = find_map_fd_by_name(obj,"map_ifindex_type",attr); 142 | 143 | if (cpu_map_fd < 0 || ip_hash_map_fd < 0 || 144 | cpus_available_map_fd < 0 || ifindex_type_map_fd < 0) { 145 | fprintf(stderr, 146 | "FDs cpu_map:%d ip_hash:%d cpus_avail:%d ifindex:%d\n", 147 | cpu_map_fd, ip_hash_map_fd, 148 | cpus_available_map_fd, ifindex_type_map_fd); 149 | return -ENOENT; 150 | } 151 | 152 | return 0; 153 | } 154 | 155 | static int create_cpu_entry(__u32 cpu, __u32 queue_size) 156 | { 157 | int ret; 158 | 159 | /* Add a CPU entry to cpumap, as this allocate a cpu entry in 160 | * the kernel for the cpu. 161 | */ 162 | /* map: cpu_map */ 163 | ret = bpf_map_update_elem(cpu_map_fd, &cpu, &queue_size, 0); 164 | if (ret) { 165 | fprintf(stderr, "Create CPU entry failed err(%d):%s\n", 166 | errno, strerror(errno)); 167 | exit(EXIT_FAIL_BPF); 168 | } 169 | 170 | /* Inform bpf_prog's that a new CPU is available to select 171 | * from via another maps, because eBPF prog side cannot lookup 172 | * directly in cpu_map. 173 | */ 174 | /* map = cpus_available */ 175 | ret = bpf_map_update_elem(cpus_available_map_fd, &cpu, &cpu, 0); 176 | if (ret) { 177 | fprintf(stderr, "Add to avail CPUs failed err(%d):%s\n", 178 | errno, strerror(errno)); 179 | exit(EXIT_FAIL_BPF); 180 | } 181 | 182 | if (verbose) 183 | printf("Added CPU:%u queue_size:%d\n", cpu, queue_size); 184 | 185 | return 0; 186 | } 187 | 188 | /* CPUs are zero-indexed. A special sentinel default value in map 189 | * cpus_available to mark CPU index'es not configured 190 | */ 191 | static void mark_cpus_available(bool cpus[MAX_CPUS], __u32 queue_size, bool add_all_cpu) 192 | { 193 | unsigned int possible_cpus = bpf_num_possible_cpus(); 194 | __u32 invalid_cpu = MAX_CPUS; 195 | int ret, i; 196 | 197 | /* add all available CPUs in system */ 198 | if (add_all_cpu == true) 199 | for (i = 0; i < possible_cpus; i++) 200 | cpus[i] = true; 201 | 202 | for (i = 0; i < MAX_CPUS; i++) { 203 | 204 | if (cpus[i] == true) { 205 | create_cpu_entry(i, queue_size); 206 | } else { 207 | /* map: cpus_available */ 208 | ret = bpf_map_update_elem(cpus_available_map_fd, 209 | &i, &invalid_cpu, 0); 210 | if (ret) { 211 | fprintf(stderr, "Failed marking CPU unavailable\n"); 212 | exit(EXIT_FAIL_BPF); 213 | } 214 | } 215 | } 216 | } 217 | 218 | static void remove_xdp_program(int ifindex, const char *ifname, __u32 xdp_flags) 219 | { 220 | const char *file = mapfile_ip_hash; 221 | __u32 dir = INTERFACE_NONE; 222 | 223 | if (verbose) { 224 | fprintf(stderr, "Removing XDP program on ifindex:%d device:%s\n", 225 | ifindex, ifname); 226 | } 227 | if (ifindex > -1) { 228 | bpf_xdp_attach(ifindex, -1, xdp_flags, NULL); 229 | if (bpf_map_update_elem(ifindex_type_map_fd, 230 | &ifindex, &dir, 0) < 0) { 231 | fprintf(stderr, "ERR: Clear ifindex type failed \n"); 232 | } 233 | } 234 | 235 | /* map file is possibly share, cannot remove it here */ 236 | if (verbose) 237 | fprintf(stderr, 238 | "INFO: not cleanup pinned map file:%s (use 'rm')\n", 239 | file); 240 | } 241 | 242 | void chown_maps(uid_t owner, gid_t group, const char *file) 243 | { 244 | /* Change permissions and user for the map file, as this allow 245 | * an unpriviliged user to operate the cmdline tool. 246 | */ 247 | if (chown(file, owner, group) < 0) 248 | fprintf(stderr, 249 | "WARN: Cannot chown file:%s err(%d):%s\n", 250 | file, errno, strerror(errno)); 251 | } 252 | 253 | /* From: include/linux/err.h */ 254 | #define MAX_ERRNO 4095 255 | #define IS_ERR_VALUE(x) ((x) >= (unsigned long)-MAX_ERRNO) 256 | static inline bool IS_ERR_OR_NULL(const void *ptr) 257 | { 258 | return (!ptr) || IS_ERR_VALUE((unsigned long)ptr); 259 | } 260 | 261 | #define pr_warning printf 262 | 263 | /* As close as possible to libbpf bpf_prog_load_xattr(), with the 264 | * difference of handling pinned maps. 265 | */ 266 | int bpf_prog_load_xattr_maps(const struct bpf_prog_load_attr_maps *attr, 267 | struct bpf_object **pobj, int *prog_fd) 268 | { 269 | LIBBPF_OPTS(bpf_object_open_opts, opts, 270 | .pin_root_path = BASEDIR_MAPS); 271 | struct bpf_program *first_prog = NULL; 272 | struct bpf_object *obj; 273 | struct bpf_map *map; 274 | int err, i; 275 | 276 | if (!attr) 277 | return -EINVAL; 278 | if (!attr->file) 279 | return -EINVAL; 280 | 281 | obj = bpf_object__open_file(attr->file, &opts); 282 | if (IS_ERR_OR_NULL(obj)) 283 | return -ENOENT; 284 | 285 | first_prog = bpf_object__next_program(obj, NULL); 286 | 287 | /* Reset attr->pinned_maps.map_fd to identify successful file load */ 288 | for (i = 0; i < attr->nr_pinned_maps; i++) 289 | attr->pinned_maps[i].map_fd = -1; 290 | 291 | if (!first_prog) { 292 | pr_warning("object file doesn't contain bpf program\n"); 293 | bpf_object__close(obj); 294 | return -ENOENT; 295 | } 296 | 297 | err = bpf_object__load(obj); 298 | if (err) { 299 | bpf_object__close(obj); 300 | return -EINVAL; 301 | } 302 | 303 | /* Pin the maps that were not loaded via pinned filename */ 304 | bpf_object__for_each_map(map, obj) { 305 | const char* mapname = bpf_map__name(map); 306 | 307 | for (i = 0; i < attr->nr_pinned_maps; i++) { 308 | struct bpf_pinned_map *pin_map = &attr->pinned_maps[i]; 309 | 310 | if (strcmp(mapname, pin_map->name) != 0) 311 | continue; 312 | 313 | /* Matched, check if map is already loaded */ 314 | if (pin_map->map_fd != -1) 315 | continue; 316 | 317 | pin_map->map_fd = bpf_map__fd(map); 318 | } 319 | } 320 | 321 | /* Help user if requested map name that doesn't exist */ 322 | for (i = 0; i < attr->nr_pinned_maps; i++) { 323 | struct bpf_pinned_map *pin_map = &attr->pinned_maps[i]; 324 | 325 | if (pin_map->map_fd < 0) 326 | pr_warning("%s() requested mapname:%s not seen\n", 327 | __func__, pin_map->name); 328 | } 329 | 330 | *pobj = obj; 331 | *prog_fd = bpf_program__fd(first_prog); 332 | return 0; 333 | } 334 | 335 | 336 | int main(int argc, char **argv) 337 | { 338 | struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; 339 | bool cpus[MAX_CPUS] = { false }; 340 | bool add_all_cpus = true; /* Default add all CPU if no-others are provided */ 341 | bool rm_xdp_prog = false; 342 | struct passwd *pwd = NULL; 343 | __u32 xdp_flags = 0; 344 | char filename[512]; 345 | __u32 qsize; 346 | int longindex = 0; 347 | uid_t owner = -1; /* -1 result in no-change of owner */ 348 | gid_t group = -1; 349 | int dir = 0; 350 | int added_cpus = 0; 351 | int add_cpu = -1; 352 | int err; 353 | int opt; 354 | 355 | /* libbpf */ 356 | struct bpf_prog_info info = {}; 357 | __u32 info_len = sizeof(info); 358 | struct bpf_object *obj; 359 | int prog_fd; 360 | 361 | struct bpf_pinned_map my_pinned_maps[4]; 362 | struct bpf_prog_load_attr_maps prog_load_attr_maps = { 363 | .prog_type = BPF_PROG_TYPE_XDP, 364 | .nr_pinned_maps = 3, 365 | }; 366 | my_pinned_maps[0].name = "map_ip_hash"; 367 | my_pinned_maps[0].filename = mapfile_ip_hash; 368 | my_pinned_maps[1].name = "map_ifindex_type"; 369 | my_pinned_maps[1].filename = mapfile_ifindex_type; 370 | my_pinned_maps[2].name = "cpu_map"; 371 | my_pinned_maps[2].filename = mapfile_cpu_map; 372 | 373 | prog_load_attr_maps.pinned_maps = my_pinned_maps; 374 | 375 | /* Notice: Choosing the queue size is very important when CPU is 376 | * configured with power-saving states. 377 | * 378 | * If deepest state take 133 usec to wakeup from (133/10^6). When link 379 | * speed is 10Gbit/s ((10*10^9/8) in bytes/sec). How many bytes can 380 | * arrive with in 133 usec at this speed: (10*10^9/8)*(133/10^6) = 381 | * 166250 bytes. With MTU size packets this is 110 packets, and with 382 | * minimum Ethernet (incl intergap overhead) 84 bytes is 1979 packets. 383 | * 384 | * Setting default cpumap queue to 2048 as worst-case (small packet) 385 | * should be +64 packet due kthread wakeup delay (due to xdp_do_flush) 386 | * worst-case is 2043 packets. 387 | * 388 | * Sysadm can configured system to avoid deep-sleep via: 389 | * tuned-adm profile network-latency 390 | */ 391 | qsize = 2048; 392 | 393 | /* Depend on sharing pinned maps */ 394 | if (bpf_fs_check_and_fix()) { 395 | fprintf(stderr, "ERR: " 396 | "Need access to bpf-fs(%s) for pinned maps " 397 | "(%d): %s\n", BPF_DIR_MNT, errno, strerror(errno)); 398 | return EXIT_FAIL_MAP_FS; 399 | } 400 | 401 | if (!locate_kern_object(argv[0], filename, sizeof(filename))) { 402 | fprintf(stderr, "ERR: " 403 | "cannot locate BPF _kern.o ELF file:%s errno(%d): %s\n", 404 | filename, errno, strerror(errno)); 405 | return EXIT_FAIL; 406 | } 407 | //prog_open_attr.file = filename; 408 | prog_load_attr_maps.file = filename; 409 | 410 | /* Parse commands line args */ 411 | while ((opt = getopt_long(argc, argv, "hSrqd:wlc:q:o:s:", 412 | long_options, &longindex)) != -1) { 413 | switch (opt) { 414 | case 'q': 415 | verbose = 0; 416 | break; 417 | case 'r': 418 | rm_xdp_prog = true; 419 | break; 420 | case 'o': /* extract owner and group from username */ 421 | if (!(pwd = getpwnam(optarg))) { 422 | fprintf(stderr, 423 | "ERR: unknown owner:%s err(%d):%s\n", 424 | optarg, errno, strerror(errno)); 425 | goto error; 426 | } 427 | owner = pwd->pw_uid; 428 | group = pwd->pw_gid; 429 | break; 430 | case 'd': 431 | if (strlen(optarg) >= IF_NAMESIZE) { 432 | fprintf(stderr, "ERR: --dev name too long\n"); 433 | goto error; 434 | } 435 | ifname = (char *)&ifname_buf; 436 | strncpy(ifname, optarg, IF_NAMESIZE); 437 | ifindex = if_nametoindex(ifname); 438 | if (ifindex == 0) { 439 | fprintf(stderr, 440 | "ERR: --dev name unknown err(%d):%s\n", 441 | errno, strerror(errno)); 442 | goto error; 443 | } 444 | if (ifindex >= MAX_IFINDEX) { 445 | fprintf(stderr, 446 | "ERR: Fix MAX_IFINDEX err(%d):%s\n", 447 | errno, strerror(errno)); 448 | goto error; 449 | } 450 | break; 451 | case 'S': 452 | xdp_flags |= XDP_FLAGS_SKB_MODE; 453 | break; 454 | case 'w': 455 | if (dir != 0) { 456 | fprintf(stderr, 457 | "ERR: set either --wan or --lan\n"); 458 | goto error; 459 | } 460 | dir = INTERFACE_WAN; 461 | break; 462 | case 'l': 463 | if (dir != 0) { 464 | fprintf(stderr, 465 | "ERR: set either --wan or --lan\n"); 466 | goto error; 467 | } 468 | dir = INTERFACE_LAN; 469 | break; 470 | case 'a': 471 | add_all_cpus = true; 472 | break; 473 | case 's': 474 | qsize = strtoul(optarg, NULL, 0); 475 | if (qsize == 0 || errno || (__s32)qsize < 0) { 476 | fprintf(stderr, "ERR(%d): Invalid --qsize %d\n", 477 | errno, qsize); 478 | goto error; 479 | } 480 | break; 481 | case 'c': 482 | add_all_cpus = false; 483 | add_cpu = strtoul(optarg, NULL, 0); 484 | if (add_cpu >= MAX_CPUS) { 485 | fprintf(stderr, 486 | "--cpu nr too large for cpumap err(%d):%s\n", 487 | errno, strerror(errno)); 488 | goto error; 489 | } 490 | cpus[add_cpu] = true; 491 | added_cpus++; 492 | break; 493 | case 'h': 494 | error: 495 | default: 496 | usage(argv); 497 | return EXIT_FAIL_OPTION; 498 | } 499 | } 500 | if (ifindex == -1) { 501 | fprintf(stderr, "ERR: required option --dev missing"); 502 | usage(argv); 503 | return EXIT_FAIL_OPTION; 504 | } 505 | 506 | if (rm_xdp_prog) { 507 | remove_xdp_program(ifindex, ifname, xdp_flags); 508 | return EXIT_OK; 509 | } 510 | /* Required option */ 511 | if (dir == 0) { 512 | fprintf(stderr,"ERR: set either --wan or --lan\n"); 513 | goto error; 514 | } 515 | 516 | /* Increase resource limits */ 517 | if (setrlimit(RLIMIT_MEMLOCK, &r)) { 518 | perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"); 519 | return 1; 520 | } 521 | 522 | if (bpf_prog_load_xattr_maps(&prog_load_attr_maps, &obj, &prog_fd)) { 523 | fprintf(stderr,"ERR: Failed loading BPF-prog\n"); 524 | return EXIT_FAIL_BPF; 525 | } 526 | 527 | if (owner >= 0) 528 | chown_maps(owner, group, mapfile_ip_hash); 529 | 530 | if (init_map_fds(obj, &prog_load_attr_maps) < 0) { 531 | fprintf(stderr, "bpf_object__find_map_fd_by_name failed\n"); 532 | return EXIT_FAIL; 533 | } 534 | 535 | /* The CPUMAP type doesn't allow to bpf_map_lookup_elem (from 536 | * eBPF prog side _kern.c). Thus, maintain another map that 537 | * says if a CPU is avail for redirect. 538 | */ 539 | mark_cpus_available(cpus, qsize, add_all_cpus); 540 | 541 | /* Set LAN or WAN type direction */ 542 | if (bpf_map_update_elem(ifindex_type_map_fd, &ifindex, &dir, 0) < 0) { 543 | fprintf(stderr, "ERR: create ifindex direction type failed \n"); 544 | return (EXIT_FAIL_BPF); 545 | } 546 | if ((err = bpf_xdp_attach(ifindex, prog_fd, xdp_flags, NULL)) < 0) { 547 | fprintf(stderr, "ERR: link set xdp fd failed (err:%d)\n", err); 548 | return EXIT_FAIL_XDP; 549 | } 550 | 551 | err = bpf_obj_get_info_by_fd(prog_fd, &info, &info_len); 552 | if (err) { 553 | fprintf(stderr, "ERR: can't get prog info - %s\n", 554 | strerror(errno)); 555 | return err; 556 | } 557 | 558 | if (verbose) { 559 | printf("Documentation:\n%s\n", __doc__); 560 | printf(" - Attached to device:%s (ifindex:%d) prog_id:%d\n", 561 | ifname, ifindex, info.id); 562 | } 563 | 564 | return EXIT_OK; 565 | } 566 | -------------------------------------------------------------------------------- /src/xdp_pass_kern.c: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | #include 3 | #include "bpf_helpers.h" 4 | 5 | SEC("xdp") 6 | int xdp_prog(struct xdp_md *ctx) 7 | { 8 | return XDP_PASS; 9 | } 10 | -------------------------------------------------------------------------------- /src/xdp_pass_user.c: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: GPL-2.0 */ 2 | static const char *__doc__ = "Simple XDP prog doing XDP_PASS\n"; 3 | 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | #include 12 | #include 13 | 14 | #include 15 | #include /* depend on kernel-headers installed */ 16 | 17 | static int ifindex = -1; 18 | static char ifname_buf[IF_NAMESIZE]; 19 | static char *ifname; 20 | static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST; 21 | 22 | static const struct option long_options[] = { 23 | {"help", no_argument, NULL, 'h' }, 24 | {"dev", required_argument, NULL, 'd' }, 25 | {"skb-mode", no_argument, NULL, 'S' }, 26 | {"native-mode", no_argument, NULL, 'N' }, 27 | {"force", no_argument, NULL, 'F' }, 28 | {"unload", no_argument, NULL, 'U' }, 29 | {0, 0, NULL, 0 } 30 | }; 31 | 32 | static void usage(char *argv[]) 33 | { 34 | int i; 35 | 36 | printf("\nDOCUMENTATION:\n %s\n", __doc__); 37 | printf(" Usage: %s (options-see-below)\n", argv[0]); 38 | printf(" Listing options:\n"); 39 | for (i = 0; long_options[i].name != 0; i++) { 40 | printf(" --%-12s", long_options[i].name); 41 | if (long_options[i].flag != NULL) 42 | printf(" flag (internal value:%d)", 43 | *long_options[i].flag); 44 | else 45 | printf(" short-option: -%c", 46 | long_options[i].val); 47 | printf("\n"); 48 | } 49 | printf("\n"); 50 | } 51 | 52 | /* Exit return codes */ 53 | #define EXIT_OK 0 54 | #define EXIT_FAIL 1 55 | #define EXIT_FAIL_OPTION 2 56 | #define EXIT_FAIL_XDP 3 57 | 58 | static int xdp_unload(int ifindex_unload) 59 | { 60 | int err; 61 | 62 | if ((err = bpf_xdp_attach(ifindex, -1, xdp_flags, NULL)) < 0) { 63 | fprintf(stderr, "ERR: link set xdp unload failed (err=%d):%s\n", 64 | err, strerror(-err)); 65 | return EXIT_FAIL_XDP; 66 | } 67 | return EXIT_OK; 68 | } 69 | 70 | int main(int argc, char **argv) 71 | { 72 | struct bpf_prog_info info = {}; 73 | __u32 info_len = sizeof(info); 74 | struct bpf_object *obj; 75 | int prog_fd, opt, err; 76 | bool unload = false; 77 | char filename[256]; 78 | int longindex = 0; 79 | int ret = EXIT_FAIL; 80 | 81 | /* Parse commands line args */ 82 | while ((opt = getopt_long(argc, argv, "FhSrmzd:s:a:", 83 | long_options, &longindex)) != -1) { 84 | switch (opt) { 85 | case 'd': 86 | if (strlen(optarg) >= IF_NAMESIZE) { 87 | fprintf(stderr, "ERR: --dev name too long\n"); 88 | goto error; 89 | } 90 | ifname = (char *)&ifname_buf; 91 | strncpy(ifname, optarg, IF_NAMESIZE); 92 | ifindex = if_nametoindex(ifname); 93 | if (ifindex == 0) { 94 | fprintf(stderr, 95 | "ERR: --dev name unknown err(%d):%s\n", 96 | errno, strerror(errno)); 97 | goto error; 98 | } 99 | break; 100 | case 'S': 101 | xdp_flags |= XDP_FLAGS_SKB_MODE; 102 | break; 103 | case 'N': 104 | xdp_flags |= XDP_FLAGS_DRV_MODE; 105 | break; 106 | case 'F': 107 | xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; 108 | break; 109 | case 'U': 110 | unload = true; 111 | break; 112 | case 'h': 113 | error: 114 | default: 115 | usage(argv); 116 | return EXIT_FAIL_OPTION; 117 | } 118 | } 119 | /* Required option */ 120 | if (ifindex == -1) { 121 | fprintf(stderr, "ERR: required option --dev missing\n"); 122 | usage(argv); 123 | return EXIT_FAIL_OPTION; 124 | } 125 | if (unload) 126 | return xdp_unload(ifindex); 127 | 128 | snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); 129 | 130 | obj = bpf_object__open_file(filename, NULL); 131 | if (!obj) 132 | return EXIT_FAIL; 133 | 134 | if (bpf_object__load(obj)) { 135 | fprintf(stderr, "ERR: load_bpf_file: %s\n", strerror(errno)); 136 | goto out; 137 | } 138 | 139 | prog_fd = bpf_program__fd(bpf_object__find_program_by_name(obj, "xdp_prog")); 140 | if ((err = bpf_xdp_attach(ifindex, prog_fd, xdp_flags, NULL)) < 0) { 141 | fprintf(stderr, "ERR: link set xdp fd failed (err=%d):%s\n", 142 | err, strerror(-err)); 143 | ret = EXIT_FAIL_XDP; 144 | goto out; 145 | } 146 | 147 | err = bpf_obj_get_info_by_fd(prog_fd, &info, &info_len); 148 | if (err) { 149 | fprintf(stderr, "ERR: can't get prog info - %s\n", 150 | strerror(errno)); 151 | goto out; 152 | } 153 | 154 | printf("Success: Load XDP prog id=%d on device:%s ifindex:%d\n", 155 | info.id, ifname, ifindex); 156 | ret = EXIT_OK; 157 | 158 | out: 159 | bpf_object__close(obj); 160 | return ret; 161 | } 162 | --------------------------------------------------------------------------------