├── Linux 数据包收发内核路径图--0919.png ├── README.md ├── mytracer-bak-flannel-brprobe.py └── mytracer.py /Linux 数据包收发内核路径图--0919.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mYu4N/mytracer/975e4f217b5d62e34bf883b3b057e29fe175cde5/Linux 数据包收发内核路径图--0919.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2024-10-22 2 | 增加tcp_v4_rcv,skb_copy_datagram_iter函数监控 3 | -------------------- 4 | # Mytracer介绍: 5 | Mytracer是一款用于分析Linux内核数据包收发链路的工具,可以快速的打印数据包经过的内核函数以及iptables的表链转发路径,可用于网络延迟、偶发丢包等内核网络异常场景的排查,也可方便内核网络爱好者进行学习与研究。 6 | 7 | 1. 不需要单独开启iptables trace,即可打印报文经过的iptables表链路径; 8 | 2. 跟踪指定的ip地址和端口,即可打印相关的调用栈以及更细粒度的调用函数。 9 | 3. 内核网络报文转发对应的函数源码学习 10 | 11 | Linux内核数据包收发链路路径图如下: 12 | ![Linux 数据包收发内核路径图--0919.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666327164115-a4020fdc-33b5-4741-9232-210952f7e1e7.png#clientId=u75e24528-a37d-4&errorMessage=unknown%20error&from=ui&id=ufc7c7448&name=Linux%20%E6%95%B0%E6%8D%AE%E5%8C%85%E6%94%B6%E5%8F%91%E5%86%85%E6%A0%B8%E8%B7%AF%E5%BE%84%E5%9B%BE--0919.png&originHeight=1495&originWidth=1905&originalType=binary&ratio=1&rotation=0&showTitle=false&size=1569478&status=error&style=none&taskId=u26b185b7-f146-4ff0-a6bc-424be14208c&title=) 13 | 14 | ## github: 15 | [https://github.com/mYu4N/mytracer](https://github.com/mYu4N/mytracer) 16 | ## 安装部署: 17 | 18 | 1. 需要使用内核版本(4.19+及以上)如Alinux 2 / Alinux3 19 | 2. 依赖ebpf相关组件,可安装bcctools工具合集 20 | ```json 21 | yum install kernel-devel-`uname -r` bcc-tools 22 | ``` 23 | 24 | 3. 下载文件mytracer.py ,置于同目录下,运行 python mytracer.py -h 即可查看详细使用方法。 25 | 4. 查看内核函数对应的源码需要安装kernel-debuginfo,建议系统内常备kernel-debuginfo及kernel-devel,然后使用faddr2line工具查找该函数地址对应的代码行号。 26 | ```json 27 | yum install kernel-debuginfo-`uname -r` 28 | wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/faddr2line 29 | ``` 30 | 31 | ## 使用帮助: 32 | ``` 33 | python mytracer.py -h 34 | usage: mytracer.py [-h] [-H IPADDR] [--proto PROTO] [--icmpid ICMPID] 35 | [-c CATCH_COUNT] [-P PORT] [-p PID] [-N NETNS] 36 | [--dropstack] [--callstack] [--iptable] [--route] [--keep] 37 | [-T] [-t] 38 | 39 | Trace any packet through TCP/IP stack 40 | 41 | optional arguments: 42 | -h, --help show this help message and exit 43 | -H IPADDR, --ipaddr IPADDR 44 | ip address 45 | --proto PROTO tcp|udp|icmp|any 46 | --icmpid ICMPID trace icmp id 47 | -c CATCH_COUNT, --catch-count CATCH_COUNT 48 | catch and print count 49 | -P PORT, --port PORT udp or tcp port 50 | -p PID, --pid PID trace this PID only 51 | -N NETNS, --netns NETNS 52 | trace this Network Namespace only 53 | --dropstack output kernel stack trace when drop packet 54 | --callstack output kernel stack trace --打印全量的调用栈 55 | --iptable output iptable path 56 | --route output route path 57 | --keep keep trace packet all lifetime 58 | -T, --time show HH:MM:SS.ms timestamp (带毫秒的时间戳,已替换为新的格式[2022-10-21 10:32:31.419514 ]) 59 | -t, --timestamp show timestamp in seconds at us resolution (可以理解是第多少秒,用处不太大) 60 | 61 | examples: 62 | mytracer.py # trace all packets 63 | mytracer.py --proto=icmp -H 1.2.3.4 --icmpid 22 # trace icmp packet with addr=1.2.3.4 and icmpid=22 64 | mytracer.py --proto=tcp -H 1.2.3.4 -P 22 # trace tcp packet with addr=1.2.3.4:22 65 | mytracer.py --proto=udp -H 1.2.3.4 -P 22 # trace udp packet wich addr=1.2.3.4:22 66 | mytracer.py -t -T -p 1 --debug -P 80 -H 127.0.0.1 --proto=tcp --callstack --icmpid=100 -N 10000 67 | 68 | 69 | 输出结果说明 70 | 第一列:ebpf抓取内核事件的时间,支持毫秒级时间戳 71 | 第二列:skb当前所在namespace的inode号(可以使用lsns -t net查看对比) 72 | 第三列:skb->dev 所指设备(待修复nil的识别) 73 | 第四列:抓取事件发生时,数据包目的mac地址 74 | 第五列:数据包信息,由4层协议+3层地址信息+4层端口信息组成(T代表TCP,U代表UDP,I代表ICMP,其他协议直接打印协议号) 75 | 第六列:数据包的跟踪信息,由skb内存地址+skb->pkt_type+抓取函数名(如果在netfilter抓取,则由pf号+表+链+执行结果构成) 76 | 77 | 第六列,skb->pkt_type含义如下(\include\uapi\linux\if_packet.h): 78 | /* Packet types */ 79 | #define PACKET_HOST 0 /* To us */ 80 | #define PACKET_BROADCAST 1 /* To all */ 81 | #define PACKET_MULTICAST 2 /* To group */ 82 | #define PACKET_OTHERHOST 3 /* To someone else */ 83 | #define PACKET_OUTGOING 4 /* Outgoing of any type */ 84 | #define PACKET_LOOPBACK 5 /* MC/BRD frame looped back */ 85 | #define PACKET_USER 6 /* To user space */ 86 | #define PACKET_KERNEL 7 /* To kernel space */ 87 | /* Unused, PACKET_FASTROUTE and PACKET_LOOPBACK are invisible to user space */ 88 | #define PACKET_FASTROUTE 6 /* Fastrouted frame */ 89 | 90 | 第六列,pf号含义如下(\include\uapi\linux\netfilter.h): 91 | enum { 92 | NFPROTO_UNSPEC = 0, 93 | NFPROTO_INET = 1, 94 | NFPROTO_IPV4 = 2, 95 | NFPROTO_ARP = 3, 96 | NFPROTO_NETDEV = 5, 97 | NFPROTO_BRIDGE = 7, 98 | NFPROTO_IPV6 = 10, 99 | NFPROTO_DECNET = 12, 100 | NFPROTO_NUMPROTO, 101 | }; 102 | ``` 103 | 104 | # mytracer场景展示 105 | 106 | ## 基于mytracer追踪报文场景模拟: 107 | ### 模拟场景1:RST跟踪 108 | 109 | - 模拟场景:访问pod的非法监听地址导致报文被目标端reset 110 | - 分析手段:分别使用tcpdump、tcpdrop以及mytracer跟踪数据库分析定位 111 | - 请求端:集群内跨节点访问(非pod所在节点),本模拟场景中为192.168.88.154 112 | - 目的端:访问指定pod不存在的端口,本模拟场景的pod地址为 192.168.40.230:8080 113 | - 抓包点:目的端host 114 | - 容器环境:ACK terway-eniip ipvlan 115 | ``` 116 | # kubectl get pods -o wide my-gotools-5bc6dfcd75-j4tmk 117 | NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 118 | my-gotools-5bc6dfcd75-j4tmk 1/1 Running 6 10d 192.168.40.230 cn-beijing.192.168.0.17 119 | # kubectl exec -it my-gotools-5bc6dfcd75-j4tmk -- bash 120 | bash-5.1# netstat -antpl 121 | Active Internet connections (servers and established) 122 | Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name 123 | bash-5.1# 124 | ``` 125 | #### 基础抓包工具tcpdump 126 | 在访问不存在的端口的模拟场景中,tcpdump抓包可以看到目标端reset的报文,但是看无法直观的定位到RST根因。 127 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666233645895-4ed5418c-0a72-49fa-9972-fad2acdedcce.png#clientId=uac1f7f0f-0e76-4&errorMessage=unknown%20error&from=paste&height=376&id=t9ML2&name=image.png&originHeight=752&originWidth=2816&originalType=binary&ratio=1&rotation=0&showTitle=false&size=585744&status=error&style=none&taskId=u90132a85-55fd-4df8-9417-5e3e80fa772&title=&width=1408) 128 | #### bcctools的tcpdrop 129 | [tcpdrop](https://www.brendangregg.com/blog/2018-05-31/linux-tcpdrop.html)通过监听tcp_drop函数调用来回显,可以显示源目数据包的详细信息,以及内核中TCP 会话状态、TCP报头标志和导致丢包的内核堆栈跟踪。但是访问不存在的端口被RST并非tcp_drop这个函数来拒绝的,因此基于tcp_drop函数的tcpdrop无法追踪到。 130 | /usr/share/bcc/tools/tcpdrop 抓到的流及函数路径 131 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666233590684-004e9fad-58ee-4718-ba2d-92892b7075c4.png#clientId=uac1f7f0f-0e76-4&errorMessage=unknown%20error&from=paste&height=451&id=ub7455b34&name=image.png&originHeight=902&originWidth=1946&originalType=binary&ratio=1&rotation=0&showTitle=false&size=353794&status=error&style=none&taskId=u32d446f3-2474-4980-8610-6f51d74952a&title=&width=973) 132 | 上面drop的包多,过滤下源ip的方式看依然没有想要看到的8080的访问 133 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666233681221-562fc8c9-1b9e-4faf-8aff-47be1d889f3c.png#clientId=uac1f7f0f-0e76-4&errorMessage=unknown%20error&from=paste&height=483&id=ub998ea1e&name=image.png&originHeight=966&originWidth=1992&originalType=binary&ratio=1&rotation=0&showTitle=false&size=419853&status=error&style=none&taskId=u29c55c09-28ca-4995-8970-0a859d06c66&title=&width=996) 134 | 135 | #### 基于mytracer.py的跟踪: 136 | 使用mytracer的默认参数去跟踪指定的ip地址和端口即可打印相关的调用栈信息,默认是没有扩展出来具体的调用函数路径的,不过默认的调用函数信息无法直观显示哪一步函数发送的RST。 137 | ``` 138 | # python mytracer.py --proto=tcp -H 192.168.88.154 -P 8080 139 | time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO 140 | [2022-11-04 10:26:26.103401 ][4026531992] eth1 00163e0cb838 T_SYN:192.168.88.154:34060->192.168.40.230:8080 ffff9fdef19eb600.0:napi_gro_receive 141 | [2022-11-04 10:26:26.103676 ][4026531992] eth1 00163e0cb838 T_SYN:192.168.88.154:34060->192.168.40.230:8080 ffff9fdef19eb600.0:__netif_receive_skb 142 | [2022-11-04 10:26:26.103790 ][4026533234] eth0 00163e0cb838 T_SYN:192.168.88.154:34060->192.168.40.230:8080 ffff9fdef19eb600.0:ip_rcv 143 | [2022-11-04 10:26:26.103900 ][4026533234] eth0 00163e0cb838 T_SYN:192.168.88.154:34060->192.168.40.230:8080 ffff9fdef19eb600.0:ip_rcv_finish 144 | [2022-11-04 10:26:26.104033 ][4026533234] nil 6573223a2031 T_ACK,RST:192.168.40.230:8080->192.168.88.154:34060 ffff9fdef19eb700.0:ip_output 145 | [2022-11-04 10:26:26.104196 ][4026533234] eth0 6573223a2031 T_ACK,RST:192.168.40.230:8080->192.168.88.154:34060 ffff9fdef19eb700.0:ip_finish_output 146 | [2022-11-04 10:26:26.104322 ][4026533234] eth0 6573223a2031 T_ACK,RST:192.168.40.230:8080->192.168.88.154:34060 ffff9fdef19eb700.0:__dev_queue_xmit 147 | [2022-11-04 10:26:26.104443 ][4026531992] eth1 eeffffffffff T_ACK,RST:192.168.40.230:8080->192.168.88.154:34060 ffff9fdef19eb700.0:__dev_queue_xmit 148 | ``` 149 | 注(本模拟场景非terway-ipvlan模式): 150 | 1,我们可以看到有多个网卡出现,即使terway+ipvlan的模式,入方向不绕开eth0的协议栈,以及inode号4026531992对应的就是系统本身的网络ns 151 | 2,如果是ipvlan的pod,切换到pod内部去访问被iptables拒绝的ip地址,则会直通,即ipvlan的出方向不受iptables影响,直通底层,非ipvlan则受影响 152 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1667529964965-f561620c-6a0a-4202-882c-c19ad51985ec.png#clientId=ue4e3aeb0-27ac-4&from=paste&height=868&id=oRYWV&name=image.png&originHeight=868&originWidth=2368&originalType=binary&ratio=1&rotation=0&showTitle=false&size=482870&status=done&style=none&taskId=u6a782d4d-db53-4583-8c82-f5db4aee926&title=&width=2368) 153 | 154 | 可以使用mytracer的--callstack的参数将每一条调用栈都详细打印出来,这个参数的开销比默认参数开销会大一些,但是更有利于问题的排查(为了缩减篇幅,只保留最后一个syn 以及第一个rst) 155 | 156 | ``` 157 | # python mytracer.py --proto=tcp -H 192.168.88.154 -P 8080 --callstack 158 | time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO 159 | ....... 160 | [2022-11-04 10:35:04.538548 ][4026533234] eth0 00163e0cb838 T_SYN:192.168.88.154:36454->192.168.40.230:8080 ffff9fdef19ea900.0:ip_rcv_finish 161 | ip_rcv_finish+0x1 162 | ip_rcv+0x3d 163 | __netif_receive_skb_one_core+0x42 164 | netif_receive_skb_internal+0x34 165 | napi_gro_receive+0xbf 166 | receive_buf+0xee 167 | virtnet_poll+0x137 168 | net_rx_action+0x266 169 | __softirqentry_text_start+0xd1 170 | irq_exit+0xd2 171 | do_IRQ+0x54 172 | ret_from_intr+0x0 173 | cpuidle_enter_state+0xcb 174 | do_idle+0x1cc 175 | cpu_startup_entry+0x5f 176 | start_secondary+0x197 177 | secondary_startup_64+0xa4 178 | [2022-11-04 10:35:04.539035 ][4026533234] nil 6173683a3834 T_ACK,RST:192.168.40.230:8080->192.168.88.154:36454 ffff9fdef19ea200.0:ip_output 179 | ip_output+0x1 180 | ip_send_skb+0x15 181 | ip_send_unicast_reply+0x2c5 182 | tcp_v4_send_reset+0x3c6 #从tcp_v4_rcv的函数上进到send_reset函数地址+0x3c6 ,发送了reset 183 | tcp_v4_rcv+0x6d3 #tcp接收参数,这里要注意,走到了这个函数的0x6d3地址, 去调用的reset,所以先看这个函数 184 | ip_local_deliver_finish+0x9c 185 | ip_local_deliver+0x42 186 | ip_rcv+0x3d 187 | __netif_receive_skb_one_core+0x42 188 | netif_receive_skb_internal+0x34 189 | napi_gro_receive+0xbf 190 | receive_buf+0xee 191 | virtnet_poll+0x137 192 | net_rx_action+0x266 193 | __softirqentry_text_start+0xd1 194 | irq_exit+0xd2 195 | do_IRQ+0x54 196 | ret_from_intr+0x0 197 | cpuidle_enter_state+0xcb 198 | do_idle+0x1cc 199 | cpu_startup_entry+0x5f 200 | start_secondary+0x197 201 | secondary_startup_64+0xa4 202 | ... 203 | ``` 204 | 205 | 从这个调用栈里我们可以看到是server端直接拒绝的请求(syn,ack rst),那么怎么看出来是什么原因拒绝的呢?是没监听还是别的原因呢?我们重点看第一次出现rst的函数路径 206 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1667530116746-3341ab32-da82-42e8-b071-44f96c19bf0f.png#clientId=ue4e3aeb0-27ac-4&from=paste&height=127&id=ue9945523&name=image.png&originHeight=127&originWidth=877&originalType=binary&ratio=1&rotation=0&showTitle=false&size=39268&status=done&style=none&taskId=u9daa5fea-e3fb-42e3-bcb9-5c45ffaa1a0&title=&width=877) 207 | 对于很多同学来说,怎么看tcp_v4_rcv+0x6d3 跟 tcp_v4_send_reset+0x3c6是个比较头疼的问题,我之前写过一篇linux内核网络的数据发送,基于dropwatch对函数+偏移量做计算的方式,比较复杂,我们今天使用更简单的方式来找内核代码。 208 | 需要安装kernel-debuginfo查看内核函数对应的源码,建议系统内常备kernel-debuginfo及kernel-devel,然后使用faddr2line工具查找该函数地址对应的代码行号。 209 | ``` 210 | yum install -y kernel-debuginfo.x86_64 211 | wget https://raw.githubusercontent.com/torvalds/linux/master/scripts/faddr2line 212 | 213 | # bash faddr2line /usr/lib/debug/lib/modules/4.19.91-26.5.al7.x86_64/vmlinux tcp_v4_send_reset+0x3c6 214 | tcp_v4_send_reset+0x3c6/0x590: 215 | tcp_v4_send_reset at net/ipv4/tcp_ipv4.c:780 216 | 217 | # bash faddr2line /usr/lib/debug/lib/modules/4.19.91-26.5.al7.x86_64/vmlinux tcp_v4_rcv+0x6d3 218 | tcp_v4_rcv+0x6d3/0xfc0: 219 | __xfrm_policy_check2 at include/net/xfrm.h:1200 220 | (inlined by) xfrm_policy_check at include/net/xfrm.h:1207 221 | (inlined by) xfrm4_policy_check at include/net/xfrm.h:1212 222 | (inlined by) tcp_v4_rcv at net/ipv4/tcp_ipv4.c:1833 223 | ``` 224 | ##### 内核函数源代码分析: 225 | callstack从下往上看,我们先看tcp send reset的调用方,tcp_rcv的调用链路 226 | 如下tcp_v4_rcv函数中,__inet_lookup_skb没有找到监听套接口,跳转到no_tcp_socket标签处. 227 | 如果此触发报文的checksum没有问题,将回复Reset报文。 228 | 另外在函数tcp_v4_send_reset中会检查当前报文是否设置了reset标志位,不对接收到的reset报文回复reset报文 229 | ``` 230 | 231 | ----------tcp_rcv 232 | tcp_v4_rcv at net/ipv4/tcp_ipv4.c:1833 233 | 234 | lookup: 235 | sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, 236 | th->dest, sdif, &refcounted); 237 | if (!sk) 238 | goto no_tcp_socket; 找不到socket(监听,session)直接跳到 no_tcp_socket 239 | 240 | 1832 no_tcp_socket: 241 | 1833 if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) 242 | 1834 goto discard_it; 243 | 1835 244 | 1836 tcp_v4_fill_cb(skb, iph, th); 245 | 1837 246 | 1838 if (tcp_checksum_complete(skb)) {checksum的校验 247 | 1839 csum_error: 248 | 1840 __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS); 249 | 1841 bad_packet: 250 | 1842 __TCP_INC_STATS(net, TCP_MIB_INERRS); 251 | 1843 } else { 252 | 1844 tcp_v4_send_reset(NULL, skb);调用tcp_v4_send_reset函数发送reset 253 | 1845 } 254 | 1846 255 | 1847 discard_it: 256 | 1848 /* Discard frame. */ 257 | 1849 kfree_skb(skb); 258 | 1850 return 0; 259 | 260 | 调用tcp_v4_send_reset函数后直接跳转到780行 261 | ------tcp_v4_send_reset at net/ipv4/tcp_ipv4.c:780 262 | 650 static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) 263 | 651 { 264 | 652 const struct tcphdr *th = tcp_hdr(skb); 265 | 653 struct { 266 | 654 struct tcphdr th; 267 | 655 #ifdef CONFIG_TCP_MD5SIG 268 | 656 __be32 opt[(TCPOLEN_MD5SIG_ALIGNED >> 2)]; 269 | 657 #endif 270 | 658 } rep; 271 | 659 struct ip_reply_arg arg; 272 | 660 #ifdef CONFIG_TCP_MD5SIG 273 | 661 struct tcp_md5sig_key *key = NULL; 274 | 662 const __u8 *hash_location = NULL; 275 | 663 unsigned char newhash[16]; 276 | 664 int genhash; 277 | 665 struct sock *sk1 = NULL; 278 | 666 #endif 279 | 667 struct net *net; 280 | 668 struct sock *ctl_sk; 281 | 669 282 | 670 /* Never send a reset in response to a reset. */ 283 | 671 if (th->rst) 284 | 672 return; 285 | 673 286 | 674 /* If sk not NULL, it means we did a successful lookup and incoming 287 | 675 * route had to be correct. prequeue might have dropped our dst. 288 | 676 */ 289 | 677 if (!sk && skb_rtable(skb)->rt_type != RTN_LOCAL) 290 | 678 return; 291 | 679 292 | 680 /* Swap the send and the receive. */ 293 | 681 memset(&rep, 0, sizeof(rep)); 294 | 682 rep.th.dest = th->source; 295 | 683 rep.th.source = th->dest; 296 | 684 rep.th.doff = sizeof(struct tcphdr) / 4; 297 | 685 rep.th.rst = 1; 298 | 686 299 | 687 if (th->ack) { 300 | 688 rep.th.seq = th->ack_seq; 301 | 689 } else { 302 | 690 rep.th.ack = 1; 303 | 691 rep.th.ack_seq = htonl(ntohl(th->seq) + th->syn + th->fin + 304 | 692 skb->len - (th->doff << 2)); 305 | 693 } 306 | 694 307 | 695 memset(&arg, 0, sizeof(arg)); 308 | 696 arg.iov[0].iov_base = (unsigned char *)&rep; 309 | 697 arg.iov[0].iov_len = sizeof(rep.th); 310 | 698 311 | 699 net = sk ? sock_net(sk) : dev_net(skb_dst(skb)->dev); 312 | 700 #ifdef CONFIG_TCP_MD5SIG 313 | 701 rcu_read_lock(); 314 | 702 hash_location = tcp_parse_md5sig_option(th); 315 | 703 if (sk && sk_fullsock(sk)) { 316 | 704 key = tcp_md5_do_lookup(sk, (union tcp_md5_addr *) 317 | 705 &ip_hdr(skb)->saddr, AF_INET); 318 | 706 } else if (hash_location) { 319 | 707 /* 320 | 708 * active side is lost. Try to find listening socket through 321 | 709 * source port, and then find md5 key through listening socket. 322 | 710 * we are not loose security here: 323 | 711 * Incoming packet is checked with md5 hash with finding key, 324 | 712 * no RST generated if md5 hash doesn't match. 325 | 713 */ 326 | 714 sk1 = __inet_lookup_listener(net, &tcp_hashinfo, NULL, 0, 327 | 715 ip_hdr(skb)->saddr, 328 | 716 th->source, ip_hdr(skb)->daddr, 329 | 717 ntohs(th->source), inet_iif(skb), 330 | 718 tcp_v4_sdif(skb)); 331 | 719 /* don't send rst if it can't find key */ 332 | 720 if (!sk1) 333 | 721 goto out; 334 | 722 335 | 723 key = tcp_md5_do_lookup(sk1, (union tcp_md5_addr *) 336 | 724 &ip_hdr(skb)->saddr, AF_INET); 337 | 725 if (!key) 338 | 726 goto out; 339 | 727 340 | 728 341 | 729 genhash = tcp_v4_md5_hash_skb(newhash, key, NULL, skb); 342 | 730 if (genhash || memcmp(hash_location, newhash, 16) != 0) 343 | 731 goto out; 344 | 732 345 | 733 } 346 | 734 347 | 735 if (key) { 348 | 736 rep.opt[0] = htonl((TCPOPT_NOP << 24) | 349 | 737 (TCPOPT_NOP << 16) | 350 | 738 (TCPOPT_MD5SIG << 8) | 351 | 739 TCPOLEN_MD5SIG); 352 | 740 /* Update length and the length the header thinks exists */ 353 | 741 arg.iov[0].iov_len += TCPOLEN_MD5SIG_ALIGNED; 354 | 742 rep.th.doff = arg.iov[0].iov_len / 4; 355 | 743 356 | 744 tcp_v4_md5_hash_hdr((__u8 *) &rep.opt[1], 357 | 745 key, ip_hdr(skb)->saddr, 358 | 746 ip_hdr(skb)->daddr, &rep.th); 359 | 747 } 360 | 748 #endif 361 | 749 arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr, 362 | 750 ip_hdr(skb)->saddr, /* XXX */ 363 | 751 arg.iov[0].iov_len, IPPROTO_TCP, 0); 364 | 752 arg.csumoffset = offsetof(struct tcphdr, check) / 2; 365 | 753 arg.flags = (sk && inet_sk_transparent(sk)) ? IP_REPLY_ARG_NOSRCCHECK : 0; 366 | 754 367 | 755 /* When socket is gone, all binding information is lost. 368 | 756 * routing might fail in this case. No choice here, if we choose to force 369 | 757 * input interface, we will misroute in case of asymmetric route. 370 | 758 */ 371 | 759 if (sk) { 372 | 760 arg.bound_dev_if = sk->sk_bound_dev_if; 373 | 761 if (sk_fullsock(sk)) 374 | 762 trace_tcp_send_reset(sk, skb); 375 | 763 } 376 | 764 377 | 765 BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != 378 | 766 offsetof(struct inet_timewait_sock, tw_bound_dev_if)); 379 | 767 380 | 768 arg.tos = ip_hdr(skb)->tos; 381 | 769 arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL); 382 | 770 local_bh_disable(); 383 | 771 ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk); 384 | 772 if (sk) 385 | 773 ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ? 386 | 774 inet_twsk(sk)->tw_mark : sk->sk_mark; 387 | 775 ip_send_unicast_reply(ctl_sk, 388 | 776 skb, &TCP_SKB_CB(skb)->header.h4.opt, 389 | 777 ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, 390 | 778 &arg, arg.iov[0].iov_len); 391 | 779 392 | 780 ctl_sk->sk_mark = 0; 修改sk_mark为0 393 | 781 __TCP_INC_STATS(net, TCP_MIB_OUTSEGS); 记录监控数据到out发送以及outreset的计数 394 | 782 __TCP_INC_STATS(net, TCP_MIB_OUTRSTS); 这俩计数netstat st看不到,对于的是在/proc/net/snmp 395 | 783 local_bh_enable(); 396 | 784 397 | 785 #ifdef CONFIG_TCP_MD5SIG 398 | 786 out: 399 | 787 rcu_read_unlock(); 400 | 788 #endif 401 | 789 } 402 | 403 | ``` 404 | ##### 分析总结: 405 | 走到 no_tcp_socket的路径,基本只有以下几种情况: 406 | 407 | 1. session已经不存在,被回收了,比如说旧版本的内核twbucket满了,被直接回收掉,如果还有网络请求过来就会reset 408 | 2. 监听本身不存在(本案例测试时为curl不存在的端口) 409 | 410 | /* 以下信息摘抄自网络 411 | _调用tcp_v4_send_reset发送RESET报文:_ 412 | 413 | 1. _TCP接收报文:在tcp_v4_rcv,如果校验和有问题,则发送RESET;_ 414 | 2. _TCP接收报文:在tcp_v4_rcv,如果 __inet_lookup_skb 函数找不到报文所请求的socket,则发送RESET;_ 415 | 3. _TCP收到SYN,发送SYN-ACK,并开始等待连接最后的ACK:在tcp_v4_do_rcv - tcp_v4_hnd_req - tcp_check_req,如果TCP报文头部包含RST,或者包含序列不合法的SYN,则发送RESET;_ 416 | 4. _TCP收到连接建立最后的ACK,并建立child套接字后:tcp_v4_do_rcv - tcp_child_process - tcp_rcv_state_process - tcp_ack 函数中,如果发现连接等待的最后ACK序列号有问题: before(ack, prior_snd_una),则发送RESET;_ 417 | 5. _TCP在ESTABLISH状态收到报文,在tcp_v4_do_rcv - tcp_rcv_established - tcp_validate_incoming 函数中,如果发现有SYN报文出现在当前的接收窗口中: th->syn && !before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt),则发送RESET;_ 418 | 6. _TCP在进行状态迁移时:tcp_rcv_state_process -_ 419 | - _如果此时socket处于LISTEN状态,且报文中含有ACK,则发送RESET;_ 420 | - _如果此时socket处于FIN_WAIT_1或者FIN_WAIT_2;当接收已经shutdown,并且报文中有新的数据时,发送RESET;_ 421 | - _如果测试socket处于FIN_WAIT_1;=待续=_ 422 | 423 | __在iptables规则中数据包被拒:_ 424 | 425 | - _send_reset:在iptables规则中,可以指定 -j RESET。如果符合iptables规则并丢弃数据包,并向对端发送RESET报文;_ 426 | 427 | 以上信息摘抄自网络*/ 428 | ### 模拟场景2:跟踪iptables丢包: 429 | 430 | - 场景模拟:在容器节点上插入一条拒绝访问集群外指定公网ip地址的规则,让系统内访问指定地址被拒绝。 431 | ``` 432 | # iptables -t filter -I OUTPUT 1 -m tcp --proto tcp --dst 140.205.60.46/32 -j DROP 433 | ``` 434 | 435 | - 排查手段:tcpdump/iptables/mytracer的iptable参数 436 | - 请求端:容器集群节点或者节点上运行的pod 437 | - 目的端:公网IP, 本模拟环境中访问 140.205.60.46。 438 | - 抓包点位:请求发起端所在的节点 439 | - 容器环境:ACK terway-eniip ipvlan 440 | 441 | 注意:出方向访问非集群内的ip资源,eni模式是直通底层,因此terway-eni + ipvlan的pod 往外访问时不受主机上的iptables规则影响,istiosidecar单加iptables的形式例外 442 | 443 | #### 基础抓包工具tcpdump 444 | iptables丢包场景是无法在tcpdmp抓到报文的。 445 | ``` 446 | 在node eth0上访问: curl 140.205.60.46 447 | # tcpdump -i any host 140.205.60.46 -nv -x 448 | tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 449 | 450 | ``` 451 | 为什么抓不到报文?我们回到文章开头的位置看下内核网络的路径图就可以看出,出方向的报文,iptables是在抓包点之前的,如果在iptables的表链上丢掉了报文,则tcpdump无法抓到对应的网络报文,因为报文还没送到抓包点就被丢弃了。 452 | #### 查看iptables表链 453 | 如果数据包量小且已经怀疑是iptable丢包的话,也可以直接使用iptables的统计信息查看,如下所示,可以看到output的第一条规则 drop的增长: 454 | ``` 455 | # iptables -t filter -L OUTPUT --line-number -n -v 456 | Chain OUTPUT (policy ACCEPT 51 packets, 26067 bytes) 457 | num pkts bytes target prot opt in out source destination 458 | 1 8 480 DROP tcp -- * * 0.0.0.0/0 140.205.60.46 tcp 459 | 460 | # iptables -t filter -L OUTPUT --line-number -n -v 461 | Chain OUTPUT (policy ACCEPT 51 packets, 26067 bytes) 462 | num pkts bytes target prot opt in out source destination 463 | 1 9 540 DROP tcp -- * * 0.0.0.0/0 140.205.60.46 tcp 464 | ``` 465 | #### 基于mytracer追踪: 466 | 如下所示,在请求节点上抓取140.205.60.46这个地址的iptable的表链,可以看到时间、ns的inode号、网卡、mac地址、tcpflags 、五元组信息以及iptable表链信息,可以看到syn包丢包点位在filter.OUTPUT.DROP : 467 | ``` 468 | curl 140.205.60.46 469 | # python mytracer.py --proto tcp --iptable -H 140.205.60.46 470 | time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO 471 | [15:43:03 ][4026531992] nil 0d0000000000 T_SYN:192.168.0.17:47645->140.205.60.46:80 ffff9935587cbcf8.0:2.raw.OUTPUT.ACCEPT 472 | [15:43:03 ][4026531992] nil 0d0000000000 T_SYN:192.168.0.17:47645->140.205.60.46:80 ffff9935587cbcf8.0:2.mangle.OUTPUT.ACCEPT 473 | [15:43:03 ][4026531992] nil 0d0000000000 T_SYN:192.168.0.17:47645->140.205.60.46:80 ffff9935587cbcf8.0:2.nat.OUTPUT.ACCEPT 474 | [15:43:03 ][4026531992] nil 0d0000000000 T_SYN:192.168.0.17:47645->140.205.60.46:80 ffff9935587cbcf8.0:2.filter.OUTPUT.DROP 475 | 476 | ``` 477 | 为什么我不抓取dropstack?iptables的drop能不能使用dropstack、callstack看到呢?以及说监听不存在的reset算不算dropstack: 478 | ``` 479 | # python mytracer.py --proto tcp --callstack -H 140.205.60.46 480 | time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO 481 | # python mytracer.py --proto tcp --dropstack -H 140.205.60.46 482 | time NETWORK_NS INTERFACE DEST_MAC PKT_INFO TRACE_INFO 483 | ``` 484 | 使用iptables拒绝的访问,抓不到dropstack的信息,换成callstack也不可以,原因在于dropstack监控的是kfree_skb,如果是案例1的场景,没监听的reset直接调用reset参数拒绝的,没走kfree_skb也是dropstack抓不到,但是没监听的访问是可以通过callstack抓到信息的, 485 | ``` 486 | [2022-10-21 11:06:41.209742 ][4026531992] eth1 eeffffffffff T_ACK,RST:192.168.40.230:8080->192.168.88.154:34454 ffff993508ce2300.0:__dev_queue_xmit 487 | __dev_queue_xmit+0x1 488 | ipvlan_queue_xmit+0x20b 489 | ipvlan_start_xmit+0x16 490 | dev_hard_start_xmit+0xa4 491 | __dev_queue_xmit+0x722 492 | ip_finish_output2+0x1f5 493 | ip_output+0x61 494 | ip_send_skb+0x15 495 | ip_send_unicast_reply+0x2c5 496 | tcp_v4_send_reset+0x3c6 497 | tcp_v4_rcv+0x6d3 498 | ip_local_deliver_finish+0x9c 499 | ip_local_deliver+0x42 500 | 501 | 502 | drop_stack只采集kfree_skb的相关信息 503 | 504 | #if __BCC_dropstack 505 | int kprobe____kfree_skb(struct pt_regs *ctx, struct sk_buff *skb) 506 | { 507 | struct event_t event = {}; 508 | 509 | if (do_trace_skb(&event, ctx, skb, NULL) < 0) 510 | return 0; 511 | 512 | event.flags |= ROUTE_EVENT_DROP; 513 | event.start_ns = bpf_ktime_get_ns(); 514 | bpf_strncpy(event.func_name, __func__+8, FUNCNAME_MAX_LEN); 515 | get_stack(ctx, &event); 516 | route_event.perf_submit(ctx, &event, sizeof(event)); 517 | return 0; 518 | } 519 | #endif 520 | ``` 521 | 522 | 去掉iptable拒绝访问的规则,执行mytracer看iptables的路径,对比正常访问的调用栈 523 | ``` 524 | # iptables -t filter -L OUTPUT --line-number 525 | Chain OUTPUT (policy ACCEPT) 526 | num target prot opt source destination 527 | 1 DROP tcp -- anywhere 140.205.60.46 tcp 528 | 2 ACCEPT udp -- 169.254.20.10 anywhere udp spt:domain 529 | 3 ACCEPT tcp -- 169.254.20.10 anywhere tcp spt:domain 530 | 4 CILIUM_OUTPUT all -- anywhere anywhere /* cilium-feeder: CILIUM_OUTPUT */ 531 | 5 KUBE-FIREWALL all -- anywhere anywhere 532 | 533 | # iptables -t filter -D OUTPUT 1 534 | ``` 535 | 执行mytracer查看eth0的iptables的路径: 536 | ``` 537 | # python mytracer.py --proto tcp --iptable -H 140.205.60.46|grep eth0 538 | [15:44:43 ][4026531992] eth0 000000000000 T_SYN:192.168.0.17:48789->140.205.60.46:80 ffff9935587ca4f8.0:2.mangle.POSTROUTING.ACCEPT 539 | [15:44:43 ][4026531992] eth0 000000000000 T_SYN:192.168.0.17:48789->140.205.60.46:80 ffff9935587ca4f8.0:2.nat.POSTROUTING.ACCEPT 540 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,SYN:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2a00.0:2.raw.PREROUTING.ACCEPT 541 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,SYN:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2a00.0:2.mangle.PREROUTING.ACCEPT 542 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,SYN:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2a00.0:2.mangle.INPUT.ACCEPT 543 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,SYN:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2a00.0:2.filter.INPUT.ACCEPT 544 | [15:44:43 ][4026531992] eth0 ff005c90ff34 T_ACK:192.168.0.17:48789->140.205.60.46:80 ffff9934c50d2f00.0:2.mangle.POSTROUTING.ACCEPT 545 | [15:44:43 ][4026531992] eth0 bd78ffffff48 T_ACK,PSH:192.168.0.17:48789->140.205.60.46:80 ffff9935587ca8f8.0:2.mangle.POSTROUTING.ACCEPT 546 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.raw.PREROUTING.ACCEPT 547 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.mangle.PREROUTING.ACCEPT 548 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.mangle.INPUT.ACCEPT 549 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.filter.INPUT.ACCEPT 550 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,PSH:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.raw.PREROUTING.ACCEPT 551 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,PSH:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.mangle.PREROUTING.ACCEPT 552 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,PSH:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.mangle.INPUT.ACCEPT 553 | [15:44:43 ][4026531992] eth0 00163e0c327b T_ACK,PSH:140.205.60.46:80->192.168.0.17:48789 ffff9934c50d2500.0:2.filter.INPUT.ACCEPT 554 | 换成堆栈信息,可以看到已经能抓到了 555 | # python mytracer.py --proto tcp --callstack -H 140.205.60.46 556 | ...... 557 | [15:45:45 ][4026531992] eth0 000000000000 T_ACK:192.168.0.17:49501->140.205.60.46:80 ffff993496607500.0:__dev_queue_xmit 558 | __dev_queue_xmit+0x1 559 | ip_finish_output2+0x1f5 560 | ip_output+0x61 561 | __ip_queue_xmit+0x151 562 | __tcp_transmit_skb+0x582 563 | tcp_fin+0x14f 564 | tcp_data_queue+0x51d 565 | tcp_rcv_state_process+0x3ed 566 | tcp_v4_do_rcv+0x5b 567 | tcp_v4_rcv+0xc0c 568 | ip_local_deliver_finish+0x9c 569 | ip_local_deliver+0x42 570 | ip_rcv+0x3d 571 | __netif_receive_skb_one_core+0x42 572 | netif_receive_skb_internal+0x34 573 | napi_gro_receive+0xbf 574 | receive_buf+0xee 575 | virtnet_poll+0x137 576 | net_rx_action+0x266 577 | __softirqentry_text_start+0xd1 578 | irq_exit+0xd2 579 | do_IRQ+0x54 580 | ret_from_intr+0x0 581 | cpuidle_enter_state+0xcb 582 | do_idle+0x1cc 583 | cpu_startup_entry+0x5f 584 | start_secondary+0x197 585 | secondary_startup_64+0xa4 586 | ``` 587 | 588 | ### 模拟场景3: 分析网络延迟问题(网卡) 589 | 590 | - 模拟场景:借助tc命令行工具,给指定网卡注入延迟来模拟server端返回慢的场景 591 | - 分析手段:使用mytracer来看下如何分析网络延迟 592 | - 目的端(服务端): 注入延迟300ms的的pod:192.168.88.27 。 593 | - 请求端:可访问通pod地址即可(本环境为192.168.0.17) 594 | - 抓包点位:目的端 595 | 596 | 我们挑选一个pod登录到对应的主机上,同时切换到该pod的net namespace里面,使用tc设置300ms的延迟看下效果,tc添加的延迟,是作用于出方向的,本案例将延迟设置在server端(即pod) 597 | ``` 598 | 添加300ms的延迟 599 | # tc qdisc add dev eth0 root netem delay 300ms 600 | 删除的话把add换成del即可 601 | # tc qdisc del dev eth0 root netem delay 300ms 602 | 603 | ``` 604 | 新开一个客户端测试curl,可以看到延迟已经加成功了。 605 | ``` 606 | # time curl -I 192.168.88.27 607 | HTTP/1.1 200 OK 608 | Server: nginx/1.18.0 (Ubuntu) 609 | Date: Mon, 24 Oct 2022 08:49:18 GMT 610 | Content-Type: text/html 611 | Content-Length: 10671 612 | Last-Modified: Mon, 08 Aug 2022 06:31:21 GMT 613 | Connection: keep-alive 614 | ETag: "62f0adb9-29af" 615 | Accept-Ranges: bytes 616 | 617 | 618 | real 0m0.609s 619 | user 0m0.003s 620 | sys 0m0.003s 621 | ``` 622 | 额外的问题:为什么加了300ms延迟,curl回显是0.6秒呢? 623 | #### tcpdump抓包 624 | 单纯从抓包看,慢在了server端。 625 | server端切换ns并抓包: 626 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666602048279-f8e7232a-6103-4c74-acee-3f87fdd9249b.png#clientId=ub3bce9d4-8e4c-4&from=paste&height=105&id=ud6b3a44e&name=image.png&originHeight=210&originWidth=1802&originalType=binary&ratio=1&rotation=0&showTitle=false&size=99709&status=done&style=none&taskId=ua23ac429-4f57-4469-b419-194099299c5&title=&width=901) 627 | 客户端的报文: 628 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666602927642-1eca63df-742b-418b-a2ab-8379271d704d.png#clientId=ub3bce9d4-8e4c-4&from=paste&height=205&id=u366cff82&name=image.png&originHeight=410&originWidth=2872&originalType=binary&ratio=1&rotation=0&showTitle=false&size=570921&status=done&style=none&taskId=u2f5a8f26-718b-4409-899a-12ef56120c4&title=&width=1436) 629 | #### mytracer跟踪 630 | 在server端部署mytracer进行函数抓取,同时两端抓包可以看出来最后一个fin实际出现3次0.3秒的情况了,但是curl记录的是0.6秒,说明curl的时间线是传输结束(发送finack),不是记录的整个四次挥手的过程 631 | 632 | 默认的stack: 633 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666603030059-aa4e4858-7f84-447d-a64b-8e5fa4292319.png#clientId=ub3bce9d4-8e4c-4&from=paste&height=666&id=udce637a9&name=image.png&originHeight=1332&originWidth=1821&originalType=binary&ratio=1&rotation=0&showTitle=false&size=1165338&status=done&style=none&taskId=u2a3e2330-b62a-478e-9fbe-a915559df7a&title=&width=910.5) 634 | mytracer跟踪的函数延迟,对比内核路径图不难发现,tc的延迟作用在 ip_finish_output2+0x209的下一跳 __dev_queue_xmit+0x1 上,而tcpdump的抓包点在更下面的dev_hard_start_xmit(对照开篇的内核路径图),因此tcpdump看到的是延迟后的报文 635 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666602822022-b610347c-9255-461a-8ed0-6d8b8f8a7bb5.png#clientId=ub3bce9d4-8e4c-4&from=paste&height=452&id=uaf532613&name=image.png&originHeight=904&originWidth=1668&originalType=binary&ratio=1&rotation=0&showTitle=false&size=319475&status=done&style=none&taskId=uefd5b750-6f7e-41d1-afd7-91ae0795d60&title=&width=834) 636 | 有些同学可能会遇到看到的是dev_queue_xmit函数,不带__,实际上dev_queue_xmit封装的是__dev_queue_xmit 637 | ``` 638 | int dev_queue_xmit(struct sk_buff *skb) 639 | { 640 | return __dev_queue_xmit(skb, NULL); 641 | } 642 | EXPORT_SYMBOL(dev_queue_xmit); 643 | ``` 644 | 继续往下可以使用faddr2line把这个函数地址对应的源码找出来看下,可以看到x01对应的3787行是这个__dev_queue_xmit发送函数的开始位置 645 | ``` 646 | # bash faddr2line /usr/lib/debug/lib/modules/4.19.91-26.5.al7.x86_64/vmlinux __dev_queue_xmit+0x1 647 | __dev_queue_xmit+0x1/0x910: 648 | __dev_queue_xmit at net/core/dev.c:3787 649 | 650 | #cat -n /usr/src/debug/kernel-4.19.91-26.5.al7/linux-4.19.91-26.5.al7.x86_64/net/core/dev.c 651 | 652 | 3786 static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) 653 | 3787 { 654 | 3788 struct net_device *dev = skb->dev; 655 | 3789 struct netdev_queue *txq; 656 | 3790 struct Qdisc *q; 657 | 3791 int rc = -ENOMEM; 658 | 3792 bool again = false; 659 | 3793 660 | 3794 skb_reset_mac_header(skb); 661 | 3795 662 | 3796 if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) 663 | 3797 __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED); 664 | 3798 665 | 3799 /* Disable soft irqs for various locks below. Also 666 | 3800 * stops preemption for RCU. 667 | 3801 */ 668 | 3802 rcu_read_lock_bh(); 669 | 3803 670 | 3804 skb_update_prio(skb); 671 | 3805 672 | 3806 qdisc_pkt_len_init(skb); 673 | 3807 #ifdef CONFIG_NET_CLS_ACT 674 | 3808 skb->tc_at_ingress = 0; 675 | 3809 # ifdef CONFIG_NET_EGRESS 676 | 3810 if (static_branch_unlikely(&egress_needed_key)) { 677 | 3811 skb = sch_handle_egress(skb, &rc, dev); 678 | 3812 if (!skb) 679 | 3813 goto out; 680 | 3814 } 681 | 3815 # endif 682 | 3816 #endif 683 | 3817 /* If device/qdisc don't need skb->dst, release it right now while 684 | 3818 * its hot in this cpu cache. 685 | 3819 */ 686 | 3820 if (dev->priv_flags & IFF_XMIT_DST_RELEASE) 687 | 3821 skb_dst_drop(skb); 688 | 3822 else 689 | 3823 skb_dst_force(skb); 690 | 3824 691 | /*此处主要是取出此netdevice的txq和txq的Qdisc,Qdisc主要用于进行拥塞处理,一般的情况下,直接将 692 | *数据包发送给driver了,如果遇到Busy的状况,就需要进行拥塞处理了,就会用到Qdisc*/ 693 | 3825 txq = netdev_pick_tx(dev, skb, sb_dev); 694 | 3826 q = rcu_dereference_bh(txq->qdisc); 695 | 3827 696 | 3828 trace_net_dev_queue(skb); 697 | /*如果Qdisc有对应的enqueue规则,就会调用__dev_xmit_skb,进入带有拥塞的控制的Flow,注意这个地方,虽然是走拥塞控制的 698 | *Flow但是并不一定非得进行enqueue操作啦,只有Busy的状况下,才会走Qdisc的enqueue/dequeue操作进行 699 | */ 700 | 3829 if (q->enqueue) { 701 | 3830 rc = __dev_xmit_skb(skb, q, dev, txq); 702 | 3831 goto out; 703 | 3832 } 704 | ``` 705 | 706 | 707 | tc在报文发送的流程中的位置,这里直接借用 [@九善(wangrui.ruiwang)](/wangrui.ruiwang) 同学的一个图示 708 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666840046786-72aefeb8-1da5-4eba-97b8-b77d62c68136.png#clientId=u2ac78acf-e5f7-4&from=paste&height=643&id=ua8d9daaa&name=image.png&originHeight=1286&originWidth=2556&originalType=binary&ratio=1&rotation=0&showTitle=false&size=1309341&status=done&style=none&taskId=u06179d84-0d7e-46cc-bbae-029d1de20b2&title=&width=1278) 709 | 710 | **分析小结:** 711 | 使用mytracer做延迟分析,需要对内核的协议栈稍微有所了解,我们对这种问题可以不用个个都去看内核函数调用路径信息,tc跟实际的业务延迟有所不同,在业务慢的场景,如tcp queue堆积,应用一直没有去调用rcv收数据,我们大概率会看到延迟的调用栈会包含 tcp_data_queue这种函数,使用mytracer分析只需要看看延迟出现的上下文大概分析即可,结合抓包分析效果更佳。 712 | 713 | 714 | ### 更高版本的一个神器pwru: 715 | pwru 是 cilium 推出的基于 eBPF 开发的网络数据包排查工具,它提供了更细粒度的网络数据包排查方案,但是对内核版本要求较高,不做测试了 716 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666252497073-17fd4f8f-9fee-4cf3-8f85-a5fae7417b96.png#clientId=u70b47896-d039-4&errorMessage=unknown%20error&from=paste&height=475&id=u8fcdc1ff&name=image.png&originHeight=950&originWidth=1678&originalType=binary&ratio=1&rotation=0&showTitle=false&size=913760&status=error&style=none&taskId=ua918a758-1c24-4863-812e-6817e62cec7&title=&width=839) 717 | ## ####update 2022-10-20### 718 | 为了便于分析某些延迟类型的问题,将默认的time模块更换成datetime,支持毫秒级展示,效果如下: 719 | ![image.png](https://intranetproxy.alipay.com/skylark/lark/0/2022/png/26092/1666256443836-0bea1ec0-4ffb-4782-b465-8d2ca686164e.png#clientId=u70b47896-d039-4&errorMessage=unknown%20error&from=paste&height=297&id=ubd93bde4&name=image.png&originHeight=594&originWidth=2448&originalType=binary&ratio=1&rotation=0&showTitle=false&size=270033&status=error&style=none&taskId=u81b4c70c-580a-4f30-83c1-68ac5016e46&title=&width=1224) 720 | -------------------------------------------------------------------------------- /mytracer-bak-flannel-brprobe.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import sys 5 | import socket 6 | from socket import inet_ntop, AF_INET, AF_INET6 7 | from bcc import BPF 8 | import ctypes as ct 9 | import subprocess 10 | from struct import pack 11 | import argparse 12 | import time 13 | import struct 14 | import datetime 15 | 16 | examples = """examples: 17 | mytracer.py # trace all packets 18 | mytracer.py --proto=icmp -H 140.205.60.46 --icmpid 22 # trace icmp packet with addr=140.205.60.46 and icmpid=22 19 | mytracer.py --proto=tcp -H 140.205.60.46 -P 22 # trace tcp packet with addr=140.205.60.46:22 20 | mytracer.py --proto=udp -H 140.205.60.46 -P 22 # trace udp packet wich addr=140.205.60.46:22 21 | mytracer.py -T -p 1 --debug -P 80 -H 127.0.0.1 --proto=tcp --callstack --icmpid=100 -N 10000 22 | """ 23 | 24 | parser = argparse.ArgumentParser( 25 | description="Trace any packet through TCP/IP stack", 26 | formatter_class=argparse.RawDescriptionHelpFormatter, 27 | epilog=examples) 28 | 29 | parser.add_argument("-H", "--ipaddr", type=str, 30 | help="ip address") 31 | 32 | parser.add_argument("--proto", type=str, 33 | help="tcp|udp|icmp|any ") 34 | 35 | parser.add_argument("--icmpid", type=int, default=0, 36 | help="trace icmp id") 37 | 38 | parser.add_argument("-c", "--catch-count", type=int, default=1000000, 39 | help="catch and print count") 40 | 41 | parser.add_argument("-P", "--port", type=int, default=0, 42 | help="udp or tcp port") 43 | 44 | parser.add_argument("-p", "--pid", type=int, default=0, 45 | help="trace this PID only") 46 | 47 | parser.add_argument("-N", "--netns", type=int, default=0, 48 | help="trace this Network Namespace only") 49 | 50 | parser.add_argument("--dropstack", action="store_true", 51 | help="output kernel stack trace when drop packet,kfree_skb") 52 | 53 | parser.add_argument("--callstack", action="store_true", 54 | help="output kernel stack trace") 55 | 56 | parser.add_argument("--iptable", action="store_true", 57 | help="output iptable path") 58 | 59 | parser.add_argument("--route", action="store_true", 60 | help="output route path") 61 | 62 | parser.add_argument("--keep", action="store_true", 63 | help="keep trace packet all lifetime") 64 | 65 | parser.add_argument("-T", "--time", action="store_true", 66 | help="show HH:MM:SS timestamp") 67 | 68 | parser.add_argument("--ebpf", action="store_true", 69 | help=argparse.SUPPRESS) 70 | 71 | parser.add_argument("--debug", action="store_true", 72 | help=argparse.SUPPRESS) 73 | 74 | args = parser.parse_args() 75 | if args.debug == True: 76 | print("pid=%d time=%d ipaddr=%s port=%d netns=%d proto=%s icmpid=%d dropstack=%d" % \ 77 | (args.pid,args.time,args.ipaddr, args.port,args.netns,args.proto,args.icmpid, args.dropstack)) 78 | sys.exit() 79 | 80 | 81 | ipproto={} 82 | #ipproto["tcp"]="IPPROTO_TCP" 83 | ipproto["tcp"]="6" 84 | #ipproto["udp"]="IPPROTO_UDP" 85 | ipproto["udp"]="17" 86 | #ipproto["icmp"]="IPPROTO_ICMP" 87 | ipproto["icmp"]="1" 88 | proto = 0 if args.proto == None else (0 if ipproto.get(args.proto) == None else ipproto[args.proto]) 89 | #ipaddr=socket.htonl(struct.unpack("I",socket.inet_aton("0" if args.ipaddr == None else args.ipaddr))[0]) 90 | #port=socket.htons(args.port) 91 | ipaddr=(struct.unpack("I",socket.inet_aton("0" if args.ipaddr == None else args.ipaddr))[0]) 92 | port=(args.port) 93 | icmpid=socket.htons(args.icmpid) 94 | 95 | bpf_def="#define __BCC_ARGS__\n" 96 | bpf_args="#define __BCC_pid (%d)\n" % (args.pid) 97 | bpf_args+="#define __BCC_ipaddr (0x%x)\n" % (ipaddr) 98 | bpf_args+="#define __BCC_port (%d)\n" % (port) 99 | bpf_args+="#define __BCC_netns (%d)\n" % (args.netns) 100 | bpf_args+="#define __BCC_proto (%s)\n" % (proto) 101 | bpf_args+="#define __BCC_icmpid (%d)\n" % (icmpid) 102 | bpf_args+="#define __BCC_dropstack (%d)\n" % (args.dropstack) 103 | bpf_args+="#define __BCC_callstack (%d)\n" % (args.callstack) 104 | bpf_args+="#define __BCC_iptable (%d)\n" % (args.iptable) 105 | bpf_args+="#define __BCC_route (%d)\n" % (args.route) 106 | bpf_args+="#define __BCC_keep (%d)\n" % (args.keep) 107 | 108 | 109 | # bpf_text=open(r"mytracer.c", "r").read() 110 | bpf_text= """ 111 | #include 112 | #include 113 | #include 114 | #include 115 | #include 116 | #include 117 | #include 118 | #include 119 | #include 120 | #include 121 | #include 122 | #include 123 | 124 | #define ROUTE_EVENT_IF 0x0001 125 | #define ROUTE_EVENT_IPTABLE 0x0002 126 | #define ROUTE_EVENT_DROP 0x0004 127 | #define ROUTE_EVENT_NEW 0x0010 128 | 129 | #ifdef __BCC_ARGS__ 130 | __BCC_ARGS_DEFINE__ 131 | #else 132 | #define __BCC_pid 0 133 | #define __BCC_ipaddr 0 134 | #define __BCC_port 0 135 | #define __BCC_icmpid 0 136 | #define __BCC_dropstack 0 137 | #define __BCC_callstack 0 138 | #define __BCC_iptable 0 139 | #define __BCC_route 0 140 | #define __BCC_keep 0 141 | #define __BCC_proto 0 142 | #define __BCC_netns 0 143 | #endif 144 | 145 | /* route info as default */ 146 | #if !__BCC_dropstack && !__BCC_iptable && !__BCC_route 147 | #undef __BCC_route 148 | #define __BCC_route 1 149 | #endif 150 | 151 | #if (__BCC_dropstack) || (!__BCC_pid && !__BCC_ipaddr && !__BCC_port && !__BCC_icmpid &&! __BCC_proto && !__BCC_netns) 152 | #undef __BCC_keep 153 | #define __BCC_keep 0 154 | #endif 155 | 156 | BPF_STACK_TRACE(stacks, 2048); 157 | 158 | #define FUNCNAME_MAX_LEN 64 159 | struct event_t { 160 | char func_name[FUNCNAME_MAX_LEN]; 161 | u8 flags; 162 | 163 | // route info 164 | char comm[IFNAMSIZ]; 165 | char ifname[IFNAMSIZ]; 166 | 167 | u32 netns; 168 | 169 | // pkt info 170 | u8 dest_mac[6]; 171 | u32 len; 172 | u8 ip_version; 173 | u8 l4_proto; 174 | u64 saddr[2]; 175 | u64 daddr[2]; 176 | u8 icmptype; 177 | u16 icmpid; 178 | u16 icmpseq; 179 | u16 sport; 180 | u16 dport; 181 | u16 tcpflags; 182 | u32 seq; 183 | u32 ack_seq; 184 | // u32 tcp_len; 185 | 186 | // ipt info 187 | u32 hook; 188 | u8 pf; 189 | u32 verdict; 190 | char tablename[XT_TABLE_MAXNAMELEN]; 191 | 192 | void *skb; 193 | // skb info 194 | u8 pkt_type; //skb->pkt_type 195 | 196 | // call stack 197 | int kernel_stack_id; 198 | u64 kernel_ip; 199 | 200 | //time 201 | u64 test; 202 | }; 203 | BPF_PERF_OUTPUT(route_event); 204 | 205 | struct ipt_do_table_args 206 | { 207 | struct sk_buff *skb; 208 | const struct nf_hook_state *state; 209 | struct xt_table *table; 210 | }; 211 | BPF_HASH(cur_ipt_do_table_args, u32, struct ipt_do_table_args); 212 | 213 | union ___skb_pkt_type { 214 | __u8 value; 215 | struct { 216 | __u8 __pkt_type_offset[0]; 217 | __u8 pkt_type:3; 218 | __u8 pfmemalloc:1; 219 | __u8 ignore_df:1; 220 | 221 | __u8 nf_trace:1; 222 | __u8 ip_summed:2; 223 | }; 224 | }; 225 | 226 | #if __BCC_keep 227 | #endif 228 | 229 | #define MAC_HEADER_SIZE 14; 230 | #define member_address(source_struct, source_member) \ 231 | ({ \ 232 | void* __ret; \ 233 | __ret = (void*) (((char*)source_struct) + offsetof(typeof(*source_struct), source_member)); \ 234 | __ret; \ 235 | }) 236 | #define member_read(destination, source_struct, source_member) \ 237 | do{ \ 238 | bpf_probe_read( \ 239 | destination, \ 240 | sizeof(source_struct->source_member), \ 241 | member_address(source_struct, source_member) \ 242 | ); \ 243 | } while(0) 244 | 245 | enum { 246 | __TCP_FLAG_FIN, // 0 247 | __TCP_FLAG_SYN, // 1 248 | __TCP_FLAG_RST, // 2 249 | __TCP_FLAG_PSH, // 3 250 | __TCP_FLAG_ACK, // 4 251 | __TCP_FLAG_URG, // 5 252 | __TCP_FLAG_ECE, // 6 253 | __TCP_FLAG_CWR // 7 254 | }; 255 | 256 | static void bpf_strncpy(char *dst, const char *src, int n) 257 | { 258 | int i = 0, j; 259 | #define CPY(n) \ 260 | do { \ 261 | for (; i < n; i++) { \ 262 | if (src[i] == 0) return; \ 263 | dst[i] = src[i]; \ 264 | } \ 265 | } while(0) 266 | 267 | for (j = 10; j < 64; j += 10) 268 | CPY(j); 269 | CPY(64); 270 | #undef CPY 271 | } 272 | 273 | #define TCP_FLAGS_INIT(new_flags, orig_flags, flag) \ 274 | do { \ 275 | if (orig_flags & flag) { \ 276 | new_flags |= (1U<<__##flag); \ 277 | } \ 278 | } while (0) 279 | 280 | #define init_tcpflags_bits(new_flags, orig_flags) \ 281 | ({ \ 282 | new_flags = 0; \ 283 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_FIN); \ 284 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_SYN); \ 285 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_RST); \ 286 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_PSH); \ 287 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_ACK); \ 288 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_URG); \ 289 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_ECE); \ 290 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_CWR); \ 291 | }) 292 | 293 | 294 | static void get_stack(struct pt_regs *ctx, struct event_t *event) 295 | { 296 | event->kernel_stack_id = stacks.get_stackid(ctx, 0); 297 | if (event->kernel_stack_id >= 0) { 298 | u64 ip = PT_REGS_IP(ctx); 299 | u64 page_offset; 300 | // if ip isn't sane, leave key ips as zero for later checking 301 | #if defined(CONFIG_X86_64) && defined(__PAGE_OFFSET_BASE) 302 | // x64, 4.16, ..., 4.11, etc., but some earlier kernel didn't have it 303 | page_offset = __PAGE_OFFSET_BASE; 304 | #elif defined(CONFIG_X86_64) && defined(__PAGE_OFFSET_BASE_L4) 305 | // x64, 4.17, and later 306 | #if defined(CONFIG_DYNAMIC_MEMORY_LAYOUT) && defined(CONFIG_X86_5LEVEL) 307 | page_offset = __PAGE_OFFSET_BASE_L5; 308 | #else 309 | page_offset = __PAGE_OFFSET_BASE_L4; 310 | #endif 311 | #else 312 | // earlier x86_64 kernels, e.g., 4.6, comes here 313 | // arm64, s390, powerpc, x86_32 314 | page_offset = PAGE_OFFSET; 315 | #endif 316 | if (ip > page_offset) { 317 | event->kernel_ip = ip; 318 | } 319 | } 320 | return; 321 | } 322 | 323 | #define CALL_STACK(ctx, event) \ 324 | do { \ 325 | if (__BCC_callstack) \ 326 | get_stack(ctx, event); \ 327 | } while (0) 328 | 329 | 330 | /** 331 | * Common tracepoint handler. Detect IPv4/IPv6 and 332 | * emit event with address, interface and namespace. 333 | */ 334 | static int 335 | do_trace_skb(struct event_t *event, void *ctx, struct sk_buff *skb, void *netdev) 336 | { 337 | struct net_device *dev; 338 | 339 | char *head; 340 | char *l2_header_address; 341 | char *l3_header_address; 342 | char *l4_header_address; 343 | 344 | u16 mac_header; 345 | u16 network_header; 346 | 347 | u8 proto_icmp_echo_request; 348 | u8 proto_icmp_echo_reply; 349 | u8 l4_offset_from_ip_header; 350 | 351 | struct icmphdr icmphdr; 352 | union tcp_word_hdr tcphdr; 353 | struct udphdr udphdr; 354 | 355 | // Get device pointer, we'll need it to get the name and network namespace 356 | event->ifname[0] = 0; 357 | if (netdev) 358 | dev = netdev; 359 | else 360 | member_read(&dev, skb, dev); 361 | 362 | bpf_probe_read(&event->ifname, IFNAMSIZ, dev->name); 363 | 364 | if (event->ifname[0] == 0 || dev == NULL) 365 | bpf_strncpy(event->ifname, "nil", IFNAMSIZ); 366 | 367 | event->flags |= ROUTE_EVENT_IF; 368 | 369 | #ifdef CONFIG_NET_NS 370 | struct net* net; 371 | 372 | // Get netns id. The code below is equivalent to: event->netns = dev->nd_net.net->ns.inum 373 | possible_net_t *skc_net = &dev->nd_net; 374 | member_read(&net, skc_net, net); 375 | struct ns_common *ns = member_address(net, ns); 376 | member_read(&event->netns, ns, inum); 377 | 378 | // maybe the skb->dev is not init, for this situation, we can get ns by sk->__sk_common.skc_net.net->ns.inum 379 | if (event->netns == 0) { 380 | struct sock *sk; 381 | struct sock_common __sk_common; 382 | struct ns_common* ns2; 383 | member_read(&sk, skb, sk); 384 | if (sk != NULL) { 385 | member_read(&__sk_common, sk, __sk_common); 386 | ns2 = member_address(__sk_common.skc_net.net, ns); 387 | member_read(&event->netns, ns2, inum); 388 | } 389 | } 390 | 391 | 392 | #endif 393 | 394 | member_read(&event->len, skb, len); 395 | member_read(&head, skb, head); 396 | member_read(&mac_header, skb, mac_header); 397 | member_read(&network_header, skb, network_header); 398 | 399 | if(network_header == 0) { 400 | network_header = mac_header + MAC_HEADER_SIZE; 401 | } 402 | 403 | l2_header_address = mac_header + head; 404 | bpf_probe_read(&event->dest_mac, 6, l2_header_address); 405 | 406 | l3_header_address = head + network_header; 407 | bpf_probe_read(&event->ip_version, sizeof(u8), l3_header_address); 408 | event->ip_version = event->ip_version >> 4 & 0xf; 409 | 410 | if (event->ip_version == 4) { 411 | struct iphdr iphdr; 412 | bpf_probe_read(&iphdr, sizeof(iphdr), l3_header_address); 413 | 414 | l4_offset_from_ip_header = iphdr.ihl * 4; 415 | event->l4_proto = iphdr.protocol; 416 | event->saddr[0] = iphdr.saddr; 417 | event->daddr[0] = iphdr.daddr; 418 | bpf_get_current_comm(event->comm, sizeof(event->comm)); 419 | 420 | if (event->l4_proto == IPPROTO_ICMP) { 421 | proto_icmp_echo_request = ICMP_ECHO; 422 | proto_icmp_echo_reply = ICMP_ECHOREPLY; 423 | } 424 | 425 | } else if (event->ip_version == 6) { 426 | // Assume no option header --> fixed size header 427 | struct ipv6hdr* ipv6hdr = (struct ipv6hdr*)l3_header_address; 428 | l4_offset_from_ip_header = sizeof(*ipv6hdr); 429 | 430 | bpf_probe_read(&event->l4_proto, sizeof(ipv6hdr->nexthdr), (char*)ipv6hdr + offsetof(struct ipv6hdr, nexthdr)); 431 | bpf_probe_read(event->saddr, sizeof(ipv6hdr->saddr), (char*)ipv6hdr + offsetof(struct ipv6hdr, saddr)); 432 | bpf_probe_read(event->daddr, sizeof(ipv6hdr->daddr), (char*)ipv6hdr + offsetof(struct ipv6hdr, daddr)); 433 | 434 | if (event->l4_proto == IPPROTO_ICMPV6) { 435 | proto_icmp_echo_request = ICMPV6_ECHO_REQUEST; 436 | proto_icmp_echo_reply = ICMPV6_ECHO_REPLY; 437 | } 438 | 439 | } else { 440 | return -1; 441 | } 442 | 443 | l4_header_address = l3_header_address + l4_offset_from_ip_header; 444 | switch (event->l4_proto) { 445 | case IPPROTO_ICMPV6: 446 | case IPPROTO_ICMP: 447 | bpf_probe_read(&icmphdr, sizeof(icmphdr), l4_header_address); 448 | if (icmphdr.type != proto_icmp_echo_request && icmphdr.type != proto_icmp_echo_reply) { 449 | return -1; 450 | } 451 | event->icmptype = icmphdr.type; 452 | event->icmpid = be16_to_cpu(icmphdr.un.echo.id); 453 | event->icmpseq = be16_to_cpu(icmphdr.un.echo.sequence); 454 | break; 455 | case IPPROTO_TCP: 456 | bpf_probe_read(&tcphdr, sizeof(tcphdr), l4_header_address); 457 | init_tcpflags_bits(event->tcpflags, tcp_flag_word(&tcphdr)); 458 | event->sport = be16_to_cpu(tcphdr.hdr.source); 459 | event->dport = be16_to_cpu(tcphdr.hdr.dest); 460 | event->seq = be32_to_cpu(tcphdr.hdr.seq); 461 | event->ack_seq = be32_to_cpu(tcphdr.hdr.ack_seq); 462 | // event->tcp_len = tcphdr.hdr.doff * 4; 463 | break; 464 | case IPPROTO_UDP: 465 | bpf_probe_read(&udphdr, sizeof(udphdr), l4_header_address); 466 | event->sport = be16_to_cpu(udphdr.source); 467 | event->dport = be16_to_cpu(udphdr.dest); 468 | break; 469 | default: 470 | return -1; 471 | } 472 | 473 | #if __BCC_keep 474 | #endif 475 | 476 | 477 | /* 478 | * netns filter 479 | */ 480 | if (__BCC_netns !=0 && event->netns != 0 && event->netns != __BCC_netns) { 481 | return -1; 482 | } 483 | 484 | /* 485 | * pid filter 486 | */ 487 | #if __BCC_pid 488 | u64 tgid = bpf_get_current_pid_tgid() >> 32; 489 | if (tgid != __BCC_pid) 490 | return -1; 491 | #endif 492 | 493 | /* 494 | * skb filter 495 | */ 496 | #if __BCC_ipaddr 497 | if (event->ip_version == 4) { 498 | if (__BCC_ipaddr != event->saddr[0] && __BCC_ipaddr != event->daddr[0]) 499 | return -1; 500 | } else { 501 | return -1; 502 | } 503 | #endif 504 | 505 | #if __BCC_proto 506 | if (__BCC_proto != event->l4_proto) 507 | return -1; 508 | #endif 509 | 510 | #if __BCC_port 511 | if ( (event->l4_proto == IPPROTO_UDP || event->l4_proto == IPPROTO_TCP) && 512 | (__BCC_port != event->sport && __BCC_port != event->dport)) 513 | return -1; 514 | #endif 515 | 516 | #if __BCC_icmpid 517 | if (__BCC_proto == IPPROTO_ICMP && __BCC_icmpid != event->icmpid) 518 | return -1; 519 | #endif 520 | 521 | #if __BCC_keep 522 | #endif 523 | 524 | return 0; 525 | } 526 | 527 | static int 528 | do_trace(void *ctx, struct sk_buff *skb, const char *func_name, void *netdev) 529 | { 530 | struct event_t event = {}; 531 | union ___skb_pkt_type type = {}; 532 | 533 | if (do_trace_skb(&event, ctx, skb, netdev) < 0) 534 | return 0; 535 | 536 | event.skb=skb; 537 | bpf_probe_read(&type.value, 1, ((char*)skb) + offsetof(typeof(*skb), __pkt_type_offset)); 538 | event.pkt_type = type.pkt_type; 539 | bpf_strncpy(event.func_name, func_name, FUNCNAME_MAX_LEN); 540 | CALL_STACK(ctx, &event); 541 | route_event.perf_submit(ctx, &event, sizeof(event)); 542 | out: 543 | return 0; 544 | } 545 | 546 | #if __BCC_route 547 | 548 | /* 549 | * netif rcv hook: 550 | * 1) int netif_rx(struct sk_buff *skb) 551 | * 2) int __netif_receive_skb(struct sk_buff *skb) 552 | * 3) gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb) 553 | * 4) ... 554 | */ 555 | int kprobe__netif_rx(struct pt_regs *ctx, struct sk_buff *skb) 556 | { 557 | return do_trace(ctx, skb, __func__+8, NULL); 558 | } 559 | 560 | int kprobe____netif_receive_skb(struct pt_regs *ctx, struct sk_buff *skb) 561 | { 562 | return do_trace(ctx, skb, __func__+8, NULL); 563 | } 564 | 565 | int kprobe__tpacket_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 566 | { 567 | return do_trace(ctx, skb, __func__+8, orig_dev); 568 | } 569 | 570 | int kprobe__packet_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 571 | { 572 | return do_trace(ctx, skb, __func__+8, orig_dev); 573 | } 574 | 575 | int kprobe__napi_gro_receive(struct pt_regs *ctx, struct napi_struct *napi, struct sk_buff *skb) 576 | { 577 | return do_trace(ctx, skb, __func__+8, NULL); 578 | } 579 | 580 | 581 | /* 582 | * tcp recv hook: 583 | * 1) int __tcp_v4_rcv(struct sk_buff *skb, struct net_device *sb_dev) 584 | * 2) ... 585 | */ 586 | 587 | int kprobe__tcp_v4_rcv(struct pt_regs *ctx, struct sk_buff *skb) 588 | { 589 | return do_trace(ctx, skb, __func__+8, NULL); 590 | } 591 | 592 | /* 593 | * skb copy hook: 594 | * 1) int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, struct iov_iter *to, int len) 595 | * 2) ... 596 | */ 597 | int kprobe__skb_copy_datagram_iter(struct pt_regs *ctx, const struct sk_buff *skb, int offset, struct iov_iter *to, int len) 598 | { 599 | return do_trace(ctx, skb, __func__+8, NULL); 600 | } 601 | 602 | /* 603 | * netif send hook: 604 | * 1) int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) 605 | * 2) ... 606 | */ 607 | 608 | int kprobe____dev_queue_xmit(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *sb_dev) 609 | { 610 | return do_trace(ctx, skb, __func__+8, NULL); 611 | } 612 | 613 | /* 614 | * br process hook: 615 | * 1) rx_handler_result_t br_handle_frame(struct sk_buff **pskb) 616 | * 2) int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 617 | * 3) unsigned int br_nf_pre_routing(void *priv, struct sk_buff *skb, const struct nf_hook_state *state) 618 | * 4) int br_nf_pre_routing_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 619 | * 5) int br_pass_frame_up(struct sk_buff *skb) 620 | * 6) int br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff *skb) 621 | * 7) void br_forward(const struct net_bridge_port *to, struct sk_buff *skb, bool local_rcv, bool local_orig) 622 | * 8) int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 623 | * 9) unsigned int br_nf_forward_ip(void *priv,struct sk_buff *skb,const struct nf_hook_state *state) 624 | * 10)int br_nf_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 625 | * 11)unsigned int br_nf_post_routing(void *priv,struct sk_buff *skb,const struct nf_hook_state *state) 626 | * 12)int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) 627 | */ 628 | 629 | int kprobe__br_handle_frame(struct pt_regs *ctx, struct sk_buff **pskb) 630 | { 631 | return do_trace(ctx, *pskb, __func__+8, NULL); 632 | } 633 | 634 | int kprobe__br_handle_frame_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 635 | { 636 | return do_trace(ctx, skb, __func__+8, NULL); 637 | } 638 | 639 | int kprobe__br_nf_pre_routing(struct pt_regs *ctx, void *priv, struct sk_buff *skb, const struct nf_hook_state *state) 640 | { 641 | return do_trace(ctx, skb, __func__+8, NULL); 642 | } 643 | 644 | int kprobe__br_nf_pre_routing_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 645 | { 646 | return do_trace(ctx, skb, __func__+8, NULL); 647 | } 648 | 649 | int kprobe__br_pass_frame_up(struct pt_regs *ctx, struct sk_buff *skb) 650 | { 651 | return do_trace(ctx, skb, __func__+8, NULL); 652 | } 653 | 654 | int kprobe__br_netif_receive_skb(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 655 | { 656 | return do_trace(ctx, skb, __func__+8, NULL); 657 | } 658 | 659 | 660 | int kprobe__br_forward(struct pt_regs *ctx, const void *to, struct sk_buff *skb, bool local_rcv, bool local_orig) 661 | { 662 | return do_trace(ctx, skb, __func__+8, NULL); 663 | } 664 | 665 | int kprobe____br_forward(struct pt_regs *ctx, const void *to, struct sk_buff *skb, bool local_orig) 666 | { 667 | return do_trace(ctx, skb, __func__+8, NULL); 668 | } 669 | 670 | /* 671 | if the kernel version is 5.10.x kernel(alinux3),we need disable this probe(kprobe__deliver_clone). 672 | If the kernel version below 4.19(alinux2), this probe you can enable 673 | if you use flannel network ,please open this probe 674 | 675 | int kprobe__deliver_clone(struct pt_regs *ctx, const void *prev, struct sk_buff *skb, bool local_orig) 676 | { 677 | return do_trace(ctx, skb, __func__+8, NULL); 678 | } 679 | 680 | 681 | int kprobe__br_forward_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 682 | { 683 | return do_trace(ctx, skb, __func__+8, NULL); 684 | } 685 | 686 | int kprobe__br_nf_forward_ip(struct pt_regs *ctx, void *priv,struct sk_buff *skb,const struct nf_hook_state *state) 687 | { 688 | return do_trace(ctx, skb, __func__+8, NULL); 689 | } 690 | 691 | int kprobe__br_nf_forward_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 692 | { 693 | return do_trace(ctx, skb, __func__+8, NULL); 694 | } 695 | 696 | int kprobe__br_nf_post_routing(struct pt_regs *ctx, void *priv,struct sk_buff *skb,const struct nf_hook_state *state) 697 | { 698 | return do_trace(ctx, skb, __func__+8, NULL); 699 | } 700 | 701 | int kprobe__br_nf_dev_queue_xmit(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 702 | { 703 | return do_trace(ctx, skb, __func__+8, NULL); 704 | } 705 | */ 706 | 707 | /* 708 | * ip layer: 709 | * 1) int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 710 | * 2) int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 711 | * 3) int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) 712 | * 4) int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) 713 | * 5) int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb) 714 | * 6) ... 715 | */ 716 | 717 | int kprobe__ip_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 718 | { 719 | return do_trace(ctx, skb, __func__+8, NULL); 720 | } 721 | 722 | int kprobe__ip_rcv_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 723 | { 724 | return do_trace(ctx, skb, __func__+8, NULL); 725 | } 726 | 727 | int kprobe__ip_output(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 728 | { 729 | return do_trace(ctx, skb, __func__+8, NULL); 730 | } 731 | 732 | int kprobe__ip_finish_output(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 733 | { 734 | return do_trace(ctx, skb, __func__+8, NULL); 735 | } 736 | 737 | #endif 738 | 739 | #if __BCC_iptable 740 | static int 741 | __ipt_do_table_in(struct pt_regs *ctx, struct sk_buff *skb, 742 | const struct nf_hook_state *state, struct xt_table *table) 743 | { 744 | u32 pid = bpf_get_current_pid_tgid(); 745 | 746 | struct ipt_do_table_args args = { 747 | .skb = skb, 748 | .state = state, 749 | .table = table, 750 | }; 751 | cur_ipt_do_table_args.update(&pid, &args); 752 | 753 | return 0; 754 | }; 755 | 756 | static int 757 | __ipt_do_table_out(struct pt_regs * ctx, struct sk_buff *skb) 758 | { 759 | struct event_t event = {}; 760 | union ___skb_pkt_type type = {}; 761 | struct ipt_do_table_args *args; 762 | u32 pid = bpf_get_current_pid_tgid(); 763 | 764 | args = cur_ipt_do_table_args.lookup(&pid); 765 | if (args == 0) 766 | return 0; 767 | 768 | cur_ipt_do_table_args.delete(&pid); 769 | 770 | if (do_trace_skb(&event, ctx, args->skb, NULL) < 0) 771 | return 0; 772 | 773 | event.flags |= ROUTE_EVENT_IPTABLE; 774 | member_read(&event.hook, args->state, hook); 775 | member_read(&event.pf, args->state, pf); 776 | member_read(&event.tablename, args->table, name); 777 | event.verdict = PT_REGS_RC(ctx); 778 | event.skb=args->skb; 779 | bpf_probe_read(&type.value, 1, ((char*)args->skb) + offsetof(typeof(*args->skb), __pkt_type_offset)); 780 | event.pkt_type = type.pkt_type; 781 | CALL_STACK(ctx, &event); 782 | route_event.perf_submit(ctx, &event, sizeof(event)); 783 | 784 | return 0; 785 | } 786 | 787 | int kprobe__ipt_do_table(struct pt_regs *ctx, struct sk_buff *skb, const struct nf_hook_state *state, struct xt_table *table) 788 | { 789 | return __ipt_do_table_in(ctx, skb, state, table); 790 | }; 791 | 792 | /* 793 | * tricky: use ebx as the 1st parms, thus get skb 794 | */ 795 | int kretprobe__ipt_do_table(struct pt_regs *ctx) 796 | { 797 | struct sk_buff *skb=(void*)ctx->bx; 798 | return __ipt_do_table_out(ctx, skb); 799 | } 800 | #endif 801 | 802 | 803 | #if __BCC_dropstack 804 | int kprobe____kfree_skb(struct pt_regs *ctx, struct sk_buff *skb) 805 | { 806 | struct event_t event = {}; 807 | 808 | if (do_trace_skb(&event, ctx, skb, NULL) < 0) 809 | return 0; 810 | 811 | event.flags |= ROUTE_EVENT_DROP; 812 | bpf_strncpy(event.func_name, __func__+8, FUNCNAME_MAX_LEN); 813 | get_stack(ctx, &event); 814 | route_event.perf_submit(ctx, &event, sizeof(event)); 815 | return 0; 816 | } 817 | #endif 818 | 819 | #if 0 820 | int kprobe__ip6t_do_table(struct pt_regs *ctx, struct sk_buff *skb, const struct nf_hook_state *state, struct xt_table *table) 821 | { 822 | return __ipt_do_table_in(ctx, skb, state, table); 823 | }; 824 | 825 | int kretprobe__ip6t_do_table(struct pt_regs *ctx) 826 | { 827 | return __ipt_do_table_out(ctx); 828 | } 829 | #endif 830 | 831 | """ 832 | bpf_text=bpf_def + bpf_text 833 | bpf_text=bpf_text.replace("__BCC_ARGS_DEFINE__", bpf_args) 834 | 835 | if args.ebpf == True: 836 | print("%s" % (bpf_text)) 837 | sys.exit() 838 | 839 | # uapi/linux/if.h 840 | IFNAMSIZ = 16 841 | 842 | # uapi/linux/netfilter/x_tables.h 843 | XT_TABLE_MAXNAMELEN = 32 844 | 845 | # uapi/linux/netfilter.h 846 | NF_VERDICT_NAME = [ 847 | 'DROP', 848 | 'ACCEPT', 849 | 'STOLEN', 850 | 'QUEUE', 851 | 'REPEAT', 852 | 'STOP', 853 | ] 854 | 855 | # uapi/linux/netfilter.h 856 | # net/ipv4/netfilter/ip_tables.c 857 | HOOKNAMES = [ 858 | "PREROUTING", 859 | "INPUT", 860 | "FORWARD", 861 | "OUTPUT", 862 | "POSTROUTING", 863 | ] 864 | 865 | TCPFLAGS = [ 866 | "FIN", 867 | "SYN", 868 | "RST", 869 | "PSH", 870 | "ACK", 871 | "URG", 872 | "ECE", 873 | "CWR", 874 | ] 875 | 876 | ROUTE_EVENT_IF = 0x0001 877 | ROUTE_EVENT_IPTABLE = 0x0002 878 | ROUTE_EVENT_DROP = 0x0004 879 | ROUTE_EVENT_NEW = 0x0010 880 | FUNCNAME_MAX_LEN = 64 881 | 882 | class TestEvt(ct.Structure): 883 | _fields_ = [ 884 | ("func_name", ct.c_char * FUNCNAME_MAX_LEN), 885 | ("flags", ct.c_ubyte), 886 | ("comm", ct.c_char * IFNAMSIZ), 887 | ("ifname", ct.c_char * IFNAMSIZ), 888 | ("netns", ct.c_uint), 889 | 890 | ("dest_mac", ct.c_ubyte * 6), 891 | ("len", ct.c_uint), 892 | ("ip_version", ct.c_ubyte), 893 | ("l4_proto", ct.c_ubyte), 894 | ("saddr", ct.c_ulonglong * 2), 895 | ("daddr", ct.c_ulonglong * 2), 896 | ("icmptype", ct.c_ubyte), 897 | ("icmpid", ct.c_ushort), 898 | ("icmpseq", ct.c_ushort), 899 | ("sport", ct.c_ushort), 900 | ("dport", ct.c_ushort), 901 | ("tcpflags", ct.c_ushort), 902 | ("seq", ct.c_uint), 903 | ("ack_seq", ct.c_uint), 904 | 905 | ("hook", ct.c_uint), 906 | ("pf", ct.c_ubyte), 907 | ("verdict", ct.c_uint), 908 | ("tablename", ct.c_char * XT_TABLE_MAXNAMELEN), 909 | 910 | ("skb", ct.c_ulonglong), 911 | ("pkt_type", ct.c_ubyte), 912 | 913 | ("kernel_stack_id", ct.c_int), 914 | ("kernel_ip", ct.c_ulonglong), 915 | 916 | ("test", ct.c_ulonglong) 917 | ] 918 | 919 | 920 | def _get(l, index, default): 921 | ''' 922 | Get element at index in l or return the default 923 | ''' 924 | if index < len(l): 925 | return l[index] 926 | return default 927 | 928 | def _get_tcpflags(tcpflags): 929 | flag="" 930 | start=1 931 | for index in range(len(TCPFLAGS)): 932 | if (tcpflags & (1< 0: 949 | kernel_tmp = stack_traces.walk(event.kernel_stack_id) 950 | # fix kernel stack 951 | for addr in kernel_tmp: 952 | kernel_stack.append(addr) 953 | for addr in kernel_stack: 954 | print(" %s" % trans_bytes_to_string(b.sym(addr, -1, show_offset=True))) 955 | 956 | def time_str(event): 957 | if args.time: 958 | return "%-7s " % datetime.datetime.now() 959 | else: 960 | return "%-7s " % datetime.datetime.now() 961 | 962 | def event_printer(cpu, data, size): 963 | # Decode event 964 | event = ct.cast(data, ct.POINTER(TestEvt)).contents 965 | 966 | if event.ip_version == 4: 967 | saddr = inet_ntop(AF_INET, pack("=I", event.saddr[0])) 968 | daddr = inet_ntop(AF_INET, pack("=I", event.daddr[0])) 969 | elif event.ip_version == 6: 970 | saddr = inet_ntop(AF_INET6, event.saddr) 971 | daddr = inet_ntop(AF_INET6, event.daddr) 972 | else: 973 | return 974 | 975 | 976 | mac_info = ':'.join(f'{b:02x}' for b in event.dest_mac) 977 | 978 | if event.l4_proto == socket.IPPROTO_TCP: 979 | pkt_info = "T_%s:%s:%u->%s:%u" % (_get_tcpflags(event.tcpflags), saddr, event.sport, daddr, event.dport) 980 | elif event.l4_proto == socket.IPPROTO_UDP: 981 | pkt_info = "U:%s:%u->%s:%u" % (saddr, event.sport, daddr, event.dport) 982 | elif event.l4_proto == socket.IPPROTO_ICMP: 983 | if event.icmptype in [8, 128]: 984 | pkt_info = "I_request:%s->%s" % (saddr, daddr) 985 | elif event.icmptype in [0, 129]: 986 | pkt_info = "I_reply:%s->%s" % (saddr, daddr) 987 | else: 988 | pkt_info = "I:%s->%s" % (saddr, daddr) 989 | else: 990 | pkt_info = "%u:%s->%s" % (event.l4_proto, saddr, daddr) 991 | 992 | iptables = "" 993 | if event.flags & ROUTE_EVENT_IPTABLE == ROUTE_EVENT_IPTABLE: 994 | verdict = _get(NF_VERDICT_NAME, event.verdict, "~UNK~") 995 | hook = _get(HOOKNAMES, event.hook, "~UNK~") 996 | iptables = "%u.%s.%s.%s " % (event.pf, event.tablename, hook, verdict) 997 | 998 | trace_info = "%x.%u:%s%s" % (event.skb, event.pkt_type, iptables, trans_bytes_to_string(event.func_name)) 999 | 1000 | # Print event 1001 | print("[%-8s] [%-10s] %-10s %-12s %-12s %-12s %-12s %-40s %s" % (time_str(event), event.netns, trans_bytes_to_string(event.comm), trans_bytes_to_string(event.ifname), mac_info, event.seq, event.ack_seq, pkt_info, trace_info)) 1002 | 1003 | print_stack(event) 1004 | args.catch_count = args.catch_count - 1 1005 | if args.catch_count <= 0: 1006 | sys.exit(0) 1007 | 1008 | if __name__ == "__main__": 1009 | b = BPF(text=bpf_text) 1010 | b["route_event"].open_perf_buffer(event_printer) 1011 | print("%-29s %-12s %-10s %-12s %-12s %-12s %-12s %-40s %s" % ('Time', 'NETWORK_NS', 'COMMAND', 'INTERFACE', 'DEST_MAC', 'Seq', 'Ack', 'PKT_INFO', 'TRACE_INFO')) 1012 | 1013 | try: 1014 | while True: 1015 | b.kprobe_poll(10) 1016 | except KeyboardInterrupt: 1017 | sys.exit(0) 1018 | -------------------------------------------------------------------------------- /mytracer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import sys 5 | import socket 6 | from socket import inet_ntop, AF_INET, AF_INET6 7 | from bcc import BPF 8 | import ctypes as ct 9 | import subprocess 10 | from struct import pack 11 | import argparse 12 | import time 13 | import struct 14 | import datetime 15 | 16 | examples = """examples: 17 | mytracer.py # trace all packets 18 | mytracer.py --proto=tcp -H 1.2.3.4 # trace tcp packet with addr=1.2.3.4 19 | mytracer.py --proto=tcp -H 1.2.3.4 --dropstack # trace drop packet with addr=1.2.3.4 20 | mytracer.py --proto=udp -H 1.2.3.4 -P 53 --callstack # trace udp kernel stack wich addr=1.2.3.4 and port=22 21 | """ 22 | 23 | parser = argparse.ArgumentParser( 24 | description="Trace any packet through TCP/IP stack", 25 | formatter_class=argparse.RawDescriptionHelpFormatter, 26 | epilog=examples) 27 | 28 | parser.add_argument("-H", "--ipaddr", type=str, 29 | help="ip address") 30 | 31 | parser.add_argument("--proto", type=str, 32 | help="tcp|udp|icmp|any ") 33 | 34 | parser.add_argument("--icmpid", type=int, default=0, 35 | help="trace icmp id") 36 | 37 | parser.add_argument("-c", "--catch-count", type=int, default=1000000, 38 | help="catch and print count") 39 | 40 | parser.add_argument("-P", "--port", type=int, default=0, 41 | help="udp or tcp port") 42 | 43 | parser.add_argument("-p", "--pid", type=int, default=0, 44 | help="trace this PID only") 45 | 46 | parser.add_argument("-N", "--netns", type=int, default=0, 47 | help="trace this Network Namespace only") 48 | 49 | parser.add_argument("--dropstack", action="store_true", 50 | help="output kernel stack trace when drop packet,kfree_skb") 51 | 52 | parser.add_argument("--callstack", action="store_true", 53 | help="output kernel stack trace") 54 | 55 | parser.add_argument("--iptable", action="store_true", 56 | help="output iptable path") 57 | 58 | parser.add_argument("--route", action="store_true", 59 | help="output route path") 60 | 61 | parser.add_argument("--keep", action="store_true", 62 | help="keep trace packet all lifetime") 63 | 64 | parser.add_argument("-T", "--time", action="store_true", 65 | help="show HH:MM:SS timestamp") 66 | 67 | parser.add_argument("--ebpf", action="store_true", 68 | help=argparse.SUPPRESS) 69 | 70 | parser.add_argument("--debug", action="store_true", 71 | help=argparse.SUPPRESS) 72 | 73 | args = parser.parse_args() 74 | if args.debug == True: 75 | print("pid=%d time=%d ipaddr=%s port=%d netns=%d proto=%s icmpid=%d dropstack=%d" % \ 76 | (args.pid,args.time,args.ipaddr, args.port,args.netns,args.proto,args.icmpid, args.dropstack)) 77 | sys.exit() 78 | 79 | 80 | ipproto={} 81 | #ipproto["tcp"]="IPPROTO_TCP" 82 | ipproto["tcp"]="6" 83 | #ipproto["udp"]="IPPROTO_UDP" 84 | ipproto["udp"]="17" 85 | #ipproto["icmp"]="IPPROTO_ICMP" 86 | ipproto["icmp"]="1" 87 | proto = 0 if args.proto == None else (0 if ipproto.get(args.proto) == None else ipproto[args.proto]) 88 | #ipaddr=socket.htonl(struct.unpack("I",socket.inet_aton("0" if args.ipaddr == None else args.ipaddr))[0]) 89 | #port=socket.htons(args.port) 90 | ipaddr=(struct.unpack("I",socket.inet_aton("0" if args.ipaddr == None else args.ipaddr))[0]) 91 | port=(args.port) 92 | icmpid=socket.htons(args.icmpid) 93 | 94 | bpf_def="#define __BCC_ARGS__\n" 95 | bpf_args="#define __BCC_pid (%d)\n" % (args.pid) 96 | bpf_args+="#define __BCC_ipaddr (0x%x)\n" % (ipaddr) 97 | bpf_args+="#define __BCC_port (%d)\n" % (port) 98 | bpf_args+="#define __BCC_netns (%d)\n" % (args.netns) 99 | bpf_args+="#define __BCC_proto (%s)\n" % (proto) 100 | bpf_args+="#define __BCC_icmpid (%d)\n" % (icmpid) 101 | bpf_args+="#define __BCC_dropstack (%d)\n" % (args.dropstack) 102 | bpf_args+="#define __BCC_callstack (%d)\n" % (args.callstack) 103 | bpf_args+="#define __BCC_iptable (%d)\n" % (args.iptable) 104 | bpf_args+="#define __BCC_route (%d)\n" % (args.route) 105 | bpf_args+="#define __BCC_keep (%d)\n" % (args.keep) 106 | 107 | # bpf_text=open(r"mytracer.c", "r").read() 108 | bpf_text= """ 109 | #include 110 | #include 111 | #include 112 | #include 113 | #include 114 | #include 115 | #include 116 | #include 117 | #include 118 | #include 119 | #include 120 | #include 121 | 122 | #define ROUTE_EVENT_IF 0x0001 123 | #define ROUTE_EVENT_IPTABLE 0x0002 124 | #define ROUTE_EVENT_DROP 0x0004 125 | #define ROUTE_EVENT_NEW 0x0010 126 | 127 | #ifdef __BCC_ARGS__ 128 | __BCC_ARGS_DEFINE__ 129 | #else 130 | #define __BCC_pid 0 131 | #define __BCC_ipaddr 0 132 | #define __BCC_port 0 133 | #define __BCC_icmpid 0 134 | #define __BCC_dropstack 0 135 | #define __BCC_callstack 0 136 | #define __BCC_iptable 0 137 | #define __BCC_route 0 138 | #define __BCC_keep 0 139 | #define __BCC_proto 0 140 | #define __BCC_netns 0 141 | #endif 142 | 143 | /* route info as default */ 144 | #if !__BCC_dropstack && !__BCC_iptable && !__BCC_route 145 | #undef __BCC_route 146 | #define __BCC_route 1 147 | #endif 148 | 149 | #if (__BCC_dropstack) || (!__BCC_pid && !__BCC_ipaddr && !__BCC_port && !__BCC_icmpid &&! __BCC_proto && !__BCC_netns) 150 | #undef __BCC_keep 151 | #define __BCC_keep 0 152 | #endif 153 | 154 | BPF_STACK_TRACE(stacks, 2048); 155 | 156 | #define FUNCNAME_MAX_LEN 64 157 | struct event_t { 158 | char func_name[FUNCNAME_MAX_LEN]; 159 | u8 flags; 160 | 161 | // route info 162 | char comm[IFNAMSIZ]; 163 | char ifname[IFNAMSIZ]; 164 | 165 | u32 netns; 166 | 167 | // pkt info 168 | u8 dest_mac[6]; 169 | u32 len; 170 | u8 ip_version; 171 | u8 l4_proto; 172 | u64 saddr[2]; 173 | u64 daddr[2]; 174 | u8 icmptype; 175 | u16 icmpid; 176 | u16 icmpseq; 177 | u16 sport; 178 | u16 dport; 179 | u16 tcpflags; 180 | u32 seq; 181 | u32 ack_seq; 182 | // u32 tcp_len; 183 | 184 | // ipt info 185 | u32 hook; 186 | u8 pf; 187 | u32 verdict; 188 | char tablename[XT_TABLE_MAXNAMELEN]; 189 | 190 | void *skb; 191 | // skb info 192 | u8 pkt_type; //skb->pkt_type 193 | 194 | // call stack 195 | int kernel_stack_id; 196 | u64 kernel_ip; 197 | 198 | //time 199 | u64 test; 200 | }; 201 | BPF_PERF_OUTPUT(route_event); 202 | 203 | struct ipt_do_table_args 204 | { 205 | struct sk_buff *skb; 206 | const struct nf_hook_state *state; 207 | struct xt_table *table; 208 | }; 209 | BPF_HASH(cur_ipt_do_table_args, u32, struct ipt_do_table_args); 210 | 211 | union ___skb_pkt_type { 212 | __u8 value; 213 | struct { 214 | __u8 __pkt_type_offset[0]; 215 | __u8 pkt_type:3; 216 | __u8 pfmemalloc:1; 217 | __u8 ignore_df:1; 218 | 219 | __u8 nf_trace:1; 220 | __u8 ip_summed:2; 221 | }; 222 | }; 223 | 224 | #if __BCC_keep 225 | #endif 226 | 227 | #define MAC_HEADER_SIZE 14; 228 | #define member_address(source_struct, source_member) \ 229 | ({ \ 230 | void* __ret; \ 231 | __ret = (void*) (((char*)source_struct) + offsetof(typeof(*source_struct), source_member)); \ 232 | __ret; \ 233 | }) 234 | #define member_read(destination, source_struct, source_member) \ 235 | do{ \ 236 | bpf_probe_read( \ 237 | destination, \ 238 | sizeof(source_struct->source_member), \ 239 | member_address(source_struct, source_member) \ 240 | ); \ 241 | } while(0) 242 | 243 | enum { 244 | __TCP_FLAG_FIN, // 0 245 | __TCP_FLAG_SYN, // 1 246 | __TCP_FLAG_RST, // 2 247 | __TCP_FLAG_PSH, // 3 248 | __TCP_FLAG_ACK, // 4 249 | __TCP_FLAG_URG, // 5 250 | __TCP_FLAG_ECE, // 6 251 | __TCP_FLAG_CWR // 7 252 | }; 253 | 254 | static void bpf_strncpy(char *dst, const char *src, int n) 255 | { 256 | int i = 0, j; 257 | #define CPY(n) \ 258 | do { \ 259 | for (; i < n; i++) { \ 260 | if (src[i] == 0) return; \ 261 | dst[i] = src[i]; \ 262 | } \ 263 | } while(0) 264 | 265 | for (j = 10; j < 64; j += 10) 266 | CPY(j); 267 | CPY(64); 268 | #undef CPY 269 | } 270 | 271 | #define TCP_FLAGS_INIT(new_flags, orig_flags, flag) \ 272 | do { \ 273 | if (orig_flags & flag) { \ 274 | new_flags |= (1U<<__##flag); \ 275 | } \ 276 | } while (0) 277 | 278 | #define init_tcpflags_bits(new_flags, orig_flags) \ 279 | ({ \ 280 | new_flags = 0; \ 281 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_FIN); \ 282 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_SYN); \ 283 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_RST); \ 284 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_PSH); \ 285 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_ACK); \ 286 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_URG); \ 287 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_ECE); \ 288 | TCP_FLAGS_INIT(new_flags, orig_flags, TCP_FLAG_CWR); \ 289 | }) 290 | 291 | 292 | static void get_stack(struct pt_regs *ctx, struct event_t *event) 293 | { 294 | event->kernel_stack_id = stacks.get_stackid(ctx, 0); 295 | if (event->kernel_stack_id >= 0) { 296 | u64 ip = PT_REGS_IP(ctx); 297 | u64 page_offset; 298 | // if ip isn't sane, leave key ips as zero for later checking 299 | #if defined(CONFIG_X86_64) && defined(__PAGE_OFFSET_BASE) 300 | // x64, 4.16, ..., 4.11, etc., but some earlier kernel didn't have it 301 | page_offset = __PAGE_OFFSET_BASE; 302 | #elif defined(CONFIG_X86_64) && defined(__PAGE_OFFSET_BASE_L4) 303 | // x64, 4.17, and later 304 | #if defined(CONFIG_DYNAMIC_MEMORY_LAYOUT) && defined(CONFIG_X86_5LEVEL) 305 | page_offset = __PAGE_OFFSET_BASE_L5; 306 | #else 307 | page_offset = __PAGE_OFFSET_BASE_L4; 308 | #endif 309 | #else 310 | // earlier x86_64 kernels, e.g., 4.6, comes here 311 | // arm64, s390, powerpc, x86_32 312 | page_offset = PAGE_OFFSET; 313 | #endif 314 | if (ip > page_offset) { 315 | event->kernel_ip = ip; 316 | } 317 | } 318 | return; 319 | } 320 | 321 | #define CALL_STACK(ctx, event) \ 322 | do { \ 323 | if (__BCC_callstack) \ 324 | get_stack(ctx, event); \ 325 | } while (0) 326 | 327 | 328 | /** 329 | * Common tracepoint handler. Detect IPv4/IPv6 and 330 | * emit event with address, interface and namespace. 331 | */ 332 | static int 333 | do_trace_skb(struct event_t *event, void *ctx, struct sk_buff *skb, void *netdev) 334 | { 335 | struct net_device *dev; 336 | 337 | char *head; 338 | char *l2_header_address; 339 | char *l3_header_address; 340 | char *l4_header_address; 341 | 342 | u16 mac_header; 343 | u16 network_header; 344 | 345 | u8 proto_icmp_echo_request; 346 | u8 proto_icmp_echo_reply; 347 | u8 l4_offset_from_ip_header; 348 | 349 | struct icmphdr icmphdr; 350 | union tcp_word_hdr tcphdr; 351 | struct udphdr udphdr; 352 | 353 | // Get device pointer, we'll need it to get the name and network namespace 354 | event->ifname[0] = 0; 355 | if (netdev) 356 | dev = netdev; 357 | else 358 | member_read(&dev, skb, dev); 359 | 360 | bpf_probe_read(&event->ifname, IFNAMSIZ, dev->name); 361 | 362 | if (event->ifname[0] == 0 || dev == NULL) 363 | bpf_strncpy(event->ifname, "nil", IFNAMSIZ); 364 | 365 | event->flags |= ROUTE_EVENT_IF; 366 | 367 | #ifdef CONFIG_NET_NS 368 | struct net* net; 369 | 370 | // Get netns id. The code below is equivalent to: event->netns = dev->nd_net.net->ns.inum 371 | possible_net_t *skc_net = &dev->nd_net; 372 | member_read(&net, skc_net, net); 373 | struct ns_common *ns = member_address(net, ns); 374 | member_read(&event->netns, ns, inum); 375 | 376 | // maybe the skb->dev is not init, for this situation, we can get ns by sk->__sk_common.skc_net.net->ns.inum 377 | if (event->netns == 0) { 378 | struct sock *sk; 379 | struct sock_common __sk_common; 380 | struct ns_common* ns2; 381 | member_read(&sk, skb, sk); 382 | if (sk != NULL) { 383 | member_read(&__sk_common, sk, __sk_common); 384 | ns2 = member_address(__sk_common.skc_net.net, ns); 385 | member_read(&event->netns, ns2, inum); 386 | } 387 | } 388 | 389 | 390 | #endif 391 | 392 | member_read(&event->len, skb, len); 393 | member_read(&head, skb, head); 394 | member_read(&mac_header, skb, mac_header); 395 | member_read(&network_header, skb, network_header); 396 | 397 | if(network_header == 0) { 398 | network_header = mac_header + MAC_HEADER_SIZE; 399 | } 400 | 401 | l2_header_address = mac_header + head; 402 | bpf_probe_read(&event->dest_mac, 6, l2_header_address); 403 | 404 | l3_header_address = head + network_header; 405 | bpf_probe_read(&event->ip_version, sizeof(u8), l3_header_address); 406 | event->ip_version = event->ip_version >> 4 & 0xf; 407 | 408 | if (event->ip_version == 4) { 409 | struct iphdr iphdr; 410 | bpf_probe_read(&iphdr, sizeof(iphdr), l3_header_address); 411 | 412 | l4_offset_from_ip_header = iphdr.ihl * 4; 413 | event->l4_proto = iphdr.protocol; 414 | event->saddr[0] = iphdr.saddr; 415 | event->daddr[0] = iphdr.daddr; 416 | bpf_get_current_comm(event->comm, sizeof(event->comm)); 417 | 418 | if (event->l4_proto == IPPROTO_ICMP) { 419 | proto_icmp_echo_request = ICMP_ECHO; 420 | proto_icmp_echo_reply = ICMP_ECHOREPLY; 421 | } 422 | 423 | } else if (event->ip_version == 6) { 424 | // Assume no option header --> fixed size header 425 | struct ipv6hdr* ipv6hdr = (struct ipv6hdr*)l3_header_address; 426 | l4_offset_from_ip_header = sizeof(*ipv6hdr); 427 | 428 | bpf_probe_read(&event->l4_proto, sizeof(ipv6hdr->nexthdr), (char*)ipv6hdr + offsetof(struct ipv6hdr, nexthdr)); 429 | bpf_probe_read(event->saddr, sizeof(ipv6hdr->saddr), (char*)ipv6hdr + offsetof(struct ipv6hdr, saddr)); 430 | bpf_probe_read(event->daddr, sizeof(ipv6hdr->daddr), (char*)ipv6hdr + offsetof(struct ipv6hdr, daddr)); 431 | 432 | if (event->l4_proto == IPPROTO_ICMPV6) { 433 | proto_icmp_echo_request = ICMPV6_ECHO_REQUEST; 434 | proto_icmp_echo_reply = ICMPV6_ECHO_REPLY; 435 | } 436 | 437 | } else { 438 | return -1; 439 | } 440 | 441 | l4_header_address = l3_header_address + l4_offset_from_ip_header; 442 | switch (event->l4_proto) { 443 | case IPPROTO_ICMPV6: 444 | case IPPROTO_ICMP: 445 | bpf_probe_read(&icmphdr, sizeof(icmphdr), l4_header_address); 446 | if (icmphdr.type != proto_icmp_echo_request && icmphdr.type != proto_icmp_echo_reply) { 447 | return -1; 448 | } 449 | event->icmptype = icmphdr.type; 450 | event->icmpid = be16_to_cpu(icmphdr.un.echo.id); 451 | event->icmpseq = be16_to_cpu(icmphdr.un.echo.sequence); 452 | break; 453 | case IPPROTO_TCP: 454 | bpf_probe_read(&tcphdr, sizeof(tcphdr), l4_header_address); 455 | init_tcpflags_bits(event->tcpflags, tcp_flag_word(&tcphdr)); 456 | event->sport = be16_to_cpu(tcphdr.hdr.source); 457 | event->dport = be16_to_cpu(tcphdr.hdr.dest); 458 | event->seq = be32_to_cpu(tcphdr.hdr.seq); 459 | event->ack_seq = be32_to_cpu(tcphdr.hdr.ack_seq); 460 | // event->tcp_len = tcphdr.hdr.doff * 4; 461 | break; 462 | case IPPROTO_UDP: 463 | bpf_probe_read(&udphdr, sizeof(udphdr), l4_header_address); 464 | event->sport = be16_to_cpu(udphdr.source); 465 | event->dport = be16_to_cpu(udphdr.dest); 466 | break; 467 | default: 468 | return -1; 469 | } 470 | 471 | #if __BCC_keep 472 | #endif 473 | 474 | 475 | /* 476 | * netns filter 477 | */ 478 | if (__BCC_netns !=0 && event->netns != 0 && event->netns != __BCC_netns) { 479 | return -1; 480 | } 481 | 482 | /* 483 | * pid filter 484 | */ 485 | #if __BCC_pid 486 | u64 tgid = bpf_get_current_pid_tgid() >> 32; 487 | if (tgid != __BCC_pid) 488 | return -1; 489 | #endif 490 | 491 | /* 492 | * skb filter 493 | */ 494 | #if __BCC_ipaddr 495 | if (event->ip_version == 4) { 496 | if (__BCC_ipaddr != event->saddr[0] && __BCC_ipaddr != event->daddr[0]) 497 | return -1; 498 | } else { 499 | return -1; 500 | } 501 | #endif 502 | 503 | #if __BCC_proto 504 | if (__BCC_proto != event->l4_proto) 505 | return -1; 506 | #endif 507 | 508 | #if __BCC_port 509 | if ( (event->l4_proto == IPPROTO_UDP || event->l4_proto == IPPROTO_TCP) && 510 | (__BCC_port != event->sport && __BCC_port != event->dport)) 511 | return -1; 512 | #endif 513 | 514 | #if __BCC_icmpid 515 | if (__BCC_proto == IPPROTO_ICMP && __BCC_icmpid != event->icmpid) 516 | return -1; 517 | #endif 518 | 519 | #if __BCC_keep 520 | #endif 521 | 522 | return 0; 523 | } 524 | 525 | static int 526 | do_trace(void *ctx, struct sk_buff *skb, const char *func_name, void *netdev) 527 | { 528 | struct event_t event = {}; 529 | union ___skb_pkt_type type = {}; 530 | 531 | if (do_trace_skb(&event, ctx, skb, netdev) < 0) 532 | return 0; 533 | 534 | event.skb=skb; 535 | bpf_probe_read(&type.value, 1, ((char*)skb) + offsetof(typeof(*skb), __pkt_type_offset)); 536 | event.pkt_type = type.pkt_type; 537 | bpf_strncpy(event.func_name, func_name, FUNCNAME_MAX_LEN); 538 | CALL_STACK(ctx, &event); 539 | route_event.perf_submit(ctx, &event, sizeof(event)); 540 | out: 541 | return 0; 542 | } 543 | 544 | #if __BCC_route 545 | 546 | /* 547 | * netif rcv hook: 548 | * 1) int netif_rx(struct sk_buff *skb) 549 | * 2) int __netif_receive_skb(struct sk_buff *skb) 550 | * 3) gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb) 551 | * 4) ... 552 | */ 553 | int kprobe__netif_rx(struct pt_regs *ctx, struct sk_buff *skb) 554 | { 555 | return do_trace(ctx, skb, __func__+8, NULL); 556 | } 557 | 558 | int kprobe____netif_receive_skb(struct pt_regs *ctx, struct sk_buff *skb) 559 | { 560 | return do_trace(ctx, skb, __func__+8, NULL); 561 | } 562 | 563 | int kprobe__tpacket_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 564 | { 565 | return do_trace(ctx, skb, __func__+8, orig_dev); 566 | } 567 | 568 | int kprobe__packet_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 569 | { 570 | return do_trace(ctx, skb, __func__+8, orig_dev); 571 | } 572 | 573 | 574 | int kprobe__napi_gro_receive(struct pt_regs *ctx, struct napi_struct *napi, struct sk_buff *skb) 575 | { 576 | return do_trace(ctx, skb, __func__+8, NULL); 577 | } 578 | 579 | 580 | /* 581 | * tcp recv hook: 582 | * 1) int __tcp_v4_rcv(struct sk_buff *skb, struct net_device *sb_dev) 583 | * 2) ... 584 | */ 585 | 586 | int kprobe__tcp_v4_rcv(struct pt_regs *ctx, struct sk_buff *skb) 587 | { 588 | return do_trace(ctx, skb, __func__+8, NULL); 589 | } 590 | 591 | /* if dropstack cannot caputre error packet,please open this probe,use callstack args 592 | int kprobe__tcp_validate_incoming(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb, const struct tcphdr *th, int syn_inerr) 593 | { 594 | return do_trace(ctx, skb, __func__+8, NULL); 595 | } 596 | 597 | int kprobe__tcp_rcv_state_process(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb) 598 | { 599 | return do_trace(ctx, skb, __func__+8, NULL); 600 | } 601 | 602 | int kprobe____kfree_skb(struct pt_regs *ctx, struct sk_buff *skb) 603 | { 604 | return do_trace(ctx, skb, __func__+8, NULL); 605 | } 606 | 607 | 608 | int kprobe__kfree_skb(struct pt_regs *ctx, struct sk_buff *skb) 609 | { 610 | return do_trace(ctx, skb, __func__+8, NULL); 611 | } 612 | 613 | int kprobe__consume_skb(struct pt_regs *ctx, struct sk_buff *skb) 614 | { 615 | return do_trace(ctx, skb, __func__+8, NULL); 616 | } 617 | 618 | 619 | int kprobe__tcp_check_req(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb, struct request_sock *req, bool fastopen, bool *req_stolen) 620 | 621 | { 622 | return do_trace(ctx, skb, __func__+8, NULL); 623 | } 624 | 625 | */ 626 | 627 | /* 628 | * skb copy hook: 629 | * 1) int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, struct iov_iter *to, int len) 630 | * 2) ... 631 | */ 632 | int kprobe__skb_copy_datagram_iter(struct pt_regs *ctx, const struct sk_buff *skb, int offset, struct iov_iter *to, int len) 633 | { 634 | return do_trace(ctx, skb, __func__+8, NULL); 635 | } 636 | 637 | /* 638 | * netif send hook: 639 | * 1) int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) 640 | * 2) ... 641 | */ 642 | 643 | int kprobe____dev_queue_xmit(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *sb_dev) 644 | { 645 | return do_trace(ctx, skb, __func__+8, NULL); 646 | } 647 | 648 | /* 649 | * ip layer: 650 | * 1) int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 651 | * 2) int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb) 652 | * 3) int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) 653 | * 4) int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb) 654 | * 5) int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb) 655 | * 6) ... 656 | */ 657 | 658 | int kprobe__ip_rcv(struct pt_regs *ctx, struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev) 659 | { 660 | return do_trace(ctx, skb, __func__+8, NULL); 661 | } 662 | 663 | int kprobe__ip_rcv_finish(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 664 | { 665 | return do_trace(ctx, skb, __func__+8, NULL); 666 | } 667 | 668 | int kprobe__ip_output(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 669 | { 670 | return do_trace(ctx, skb, __func__+8, NULL); 671 | } 672 | 673 | int kprobe__ip_finish_output(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 674 | { 675 | return do_trace(ctx, skb, __func__+8, NULL); 676 | } 677 | 678 | int kprobe____ip_local_out(struct pt_regs *ctx, struct net *net, struct sock *sk, struct sk_buff *skb) 679 | { 680 | return do_trace(ctx, skb, __func__+8, NULL); 681 | } 682 | 683 | #endif 684 | 685 | #if __BCC_iptable 686 | static int 687 | __ipt_do_table_in(struct pt_regs *ctx, struct sk_buff *skb, 688 | const struct nf_hook_state *state, struct xt_table *table) 689 | { 690 | u32 pid = bpf_get_current_pid_tgid(); 691 | 692 | struct ipt_do_table_args args = { 693 | .skb = skb, 694 | .state = state, 695 | .table = table, 696 | }; 697 | cur_ipt_do_table_args.update(&pid, &args); 698 | 699 | return 0; 700 | }; 701 | 702 | static int 703 | __ipt_do_table_out(struct pt_regs * ctx, struct sk_buff *skb) 704 | { 705 | struct event_t event = {}; 706 | union ___skb_pkt_type type = {}; 707 | struct ipt_do_table_args *args; 708 | u32 pid = bpf_get_current_pid_tgid(); 709 | 710 | args = cur_ipt_do_table_args.lookup(&pid); 711 | if (args == 0) 712 | return 0; 713 | 714 | cur_ipt_do_table_args.delete(&pid); 715 | 716 | if (do_trace_skb(&event, ctx, args->skb, NULL) < 0) 717 | return 0; 718 | 719 | event.flags |= ROUTE_EVENT_IPTABLE; 720 | member_read(&event.hook, args->state, hook); 721 | member_read(&event.pf, args->state, pf); 722 | member_read(&event.tablename, args->table, name); 723 | event.verdict = PT_REGS_RC(ctx); 724 | event.skb=args->skb; 725 | bpf_probe_read(&type.value, 1, ((char*)args->skb) + offsetof(typeof(*args->skb), __pkt_type_offset)); 726 | event.pkt_type = type.pkt_type; 727 | CALL_STACK(ctx, &event); 728 | route_event.perf_submit(ctx, &event, sizeof(event)); 729 | 730 | return 0; 731 | } 732 | 733 | int kprobe__ipt_do_table(struct pt_regs *ctx, struct sk_buff *skb, const struct nf_hook_state *state, struct xt_table *table) 734 | { 735 | return __ipt_do_table_in(ctx, skb, state, table); 736 | }; 737 | 738 | /* 739 | * tricky: use ebx as the 1st parms, thus get skb 740 | */ 741 | int kretprobe__ipt_do_table(struct pt_regs *ctx) 742 | { 743 | struct sk_buff *skb=(void*)ctx->bx; 744 | return __ipt_do_table_out(ctx, skb); 745 | } 746 | #endif 747 | 748 | 749 | #if __BCC_dropstack 750 | int kprobe____kfree_skb(struct pt_regs *ctx, struct sk_buff *skb) 751 | { 752 | struct event_t event = {}; 753 | 754 | if (do_trace_skb(&event, ctx, skb, NULL) < 0) 755 | return 0; 756 | 757 | event.flags |= ROUTE_EVENT_DROP; 758 | bpf_strncpy(event.func_name, __func__+8, FUNCNAME_MAX_LEN); 759 | get_stack(ctx, &event); 760 | route_event.perf_submit(ctx, &event, sizeof(event)); 761 | return 0; 762 | } 763 | #endif 764 | 765 | #if 0 766 | int kprobe__ip6t_do_table(struct pt_regs *ctx, struct sk_buff *skb, const struct nf_hook_state *state, struct xt_table *table) 767 | { 768 | return __ipt_do_table_in(ctx, skb, state, table); 769 | }; 770 | 771 | int kretprobe__ip6t_do_table(struct pt_regs *ctx) 772 | { 773 | return __ipt_do_table_out(ctx); 774 | } 775 | #endif 776 | 777 | """ 778 | bpf_text=bpf_def + bpf_text 779 | bpf_text=bpf_text.replace("__BCC_ARGS_DEFINE__", bpf_args) 780 | 781 | if args.ebpf == True: 782 | print("%s" % (bpf_text)) 783 | sys.exit() 784 | 785 | # uapi/linux/if.h 786 | IFNAMSIZ = 16 787 | 788 | # uapi/linux/netfilter/x_tables.h 789 | XT_TABLE_MAXNAMELEN = 32 790 | 791 | # uapi/linux/netfilter.h 792 | NF_VERDICT_NAME = [ 793 | 'DROP', 794 | 'ACCEPT', 795 | 'STOLEN', 796 | 'QUEUE', 797 | 'REPEAT', 798 | 'STOP', 799 | ] 800 | 801 | # uapi/linux/netfilter.h 802 | # net/ipv4/netfilter/ip_tables.c 803 | HOOKNAMES = [ 804 | "PREROUTING", 805 | "INPUT", 806 | "FORWARD", 807 | "OUTPUT", 808 | "POSTROUTING", 809 | ] 810 | 811 | TCPFLAGS = [ 812 | "FIN", 813 | "SYN", 814 | "RST", 815 | "PSH", 816 | "ACK", 817 | "URG", 818 | "ECE", 819 | "CWR", 820 | ] 821 | 822 | ROUTE_EVENT_IF = 0x0001 823 | ROUTE_EVENT_IPTABLE = 0x0002 824 | ROUTE_EVENT_DROP = 0x0004 825 | ROUTE_EVENT_NEW = 0x0010 826 | FUNCNAME_MAX_LEN = 64 827 | 828 | class TestEvt(ct.Structure): 829 | _fields_ = [ 830 | ("func_name", ct.c_char * FUNCNAME_MAX_LEN), 831 | ("flags", ct.c_ubyte), 832 | ("comm", ct.c_char * IFNAMSIZ), 833 | ("ifname", ct.c_char * IFNAMSIZ), 834 | ("netns", ct.c_uint), 835 | 836 | ("dest_mac", ct.c_ubyte * 6), 837 | ("len", ct.c_uint), 838 | ("ip_version", ct.c_ubyte), 839 | ("l4_proto", ct.c_ubyte), 840 | ("saddr", ct.c_ulonglong * 2), 841 | ("daddr", ct.c_ulonglong * 2), 842 | ("icmptype", ct.c_ubyte), 843 | ("icmpid", ct.c_ushort), 844 | ("icmpseq", ct.c_ushort), 845 | ("sport", ct.c_ushort), 846 | ("dport", ct.c_ushort), 847 | ("tcpflags", ct.c_ushort), 848 | ("seq", ct.c_uint), 849 | ("ack_seq", ct.c_uint), 850 | 851 | ("hook", ct.c_uint), 852 | ("pf", ct.c_ubyte), 853 | ("verdict", ct.c_uint), 854 | ("tablename", ct.c_char * XT_TABLE_MAXNAMELEN), 855 | 856 | ("skb", ct.c_ulonglong), 857 | ("pkt_type", ct.c_ubyte), 858 | 859 | ("kernel_stack_id", ct.c_int), 860 | ("kernel_ip", ct.c_ulonglong), 861 | 862 | ("test", ct.c_ulonglong) 863 | ] 864 | 865 | 866 | def _get(l, index, default): 867 | ''' 868 | Get element at index in l or return the default 869 | ''' 870 | if index < len(l): 871 | return l[index] 872 | return default 873 | 874 | def _get_tcpflags(tcpflags): 875 | flag="" 876 | start=1 877 | for index in range(len(TCPFLAGS)): 878 | if (tcpflags & (1< 0: 895 | kernel_tmp = stack_traces.walk(event.kernel_stack_id) 896 | # fix kernel stack 897 | for addr in kernel_tmp: 898 | kernel_stack.append(addr) 899 | for addr in kernel_stack: 900 | print(" %s" % trans_bytes_to_string(b.sym(addr, -1, show_offset=True))) 901 | 902 | def time_str(event): 903 | if args.time: 904 | return "%-7s " % datetime.datetime.now() 905 | else: 906 | return "%-7s " % datetime.datetime.now() 907 | 908 | def event_printer(cpu, data, size): 909 | # Decode event 910 | event = ct.cast(data, ct.POINTER(TestEvt)).contents 911 | 912 | if event.ip_version == 4: 913 | saddr = inet_ntop(AF_INET, pack("=I", event.saddr[0])) 914 | daddr = inet_ntop(AF_INET, pack("=I", event.daddr[0])) 915 | elif event.ip_version == 6: 916 | saddr = inet_ntop(AF_INET6, event.saddr) 917 | daddr = inet_ntop(AF_INET6, event.daddr) 918 | else: 919 | return 920 | 921 | 922 | mac_info = ':'.join(f'{b:02x}' for b in event.dest_mac) 923 | 924 | if event.l4_proto == socket.IPPROTO_TCP: 925 | pkt_info = "T_%s:%s:%u->%s:%u" % (_get_tcpflags(event.tcpflags), saddr, event.sport, daddr, event.dport) 926 | elif event.l4_proto == socket.IPPROTO_UDP: 927 | pkt_info = "U:%s:%u->%s:%u" % (saddr, event.sport, daddr, event.dport) 928 | elif event.l4_proto == socket.IPPROTO_ICMP: 929 | if event.icmptype in [8, 128]: 930 | pkt_info = "I_request:%s->%s" % (saddr, daddr) 931 | elif event.icmptype in [0, 129]: 932 | pkt_info = "I_reply:%s->%s" % (saddr, daddr) 933 | else: 934 | pkt_info = "I:%s->%s" % (saddr, daddr) 935 | else: 936 | pkt_info = "%u:%s->%s" % (event.l4_proto, saddr, daddr) 937 | 938 | iptables = "" 939 | if event.flags & ROUTE_EVENT_IPTABLE == ROUTE_EVENT_IPTABLE: 940 | verdict = _get(NF_VERDICT_NAME, event.verdict, "~UNK~") 941 | hook = _get(HOOKNAMES, event.hook, "~UNK~") 942 | iptables = "%u.%s.%s.%s " % (event.pf, event.tablename, hook, verdict) 943 | 944 | trace_info = "%x.%u:%s%s" % (event.skb, event.pkt_type, iptables, trans_bytes_to_string(event.func_name)) 945 | 946 | # Print event 947 | print("[%-8s] [%-10s] %-10s %-12s %-12s %-12s %-12s %-40s %s" % (time_str(event), event.netns, trans_bytes_to_string(event.comm), trans_bytes_to_string(event.ifname), mac_info, event.seq, event.ack_seq, pkt_info, trace_info)) 948 | 949 | print_stack(event) 950 | args.catch_count = args.catch_count - 1 951 | if args.catch_count <= 0: 952 | sys.exit(0) 953 | 954 | if __name__ == "__main__": 955 | b = BPF(text=bpf_text) 956 | b["route_event"].open_perf_buffer(event_printer) 957 | print("%-29s %-12s %-10s %-12s %-12s %-12s %-12s %-40s %s" % ('Time', 'NETWORK_NS', 'COMMAND', 'INTERFACE', 'DEST_MAC', 'Seq', 'Ack', 'PKT_INFO', 'TRACE_INFO')) 958 | 959 | try: 960 | while True: 961 | b.kprobe_poll(10) 962 | except KeyboardInterrupt: 963 | sys.exit(0) 964 | --------------------------------------------------------------------------------