├── 2_UsingSystemTap.md
├── 6_UnderstandingSystemTapErrors.md
├── flight_record_mode_in_memory.gif
├── 1_Introduction.md
├── 5_UsefulSystemTapScripts.md
├── 1_3_LimitationsOfSystemTap.md
├── README.md
├── 3_UnderstandingHowSystemTapWorks.md
├── 1_2_SystemTapCapabilities.md
├── 3_6_Tapsets.md
├── 1_1_DocumentationGuide.md
├── 3_4_AssociativeArrays.md
├── 3_1_Architecture.md
├── 4_UserSpaceProbing.md
├── 4_2_AccessingUserSpaceTargetVariables.md
├── 6_2_RuntimeErrorsAndWarnings.md
├── SUMMARY.md
├── 4_1_UserSpaceEvents.md
├── 5_4_IdentifyingContendedUserSpaceLocks.md
├── 2_2_GeneratingInstrumentationForOtherComputers.md
├── 4_3_UserSpaceStackBacktraces.md
├── 6_1_ParseAndSemanticErrors.md
├── 2_1_InstallationAndSetup.md
├── 2_3_RunningSystemTapScripts.md
├── 3_5_ArrayOperationsInSystemTap.md
├── 3_3_BasicSystemTapHandlerConstructs.md
├── 3_2_SystemTapScripts.md
├── 5_1_Network.md
├── 5_3_Profiling.md
├── 5_2_Disk.md
└── LICENSE


/2_UsingSystemTap.md:
--------------------------------------------------------------------------------
1 | # 2. 安装和配置
2 | 
3 | 本章会教用户怎么安装SystemTap，并介绍SystemTap脚本的运行方式。
4 | 


--------------------------------------------------------------------------------
/6_UnderstandingSystemTapErrors.md:
--------------------------------------------------------------------------------
1 | # 6. 解读错误信息
2 | 
3 | 本章解释使用SystemTap过程中常见的错误信息。
4 | 


--------------------------------------------------------------------------------
/flight_record_mode_in_memory.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/spacewander/SystemTapBeginnersGuide_zh/HEAD/flight_record_mode_in_memory.gif


--------------------------------------------------------------------------------
/1_Introduction.md:
--------------------------------------------------------------------------------
1 | # 1. 介绍
2 | 
3 | SystemTap是一种允许用户研究和监控操作系统（尤指Linux内核）运行细节的跟踪/剖析工具。SystemTap可以像`netstat`，`ps`，`top`和`iostat`那样统计系统数据；不止于此，为了收集信息，它还提供了更多过滤和分析的途径。
4 | 


--------------------------------------------------------------------------------
/5_UsefulSystemTapScripts.md:
--------------------------------------------------------------------------------
1 | # 5. SystemTap脚本集锦
2 | 
3 | 本章列举了若干可用于监控和调查内核子系统的SystemTap脚本。如果你安装了`systemtap-testsuite`这个RPM包，所有这些示例都能在`/usr/share/systemtap/testsuite/systemtap.examples/`下找到。
4 | 


--------------------------------------------------------------------------------
/1_3_LimitationsOfSystemTap.md:
--------------------------------------------------------------------------------
1 | # 1.3. SystemTap的局限
2 | 
3 | 当前版本的SystemTap提供的探测内核空间事件的众多选项，可以在不同版本的内核下使用。然而，SystemTap对探测用户空间事件的支持依赖于内核的支持（需要utrace机制），而多数内核缺乏这一支持。结果是，仅有部分内核上的SystemTap版本支持用户空间探测。
4 | 当前，SystemTap社区正集中力量改进SystemTap的用户空间探测能力。
5 | （译注：本指南写于2013年。现今的内核普遍已经支持用户空间探测，旧的内核用的是utrace，新的内核用的是uprobes）
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | SystemTap新手指南中文翻译。翻译自https://sourceware.org/systemtap/SystemTap_Beginners_Guide/ 。
2 | 
3 | 开头的“前言”和第7章“References”因意义不大，故不译。
4 | 
5 | [点此看书](https://www.gitbook.com/book/spacewander/systemtapbeginnersguide_zh/details)
6 | 
7 | 如有错漏之处，请务必不吝[赐教](https://github.com/spacewander/SystemTapBeginnersGuide_zh/issues)
8 | 


--------------------------------------------------------------------------------
/3_UnderstandingHowSystemTapWorks.md:
--------------------------------------------------------------------------------
1 | # 3. 工作细节
2 | 
3 | SystemTap允许用户仅需编写和重用简单的脚本即可获取Linux繁多的运行数据。通过SystemTap脚本，你可以又好又快地提取数据、过滤数据、汇总数据。诊断复杂的性能问题（或功能问题）再也不是难事。
4 | 
5 | 整个SystemTap脚本所做的，无非就是声明感兴趣的事件，然后添加对应的处理程序。当SystemTap脚本运行时，SystemTap会监控声明的事件；一旦事件发生，Linux内核会临时切换到对应的处理程序，完成后再重拾原先的工作。
6 | 
7 | 可供监控的事件种类繁多：进入/退出某个函数，定时器到期，会话终止，等等。处理程序由一组SystemTap语句构成，指明事件发生后要做的工作。其中包括从事件上下文中提取数据，存储到内部变量中，输出结果。
8 | 


--------------------------------------------------------------------------------
/1_2_SystemTapCapabilities.md:
--------------------------------------------------------------------------------
1 | # 1.2. SystemTap的能力
2 | 
3 | **灵活性**：SystemTap提供了一门领域特定语言，使得用户可以编写自定义脚本，调查和监控各种内核函数、系统调用，和其它发生在内核空间的事件。就此而言，SystemTap不仅仅是个工具，它是一个让你能够自定义内核取证和监控工具的生态系统。
4 | 
5 | **易于上手**：正如前面谈到的，SystemTap把用户从在探测内核空间事件时，注入检测代码-重新编译-安装-重启这一繁琐过程解放出来。
6 | 
7 | 我们在第五章——[SystemTap脚本集锦](5_UsefulSystemTapScripts.md)——列出的大多数SystemTap脚本所展现出的系统取证和监控能力，是其它同类型工具（比如`top`，`oprofile`，`ps`）所不能及的。这些脚本很好地诠释了SystemTap强大的功能，届时将会拓宽诸位读者编写自己的SystemTap脚本时的眼界。
8 | 


--------------------------------------------------------------------------------
/3_6_Tapsets.md:
--------------------------------------------------------------------------------
1 | ## 3.6. Tapsets
2 | 
3 | tapsets是一些包含常用的探针和函数的内置脚本，你可以在SystemTap脚本中复用它们。当用户运行一个SystemTap脚本时，SystemTap会检测脚本中的事件和处理程序，并在翻译脚本成C代码之前，加载用到的tapset。（可以回顾下本章开头所讲到的，SystemTap会话的启动过程）
4 | 就像SystemTap脚本一样，tapset的拓展名也是`.stp`。默认情况下tapset位于`/usr/share/systemtap/tapset/`。跟SystemTap脚本不同的是，tapset不能被直接运行；它只能作为库使用。
5 | tapset库让用户能够在更高的抽象层次上定义事件和函数。tapset提供了一些常用的内核函数的别名，这样用户就不需要记住完整的内核函数名了（尤其是有些函数名可能会因内核版本的不同而不同）。另外tapset也提供了常用的辅助函数，比如之前我们见过的`thread_indent()`。
6 | 


--------------------------------------------------------------------------------
/1_1_DocumentationGuide.md:
--------------------------------------------------------------------------------
 1 | # 1.1. 编写本文档的目标
 2 | 
 3 | SystemTap允许使用者监控Linux系统当前的运行情况，以便进一步分析。这将有助于运维或开发人员缉查bug或性能问题的罪魁祸首。
 4 | 
 5 | 在SystemTap开发出来之前，要想监控一个运行中的内核，有些时候需要注入检测代码（instrument），重新编译，安装，还要重启一下。SystemTap的诞生，把程序员从这一串繁琐的步骤中解放出来。现在要想完成同样的工作，你只需要简单地运行下自己写的SystemTap脚本。
 6 | 
 7 | 不过，SystemTap的目标用户是那些对内核驾轻就熟的老手。对于内核菜鸟来说，SystemTap依旧难以上手。雪上加霜的是，现存的许多SystemTap文档都假定读者有相当丰富的内核经验，这使得学习SystemTap的路途道阻且长。
 8 | 
 9 | 为了降低入门的门槛，我们编写了《SystemTap新手指南》。它包括下面的内容：
10 | 1. 介绍SystemTap和它的用法，并列出各种型号内核下的安装方式。
11 | 2. 提供监控系统各组件的运行详情的SystemTap脚本作为例子，并介绍它们的运行方式和分析其输出。
12 | 


--------------------------------------------------------------------------------
/3_4_AssociativeArrays.md:
--------------------------------------------------------------------------------
 1 | # 3.4. 关联数组
 2 | 
 3 | SystemTap支持关联数组。关联数组就像其它编程语言中的map/dict/hash，你可以把它看作由互不相同的键所组成的数组，每个键都有一个关联的值。
 4 | 
 5 | 关联数组需要定义为全局变量。访问关联数组的值的语法跟awk类似，就是`array_name[index_expression]`。
 6 | 这里的`array_name`指关联数组的名字，`index_expression`指数组中某个唯一的键。比如在下面的例子中，我们需要在数组`foo`中存tom、dick、harry三个人的年龄，可以这么写：
 7 | ```
 8 | foo["tom"] = 23
 9 | foo["dick"] = 24
10 | foo["harry"] = 25
11 | ```
12 | 
13 | 在一个数组语句中你最多可以指定**九个**表达式，每个表达式间以`,`隔开。这样做可以给单个键附加多个信息。下面一行代码中，数组`device`的键包含五个表达式：进程PID，可执行程序名，用户UID，父进程PID，和字符串“W”。`devname`值关联到这个键上面。
14 | ```
15 | device[pid(),execname(),uid(),ppid(),"W"] = devname
16 | ```
17 | 
18 | > 所有的关联数组都必须是全局变量，不管它们是否使用在多个探针内。
19 | 


--------------------------------------------------------------------------------
/3_1_Architecture.md:
--------------------------------------------------------------------------------
 1 | # 3.1. 结构
 2 | 
 3 | SystemTap脚本运行时，会启动一个对应的SystemTap会话。整个会话大致流程如下：
 4 | 
 5 | 首先，SystemTap会检查脚本中用到的`tapset`，确保它们都存在于tapset库中（通常是`/usr/share/systemtap/tapset/`）。然后SystemTap会把找到的`tapset`替换成在tapset库中对应的定义。（译注：tapset是tap（听诊器）的集合，指一些预定义的SystemTap事件或函数。完整的tapset列表见 https://sourceware.org/systemtap/tapsets/ ）
 6 | 
 7 | SystemTap接着会把脚本转化成C代码，运行系统的C编译器编译出一个内核模块。完成这一步的工具包含在systemtap包中（详见第2.1节，“安装和配置”）
 8 | 
 9 | SystemTap随即加载该模块，并启用脚本中所有的探针（包括事件和对应的处理程序）。这一步由system-runtime包的`staprun`完成。（详见第2.1节，“安装和配置”）
10 | 
11 | 每当被监控的事件发生，对应的处理程序就会被执行。
12 | 
13 | 一旦SystemTap会话终止，探针会被禁用，内核模块也会被卸载。
14 | 
15 | 这一串流程皆始于一个简单的命令行程序：`stap`。这个程序包揽了SystemTap主要的功能。要想了解关于`stap`的更多信息，请`man stap`（前提是你的机器上已经安装了SystemTap）
16 | 


--------------------------------------------------------------------------------
/4_UserSpaceProbing.md:
--------------------------------------------------------------------------------
 1 | # 4. 用户空间探测
 2 | 
 3 | SystemTap诞生的最初使命，是探测内核空间。由于许多情况下用户空间探测有助于诊断问题，SystemTap从0.6版本开始也支持探测用户空间的进程。SystemTap可以探测用户空间进程内函数的调用和退出，可以探测用户代码中预定义的标记，可以探测用户进程的事件。
 4 | 
 5 | SystemTap进行用户空间探测需要uprobes模块。如果你的Linux内核版本大于等于3.5, 它已经内置了`uprobes`。要想验证当前内核是否原生支持uprobes，运行下面命令：
 6 | ```
 7 | grep CONFIG_UPROBES /boot/config-`uname -r`
 8 | ```
 9 | 
10 | 如果当前内核集成了uprobes，就会输出以下内容：
11 | ```
12 | CONFIG_UPROBES=y
13 | ```
14 | 
15 | 如果你的内核版本小于3.5, SystemTap会自动构建uprobes模块。不过，SystemTap的用户空间事件跟踪功能依然需要你的内核支持utrace拓展。可以从这个链接获取更多关于utrace的细节：http://sourceware.org/systemtap/wiki/utrace 。要想验证当前内核是否提供了必要的utrace支持，在终端中输入下面的命令：
16 | ```
17 | grep CONFIG_UTRACE /boot/config-`uname -r`
18 | ```
19 | 
20 | 如果当前内核支持用户空间探测，就会输出以下内容：
21 | ```
22 | CONFIG_UTRACE=y
23 | ```
24 | 


--------------------------------------------------------------------------------
/4_2_AccessingUserSpaceTargetVariables.md:
--------------------------------------------------------------------------------
 1 | # 4.2. 访问用户空间目标变量
 2 | 
 3 | 你可以访问用户空间目标变量，所用的语法与第3.3节第二部分，“目标变量”中访问内核空间的语法相同。在Linux中，用户代码和内核代码使用的地址空间是隔绝的。不过SystemTap可以在使用`->`运算符时找到恰当的地址空间。
 4 | 
 5 | 对于指向基本类型（如整数和字符串）的指针，可以使用下列的函数访问用户空间的数据。每个函数的第一个参数都是指向数据的指针（`address`）。
 6 | 
 7 | **user_char(address)**
 8 | 
 9 | 从当前用户进程中获取地址对应的字符数据。
10 | 
11 | **user_short(address)**
12 | 
13 | 从当前用户进程中获取地址对应的short型数据。
14 | 
15 | **user_int(address)**
16 | 
17 | 从当前用户进程中获取地址对应的int型数据。
18 | 
19 | **user_long(address)**
20 | 
21 | 从当前用户进程中获取地址对应的long型数据。
22 | 
23 | **user_string(address)**
24 | 
25 | 从当前用户进程中获取地址对应的字符串数据。
26 | 
27 | **user_string_n(address, n)**
28 | 
29 | 从当前用户进程中获取地址对应的字符串数据，取前n字节。
30 | 
31 | 译注：这些函数都是在`process(PATH).xxx`事件的处理程序中使用的。当前用户进程指的就是`PATH`。如
32 | ```
33 | process(@1).syscall {
34 |     ...
35 |     user_string(field) # field指向@1地址空间中的某个地址
36 | }
37 | ```
38 | 


--------------------------------------------------------------------------------
/6_2_RuntimeErrorsAndWarnings.md:
--------------------------------------------------------------------------------
 1 | # 6.2. 运行时错误和警告
 2 | 
 3 | 运行时错误和警告发生在SystemTap安装了检测代码并开始收集数据的时候。
 4 | 
 5 | **⁠WARNING: Number of errors: N, skipped probes: M**
 6 | 
 7 | 在运行时出错并/或跳过某些探针。由于诸如给定时间不足以执行完处理程序的原因，某些探针没有得到执行，`N`和`M`就是这些探针的数目。
 8 | 
 9 | **⁠division by 0**
10 | 
11 | 代码里出现了除零错误。
12 | 
13 | **⁠aggregate element not found**
14 | 
15 | 在一个空的聚集变量上调用除`@count`以外的提取函数。这就跟除零差不多。关于聚集变量的更多信息，参见[3.5 数组操作](3_5_ArrayOperationsInSystemTap.md)中的“使用聚集变量”部分。
16 | 
17 | **⁠aggregation overflow**
18 | 
19 | 聚集变量数组中包含太多的键。（译注：前文有提及，数组的索引中最多只能使用九个键）
20 | 
21 | **⁠MAXNESTING exceeded**
22 | 
23 | 过多的嵌套函数调用。默认函数调用层级是10。可以使用`-DMAXNESTING=NN`重编译脚本来修改这个限制。
24 | 
25 | **⁠MAXACTION exceeded**
26 | 
27 | 处理程序太长了。默认一个探针的处理程序里面最多只能执行1000个语句。可以使用`--DMAXACTION=NN`或`-DMAXACTION_INTERRUPTIBLE=NN`重编译脚本来修改这个限制。
28 | 
29 | **⁠kernel/user string copy fault at ADDR**
30 | 
31 | 处理程序试图把一个来自内核或用户空间的字符串拷贝到无效地址`ADDR`。
32 | 
33 | **pointer dereference fault**
34 | 
35 | 指针解引用时发生了一个错误，可能发生在诸如计算目标变量的时候。
36 | 


--------------------------------------------------------------------------------
/SUMMARY.md:
--------------------------------------------------------------------------------
 1 | # Summary
 2 | 
 3 | * [介绍](1_Introduction.md)
 4 |     * [编写本文档的目标](1_1_DocumentationGuide.md)
 5 |     * [SystemTap的能力](1_2_SystemTapCapabilities.md)
 6 | 	* [SystemTap的局限](1_3_LimitationsOfSystemTap.md)
 7 | * [使用](2_UsingSystemTap.md)
 8 |     * [安装和配置](2_1_InstallationAndSetup.md)
 9 | 	* [为其它计算机生成检测模块](2_2_GeneratingInstrumentationForOtherComputers.md)
10 | 	* [运行SystemTap脚本](2_3_RunningSystemTapScripts.md)
11 | * [工作细节](3_UnderstandingHowSystemTapWorks.md)
12 | 	* [结构](3_1_Architecture.md)
13 | 	* [脚本](3_2_SystemTapScripts.md)
14 | 	* [处理程序的基本结构](3_3_BasicSystemTapHandlerConstructs.md)
15 | 	* [关联数组](3_4_AssociativeArrays.md)
16 | 	* [数组操作](3_5_ArrayOperationsInSystemTap.md)
17 | 	* [Tapsets](3_6_Tapsets.md)
18 | * [用户空间探测](4_UserSpaceProbing.md)
19 | 	* [用户空间事件](4_1_UserSpaceEvents.md)
20 | 	* [访问用户空间目标变量](4_2_AccessingUserSpaceTargetVariables.md)
21 | 	* [用户空间栈回溯](4_3_UserSpaceStackBacktraces.md)
22 | * [SystemTap脚本集锦](5_UsefulSystemTapScripts.md)
23 | 	* [网络](5_1_Network.md)
24 | 	* [磁盘](5_2_Disk.md)
25 | 	* [剖析](5_3_Profiling.md)
26 | 	* [标识用户空间锁竞争](5_4_IdentifyingContendedUserSpaceLocks.md)
27 | * [解读错误信息](6_UnderstandingSystemTapErrors.md)
28 | 	* [解析和文法错误](6_1_ParseAndSemanticErrors.md)
29 | 	* [运行时错误和警告](6_2_RuntimeErrorsAndWarnings.md)
30 | 


--------------------------------------------------------------------------------
/4_1_UserSpaceEvents.md:
--------------------------------------------------------------------------------
 1 | # 4.1. 用户空间事件
 2 | 
 3 | 所有的用户空间事件都以`process`开头。你可以通过进程ID指定要检测的进程，也可以通过可执行文件名的路径名指定。SystemTap会查看系统的`PATH`环境变量，所以你既可以使用绝对路径，也可以使用在命令行中运行可执行文件时所用的名字。
 4 | 
 5 | 由于SystemTap静态分析放置探针的位置时离不开调试信息，一些用户空间事件需要给定PID或可执行文件的路径（以下将两者统称为`PATH`）。不过大多数`process`事件中，PID和可执行文件路径名都是可选的。下面列出的事件都需要进程ID或可执行文件的路径。不在其中的`process`事件不需要PID和可执行文件路径名。
 6 | 
 7 | **process("PATH").function("function")**
 8 | 
 9 | 进入可执行文件`PATH`的用户空间函数`function`。该事件相当于内核空间中的`kernel.function("function")`。它允许使用通配符和`.return`后缀。
10 | 
11 | **process("PATH").statement("statement")**
12 | 
13 | 代码中第一次执行`statement`的地方。该事件相当于内核空间中的`kernel.statement("statement")`。
14 | 
15 | **process("PATH").mark("marker")**
16 | 
17 | 在`PATH`中定义的静态探测点。你可以使用通配符，在单个探针中指定多个探测点。有些静态探测点中 允许使用编了号（numbered）的参数（`$1`，`$2`等等）。
18 | 有些用户空间下的可执行程序提供了这些静态探测点，比如Java。大多数提供了静态探测点的程序也一并给这些探测点提供了易于使用的别名。下面是x86_64 Java hotspot虚拟机中的一个例子：
19 | ```
20 | probe hotspot.gc_begin =
21 |   process("/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/lib/amd64/server/libjvm.so").mark("gc__begin")
22 | ```
23 | 
24 | **process("PATH").begin**
25 | 
26 | 创建了一个用户空间下的进程。你可以限定某个进程ID或可执行文件的路径，如果不限定，任意进程的创建都会触发该事件。
27 | 
28 | **process("PATH").thread.begin**
29 | 
30 | 创建了一个用户空间下的线程。你可以限定某个进程ID或可执行文件的路径。
31 | 
32 | **process("PATH").end**
33 | 
34 | 销毁了一个用户空间下的进程。你可以限定某个进程ID或可执行文件的路径。
35 | 
36 | **process("PATH").thread.end**
37 | 
38 | 销毁了一个用户空间下的线程。你可以限定某个进程ID或可执行文件的路径。
39 | 
40 | **process("PATH").syscall**
41 | 
42 | 一个用户空间进程调用了系统调用。可以通过上下文变量`$syscall`获取系统调用号。还可以通过`$arg1`到`$arg6`分别获取前六个参数。添加`return`后缀后会捕获退出系统调用的事件。在`syscall.return`中，可以通过上下文变量`$return`获取返回值。
43 | 你可以用某个进程ID或可执行文件的路径进行限定。
44 | 


--------------------------------------------------------------------------------
/5_4_IdentifyingContendedUserSpaceLocks.md:
--------------------------------------------------------------------------------
 1 | # 5.4. 标识用户空间锁竞争
 2 | 
 3 | 本节展示如何显示特定时间内用户空间锁竞争的情况。通过展示锁竞争的图景，你可以判断当前的性能问题是否由对`futex`的竞争所造成的。
 4 | 简单地说，如果在同一时间内多个进程试图获取同一把锁，就会产生对`futex`的竞争。由于仅有一个进程可以持有锁，其他的进程都只能等待锁重新可用，锁竞争会导致性能的下降。
 5 | 下面的`futexes.stp`脚本通过探测`futex`系统调用来显示锁竞争的情况：
 6 | 
 7 | futexes.stp
 8 | ```
 9 | #! /usr/bin/env stap
10 | 
11 | # This script tries to identify contended user-space locks by hooking
12 | # into the futex system call.
13 | 
14 | global FUTEX_WAIT = 0 /*, FUTEX_WAKE = 1 */
15 | global FUTEX_PRIVATE_FLAG = 128 /* linux 2.6.22+ */
16 | global FUTEX_CLOCK_REALTIME = 256 /* linux 2.6.29+ */
17 | 
18 | global lock_waits # long-lived stats on (tid,lock) blockage elapsed time
19 | global process_names # long-lived pid-to-execname mapping
20 | 
21 | probe syscall.futex.return {  
22 |   if (($op & ~(FUTEX_PRIVATE_FLAG|FUTEX_CLOCK_REALTIME)) != FUTEX_WAIT) next
23 |   process_names[pid()] = execname()
24 |   elapsed = gettimeofday_us() - @entry(gettimeofday_us())
25 |   lock_waits[pid(), $uaddr] <<< elapsed
26 | }
27 | 
28 | probe end {
29 |   foreach ([pid+, lock] in lock_waits) 
30 |     printf ("%s[%d] lock %p contended %d times, %d avg us\n",
31 |             process_names[pid], pid, lock, @count(lock_waits[pid,lock]),
32 |             @avg(lock_waits[pid,lock]))
33 | }
34 | ```
35 | 
36 | `futexes.stp`需要手动Ctrl+C退出。一旦退出后，它会输出下面信息：
37 | * 参与锁竞争的进程的名字和ID
38 | * 被竞争的锁变量的地址
39 | * 锁被竞争的次数
40 | * 竞争锁的平均耗时
41 | 
42 | ⁠下面是`futexes.stp`在运行约20秒 退出时，大致的输出情况：
43 | ```
44 | [...]
45 | automount[2825] lock 0x00bc7784 contended 18 times, 999931 avg us
46 | synergyc[3686] lock 0x0861e96c contended 192 times, 101991 avg us
47 | synergyc[3758] lock 0x08d98744 contended 192 times, 101990 avg us
48 | synergyc[3938] lock 0x0982a8b4 contended 192 times, 101997 avg us
49 | [...]
50 | ```
51 | 


--------------------------------------------------------------------------------
/2_2_GeneratingInstrumentationForOtherComputers.md:
--------------------------------------------------------------------------------
 1 | # 2.2. 为其它计算机生成检测模块
 2 | 
 3 | 当用户运行一个SystemTap脚本时，SystemTap会从中创建一个内核模块。然后SystemTap会把该模块加载到内核里，这样一来它就能从内核直接提取出特定的数据。（详见第3.1节“结构”中“SystemTap会话”部分）
 4 | 
 5 | 一般来说，SystemTap脚本只能运行在安装了SystemTap的系统上（见第2.1节，“安装和配置”）。这意味着，如果想在十个系统上运行SystemTap，你需要挨个系统安装SystemTap。在有些情况下，这既不如人意也不合实际。比如，公司内部的规章可能会禁止管理员往机器上安装提供编译器或调试信息的RPM包，这么一来SystemTap就没法安装了。为了避开这个问题，SystemTap提供了交叉检测（cross-instrumentaion）的功能。
 6 | 
 7 | 在一台计算机上运行SystemTap脚本，生成在另一台机器上可用的SystemTap检测模块，这一过程就叫做交叉检测。这一功能提供了以下便利：
 8 | 
 9 | * 仅需在单台开发机上安装适合其它机器的多个内核信息包。
10 | * 每个目标机器仅需安装单个RPM包来使用生成的SystemTap检测模块：`systemtap-runtime`（译注：所以说还是得安装新的包）
11 | 
12 | 为简明起见，定义下本节用到的几个术语：
13 | 
14 | * 检测模块 - 由SystemTap脚本创建的内核模块。SystemTap模块由主机系统创建，并且将会被分发到目标系统的目标内核上。
15 | * 主机系统 - 在这个系统上编译SystemTap脚本成目标系统上可用的检测模块。
16 | * 目标系统 - 需要应用检测模块的系统。
17 | * 目标内核 - 目标系统的内核。这个内核将加载和运行检测模块。
18 | 
19 | > **为了创建可用的检测模块，主机系统使用的硬件架构和分行版需要跟目标系统相同。**
20 | 
21 | 完成下面各步来配置主机系统和目标系统：
22 | 
23 | 1. 在每个目标系统上安装`systemtap-runtime`包。
24 | 2. 通过运行`uname -r`，获取每个目标系统的内核版本。
25 | 3. 你将在主机系统上创建适用于目标系统的检测模块。关于安装SystemTap的信息，请参考第2.1节“安装和配置”中的“安装SystemTap”。（译注：请查看官方[Wiki](https://sourceware.org/systemtap/wiki)）
26 | 4. 在主机系统上，安装跟目标内核版本一样的内核和对应的内核信息包，可以参考第2.1节中的“手动安装依赖的内核调试信息包”。如果多个目标系统使用的内核版本不一样，每种内核版本都要安装一次。
27 | 
28 | 完成了这几步后，你现在可以在主机系统上给所有的目标系统创建检测模块了。
29 | 
30 | 要想创建检测模块，在主机系统运行下面命令（**记得把`kernel_version`等改成实际的值**）：
31 | 
32 | ```
33 | stap -r kernel_version script -m module_name
34 | ```
35 | 
36 | 在这里，`kernel_version`表示目标系统的版本（目标系统上`uname -r`的输出），`script`表示需要编译成检测模块的脚本，而`module_name`则是你给检测模块起的名字。
37 | 
38 | > 要想获得当前内核的硬件架构，你可以运行下面的命令：
39 | > `uname -m`
40 | 
41 | 一旦检测模块编译好了，把它分发到目标系统，然后加载它：
42 | 
43 | ```
44 | staprun module_name.ko
45 | ```
46 | 
47 | 举个例子，需要从SystemTap脚本`simple.stp`中创建检测模块`simple.ko`，来应用于版本为`2.6.18-92.1.10.el5`的目标系统（x86_64架构），使用下面命令：
48 | 
49 | ```
50 | stap -r 2.6.18-92.1.10.el5 -e 'probe vfs.read {exit()}' -m simple
51 | ```
52 | 
53 | 这将创建一个名为`simple.ko`的模块。把它复制到目标系统并在目标系统上运行下面命令：
54 | 
55 | ```
56 | staprun simple.ko
57 | ```
58 | 


--------------------------------------------------------------------------------
/4_3_UserSpaceStackBacktraces.md:
--------------------------------------------------------------------------------
 1 | # 4.3. 用户空间栈回溯
 2 | 
 3 | `pp`（probe point）函数可以返回触发当前处理程序的事件名（包含展开了的通配符和别名）。如果该事件与特定的函数相关，`pp`的输出会包括触发了该事件的函数名。然而，许多情况下触发同一个事件的函数可能来自于程序中不同的模块；特别是在该函数位于某个共享库的情况下。还好SystemTap提供了用户空间栈的回溯（backtrace）功能，便于查看事件是怎么被触发的。
 4 | 
 5 | 编译器优化代码时会消除栈帧指针（stack frame pointers），这将混淆用户空间栈回溯的结果。所以要想查看栈回溯，需要有编译器生成的调试信息。SystemTap用户空间栈回溯机制可以利用这些调试信息来重建栈回溯的现场，不过该功能当前只实现在32位和64位x86处理器上，还不支持其他架构的处理器。要想使用这些调试信息来重建栈回溯，给可执行文件加上`-d executable`选项，并给共享库加上`-ldd`选项。
 6 | 
 7 | 举个例子，你可以使用`ubacktrace`（user-space backtrace）函数来输出`ls`命令中`xmalloc`函数的调用情况。如果你已经安装了`ls`命令的debuginfo，下面的SystemTap命令会在`xmalloc`函数调用时输出栈回溯的结果：
 8 | ```
 9 | stap -d /bin/ls --ldd \
10 | -e 'probe process("ls").function("xmalloc") {print_usyms(ubacktrace())}' \
11 | -c "ls /"
12 | ```
13 | 
14 | 译注：要想成功运行上面的命令，你需要安装coreutils的debuginfo。具体安装方式请根据自己用的发行版搜索一下。如果你跟我一样用的也是Ubuntu，可以看下[askubuntu上这个回答](http://askubuntu.com/questions/427318/how-can-i-install-a-debug-build-for-coreutils)，运行`sudo apt-get install coreutils-dbgsym`。
15 | 
16 | 运行后，上面的命令会有类似下面的输出：
17 | ```
18 | bin dev   lib     media  net         proc   sbin     sys  var
19 | boot    etc   lib64   misc   op_session  profilerc  selinux  tmp
20 | cgroup  home  lost+found  mnt    opt         root   srv  usr
21 |  0x4116c0 : xmalloc+0x0/0x20 [/bin/ls]
22 |  0x4116fc : xmemdup+0x1c/0x40 [/bin/ls]
23 |  0x40e68b : clone_quoting_options+0x3b/0x50 [/bin/ls]
24 |  0x4087e4 : main+0x3b4/0x1900 [/bin/ls]
25 |  0x3fa441ec5d : __libc_start_main+0xfd/0x1d0 [/lib64/libc-2.12.so]
26 |  0x402799 : _start+0x29/0x2c [/bin/ls]
27 |  0x4116c0 : xmalloc+0x0/0x20 [/bin/ls]
28 |  0x4116fc : xmemdup+0x1c/0x40 [/bin/ls]
29 |  0x40e68b : clone_quoting_options+0x3b/0x50 [/bin/ls]
30 |  0x40884a : main+0x41a/0x1900 [/bin/ls]
31 |  0x3fa441ec5d : __libc_start_main+0xfd/0x1d0 [/lib64/libc-2.12.so]
32 |  ...
33 |  ```
34 | 
35 | 关于在用户空间栈回溯中可用的函数的更多内容，请查看`ucontext-symbols.stp`和`ucontext-unwind.stp`两个tapset。上述tapset中的函数的描述信息也可以在[SystemTap Tapset Reference Manual](https://sourceware.org/systemtap/tapsets/)找到。
36 | 


--------------------------------------------------------------------------------
/6_1_ParseAndSemanticErrors.md:
--------------------------------------------------------------------------------
 1 | # 6.1. 解析和文法错误
 2 | 
 3 | 解析和文法错误发生在SystemTap解析脚本和编译成C代码时。举个例子，把无效的值赋给变量或数组时，会报类型错误。
 4 | 
 5 | **parse error: expected foo, saw bar**
 6 | 
 7 | 脚本存在语法或排版错误。SystemTap会探测到脚本中存在的不正确结构，并指出有问题的探针。
 8 | 举个例子，下面的SystemTap脚本是有问题的，里面的探针缺了处理程序：
 9 | ```
10 | probe vfs.read
11 | probe vfs.write
12 | ```
13 | 
14 | 尝试运行这个脚本，它会报告以下的错误信息，声称第2行第1列不应该是`probe`关键字。
15 | ```
16 | parse error: expected one of '. , ( ? ! { = +='
17 |     saw: keyword at perror.stp:2:1
18 | 1 parse error(s).
19 | ```
20 | 
21 | **parse error: embedded code in unprivileged script**
22 | 
23 | 脚本中嵌入了不安全的C代码。SystemTap允许你通过`%{...%}`代码块嵌入C代码，以便于定义适合的tapset。然而，这样做是不安全的，一旦你真的在脚本里这么做了，SystemTap用这个错误来警告你。
24 | 
25 | 如果你确信你的做法是安全的，并且拥有`stapdev`权限（或root权限），可以带上`-g`选项，以“guru”模式运行脚本来消除这个报错。（`stap -g script`）
26 | 
27 | **semantic error: type mismatch for identifier 'foo' ... string vs. long**
28 | 
29 | 脚本中的函数`foo`使用了错误的类型（比如`%s`或`%d`）。在下面的例子中，格式标志符应该是`%s`而不是`%d`，因为`execname()`函数返回一个字符串：
30 | ```
31 | probe syscall.open
32 | {
33 |   printf ("%d(%d) open\n", execname(), pid())
34 | }
35 | ```
36 | 
37 | **semantic error: unresolved type for identifier 'foo'**
38 | 
39 | 你使用了一个变量（identifier），但是没办法推导出它的类型（数值或字符串）。举个例子，如果你在一个`printf`语句中使用了从未赋过值的变量，就会遇到这样的错误。
40 | 
41 | **semantic error: Expecting symbol or array index expression**
42 | 
43 | SystemTap不能完成某个赋值操作，因为这个操作的接收者不合理。下面的示例代码就会发生这个错误：
44 | ```
45 | probe begin { printf("x") = 1 }
46 | ```
47 | 
48 | **while searching for arity N function, semantic error: unresolved function call**
49 | 
50 | 脚本中的函数调用或数组索引表达式使用了不合理的参数个数。在SystemTap，*arity*可以指数组的索引个数，也可以指函数的参数个数。
51 | 
52 | **semantic error: array locals not supported, missing global declaration?**
53 | 
54 | 在脚本中使用了一个数组，却没有把它定义成全局变量（SystemTap脚本中，全局变量可以定义在使用的位置之后）。如果一个数组使用的索引个数不一致，也会报告相似的错误。（SystemTap中数组可以使用一组值作为索引，而不仅仅是一个下标）
55 | 
56 | **semantic error: variable 'foo' modified during 'foreach' iteration**
57 | 
58 | 用`foreach`迭代数组`foo`的同时修改了该数组（赋新的值或使用了delete）。如果在`foreach`迭代的过程中对数组`foo`调用了函数，也会显示这个错误。
59 | 
60 | **semantic error: probe point mismatch at position N, while resolving probe point foo**
61 | 
62 | SystemTap无法找到事件或SystemTap函数`foo`的定义。通常意味着SystemTap在tapset库中找不到匹配foo的项。`N`表示错误的行号和列号。
63 | 
64 | **semantic error: no match for probe point, while resolving probe point foo**
65 | 
66 | SystemTap因为一些原因不能解析事件或处理函数`foo`。比如说脚本包含事件`kernel.function("something")`，而`something`并不存在。在某些时候，这个错误也意味着脚本中包含不存在的内核文件名或源代码行号。
67 | 
68 | **semantic error: unresolved target-symbol expression**
69 | 
70 | 脚本中的一个处理程序用到了某个目标变量（target variable)，但这个目标变量无法解析。这个错误意味着该目标变量在处理程序的上下文里不存在。也许是编译器把代码优化掉了。
71 | 
72 | **semantic error: libdwfl failure**
73 | 
74 | 在处理调试信息的时候遇到一个问题。在大多数情况下，这个错误的产生源于安装的`kernel-debuginfo`包没有完全匹配要探测的内核。也许是安装的`kernel-debuginfo`包中存在某些完整性或正确性问题。
75 | 
76 | **semantic error: cannot find foo debuginfo**
77 | 
78 | SystemTap找不到适合的`kernel-debuginfo`包。
79 | 


--------------------------------------------------------------------------------
/2_1_InstallationAndSetup.md:
--------------------------------------------------------------------------------
  1 | # 2.1. 安装和配置
  2 | 
  3 | 要想使用`SystemTap`，需要安装跟目标内核版本匹配的`-devel`、`-debuginfo`和`-debuginfo-common-arch`包。如果要在不止一个内核上运行`SystemTap`，需要根据每个内核的版本安装对应的`-devel`和`-debuginfo`包。
  4 | 
  5 | 接下来的几个小节里，我们会详细讲解这一过程。（译注：SystemTap的[wiki](https://sourceware.org/systemtap/wiki/)里面有针对Linux各发行版的安装步骤。本节内容仅适用于RHEL，且不能保证及时更新，建议跳过本节，直接参考官方文档。如果你正好用的是ubuntu，可以参考ubuntu的[wiki](https://wiki.ubuntu.com/Kernel/Systemtap)）
  6 | 
  7 | > 很多用户把`-debuginfo`记成了`-debug`。要想使用SystemTap，切记安装对应内核的`-debuginfo`包，不是`-debug`包。
  8 | 
  9 | ## 安装SystemTap
 10 | 
 11 | 要想使用SystemTap，安装下面的RPM包：
 12 | 
 13 | * systemtap
 14 | * systemtap-runtime
 15 | 
 16 | 以root权限运行下面的命令：
 17 | 
 18 | ```
 19 | yum install systemtap systemtap-runtime
 20 | ```
 21 | 
 22 | 注意要想用上SystemTap，还得安装依赖的内核调试信息包。在较新的系统上，仅需以root权限运行下面的命令：
 23 | 
 24 | ```
 25 | stap-prep
 26 | ```
 27 | 
 28 | 如果这个命令没起作用，你需要按照下面的步骤手动安装。
 29 | 
 30 | ## 手动安装依赖的内核调试信息包
 31 | 
 32 | SystemTap需要内核信息，这样才能注入指令。此外，这些信息还能帮助SystemTap生成合适的检测代码。
 33 | 
 34 | 这些必要的内核信息分别包括在特定内核版本所对应的`-devel`，`-debuginfo`和`-debuginfo-common`包中。对于“标准版”内核（指按照常规配置编译的内核），所需的`-devel`和`-debuginfo`等包命名为：
 35 | 
 36 | ```
 37 | kernel-debuginfo
 38 | kernel-debuginfo-common
 39 | kernel-devel
 40 | ```
 41 | 
 42 | 同样的，启用了PAE的内核所需的包分别为`kernel-PAE-debuginfo`，`kernel-PAE-debuginfo-common`，和`kernel-PAE-devel`。（译注：PAE即Physical Address Extension（物理地址拓展），32位Linux可以用它拓展内存访问空间）
 43 | 
 44 | 要想确定当前系统的内核版本，敲入：
 45 | ```
 46 | uname -r
 47 | ```
 48 | 
 49 | 举个例子，如果你想在i686环境下的2.6.18-53.el5内核上使用SystenTap，需要下载安装如下的RPM包：
 50 | 
 51 | ```
 52 | kernel-debuginfo-2.6.18-53.1.13.el5.i686.rpm
 53 | kernel-debuginfo-common-2.6.18-53.1.13.el5.i686.rpm
 54 | kernel-devel-2.6.18-53.1.13.el5.i686.rpm
 55 | ```
 56 | 
 57 | > 你安装的`-devel`、`-debuginfo`和`-debuginfo-common`包的版本一定要匹配目标内核的版本/特性/架构。
 58 | 
 59 | 安装依赖的内核信息包最简单的方法，就是用`yum install`和`debuginfo-install`命令。`debuginfo-install`命令包含在版本1.1.10以上的`yum-utils`包里，还需要一个能够下载安装`-debuginfo`和`-debuginfo-common`包的yum源。
 60 | 确保系统包管理的源满足要求，运行下面的命令就能安装特定内核对应的包：
 61 | 
 62 | ```
 63 | yum install kernelname-devel-version
 64 | debuginfo-install kernelname-version
 65 | ```
 66 | 
 67 | 把命令中的`kernelname`替换成对应的内核名（比如，`kernel-PAE`），`version`换成目标内核的版本。举个例子，安装`kernel-PAE-2.6.18-53.1.13.el5`内核所对应的内核信息包，需要运行：
 68 | 
 69 | ```
 70 | yum install kernel-PAE-devel-2.6.18-53.1.13.el5
 71 | debuginfo-install kernel-PAE-2.6.18-53.1.13.el5
 72 | ```
 73 | 
 74 | 一旦手动下载了所依赖的包之后，以root权限运行下面的命令来安装它们：
 75 | 
 76 | ```
 77 | rpm --force -ivh package_names
 78 | ```
 79 | 
 80 | ## 检查安装是否成功
 81 | 
 82 | 如果你正在用的内核就是你的目标内核，你现在就能检查下安装是否成功。如果不是，重启下系统并载入目标内核。
 83 | 
 84 | 运行下面命令开始检查：
 85 | 
 86 | ```
 87 | stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'
 88 | ```
 89 | 
 90 | 这个命令让SystemTap在虚拟文件系统的读事件发生之后，输出`read performed`接着退出。如果SystemTap安装成功了，应该会输出类似下面的内容：
 91 | 
 92 | ```
 93 | Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms.
 94 | Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) in 290usr/260sys/568real ms.
 95 | Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms.
 96 | Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms.
 97 | Pass 5: starting run.
 98 | read performed
 99 | Pass 5: run completed in 10usr/40sys/73real ms.
100 | ```
101 | 
102 | 从`Pass 5`开始的最后三行说明SystemTap已经成功地注入并运行了内核探测指令，捕获了要探测的事件（在这个例子里，指虚拟文件系统的读事件），并执行了有效的处理程序（输出“read performed”并正常退出）。
103 | 


--------------------------------------------------------------------------------
/2_3_RunningSystemTapScripts.md:
--------------------------------------------------------------------------------
  1 | # 2.3. 运行SystemTap脚本
  2 | 
  3 | SystemTap包含了许多用于监控系统活动的命令行工具。`stap`命令从SystemTap脚本中读取探测指令，把它们转化为C代码，构建一个内核模块，并加载到当前的Linux内核中。`staprun`命令会运行SystemTap检测模块，比如SystemTap通过交叉检测创建的内核模块。
  4 | 
  5 | 运行`stap`和`staprun`需要较高的系统权限。由于不是每个运行SystemTap的用户都可以被授予root权限，对于那些没有权限的用户，你可以把他们的帐号加入到下面的用户组中：
  6 | 
  7 | 1. **stapdev**
  8 | 该组内的成员可以使用`stap`运行SystemTap脚本，或`staprun`运行SystemTap检测模块。
  9 | 运行`stap`命令包括把SystemTap脚本编译成内核模块并加载进内核。这一操作需要较高的系统访问权限。所以`stapdev`用户组下的成员会拥有较高的权限。不幸的是，这也意味着他们可以做到许多只有root用户才能做到的事。所以，你应该只把那些原可以拥有root权限的用户加到这个用户组中。
 10 | 
 11 | 2. **stapusr**
 12 | 该组内的成员仅能使用`staprun`命令来运行SystemTap检测模块。另外，他们也只能在`/lib/modules/kernel_version/systemtap/`文件夹下运行模块。注意这个文件夹必须仅由root用户所拥有，而且仅对root用户可写。
 13 | 
 14 | `stap`命令从文件或标准输入中读取SystemTap脚本。要想让`stap`从文件中读取SystemTap脚本，需要在命令行中指定文件名：
 15 | 
 16 | ```
 17 | stap file_name
 18 | ```
 19 | 
 20 | 要想让`stap`从标准输入中读取SystemTap脚本，需要用`-`换掉文件名。记得把用到的命令行选项挪到`-`之前。举个例子，要让`stap`输出更多的运行信息，输入：
 21 | 
 22 | ```
 23 | echo "probe timer.s(1) {exit()}" | stap -v -
 24 | ```
 25 | 
 26 | 下面列出常用的`stap`命令行选项：
 27 | 
 28 | **-v**
 29 | 让SystemTap会话输出更加详细的信息。你可以重复该选项多次来提高执行信息的详尽程度，举个例子：
 30 | ```
 31 | stap -vvv script.stp
 32 | ```
 33 | 当你的脚本在运行时发生了错误，可以加下这个选项查看更详细的输出信息。关于SystemTap错误信息的更多内容，请参考第6章，“解读错误信息”
 34 | 
 35 | **-o file_name**
 36 | 将标准输出重定向到`file_name`
 37 | 
 38 | **-S size[,count]**
 39 | 将输出文件的最大大小限制成`size`MB，存储文件的最大数目为`count`。这个命令实现了logrotate的功能，每个输出文件会以序列号作为后缀。（译注：logrotate会把日志切割成xxx.1, xxx.2, xxx.3的形式。每当一个日志文件达到最大大小时，新开一个日志文件。当日志文件数达到最大数目时，旧的日志文件会被删掉。）
 40 | 
 41 | **-x process_id**
 42 | 设置SystemTap处理函数`target()`为指定PID。关于`target()`的更多信息，请参考[SystemTap函数列表](http://linux.die.net/man/5/stapfuncs)。
 43 | 
 44 | **-c 'command'**
 45 | 运行`command`，并在`command`结束时退出。该选项同时会把`target()`设置成`command`运行时的PID
 46 | 
 47 | **-e 'code'**
 48 | 直接执行给定的`code`。（译注：如`stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}'`）
 49 | 
 50 | **-F**
 51 | 进入SystemTap的飞行记录仪模式（flight recorder mode），并在后台运行该脚本。关于的更多信息，请参考下面的“飞行记录仪模式”。
 52 | 
 53 | 关于`stap`的更多信息，请参考`stap(1)` man page。关于`staprun`和交叉检测的更多信息，请参考第2.2节“为其它计算机生成检测模块”，或`staprun(8)` man page。
 54 | 
 55 | ## 飞行记录仪模式
 56 | 
 57 | SystemTap的飞行记录仪模式允许你长时间运行一个SystemTap脚本，并关注最新的输出。飞行记录仪模式会限制输出的生成量。
 58 | 
 59 | 飞行记录仪模式还可以分成两种：内存型（in-memory）和文件型（file）。无论是哪一种，SystemTap脚本都是作为后台进程运行。
 60 | 
 61 | ### 内存型飞行记录仪模式
 62 | 
 63 | 当飞行记录仪模式（`-F`）没有跟输出文件选项（`-o`）一起使用时，SystemTap会把脚本输出结果存储在内核内存的缓冲区内。一旦SystemTap检测模块被加载并开始探测，检测过程会分离到后台运行。当感兴趣的事件发生后，你可以重新载入检测过程来查看内存缓冲区中最近的输出和之后的输出。
 64 | 
 65 | 要想在内存型飞行记录仪模式下运行SystemTap，带`-F`选项运行`stap`命令：
 66 | 
 67 | ```
 68 | stap -F iotime.stp
 69 | ```
 70 | 
 71 | 一旦脚本启动了，`stap`会输出类似于如下的信息，告诉你怎么重新连接运行的脚本：
 72 | 
 73 | ```
 74 | Disconnecting from systemtap module.
 75 | To reconnect, type "staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556"
 76 | ```
 77 | 
 78 | 当感兴趣的事件发生后，运行对应的命令来连接当前运行的脚本，输出内存缓冲区中的最近的数据，并获取之后的输出：
 79 | 
 80 | ```
 81 | staprun -A stap_5dd0073edcb1f13f7565d8c343063e68_19556
 82 | ```
 83 | 
 84 | 默认情况下，缓冲区大小为1MB.你可以使用`-s`来调整这个值（单位是MB，会向2的幂取整）。举个例子，`-s2`将指定缓冲区大小为2MB.
 85 | 
 86 | ![flight_record_mode_in_memory](./flight_record_mode_in_memory.gif)
 87 | 
 88 | ### 文件型飞行记录仪模式
 89 | 
 90 | 在飞行记录仪模式下，你也可以把输出存储在文件中。你可以通过`-o`选项指定文件名，还可以通过`-S`选项来控制输出文件的大小和数目。
 91 | 
 92 | 下面的命令会以文件型飞行记录仪模式启动SystemTap，输出到`/tmp/iotime.log.[0-9]+`，每个文件不超过1MB，保留最新的两个文件：
 93 | 
 94 | ```
 95 | stap -F -o /tmp/pfaults.log -S 1,2  pfaults.stp
 96 | ```
 97 | 
 98 | 这个命令会把PID输出到标准输出。稍候片刻，给这个进程发个`SIGTERM`终止它的运行：
 99 | 
100 | ```
101 | kill -s SIGTERM 7590
102 | ```
103 | 
104 | 在这个例子里，仅仅有最新的两个文件被保留下来：其余的旧文件都被SystemTap移除了。使用`ls -sh /tmp/pfaults.log.*`验证下：
105 | 
106 | ```
107 | 1020K /tmp/pfaults.log.5    44K /tmp/pfaults.log.6
108 | ```
109 | 
110 | 要想查看最新数据，读取序号最大的输出文件，在这里指的是`/tmp/pfaults.log.6`。
111 | 


--------------------------------------------------------------------------------
/3_5_ArrayOperationsInSystemTap.md:
--------------------------------------------------------------------------------
  1 | # 3.5. 数组操作 
  2 | 
  3 | 本节将列举SystemTap中若干常用的数组操作。
  4 | 
  5 | ## 设置给定键的值
  6 | 
  7 | 使用`=`来设置给定键所对应的值，正如：
  8 | ```
  9 | foo[tid()] = gettimeofday_s()
 10 | ```
 11 | 
 12 | SystemTap会把`tid()`的结果作为一个键，并把`gettimeofday_s()`的结果赋给这个键。如果这个键已经存在`foo`中，原先关联的值会被覆盖掉。
 13 | 
 14 | ## 获取给定键的值
 15 | 
 16 | 使用`array_name[index_expression]`可以获取对应键上的值。比如：
 17 | ```
 18 | delta = gettimeofday_s() - foo[tid()]
 19 | ```
 20 | 
 21 | > 如果数组中没有`index_expression`对应的键，默认情况下它会返回0（在数值计算中）或者空字符串（在字符串操作中）。
 22 | 
 23 | ## 自增给定键的值
 24 | 
 25 | 使用`++`来增加对应键上的值，比如：`array_name[index_expression] ++`。在下面的例子里，每次`vfs.read`都会把当前进程名所关联的值加一：
 26 | ```
 27 | probe vfs.read
 28 | {
 29 |   reads[execname()] ++
 30 | }
 31 | ```
 32 | 
 33 | 你可以用`if (index_expression in array_name)`来判断数组是否有指定的键。
 34 | 
 35 | ## 遍历数组中的多个元素
 36 | 
 37 | 一旦已经收集了足够的信息到数组里，你往往需要去遍历它。正如上面的例子中，在收集了各个进程的读次数后，你可能需要遍历它，输出每个进程的结果。那该怎么做呢？
 38 | 
 39 | 最好的方法就是使用`foreach`语句。看下这个例子：
 40 | ```
 41 | global reads
 42 | probe vfs.read
 43 | {
 44 |   reads[execname()] ++
 45 | }
 46 | 
 47 | probe timer.s(3)
 48 | {
 49 |   foreach (count in reads)
 50 |     printf("%s : %d \n", count, reads[count])
 51 | }
 52 | ```
 53 | 
 54 | 在第二个探针中的`foreach`语句里，`count`引用了`reads`的键，所以可以通过`reads[count]`读取对应键所关联的值。
 55 | 
 56 | 在这个`foreach`语句里面，我们依次遍历`reads`的每个值。假如我们不想遍历整个数组，或者想指定遍历的顺序，该怎么做呢？你可以给数组名加个后缀`+`来表示按升序遍历，或`-`按降序遍历。另外，你可以用`limit`加一个数字来限制迭代的次数。
 57 | 看下这个类似于上一个探针的例子：
 58 | ```
 59 | probe timer.s(3)
 60 | {
 61 |   foreach (count in reads- limit 10)
 62 |     printf("%s : %d \n", count, reads[count])
 63 | }
 64 | ```
 65 | 
 66 | 上面的`foreach`语句会按关联的值降序遍历数组。`limit 10`表示`foreach`语句只会迭代10次（也即输出最高的10个值）。
 67 | 
 68 | ## 清除数组或数组中某个元素
 69 | 
 70 | 有时，你需要清除数组值某个值，或者清空整个数组以便于在另一个探针值重用。在之前统计`vfs.read`的例子里，每三秒统计一次各个进程的调用读操作的次数。如果要想统计三秒内各个进程的数据，需要每三秒清空一次数组。你可以使用`delete`运算符来删除数组中的某个元素，或整个数组。看下下面的例子：
 71 | ```
 72 | global reads
 73 | probe vfs.read
 74 | {
 75 |   reads[execname()] ++
 76 | }
 77 | 
 78 | probe timer.s(3)
 79 | {
 80 |   foreach (count in reads)
 81 |     printf("%s : %d \n", count, reads[count])
 82 |   delete reads
 83 | }
 84 | ```
 85 | 
 86 | 在上面的例子中，第二个探针仅输出三秒内每个进程的读次数。这里的`delete`语句清空了整个`reads`数组。
 87 | 
 88 | ## 使用聚集变量（use aggregates）
 89 | 
 90 | 有时候你需要快速处理新的数值，并且数据量较大，这时候可以考虑使用聚集变量（aggregates），因为它实现了对数据的流式处理。聚集变量可以用作全局变量，也可以用作数组中的值。使用`<<<`运算符可以往聚集变量中添加新数据。
 91 | 
 92 | ```
 93 | global reads
 94 | probe vfs.read
 95 | {
 96 |   reads[execname()] <<< $count
 97 | }
 98 | ```
 99 | 
100 | 假设在上面的例子中，`$count`的值是一段时间内当前进程的读次数。`<<<`会把`$count`的值存储到`reads`数组`execname()`关联的聚集变量中。请注意，我们是把值存储在聚集变量里面；它们既没有加到原来的值上，也没有覆盖掉原来的值。可以这么说，就像是`reads`数组值每个键都有多个关联的值，并且探针的每次触发都会添加新的值。
101 | 
102 | 要想从聚集变量中获取汇总的结果，使用这样的语法`@extractor(variable/array index expression)`。`extractor`可以取以下的函数：
103 | 
104 | **count**
105 | 
106 | 返回`variable/array index expression`中存储的数值的数目。以上面为例，`@count(reads[execname()])`返回对应进程的聚集变量所存储的数据数。
107 | 
108 | **sum**
109 | 
110 | 返回`variable/array index expression`中存储的数值的和。以上面为例，`@count(reads[execname()])`返回对应进程的读总数。
111 | 
112 | **min**
113 | 
114 | 返回`variable/array index expression`中存储的数值的最小值。
115 | 
116 | **max**
117 | 
118 | 返回`variable/array index expression`中存储的数值的最大值。
119 | 
120 | **avg**
121 | 
122 | 返回`variable/array index expression`中存储的数值的数目。
123 | 
124 | 你可以使用多重索引表达式在数组里关联一个聚集变量（最多使用9个索引）。这么做的好处在于，你可以在数组中附加更多的上下文信息。举个例子：
125 | ```
126 | global reads
127 | probe vfs.read
128 | {
129 |   reads[execname(),pid()] <<< 1
130 | }
131 | 
132 | probe timer.s(3)
133 | {
134 |   foreach([var1,var2] in reads)
135 |     printf("%s (%d) : %d \n", var1, var2, @count(reads[var1,var2]))
136 | }
137 | ```
138 | 
139 | 在上面的例子中，第一个探针记录每个进程的`vfs.read`次数。跟之前的例子不同的是，这里的数组同时使用进程名和PID作为索引。
140 | 
141 | 在第二个探针里，我们使用`foreach`语句遍历并输出每个进程的数据。注意这里我们分别使用`var1`和`var2`来引用进程名和PID。
142 | 


--------------------------------------------------------------------------------
/3_3_BasicSystemTapHandlerConstructs.md:
--------------------------------------------------------------------------------
  1 | # 3.3. 处理程序的基本结构
  2 | 
  3 | SystemTap支持在处理程序中使用一些基本的结构。它们的语法基本上类似于C或awk。了解最常用的一些结构，有助于你写出更清晰的SystemTap脚本。
  4 | 
  5 | ## 变量
  6 | 
  7 | 处理程序里面当然可以使用变量，你所需的不过是给它取个好名字，把函数或表达式的值赋给它，然后就可以使用它了。SystemTap可以自动判定变量的类型。举个例子，如果你用`gettimeofday_s()`给变量`foo`赋值，那么`foo`就是数值类型的，可以在`printf()`中通过`%d`输出。
  8 | 变量默认只能在其所定义的探针内可用。这意味着变量的生命周期仅仅是处理程序的某次运行。不过你也可以在探针外定义变量，并使用`global`修饰它们，这样就能在探针间共享变量了。
  9 | ⁠
 10 | ```
 11 | global count_jiffies, count_ms
 12 | probe timer.jiffies(100) { count_jiffies ++ }
 13 | probe timer.ms(100) { count_ms ++ }
 14 | probe timer.ms(12345)
 15 | {
 16 |   hz=(1000*count_jiffies) / count_ms
 17 |   printf ("jiffies:ms ratio %d:%d => CONFIG_HZ=%d\n",
 18 |     count_jiffies, count_ms, hz)
 19 |   exit ()
 20 | }
 21 | ```
 22 | 
 23 | 在上面的例子中，`timer-jiffies.stp`通过累加jiffies和milliseconds，来求出内核的`CONFIG_HZ`配置。`global`语句使得`count_jiffies`和`count_ms`在每个探针中可用。
 24 | 
 25 | > 在上面的例子中，我们用`++`来将变量的值加一。如下探针中，`count_jiffies`每隔100 jiffies会自增1:
 26 | > ```
 27 | > probe timer.jiffies(100) { count_jiffies ++ }
 28 | > ```
 29 | > SystemTap知道`count_jiffies`是一个整数。那是因为`count_jiffies`没有被赋予一个初始值，所以它的值默认为零。
 30 | 
 31 | 
 32 | ## 目标变量（Target Variables）
 33 | 
 34 | 跟内核代码相关的事件，如`kernel.function("function")`和`kernel.statement("statement")`，允许使用目标变量获取这部分代码中可访问到的变量的值。你可以使用`-L`选项来列出特定探测点下可用的目标变量。如果已经安装了内核调试信息，你可以通过这个命令获取`vfs_read`中可用的目标变量：
 35 | ```
 36 | stap -L 'kernel.function("vfs_read")'
 37 | ```
 38 | 
 39 | 它会有类似如下的输出：
 40 | ```
 41 | kernel.function("vfs_read@fs/read_write.c:277") $file:struct file* $buf:char* $count:size_t $pos:loff_t*
 42 | ```
 43 | 
 44 | 每个目标变量前面都以`$`开头，并以`:`加变量类型结尾。上面的输出表示，`vfs_read`函数入口处有三个变量可用：`$file`（指向描述文件的结构体）、`$buf`（指向接收读取的数据的用户空间缓冲区）、`$count`（读取的字节数），和`$pos`（读开始的位置）。
 45 | 对于那些不属于本地变量的变量，像是全局变量或一个在文件中定义的静态变量，可以用`@var("varname@src/file.c")`获取。
 46 | SystemTap会保留目标变量的类型信息，并且允许通过`->`访问其中的成员。跟C语言不同的是，`->`既可以用来访问指针指向的值，也可以用来访问子结构体中的成员。在获取复杂结构体中的信息时，`->`可以链式使用。举个例子，`fs/file_table.c`中的静态目标变量`files_stat`存储着一些当前文件系统中可调节的参数。我们为了获取其中的一个域，可以这么写：
 47 | 
 48 | ```
 49 | stap -e 'probe kernel.function("vfs_read") {
 50 |            printf ("current files_stat max_files: %d\n",
 51 |                    @var("files_stat@fs/file_table.c")->max_files);
 52 |            exit(); }'
 53 | ```
 54 | 
 55 | 会有类似如下的输出：
 56 | ```
 57 | current files_stat max_files: 386070
 58 | ```
 59 | 
 60 | 有许多函数可以通过指向基本类型的指针获取内核空间对应地址上的数据，在此一一列出。在第4.2节，我们还会谈到获取用户空间数据的类似函数。
 61 | 
 62 | **kernel_char(address)**
 63 | 
 64 | 从内核空间地址中获取char变量
 65 | 
 66 | **kernel_short(address)**
 67 | 
 68 | 从内核空间地址中获取short变量
 69 | 
 70 | **kernel_int(address)**
 71 | 
 72 | 从内核空间地址中获取int变量
 73 | 
 74 | **kernel_long(address)**
 75 | 
 76 | 从内核空间地址中获取long变量
 77 | 
 78 | **kernel_string(address)**
 79 | 
 80 | 从内核空间地址中获取字符串
 81 | 
 82 | **kernel_string_n(address, n)**
 83 | 
 84 | 从内核空间地址中获取长为n的字符串
 85 | 
 86 | ### 整齐打印目标变量（Pretty Printing Target Variables）
 87 | 
 88 | 某些场景中，我们可能需要输出当前可访问的各种变量，以便于记录底层的变化。SystemTap提供了一些操作，可以生成描述特定目标变量的字符串：
 89 | 
 90 | **$$vars**
 91 | 
 92 | 输出作用域内每个变量的值。等价于`sprintf("parm1=%x ... parmN=%x var1=%x ... varN=%x", parm1, ..., parmN, var1, ..., varN)`。如果变量的值在运行时找不到，输出`=?`。
 93 | 
 94 | **$$locals**
 95 | 
 96 | 同`$$vars`，只输出本地变量。
 97 | 
 98 | **$$parms**
 99 | 
100 | 同`$$vars`，只输出函数入参。
101 | 
102 | **$$return**
103 | 
104 | 仅在带`return`的探针中可用。如果被监控的函数有返回值，它等价于`sprintf("return=%x", $return)`，否则为空字符串。
105 | 
106 | 下面的例子中，我们会输出`vfs_read`的入参：
107 | ```
108 | stap -e 'probe kernel.function("vfs_read") {printf("%s\n", $$parms); exit(); }'
109 | ```
110 | 
111 | `vfs_read`的入参有四个：`file`，`buf`，`count`，和`pos`。`$$params`会给这些入参生成描述字符串。在这个例子里，四个变量都是指针。下面是之前的命令的输出：
112 | ```
113 | file=0xffff8800b40d4c80 buf=0x7fff634403e0 count=0x2004 pos=0xffff8800af96df48
114 | ```
115 | 
116 | 关输出个地址值没什么用啊。要想输出指针指向的值，我们可以加上`$`后缀。下面的命令使用`$`后缀来输出`vfs_read`入参的实际值：
117 | ```
118 | stap -e 'probe kernel.function("vfs_read") {printf("%s\n", $$parms$); exit(); }'
119 | ```
120 | 
121 | 输出的结果：
122 | ```
123 | file={.f_u={...}, .f_path={...}, .f_op=0xffffffffa06e1d80, .f_lock={...}, .f_count={...}, .f_flags=34818, .f_mode=31, .f_pos=0, .f_owner={...}, .f_cred=0xffff88013148fc80, .f_ra={...}, .f_version=0, .f_security=0xffff8800b8dce560, .private_data=0x0, .f_ep_links={...}, .f_mapping=0xffff880037f8fdf8} buf="" count=8196 pos=-131938753921208
124 | ```
125 | 
126 | 只使用`$`后缀的话，是不会展开结构体里面嵌套的结构体的。要想展开嵌套的结构体，你需要使用`$$`后缀。下面是一个使用`$$`的例子：
127 | ```
128 | stap -e 'probe kernel.function("vfs_read") {printf("%s\n", $$parms$$); exit(); }'
129 | ```
130 | 
131 | 注意`$$`的输出，会受到字符串最长长度的限制。来自上面命令的输出，就因此被截断了：
132 | ```
133 | file={.f_u={.fu_list={.next=0xffff8801336ca0e8, .prev=0xffff88012ded0840}, .fu_rcuhead={.next=0xffff8801336ca0e8, .func=0xffff88012ded0840}}, .f_path={.mnt=0xffff880132fc97c0, .dentry=0xffff88001a889cc0}, .f_op=0xffffffffa06f64c0, .f_lock={.raw_lock={.slock=196611}}, .f_count={.counter=2}, .f_flags=34818, .f_mode=31, .f_pos=0, .f_owner={.lock={.raw_lock={.lock=16777216}}, .pid=0x0, .pid_type=0, .uid=0, .euid=0, .signum=0}, .f_cred=0xffff880130129a80, .f_ra={.start=0, .size=0, .async_size=0, .ra_pages=32, .
134 | ```
135 | 
136 | ## 条件语句
137 | 
138 | 有些时候，你写的SystemTap脚本较为复杂，可能需要用上条件语句。SystemTap支持C风格的条件语句，另外还支持`foreach (VAR in ARRAY) {}`形式的遍历。
139 | 
140 | ## 命令行参数
141 | 
142 | 通过`$`或`@`加个数字的形式可以访问对应位置的命令行参数。用`$`会把用户输入当作整数，用`@`会把用户输入当作字符串。
143 | 
144 | ```
145 | probe kernel.function(@1) { }
146 | probe kernel.function(@1).return { }
147 | ```
148 | 
149 | 上面的脚本期望用户把要监控的函数作为命令行参数传递进来。你可以让脚本接受多个命令行参数，分别命名为`@1`，`@2`等等，按用户输入的次序逐个对应。
150 | 


--------------------------------------------------------------------------------
/3_2_SystemTapScripts.md:
--------------------------------------------------------------------------------
  1 | # 3.2. 脚本
  2 | 
  3 | 在大多数情况下，SystemTap脚本是每个SystemTap会话的基石。SystemTap脚本决定了需要收集的信息类型，也决定了对收集到的信息的处理方式。
  4 | 
  5 | 在本章的开头曾经提到过，SystemTap脚本由两部分组成：事件和处理程序。一旦SystemTap会话准备就绪，SystemTap会监控操作系统中特定的事件，并在事件发生的时候触发对应的处理程序。
  6 | 
  7 | > 一个事件和它对应的处理程序合称探针。一个SystemTap脚本可以有多个探针。
  8 | > 一个探针的处理程序部分通常称之为探针主体（probe body）
  9 | 
 10 | 以应用开发的方式类比，使用事件和处理程序就像在程序的特定位置插入打日志的语句。每当程序运行时，这些日志会帮助你查看程序执行的流程。
 11 | 
 12 | SystemTasp脚本允许你在无需重新编译代码，即可插入检测指令，而且处理程序也不限于单纯地打印数据。事件会触发对应的处理程序；对应的处理程序记录下感兴趣的数据，并以你指定的格式输出。
 13 | 
 14 | SystemTap脚本的后缀是`.stp`，并以这样的语句表示一个探针：
 15 | 
 16 | ```
 17 | probe   event {statements}
 18 | ```
 19 | 
 20 | 译注：如果你写过awk脚本，应该会感觉似曾相识。
 21 | 
 22 | SystemTap支持给一个探针指定多个事件；每个事件以逗号隔开。如果给某一个探针指定了多个事件，只要其中一个事件发生，SystemTap就会执行对应的处理程序。
 23 | 
 24 | 每个探针有自己对应的语句块。语句块由花括号（`{}`）括住，包含事件发生时需要执行的所有语句。SystemTap会顺序执行这些语句；语句间通常不需要特殊的分隔符或终止符。
 25 | 
 26 | > SystemTap脚本的语句块使用跟C语言一样的语法。语句块内允许嵌套。
 27 | 
 28 | SystemTap允许你编写函数来提取探针间公共的逻辑。所以，与其在多个探针间复制粘贴重复的语句，你不如把它们放入函数中，就像：
 29 | 
 30 | ```
 31 | function function_name(arguments) {statements}
 32 | 
 33 | probe event {function_name(arguments)}
 34 | ```
 35 | 
 36 | 当探针被触发时，`function_name`中的语句会被执行。`arguments`是传递给函数的可选的入参。
 37 | 
 38 | > 本节仅仅是粗略地介绍下SystemTap脚本的结构。要想了解更详细的内容，最好坚持读到第5章，SystemTap脚本集锦；其中的每一节都会详细介绍一个脚本，包含它所监控的事件、它的处理程序和输出内容。
 39 | 
 40 | ## 事件
 41 | 
 42 | SystemTap事件大致分为两类：同步事件和异步事件。
 43 | 
 44 | ### 同步事件
 45 | 
 46 | 同步事件会在任意进程执行到内核特定位置时触发。你可以用它来作为其它事件的参照点，毕竟同步事件有着清晰的上下文信息。
 47 | 
 48 | 同步事件包括：
 49 | 
 50 | **syscall.system_call**
 51 | 
 52 | 进入名为`system_call`的系统调用。如果想要监控的是退出某个系统调用的事件，在后面添加`.return`。举个例子，要想监控进入和退出系统调用`close`的事件，应该使用`syscall.close`和`syscall.close.return`。
 53 | 
 54 | **vfs.file_operation**
 55 | 
 56 | 进入虚拟文件系统（VFS）名为`file_operation`的文件操作。跟系统调用事件一样，在后面添加`.return`可以监控对应的退出事件。
 57 | 译注：`file_operation`取值的范畴，取决于当前内核中`struct file_operations`的定义的操作（可能位于`include/linux/fs.h`中，版本不同位置会不一样，建议上http://lxr.free-electrons.com/ident 查找`file_operations`）。
 58 | 
 59 | **kernel.function("function")**
 60 | 
 61 | 进入名为`function`的内核函数。举个例子，`kernel.function("sys_open")`即内核函数`sys_open`被调用时所触发的事件。同样，`kernel.function("sys_open").return`会在`sys_open`函数调用返回时被触发。
 62 | 
 63 | 在定义探测事件时，可以使用像`*`这样的通配符。你也可以用内核源码文件名限定要跟踪的函数。看下面的例子：
 64 | 
 65 | >     probe kernel.function("*@net/socket.c") { }
 66 | >     probe kernel.function("*@net/socket.c").return { }
 67 | 
 68 | 在上面的例子中，第一个探针会监控`net/socket.c`中的所有函数的调用。第二个会监控所有这些函数的退出。注意在这个例子里，处理程序是空的；所以，即使事件被触发了，什么也不会发生。
 69 | 译注：例子中用的是探测内核源码中的函数的语法。完整的语法是`func_name@file_name[:line_num]`，由函数名、文件名、行号三部分组成。其中函数名在例子中为`*`，匹配任意函数。行号是可选的，在上面的例子里就被忽略掉了。如果想指定某个范围内的函数，如从行x到y，使用`:x-y`这样格式作为行号。
 70 | 
 71 | **kernel.trace("tracepoint")**
 72 | 
 73 | 到达名为`tracepoint`的静态内核探测点（tracepoint）。较新的内核（>= 2.6.30）包含了特定事件的检测代码。这些事件一般会被标记成静态内核探测点。一个例子是，`kernel.trace("kfree_skb")`表示内核释放了一个网络缓冲区的事件。（译注：想知道当前内核设置了哪些静态内核探测点吗？你需要运行`sudo perf list`。）
 74 | 
 75 | **module("module").function("function")**
 76 | 
 77 | 进入指定模块`module`的函数`function`。举个例子：
 78 | 
 79 | >     probe module("ext3").function("*") { }
 80 | >     probe module("ext3").function("*").return { }
 81 | 
 82 | 上面例子的第一个探针，会在每个ext3模块中的函数被调用时触发。第二个探针会在函数退出时触发。一切就跟`kernel.function()`一样。
 83 | 
 84 | 系统内的所有内核模块通常都在`/lib/modules/kernel_version`，其中`kernel_version`取当前内核版本号。模块的后缀名为`.ko`。
 85 | （译注：在该路径下使用`find -name '*.ko' -printf '%f\n' | sed 's/\.ko$//' `可列出所有的内核模块）
 86 | 
 87 | ### 异步事件
 88 | 
 89 | 异步事件跟特定的指令或代码的位置无关。
 90 | 这部分事件主要包含计数器、定时器和其它类似的东西。
 91 | 
 92 | **begin**
 93 | 
 94 | SystemTap会话的启动事件，会在脚本开始时触发。
 95 | 
 96 | **end**
 97 | 
 98 | SystemTap会话的结束事件，会在脚本结束时触发。
 99 | 
100 | **timer events**
101 | 
102 | 用于周期性执行某段处理程序。举个例子：
103 | 
104 | >    probe timer.s(4)
105 | >    {
106 | >        printf("hello world\n")
107 | >    }
108 | 
109 | 上面的例子中，每隔4秒就会输出`hello world`。还可以使用其它规格的定时器：
110 | 
111 | ```
112 | timer.ms(milliseconds)
113 | timer.us(microseconds)
114 | timer.ns(nanoseconds)
115 | timer.hz(hertz)
116 | timer.jiffies(jiffies)
117 | ```
118 | 
119 | 定时事件总是跟其它事件搭配使用。其它事件负责收集信息，而定时事件定期输出当前状况，让你看到数据随时间的变化情况。
120 | 
121 | > 限于篇幅，还有些SystemTap事件就不再一一介绍了。如果你想了解更多内容，请`man stapprobes`。该man page中的`SEE ALSO`一节，包括了通往其它man page的链接，你还可以随之找到某些特定子系统和组件所支持的事件。
122 | 
123 | ## 处理程序
124 | 
125 | 看一下下面的示例脚本：
126 | ```
127 | probe begin
128 | {
129 |   printf ("hello world\n")
130 |   exit ()
131 | }
132 | ```
133 | 
134 | 在上面的例子中，每当会话开始时，`begin`事件会触发`{}`内的处理程序，输出`hello world`加一个换行符，然后退出。
135 | 
136 | > SystemTap脚本会一直运行，直到执行了`exit()`函数。如果你想中途退出一个脚本，可以用`Ctrl+c`中断。 
137 | 
138 | **printf**
139 | 
140 | `printf()`是最简单的SystemTap函数之一，可以跟许多函数搭配使用，用来输出数据。通常我们会这样调用`printf()`：
141 | 
142 |     printf ("format string\n", arguments)
143 |     
144 | `format string`指明`arguments`输出的格式。在前面的例子里，printf语句内没有指定format格式符。在格式字符串（format string）中，你可以用`%s`表示字符串，`%d`表示数字。格式字符串中可以包含多个格式符，每个格式符对应一个参数；每个参数之间用逗号隔开。
145 | 
146 | > SystemTap的printf语句跟C的printf语句，无论在语法还是在格式字符串上都差不多。
147 | 
148 | 下面让我们再看多一个例子：
149 | ```
150 | probe syscall.open
151 | {
152 |   printf ("%s(%d) open\n", execname(), pid())
153 | }
154 | ```
155 | 
156 | 在上面的例子中，SystemTap会在每次`open`被调用时，输出调用程序的名字和PID，外加`open`这个词。该探针输出的结果看上去会是这样：
157 | ```
158 | vmware-guestd(2206) open
159 | hald(2360) open
160 | hald(2360) open
161 | hald(2360) open
162 | df(3433) open
163 | df(3433) open
164 | df(3433) open
165 | hald(2360) open
166 | ```
167 | 
168 | 你可以在`printf()`里使用其他的SystemTap函数。比如上面的例子中就用到`execname()`（获取触发事件的进程名）和`pid()`（当前进程ID）。
169 | 
170 | 下面列出常用的SystemTap函数：
171 | 
172 | **tid()**
173 | 
174 | 当前的tid（thread id）。
175 | 
176 | **uid()**
177 | 
178 | 当前的uid。
179 | 
180 | **cpu()**
181 | 
182 | 当前的CPU号
183 | 
184 | **gettimeofday_s()**
185 | 
186 | 自epoch以来的秒数
187 | 
188 | **ctime()**
189 | 
190 | 将上一个函数返回的秒数转化成时间字符串
191 | 
192 | **pp()**
193 | 
194 | 返回描述当前处理的探测点的字符串
195 | 
196 | **thread_indent()**
197 | 
198 | 你可以用这个函数来组织你的输出结果。这个函数接受一个表示缩进差额的参数，用来更新当前线程的“缩进计数器”（其实就是用于缩进的空格数）。它返回的是加了足够缩进的标识字符串。
199 | 这个标识字符串包括一个时间戳（表示自从该线程首次调用`thread_indent()`以来所经过的毫秒数），一个进程名，一个tid。由此可以清晰地看出函数的调用次序和调用层级，和每次调用时的间隔。
200 | 如果一个函数调用后随即退出，很容易就能看出被触发的两个事件是相关的。然而，在大多数情况下，一个函数调用和退出之间，往往会有调用其他别的函数。通过缩进，可以相对更清晰地看出某个函数调用和退出的时机。
201 | 
202 | 看一下下面使用`thread_indent()`的例子：
203 | ```
204 | probe kernel.function("*@net/socket.c").call
205 | {
206 |   printf ("%s -> %s\n", thread_indent(1), probefunc())
207 | }
208 | probe kernel.function("*@net/socket.c").return
209 | {
210 |   printf ("%s <- %s\n", thread_indent(-1), probefunc())
211 | }
212 | ```
213 | 
214 | 它输出的结果大概是这个样子的，注意箭头前面的空格数:
215 | 
216 | ```
217 | 0 ftp(7223): -> sys_socketcall
218 | 1159 ftp(7223):  -> sys_socket
219 | 2173 ftp(7223):   -> __sock_create
220 | 2286 ftp(7223):    -> sock_alloc_inode
221 | 2737 ftp(7223):    <- sock_alloc_inode
222 | 3349 ftp(7223):    -> sock_alloc
223 | 3389 ftp(7223):    <- sock_alloc
224 | 3417 ftp(7223):   <- __sock_create
225 | 4117 ftp(7223):   -> sock_create
226 | 4160 ftp(7223):   <- sock_create
227 | 4301 ftp(7223):   -> sock_map_fd
228 | 4644 ftp(7223):    -> sock_map_file
229 | 4699 ftp(7223):    <- sock_map_file
230 | 4715 ftp(7223):   <- sock_map_fd
231 | 4732 ftp(7223):  <- sys_socket
232 | 4775 ftp(7223): <- sys_socketcall
233 | ```
234 | 
235 | 上面的输出包含如下信息：
236 | * 自从该线程首次调用`thread_indent()`以来所经过的毫秒数。
237 | * 进程名和PID。
238 | * 用于缩进的若干个空格。以上三项均为`thread_indent()`的输出。
239 | * `->`表示函数调用，`<-`表示函数退出。
240 | * 触发事件的函数名。
241 | 
242 | **name**
243 | 
244 | 返回系统调用的名字。这个变量只能在`syscall.system_call`触发的处理程序中使用。
245 | 
246 | **target()**
247 | 
248 | 当你通过`stap script -x PID`或`stap script -c command`来执行某个脚本`script`时，`target()`会返回你指定的PID或命令名。举个例子：
249 | 
250 | ```
251 | probe syscall.* {
252 |   if (pid() == target())
253 |     printf("%s\n", name)
254 | }
255 | ```
256 | 
257 | 当上面的例子中的脚本带命令行参数`-x PID`运行时，它会监控所有的系统调用（`syscall.*`），并输出其中由指定进程所触发的系统调用。
258 | 你当然可以把上面例子中的`target()`替换成你想要指定的PID。不过使用`target()`让你的脚本可以重用。现在你只需在运行时指定PID，而无需每次都修改掉硬编码的PID值。
259 | 
260 | 要想了解更多关于SystemTap函数的信息，请`man stapfuncs`。
261 | 


--------------------------------------------------------------------------------
/5_1_Network.md:
--------------------------------------------------------------------------------
  1 | # 5.1. 网络
  2 | 
  3 | 以下各节的脚本展示了如何跟踪网络相关的函数和剖析（profile）网络活动。
  4 | 
  5 | ## 剖析网络活动
  6 | 
  7 | 本节展示SystemTap中剖析网络活动的方式。下面的`nettop.stp`允许我们一窥每个进程的网络流量使用情况。
  8 | 
  9 | nettop.stp
 10 | ```
 11 | #! /usr/bin/env stap
 12 | 
 13 | global ifxmit, ifrecv
 14 | global ifmerged
 15 | 
 16 | probe netdev.transmit
 17 | {
 18 |   ifxmit[pid(), dev_name, execname(), uid()] <<< length
 19 | }
 20 | 
 21 | probe netdev.receive
 22 | {
 23 |   ifrecv[pid(), dev_name, execname(), uid()] <<< length
 24 | }
 25 | 
 26 | function print_activity()
 27 | {
 28 |   printf("%5s %5s %-7s %7s %7s %7s %7s %-15s\n",
 29 |          "PID", "UID", "DEV", "XMIT_PK", "RECV_PK",
 30 |          "XMIT_KB", "RECV_KB", "COMMAND")
 31 | 
 32 |   foreach ([pid, dev, exec, uid] in ifrecv) {
 33 |       ifmerged[pid, dev, exec, uid] += @count(ifrecv[pid,dev,exec,uid]);
 34 |   }
 35 |   foreach ([pid, dev, exec, uid] in ifxmit) {
 36 |       ifmerged[pid, dev, exec, uid] += @count(ifxmit[pid,dev,exec,uid]);
 37 |   }
 38 |   foreach ([pid, dev, exec, uid] in ifmerged-) {
 39 |     n_xmit = @count(ifxmit[pid, dev, exec, uid])
 40 |     n_recv = @count(ifrecv[pid, dev, exec, uid])
 41 |     printf("%5d %5d %-7s %7d %7d %7d %7d %-15s\n",
 42 |            pid, uid, dev, n_xmit, n_recv,
 43 |            n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0,
 44 |            n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0,
 45 |            exec)
 46 |   }
 47 | 
 48 |   print("\n")
 49 | 
 50 |   delete ifxmit
 51 |   delete ifrecv
 52 |   delete ifmerged
 53 | }
 54 | 
 55 | probe timer.ms(5000), end, error
 56 | {
 57 |   print_activity()
 58 | }
 59 | ```
 60 | 
 61 | 注意看`print_activity()`的这几个表达式：
 62 | ```
 63 | n_xmit ? @sum(ifxmit[pid, dev, exec, uid])/1024 : 0
 64 | n_recv ? @sum(ifrecv[pid, dev, exec, uid])/1024 : 0
 65 | ```
 66 | 
 67 | 它们也是`if/else`语句，等价于如下的伪代码：
 68 | ```
 69 | if n_recv != 0 then
 70 |   @sum(ifrecv[pid, dev, exec, uid])/1024
 71 | else
 72 |   0
 73 | ```
 74 | 
 75 | `nettop.stp`跟踪用了网络流量的进程，并逐个进程输出如下的信息：
 76 | * PID — 进程的PID.
 77 | * UID — 进程所有者的UID。
 78 | * DEV — 进程使用的端口，如`eth0`、`eth1`。
 79 | * XMIT_PK — 发送的包的数量
 80 | * RECV_PK — 接收的包的数量
 81 | * XMIT_KB — 发送的KB数
 82 | * RECV_KB — 接收的KB数
 83 | 
 84 | `nettop.stp`每隔5秒就会取样一次。你可以修改`probe timer.ms(5000)`来调整取样间隔。`nettop.stp`在20秒内的输出如下：
 85 | ```
 86 | [...]
 87 |   PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
 88 |     0     0 eth0          0       5       0       0 swapper
 89 | 11178     0 eth0          2       0       0       0 synergyc
 90 | 
 91 |   PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
 92 |  2886     4 eth0         79       0       5       0 cups-polld
 93 | 11362     0 eth0          0      61       0       5 firefox
 94 |     0     0 eth0          3      32       0       3 swapper
 95 |  2886     4 lo            4       4       0       0 cups-polld
 96 | 11178     0 eth0          3       0       0       0 synergyc
 97 | 
 98 |   PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
 99 |     0     0 eth0          0       6       0       0 swapper
100 |  2886     4 lo            2       2       0       0 cups-polld
101 | 11178     0 eth0          3       0       0       0 synergyc
102 |  3611     0 eth0          0       1       0       0 Xorg
103 | 
104 |   PID   UID DEV     XMIT_PK RECV_PK XMIT_KB RECV_KB COMMAND
105 |     0     0 eth0          3      42       0       2 swapper
106 | 11178     0 eth0         43       1       3       0 synergyc
107 | 11362     0 eth0          0       7       0       0 firefox
108 |  3897     0 eth0          0       1       0       0 multiload-apple
109 | [...]
110 | ```
111 | 
112 | 
113 | 
114 | ## 跟踪网络连接中的内核函数调用
115 | 
116 | 本节展示如何跟踪内核的`net/socket.c`中的函数的调用情况。这将帮助你从细节上看清各进程是怎么跟内核的网络功能打交道的。
117 | 
118 | socket-trace.stp
119 | ```
120 | #! /usr/bin/env stap
121 | 
122 | probe kernel.function("*@net/socket.c").call {
123 |   printf ("%s -> %s\n", thread_indent(1), ppfunc())
124 | }
125 | 
126 | probe kernel.function("*@net/socket.c").return {
127 |   printf ("%s <- %s\n", thread_indent(-1), ppfunc())
128 | }
129 | ```
130 | 
131 | `socket-trace.stp`这个脚本其实在我们之前在第3章介绍`thread_indent()`的时候已经见过了。下面是它在3秒内的输出：
132 | ```
133 | [...]
134 | 0 Xorg(3611): -> sock_poll
135 | 3 Xorg(3611): <- sock_poll
136 | 0 Xorg(3611): -> sock_poll
137 | 3 Xorg(3611): <- sock_poll
138 | 0 gnome-terminal(11106): -> sock_poll
139 | 5 gnome-terminal(11106): <- sock_poll
140 | 0 scim-bridge(3883): -> sock_poll
141 | 3 scim-bridge(3883): <- sock_poll
142 | 0 scim-bridge(3883): -> sys_socketcall
143 | 4 scim-bridge(3883):  -> sys_recv
144 | 8 scim-bridge(3883):   -> sys_recvfrom
145 | 12 scim-bridge(3883):-> sock_from_file
146 | 16 scim-bridge(3883):<- sock_from_file
147 | 20 scim-bridge(3883):-> sock_recvmsg
148 | 24 scim-bridge(3883):<- sock_recvmsg
149 | 28 scim-bridge(3883):   <- sys_recvfrom
150 | 31 scim-bridge(3883):  <- sys_recv
151 | 35 scim-bridge(3883): <- sys_socketcall
152 | [...]
153 | ```
154 | 
155 | ## 监控TCP连接的创建
156 | 
157 | 本节展示如何监控TCP连接的创建。这可以帮助你第一时间识别出任何未授权的、可疑的或其它不请自来的网络连接。
158 | 
159 | tcp_connections.stp
160 | ```
161 | #! /usr/bin/env stap
162 | 
163 | probe begin {
164 |   printf("%6s %16s %6s %6s %16s\n",
165 |          "UID", "CMD", "PID", "PORT", "IP_SOURCE")
166 | }
167 | 
168 | probe kernel.function("tcp_accept").return?,
169 |       kernel.function("inet_csk_accept").return? {
170 |   sock = $return
171 |   if (sock != 0)
172 |     printf("%6d %16s %6d %6d %16s\n", uid(), execname(), pid(),
173 |            inet_get_local_port(sock), inet_get_ip_source(sock))
174 | }
175 | ```
176 | 
177 | 当`tcp_connections.stp`运行时，它会实时输出新创建的TCP连接的如下信息：
178 | * 当前UID
179 | * 接受连接的程序名
180 | * 接受连接的进程PID
181 | * 创建连接的远程IP地址
182 | 
183 | ```
184 | UID            CMD    PID   PORT        IP_SOURCE
185 | 0             sshd   3165     22      10.64.0.227
186 | 0             sshd   3165     22      10.64.0.227
187 | ```
188 | 
189 | ## 监控TCP包
190 | 
191 | 本节展示如何监控收到的TCP包。这可以帮助你分析应用的流量使用情况。
192 | 
193 | tcpdumplike.stp
194 | ```
195 | #! /usr/bin/env stap
196 | 
197 | // A TCP dump like example
198 | 
199 | probe begin, timer.s(1) {
200 |   printf("-----------------------------------------------------------------\n")
201 |   printf("       Source IP         Dest IP  SPort  DPort  U  A  P  R  S  F \n")
202 |   printf("-----------------------------------------------------------------\n")
203 | }
204 | 
205 | probe udp.recvmsg /* ,udp.sendmsg */ {
206 |   printf(" %15s %15s  %5d  %5d  UDP\n",
207 |          saddr, daddr, sport, dport)
208 | }
209 | 
210 | probe tcp.receive {
211 |   printf(" %15s %15s  %5d  %5d  %d  %d  %d  %d  %d  %d\n",
212 |          saddr, daddr, sport, dport, urg, ack, psh, rst, syn, fin)
213 | }
214 | ```
215 | 
216 | 当`tcpdumplike.stp`运行时，它会实时输出收到的TCP包的如下信息：
217 | * 源IP地址和目标IP地址（saddr和daddr）
218 | * 源端口和目标端口（sport和dport）
219 | * 包标识
220 | 
221 | `tcpdumplike.stp`使用了以下函数来获取包的标识信息：
222 | * urg - urgent
223 | * ack - acknowledgement
224 | * psh - push
225 | * rst - reset
226 | * syn - synchronize
227 | * fin - finished
228 | 
229 | 上述函数返回1或0来表示包中是否存在对应的标识。
230 | ⁠
231 | ```
232 | -----------------------------------------------------------------
233 |        Source IP         Dest IP  SPort  DPort  U  A  P  R  S  F
234 | -----------------------------------------------------------------
235 |   209.85.229.147       10.0.2.15     80  20373  0  1  1  0  0  0
236 |   92.122.126.240       10.0.2.15     80  53214  0  1  0  0  1  0
237 |   92.122.126.240       10.0.2.15     80  53214  0  1  0  0  0  0
238 |   209.85.229.118       10.0.2.15     80  63433  0  1  0  0  1  0
239 |   209.85.229.118       10.0.2.15     80  63433  0  1  0  0  0  0
240 |   209.85.229.147       10.0.2.15     80  21141  0  1  1  0  0  0
241 |   209.85.229.147       10.0.2.15     80  21141  0  1  1  0  0  0
242 |   209.85.229.147       10.0.2.15     80  21141  0  1  1  0  0  0
243 |   209.85.229.147       10.0.2.15     80  21141  0  1  1  0  0  0
244 |   209.85.229.147       10.0.2.15     80  21141  0  1  1  0  0  0
245 |   209.85.229.118       10.0.2.15     80  63433  0  1  1  0  0  0
246 | [...]
247 | ```
248 | 
249 | ## 监控内核中的网络丢包情况
250 | 
251 | 某些情况下Linux网络栈会丢包。有些版本的Linux内核包含静态内核探测点`kernel.trace("kfree_skb")`，它可以帮助你跟踪包丢掉的原因。`dropwatch.stp`就使用了它来跟踪丢包；这个脚本每五秒统计一次丢包的位置。
252 | 
253 | dropwatch.stp
254 | ```
255 | #! /usr/bin/env stap
256 | 
257 | ############################################################
258 | # Dropwatch.stp
259 | # Author: Neil Horman <nhorman@redhat.com>
260 | # An example script to mimic the behavior of the dropwatch utility
261 | # http://fedorahosted.org/dropwatch
262 | ############################################################
263 | 
264 | # Array to hold the list of drop points we find
265 | global locations
266 | 
267 | # Note when we turn the monitor on and off
268 | probe begin { printf("Monitoring for dropped packets\n") }
269 | probe end { printf("Stopping dropped packet monitor\n") }
270 | 
271 | # increment a drop counter for every location we drop at
272 | probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
273 | 
274 | # Every 5 seconds report our drop locations
275 | probe timer.sec(5)
276 | {
277 |   printf("\n")
278 |   foreach (l in locations-) {
279 |     printf("%d packets dropped at %s\n",
280 |            @count(locations[l]), symname(l))
281 |   }
282 |   delete locations
283 | }
284 | ```
285 | 
286 | `kernel.trace("kfree_skb")`跟踪内核中网络包被丢弃的位置。它有两个参数：一个指向将被释放的缓冲区的指针`$skb`，和释放缓冲区时的内核位置`$location`。如果可以获取`$location`所存储的内核地址上对应的函数名，`dropwatch.stp`脚本可以把它的值映射成对应的函数。这个映射默认不会启用。对于1.4及以上的SystemTap，你可以指定`--all-modules`选项来启用该映射：
287 | ```
288 | stap --all-modules dropwatch.stp
289 | ```
290 | 
291 | 在低版本的SystemTap，你可以使用下面的命令模拟`--all-modules`选项：
292 | ```
293 | stap -dkernel \
294 | `cat /proc/modules | awk 'BEGIN { ORS = " " } {print "-d"$1}'` \
295 | dropwatch.stp
296 | ```
297 | 
298 | 运行`dropwatch.stp`15秒会输出类似下面的结果。输出的结果会按函数名或地址聚合丢包的次数。
299 | ```
300 | Monitoring for dropped packets
301 | 
302 | 1762 packets dropped at unix_stream_recvmsg
303 | 4 packets dropped at tun_do_read
304 | 2 packets dropped at nf_hook_slow
305 | 
306 | 467 packets dropped at unix_stream_recvmsg
307 | 20 packets dropped at nf_hook_slow
308 | 6 packets dropped at tun_do_read
309 | 
310 | 446 packets dropped at unix_stream_recvmsg
311 | 4 packets dropped at tun_do_read
312 | 4 packets dropped at nf_hook_slow
313 | Stopping dropped packet monitor
314 | ```
315 | 
316 | 当运行脚本的机器不支持`--all-modules`和`/proc/modules`时，`symname`只会输出原始的地址。你可以通过`/boot/System.map-$(uname -r)`按地址找出对应的函数。下面的`/boot/System.map-$(uname -r)`片段中，地址`0xffffffff8149a8ed`映射到函数`unix_stream_recvmsg`：
317 | ```
318 | [...]
319 | ffffffff8149a420 t unix_dgram_poll
320 | ffffffff8149a5e0 t unix_stream_recvmsg
321 | ffffffff8149ad00 t unix_find_other
322 | [...]
323 | ```
324 | 


--------------------------------------------------------------------------------
/5_3_Profiling.md:
--------------------------------------------------------------------------------
  1 | # 5.3. 剖析
  2 | 
  3 | 以下各节的脚本展示了如何通过监控函数调用来剖析（profile）内核活动。
  4 | 
  5 | ## 统计函数调用次数
  6 | 
  7 | 本节展示如何统计30秒内某个内核函数调用次数。通过使用通配符，你可以用这个脚本同时统计多个内核函数。
  8 | 
  9 | functioncallcount.stp
 10 | ```
 11 | #! /usr/bin/env stap
 12 | # The following line command will probe all the functions
 13 | # in kernel's memory management code:
 14 | #
 15 | # stap  functioncallcount.stp "*@mm/*.c"
 16 | 
 17 | probe kernel.function(@1).call {  # probe functions listed on commandline
 18 |   called[ppfunc()] <<< 1  # add a count efficiently
 19 | }
 20 | 
 21 | global called
 22 | 
 23 | probe end {
 24 |   foreach (fn in called-)  # Sort by call count (in decreasing order)
 25 |   #       (fn+ in called)  # Sort by function name
 26 |     printf("%s %d\n", fn, @count(called[fn]))
 27 |   exit()
 28 | }
 29 | ```
 30 | 
 31 | `functioncallcount.stp`接受内核函数名作为参数。你可以使用通配符，这样就能同时监控多个内核函数。
 32 | 它的输出包括调用者的名字和取样时间内调用次数。下面是`stap functioncallcount.stp "*@mm/*.c"`的输出片段：
 33 | 
 34 | ```
 35 | [...]
 36 | __vma_link 97
 37 | __vma_link_file 66
 38 | __vma_link_list 97
 39 | __vma_link_rb 97
 40 | __xchg 103
 41 | add_page_to_active_list 102
 42 | add_page_to_inactive_list 19
 43 | add_to_page_cache 19
 44 | add_to_page_cache_lru 7
 45 | all_vm_events 6
 46 | alloc_pages_node 4630
 47 | alloc_slabmgmt 67
 48 | anon_vma_alloc 62
 49 | anon_vma_free 62
 50 | anon_vma_lock 66
 51 | anon_vma_prepare 98
 52 | anon_vma_unlink 97
 53 | anon_vma_unlock 66
 54 | arch_get_unmapped_area_topdown 94
 55 | arch_get_unmapped_exec_area 3
 56 | arch_unmap_area_topdown 97
 57 | atomic_add 2
 58 | atomic_add_negative 97
 59 | atomic_dec_and_test 5153
 60 | atomic_inc 470
 61 | atomic_inc_and_test 1
 62 | [...]
 63 | ```
 64 | 
 65 | ## 追踪函数调用链
 66 | 
 67 | 本节展示如何追踪函数调用链。
 68 | 
 69 | para-callgraph.stp
 70 | ```
 71 | #! /usr/bin/env stap
 72 | 
 73 | function trace(entry_p, extra) {
 74 |   %( $# > 1 %? if (tid() in trace) %)
 75 |   printf("%s%s%s %s\n",
 76 |          thread_indent (entry_p),
 77 |          (entry_p>0?"->":"<-"),
 78 |          ppfunc (),
 79 |          extra)
 80 | }
 81 | 
 82 | 
 83 | %( $# > 1 %?
 84 | global trace
 85 | probe $2.call {
 86 |   trace[tid()] = 1
 87 | }
 88 | probe $2.return {
 89 |   delete trace[tid()]
 90 | }
 91 | %)
 92 | 
 93 | probe $1.call   { trace(1, $$parms) }
 94 | probe $1.return { trace(-1, $$return) }
 95 | ```
 96 | 
 97 | `para-callgraph.stp`接受两个命令行参数：
 98 | 1. 想要跟踪的函数（`$1`）
 99 | 2. 可选的触发函数。该函数可以在线程范围内启动/停止追踪。只要触发函数不退出，追踪就不会结束。
100 | `para-callgraph.stp`使用了`thread_indent()`；此外它的输出包括了时间戳、进程名，和`$1`所在的线程ID。关于`thread_indent()`的更多信息，请参考[。。。](3.2节)。
101 | （译注：这个脚本的编码风格小朋友们可不要学。前两个探针里的`trace`是数组，后两个探针里的`trace`是函数。另外`$#`表示参数的个数，写过shell的都明白。`%( $# > 1 %? if (tid() in trace) %)`是一个预处理三元表达式，见[langref](https://sourceware.org/systemtap/langref/Language_elements.html)中的“5.8.1 Conditions”）
102 | 
103 | 下面是`stap para-callgraph.stp 'kernel.function("*@fs/*.c")' 'kernel.function("sys_read")'`的输出片段：
104 | ```
105 | [...]
106 |    267 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
107 |    269 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
108 |      0 gnome-terminal(2921):->fput file=0xffff880111eebbc0
109 |      2 gnome-terminal(2921):<-fput
110 |      0 gnome-terminal(2921):->fget_light fd=0x3 fput_needed=0xffff88010544df54
111 |      3 gnome-terminal(2921):<-fget_light return=0xffff8801116ce980
112 |      0 gnome-terminal(2921):->vfs_read file=0xffff8801116ce980 buf=0xc86504 count=0x1000 pos=0xffff88010544df48
113 |      4 gnome-terminal(2921): ->rw_verify_area read_write=0x0 file=0xffff8801116ce980 ppos=0xffff88010544df48 count=0x1000
114 |      7 gnome-terminal(2921): <-rw_verify_area return=0x1000
115 |     12 gnome-terminal(2921): ->do_sync_read filp=0xffff8801116ce980 buf=0xc86504 len=0x1000 ppos=0xffff88010544df48
116 |     15 gnome-terminal(2921): <-do_sync_read return=0xfffffffffffffff5
117 |     18 gnome-terminal(2921):<-vfs_read return=0xfffffffffffffff5
118 |      0 gnome-terminal(2921):->fput file=0xffff8801116ce980
119 | ```
120 | 
121 | 
122 | ## 统计给定线程在内核空间和用户空间上的耗时
123 | 
124 | 本节展示如何统计给定线程花费在内核空间或用户空间上的运行时间。
125 | 
126 | thread-times.stp
127 | ```
128 | #! /usr/bin/env stap
129 | 
130 | probe perf.sw.cpu_clock!, timer.profile {
131 |   // NB: To avoid contention on SMP machines, no global scalars/arrays used,
132 |   // only contention-free statistics aggregates.
133 |   tid=tid(); e=execname()
134 |   if (!user_mode())
135 |     kticks[e,tid] <<< 1
136 |   else
137 |     uticks[e,tid] <<< 1
138 |   ticks <<< 1
139 |   tids[e,tid] <<< 1
140 | }
141 | 
142 | global uticks%, kticks%, ticks
143 | 
144 | global tids%
145 | 
146 | probe timer.s(5), end {
147 |   allticks = @count(ticks)
148 |   printf ("%16s %5s %7s %7s (of %d ticks)\n",
149 |           "comm", "tid", "%user", "%kernel", allticks)
150 |   foreach ([e,tid] in tids- limit 20) {
151 |     uscaled = @count(uticks[e,tid])*10000/allticks
152 |     kscaled = @count(kticks[e,tid])*10000/allticks
153 |     printf ("%16s %5d %3d.%02d%% %3d.%02d%%\n",
154 |       e, tid, uscaled/100, uscaled%100, kscaled/100, kscaled%100)
155 |   }
156 |   printf("\n")
157 | 
158 |   delete uticks
159 |   delete kticks
160 |   delete ticks
161 |   delete tids
162 | }
163 | ```
164 | 
165 | `thread-time.stp`列出5秒内花费CPU时间最多的20个进程，和这段时间CPU滴答（ticks）的总数。脚本的输出还包括每个进程CPU占用百分比，分别按内核空间和用户空间列出。
166 | 下面就是它的输出:
167 | 
168 | ```
169 |   tid   %user %kernel (of 20002 ticks)
170 |     0   0.00%  87.88%
171 | 32169   5.24%   0.03%
172 |  9815   3.33%   0.36%
173 |  9859   0.95%   0.00%
174 |  3611   0.56%   0.12%
175 |  9861   0.62%   0.01%
176 | 11106   0.37%   0.02%
177 | 32167   0.08%   0.08%
178 |  3897   0.01%   0.08%
179 |  3800   0.03%   0.00%
180 |  2886   0.02%   0.00%
181 |  3243   0.00%   0.01%
182 |  3862   0.01%   0.00%
183 |  3782   0.00%   0.00%
184 | 21767   0.00%   0.00%
185 |  2522   0.00%   0.00%
186 |  3883   0.00%   0.00%
187 |  3775   0.00%   0.00%
188 |  3943   0.00%   0.00%
189 |  3873   0.00%   0.00%
190 |  ```
191 |  
192 | ## 监控应用轮询情况
193 | 
194 | 本节展示如何监控应用的轮询（polling）情况。这将允许你跟踪多余的或过度的轮询，帮助锁定CPU使用或能源消耗需要改善的地方。
195 | 
196 | timeout.stp
197 | ```
198 | #! /usr/bin/env stap
199 | # Copyright (C) 2009 Red Hat, Inc.
200 | # Written by Ulrich Drepper <drepper@redhat.com>
201 | # Modified by William Cohen <wcohen@redhat.com>
202 | 
203 | global process, timeout_count, to
204 | global poll_timeout, epoll_timeout, select_timeout, itimer_timeout
205 | global nanosleep_timeout, futex_timeout, signal_timeout
206 | 
207 | probe syscall.poll, syscall.epoll_wait {
208 |   if (timeout) to[pid()]=timeout
209 | }
210 | 
211 | probe syscall.poll.return {
212 |   if ($return == 0 && to[pid()] > 0 ) {
213 |     poll_timeout[pid()]++
214 |     timeout_count[pid()]++
215 |     process[pid()] = execname()
216 |     delete to[pid()]
217 |   }
218 | }
219 | 
220 | probe syscall.epoll_wait.return {
221 |   if ($return == 0 && to[pid()] > 0 ) {
222 |     epoll_timeout[pid()]++
223 |     timeout_count[pid()]++
224 |     process[p] = execname()
225 |     delete to[pid()]
226 |   }
227 | }
228 | 
229 | probe syscall.select.return {
230 |   if ($return == 0) {
231 |     select_timeout[pid()]++
232 |     timeout_count[pid()]++
233 |     process[pid()] = execname()
234 |   }
235 | }
236 | 
237 | probe syscall.futex.return {
238 |   if (errno_str($return) == "ETIMEDOUT") {
239 |     futex_timeout[pid()]++
240 |     timeout_count[pid()]++
241 |     process[pid()] = execname()
242 |   }
243 | }
244 | 
245 | probe syscall.nanosleep.return {
246 |   if ($return == 0) {
247 |     nanosleep_timeout[pid()]++
248 |     timeout_count[pid()]++
249 |     process[pid()] = execname()
250 |   }
251 | }
252 | 
253 | probe kernel.function("it_real_fn") {
254 |   itimer_timeout[pid()]++
255 |   timeout_count[pid()]++
256 |   process[pid()] = execname()
257 | }
258 | 
259 | probe syscall.rt_sigtimedwait.return {
260 |   if (errno_str($return) == "EAGAIN") {
261 |     signal_timeout[pid()]++
262 |     timeout_count[pid()]++
263 |     process[pid()] = execname()
264 |   }
265 | }
266 | 
267 | probe syscall.exit {
268 |   if (pid() in process) {
269 |     delete process[pid()]
270 |     delete timeout_count[pid()]
271 |     delete poll_timeout[pid()]
272 |     delete epoll_timeout[pid()]
273 |     delete select_timeout[pid()]
274 |     delete itimer_timeout[pid()]
275 |     delete futex_timeout[pid()]
276 |     delete nanosleep_timeout[pid()]
277 |     delete signal_timeout[pid()]
278 |   }
279 | }
280 | 
281 | probe timer.s(1) {
282 |   ansi_clear_screen()
283 |   printf ("  pid |   poll  select   epoll  itimer   futex nanosle  signal| process\n")
284 |   foreach (p in timeout_count- limit 20) {
285 |      printf ("%5d |%7d %7d %7d %7d %7d %7d %7d| %-.38s\n", p,
286 |               poll_timeout[p], select_timeout[p],
287 |               epoll_timeout[p], itimer_timeout[p],
288 |               futex_timeout[p], nanosleep_timeout[p],
289 |               signal_timeout[p], process[p])
290 |   }
291 | }
292 | ```
293 | 
294 | `timeout.stp`跟踪下列系统调用，并仅当因为超时而退出该调用时记录次数：
295 | * poll
296 | * select
297 | * epoll
298 | * itimer
299 | * futex
300 | * nanosleep
301 | * signal
302 | 
303 | ```
304 |   uid |   poll  select   epoll  itimer   futex nanosle  signal| process
305 | 28937 | 148793       0       0    4727   37288       0       0| firefox
306 | 22945 |      0   56949       0       1       0       0       0| scim-bridge
307 |     0 |      0       0       0   36414       0       0       0| swapper
308 |  4275 |  23140       0       0       1       0       0       0| mixer_applet2
309 |  4191 |      0   14405       0       0       0       0       0| scim-launcher
310 | 22941 |   7908       1       0      62       0       0       0| gnome-terminal
311 |  4261 |      0       0       0       2       0    7622       0| escd
312 |  3695 |      0       0       0       0       0    7622       0| gdm-binary
313 |  3483 |      0    7206       0       0       0       0       0| dhcdbd
314 |  4189 |   6916       0       0       2       0       0       0| scim-panel-gtk
315 |  1863 |   5767       0       0       0       0       0       0| iscsid
316 |  2562 |      0    2881       0       1       0    1438       0| pcscd
317 |  4257 |   4255       0       0       1       0       0       0| gnome-power-man
318 |  4278 |   3876       0       0      60       0       0       0| multiload-apple
319 |  4083 |      0    1331       0    1728       0       0       0| Xorg
320 |  3921 |   1603       0       0       0       0       0       0| gam_server
321 |  4248 |   1591       0       0       0       0       0       0| nm-applet
322 |  3165 |      0    1441       0       0       0       0       0| xterm
323 | 29548 |      0    1440       0       0       0       0       0| httpd
324 |  1862 |      0       0       0       0       0    1438       0| iscsid
325 | ```
326 | 
327 | 你可以通过修改最后一个探针（`timer.s(1)`）来增大取样时间。`timeout.stp`的输出包括前20个轮询应用的名字和UID，连带每个应用调用每种轮询系统调用的累计次数。在上面的输出片段中，由于某个插件模块，firefox进行了过度的轮询。
328 | 
329 | ## 监控最常调用的系统调用
330 | 
331 | 上一节的`timeout.stp`通过监控系统调用的某个子集，帮助你找到过度轮询的应用。同样，如果你怀疑某些系统调用被过度地调用了，通过类似的监控，你也能把它们找出来。下面就通过`topsys.stp`实现这一点：
332 | 
333 | topsys.stp
334 | ```
335 | #! /usr/bin/env stap
336 | #
337 | # This script continuously lists the top 20 systemcalls in the interval 
338 | # 5 seconds
339 | #
340 | 
341 | global syscalls_count
342 | 
343 | probe syscall.* {
344 |   syscalls_count[name] <<< 1
345 | }
346 | 
347 | function print_systop () {
348 |   printf ("%25s %10s\n", "SYSCALL", "COUNT")
349 |   foreach (syscall in syscalls_count- limit 20) {
350 |     printf("%25s %10d\n", syscall, @count(syscalls_count[syscall]))
351 |   }
352 |   delete syscalls_count
353 | }
354 | 
355 | probe timer.s(5) {
356 |   print_systop ()
357 |   printf("--------------------------------------------------------------\n")
358 | }
359 | ```
360 | 
361 | `topsys.stp`每5秒列出调用最多的20个系统调用。它也列出了这段时间内每个系统调用被调用的次数。下面是它的一个输出：
362 | 
363 | ```
364 | --------------------------------------------------------------
365 |                   SYSCALL      COUNT
366 |              gettimeofday       1857
367 |                      read       1821
368 |                     ioctl       1568
369 |                      poll       1033
370 |                     close        638
371 |                      open        503
372 |                    select        455
373 |                     write        391
374 |                    writev        335
375 |                     futex        303
376 |                   recvmsg        251
377 |                    socket        137
378 |             clock_gettime        124
379 |            rt_sigprocmask        121
380 |                    sendto        120
381 |                 setitimer        106
382 |                      stat         90
383 |                      time         81
384 |                 sigreturn         72
385 |                     fstat         66
386 | --------------------------------------------------------------
387 | ```
388 | 
389 | ## 找出每个进程的系统调用量
390 | 
391 | 本节展示如何找出调用系统调用最多的进程。在上一节，我们谈到了如何找出调用最多的系统调用。而在上上节，我们也谈到了如何找出轮询最多的进程。通过监控每个进程的调用系统调用的次数，可以在调查轮询进程和其他滥用资源者时提供更多的数据。
392 | 
393 | `syscalls_by_proc.stp`
394 | ```
395 | #! /usr/bin/env stap
396 | 
397 | # Copyright (C) 2006 IBM Corp.
398 | #
399 | # This file is part of systemtap, and is free software.  You can
400 | # redistribute it and/or modify it under the terms of the GNU General
401 | # Public License (GPL); either version 2, or (at your option) any
402 | # later version.
403 | 
404 | #
405 | # Print the system call count by process name in descending order.
406 | #
407 | 
408 | global syscalls
409 | 
410 | probe begin {
411 |   print ("Collecting data... Type Ctrl-C to exit and display results\n")
412 | }
413 | 
414 | probe nd_syscall.* {
415 |   syscalls[execname()]++
416 | }
417 | 
418 | probe end {
419 |   printf ("%-10s %-s\n", "#SysCalls", "Process Name")
420 |   foreach (proc in syscalls-)
421 |     printf("%-10d %-s\n", syscalls[proc], proc)
422 | }
423 | ```
424 | `syscalls_by_proc.stp`列出调用系统调用最多的20个进程。它也列出了这段时间内每个进程调用系统调用的数量。下面是它的输出：
425 | 
426 | ```
427 | Collecting data... Type Ctrl-C to exit and display results
428 | #SysCalls  Process Name
429 | 1577       multiload-apple
430 | 692        synergyc
431 | 408        pcscd
432 | 376        mixer_applet2
433 | 299        gnome-terminal
434 | 293        Xorg
435 | 206        scim-panel-gtk
436 | 95         gnome-power-man
437 | 90         artsd
438 | 85         dhcdbd
439 | 84         scim-bridge
440 | 78         gnome-screensav
441 | 66         scim-launcher
442 | [...]
443 | ```
444 | 
445 | 要想输出进程PID而非进程名，改用下面的脚本：
446 | 
447 | syscalls_by_pid.stp
448 | ```
449 | #! /usr/bin/env stap
450 | 
451 | # Copyright (C) 2006 IBM Corp.
452 | #
453 | # This file is part of systemtap, and is free software.  You can
454 | # redistribute it and/or modify it under the terms of the GNU General
455 | # Public License (GPL); either version 2, or (at your option) any
456 | # later version.
457 | 
458 | #
459 | # Print the system call count by process ID in descending order.
460 | #
461 | 
462 | global syscalls
463 | 
464 | probe begin {
465 |   print ("Collecting data... Type Ctrl-C to exit and display results\n")
466 | }
467 | 
468 | probe nd_syscall.* {
469 |   syscalls[pid()]++
470 | }
471 | 
472 | probe end {
473 |   printf ("%-10s %-s\n", "#SysCalls", "PID")
474 |   foreach (pid in syscalls-)
475 |     printf("%-10d %-d\n", syscalls[pid], pid)
476 | }
477 | ```
478 | 
479 | 正如在输出中提到的，你需要手动Ctrl-C退出程序来显示结果。你可以简单地添加一个`timer.s()`探针，让脚本在给定时间后自动退出。举个例子，要想让脚本5秒后退出，往里面添加下面的探针：
480 | ```
481 | probe timer.s(5)
482 | {
483 |     exit()
484 | }
485 | ```
486 | 


--------------------------------------------------------------------------------
/5_2_Disk.md:
--------------------------------------------------------------------------------
  1 | # 5.2. 磁盘
  2 | 
  3 | 以下各节的脚本展示了如何监控磁盘和I/O活动。
  4 | 
  5 | ## 统计磁盘读写状况
  6 | 
  7 | 本节展示了如何找出磁盘读写最频繁的进程。
  8 | 
  9 | disktop.stp
 10 | ```
 11 | #!/usr/bin/env stap
 12 | #
 13 | # Copyright (C) 2007 Oracle Corp.
 14 | #
 15 | # Get the status of reading/writing disk every 5 seconds,
 16 | # output top ten entries
 17 | #
 18 | # This is free software,GNU General Public License (GPL);
 19 | # either version 2, or (at your option) any later version.
 20 | #
 21 | # Usage:
 22 | #  ./disktop.stp
 23 | #
 24 | 
 25 | global io_stat,device
 26 | global read_bytes,write_bytes
 27 | 
 28 | probe vfs.read.return {
 29 |   if ($return>0) {
 30 |     if (devname!="N/A") {/*skip read from cache*/
 31 |       io_stat[pid(),execname(),uid(),ppid(),"R"] += $return
 32 |       device[pid(),execname(),uid(),ppid(),"R"] = devname
 33 |       read_bytes += $return
 34 |     }
 35 |   }
 36 | }
 37 | 
 38 | probe vfs.write.return {
 39 |   if ($return>0) {
 40 |     if (devname!="N/A") { /*skip update cache*/
 41 |       io_stat[pid(),execname(),uid(),ppid(),"W"] += $return
 42 |       device[pid(),execname(),uid(),ppid(),"W"] = devname
 43 |       write_bytes += $return
 44 |     }
 45 |   }
 46 | }
 47 | 
 48 | probe timer.ms(5000) {
 49 |   /* skip non-read/write disk */
 50 |   if (read_bytes+write_bytes) {
 51 | 
 52 |     printf("\n%-25s, %-8s%4dKb/sec, %-7s%6dKb, %-7s%6dKb\n\n",
 53 |            ctime(gettimeofday_s()),
 54 |            "Average:", ((read_bytes+write_bytes)/1024)/5,
 55 |            "Read:",read_bytes/1024,
 56 |            "Write:",write_bytes/1024)
 57 | 
 58 |     /* print header */
 59 |     printf("%8s %8s %8s %25s %8s %4s %12s\n",
 60 |            "UID","PID","PPID","CMD","DEVICE","T","BYTES")
 61 |   }
 62 |   /* print top ten I/O */
 63 |   foreach ([process,cmd,userid,parent,action] in io_stat- limit 10)
 64 |     printf("%8d %8d %8d %25s %8s %4s %12d\n",
 65 |            userid,process,parent,cmd,
 66 |            device[process,cmd,userid,parent,action],
 67 |            action,io_stat[process,cmd,userid,parent,action])
 68 | 
 69 |   /* clear data */
 70 |   delete io_stat
 71 |   delete device
 72 |   read_bytes = 0
 73 |   write_bytes = 0
 74 | }
 75 | 
 76 | probe end{
 77 |   delete io_stat
 78 |   delete device
 79 |   delete read_bytes
 80 |   delete write_bytes
 81 | }
 82 | ```
 83 | 
 84 | `disktop.stp`输出磁盘读写最频繁的十个进程，包含各个进程的以下数据：
 85 | * UID - 进程所有者的UID
 86 | * PID - 进程的PID
 87 | * PPID - 进程的父进程的PID
 88 | * CMD - 进程的名字
 89 | * DEVICE - 读/写的设备名
 90 | * T - 进程的操作；`W`是写，而`R`是读。
 91 | * BYTES - 读/写的数据量
 92 | 
 93 | `disktop.stp`使用`ctime()`和`gettimeofday_s()`输出当前时间。`gettimeofday_s`返回当前时间自epoch（1970年1月1日）以来的秒数，`ctime`把它转化成可读的时间戳。
 94 | 在这个脚本中，`$return`是一个存储着虚拟文件系统读写的字节数的本地变量。`$return`只能在函数返回事件探针中使用（比如这里的`vfs.read.return`和`vfs.write.return`）。
 95 | 
 96 | 以下是本节脚本的输出：
 97 | ```
 98 | [...]
 99 | Mon Sep 29 03:38:28 2008 , Average:  19Kb/sec, Read: 7Kb, Write: 89Kb
100 | 
101 | UID      PID     PPID                       CMD   DEVICE    T    BYTES
102 | 0    26319    26294                   firefox     sda5    W        90229
103 | 0     2758     2757           pam_timestamp_c     sda5    R         8064
104 | 0     2885        1                     cupsd     sda5    W         1678
105 | 
106 | Mon Sep 29 03:38:38 2008 , Average:   1Kb/sec, Read: 7Kb, Write: 1Kb
107 | 
108 | UID      PID     PPID                       CMD   DEVICE    T    BYTES
109 | 0     2758     2757           pam_timestamp_c     sda5    R         8064
110 | 0     2885        1                     cupsd     sda5    W         1678
111 | ```
112 | 
113 | ## 追踪对任意文件的读写
114 | 
115 | 本节展示如何监控各进程读/写任意文件所花费的时间。这可以帮助你发现系统中加载时间过长的文件。
116 | 
117 | iotime.stp
118 | ```
119 | #! /usr/bin/env stap
120 | 
121 | /*
122 |  * Copyright (C) 2006-2007 Red Hat Inc.
123 |  *
124 |  * This copyrighted material is made available to anyone wishing to use,
125 |  * modify, copy, or redistribute it subject to the terms and conditions
126 |  * of the GNU General Public License v.2.
127 |  *
128 |  * You should have received a copy of the GNU General Public License
129 |  * along with this program.  If not, see <http://www.gnu.org/licenses/>.
130 |  *
131 |  * Print out the amount of time spent in the read and write systemcall
132 |  * when each file opened by the process is closed. Note that the systemtap
133 |  * script needs to be running before the open operations occur for
134 |  * the script to record data.
135 |  *
136 |  * This script could be used to to find out which files are slow to load
137 |  * on a machine. e.g.
138 |  *
139 |  * stap iotime.stp -c 'firefox'
140 |  *
141 |  * Output format is:
142 |  * timestamp pid (executabable) info_type path ...
143 |  *
144 |  * 200283135 2573 (cupsd) access /etc/printcap read: 0 write: 7063
145 |  * 200283143 2573 (cupsd) iotime /etc/printcap time: 69
146 |  *
147 |  */
148 | 
149 | global start
150 | global time_io
151 | 
152 | function timestamp:long() { return gettimeofday_us() - start }
153 | 
154 | function proc:string() { return sprintf("%d (%s)", pid(), execname()) }
155 | 
156 | probe begin { start = gettimeofday_us() }
157 | 
158 | global filehandles, fileread, filewrite
159 | 
160 | probe syscall.open.return {
161 |   filename = user_string($filename)
162 |   if ($return != -1) {
163 |     filehandles[pid(), $return] = filename
164 |   } else {
165 |     printf("%d %s access %s fail\n", timestamp(), proc(), filename)
166 |   }
167 | }
168 | 
169 | probe syscall.read.return {
170 |   p = pid()
171 |   fd = $fd
172 |   bytes = $return
173 |   time = gettimeofday_us() - @entry(gettimeofday_us())
174 |   if (bytes > 0)
175 |     fileread[p, fd] += bytes
176 |   time_io[p, fd] <<< time
177 | }
178 | 
179 | probe syscall.write.return {
180 |   p = pid()
181 |   fd = $fd
182 |   bytes = $return
183 |   time = gettimeofday_us() - @entry(gettimeofday_us())
184 |   if (bytes > 0)
185 |     filewrite[p, fd] += bytes
186 |   time_io[p, fd] <<< time
187 | }
188 | 
189 | probe syscall.close {
190 |   if ([pid(), $fd] in filehandles) {
191 |     printf("%d %s access %s read: %d write: %d\n",
192 |            timestamp(), proc(), filehandles[pid(), $fd],
193 |            fileread[pid(), $fd], filewrite[pid(), $fd])
194 |     if (@count(time_io[pid(), $fd]))
195 |       printf("%d %s iotime %s time: %d\n",  timestamp(), proc(),
196 |              filehandles[pid(), $fd], @sum(time_io[pid(), $fd]))
197 |    }
198 |   delete fileread[pid(), $fd]
199 |   delete filewrite[pid(), $fd]
200 |   delete filehandles[pid(), $fd]
201 |   delete time_io[pid(),$fd]
202 | }
203 | ```
204 | 
205 | `iotime.stp`跟踪每次`open`、`close`、`read`和`write`系统调用。对于访问到的每个文件，`iotime.stp`都会计算读写操作花费的时间和读写的数据量（以字节为单位）。
206 | 虽然我们可以在读写事件（`syscall.read`和`syscall.write`）中使用本地变量`$count`，但是`$count`存储的是系统调用想要读写的数据量，要获取实际读写到的数据量需要使用`$return`。
207 | 
208 | ```
209 | [...]
210 | 825946 3364 (NetworkManager) access /sys/class/net/eth0/carrier read: 8190 write: 0
211 | 825955 3364 (NetworkManager) iotime /sys/class/net/eth0/carrier time: 9
212 | [...]
213 | 117061 2460 (pcscd) access /dev/bus/usb/003/001 read: 43 write: 0
214 | 117065 2460 (pcscd) iotime /dev/bus/usb/003/001 time: 7
215 | [...]
216 | 3973737 2886 (sendmail) access /proc/loadavg read: 4096 write: 0
217 | 3973744 2886 (sendmail) iotime /proc/loadavg time: 11
218 | [...]
219 | ```
220 | 
221 | 本节的脚本会输出以下数据：
222 | * 时间戳，精确到毫秒
223 | * PID和进程名
224 | * access或iotime
225 | * 访问的文件
226 | 
227 | 如果一个进程读写了数据，你会看到`access`和`iotime`成对出现。`access`那一行的时间戳表示进程访问了文件；在结尾处会输出读写的数据（以字节为单位）。`iotime`那一行会输出读写消耗的时间（以毫秒为单位）。如果一行`access`后面没有`iotime`，意味着进程没有读写到数据。
228 | 
229 | 
230 | ## 追踪I/O的累计总量
231 | 
232 | 本节展示如何累计I/O总量。
233 | 
234 | traceio.stp
235 | ```
236 | #! /usr/bin/env stap
237 | # traceio.stp
238 | # Copyright (C) 2007 Red Hat, Inc., Eugene Teo <eteo@redhat.com>
239 | # Copyright (C) 2009 Kai Meyer <kai@unixlords.com>
240 | #   Fixed a bug that allows this to run longer
241 | #   And added the humanreadable function
242 | #
243 | # This program is free software; you can redistribute it and/or modify
244 | # it under the terms of the GNU General Public License version 2 as
245 | # published by the Free Software Foundation.
246 | #
247 | 
248 | global reads, writes, total_io
249 | 
250 | probe vfs.read.return {
251 |   if ($return > 0) {
252 |     reads[pid(),execname()] += $return
253 |     total_io[pid(),execname()] += $return
254 |   }
255 | }
256 | 
257 | probe vfs.write.return {
258 |   if ($return > 0) {
259 |     writes[pid(),execname()] += $return
260 |     total_io[pid(),execname()] += $return
261 |   }
262 | }
263 | 
264 | function humanreadable(bytes) {
265 |   if (bytes > 1024*1024*1024) {
266 |     return sprintf("%d GiB", bytes/1024/1024/1024)
267 |   } else if (bytes > 1024*1024) {
268 |     return sprintf("%d MiB", bytes/1024/1024)
269 |   } else if (bytes > 1024) {
270 |     return sprintf("%d KiB", bytes/1024)
271 |   } else {
272 |     return sprintf("%d   B", bytes)
273 |   }
274 | }
275 | 
276 | probe timer.s(1) {
277 |   foreach([p,e] in total_io- limit 10)
278 |     printf("%8d %15s r: %12s w: %12s\n",
279 |            p, e, humanreadable(reads[p,e]),
280 |            humanreadable(writes[p,e]))
281 |   printf("\n")
282 |   # Note we don't zero out reads, writes and total_io,
283 |   # so the values are cumulative since the script started.
284 | }
285 | ```
286 | 
287 | `traceio.stp`逐秒输出累计I/O最频繁的前十个进程。此外，它还会累计每个进程的I/O情况。注意该脚本跟开头找出磁盘读写最频繁的进程的脚本一样，也通过本地变量`$return`获取实际的读写数据量
288 | 
289 | ```
290 | [...]
291 |            Xorg r:   583401 KiB w:        0 KiB
292 |        floaters r:       96 KiB w:     7130 KiB
293 | multiload-apple r:      538 KiB w:      537 KiB
294 |            sshd r:       71 KiB w:       72 KiB
295 | pam_timestamp_c r:      138 KiB w:        0 KiB
296 |         staprun r:       51 KiB w:       51 KiB
297 |           snmpd r:       46 KiB w:        0 KiB
298 |           pcscd r:       28 KiB w:        0 KiB
299 |      irqbalance r:       27 KiB w:        4 KiB
300 |           cupsd r:        4 KiB w:       18 KiB
301 | 
302 |            Xorg r:   588140 KiB w:        0 KiB
303 |        floaters r:       97 KiB w:     7143 KiB
304 | multiload-apple r:      543 KiB w:      542 KiB
305 |            sshd r:       72 KiB w:       72 KiB
306 | pam_timestamp_c r:      138 KiB w:        0 KiB
307 |         staprun r:       51 KiB w:       51 KiB
308 |           snmpd r:       46 KiB w:        0 KiB
309 |           pcscd r:       28 KiB w:        0 KiB
310 |      irqbalance r:       27 KiB w:        4 KiB
311 |           cupsd r:        4 KiB w:       18 KiB
312 | ```
313 | 
314 | ## 监控指定设备的I/O
315 | 
316 | 本节展示如何监控指定设备的I/O活动。
317 | 
318 | traceio2.stp
319 | ```
320 | #! /usr/bin/env stap
321 | 
322 | global device_of_interest
323 | 
324 | probe begin {
325 |   /* The following is not the most efficient way to do this.
326 |       One could directly put the result of usrdev2kerndev()
327 |       into device_of_interest.  However, want to test out
328 |       the other device functions */
329 |   dev = usrdev2kerndev($1)
330 |   device_of_interest = MKDEV(MAJOR(dev), MINOR(dev))
331 | }
332 | 
333 | probe vfs.write, vfs.read
334 | {
335 |   if (dev == device_of_interest)
336 |     printf ("%s(%d) %s 0x%x\n",
337 |             execname(), pid(), ppfunc(), dev)
338 | }
339 | ```
340 | 
341 | `traceio2.stp`接受一个参数：设备号，要想获取名为`directory`的文件夹所在设备的设备号，使用`stat -c "0x%D" directory`。
342 | `usrdev2kerndev()`把设备号转化成内核理解的格式。`usrdev2kerndev()`的输出经过`MAJOR()`和`MINOR()`处理，分别得到主设备号和次设备号，再经过`MKDEV()`处理，得到内核里对应的设备号。
343 | `traceio2.stp`的输出包括了读/写进程的名字和PID，所调用的函数（`vfs_read`或`vfs_write`）和内核里对应的设备号。
344 | 
345 | 下面是`stap traceio2.stp 0x805`的输出，其中`0x805`是`/home`的设备号。`/home`位于`/dev/sda5`，正是我们想要监控的设备。
346 | ```
347 | [...]
348 | synergyc(3722) vfs_read 0x800005
349 | synergyc(3722) vfs_read 0x800005
350 | cupsd(2889) vfs_write 0x800005
351 | cupsd(2889) vfs_write 0x800005
352 | cupsd(2889) vfs_write 0x800005
353 | [...]
354 | ```
355 | 
356 | ## 监控对指定文件的读写
357 | 
358 | 本节展示如何实时监控对指定文件的读写。
359 | 
360 | inodewatch.stp
361 | ```
362 | #! /usr/bin/env stap
363 | 
364 | probe vfs.write, vfs.read
365 | {
366 |   # dev and ino are defined by vfs.write and vfs.read
367 |   if (dev == MKDEV($1,$2) # major/minor device
368 |       && ino == $3)
369 |     printf ("%s(%d) %s 0x%x/%u\n",
370 |       execname(), pid(), ppfunc(), dev, ino)
371 | }
372 | ```
373 | 
374 | `inodewatch.stp`从命令行中依次获取如下参数：
375 | 1. 文件的主设备号
376 | 2. 文件的次设备号
377 | 3. 文件的inode号
378 | 
379 | 要获取上述信息，使用`stat -c '%D %i' filename`，注意`filename`取绝对路径。
380 | 比如：要监控`/etc/crontab`，先运行`stat -c '%D %i' /etc/crontab`。应该会有如下输出：
381 | ```
382 | 805 1078319
383 | ```
384 | 805是十六进制的设备号。最小的两位是次设备号，其余是主设备号。1078319是inode号。要监控`/etc/crontab`，运行`stap inodewatch.stp 0x8 0x05 1078319`.（加`0x`以表示这是十六进制的数）
385 | 
386 | 该命令的输出包括进程名和进程PID，以及调用的函数（`vfs_read`或`vfs_write`），设备号（以十六进制的格式输出）和inode号。下面就是`stap inodewatch.stp 0x8 0x05 1078319`的输出（当脚本运行时，`/etc/crontab`也正在执行中）：
387 | ```
388 | cat(16437) vfs_read 0x800005/1078319
389 | cat(16437) vfs_read 0x800005/1078319
390 | ```
391 | 
392 | ## 监控对指定文件的属性的修改
393 | 
394 | 本节展示如何实时监控对指定文件的属性的修改。
395 | 
396 | inodewatch2.stp
397 | ```
398 | #! /usr/bin/env stap
399 | 
400 | global ATTR_MODE = 1
401 | 
402 | probe kernel.function("setattr_copy")!,
403 |       kernel.function("generic_setattr")!,
404 |       kernel.function("inode_setattr") {
405 |   dev_nr = $inode->i_sb->s_dev
406 |   inode_nr = $inode->i_ino
407 | 
408 |   if (dev_nr == MKDEV($1,$2) # major/minor device
409 |       && inode_nr == $3
410 |       && $attr->ia_valid & ATTR_MODE)
411 |     printf ("%s(%d) %s 0x%x/%u %o %d\n",
412 |       execname(), pid(), ppfunc(), dev_nr, inode_nr, $attr->ia_mode, uid())
413 | }
414 | ```
415 | 跟上一节的脚本类似，`inodewatch2.stp`也需要提供目标文件的设备号和inode号作为参数。用上一节的方法可以获取这些数据。
416 | `inodewatch.stp`的输出也类似于上一节的脚本，不过`inodewatch.stp`还包括文件属性的变化，和对应用户的UID。下面就是监控`/home/joe/bigfile`时，用户job执行`chmod 777 /home/joe/bigfile`和`chmod 666 /home/joe/bigfile`后的输出：
417 | 
418 | ```
419 | chmod(17448) inode_setattr 0x800005/6011835 100777 500
420 | chmod(17449) inode_setattr 0x800005/6011835 100666 500
421 | ```
422 | 
423 | ## 定期输出块I/O等待时间
424 | 
425 | 本节展示如何跟踪每个块I/O的等待时间。这可以帮助你发现给定时间内块I/O操作是否排起了长队。
426 | 
427 | ioblktime.stp
428 | ```
429 | #! /usr/bin/env stap
430 | 
431 | global req_time%[25000], etimes
432 | 
433 | probe ioblock.request
434 | {
435 |   req_time[$bio] = gettimeofday_us()
436 | }
437 | 
438 | probe ioblock.end
439 | {
440 |   t = gettimeofday_us()
441 |   s =  req_time[$bio]
442 |   delete req_time[$bio]
443 |   if (s) {
444 |     etimes[devname, bio_rw_str(rw)] <<< t - s
445 |   }
446 | }
447 | 
448 | /* for time being delete things that get merged with others */
449 | probe kernel.trace("block_bio_frontmerge"),
450 |       kernel.trace("block_bio_backmerge")
451 | {
452 |   delete req_time[$bio]
453 | }
454 | 
455 | probe timer.s(10), end {
456 |   ansi_clear_screen()
457 |   printf("%10s %3s %10s %10s %10s\n",
458 |          "device", "rw", "total (us)", "count", "avg (us)")
459 |   foreach ([dev,rw] in etimes - limit 20) {
460 |     printf("%10s %3s %10d %10d %10d\n", dev, rw,
461 |            @sum(etimes[dev,rw]), @count(etimes[dev,rw]), @avg(etimes[dev,rw]))
462 |   }
463 |   delete etimes
464 | }
465 | ```
466 | `ioblktime.stp`计算每个设备上块I/O平均等待时间，每10秒更新一次。你可以修改`probe timer.s(10), end {`来更改刷新频率。
467 | 有时候，在设备上的块I/O操作实在太多，以致于超过了默认的`MAXMAPENTRIES`值。如果你在定义数组时没有指定大小，SystemTap会以`MAXMAPENTRIES`作为数组的最大长度。它的默认值是2048,不过你可以使用stap命令的选项`-DMAXMAPENTRIES=10000`来指定该变量的值。
468 | 
469 | ```
470 |     device  rw total (us)      count   avg (us)
471 |        sda   W       9659          6       1609
472 |       dm-0   W      20278          6       3379
473 |       dm-0   R      20524          5       4104
474 |        sda   R      19277          5       3855
475 | ```
476 | 
477 | 上面的输出展示了设备名，操作类型（`rw`），总等待时间（`total(us)`），操作数（`count`），和平均等待时间（`avg(us)`）。这里面的时间都是以毫秒为单位。
478 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                     GNU GENERAL PUBLIC LICENSE
  2 |                        Version 2, June 1991
  3 | 
  4 |  Copyright (C) 1989, 1991 Free Software Foundation, Inc., <http://fsf.org/>
  5 |  51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  6 |  Everyone is permitted to copy and distribute verbatim copies
  7 |  of this license document, but changing it is not allowed.
  8 | 
  9 |                             Preamble
 10 | 
 11 |   The licenses for most software are designed to take away your
 12 | freedom to share and change it.  By contrast, the GNU General Public
 13 | License is intended to guarantee your freedom to share and change free
 14 | software--to make sure the software is free for all its users.  This
 15 | General Public License applies to most of the Free Software
 16 | Foundation's software and to any other program whose authors commit to
 17 | using it.  (Some other Free Software Foundation software is covered by
 18 | the GNU Lesser General Public License instead.)  You can apply it to
 19 | your programs, too.
 20 | 
 21 |   When we speak of free software, we are referring to freedom, not
 22 | price.  Our General Public Licenses are designed to make sure that you
 23 | have the freedom to distribute copies of free software (and charge for
 24 | this service if you wish), that you receive source code or can get it
 25 | if you want it, that you can change the software or use pieces of it
 26 | in new free programs; and that you know you can do these things.
 27 | 
 28 |   To protect your rights, we need to make restrictions that forbid
 29 | anyone to deny you these rights or to ask you to surrender the rights.
 30 | These restrictions translate to certain responsibilities for you if you
 31 | distribute copies of the software, or if you modify it.
 32 | 
 33 |   For example, if you distribute copies of such a program, whether
 34 | gratis or for a fee, you must give the recipients all the rights that
 35 | you have.  You must make sure that they, too, receive or can get the
 36 | source code.  And you must show them these terms so they know their
 37 | rights.
 38 | 
 39 |   We protect your rights with two steps: (1) copyright the software, and
 40 | (2) offer you this license which gives you legal permission to copy,
 41 | distribute and/or modify the software.
 42 | 
 43 |   Also, for each author's protection and ours, we want to make certain
 44 | that everyone understands that there is no warranty for this free
 45 | software.  If the software is modified by someone else and passed on, we
 46 | want its recipients to know that what they have is not the original, so
 47 | that any problems introduced by others will not reflect on the original
 48 | authors' reputations.
 49 | 
 50 |   Finally, any free program is threatened constantly by software
 51 | patents.  We wish to avoid the danger that redistributors of a free
 52 | program will individually obtain patent licenses, in effect making the
 53 | program proprietary.  To prevent this, we have made it clear that any
 54 | patent must be licensed for everyone's free use or not licensed at all.
 55 | 
 56 |   The precise terms and conditions for copying, distribution and
 57 | modification follow.
 58 | 
 59 |                     GNU GENERAL PUBLIC LICENSE
 60 |    TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
 61 | 
 62 |   0. This License applies to any program or other work which contains
 63 | a notice placed by the copyright holder saying it may be distributed
 64 | under the terms of this General Public License.  The "Program", below,
 65 | refers to any such program or work, and a "work based on the Program"
 66 | means either the Program or any derivative work under copyright law:
 67 | that is to say, a work containing the Program or a portion of it,
 68 | either verbatim or with modifications and/or translated into another
 69 | language.  (Hereinafter, translation is included without limitation in
 70 | the term "modification".)  Each licensee is addressed as "you".
 71 | 
 72 | Activities other than copying, distribution and modification are not
 73 | covered by this License; they are outside its scope.  The act of
 74 | running the Program is not restricted, and the output from the Program
 75 | is covered only if its contents constitute a work based on the
 76 | Program (independent of having been made by running the Program).
 77 | Whether that is true depends on what the Program does.
 78 | 
 79 |   1. You may copy and distribute verbatim copies of the Program's
 80 | source code as you receive it, in any medium, provided that you
 81 | conspicuously and appropriately publish on each copy an appropriate
 82 | copyright notice and disclaimer of warranty; keep intact all the
 83 | notices that refer to this License and to the absence of any warranty;
 84 | and give any other recipients of the Program a copy of this License
 85 | along with the Program.
 86 | 
 87 | You may charge a fee for the physical act of transferring a copy, and
 88 | you may at your option offer warranty protection in exchange for a fee.
 89 | 
 90 |   2. You may modify your copy or copies of the Program or any portion
 91 | of it, thus forming a work based on the Program, and copy and
 92 | distribute such modifications or work under the terms of Section 1
 93 | above, provided that you also meet all of these conditions:
 94 | 
 95 |     a) You must cause the modified files to carry prominent notices
 96 |     stating that you changed the files and the date of any change.
 97 | 
 98 |     b) You must cause any work that you distribute or publish, that in
 99 |     whole or in part contains or is derived from the Program or any
100 |     part thereof, to be licensed as a whole at no charge to all third
101 |     parties under the terms of this License.
102 | 
103 |     c) If the modified program normally reads commands interactively
104 |     when run, you must cause it, when started running for such
105 |     interactive use in the most ordinary way, to print or display an
106 |     announcement including an appropriate copyright notice and a
107 |     notice that there is no warranty (or else, saying that you provide
108 |     a warranty) and that users may redistribute the program under
109 |     these conditions, and telling the user how to view a copy of this
110 |     License.  (Exception: if the Program itself is interactive but
111 |     does not normally print such an announcement, your work based on
112 |     the Program is not required to print an announcement.)
113 | 
114 | These requirements apply to the modified work as a whole.  If
115 | identifiable sections of that work are not derived from the Program,
116 | and can be reasonably considered independent and separate works in
117 | themselves, then this License, and its terms, do not apply to those
118 | sections when you distribute them as separate works.  But when you
119 | distribute the same sections as part of a whole which is a work based
120 | on the Program, the distribution of the whole must be on the terms of
121 | this License, whose permissions for other licensees extend to the
122 | entire whole, and thus to each and every part regardless of who wrote it.
123 | 
124 | Thus, it is not the intent of this section to claim rights or contest
125 | your rights to work written entirely by you; rather, the intent is to
126 | exercise the right to control the distribution of derivative or
127 | collective works based on the Program.
128 | 
129 | In addition, mere aggregation of another work not based on the Program
130 | with the Program (or with a work based on the Program) on a volume of
131 | a storage or distribution medium does not bring the other work under
132 | the scope of this License.
133 | 
134 |   3. You may copy and distribute the Program (or a work based on it,
135 | under Section 2) in object code or executable form under the terms of
136 | Sections 1 and 2 above provided that you also do one of the following:
137 | 
138 |     a) Accompany it with the complete corresponding machine-readable
139 |     source code, which must be distributed under the terms of Sections
140 |     1 and 2 above on a medium customarily used for software interchange; or,
141 | 
142 |     b) Accompany it with a written offer, valid for at least three
143 |     years, to give any third party, for a charge no more than your
144 |     cost of physically performing source distribution, a complete
145 |     machine-readable copy of the corresponding source code, to be
146 |     distributed under the terms of Sections 1 and 2 above on a medium
147 |     customarily used for software interchange; or,
148 | 
149 |     c) Accompany it with the information you received as to the offer
150 |     to distribute corresponding source code.  (This alternative is
151 |     allowed only for noncommercial distribution and only if you
152 |     received the program in object code or executable form with such
153 |     an offer, in accord with Subsection b above.)
154 | 
155 | The source code for a work means the preferred form of the work for
156 | making modifications to it.  For an executable work, complete source
157 | code means all the source code for all modules it contains, plus any
158 | associated interface definition files, plus the scripts used to
159 | control compilation and installation of the executable.  However, as a
160 | special exception, the source code distributed need not include
161 | anything that is normally distributed (in either source or binary
162 | form) with the major components (compiler, kernel, and so on) of the
163 | operating system on which the executable runs, unless that component
164 | itself accompanies the executable.
165 | 
166 | If distribution of executable or object code is made by offering
167 | access to copy from a designated place, then offering equivalent
168 | access to copy the source code from the same place counts as
169 | distribution of the source code, even though third parties are not
170 | compelled to copy the source along with the object code.
171 | 
172 |   4. You may not copy, modify, sublicense, or distribute the Program
173 | except as expressly provided under this License.  Any attempt
174 | otherwise to copy, modify, sublicense or distribute the Program is
175 | void, and will automatically terminate your rights under this License.
176 | However, parties who have received copies, or rights, from you under
177 | this License will not have their licenses terminated so long as such
178 | parties remain in full compliance.
179 | 
180 |   5. You are not required to accept this License, since you have not
181 | signed it.  However, nothing else grants you permission to modify or
182 | distribute the Program or its derivative works.  These actions are
183 | prohibited by law if you do not accept this License.  Therefore, by
184 | modifying or distributing the Program (or any work based on the
185 | Program), you indicate your acceptance of this License to do so, and
186 | all its terms and conditions for copying, distributing or modifying
187 | the Program or works based on it.
188 | 
189 |   6. Each time you redistribute the Program (or any work based on the
190 | Program), the recipient automatically receives a license from the
191 | original licensor to copy, distribute or modify the Program subject to
192 | these terms and conditions.  You may not impose any further
193 | restrictions on the recipients' exercise of the rights granted herein.
194 | You are not responsible for enforcing compliance by third parties to
195 | this License.
196 | 
197 |   7. If, as a consequence of a court judgment or allegation of patent
198 | infringement or for any other reason (not limited to patent issues),
199 | conditions are imposed on you (whether by court order, agreement or
200 | otherwise) that contradict the conditions of this License, they do not
201 | excuse you from the conditions of this License.  If you cannot
202 | distribute so as to satisfy simultaneously your obligations under this
203 | License and any other pertinent obligations, then as a consequence you
204 | may not distribute the Program at all.  For example, if a patent
205 | license would not permit royalty-free redistribution of the Program by
206 | all those who receive copies directly or indirectly through you, then
207 | the only way you could satisfy both it and this License would be to
208 | refrain entirely from distribution of the Program.
209 | 
210 | If any portion of this section is held invalid or unenforceable under
211 | any particular circumstance, the balance of the section is intended to
212 | apply and the section as a whole is intended to apply in other
213 | circumstances.
214 | 
215 | It is not the purpose of this section to induce you to infringe any
216 | patents or other property right claims or to contest validity of any
217 | such claims; this section has the sole purpose of protecting the
218 | integrity of the free software distribution system, which is
219 | implemented by public license practices.  Many people have made
220 | generous contributions to the wide range of software distributed
221 | through that system in reliance on consistent application of that
222 | system; it is up to the author/donor to decide if he or she is willing
223 | to distribute software through any other system and a licensee cannot
224 | impose that choice.
225 | 
226 | This section is intended to make thoroughly clear what is believed to
227 | be a consequence of the rest of this License.
228 | 
229 |   8. If the distribution and/or use of the Program is restricted in
230 | certain countries either by patents or by copyrighted interfaces, the
231 | original copyright holder who places the Program under this License
232 | may add an explicit geographical distribution limitation excluding
233 | those countries, so that distribution is permitted only in or among
234 | countries not thus excluded.  In such case, this License incorporates
235 | the limitation as if written in the body of this License.
236 | 
237 |   9. The Free Software Foundation may publish revised and/or new versions
238 | of the General Public License from time to time.  Such new versions will
239 | be similar in spirit to the present version, but may differ in detail to
240 | address new problems or concerns.
241 | 
242 | Each version is given a distinguishing version number.  If the Program
243 | specifies a version number of this License which applies to it and "any
244 | later version", you have the option of following the terms and conditions
245 | either of that version or of any later version published by the Free
246 | Software Foundation.  If the Program does not specify a version number of
247 | this License, you may choose any version ever published by the Free Software
248 | Foundation.
249 | 
250 |   10. If you wish to incorporate parts of the Program into other free
251 | programs whose distribution conditions are different, write to the author
252 | to ask for permission.  For software which is copyrighted by the Free
253 | Software Foundation, write to the Free Software Foundation; we sometimes
254 | make exceptions for this.  Our decision will be guided by the two goals
255 | of preserving the free status of all derivatives of our free software and
256 | of promoting the sharing and reuse of software generally.
257 | 
258 |                             NO WARRANTY
259 | 
260 |   11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
268 | REPAIR OR CORRECTION.
269 | 
270 |   12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
278 | POSSIBILITY OF SUCH DAMAGES.
279 | 
280 |                      END OF TERMS AND CONDITIONS
281 | 
282 |             How to Apply These Terms to Your New Programs
283 | 
284 |   If you develop a new program, and you want it to be of the greatest
285 | possible use to the public, the best way to achieve this is to make it
286 | free software which everyone can redistribute and change under these terms.
287 | 
288 |   To do so, attach the following notices to the program.  It is safest
289 | to attach them to the start of each source file to most effectively
290 | convey the exclusion of warranty; and each file should have at least
291 | the "copyright" line and a pointer to where the full notice is found.
292 | 
293 |     {description}
294 |     Copyright (C) {year}  {fullname}
295 | 
296 |     This program is free software; you can redistribute it and/or modify
297 |     it under the terms of the GNU General Public License as published by
298 |     the Free Software Foundation; either version 2 of the License, or
299 |     (at your option) any later version.
300 | 
301 |     This program is distributed in the hope that it will be useful,
302 |     but WITHOUT ANY WARRANTY; without even the implied warranty of
303 |     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
304 |     GNU General Public License for more details.
305 | 
306 |     You should have received a copy of the GNU General Public License along
307 |     with this program; if not, write to the Free Software Foundation, Inc.,
308 |     51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
309 | 
310 | Also add information on how to contact you by electronic and paper mail.
311 | 
312 | If the program is interactive, make it output a short notice like this
313 | when it starts in an interactive mode:
314 | 
315 |     Gnomovision version 69, Copyright (C) year name of author
316 |     Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
317 |     This is free software, and you are welcome to redistribute it
318 |     under certain conditions; type `show c' for details.
319 | 
320 | The hypothetical commands `show w' and `show c' should show the appropriate
321 | parts of the General Public License.  Of course, the commands you use may
322 | be called something other than `show w' and `show c'; they could even be
323 | mouse-clicks or menu items--whatever suits your program.
324 | 
325 | You should also get your employer (if you work as a programmer) or your
326 | school, if any, to sign a "copyright disclaimer" for the program, if
327 | necessary.  Here is a sample; alter the names:
328 | 
329 |   Yoyodyne, Inc., hereby disclaims all copyright interest in the program
330 |   `Gnomovision' (which makes passes at compilers) written by James Hacker.
331 | 
332 |   {signature of Ty Coon}, 1 April 1989
333 |   Ty Coon, President of Vice
334 | 
335 | This General Public License does not permit incorporating your program into
336 | proprietary programs.  If your program is a subroutine library, you may
337 | consider it more useful to permit linking proprietary applications with the
338 | library.  If this is what you want to do, use the GNU Lesser General
339 | Public License instead of this License.
340 | 
341 | 


--------------------------------------------------------------------------------