├── README.md
├── docs
    ├── cpu-ctxt.md
    ├── cpu-load.md
    ├── cpu-percentage.md
    ├── images
    │   └── linux_io.png
    ├── io-usage.md
    ├── memory-usage.md
    ├── net-util.md
    └── references.md
├── packer
    ├── README.md
    ├── bootstrap.sh
    └── packer.json
└── scripts
    ├── cpu
        ├── dummy1.sh
        ├── dummy2.sh
        ├── dummy3.sh
        ├── dummy4.sh
        └── dummy_app.py
    ├── disk
        └── writer.sh
    └── memory
        ├── buffer.sh
        ├── cache.sh
        ├── dentry.py
        ├── dentry2.py
        └── hog.sh


/README.md:
--------------------------------------------------------------------------------
 1 | # Linux Metrics Workshop
 2 | While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. In this hands-on workshop, we will cover the basics in Linux metric collection for monitoring, performance tuning and capacity planning.
 3 | 
 4 | ## Topics
 5 | 1. CPU
 6 |    1. [CPU Percentage](docs/cpu-percentage.md)
 7 |    2. [CPU Load](docs/cpu-load.md)
 8 |    3. [Context Switches](docs/cpu-ctxt.md)
 9 | 2. Memory
10 |    1. [Memory Usage](docs/memory-usage.md)
11 | 3. IO
12 |    1. [IO Usage](docs/io-usage.md)
13 | 4. Network
14 |    1. [Network Utilization](docs/net-util.md)
15 | 5. [References](docs/references.md)
16 | 
17 | ## Setup
18 | The workshop was designed to run on AWS EC2 t2.small instances with general purpose SSD, Ubuntu 24.04 amd64, and transparent hugh pages disabled.
19 | You can build an AMI with all the dependencies installed using the attached [packer](https://www.packer.io/) template.
20 | 
21 | A pre-built AMI `ami-01cfe208cd769212d` is available on `eu-central-1` 
22 | 
23 | If you run on your own instance, make sure you have only 1 CPU (easier to read the metrics) and that you disable transparent huge pages (`echo never > /sys/kernel/mm/transparent_hugepage/enabled `)
24 | 


--------------------------------------------------------------------------------
/docs/cpu-ctxt.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | ## Context Switches
 4 | 
 5 | ### Recall: Linux Process Context Switches
 6 | A mechanism to store current process *state* ie. Registers, Memory maps, Kernel structs (eg. TSS in 32bit), and load another (or a new one). Context switches are usually computationally expensive (although optimization exist), yet inevitable. For example, they are used to allow multi-tasking (eg. preemption), and to switch between user and kernel modes.
 7 | 
 8 | ### Task CS1: Context Switches
 9 | 
10 | 1. Execute `vmstat 2` and write down the current context switch rate (`cs` field)
11 | 2. Raise that number by executing `stress -i 10`
12 | 	1. What is the current context switch rate?
13 | 	2. What is causing this rate? Multi-tasking? Interrupts? Switches between kernel and user modes?
14 | 	3. Kill the `stress` command, and watch the rate drop
15 | 3. Now let's see how a high context switch rate affects a dummy application
16 | 	1. Run the dummy application `perf stat -e cs python scripts/cpu/dummy_app.py` (which calls a dummy function 1000 times, and prints it's runtime percentile)
17 | 	2. Write the current CPU usage, the application percentiles and context switch rate
18 | 	3. **In the same session**, raise the context switch rate using `stress -i 10 &` and re-run the dummy application. Write the current CPU usage, the application percentiles and context switch rate.
19 | 	4. Describe the change in the percentiles. Did the high context switch rate affect most of `foo()` runs (ie. the 50th percentile)? If not, why?
20 | 4. Finally, observe the behaviour when running `stress` in a different scheduling task group
21 | 	1. Move one of your sessions to a different cgroup `sudo mkdir /sys/fs/cgroup/cpu/a; echo $$ | sudo tee /sys/fs/cgroup/cpu/a/tasks`
22 | 	2. Run stress again  `stress -i 10` or `stress -c 10`
23 | 	2. Compare the CPU usage to **3.iii** (it should be roughly the same) and compare the context switch rate (which should be the same)
24 | 	3. Re-run the dummy application and describe the change in the percentiles (and process context switch) vs **3.iv**
25 | 
26 | ### Discussion
27 | 
28 | - Can performance measurements on a staging environment truly estimate performance on production?
29 | - Why did we run the `stress` command and our dummy application in the same session?
30 | 	- Read about the [magic kernel patch](http://www.phoronix.com/scan.php?page=article&item=linux_2637_video&num=1)
31 | 
32 | ### Tools
33 | 
34 |  - Most tools use `/proc/vmstat` to fetch global context switching information (`ctxt` field), and `/proc/[PID]/status` for process specific information (`voluntary_context_switches` and `nonvoluntary_context_switches` fields)
35 |  - From the command-line you can use
36 | 	 - `vmstat <delay> <count>` for global information
37 | 	 - `pidstat -w -p <pid> <delay> <count>` for process specific information
38 | 
39 | #### Next: [Memory Usage](memory-usage.md)
40 | 


--------------------------------------------------------------------------------
/docs/cpu-load.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | ## CPU Load
 4 | 
 5 | ### Recall: Linux Process State
 6 | 
 7 | From `man 1 ps`:
 8 | ```
 9 | D    uninterruptible sleep (usually IO)
10 | R    running or runnable (on run queue)
11 | S    interruptible sleep (waiting for an event to complete)
12 | T    stopped, either by a job control signal or because it is being traced
13 | Z    defunct ("zombie") process, terminated but not reaped by its parent
14 | ```
15 | 
16 | ### Task CL1: CPU Load
17 | 
18 | 1. What is the Load Average metric? Use the Linux Process States and `man 5 proc` (search for loadavg)
19 | 2. Start the disk stress script (NOTE: avoid running it on your own SSD):
20 | 
21 | 	```bash
22 | 	scripts/disk/writer.sh
23 | 	```
24 | 
25 | 3. Run the following command and look at the Load values for about a minute until `ldavg-1` stabilizes:
26 | 
27 | 	```bash
28 | 	sar -q 1 100
29 | 	```
30 | 
31 | 	1. What is the writing speed of our script (ignore the first value, this is [EBS General Purpose IOPS Burst](http://aws.amazon.com/ebs/details/#GP))?
32 | 	2. What is the current Load Average? Why? Which processes contribute to this number?
33 | 	3. What are CPU %user, %IO-wait and %idle?
34 | 4. While the previous script is running, start a single CPU stress:
35 | 
36 | 	```bash
37 | 	stress -c 1 -t 3600
38 | 	```
39 | 	Wait another minute, and answer the questions above again.
40 | 5. Stop all the scripts
41 | 
42 | ### Discussion
43 | 
44 | - Why are processes waiting for IO included in the Load Average?
45 | - Assuming we have 1 CPU core and Load of 5, is our CPU core on 100% utilization?
46 | - How can we know if load is going up or down?
47 | - Does a load average of 70 indicate a problem?
48 | 
49 | ### Tools
50 | 
51 |  - Most tools use `/proc/loadavg` to fetch Load Average and run queue information.
52 |  - Have a look at `/proc/pressure/cpu` for new [PSI metrics](https://docs.kernel.org/accounting/psi.html) 
53 |  - To get a percentage over a specific interval of time, you can use:
54 | 	 - `sar -q <interval> <count>`
55 | 		 - `-q` queue length and load averages
56 | 	 - or  simply `uptime`
57 |  - Use eBPF based tools to observe the run queue: `runqlat` and `runqlen`
58 | 
59 | #### Next: [Context Switches](cpu-ctxt.md)
60 | 


--------------------------------------------------------------------------------
/docs/cpu-percentage.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | ## CPU Percentage
 4 | Let's start with the most common CPU metric.
 5 | Fire up `top`, and let's start figuring out what the different CPU percentage values are.
 6 | ```
 7 | %Cpu(s):  2.3 us,  0.6 sy,  0.0 ni, 96.7 id,  0.2 wa
 8 | ```
 9 | 
10 | ### Task CP1: CPU Percentage
11 | For each of the following scripts (`dummy1.sh`, `dummy2.sh`, `dummy3.sh`, `dummy4.sh`) under the `scripts/cpu/` directory:
12 | 
13 |  1. Run the script
14 |  2. While the script is running, look at `top` on another terminal window
15 |  3. Without looking at the code, try to figure out what the script is doing (find the percentage fields description in `man 1 top`) 
16 |  4. Stop the script (use `Ctrl+C` or wait 2 minutes for it to timeout)
17 |  5. Verify your answer by reading the script content
18 | 
19 | ### Tools
20 | 
21 |  - Most tools use `/proc/stat` to fetch CPU percentage. Note that it displays amount of time and not percentage.
22 |  - To get a percentage over a specific interval of time, you can use:
23 | 	 - `sar -P ALL -u <interval> <count>`
24 | 		 - `-P` per-processor statistics
25 | 		 - `-u` CPU utilization
26 | 	 - or  `mpstat` (similar usage and output)
27 | 
28 | ### Discussion
29 | 
30 | - What's the difference between %IO-wait and %idle?
31 | - Is the entire CPU load created by a process accounted to that process?
32 | 
33 | #### Time Stolen and Amazon EC2
34 | 
35 | You may have noticed the `st` label. From `man 1 top`:
36 | ```
37 | st : time stolen from this vm by the hypervisor
38 | ```
39 | Amazon EC2 uses the hypervisor to regulate the machine CPU usage (to match the instance type's EC2 Compute Units). If you see inconsistent stolen percentage over time, then you might be using [Burstable Performance Instances](http://aws.amazon.com/ec2/instance-types/#burst).
40 | 
41 | #### Next: [CPU Load](cpu-load.md)
42 | 


--------------------------------------------------------------------------------
/docs/images/linux_io.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/natict/linux-metrics/1be276d0bf7f6cd68304613aa0d959d78ce45193/docs/images/linux_io.png


--------------------------------------------------------------------------------
/docs/io-usage.md:
--------------------------------------------------------------------------------
 1 | # IO Metrics
 2 | 
 3 | ## IO Usage
 4 | 
 5 | ### Recall: Linux IO, Merges, IOPS
 6 | 
 7 | Linux IO performance is affected by many factors, including your application workload, choice of file-system, IO scheduler choice (eg. [cfq](https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt), [deadline](https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt)), queue configuration, device driver, underlying device(s) caches and more.
 8 | 
 9 | ![Linux IO](images/linux_io.png)
10 | 
11 | #### Merged Reads/Writes
12 | 
13 | From [the Kernel Documentation](https://www.kernel.org/doc/Documentation/iostats.txt): *"Reads and writes which are adjacent to each other may be merged for efficiency.  Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O"*
14 | 
15 | #### IOPS
16 | 
17 | IOPS are input/output operations per second. Some operations take longer than others, eg. HDDs can do a sequential reading operations much faster than random writing operations. Here are some rough estimations [from Wikipedia](https://en.wikipedia.org/wiki/IOPS) and [Amazon EBS Product Details](http://aws.amazon.com/ebs/details/):
18 | 
19 | | Device/Type           | IOPS      |
20 | |-----------------------|-----------|
21 | | 7.2k-10k RPM SATA HDD | 75-150    |
22 | | 10k-15k RPM SAS HDD   | 140-210   |
23 | | SATA SSD              | 1k-120k   |
24 | | AWS EC2 gp2           | up to 10k |
25 | | AWS EC2 io1           | up to 20k |
26 | 
27 | ### Task I1: IO Usage
28 | 
29 | 1. Start by running `iostat -xd 2`, and examine the output fields. Let's go over the important ones together:
30 | 	- **rrqm/s** & **wrqm/s**- Number of read/write requests merged per-second
31 | 	- **r/s** & **w/s**- Read/Write requests (after merges) per-second. Their sum is the IOPS!
32 | 	- **rkB/s** & **wkB/s**- Number of kB read/written per-second, ie. **IO throughput**.
33 | 	- **avgqu-sz**- Average requests queue size for this device. Check out `/sys/block/<device>/queue/nr_requests` for the maximum queue size.
34 | 	- **r_await**, **w_await**, **await**- The average time (in ms.) for read/write/both requests to be served, including time spent in the queue, ie. **IO latency**
35 | 2. Please write down these field's values when our system is at rest
36 | 3. In a new session, let's benchmark our device *write performance* by running:
37 | 
38 | 	```bash
39 | 	fio --directory=/tmp --name fio_test_file --direct=1 --rw=randwrite --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
40 | 	```
41 | 	
42 | 	This will clone 16 processes to perform non-buffered (direct) random writes for 3 minutes.
43 | 	1. Compare the values you see in `iostat` to the values you wrote down earlier. Do they make sense? 
44 | 	2. Look at `fio` results and try to see if the number of IOPS make sense (we are using EBS gp2 volumes).
45 | 4. Repeat the previous task, this time benchmark **read performance**:
46 | 
47 | 	```bash
48 | 	fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
49 | 	```
50 | 	
51 | 5. Finally, repeat **read performance** benchmark with 1 process:
52 | 
53 | 	```bash
54 | 	fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=1 --time_based --runtime=180 --group_reporting --norandommap
55 | 	```
56 | 	1. Read about the `svctm` field in `man 1 iostat`. Compare the value we got now to the value we got for 16 processes. Is there a difference? If so, why?
57 | 	2. Repeat the previous question for the `%util` field.
58 | 
59 | 6. `fio` also supports other IO patterns (by changing the `--rw=` parameter), including:
60 | 	- `read` Sequential reads
61 | 	- `write` Sequential writes
62 | 	- `rw` Mixed sequential reads and writes
63 | 	- `randrw` Mixed random reads and writes
64 | 
65 | 	If time permits, explore these IO patterns to learn more about EBS gp2's performance under different workloads.
66 | 
67 | ### Discussion
68 | 
69 | - Why do we need an IO queue? What does it enable the kernel to perform?
70 | - Why are the `svctm` and `%util` iostat fields essentially useless in a modern environment? (read [Marc Brooker's excellent blog post](https://brooker.co.za/blog/2014/07/04/iostat-pct.html))
71 | - What is the difference in how the kernel handles reads and writes? How does that effect metrics and application behaviour?
72 | 
73 | ### Tools
74 | 
75 |  - Most tools use `/proc/diskstats` to fetch global IO statistics.
76 |  - Per-process IO statistics are usually fetched from `/proc/[pid]/io`, which is documented in `man 5 proc`
77 |  - From the command-line you can use
78 | 	 - `iostat -xd <delay> <count>` for per-device information
79 | 		 - `-d` device utilization
80 | 		 - `-x` extended statistics
81 | 	 - `sudo iotop` for a `top`-like interface (easily find the process doing most reads/writes)
82 | 		 - `-o` only show processes or threads actually doing I/O
83 | 
84 | #### Next: [Network Utilization](net-util.md)
85 | 


--------------------------------------------------------------------------------
/docs/memory-usage.md:
--------------------------------------------------------------------------------
 1 | # Memory Metrics
 2 | 
 3 | ## Memory Usage
 4 | 
 5 | Is free memory really free? What's the difference between cached memory and buffers? Why can't I dump all the caches? Let's get our hands dirty and find out...
 6 | 
 7 | ### Task M1: Memory usage, Caches and Buffers
 8 | 
 9 | 1. Fire up `top`, and write down how much `free` memory you have (**keep it running for the rest of this module**)
10 | 2. Start the memory hog `scripts/memory/hog.sh`, let it run until it gets killed (if it hangs- use `Ctrl+c`)
11 | 3. Compare that to the number you wrote. Are they (almost) the same? If not, why?
12 | 4. Read about the `buffer` and `cached Mem`  values in `man 5 proc` (under `meminfo`)
13 | 	1. Run the memory hog `scripts/memory/hog.sh`
14 | 	2. Write down the `buffer` size
15 | 	3. Now run the buffer balloon `sudo scripts/memory/buffer.sh`
16 | 	4. Check the `buffer` size again
17 | 	5. Read the script, and see if you can make sense of the results...
18 | 	6. Repeat the 5 steps above with the `cached Mem` value and  `sudo scripts/memory/cache.sh`
19 | 5. Let's see how `cached Mem` affects application performance
20 | 	1. As *root*, drop the cache using `echo 3 > /proc/sys/vm/drop_caches`
21 | 	2. Time a dummy Python application `time python -c 'print "Hello World"'` (you can repeat these 2 steps multiple times)
22 | 	3. Now re-run our dummy Python application, but this time without flushing the cached memory. Can you see the difference?
23 | 6. Run the `memory/dentry.py` script and observe the memory usage using `free`. What is using the memory? How does it effect performance? What tools can show you kernel memory usage?
24 | 7. Run the `memory/dentry2.py` script and try dropping the caches. Does it make a difference? what's the difference between `dentry.py` and `dentry2.py`?
25 | 
26 | ### Discussion
27 | 
28 | - Why wasn't our memory hog able to grab all the `cached` memory?
29 | - What will happen if we remove the following line from `scripts/memory/hog.sh` (Try it!)? Why?
30 | 
31 | 	```c
32 | 	tmp[0] = 0;
33 | 	```
34 | 
35 | - Assuming a server has some amount of free memory, can we assume it has enough memory to support it's current workload? If not, why?
36 | 
37 | 
38 | ### Tools
39 | 
40 |  - Most tools use `/proc/meminfo` to fetch memory usage information.
41 | 	 - A simple example is the `free` utility
42 |  - To get usage information over some period, use `sar -r <delay> <count>`
43 | 	 - Here you can also see how many dirty pages you have (try running `sync` while `sar` is running)
44 | 	 - The `%commit` field is also interesting, especially if it's larger than 100...
45 | 
46 | #### Next: [IO Usage](io-usage.md)
47 | 


--------------------------------------------------------------------------------
/docs/net-util.md:
--------------------------------------------------------------------------------
 1 | # Network Metrics
 2 | 
 3 | ## Network Utilization
 4 | 
 5 | ### Task NP1: Network Utilization
 6 | 
 7 | 1. Assuming we have two machines connected with Gigabit Ethernet interfaces, what is the maximum expected **throughput in kilo-bytes**?
 8 | 2. For the following task you'll need two machines, or a partner:
 9 | 
10 | 	| Machine/s | Command | Notes |
11 | 	|:---------:|---------|-------|
12 | 	| A + B | `sar -n DEV 2` | Write down the receive/transmit packets/KB per-second. Keep this running for the entire length of the task |
13 | 	| A + B | `sar -n EDEV 2` | These are the error statistics, read about them in `man 1 sar`. Keep this running for the entire length of the task |
14 | 	| A | `ip a` | Write down A's IP address |
15 | 	| A | `iperf -s` | This will start a server for our little benchmark |
16 | 	| B | `iperf -c <A's address> -t 30` | Start the benchmark client for 30 seconds |
17 | 	| A | `iperf -su` | Replace the previous TCP server with a UDP one |
18 | 	| B | `iperf -c 172.30.0.251 -u -b 1000M -l 8k` | Repeat the benchmark with UDP traffic |
19 | 
20 | 	1. When running the client on B, use `sar` data to determine A's link utilization (in %, assuming Gigabit Ethernet)?
21 | 	2. What are the major differences between TCP and UDP traffic observable with `sar`?
22 | 	3. Start to decrease the UDP buffer length (ie. from `8k` to `4k`, `2k`, `1k`, `512`, `128`, `64`). 
23 | 		1. Does the **throughput in KB** increase or decrease? 
24 | 		2. What about the **throughput in packets**?
25 | 		3. Look carefully at the `iperf` client and server report. Can you see any packet loss? Can you also see them in `ifconfig`?
26 | 
27 | ### Network Errors
28 | 
29 | While Linux provides multiple metrics for network errors including- collisions, errors, and packet drops, the [kernel documentation](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net-statistics) indicates that these metrics meaning is driver specific, and tightly coupled with your underlying networking layers.
30 | 
31 | ### Tools
32 | 
33 |  - Most tools use `/proc/net/dev` to fetch network device information.
34 | 	 - For example, try running `sar -n DEV <interval> <count>`
35 |  - Connection information for TCP, UDP and raw sockets can be fetched using `/proc/net/tcp`, `/proc/net/udp`, `/proc/net/raw`
36 | 	 - For parsed socket information use `netstat -tuwnp`
37 | 		 - `-t`, `-u`, `-w`: TCP, UDP and raw sockets
38 | 		 - `-n`: no DNS resolving
39 | 		 - `-p`: the process owning this socket
40 |  - The most comprehensive command-line utility is `netstat`, covering metrics from interface statistics to socket information.
41 |  - Check `iptraf`  for interactive traffic monitoring (no socket information)
42 |  - Finally, `nethogs` provides a `top`-like experience, allowing you to find which process is taking up the most bandwidth (TCP only)
43 | 
44 | ### Discussion
45 | 
46 | - What could be the reasons for packet drops? Which of these reasons can be measured on the receiving side?
47 | - Why can't you see the `%ifutil` value on EC2?
48 | 	- **Hint**: Network device speed is usually found in `/sys/class/net/<dev>/speed`
49 | 	- **Workaround**: The `nicstat` utility allows you to specify the speed and duplex of your network interface from the command line:
50 | 	    ```
51 |         nicstat -S eth0:1000fd -n 2
52 |         ```
53 | 
54 | 


--------------------------------------------------------------------------------
/docs/references.md:
--------------------------------------------------------------------------------
 1 | # References
 2 | 
 3 | This workshop borrows heavily from the following talks/blog-posts:
 4 | 
 5 | - Harald van Breederode's "Understanding Linux Load Average" [Part 1](https://prutser.wordpress.com/2012/04/23/understanding-linux-load-average-part-1/) and [Part 2](https://prutser.wordpress.com/2012/05/05/understanding-linux-load-average-part-2/)
 6 | - Brendan Gregg's [Linux Load Averages: Solving the Mystery](http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
 7 | - Vidar Holen's  ["Help! Linux ate my RAM!"](http://www.linuxatemyram.com/play.html)
 8 | - Ben Mildren's [Monitoring IO performance using iostat & pt-diskstats](https://www.percona.com/live/mysql-conference-2013/sites/default/files/slides/Monitoring-Linux-IO.pdf)
 9 | - Amazon EC2 User Guide for Linux Instances, [Benchmark Volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_piops.html)
10 | - Brendan Gregg's [USE Method: Linux Performance Checklist](http://www.brendangregg.com/USEmethod/use-linux.html)
11 | - Serhiy Topchiy's [Testing Amazon EC2 network speed](http://epamcloud.blogspot.co.il/2013/03/testing-amazon-ec2-network-speed.html)
12 | 


--------------------------------------------------------------------------------
/packer/README.md:
--------------------------------------------------------------------------------
1 | # AMI builder for Linux Metrics workshop
2 | 
3 | ## Usage
4 | 
5 | ```
6 | packer build -var 'aws_access_key=...' -var 'aws_secret_key=...' -var 'subnet_id=...' -var 'vpc_id=...' packer.json
7 | ```
8 | 


--------------------------------------------------------------------------------
/packer/bootstrap.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ $(id -u) -ne 0 ]; then
 4 |     echo "You must run this script as root. Attempting to sudo" 1>&2
 5 |     exec sudo -n bash $0 $@
 6 | fi
 7 | 
 8 | # Wait for cloud-init
 9 | sleep 10
10 | 
11 | # Install packages
12 | DIST="$(lsb_release -cs)"
13 | echo "deb http://repo.netdata.cloud/repos/stable/ubuntu/ ${DIST}/" > /etc/apt/sources.list.d/netdata.list
14 | wget -O - https://repo.netdata.cloud/netdatabot.gpg.key | gpg --dearmor -o /etc/apt/trusted.gpg.d/netdata.gpg
15 | apt update
16 | export DEBIAN_PRIORITY=critical
17 | export DEBIAN_FRONTEND=noninteractive
18 | apt install -y sysstat stress procps build-essential linux-tools-generic fio iotop iperf iptraf-ng net-tools \
19 |     nicstat git glibc-doc manpages-dev bpftrace bpfcc-tools netdata \
20 |     ubuntu-dbgsym-keyring
21 | 
22 | echo "deb http://ddebs.ubuntu.com ${DIST} main restricted universe multiverse
23 | deb http://ddebs.ubuntu.com ${DIST}-updates main restricted universe multiverse
24 | deb http://ddebs.ubuntu.com ${DIST}-proposed main restricted universe multiverse" > /etc/apt/sources.list.d/ddebs.list
25 | 
26 | apt update
27 | apt install -y bpftrace-dbgsym
28 | 
29 | # Change netdata systemd definition (e.g. to renice)
30 | mkdir /etc/systemd/system/netdata.service.d/
31 | 
32 | cat > /etc/systemd/system/netdata.service.d/override.conf<<'EOF'
33 | [Service]
34 | # The minimum netdata Out-Of-Memory (OOM) score.
35 | # netdata (via [global].OOM score in netdata.conf) can only increase the value set here.
36 | # To decrease it, set the minimum here and set the same or a higher value in netdata.conf.
37 | # Valid values: -1000 (never kill netdata) to 1000 (always kill netdata).
38 | OOMScoreAdjust=-1000
39 | 
40 | # Valid policies: other (the system default) | batch | idle | fifo | rr
41 | # To give netdata the max priority, set CPUSchedulingPolicy=rr and CPUSchedulingPriority=99
42 | CPUSchedulingPolicy=rr
43 | 
44 | # This sets the scheduling priority (for policies: rr and fifo).
45 | # Priority gets values 1 (lowest) to 99 (highest).
46 | CPUSchedulingPriority=99
47 | 
48 | Nice=-10
49 | EOF
50 | 
51 | systemctl daemon-reload
52 | systemctl restart netdata.service
53 | 
54 | # Disable transparent huge pages
55 | cat > /etc/rc.local <<'EOF'
56 | #!/bin/sh -e
57 | #
58 | # rc.local
59 | #
60 | 
61 | THP_DIR="/sys/kernel/mm/transparent_hugepage"
62 | 
63 | if test -f $THP_DIR/enabled; then
64 |     echo never > $THP_DIR/enabled
65 | fi
66 | 
67 | if test -f $THP_DIR/defrag; then
68 |     echo never > $THP_DIR/defrag
69 | fi
70 | 
71 | exit 0
72 | EOF
73 | 
74 | # Clone linux-metrics in the user's home
75 | cat > /etc/profile.d/clone-linux-metrics.sh <<'EOF'
76 | #!/bin/bash
77 | if [[ ! -e ~/linux-metrics ]] && [[ $(id -u) -ne 0 ]]; then
78 |     echo "unable to find linux-metrics directory- cloning..."
79 |     pushd ~
80 |     git clone https://github.com/natict/linux-metrics.git
81 |     popd
82 | fi
83 | EOF
84 | 


--------------------------------------------------------------------------------
/packer/packer.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "variables": {
 3 |     "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
 4 |     "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
 5 |     "subnet_id": "",
 6 |     "vpc_id": "",
 7 |     "aws_region": "eu-central-1",
 8 |     "source_ami": "ami-0fc2927a21c4cd7e2"
 9 |   },
10 |   "builders": [
11 |     {
12 |       "name": "linux-metrics",
13 |       "type": "amazon-ebs",
14 |       "access_key": "{{user `aws_access_key`}}",
15 |       "secret_key": "{{user `aws_secret_key`}}",
16 |       "region": "{{user `aws_region`}}",
17 |       "source_ami": "{{user `source_ami`}}",
18 |       "instance_type": "t2.medium",
19 |       "ssh_username": "ubuntu",
20 |       "ami_name": "linux-metrics-{{timestamp}}",
21 |       "subnet_id": "{{user `subnet_id`}}",
22 |       "vpc_id": "{{user `vpc_id`}}"
23 |     }
24 |   ],
25 |   "provisioners": [
26 |     {
27 |       "type": "shell",
28 |       "script": "bootstrap.sh"
29 |     }
30 |   ]
31 | }
32 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy1.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -c 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy2.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -m 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy3.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | nice -n 10 -- stress -q -c 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy4.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -d 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy_app.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python3
 2 | 
 3 | import time
 4 | import math
 5 | 
 6 | 
 7 | def percentile(N, percent):
 8 |     if not N:
 9 |         return None
10 |     N = sorted(N)
11 |     k = (len(N) - 1) * percent
12 |     f = math.floor(k)
13 |     c = math.ceil(k)
14 |     if f == c:
15 |         return N[int(k)]
16 |     d0 = N[int(f)] * (c - k)
17 |     d1 = N[int(c)] * (k - f)
18 |     return d0 + d1
19 | 
20 | 
21 | def foo():
22 |     for i in range(20000):
23 |         x = math.sqrt(i)
24 | 
25 | 
26 | if __name__ == "__main__":
27 |     m = []
28 | 
29 |     for _ in range(1000):
30 |         start = time.time()
31 |         foo()
32 |         m.append(time.time() - start)
33 | 
34 |     print(
35 |         "50th, 90th and 99th percentile: %f, %f, %f"
36 |         % (percentile(m, 0.5), percentile(m, 0.9), percentile(m, 0.99))
37 |     )
38 | 


--------------------------------------------------------------------------------
/scripts/disk/writer.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | while true; do dd if=/dev/zero of=/tmp/test.1G bs=1M count=1024; done
4 | 


--------------------------------------------------------------------------------
/scripts/memory/buffer.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [[ $UID != 0 ]]; then
 4 |     echo "Please run this script as root"
 5 |     exit 1
 6 | fi
 7 | 
 8 | root_dev="$(mount | grep 'on / ' | cut -d' ' -f1)"  # /dev/xvda
 9 | 
10 | # Read 1GB of blocks from the root device
11 | dd if=$root_dev of=/dev/null bs=1M count=4096
12 | 


--------------------------------------------------------------------------------
/scripts/memory/cache.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -e
 2 | 
 3 | if [[ $UID != 0 ]]; then
 4 |     echo "Please run this script as root"
 5 |     exit 1
 6 | fi
 7 | 
 8 | # Read all readable files in /usr (Generates ~1.5GB cache)
 9 | # find /usr -readable -type f -exec dd if={} of=/dev/null status=none \;
10 | 
11 | # Just create a large file
12 | SIZE_IN_MB=4096
13 | dd if=/dev/zero of=/tmp/bigfile bs=1MB count=${SIZE_IN_MB}
14 | 
15 | echo "Press Enter to continue..."; read
16 | rm /tmp/bigfile
17 | 


--------------------------------------------------------------------------------
/scripts/memory/dentry.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import os
 4 | import uuid
 5 | import sys
 6 | 
 7 | try:
 8 |     directory = sys.argv[1]
 9 | except IndexError:
10 |     directory = "./trash"
11 | 
12 | if not os.path.exists(directory):
13 |     os.makedirs(directory)
14 | 
15 | try:
16 |     while True:
17 |         open(os.path.join(directory, str(uuid.uuid4())), 'w')
18 | except Exception:
19 |     input("click any key to terminate")
20 | 


--------------------------------------------------------------------------------
/scripts/memory/dentry2.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import os
 4 | 
 5 | try:
 6 |     while True:
 7 |         os.makedirs("t")
 8 |         os.chdir("t")
 9 | except Exception:
10 |     input("press any key to terminate\n")
11 | 


--------------------------------------------------------------------------------
/scripts/memory/hog.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | HOG_C=/tmp/hog.c
 4 | HOG=/tmp/hog
 5 | 
 6 | cat >$HOG_C <<'EOF'
 7 | #include <unistd.h>
 8 | #include <stdio.h>
 9 | #include <stdlib.h>
10 | 
11 | 
12 | int main(int argc, char* argv[]) {
13 |     long page_size = sysconf(_SC_PAGESIZE);
14 | 
15 |     long count = 0;
16 |     while(1) {
17 |         char* tmp = (char*) malloc(page_size);
18 |         if (tmp) {
19 |             tmp[0] = 0;
20 |             count += page_size;
21 |             if (count % (page_size*1024) == 0) {
22 |                 printf("Allocated %ld KB\n", count/1024);
23 |             }
24 |         }
25 |     }
26 | 
27 |     return 0;
28 | }
29 | EOF
30 | 
31 | gcc -o $HOG $HOG_C
32 | 
33 | exec $HOG
34 | 


--------------------------------------------------------------------------------