├── README.md ├── docs ├── cpu-ctxt.md ├── cpu-load.md ├── cpu-percentage.md ├── images │ └── linux_io.png ├── io-usage.md ├── memory-usage.md ├── net-util.md └── references.md ├── packer ├── README.md ├── bootstrap.sh └── packer.json └── scripts ├── cpu ├── dummy1.sh ├── dummy2.sh ├── dummy3.sh ├── dummy4.sh └── dummy_app.py ├── disk └── writer.sh └── memory ├── buffer.sh ├── cache.sh ├── dentry.py ├── dentry2.py └── hog.sh /README.md: -------------------------------------------------------------------------------- 1 | # Linux Metrics Workshop 2 | While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. In this hands-on workshop, we will cover the basics in Linux metric collection for monitoring, performance tuning and capacity planning. 3 | 4 | ## Topics 5 | 1. CPU 6 | 1. [CPU Percentage](docs/cpu-percentage.md) 7 | 2. [CPU Load](docs/cpu-load.md) 8 | 3. [Context Switches](docs/cpu-ctxt.md) 9 | 2. Memory 10 | 1. [Memory Usage](docs/memory-usage.md) 11 | 3. IO 12 | 1. [IO Usage](docs/io-usage.md) 13 | 4. Network 14 | 1. [Network Utilization](docs/net-util.md) 15 | 5. [References](docs/references.md) 16 | 17 | ## Setup 18 | The workshop was designed to run on AWS EC2 t2.small instances with general purpose SSD, Ubuntu 24.04 amd64, and transparent hugh pages disabled. 19 | You can build an AMI with all the dependencies installed using the attached [packer](https://www.packer.io/) template. 20 | 21 | A pre-built AMI `ami-01cfe208cd769212d` is available on `eu-central-1` 22 | 23 | If you run on your own instance, make sure you have only 1 CPU (easier to read the metrics) and that you disable transparent huge pages (`echo never > /sys/kernel/mm/transparent_hugepage/enabled `) 24 | -------------------------------------------------------------------------------- /docs/cpu-ctxt.md: -------------------------------------------------------------------------------- 1 | # CPU Metrics 2 | 3 | ## Context Switches 4 | 5 | ### Recall: Linux Process Context Switches 6 | A mechanism to store current process *state* ie. Registers, Memory maps, Kernel structs (eg. TSS in 32bit), and load another (or a new one). Context switches are usually computationally expensive (although optimization exist), yet inevitable. For example, they are used to allow multi-tasking (eg. preemption), and to switch between user and kernel modes. 7 | 8 | ### Task CS1: Context Switches 9 | 10 | 1. Execute `vmstat 2` and write down the current context switch rate (`cs` field) 11 | 2. Raise that number by executing `stress -i 10` 12 | 1. What is the current context switch rate? 13 | 2. What is causing this rate? Multi-tasking? Interrupts? Switches between kernel and user modes? 14 | 3. Kill the `stress` command, and watch the rate drop 15 | 3. Now let's see how a high context switch rate affects a dummy application 16 | 1. Run the dummy application `perf stat -e cs python scripts/cpu/dummy_app.py` (which calls a dummy function 1000 times, and prints it's runtime percentile) 17 | 2. Write the current CPU usage, the application percentiles and context switch rate 18 | 3. **In the same session**, raise the context switch rate using `stress -i 10 &` and re-run the dummy application. Write the current CPU usage, the application percentiles and context switch rate. 19 | 4. Describe the change in the percentiles. Did the high context switch rate affect most of `foo()` runs (ie. the 50th percentile)? If not, why? 20 | 4. Finally, observe the behaviour when running `stress` in a different scheduling task group 21 | 1. Move one of your sessions to a different cgroup `sudo mkdir /sys/fs/cgroup/cpu/a; echo $$ | sudo tee /sys/fs/cgroup/cpu/a/tasks` 22 | 2. Run stress again `stress -i 10` or `stress -c 10` 23 | 2. Compare the CPU usage to **3.iii** (it should be roughly the same) and compare the context switch rate (which should be the same) 24 | 3. Re-run the dummy application and describe the change in the percentiles (and process context switch) vs **3.iv** 25 | 26 | ### Discussion 27 | 28 | - Can performance measurements on a staging environment truly estimate performance on production? 29 | - Why did we run the `stress` command and our dummy application in the same session? 30 | - Read about the [magic kernel patch](http://www.phoronix.com/scan.php?page=article&item=linux_2637_video&num=1) 31 | 32 | ### Tools 33 | 34 | - Most tools use `/proc/vmstat` to fetch global context switching information (`ctxt` field), and `/proc/[PID]/status` for process specific information (`voluntary_context_switches` and `nonvoluntary_context_switches` fields) 35 | - From the command-line you can use 36 | - `vmstat ` for global information 37 | - `pidstat -w -p ` for process specific information 38 | 39 | #### Next: [Memory Usage](memory-usage.md) 40 | -------------------------------------------------------------------------------- /docs/cpu-load.md: -------------------------------------------------------------------------------- 1 | # CPU Metrics 2 | 3 | ## CPU Load 4 | 5 | ### Recall: Linux Process State 6 | 7 | From `man 1 ps`: 8 | ``` 9 | D uninterruptible sleep (usually IO) 10 | R running or runnable (on run queue) 11 | S interruptible sleep (waiting for an event to complete) 12 | T stopped, either by a job control signal or because it is being traced 13 | Z defunct ("zombie") process, terminated but not reaped by its parent 14 | ``` 15 | 16 | ### Task CL1: CPU Load 17 | 18 | 1. What is the Load Average metric? Use the Linux Process States and `man 5 proc` (search for loadavg) 19 | 2. Start the disk stress script (NOTE: avoid running it on your own SSD): 20 | 21 | ```bash 22 | scripts/disk/writer.sh 23 | ``` 24 | 25 | 3. Run the following command and look at the Load values for about a minute until `ldavg-1` stabilizes: 26 | 27 | ```bash 28 | sar -q 1 100 29 | ``` 30 | 31 | 1. What is the writing speed of our script (ignore the first value, this is [EBS General Purpose IOPS Burst](http://aws.amazon.com/ebs/details/#GP))? 32 | 2. What is the current Load Average? Why? Which processes contribute to this number? 33 | 3. What are CPU %user, %IO-wait and %idle? 34 | 4. While the previous script is running, start a single CPU stress: 35 | 36 | ```bash 37 | stress -c 1 -t 3600 38 | ``` 39 | Wait another minute, and answer the questions above again. 40 | 5. Stop all the scripts 41 | 42 | ### Discussion 43 | 44 | - Why are processes waiting for IO included in the Load Average? 45 | - Assuming we have 1 CPU core and Load of 5, is our CPU core on 100% utilization? 46 | - How can we know if load is going up or down? 47 | - Does a load average of 70 indicate a problem? 48 | 49 | ### Tools 50 | 51 | - Most tools use `/proc/loadavg` to fetch Load Average and run queue information. 52 | - Have a look at `/proc/pressure/cpu` for new [PSI metrics](https://docs.kernel.org/accounting/psi.html) 53 | - To get a percentage over a specific interval of time, you can use: 54 | - `sar -q ` 55 | - `-q` queue length and load averages 56 | - or simply `uptime` 57 | - Use eBPF based tools to observe the run queue: `runqlat` and `runqlen` 58 | 59 | #### Next: [Context Switches](cpu-ctxt.md) 60 | -------------------------------------------------------------------------------- /docs/cpu-percentage.md: -------------------------------------------------------------------------------- 1 | # CPU Metrics 2 | 3 | ## CPU Percentage 4 | Let's start with the most common CPU metric. 5 | Fire up `top`, and let's start figuring out what the different CPU percentage values are. 6 | ``` 7 | %Cpu(s): 2.3 us, 0.6 sy, 0.0 ni, 96.7 id, 0.2 wa 8 | ``` 9 | 10 | ### Task CP1: CPU Percentage 11 | For each of the following scripts (`dummy1.sh`, `dummy2.sh`, `dummy3.sh`, `dummy4.sh`) under the `scripts/cpu/` directory: 12 | 13 | 1. Run the script 14 | 2. While the script is running, look at `top` on another terminal window 15 | 3. Without looking at the code, try to figure out what the script is doing (find the percentage fields description in `man 1 top`) 16 | 4. Stop the script (use `Ctrl+C` or wait 2 minutes for it to timeout) 17 | 5. Verify your answer by reading the script content 18 | 19 | ### Tools 20 | 21 | - Most tools use `/proc/stat` to fetch CPU percentage. Note that it displays amount of time and not percentage. 22 | - To get a percentage over a specific interval of time, you can use: 23 | - `sar -P ALL -u ` 24 | - `-P` per-processor statistics 25 | - `-u` CPU utilization 26 | - or `mpstat` (similar usage and output) 27 | 28 | ### Discussion 29 | 30 | - What's the difference between %IO-wait and %idle? 31 | - Is the entire CPU load created by a process accounted to that process? 32 | 33 | #### Time Stolen and Amazon EC2 34 | 35 | You may have noticed the `st` label. From `man 1 top`: 36 | ``` 37 | st : time stolen from this vm by the hypervisor 38 | ``` 39 | Amazon EC2 uses the hypervisor to regulate the machine CPU usage (to match the instance type's EC2 Compute Units). If you see inconsistent stolen percentage over time, then you might be using [Burstable Performance Instances](http://aws.amazon.com/ec2/instance-types/#burst). 40 | 41 | #### Next: [CPU Load](cpu-load.md) 42 | -------------------------------------------------------------------------------- /docs/images/linux_io.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/natict/linux-metrics/1be276d0bf7f6cd68304613aa0d959d78ce45193/docs/images/linux_io.png -------------------------------------------------------------------------------- /docs/io-usage.md: -------------------------------------------------------------------------------- 1 | # IO Metrics 2 | 3 | ## IO Usage 4 | 5 | ### Recall: Linux IO, Merges, IOPS 6 | 7 | Linux IO performance is affected by many factors, including your application workload, choice of file-system, IO scheduler choice (eg. [cfq](https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt), [deadline](https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt)), queue configuration, device driver, underlying device(s) caches and more. 8 | 9 | ![Linux IO](images/linux_io.png) 10 | 11 | #### Merged Reads/Writes 12 | 13 | From [the Kernel Documentation](https://www.kernel.org/doc/Documentation/iostats.txt): *"Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O"* 14 | 15 | #### IOPS 16 | 17 | IOPS are input/output operations per second. Some operations take longer than others, eg. HDDs can do a sequential reading operations much faster than random writing operations. Here are some rough estimations [from Wikipedia](https://en.wikipedia.org/wiki/IOPS) and [Amazon EBS Product Details](http://aws.amazon.com/ebs/details/): 18 | 19 | | Device/Type | IOPS | 20 | |-----------------------|-----------| 21 | | 7.2k-10k RPM SATA HDD | 75-150 | 22 | | 10k-15k RPM SAS HDD | 140-210 | 23 | | SATA SSD | 1k-120k | 24 | | AWS EC2 gp2 | up to 10k | 25 | | AWS EC2 io1 | up to 20k | 26 | 27 | ### Task I1: IO Usage 28 | 29 | 1. Start by running `iostat -xd 2`, and examine the output fields. Let's go over the important ones together: 30 | - **rrqm/s** & **wrqm/s**- Number of read/write requests merged per-second 31 | - **r/s** & **w/s**- Read/Write requests (after merges) per-second. Their sum is the IOPS! 32 | - **rkB/s** & **wkB/s**- Number of kB read/written per-second, ie. **IO throughput**. 33 | - **avgqu-sz**- Average requests queue size for this device. Check out `/sys/block//queue/nr_requests` for the maximum queue size. 34 | - **r_await**, **w_await**, **await**- The average time (in ms.) for read/write/both requests to be served, including time spent in the queue, ie. **IO latency** 35 | 2. Please write down these field's values when our system is at rest 36 | 3. In a new session, let's benchmark our device *write performance* by running: 37 | 38 | ```bash 39 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randwrite --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap 40 | ``` 41 | 42 | This will clone 16 processes to perform non-buffered (direct) random writes for 3 minutes. 43 | 1. Compare the values you see in `iostat` to the values you wrote down earlier. Do they make sense? 44 | 2. Look at `fio` results and try to see if the number of IOPS make sense (we are using EBS gp2 volumes). 45 | 4. Repeat the previous task, this time benchmark **read performance**: 46 | 47 | ```bash 48 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap 49 | ``` 50 | 51 | 5. Finally, repeat **read performance** benchmark with 1 process: 52 | 53 | ```bash 54 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=1 --time_based --runtime=180 --group_reporting --norandommap 55 | ``` 56 | 1. Read about the `svctm` field in `man 1 iostat`. Compare the value we got now to the value we got for 16 processes. Is there a difference? If so, why? 57 | 2. Repeat the previous question for the `%util` field. 58 | 59 | 6. `fio` also supports other IO patterns (by changing the `--rw=` parameter), including: 60 | - `read` Sequential reads 61 | - `write` Sequential writes 62 | - `rw` Mixed sequential reads and writes 63 | - `randrw` Mixed random reads and writes 64 | 65 | If time permits, explore these IO patterns to learn more about EBS gp2's performance under different workloads. 66 | 67 | ### Discussion 68 | 69 | - Why do we need an IO queue? What does it enable the kernel to perform? 70 | - Why are the `svctm` and `%util` iostat fields essentially useless in a modern environment? (read [Marc Brooker's excellent blog post](https://brooker.co.za/blog/2014/07/04/iostat-pct.html)) 71 | - What is the difference in how the kernel handles reads and writes? How does that effect metrics and application behaviour? 72 | 73 | ### Tools 74 | 75 | - Most tools use `/proc/diskstats` to fetch global IO statistics. 76 | - Per-process IO statistics are usually fetched from `/proc/[pid]/io`, which is documented in `man 5 proc` 77 | - From the command-line you can use 78 | - `iostat -xd ` for per-device information 79 | - `-d` device utilization 80 | - `-x` extended statistics 81 | - `sudo iotop` for a `top`-like interface (easily find the process doing most reads/writes) 82 | - `-o` only show processes or threads actually doing I/O 83 | 84 | #### Next: [Network Utilization](net-util.md) 85 | -------------------------------------------------------------------------------- /docs/memory-usage.md: -------------------------------------------------------------------------------- 1 | # Memory Metrics 2 | 3 | ## Memory Usage 4 | 5 | Is free memory really free? What's the difference between cached memory and buffers? Why can't I dump all the caches? Let's get our hands dirty and find out... 6 | 7 | ### Task M1: Memory usage, Caches and Buffers 8 | 9 | 1. Fire up `top`, and write down how much `free` memory you have (**keep it running for the rest of this module**) 10 | 2. Start the memory hog `scripts/memory/hog.sh`, let it run until it gets killed (if it hangs- use `Ctrl+c`) 11 | 3. Compare that to the number you wrote. Are they (almost) the same? If not, why? 12 | 4. Read about the `buffer` and `cached Mem` values in `man 5 proc` (under `meminfo`) 13 | 1. Run the memory hog `scripts/memory/hog.sh` 14 | 2. Write down the `buffer` size 15 | 3. Now run the buffer balloon `sudo scripts/memory/buffer.sh` 16 | 4. Check the `buffer` size again 17 | 5. Read the script, and see if you can make sense of the results... 18 | 6. Repeat the 5 steps above with the `cached Mem` value and `sudo scripts/memory/cache.sh` 19 | 5. Let's see how `cached Mem` affects application performance 20 | 1. As *root*, drop the cache using `echo 3 > /proc/sys/vm/drop_caches` 21 | 2. Time a dummy Python application `time python -c 'print "Hello World"'` (you can repeat these 2 steps multiple times) 22 | 3. Now re-run our dummy Python application, but this time without flushing the cached memory. Can you see the difference? 23 | 6. Run the `memory/dentry.py` script and observe the memory usage using `free`. What is using the memory? How does it effect performance? What tools can show you kernel memory usage? 24 | 7. Run the `memory/dentry2.py` script and try dropping the caches. Does it make a difference? what's the difference between `dentry.py` and `dentry2.py`? 25 | 26 | ### Discussion 27 | 28 | - Why wasn't our memory hog able to grab all the `cached` memory? 29 | - What will happen if we remove the following line from `scripts/memory/hog.sh` (Try it!)? Why? 30 | 31 | ```c 32 | tmp[0] = 0; 33 | ``` 34 | 35 | - Assuming a server has some amount of free memory, can we assume it has enough memory to support it's current workload? If not, why? 36 | 37 | 38 | ### Tools 39 | 40 | - Most tools use `/proc/meminfo` to fetch memory usage information. 41 | - A simple example is the `free` utility 42 | - To get usage information over some period, use `sar -r ` 43 | - Here you can also see how many dirty pages you have (try running `sync` while `sar` is running) 44 | - The `%commit` field is also interesting, especially if it's larger than 100... 45 | 46 | #### Next: [IO Usage](io-usage.md) 47 | -------------------------------------------------------------------------------- /docs/net-util.md: -------------------------------------------------------------------------------- 1 | # Network Metrics 2 | 3 | ## Network Utilization 4 | 5 | ### Task NP1: Network Utilization 6 | 7 | 1. Assuming we have two machines connected with Gigabit Ethernet interfaces, what is the maximum expected **throughput in kilo-bytes**? 8 | 2. For the following task you'll need two machines, or a partner: 9 | 10 | | Machine/s | Command | Notes | 11 | |:---------:|---------|-------| 12 | | A + B | `sar -n DEV 2` | Write down the receive/transmit packets/KB per-second. Keep this running for the entire length of the task | 13 | | A + B | `sar -n EDEV 2` | These are the error statistics, read about them in `man 1 sar`. Keep this running for the entire length of the task | 14 | | A | `ip a` | Write down A's IP address | 15 | | A | `iperf -s` | This will start a server for our little benchmark | 16 | | B | `iperf -c -t 30` | Start the benchmark client for 30 seconds | 17 | | A | `iperf -su` | Replace the previous TCP server with a UDP one | 18 | | B | `iperf -c 172.30.0.251 -u -b 1000M -l 8k` | Repeat the benchmark with UDP traffic | 19 | 20 | 1. When running the client on B, use `sar` data to determine A's link utilization (in %, assuming Gigabit Ethernet)? 21 | 2. What are the major differences between TCP and UDP traffic observable with `sar`? 22 | 3. Start to decrease the UDP buffer length (ie. from `8k` to `4k`, `2k`, `1k`, `512`, `128`, `64`). 23 | 1. Does the **throughput in KB** increase or decrease? 24 | 2. What about the **throughput in packets**? 25 | 3. Look carefully at the `iperf` client and server report. Can you see any packet loss? Can you also see them in `ifconfig`? 26 | 27 | ### Network Errors 28 | 29 | While Linux provides multiple metrics for network errors including- collisions, errors, and packet drops, the [kernel documentation](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net-statistics) indicates that these metrics meaning is driver specific, and tightly coupled with your underlying networking layers. 30 | 31 | ### Tools 32 | 33 | - Most tools use `/proc/net/dev` to fetch network device information. 34 | - For example, try running `sar -n DEV ` 35 | - Connection information for TCP, UDP and raw sockets can be fetched using `/proc/net/tcp`, `/proc/net/udp`, `/proc/net/raw` 36 | - For parsed socket information use `netstat -tuwnp` 37 | - `-t`, `-u`, `-w`: TCP, UDP and raw sockets 38 | - `-n`: no DNS resolving 39 | - `-p`: the process owning this socket 40 | - The most comprehensive command-line utility is `netstat`, covering metrics from interface statistics to socket information. 41 | - Check `iptraf` for interactive traffic monitoring (no socket information) 42 | - Finally, `nethogs` provides a `top`-like experience, allowing you to find which process is taking up the most bandwidth (TCP only) 43 | 44 | ### Discussion 45 | 46 | - What could be the reasons for packet drops? Which of these reasons can be measured on the receiving side? 47 | - Why can't you see the `%ifutil` value on EC2? 48 | - **Hint**: Network device speed is usually found in `/sys/class/net//speed` 49 | - **Workaround**: The `nicstat` utility allows you to specify the speed and duplex of your network interface from the command line: 50 | ``` 51 | nicstat -S eth0:1000fd -n 2 52 | ``` 53 | 54 | -------------------------------------------------------------------------------- /docs/references.md: -------------------------------------------------------------------------------- 1 | # References 2 | 3 | This workshop borrows heavily from the following talks/blog-posts: 4 | 5 | - Harald van Breederode's "Understanding Linux Load Average" [Part 1](https://prutser.wordpress.com/2012/04/23/understanding-linux-load-average-part-1/) and [Part 2](https://prutser.wordpress.com/2012/05/05/understanding-linux-load-average-part-2/) 6 | - Brendan Gregg's [Linux Load Averages: Solving the Mystery](http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html) 7 | - Vidar Holen's ["Help! Linux ate my RAM!"](http://www.linuxatemyram.com/play.html) 8 | - Ben Mildren's [Monitoring IO performance using iostat & pt-diskstats](https://www.percona.com/live/mysql-conference-2013/sites/default/files/slides/Monitoring-Linux-IO.pdf) 9 | - Amazon EC2 User Guide for Linux Instances, [Benchmark Volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_piops.html) 10 | - Brendan Gregg's [USE Method: Linux Performance Checklist](http://www.brendangregg.com/USEmethod/use-linux.html) 11 | - Serhiy Topchiy's [Testing Amazon EC2 network speed](http://epamcloud.blogspot.co.il/2013/03/testing-amazon-ec2-network-speed.html) 12 | -------------------------------------------------------------------------------- /packer/README.md: -------------------------------------------------------------------------------- 1 | # AMI builder for Linux Metrics workshop 2 | 3 | ## Usage 4 | 5 | ``` 6 | packer build -var 'aws_access_key=...' -var 'aws_secret_key=...' -var 'subnet_id=...' -var 'vpc_id=...' packer.json 7 | ``` 8 | -------------------------------------------------------------------------------- /packer/bootstrap.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ $(id -u) -ne 0 ]; then 4 | echo "You must run this script as root. Attempting to sudo" 1>&2 5 | exec sudo -n bash $0 $@ 6 | fi 7 | 8 | # Wait for cloud-init 9 | sleep 10 10 | 11 | # Install packages 12 | DIST="$(lsb_release -cs)" 13 | echo "deb http://repo.netdata.cloud/repos/stable/ubuntu/ ${DIST}/" > /etc/apt/sources.list.d/netdata.list 14 | wget -O - https://repo.netdata.cloud/netdatabot.gpg.key | gpg --dearmor -o /etc/apt/trusted.gpg.d/netdata.gpg 15 | apt update 16 | export DEBIAN_PRIORITY=critical 17 | export DEBIAN_FRONTEND=noninteractive 18 | apt install -y sysstat stress procps build-essential linux-tools-generic fio iotop iperf iptraf-ng net-tools \ 19 | nicstat git glibc-doc manpages-dev bpftrace bpfcc-tools netdata \ 20 | ubuntu-dbgsym-keyring 21 | 22 | echo "deb http://ddebs.ubuntu.com ${DIST} main restricted universe multiverse 23 | deb http://ddebs.ubuntu.com ${DIST}-updates main restricted universe multiverse 24 | deb http://ddebs.ubuntu.com ${DIST}-proposed main restricted universe multiverse" > /etc/apt/sources.list.d/ddebs.list 25 | 26 | apt update 27 | apt install -y bpftrace-dbgsym 28 | 29 | # Change netdata systemd definition (e.g. to renice) 30 | mkdir /etc/systemd/system/netdata.service.d/ 31 | 32 | cat > /etc/systemd/system/netdata.service.d/override.conf<<'EOF' 33 | [Service] 34 | # The minimum netdata Out-Of-Memory (OOM) score. 35 | # netdata (via [global].OOM score in netdata.conf) can only increase the value set here. 36 | # To decrease it, set the minimum here and set the same or a higher value in netdata.conf. 37 | # Valid values: -1000 (never kill netdata) to 1000 (always kill netdata). 38 | OOMScoreAdjust=-1000 39 | 40 | # Valid policies: other (the system default) | batch | idle | fifo | rr 41 | # To give netdata the max priority, set CPUSchedulingPolicy=rr and CPUSchedulingPriority=99 42 | CPUSchedulingPolicy=rr 43 | 44 | # This sets the scheduling priority (for policies: rr and fifo). 45 | # Priority gets values 1 (lowest) to 99 (highest). 46 | CPUSchedulingPriority=99 47 | 48 | Nice=-10 49 | EOF 50 | 51 | systemctl daemon-reload 52 | systemctl restart netdata.service 53 | 54 | # Disable transparent huge pages 55 | cat > /etc/rc.local <<'EOF' 56 | #!/bin/sh -e 57 | # 58 | # rc.local 59 | # 60 | 61 | THP_DIR="/sys/kernel/mm/transparent_hugepage" 62 | 63 | if test -f $THP_DIR/enabled; then 64 | echo never > $THP_DIR/enabled 65 | fi 66 | 67 | if test -f $THP_DIR/defrag; then 68 | echo never > $THP_DIR/defrag 69 | fi 70 | 71 | exit 0 72 | EOF 73 | 74 | # Clone linux-metrics in the user's home 75 | cat > /etc/profile.d/clone-linux-metrics.sh <<'EOF' 76 | #!/bin/bash 77 | if [[ ! -e ~/linux-metrics ]] && [[ $(id -u) -ne 0 ]]; then 78 | echo "unable to find linux-metrics directory- cloning..." 79 | pushd ~ 80 | git clone https://github.com/natict/linux-metrics.git 81 | popd 82 | fi 83 | EOF 84 | -------------------------------------------------------------------------------- /packer/packer.json: -------------------------------------------------------------------------------- 1 | { 2 | "variables": { 3 | "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}", 4 | "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}", 5 | "subnet_id": "", 6 | "vpc_id": "", 7 | "aws_region": "eu-central-1", 8 | "source_ami": "ami-0fc2927a21c4cd7e2" 9 | }, 10 | "builders": [ 11 | { 12 | "name": "linux-metrics", 13 | "type": "amazon-ebs", 14 | "access_key": "{{user `aws_access_key`}}", 15 | "secret_key": "{{user `aws_secret_key`}}", 16 | "region": "{{user `aws_region`}}", 17 | "source_ami": "{{user `source_ami`}}", 18 | "instance_type": "t2.medium", 19 | "ssh_username": "ubuntu", 20 | "ami_name": "linux-metrics-{{timestamp}}", 21 | "subnet_id": "{{user `subnet_id`}}", 22 | "vpc_id": "{{user `vpc_id`}}" 23 | } 24 | ], 25 | "provisioners": [ 26 | { 27 | "type": "shell", 28 | "script": "bootstrap.sh" 29 | } 30 | ] 31 | } 32 | -------------------------------------------------------------------------------- /scripts/cpu/dummy1.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | stress -q -c 1 -t 120 4 | -------------------------------------------------------------------------------- /scripts/cpu/dummy2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | stress -q -m 1 -t 120 4 | -------------------------------------------------------------------------------- /scripts/cpu/dummy3.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | nice -n 10 -- stress -q -c 1 -t 120 4 | -------------------------------------------------------------------------------- /scripts/cpu/dummy4.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | stress -q -d 1 -t 120 4 | -------------------------------------------------------------------------------- /scripts/cpu/dummy_app.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | 3 | import time 4 | import math 5 | 6 | 7 | def percentile(N, percent): 8 | if not N: 9 | return None 10 | N = sorted(N) 11 | k = (len(N) - 1) * percent 12 | f = math.floor(k) 13 | c = math.ceil(k) 14 | if f == c: 15 | return N[int(k)] 16 | d0 = N[int(f)] * (c - k) 17 | d1 = N[int(c)] * (k - f) 18 | return d0 + d1 19 | 20 | 21 | def foo(): 22 | for i in range(20000): 23 | x = math.sqrt(i) 24 | 25 | 26 | if __name__ == "__main__": 27 | m = [] 28 | 29 | for _ in range(1000): 30 | start = time.time() 31 | foo() 32 | m.append(time.time() - start) 33 | 34 | print( 35 | "50th, 90th and 99th percentile: %f, %f, %f" 36 | % (percentile(m, 0.5), percentile(m, 0.9), percentile(m, 0.99)) 37 | ) 38 | -------------------------------------------------------------------------------- /scripts/disk/writer.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | while true; do dd if=/dev/zero of=/tmp/test.1G bs=1M count=1024; done 4 | -------------------------------------------------------------------------------- /scripts/memory/buffer.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [[ $UID != 0 ]]; then 4 | echo "Please run this script as root" 5 | exit 1 6 | fi 7 | 8 | root_dev="$(mount | grep 'on / ' | cut -d' ' -f1)" # /dev/xvda 9 | 10 | # Read 1GB of blocks from the root device 11 | dd if=$root_dev of=/dev/null bs=1M count=4096 12 | -------------------------------------------------------------------------------- /scripts/memory/cache.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | if [[ $UID != 0 ]]; then 4 | echo "Please run this script as root" 5 | exit 1 6 | fi 7 | 8 | # Read all readable files in /usr (Generates ~1.5GB cache) 9 | # find /usr -readable -type f -exec dd if={} of=/dev/null status=none \; 10 | 11 | # Just create a large file 12 | SIZE_IN_MB=4096 13 | dd if=/dev/zero of=/tmp/bigfile bs=1MB count=${SIZE_IN_MB} 14 | 15 | echo "Press Enter to continue..."; read 16 | rm /tmp/bigfile 17 | -------------------------------------------------------------------------------- /scripts/memory/dentry.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | import uuid 5 | import sys 6 | 7 | try: 8 | directory = sys.argv[1] 9 | except IndexError: 10 | directory = "./trash" 11 | 12 | if not os.path.exists(directory): 13 | os.makedirs(directory) 14 | 15 | try: 16 | while True: 17 | open(os.path.join(directory, str(uuid.uuid4())), 'w') 18 | except Exception: 19 | input("click any key to terminate") 20 | -------------------------------------------------------------------------------- /scripts/memory/dentry2.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import os 4 | 5 | try: 6 | while True: 7 | os.makedirs("t") 8 | os.chdir("t") 9 | except Exception: 10 | input("press any key to terminate\n") 11 | -------------------------------------------------------------------------------- /scripts/memory/hog.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | HOG_C=/tmp/hog.c 4 | HOG=/tmp/hog 5 | 6 | cat >$HOG_C <<'EOF' 7 | #include 8 | #include 9 | #include 10 | 11 | 12 | int main(int argc, char* argv[]) { 13 | long page_size = sysconf(_SC_PAGESIZE); 14 | 15 | long count = 0; 16 | while(1) { 17 | char* tmp = (char*) malloc(page_size); 18 | if (tmp) { 19 | tmp[0] = 0; 20 | count += page_size; 21 | if (count % (page_size*1024) == 0) { 22 | printf("Allocated %ld KB\n", count/1024); 23 | } 24 | } 25 | } 26 | 27 | return 0; 28 | } 29 | EOF 30 | 31 | gcc -o $HOG $HOG_C 32 | 33 | exec $HOG 34 | --------------------------------------------------------------------------------