├── README.md
├── docs
    ├── cpu-ctxt.md
    ├── cpu-load.md
    ├── cpu-percentage.md
    ├── images
    │   └── linux_io.png
    ├── io-usage.md
    ├── memory-usage.md
    ├── net-util.md
    └── references.md
├── packer
    ├── README.md
    ├── bootstrap.sh
    └── packer.json
├── saltstack
    ├── README.md
    ├── Saltfile
    ├── master
    ├── passwords
    │   ├── README.md
    │   ├── parseme.py
    │   ├── sample_sorted.txt
    │   └── uuids.json
    ├── pillars
    │   ├── defaults
    │   │   └── init.sls
    │   └── top.sls
    ├── roster
    ├── states
    │   ├── bootstrap
    │   │   ├── init.sls
    │   │   ├── iptables_redirect
    │   │   ├── netdata.service
    │   │   ├── rc-local.service
    │   │   └── rc.local
    │   ├── control
    │   │   └── init.sls
    │   ├── top.sls
    │   └── users
    │   │   ├── init.sls
    │   │   ├── root_bashrc
    │   │   └── workshop_sudoers
    └── tools
    │   └── install_python.sh
└── scripts
    ├── cpu
        ├── dummy1.sh
        ├── dummy2.sh
        ├── dummy3.sh
        ├── dummy4.sh
        └── dummy_app.py
    ├── disk
        ├── fio1.sh
        ├── fio2.sh
        ├── fio3.sh
        └── writer.sh
    └── memory
        ├── buffer.sh
        ├── cache.sh
        ├── dentry.py
        ├── dentry2.py
        └── hog.sh


/README.md:
--------------------------------------------------------------------------------
 1 | # Linux Metrics Workshop
 2 | While you can learn a lot by emitting metrics from your application, some insights can only be gained by looking at OS metrics. In this hands-on workshop, we will cover the basics in Linux metric collection for monitoring, performance tuning and capacity planning.
 3 | 
 4 | ## Topics
 5 | 1. CPU
 6 |    1. [CPU Percentage](docs/cpu-percentage.md)
 7 |    1. [CPU Load](docs/cpu-load.md)
 8 |    1. [Context Switches](docs/cpu-ctxt.md)
 9 | 1. Memory
10 |    1. [Memory Usage](docs/memory-usage.md)
11 | 1. IO
12 |    1. [IO Usage](docs/io-usage.md)
13 | 1. Network
14 |    1. [Network Utilization](docs/net-util.md)
15 | 1. [References](docs/references.md)
16 | 
17 | ## Setup
18 | The workshop was designed to run on AWS EC2 t2.small instances with general purpose SSD, Ubuntu 18.04 amd64, and transparent hugh pages disabled.
19 | You can build an AMI with all the dependencies installed using the attached [packer](https://www.packer.io/) template.
20 | 
21 | If you run on your own instance, make sure you have only 1 CPU (easier to read the metrics) and that you disable transparent huge pages (`echo never > /sys/kernel/mm/transparent_hugepage/enabled `)
22 | 


--------------------------------------------------------------------------------
/docs/cpu-ctxt.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | You will need 3 ssh terminals
 4 | 
 5 | ## Context Switches
 6 | 
 7 | ### Recall: Linux Process Context Switches
 8 | A mechanism to store current process *state* ie. Registers, Memory maps, Kernel structs (eg. TSS in 32bit), and load another (or a new one). Context switches are usually computationally expensive (although optimization exist), yet inevitable. For example, they are used to allow multi-tasking (eg. preemption), and to switch between user and kernel modes.
 9 | 
10 | Interprocess context switches are classified as *voluntary* or *involuntary*. A voluntary context switch occurs when a thread blocks because it
11 | requires a resource that is unavailable. An involuntary context switch takes place when a thread executes for the duration of its time slice or when
12 | the system identifies a higher-priority thread to run.
13 | 
14 | ### Task CS1: Context Switches
15 | 
16 | 1. Execute `vmstat 2` in a session (#1) and write down the current context switch rate (`cs` field):
17 |    ```bash
18 |    (term 1) root:~# vmstat 2
19 |    ```
20 | 2. Raise that number by executing `stress -i 10` in a new session (#2):
21 |    ```bash
22 |    (term 2) root:~# stress -i 10
23 |    ```
24 | 	1. What is the current context switch rate?
25 | 	2. What is causing this rate? Multi-tasking? Interrupts? Switches between kernel and user modes?
26 | 	3. Kill the `stress` command in session #2, and watch the rate drop.
27 | 3. Now let's see how a high context switch rate affects a dummy application.
28 | 	1. On session #2 run the dummy application `dummy_app.py` (which calls a dummy function 5000 times, and prints it's runtime percentiles):
29 |    ```bash
30 |    (term 2) root:~# perf stat -e cs python linux-metrics/scripts/cpu/dummy_app.py
31 |    ```
32 | 	2. Write the current CPU usage, the application percentiles and context switch rate.
33 | 	3. **In the same session (#2)**, raise the context switch rate using `stress -i 10 -t 150 &` and re-run the dummy application. Write the current CPU usage, the application percentiles and context switch rate.
34 |    ```bash
35 |    (term 2) root:~# stress -i 10 -t 150 &
36 |    (term 2) root:~# perf stat -e cs python linux-metrics/scripts/cpu/dummy_app.py
37 |    ```
38 | 	4. Describe the change in the percentiles. Did the high context switch rate affect most of `foo()` runs (ie. the 50th percentile)? If not, why?
39 | 4. Observe the behaviour when running `stress` in a different scheduling task group:
40 | 	1. Open a new session (#3) and move it to a different cgroup:
41 |    ```bash
42 |    (term 3) root:~# mkdir -p /sys/fs/cgroup/cpu/grp/c; echo $$ | sudo tee /sys/fs/cgroup/cpu/grp/c/tasks
43 |    ```
44 | 	2. Run stress again in the new session (#3) `stress -i 10 -t 150` or `stress -c 10 -t 150`:
45 |    ```bash
46 |    (term 3) root:~# stress -i 10 -t 150
47 |    ```
48 | 	3. Compare the CPU usage to **3.iii** (it should be roughly the same) and compare the context switch rate (which should be the same).
49 | 	4. Re-run the dummy application in the previous session (#2) and describe the change in the percentiles (and process context switch) vs **3.iv**.
50 | 5. What happens when processes compete for cpu time under a cgroup hierarchy?
51 | 	1. Move the second session to a new cgroup:
52 |    ```bash
53 |    (term 2) root:~# mkdir -p /sys/fs/cgroup/cpu/grp/b; echo $$ | sudo tee /sys/fs/cgroup/cpu/grp/b/tasks
54 |    ```
55 | 	2. Run `stress` in session #2 and `perf dummy_app.py` in session #3. What do you observe?
56 |    ```bash
57 |    (term 2) root:~# stress -i 10 -t 150
58 |    (term 3) root:~# perf stat -e cs python linux-metrics/scripts/cpu/dummy_app.py
59 |    ```
60 | 	2. Lower cpu.shares for stress cgroup (#2) and raise for cgroup (#3):
61 |    ```bash
62 |    (term 2) root:~# echo 200 > /sys/fs/cgroup/cpu/grp/b/cpu.shares
63 |    (term 3) root:~# echo 1000 > /sys/fs/cgroup/cpu/grp/c/cpu.shares
64 |    ```
65 | 	3. In session #2 run `stress` again and in session #3 run the `perf dummy_app.py`:
66 |    ```bash
67 |    (term 2) root:~# stress -i 10 -t 150
68 |    (term 3) root:~# perf stat -e cs python linux-metrics/scripts/cpu/dummy_app.py
69 |    ```
70 | 	4. What do you observe?
71 | 
72 | ### Discussion
73 | 
74 | - Can performance measurements on a staging environment truly estimate performance on production?
75 | - Why did we run the `stress` command and our dummy application in the same session?
76 | 
77 | ### Tools
78 | 
79 |  - Most tools use `/proc/vmstat` to fetch global context switching information (`ctxt` field), and `/proc/[PID]/status` for process specific information (`voluntary_context_switches` and `nonvoluntary_context_switches` fields).
80 |  - From the command-line you can use:
81 | 	 - `vmstat <delay> <count>` for global information.
82 | 	 - `pidstat -w -p <pid> <delay> <count>` for process specific information (voluntary/involuntary context switches).
83 | 
84 | ### Further reading
85 | 
86 | - `man sched`
87 | - http://www.linfo.org/context_switch.html
88 | - https://wiki.archlinux.org/index.php/cgroups
89 | 
90 | #### Next: [Memory Usage](memory-usage.md)
91 | 


--------------------------------------------------------------------------------
/docs/cpu-load.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | ## CPU Load
 4 | On Linux, Load Average is average the number of runnable (running + ready to run) tasks, plus tasks in uninterruptible state, in the last 1, 5, and 15 minutes.
 5 | 
 6 | 
 7 | ### Recall: Linux Process State
 8 | 
 9 | From `man 1 ps`:
10 | ```
11 | D    uninterruptible sleep (usually IO)
12 | R    running or runnable (on run queue)
13 | S    interruptible sleep (waiting for an event to complete)
14 | T    stopped, either by a job control signal or because it is being traced
15 | Z    defunct ("zombie") process, terminated but not reaped by its parent
16 | ```
17 | 
18 | ### Task CL1: CPU Load
19 | Open 3 terminals (ssh connections).
20 | 
21 | 1. Where are the Load Averages? Use the Linux Process States and `man 5 proc` (search for loadavg):
22 | ```bash
23 | (term 1) root:~# man 5 proc
24 | (term 1) root:~# cat /proc/loadavg
25 | 46.26 12.59 4.39 1/106 7023
26 | ```
27 | 2. Start the disk stress script (**NOTE: Do not run this on your own laptop !!!**):
28 | 
29 | ```bash
30 | (term 1) root:~# /bin/sh linux-metrics/scripts/disk/writer.sh
31 | ```
32 | 
33 | 3. Run the following command and look at the Load values for about a minute until `ldavg-1` stabilizes:
34 | 
35 | ```bash
36 | (term 2) root:~# sar -q 1 100
37 | ```
38 | * What is the writing speed of the script?
39 | * What is the current Load Average? Why? Which processes contribute to this number?
40 | ```bash
41 | (term 2) root:~# top
42 | ```
43 | * What are CPU `%user`, `%sy`, `%IO-wait` and `%idle`?
44 | 
45 | 4. While the previous script is running, start a single CPU stress:
46 | 
47 | ```bash
48 | (term 3) root:~# stress -c 1 -t 3600
49 | ```
50 | Wait another minute, and answer the above questions above.
51 | 
52 | 5. Stop all scripts.
53 | 
54 | ### Discussion
55 | 
56 | - Why are processes waiting for IO included in the Load Average?
57 | - Assuming we have 1 CPU core and Load of 5, is our CPU core on 100% utilization?
58 | - How can we know if load is going up or down?
59 | - Does a load average of 70 indicate a problem?
60 | 
61 | ### Further reading
62 | - [Understanding Linux CPU Load](http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages)
63 | 
64 | ## Tools
65 | 
66 |  - Most tools use `/proc/loadavg` to fetch Load Average and run queue information.
67 |  - To get a percentage over a specific interval of time, you can use:
68 | 	 - `sar -q <interval> <count>`
69 | 		 - `-q` queue length and load averages
70 | 	 - or  simply `uptime`
71 | 
72 | #### Next: [Context Switches](cpu-ctxt.md)
73 | 


--------------------------------------------------------------------------------
/docs/cpu-percentage.md:
--------------------------------------------------------------------------------
 1 | # CPU Metrics
 2 | 
 3 | You will need 3 ssh terminals
 4 | 
 5 | ## CPU Percentage
 6 | Let's start with the most common CPU metric.
 7 | Fire up `top`, and let's start figuring out what the different CPU percentage values are.
 8 | ```bash
 9 | (term 1) root:~# top
10 | ```
11 | 
12 | The output will look like:
13 | ```bash
14 | %Cpu(s):  2.3 us,  0.6 sy,  0.0 ni, 96.7 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
15 | ```
16 | 
17 | ### Task CP1: CPU Percentage
18 | For each of the following scripts (`dummy1.sh`, `dummy2.sh`, `dummy3.sh`, `dummy4.sh`) under the `scripts/cpu/` directory:
19 | 
20 |  1. Run the script:
21 | ```bash
22 | (term 2) root:~# /bin/sh linux-metrics/scripts/cpu/dummy1.sh
23 | ```
24 |  2. While the script is running, look at `top` on terminal window 1.
25 |  3. Without looking at the code, try to figure out what the script is doing (find the percentage fields description in `man 1 top`)
26 |  4. Stop the script (use `Ctrl+C` or wait 2 minutes for it to timeout)
27 |  5. Verify your answer by reading the script content
28 | 
29 | ### Tools
30 | 
31 |  - Most tools use `/proc/stat` to fetch CPU percentage. Note that it displays amount of time measured in units of USER_HZ
32 |    (1/100ths of a second on most architectures), also called jiffies, and not percentage.
33 |  - To get a percentage over a specific interval of time, you can use:
34 | 	 - `sar <interval> <count>`
35 | 	 - or  `sar -P ALL -u <interval> <count>` (for details on multiple cpus)
36 | 		 - `-P` per-processor statistics
37 | 		 - `-u` CPU utilization
38 | 	 - or  `mpstat` (similar usage and output)
39 | ```bash
40 | (term 3) root:~# sar 1
41 | ```
42 | or
43 | ```bash
44 | (term 3) root:~# mpstat 1
45 | ```
46 | 
47 | ### Discussion
48 | 
49 | - What's the difference between `%IO-wait` and `%idle`?
50 | - Is the entire CPU load created by a process accounted to that process?
51 | 
52 | ### Further reading
53 | 
54 | - `man proc`
55 | - `man sar`
56 | 
57 | #### Time Stolen and Amazon EC2
58 | 
59 | You may have noticed the `st` label. From `man 1 top`:
60 | ```
61 | st : time stolen from this vm by the hypervisor
62 | ```
63 | Amazon EC2 uses the hypervisor to regulate the machine CPU usage (to match the instance type's EC2 Compute Units). If you see inconsistent stolen percentage over time, then you might be using [Burstable Performance Instances](http://aws.amazon.com/ec2/instance-types/#burst).
64 | 
65 | #### Next: [CPU Load](cpu-load.md)
66 | 


--------------------------------------------------------------------------------
/docs/images/linux_io.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kargig/linux-metrics/bf5a2447bcf67872f2c756d1774f57adcc2f2811/docs/images/linux_io.png


--------------------------------------------------------------------------------
/docs/io-usage.md:
--------------------------------------------------------------------------------
 1 | # IO Metrics
 2 | 
 3 | You will need 2 ssh terminals
 4 | 
 5 | ## IO Usage
 6 | 
 7 | ### Recall: Linux IO, Merges, IOPS
 8 | 
 9 | Linux IO performance is affected by many factors, including your application workload, choice of file-system, IO scheduler choice (eg. [cfq](https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt), [deadline](https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt)), queue configuration, device driver, underlying device(s) caches and more.
10 | 
11 | ![Linux IO](images/linux_io.png)
12 | 
13 | #### Merged Reads/Writes
14 | 
15 | From [the Kernel Documentation](https://www.kernel.org/doc/Documentation/iostats.txt): *"Reads and writes which are adjacent to each other may be merged for efficiency.  Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O"*
16 | 
17 | #### IOPS
18 | 
19 | IOPS are input/output operations per second. Some operations take longer than others, eg. HDDs can do sequential reading operations much faster than random writing operations. Here are some rough estimations from [Wikipedia](https://en.wikipedia.org/wiki/IOPS) and [Amazon EBS Product Details](http://aws.amazon.com/ebs/details/):
20 | 
21 | | Device/Type           | IOPS      |
22 | |-----------------------|-----------|
23 | | 7.2k-10k RPM SATA HDD | 75-150    |
24 | | 10k-15k RPM SAS HDD   | 140-210   |
25 | | SATA SSD              | 1k-120k   |
26 | | AWS EC2 gp2           | up to 10k |
27 | | AWS EC2 io1           | up to 20k |
28 | 
29 | ### Task I1: IO Usage
30 | 
31 | 1. Start by running `iostat`, and examine the output fields. Let's go over the important ones together:
32 |    ```bash
33 |    (term 1) root:~# iostat -xd 2
34 |    ```
35 | 	- **rrqm/s** & **wrqm/s**- Number of read/write requests merged per-second.
36 | 	- **r/s** & **w/s**- Read/Write requests (after merges) per-second. Their sum is the **IOPS**!
37 | 	- **rkB/s** & **wkB/s**- Number of kB read/written per-second, ie. **IO throughput**.
38 | 	- **avgqu-sz**- Average requests queue size for this device. Check out `/sys/block/<device>/queue/nr_requests` for the maximum queue size.
39 | 	- **r_await**, **w_await**, **await**- The average time (in ms.) for read/write/both requests to be served, including time spent in the queue, ie. **IO latency**
40 | 2. Please write down these field's values when our system is at rest.
41 | 3. In a new session, let's benchmark our device *write performance* by running:
42 | 
43 | 	```bash
44 | 	(term 2) root:~# /bin/sh linux-metrics/scripts/disk/fio1.sh
45 | 	```
46 | 	
47 | 	This will clone 16 processes to perform non-buffered (direct) random writes for 3 minutes.
48 | 	1. Compare the values you see in `iostat` to the values you wrote down earlier. Do they make sense?
49 | 	2. Look at `fio` results and try to see if the number of IOPS make sense (we are using EBS gp2 volumes).
50 | 4. Repeat the previous task, this time benchmark **read performance**:
51 | 
52 | 	```bash
53 | 	(term 2) root:~# /bin/sh linux-metrics/scripts/disk/fio2.sh
54 | 	```
55 | 	
56 | 5. Finally, repeat **read performance** benchmark with 1 process:
57 | 
58 | 	```bash
59 | 	(term 2) root:~# /bin/sh linux-metrics/scripts/disk/fio3.sh
60 | 	```
61 | 	1. Read about the `svctm` field in `man 1 iostat`. Compare the value we got now to the value we got for 16 processes. Is there a difference? If so, why?
62 | 	2. Repeat the previous question for the `%util` field.
63 | 
64 | 6. `fio` also supports other IO patterns (by changing the `--rw=` parameter), including:
65 | 	- `read` Sequential reads
66 | 	- `write` Sequential writes
67 | 	- `rw` Mixed sequential reads and writes
68 | 	- `randrw` Mixed random reads and writes
69 | 
70 | 	If time permits, explore these IO patterns to learn more about EBS gp2's performance under different workloads.
71 | 
72 | ### Discussion
73 | 
74 | - Why do we need an IO queue? What does it enable the kernel to perform? Read a few more things on IO queue depths [here](https://blog.docbert.org/queue-depth-iops-and-latency/)
75 | - Why are the `svctm` and `%util` iostat fields essentially useless in a modern environment? (read [Marc Brooker's excellent blog post](https://brooker.co.za/blog/2014/07/04/iostat-pct.html))
76 | - What is the difference in how the kernel handles reads and writes? How does that effect metrics and application behaviour?
77 | 
78 | ### Tools
79 | 
80 |  - Most tools use `/proc/diskstats` to fetch global IO statistics.
81 |  - Per-process IO statistics are usually fetched from `/proc/[pid]/io`, which is documented in `man 5 proc`.
82 |  - From the command-line you can use:
83 | 	 - `iostat -xd <delay> <count>` for per-device information
84 | 		 - `-d` device utilization
85 | 		 - `-x` extended statistics
86 | 	 - `sudo iotop` for a `top`-like interface (easily find the process doing most reads/writes)
87 | 		 - `-o` only show processes or threads actually doing I/O
88 | 
89 | #### Next: [Network Utilization](net-util.md)
90 | 


--------------------------------------------------------------------------------
/docs/memory-usage.md:
--------------------------------------------------------------------------------
 1 | # Memory Metrics
 2 | 
 3 | ## Memory Usage
 4 | 
 5 | Is free memory really free? What's the difference between cached memory and buffers? Let's get our hands dirty and find out...
 6 | 
 7 | You will need 3 open terminals for this task. **DO NOT RUN ANY SCRIPTS ON YOUR LAPTOP!**
 8 | 
 9 | ### Task M1: Memory usage, Caches and Buffers
10 | 
11 | 1. Fire up `top` on Terminal 1, and write down how much `free` memory you have (**keep it running for the rest of this module**):
12 |     ```bash
13 |     (term 1) root:~# top
14 |     ```
15 | 2. Start the memory hog `hog.sh` on Terminal 2, let it run until it gets killed (if it hangs- use `Ctrl+c`):
16 |     ```bash
17 |     (term 2) root:~# /bin/sh linux-metrics/scripts/memory/hog.sh
18 |     ```
19 | 3. Go to Terminal 1 and compare that to the number you wrote. Are they (almost) the same? If not, why?
20 | 4. Read about the `Buffers` and `Cached`  values in `man 5 proc` (under `meminfo`):
21 | 	1. Run the memory hog on Terminal 2 `scripts/memory/hog.sh`:
22 |     ```bash
23 |     (term 2) root:~# /bin/sh linux-metrics/scripts/memory/hog.sh
24 |     ```
25 | 	2. Write down the `buffer` size from Terminal 1.
26 | 	3. Now run the buffer balloon `buffer.sh` on Terminal 2:
27 |     ```bash
28 |     (term 2) root:~# /bin/sh linux-metrics/scripts/memory/buffer.sh
29 |     ```
30 | 	4. Check the `buffer` size again.
31 | 	5. Read the script, and see if you can make sense of the results.
32 | 	6. Repeat all 5 steps above with the `cached Mem` value.
33 |     7. Repeat all steps for `cache.sh`:
34 |     ```bash
35 |     (term 2) root:~# /bin/sh linux-metrics/scripts/memory/cache.sh
36 |     ```
37 | 5. Let's see how `cached Mem` affects application performance:
38 | 	1. Drop the cache:
39 |     ```bash
40 |     (term 2) root:~# echo 3 > /proc/sys/vm/drop_caches
41 |     ``` 
42 | 	2. Time a dummy Python application (you can repeat these 2 steps multiple times):
43 |     ```bash
44 |     (term 2) root:~# time python -c 'print "Hello World"'
45 |     ```
46 | 	3. Now re-run our dummy Python application, but this time without flushing the cached memory. Can you see the difference?
47 | 6. Run the `dentry.py` script and observe the memory usage using `free`. What is using the memory? How does it effect performance?
48 |     ```bash
49 |     (term 2) root:~# python linux-metrics/scripts/memory/dentry.py
50 |     (term 2) root:~# echo 3 > /proc/sys/vm/drop_caches
51 |     (term 2) root:~# time ls trash/ >/dev/null
52 |     (term 2) root:~# time ls trash/ >/dev/null
53 |     ```
54 | 7. Run the `dentry2.py` script and try dropping the caches. Does it make a difference?
55 |     ```bash
56 |     (term 2) root:~# python linux-metrics/scripts/memory/dentry2.py
57 |     ```
58 | 
59 | ### Discussion
60 | 
61 | - What's the difference between `dentry.py` and `dentry2.py`?
62 | - Assuming a server has some amount of free memory, can we assume it has enough memory to support it's current workload? If not, why?
63 | - Why wasn't our memory hog able to grab all the `cached` memory?
64 | - Run the following stress test, what do you see?
65 |   ```bash
66 |   (term 2) root:~# stress -m 18 --vm-bytes 100M  -t 600s
67 |   ```
68 | 
69 | 
70 | ### Tools
71 | 
72 |  - Most tools use `/proc/meminfo` to fetch memory usage information.
73 | 	 - A simple example is the `free` utility.
74 |      - What does the 2nd line of `free` tell us?
75 |  - To get usage information over some period, use `sar -r <delay> <count>`.
76 | 	 - Here you can also see how many dirty pages you have (try running `sync` while `sar` is running).
77 | 
78 | #### Next: [IO Usage](io-usage.md)
79 | 


--------------------------------------------------------------------------------
/docs/net-util.md:
--------------------------------------------------------------------------------
 1 | # Network Metrics
 2 | 
 3 | ## Network Utilization
 4 | 
 5 | ### Task NP1: Network Utilization
 6 | 
 7 | 1. Assuming we have two machines connected with Gigabit Ethernet interfaces, what is the maximum expected **throughput in kilo-bytes**?
 8 | 2. For the following task you'll need two machines, or a partner:
 9 | 
10 | 	| Machine/s | Command | Notes |
11 | 	|:---------:|---------|-------|
12 | 	| A + B | `sar -n DEV 2` | Write down the receive/transmit packets/KB per-second. Keep this running for the entire length of the task |
13 | 	| A + B | `sar -n EDEV 2` | These are the error statistics, read about them in `man 1 sar`. Keep this running for the entire length of the task |
14 | 	| A | `ip a` | Write down A's IP address |
15 | 	| A | `iperf -s` | This will start a server for our little benchmark |
16 | 	| B | `iperf -c <A's address> -t 30` | Start the benchmark client for 30 seconds |
17 | 	| A | `iperf -su` | Replace the previous TCP server with a UDP one |
18 | 	| B | `iperf -c 172.30.0.251 -u -b 1000M -l 8k` | Repeat the benchmark with UDP traffic |
19 | 
20 | 	1. When running the client on B, use `sar` data to determine A's link utilization (in %, assuming Gigabit Ethernet)?
21 | 	2. What are the major differences between TCP and UDP traffic observable with `sar`?
22 | 	3. Start to decrease the UDP buffer length (ie. from `8k` to `4k`, `2k`, `1k`, `512`, `128`, `64`).
23 | 		1. Does the **throughput in KB** increase or decrease?
24 | 		2. What about the **throughput in packets**?
25 | 		3. Look carefully at the `iperf` client and server report. Can you see any packet loss? Can you also see them in `ifconfig`?
26 | 
27 | ### Network Errors
28 | 
29 | While Linux provides multiple metrics for network errors including- collisions, errors, and packet drops, the [kernel documentation](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net-statistics) indicates that these metrics meaning is driver specific, and tightly coupled with your underlying networking layers.
30 | 
31 | ### Tools
32 | 
33 |  - Most tools use `/proc/net/dev` to fetch network device information.
34 | 	 - For example, try running `sar -n DEV <interval> <count>`.
35 |  - Connection information for TCP, UDP and raw sockets can be fetched using `/proc/net/tcp`, `/proc/net/udp`, `/proc/net/raw`
36 | 	 - For parsed socket information use `netstat -tuwnp`.
37 | 		 - `-t`, `-u`, `-w`: TCP, UDP and raw sockets
38 | 		 - `-n`: no DNS resolving
39 | 		 - `-p`: the process owning this socket
40 |  - The most comprehensive command-line utility is `netstat`, covering metrics from interface statistics to socket information.
41 |  - Check `iptraf`  for interactive traffic monitoring (no socket information).
42 |  - Finally, `nethogs` provides a `top`-like experience, allowing you to find which process is taking up the most bandwidth (TCP only).
43 | 
44 | ### Discussion
45 | 
46 | - What could be the reasons for packet drops? Which of these reasons can be measured on the receiving side?
47 | - Why can't you see the `%ifutil` value on EC2?
48 | 	- **Hint**: Network device speed is usually found in `/sys/class/net/<dev>/speed`.
49 | 	- **Workaround**: The `nicstat` utility allows you to specify the speed and duplex of your network interface from the command line:
50 | 	    ```
51 |         nicstat -S eth0:1000fd -n 2
52 |         ```
53 | 
54 | 


--------------------------------------------------------------------------------
/docs/references.md:
--------------------------------------------------------------------------------
 1 | # References
 2 | 
 3 | This workshop borrows heavily from the following talks/blog-posts:
 4 | 
 5 | - Harald van Breederode's "Understanding Linux Load Average" [Part 1](https://prutser.wordpress.com/2012/04/23/understanding-linux-load-average-part-1/) and [Part 2](https://prutser.wordpress.com/2012/05/05/understanding-linux-load-average-part-2/)
 6 | - Brendan Gregg's [Linux Load Averages: Solving the Mystery](http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
 7 | - Vidar Holen's  ["Help! Linux ate my RAM!"](http://www.linuxatemyram.com/play.html)
 8 | - Ben Mildren's [Monitoring IO performance using iostat & pt-diskstats](https://www.percona.com/live/mysql-conference-2013/sites/default/files/slides/Monitoring-Linux-IO.pdf)
 9 | - Amazon EC2 User Guide for Linux Instances, [Benchmark Volumes](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_piops.html)
10 | - Brendan Gregg's [USE Method: Linux Performance Checklist](http://www.brendangregg.com/USEmethod/use-linux.html)
11 | - Serhiy Topchiy's [Testing Amazon EC2 network speed](http://epamcloud.blogspot.co.il/2013/03/testing-amazon-ec2-network-speed.html)
12 | 


--------------------------------------------------------------------------------
/packer/README.md:
--------------------------------------------------------------------------------
1 | # AMI builder for Linux Metrics workshop
2 | 
3 | ## Usage
4 | 
5 | ```
6 | packer build -var 'aws_access_key=...' -var 'aws_secret_key=...' -var 'subnet_id=...' -var 'vpc_id=...' packer.json
7 | ```
8 | 


--------------------------------------------------------------------------------
/packer/bootstrap.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | if [ $(id -u) -ne 0 ]; then
  4 |     echo "You must run this script as root. Attempting to sudo" 1>&2
  5 |     exec sudo -n bash $0 $@
  6 | fi
  7 | 
  8 | # Wait for cloud-init
  9 | sleep 10
 10 | 
 11 | # sysdig repo
 12 | curl -s https://s3.amazonaws.com/download.draios.com/DRAIOS-GPG-KEY.public | apt-key add -
 13 | curl -s -o /etc/apt/sources.list.d/draios.list http://download.draios.com/stable/deb/draios.list
 14 | 
 15 | # Install packages
 16 | apt-get update
 17 | export DEBIAN_PRIORITY=critical
 18 | export DEBIAN_FRONTEND=noninteractive
 19 | 
 20 | apt-get -y install procps sysstat stress python2.7 gcc vim vim-youcompleteme linux-tools-common linux-tools-generic linux-tools-$(uname -r) fio iotop iperf iptraf nethogs nicstat git build-essential manpages-dev glibc-doc
 21 | apt-get -y install linux-headers-$(uname -r) sysdig
 22 | 
 23 | # BCC
 24 | apt-get install -y bison build-essential cmake flex git libedit-dev \
 25 |   libllvm3.9 llvm-3.9-dev libclang-3.9-dev python zlib1g-dev libelf-dev luajit luajit-5.1-dev arping netperf
 26 | 
 27 | cd ~
 28 | git clone https://github.com/iovisor/bcc.git
 29 | mkdir bcc/build; cd bcc/build
 30 | cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local
 31 | make install
 32 | ldconfig
 33 | 
 34 | cd ~
 35 | LATEST_NETDATA="$(wget -q -O - https://raw.githubusercontent.com/firehol/binary-packages/master/netdata-latest.gz.run)"
 36 | wget -q -O /tmp/netdata.gz.run "https://raw.githubusercontent.com/firehol/binary-packages/master/${LATEST_NETDATA}"
 37 | bash /tmp/netdata.gz.run --quiet --accept
 38 | rm /tmp/netdata.gz.run
 39 | 
 40 | # Change netdata systemd definition (e.g. to renice)
 41 | cat > /etc/systemd/system/netdata.service <<'EOF'
 42 | [Unit]
 43 | Description=Real time performance monitoring
 44 | After=network.target httpd.service squid.service nfs-server.service mysqld.service mysql.service named.service postfix.service
 45 | 
 46 | [Service]
 47 | Type=simple
 48 | User=netdata
 49 | Group=netdata
 50 | ExecStart=/opt/netdata/usr/sbin/netdata -D
 51 | 
 52 | # The minimum netdata Out-Of-Memory (OOM) score.
 53 | # netdata (via [global].OOM score in netdata.conf) can only increase the value set here.
 54 | # To decrease it, set the minimum here and set the same or a higher value in netdata.conf.
 55 | # Valid values: -1000 (never kill netdata) to 1000 (always kill netdata).
 56 | OOMScoreAdjust=-1000
 57 | 
 58 | Nice=-10
 59 | 
 60 | # saving a big db on slow disks may need some time
 61 | TimeoutStopSec=60
 62 | 
 63 | # restart netdata if it crashes
 64 | Restart=on-failure
 65 | RestartSec=30
 66 | 
 67 | [Install]
 68 | WantedBy=multi-user.target
 69 | EOF
 70 | 
 71 | systemctl daemon-reload
 72 | systemctl restart netdata.service
 73 | 
 74 | # Disable transparent huge pages
 75 | cat > /etc/rc.local <<'EOF'
 76 | #!/bin/sh -e
 77 | #
 78 | # rc.local
 79 | #
 80 | 
 81 | THP_DIR="/sys/kernel/mm/transparent_hugepage"
 82 | 
 83 | if test -f $THP_DIR/enabled; then
 84 |     echo never > $THP_DIR/enabled
 85 | fi
 86 | 
 87 | if test -f $THP_DIR/defrag; then
 88 |     echo never > $THP_DIR/defrag
 89 | fi
 90 | 
 91 | exit 0
 92 | EOF
 93 | 
 94 | # Clone linux-metrics in the user's home
 95 | cat > /etc/profile.d/clone-linux-metrics.sh <<'EOF'
 96 | #!/bin/bash
 97 | if [[ ! -e ~/linux-metrics ]] && [[ $(id -u) -ne 0 ]]; then
 98 |     echo "unable to find linux-metrics directory- cloning..."
 99 |     pushd ~
100 |     git clone https://github.com/natict/linux-metrics.git
101 |     popd
102 | fi
103 | EOF
104 | 


--------------------------------------------------------------------------------
/packer/packer.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "variables": {
 3 |     "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
 4 |     "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
 5 |     "subnet_id": "",
 6 |     "vpc_id": "",
 7 |     "aws_region": "eu-west-1",
 8 |     "source_ami": "ami-05801d0a3c8e4c443"
 9 |   },
10 |   "builders": [
11 |     {
12 |       "name": "linux-metrics",
13 |       "type": "amazon-ebs",
14 |       "access_key": "{{user `aws_access_key`}}",
15 |       "secret_key": "{{user `aws_secret_key`}}",
16 |       "region": "{{user `aws_region`}}",
17 |       "source_ami": "{{user `source_ami`}}",
18 |       "instance_type": "m4.large",
19 |       "ssh_username": "ubuntu",
20 |       "ami_name": "linux-metrics-{{timestamp}}",
21 |       "subnet_id": "{{user `subnet_id`}}",
22 |       "vpc_id": "{{user `vpc_id`}}"
23 |     }
24 |   ],
25 |   "provisioners": [
26 |     {
27 |       "type": "shell",
28 |       "script": "bootstrap.sh"
29 |     }
30 |   ]
31 | }
32 | 


--------------------------------------------------------------------------------
/saltstack/README.md:
--------------------------------------------------------------------------------
 1 | ### Saltstack installation for of target hosts for the Linux Metrics Workshop
 2 | 
 3 | Tested on Debian 8 (Jessie).
 4 | 
 5 | Requirements:
 6 | 
 7 | 1. Root/User with sudo rights and the same key on all target hosts 
 8 | 2. SSHd running on all targethosts and accepting password authentication
 9 | 3. Python installed on target hosts (check tools directory)
10 | 4. Control machine with salt-ssh installed 
11 | 
12 | 
13 | 
14 | 


--------------------------------------------------------------------------------
/saltstack/Saltfile:
--------------------------------------------------------------------------------
1 | salt-ssh:
2 |   config_dir: ./
3 |   log_file: /tmp/salt/master.log
4 |   log_level_logfile: debug
5 |   ssh_max_procs: 30
6 |   ssh_wipe: False
7 |   roster_file: ./roster
8 | 


--------------------------------------------------------------------------------
/saltstack/master:
--------------------------------------------------------------------------------
 1 | conf_file: ./master
 2 | cachedir: /tmp/saltcache
 3 | file_roots:
 4 |   base:
 5 |     - ./states
 6 | pillar_roots:
 7 |   base:
 8 |     - ./pillars
 9 | timeout: 45
10 | pki_dir: ./pki/salt/master
11 | state_verbose: True
12 | roster_defaults:
13 |   user: root
14 |   priv: ~/.ssh/id_rsa
15 |   thin_dir: /tmp/salt-thin
16 | 


--------------------------------------------------------------------------------
/saltstack/passwords/README.md:
--------------------------------------------------------------------------------
 1 | ## Print generated password
 2 | 
 3 | The user salt state has created the `lab` user using the first 10 characters from the server's salt "UUID". All server full UUID's can be printed in json format by running:
 4 | ```
 5 | $ salt-ssh ws* grains.get uuid --out=json --static > uuids.json
 6 | ```
 7 | Passwords can be printed on screen by running:
 8 | ```
 9 | $python parseme.py
10 | ```
11 | 


--------------------------------------------------------------------------------
/saltstack/passwords/parseme.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import sys
 3 | import pprint
 4 | import os
 5 | 
 6 | 
 7 | raw_data = json.load(open('uuids.json'))
 8 | raw_list = []
 9 | for key, value in raw_data.items():
10 |     raw_list.append((key,value[:10]))
11 | 
12 | user_list = sorted(raw_list)
13 | for user in user_list:
14 |     print "{0}.example.net {1}".format(user[0],user[1])
15 | 
16 | 


--------------------------------------------------------------------------------
/saltstack/passwords/sample_sorted.txt:
--------------------------------------------------------------------------------
1 | ws1.example.com:  9137141d-2
2 | ws2.example.com:  48d90b48-7
3 | 


--------------------------------------------------------------------------------
/saltstack/passwords/uuids.json:
--------------------------------------------------------------------------------
1 | {
2 |     "ws1": "136fe95c-6a97-3147-be25-946e3270eef9",
3 |     "ws2": "b3fcaf3a-12b5-6f43-b21a-8062f0b391ad"
4 | }
5 | 


--------------------------------------------------------------------------------
/saltstack/pillars/defaults/init.sls:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kargig/linux-metrics/bf5a2447bcf67872f2c756d1774f57adcc2f2811/saltstack/pillars/defaults/init.sls


--------------------------------------------------------------------------------
/saltstack/pillars/top.sls:
--------------------------------------------------------------------------------
1 | base:
2 |   '*':
3 |     - defaults
4 | 


--------------------------------------------------------------------------------
/saltstack/roster:
--------------------------------------------------------------------------------
1 | ws1:
2 |   host: 'ws1.example.com'
3 | ws2:
4 |   host: 'ws2.example.com'
5 | 


--------------------------------------------------------------------------------
/saltstack/states/bootstrap/init.sls:
--------------------------------------------------------------------------------
  1 | update:
  2 |   cmd.run:
  3 |     - name: 'apt-get  -qq update'
  4 |     - order: 1
  5 | upgrade:
  6 |   module.run:
  7 |     - name: pkg.upgrade
  8 |     - dist_upgrade: True
  9 |     - order: 2
 10 | 
 11 | draios_repo:
 12 |   pkgrepo.managed:
 13 |     - humanname: Draios
 14 |     - name: deb http://download.draios.com/stable/deb stable-$(ARCH)/
 15 |     - file: /etc/apt/sources.list.d/draios.list
 16 |     - keyid: EC51E8C4
 17 |     - keyserver: keyserver.ubuntu.com
 18 |     - require:
 19 |       - pkg: base_packages
 20 | 
 21 | base_packages:
 22 |   pkg.installed:
 23 |     - pkgs:
 24 |       - wget:
 25 |       - curl:
 26 |       - telnet:
 27 |       - strace:
 28 |       - git:
 29 |       - tree:
 30 |       - rsync:
 31 |       - less:
 32 |       - sysstat:
 33 |       - ethtool:
 34 |       - screen:
 35 |       - bzip2:
 36 |       - lsof:
 37 |       - sudo:
 38 |       - haveged:
 39 |       - file:
 40 |       - lftp:
 41 |       - unzip:
 42 |       - htop:
 43 |       - heirloom-mailx:
 44 |       - htop:
 45 |       - python-apt:
 46 |       - pkg-config:
 47 |       - multitail:
 48 |       - vim:
 49 |       - ngrep:
 50 |       - netcat-openbsd:
 51 |       - dnsutils:
 52 |       - vim-scripts:
 53 |       - iptables-persistent
 54 |       - bc
 55 |       - aptitude
 56 |       - python-msgpack
 57 |       - python-m2crypto
 58 |       - python-pip
 59 |       - lsb-release
 60 |       - debconf-utils
 61 |       - apt-transport-https:
 62 |       - psmisc:
 63 |       - atop
 64 |       - iotop
 65 |       - whois
 66 |       - procps
 67 |       - sysstat
 68 |       - stress
 69 |       - linux-tools
 70 |       - fio
 71 |       - iptraf
 72 |       - nethogs
 73 |       - iperf
 74 |       - nicstat
 75 |       - linux-headers-amd64
 76 |       - fortunes
 77 |       - bash-completion
 78 | 
 79 | 
 80 | linux_metrics_workshop:
 81 |   git.latest:
 82 |     - name: https://github.com/kargig/linux-metrics/
 83 |     - rev: master
 84 |     - target: /root/linux-metrics
 85 |     - remote: origin
 86 |     - force_reset: True
 87 | 
 88 | netdata:
 89 |   cmd.run:
 90 |     - name: curl https://my-netdata.io/kickstart-static64.sh >/tmp/kickstart-static64.sh && sh /tmp/kickstart-static64.sh --dont-wait
 91 |     - unless: ls -la /opt/netdata/
 92 |     - require:
 93 |       - pkg: base_packages
 94 |   service.running:
 95 |     - require:
 96 |       - cmd: netdata
 97 |     - watch:
 98 |       - file: netdata
 99 |   file.managed:
100 |     - name: /etc/systemd/system/netdata.service
101 |     - source: salt://bootstrap/netdata.service
102 |     - require:
103 |       - cmd: netdata
104 | rc.local:
105 |   file.managed:
106 |     - name: /etc/rc.local
107 |     - source: salt://bootstrap/rc.local
108 |     - mode: 700
109 | 
110 | rc-local.service:
111 |   file.managed:
112 |     - name: /etc/systemd/system/rc-local.service
113 |     - source: salt://bootstrap/rc-local.service
114 |   service.running:
115 |     - enable: True
116 |     - watch:
117 |       - file: rc-local.service
118 |     - require:
119 |       - file: rc-local.service
120 | 
121 | sysstat:
122 |   file.replace:
123 |     - name: /etc/default/sysstat
124 |     - pattern: "^ENABLED=.*"
125 |     - repl: "ENABLED=true"
126 |     - append_if_not_found: True
127 |   service.running:
128 |     - enable: True
129 |     - watch:
130 |       - file: /etc/default/sysstat
131 | 
132 | locale:
133 |   file.replace:
134 |     - name: /etc/locale.gen
135 |     - pattern: "^# el_GR.UTF-8 UTF-8"
136 |     - repl: "el_GR.UTF-8 UTF-8"
137 |     - append_if_not_found: True
138 |   cmd.wait:
139 |     - name: locale-gen
140 |     - watch:
141 |       - file: /etc/locale.gen
142 | 
143 | iptables:
144 |   file.managed:
145 |     - name: /etc/iptables/rules.v4
146 |     - source: salt://bootstrap/iptables_redirect
147 |   cmd.wait:
148 |     - name: iptables-restore</etc/iptables/rules.v4
149 |     - watch:
150 |       - file: /etc/iptables/rules.v4
151 | 


--------------------------------------------------------------------------------
/saltstack/states/bootstrap/iptables_redirect:
--------------------------------------------------------------------------------
 1 | # Generated by iptables-save
 2 | *nat
 3 | :PREROUTING ACCEPT [0:0]
 4 | :INPUT ACCEPT [0:0]
 5 | :OUTPUT ACCEPT [0:0]
 6 | :POSTROUTING ACCEPT [0:0]
 7 | -A PREROUTING -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 19999
 8 | COMMIT
 9 | *filter
10 | :INPUT ACCEPT [0:0]
11 | :FORWARD ACCEPT [0:0]
12 | :OUTPUT ACCEPT [0:0]
13 | COMMIT
14 | 


--------------------------------------------------------------------------------
/saltstack/states/bootstrap/netdata.service:
--------------------------------------------------------------------------------
 1 | [Unit]
 2 | Description=Real time performance monitoring
 3 | After=network.target httpd.service squid.service nfs-server.service mysqld.service mysql.service named.service postfix.service
 4 | [Service]
 5 | Type=simple
 6 | User=netdata
 7 | Group=netdata
 8 | ExecStart=/opt/netdata/usr/sbin/netdata -D
 9 | # The minimum netdata Out-Of-Memory (OOM) score.
10 | # netdata (via [global].OOM score in netdata.conf) can only increase the value set here.
11 | # To decrease it, set the minimum here and set the same or a higher value in netdata.conf.
12 | # Valid values: -1000 (never kill netdata) to 1000 (always kill netdata).
13 | OOMScoreAdjust=-1000
14 | Nice=-10
15 | # saving a big db on slow disks may need some time
16 | TimeoutStopSec=60
17 | # restart netdata if it crashes
18 | Restart=on-failure
19 | RestartSec=30
20 | [Install]
21 | WantedBy=multi-user.target
22 | 
23 | 


--------------------------------------------------------------------------------
/saltstack/states/bootstrap/rc-local.service:
--------------------------------------------------------------------------------
 1 | [Unit]
 2 | Description=/etc/rc.local compatibility
 3 | 
 4 | [Service]
 5 | Type=oneshot
 6 | ExecStart=/etc/rc.local
 7 | RemainAfterExit=yes
 8 | 
 9 | [Install]
10 | WantedBy=multi-user.target
11 | 


--------------------------------------------------------------------------------
/saltstack/states/bootstrap/rc.local:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | THP_DIR="/sys/kernel/mm/transparent_hugepage"
 3 | if test -f $THP_DIR/enabled; then
 4 |     echo never > $THP_DIR/enabled
 5 | fi
 6 | if test -f $THP_DIR/defrag; then
 7 |     echo never > $THP_DIR/defrag
 8 | fi
 9 | echo 0 > /sys/devices/system/cpu/cpu1/online
10 | exit 0
11 | 


--------------------------------------------------------------------------------
/saltstack/states/control/init.sls:
--------------------------------------------------------------------------------
 1 | salt:
 2 |   pkg.latest:
 3 |     - pkgs:
 4 |       - python2.7
 5 |       - python-apt
 6 |       - python-msgpack
 7 |       - python-m2crypto
 8 |       - python-pip
 9 |       - apt-transport-https
10 |       - pkg-config
11 |       - python-pip
12 | salt-ssh:
13 |   pip.installed:
14 |     - require:
15 |       - pkg: salt
16 |   
17 | 


--------------------------------------------------------------------------------
/saltstack/states/top.sls:
--------------------------------------------------------------------------------
1 | base:
2 |   '*':
3 |     - bootstrap
4 |     - users
5 | 


--------------------------------------------------------------------------------
/saltstack/states/users/init.sls:
--------------------------------------------------------------------------------
 1 | {%- set password=grains['uuid'][:10] %}
 2 | sudo:
 3 |   pkg.installed
 4 | 
 5 | workshop:
 6 |   user.present:
 7 |     - home: /home/workshop
 8 |     - password: {{password}}
 9 |     - hash_password: True
10 |     - shell: /bin/bash
11 |     - fullname: workshop user
12 |     - require:
13 |       - pkg: sudo 
14 | workshop_bashrc:
15 |   file.managed:
16 |     - name: /home/workshop/.bashrc
17 |     - user: workshop
18 |     - group: workshop
19 |     - source: salt://users/root_bashrc
20 |     - require:
21 |       - user: workshop
22 | 
23 | /etc/sudoers.d/workshop:
24 |   file.managed:
25 |     - source: salt://users/workshop_sudoers
26 |     - user: root
27 |     - group: root
28 |     - mode: 440
29 |     - template: jinja
30 |     - check_cmd: /usr/sbin/visudo -c -f
31 |     - require:
32 |       - pkg: sudo
33 |       - user: workshop
34 | 
35 | root_bashrc:
36 |   file.managed:
37 |     - name: /root/.bashrc
38 |     - source: salt://users/root_bashrc
39 | 
40 | 
41 | linux_metrics_workshop_2:
42 |   git.latest:
43 |     - name: https://github.com/kargig/linux-metrics/
44 |     - rev: master
45 |     - target: /home/workshop/linux-metrics
46 |     - remote: origin
47 |     - user: workshop
48 |     - force_reset: True
49 |     - require:
50 |       - user: workshop
51 | 


--------------------------------------------------------------------------------
/saltstack/states/users/root_bashrc:
--------------------------------------------------------------------------------
 1 | ## set colors for GNU ls ; set this to right file
 2 | export LS_COLORS='*.swp=00;44;37:*,v=5;34;93:*.vim=35:no=0:fi=0:di=32:ln=36:or=1;40:mi=1;40:pi=31:so=33:bd=44;37:cd=44;37:ex=35:*.jpg=1;32:*.jpeg=1;32:*.JPG=1;32:*.gif=1;32:*.png=1;32:*.jpeg=1;32:*.ppm=1;32:*.pgm=1;32:*.pbm=1;32:*.c=1;32:*.C=1;33:*.h=1;33:*.cc=1;33:*.awk=1;33:*.pl=1;33:*.gz=31:*.tar=31:*.zip=31:*.lha=1;31:*.lzh=1;31:*.arj=1;31:*.bz2=31:*.tgz=31:*.taz=1;31:*.html=36:*.htm=1;34:*.doc=1;34:*.txt=1;34:*.o=1;36:*.a=1;36'
 3 | export LS_OPTIONS='--color=auto'
 4 | alias ls='ls $LS_OPTIONS'
 5 | alias ll='ls $LS_OPTIONS -l'
 6 | alias l='ls $LS_OPTIONS -lA'
 7 | export LESSCHARSET=utf-8
 8 | export EDITOR=vim
 9 | #custom exports for coloured less
10 | export LESS_TERMCAP_mb=$'\E[01;31m'
11 | export LESS_TERMCAP_md=$'\E[01;31m'
12 | export LESS_TERMCAP_me=$'\E[0m'
13 | export LESS_TERMCAP_se=$'\E[0m'
14 | export LESS_TERMCAP_so=$'\E[01;47;34m'
15 | export LESS_TERMCAP_ue=$'\E[0m'
16 | export LESS_TERMCAP_us=$'\E[01;32m'
17 | 
18 | 


--------------------------------------------------------------------------------
/saltstack/states/users/workshop_sudoers:
--------------------------------------------------------------------------------
1 | %fosscomm ALL=(ALL:ALL) NOPASSWD:ALL
2 | 


--------------------------------------------------------------------------------
/saltstack/tools/install_python.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | DOMAIN='example.com'
3 |  for i in  1 2 ; do ssh ws${i}.${DOMAIN} 'DEBIAN_FRONTEND=noninteractive;apt-get -y -qq install python2.7>/dev/null' ; done
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy1.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -c 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy2.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -m 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy3.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | nice -n 10 -- stress -q -c 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy4.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | stress -q -d 1 -t 120
4 | 


--------------------------------------------------------------------------------
/scripts/cpu/dummy_app.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | import time
 4 | import math
 5 | 
 6 | def percentile(N, percent):
 7 |     if not N:
 8 |         return None
 9 |     N = sorted(N)
10 |     k = (len(N)-1) * percent
11 |     f = math.floor(k)
12 |     c = math.ceil(k)
13 |     if f == c:
14 |         return N[int(k)]
15 |     d0 = N[int(f)] * (c-k)
16 |     d1 = N[int(c)] * (k-f)
17 |     return d0+d1
18 | 
19 | def foo():
20 |     for i in xrange(20000):
21 |         x = math.sqrt(i)
22 | 
23 | if __name__ == "__main__":
24 |     m = []
25 | 
26 |     for _ in xrange(5000):
27 |         start = time.time()
28 |         foo()
29 |         m.append(time.time() - start)
30 | 
31 |     print "50th, 75th, 90th and 99th percentile: %f, %f, %f, %f" % (
32 |         percentile(m, 0.5), percentile(m, 0.75), percentile(m, 0.9), percentile(m, 0.99))
33 | 


--------------------------------------------------------------------------------
/scripts/disk/fio1.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randwrite --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
4 | 


--------------------------------------------------------------------------------
/scripts/disk/fio2.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
4 | 


--------------------------------------------------------------------------------
/scripts/disk/fio3.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | fio --directory=/tmp --name fio_test_file --direct=1 --rw=randread --bs=16k --size=100M --numjobs=1 --time_based --runtime=180 --group_reporting --norandommap
4 | 


--------------------------------------------------------------------------------
/scripts/disk/writer.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | while true; do dd if=/dev/zero of=/tmp/test.1G bs=1M count=1024; done
4 | 


--------------------------------------------------------------------------------
/scripts/memory/buffer.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | root_dev="$(mount | grep 'on / ' | cut -d' ' -f1)"  # /dev/xvda
4 | 
5 | # Read 1GB of blocks from the root device
6 | dd if=$root_dev of=/dev/null bs=1M count=4096
7 | 


--------------------------------------------------------------------------------
/scripts/memory/cache.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Read all readable files in /usr (Generates ~1.5GB cache)
 4 | # find /usr -readable -type f -exec dd if={} of=/dev/null status=none \;
 5 | 
 6 | # Just create a large file
 7 | SIZE_IN_MB=4096
 8 | dd if=/dev/zero of=/tmp/bigfile bs=1MB count=${SIZE_IN_MB}
 9 | 
10 | echo "Press Enter to continue..."; read
11 | #rm /tmp/bigfile
12 | 


--------------------------------------------------------------------------------
/scripts/memory/dentry.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import os
 4 | import uuid
 5 | import sys
 6 | import time
 7 | 
 8 | try:
 9 |     directory = sys.argv[1]
10 | except IndexError:
11 |     directory = "./trash"
12 | 
13 | if not os.path.exists(directory):
14 |     os.makedirs(directory)
15 | 
16 | try:
17 |     t_end = time.time() + 30
18 |     while time.time() < t_end:
19 |         open(os.path.join(directory, str(uuid.uuid4())), 'w')
20 | except Exception:
21 |     raw_input("click any key to terminate")
22 | 


--------------------------------------------------------------------------------
/scripts/memory/dentry2.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import os
 4 | import time
 5 | 
 6 | try:
 7 |     t_end = time.time() + 30
 8 |     while time.time() < t_end:
 9 |         os.makedirs("t")
10 |         os.chdir("t")
11 | except Exception:
12 |     raw_input("press any key to terminate\n")
13 | 


--------------------------------------------------------------------------------
/scripts/memory/hog.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | HOG_C=/tmp/hog.c
 4 | HOG=/tmp/hog
 5 | rm -rf $HOG $HOG_C
 6 | 
 7 | cat >$HOG_C <<'EOF'
 8 | #include <unistd.h>
 9 | #include <stdio.h>
10 | #include <stdlib.h>
11 | 
12 | 
13 | int main(int argc, char* argv[]) {
14 |     long page_size = sysconf(_SC_PAGESIZE);
15 | 
16 |     long count = 0;
17 |     while(1) {
18 |         char* tmp = (char*) malloc(page_size);
19 |         if (tmp) {
20 |             tmp[0] = 0;
21 |             count += page_size;
22 |             if (count % (page_size*1024) == 0) {
23 |                 printf("Allocated %ld KB\n", count/1024);
24 |                 usleep(10000);
25 |             }
26 |         }
27 |     }
28 | 
29 |     return 0;
30 | }
31 | EOF
32 | 
33 | gcc -o $HOG $HOG_C
34 | 
35 | exec $HOG
36 | 


--------------------------------------------------------------------------------