└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Simple Awk 2 | Is awk simple? Or this guide will make learning awk simple? Ha-ha, you should decide this after you read this guide. 3 | 4 | # Recommended books (not written by me) 5 | [Definitive Guide to sed - by Daniel Goldman](https://amzn.to/3cnzQCS) 6 |
7 | [Sed & Awk - Dale Dougherty & Arnold Robbins](https://amzn.to/3kKck7Y) 8 |
9 | [Effective awk programming - by Arnold Robbins](https://amzn.to/30scfhM) 10 | 11 | # More guides that I wrote 12 | [useful-sed](https://github.com/adrianscheff/useful-sed) - Useful sed tips, techniques & tricks for daily usage 13 |
14 | [wizardly-tips-vim](https://github.com/adrianscheff/wizardly-tips-vim) - Less known Vim tips & tricks 15 |
16 | [quick-grep](https://github.com/adrianscheff/quick-grep) - Quick grep reference and tutorial. 17 |
18 | [convenient-utils-linux](https://github.com/adrianscheff/convenient-utils-linux) - Linux utils to make life easier and more convenient. 19 | 20 | ----- 21 | 22 | ### Before you start 23 | * PRACTICE! Don't just read the commands. Type them into the terminal yourself. Experiment! 24 | * Create some dummy files that will make experimentation easy. Use them with the examples below. Expand on them. 25 | 26 | 27 | ### Prerequisites 28 | * Make sure you have 'gawk' installed. It has more features than the usual default 'mawk'. You can see what implementation af awk you're using by typing `man awk` and looking at the header. If you have 'mawk' just install awk with `sudo apt-get install gawk` 29 | 30 | ### Intro 31 | * Awk operates on records and fields. A record is by default a line. A field is a "word" by default. You can change how a record/field is defined by changing the field separator var (FS) and the record separator (RS). 32 | * You give awk some files on which to operate. Awk starts reading records (lines) from these files one by one. It splits each record (line) into fields (words). 33 | * In your awk program you have code that will look like this `/bilbo/ {print $0}`. `/bilbo/` is a pattern. `{print $0}` is an action (inside the curly braces). 34 | * On each record (line) that matches pattern (`/bilbo/`) execute actions inside curly braces (`{print $0}`). 35 | * The special variable `$0` represents the whole record (line). `print` prints. 36 | 37 | 38 | ### Note about lines/records fields/words 39 | * Usually you'll want to keep working with records which are "lines". That is - strings delimited by newlines. I'll also be referring to records as lines to make things simpler. 40 | * You can change how awk interprets records. Then records will be other things which are not "lines" (like strings separated by commas for example). 41 | * The same goes for fields. Fields are strings separated by space by default. You can change this of course to have the separator be a comma or something else. 42 | * Because I just told you you'll know that a record is a line by default. But you could change it to something else (by changing the RS - the Record Separator). You also know that a field is a word (separated by space) by default - but you could change it by changing the FS - Field Separator. 43 | 44 | 45 | ### How to call awk 46 | * You call it like `awk '{print $0}' file1 file2` 47 | * In the examples below I sometimes show the text inside '', not the full blown bash command. 48 | * If there are multiple files you have to use an awk script. Make a new file and put inside it: 49 | ``` 50 | #!/usr/bin/awk -f 51 | BEGIN {print "BEGINNING"} 52 | /Gollum/ {print "I like it raw and riggling"} 53 | ``` 54 | * Name it something like `myscript.awk` and make it executable with `chmod +x myscript.awk` 55 | * Call it with `./myscript.awk file1 file2` 56 | 57 | 58 | ### Simple pattern 59 | 60 | * So you can read this awk program `/bilbo/ {print $0}` as: 61 | >> for each record (line) in all the files passed to awk 62 | >>> if record (line) matches `/bilbo/` pattern 63 | >>>> print the whole record (line) 64 | 65 | ### Field 66 | * How about `/bilbo/ {print $1}`? `$1` represents the first field (word) from the current record (line). `$2` is second field (word) and so on. 67 | * You can read `/bilbo/ {print $1}` as: 68 | >> for each record (line) in all the files passed to awk 69 | >>> if record (line) matches `/bilbo/` pattern 70 | >>>> print the first field (word) from the record (line) 71 | 72 | 73 | ### Pattern AND Pattern 74 | * Patterns can be more complex. Check this out `/bilbo/&&/frodo/{print "my precious"}` 75 | * You can read this as: 76 | >> On each record (line) that matches `/bilbo/` AND `/frodo/` 77 | >>> print the string "my precious" 78 | 79 | ### Pattern OR Pattern 80 | * Check this other pattern out `/bilbo/||/frodo/{print "Is it you mister Frodo?"}` 81 | * You can read this as: 82 | >> On each record (line) that matches `/bilbo/` OR `/frodo/` 83 | >>> print the string "Is it you mister Frodo?" 84 | 85 | 86 | 87 | ### NOT Pattern 88 | * What about `! /frodo/ { print "Pohtatoes" }`? (note the extra spaces put there for clarity. You could also eliminate them to save typing) 89 | * Read it as: 90 | >> On each record (line) that DOESN'T match `/frodo/` 91 | >>> Print "Pohtatoes" 92 | 93 | ### IF Pattern present then check for Pattern, ELSE check for Pattern 94 | * Here's a more complex example `/frodo/ ? /ring/ : /orcs/{ print "Either frodo with the ring, or the orcs" }` 95 | * `a?b:c` is a ternary operator. It reads: if a then do b, else do c. 96 | * Read it as: 97 | >> Read record. 98 | >>> If it matches `/frodo/` 99 | >>>> Does it also match `/ring/`? If yes then print "Either frodo with the ring, or the orcs" 100 | >>> If it doesn't match `/frodo/` 101 | >>>> Does it match `/orcs/`? If yes then print "Either frodo with the ring, or the orcs" 102 | * The actions are executed if: a record (line) contains either ("frodo" and the "ring") OR (no "frodo" and "ring") but "orcs". 103 | 104 | 105 | ### Pattern Range 106 | * How about this one? `/Shire/ , /Osgiliath/ { print $0 }`? 107 | * The comma separated regex expressions are a "pattern range". 108 | * Read this as: 109 | >> Execute command for each record (line) 110 | >>> Between the record (line) that matches `/Shire/` (including that record (line)) 111 | >>> And record (line) that matches `/Osgiliath/` (including that record(line)) 112 | * If you have a file that looks like: 113 | ``` 114 | What is it mister Frodo? 115 | Do you miss the Shire? 116 | I miss the shire too. 117 | This Osgiliath is to drab for me. 118 | Too many orcs. 119 | ``` 120 | * The above command will simply print the records(lines): 121 | ``` 122 | Do you miss the Shire? 123 | I miss the shire too. 124 | This Osgiliath is to drab for me. 125 | ``` 126 | 127 | 128 | ### BEGIN Pattern 129 | * You'll be interested in this one. `BEGIN {print "And so it begins"}` 130 | * BEGIN is a special pattern, triggered right at the beginning, before any input from files is read. 131 | * NOTE - BEGIN will execute it's command even if no files are passed 132 | * It reads: 133 | >> Before any input is read 134 | >>> Print "And so it beggins" 135 | 136 | ### END Pattern 137 | * If something begins, it has to end, right? `END {print "There and back, by Bilbo Baggins"}` 138 | * END is triggered when awk has finished reading input from all files. Thus it needs files passed to awk in ordered to be triggered 139 | * It reads: 140 | >> After all input was read 141 | >>> Print "There and back, by Bilbo Baggins" 142 | 143 | 144 | ### BEGINFILE, ENDFILE Patterns 145 | * If you pass multiple files to awk - it will treat them as being one contiguous file. But what if you want to call some commands when beginning to read input from a file? You use BEGINFILE and ENDFILE respectively. 146 | * `BEGINFILE {print "A new chapter is beginning mister Frodo"}` 147 | * It reads: 148 | >> Before input is read from a file 149 | >>> print "A new chapter is beginning mister Frodo" 150 | 151 | ### NO Pattern 152 | * This a simple one. `{print $1}`. 153 | * When no pattern is provided, just the command in curly braces, it is executed for all records (lines). 154 | * This reads 155 | >> For all records (lines): 156 | >>> print the first word 157 | 158 | ### Conditional Pattern 159 | * `NF>4{print}` 160 | * The pattern is `NF>4`. If true then exec commands inside curly braces. 161 | * NF is the Number of Fields(words). If the current record (line) has more than 4 fields (words), print the record (line) 162 | 163 | 164 | 165 | ### About commands 166 | * All commands MUST be inside curly braces. You can't just call `print $0`. You have to put that command inside curly braces - `{print $0}` 167 | 168 | 169 | ### Variables 170 | * Awk has some built in variables. They are written in uppercase. 171 | * They can be useful in a variety of cases. 172 | * `FILENAME` is the name of the current file being processed. 173 | * `BEGINFILE {print "we are beginning to process " FILENAME}` will print the string followed by the name of the file when awk begin processing file. 174 | * `NR` is the Record Number (or Number Record if you'd like). In simple terms it's the record (line) count. If awk is processing line number 5 then NR is 5. 175 | * Let's say you have a 10 line file and a 5 line file. You pass both files to awk. Awk finishes the first 10 lines and is now at line 3 from the second file. How much will the NR be? 176 | * You might be tempted to say 3 - but it's 13. NR stands for ALL the records (lines) that awk processes, not records (lines) belonging to file. 177 | * If you want to refer to record (line) count for file you would use FNR. F - from File. File Number Record. 178 | * In the case above NR would be 13 but FNR would be 3. 179 | 180 | ### More Variables 181 | * Here's a list of the most important variables with a short description 182 | * `FILENAME` - name of file 183 | * `FNR` - File Number Record (input record number for current file - according to official docs). File record (line) count. 184 | * `FS` - Field separator (word separator if you'd like). Space by default. 185 | * `IGNORECASE` - if set to 1 case is ignored. Very important. `/frodo/` would match `Have you seen my old ring Frodo?` if IGNORECASE is 1. If set to 0 it would not. Use it like this: 186 | ``` 187 | BEGIN {IGNORECASE=1} 188 | /frodo/ {print "do you remember the taste of strawberries Frodo?"} 189 | ``` 190 | * `NF` - number of fields (words) in the current record (line). You could use this to count words for example. 191 | * `NR` - Number Record. Global record (line) count if you will. 192 | * `RS` - Record Separator. A newline by default. 193 | 194 | 195 | ### Programming intro 196 | * Awk is a full blown programming language. It has all the operators & syntax you would expect from a "regular" programming language such as C or python. 197 | * This is taken directly from the manual 198 | ``` 199 | Operators 200 | The operators in AWK, in order of decreasing precedence, are: 201 | 202 | (...) Grouping 203 | 204 | $ Field reference. 205 | 206 | ++ -- Increment and decrement, both prefix and postfix. 207 | 208 | ^ Exponentiation (** may also be used, and **= for the assignment operator). 209 | 210 | + - ! Unary plus, unary minus, and logical negation. 211 | 212 | * / % Multiplication, division, and modulus. 213 | 214 | + - Addition and subtraction. 215 | 216 | space String concatenation. 217 | 218 | | |& Piped I/O for getline, print, and printf. 219 | 220 | < > <= >= == != 221 | The regular relational operators. 222 | 223 | ~ !~ Regular expression match, negated match. NOTE: Do not use a constant regular expression (/foo/) on the left-hand side of a ~ or !~. Only use one on the right-hand side. The expression 224 | /foo/ ~ exp has the same meaning as (($0 ~ /foo/) ~ exp). This is usually not what you want. 225 | 226 | && Logical AND. 227 | 228 | || Logical OR. 229 | 230 | ?: The C conditional expression. This has the form expr1 ? expr2 : expr3. If expr1 is true, the value of the expression is expr2, otherwise it is expr3. Only one of expr2 and expr3 is 231 | evaluated. 232 | 233 | = += -= *= /= %= ^= 234 | Assignment. Both absolute assignment (var = value) and operator-assignment (the other forms) are supported. 235 | 236 | Control Statements 237 | The control statements are as follows: 238 | 239 | if (condition) statement [ else statement ] 240 | while (condition) statement 241 | do statement while (condition) 242 | for (expr1; expr2; expr3) statement 243 | for (var in array) statement 244 | break 245 | continue 246 | delete array[index] 247 | delete array 248 | exit [ expression ] 249 | { statements } 250 | switch (expression) { 251 | case value|regex : statement 252 | ... 253 | [ default: statement ] 254 | } 255 | 256 | ``` 257 | 258 | 259 | ### Programming usage 260 | * Here's how you could use the "programming" part of awk: 261 | ``` 262 | #!/usr/bin/awk -f 263 | BEGIN { 264 | IGNORECASE=1 265 | hobitses=0 266 | } 267 | /fellowship/ { 268 | if (index($0,"samwise") >0 ) { 269 | hobitses+=1 270 | print "Hurry up hobitses" 271 | } 272 | } 273 | END { 274 | print "Found a total of " hobitses " hobitses" 275 | } 276 | ``` 277 | 278 | * Let's break this down. 279 | * You see that this is an awk script because of the shebang (the line beginning with `#!/...`) 280 | * Before reading any input set the built in var IGNORECASE to 1. "frodo" will match "Frodo", "FRODO", "frodo", etc. 281 | * We also set a custom variable for our own usage and set its initial value to 0 (hobitses) 282 | * Check all records (lines) and on those that match `/fellowship/` execute the following: 283 | >> Check if we have the string "samwise" inside the record (line). Index is a built in function. It takes two strings. If the second string is contained within the first it will return a value bigger than 0. If the second string is not present in the first return 0. 284 | >> If index() returns a value bigger than 0 ("samwise" was found inside the current record (line)) do the following: 285 | >>> increase hobitses by 1 286 | >>> print a message 287 | * After processing all the records print a message with value of our custom variable "hobitses". 288 | 289 | 290 | ### Options 291 | * You can pass options to awk. Here are some useful ones: 292 | * `-f` read awk source file. `awk -f source.awk file.txt`. You put all awk code in the source.awk file. NOTE - don't put the shebang (`#!/usr/bin/awk -f`) 293 | * `-F` - field separator. Use to define "words". For example if you have a .csv you could make the separator a comma. Like this: `awk -F, '{print $2} file.txt ' - will print the second "word" separated by comma. 294 | * `-v` assign a variable. Eg: `awk -v count=0 '/bilbo/{count+=1;print "Found another one. Now count is " count}' f1`. init count to 0. On records matching `/bilbo/` increment count by 1. Print the message. 295 | * `-e` - execute commands. Use multiple awk commands. Eg: `awk -e 'BEGIN {IGNORECASE=1}' -e '/bilbo/{"Found him"}'` 296 | 297 | 298 | ### String concatenation 299 | * `print "hello" "world"` will output "helloworld". No space between. 300 | * `print "hello","world"` will output "hello world". A space between by using a comma in print. 301 | 302 | ### System commands 303 | * You can execute system commands like so `awk '{system("ls "$1" -la")}' file.txt `. 304 | * Let's break it down. We start by calling the system function on every record (line). We pass an argument built dynamically. "ls " followed by the first field (word) followed by " -l". If the field was "myfile.txt" the command would be "ls myfile.txt -la" 305 | * note the spaces at the end in "ls ". If you don't put the space the command would look like "lsmyfile.txt -la" which won't work obviously. 306 | * It will output something like: 307 | ``` 308 | -rw-rw-r-- 1 me me 0 Nov 14 17:40 f1 309 | -rw-rw-r-- 1 me me 59 Nov 14 17:41 f2 310 | -rw-rw-r-- 1 me me 20 Nov 12 15:42 col1 311 | ``` 312 | * Basically we run `ls` on the first word of every file passed to awk. 313 | 314 | 315 | ### Writing dynamically to files 316 | * If you find a certain pattern you might like to write that to a file (with some extra info maybe). 317 | * `awk '/Bilbo/{print "I found him. First word is " $1 >> "appended.txt"}' file.txt `. It will append the print message (followed by first field(word)) to the file appended.txt 318 | * Use '>' instead of '>' to overwrite the file (instead of appending) 319 | 320 | 321 | ### System commands with stdin 322 | * You can pipe with print into certain system commands. Here's an example `awk '{print "file.txt" | "xargs -n1 ls -l"}'` 323 | * "file.txt" is piped into the command "xargs -n1 ls -l". Similar to `echo "file.txt" | xargs -n ls -l` 324 | * The advantage is that you can pass "dynamic" arguments. Certain fields (words) for example. 325 | * The output looks like: 326 | ``` 327 | -rw-rw-r-- 1 me me 0 Nov 14 17:40 file.txt 328 | ``` 329 | 330 | ### Getline example 331 | * `"date" | getline cur_date` - run "date", store into variable cur_date. This a simple use for getline. 332 | * Here's a cooler one `awk '{"du "$0" |cut -f1" |getline cur_size;print "for file " $0 " size is " cur_size}' filenames.txt` 333 | * Let's break it down. "du "$0" |cut -f" will look like "du myfile.txt | cut -f1". This linux command will output the size of the file in kb. 334 | * We pass this value to the variable "cur_size". 335 | * We use a semicolon to separate commands. Next we print the value of the variable "cur_size" 336 | 337 | 338 | ### $ - the positional variable 339 | * $ is not a "normal" variable but rather a function triggered by the dolar sign (according to grymoire awk tut) 340 | * `X=; print $X;` means `print $1`. 341 | * You can do more fancy stuff with this. `'{print $(NF-1)}' - NF is the number of fields (words) in the record (line). `$NF` would be the last field (word) (if there are 5 fields NF is 5. $NF will point to the 5th field (word)) 342 | 343 | ### Modify the positional variable 344 | * You can modify a certain field (word) and print the record (line) containing that modified field. 345 | * `echo "Meat's back on the menu" | awk '{$5="NO_MENUS_IN_ORC_LAND";print}'` will output `Meat's back on the NO_MENUS_IN_ORC_LAND` 346 | * We assign a value to field number 5. Then we print the record (print with no arguments prints the current record, just like `$print $0`) 347 | 348 | 349 | ### Selective .csv column print 350 | * You have a .csv in the form: 351 | ``` 352 | city,area,population 353 | LA,400,100 354 | Miami,500,101, 355 | Buenos Aires,800,102 356 | ``` 357 | * As you can see you have 3 columns. You would like to print the 1st and 3rd column only. 358 | * This should do the trick `awk -F, '{print $1,$3}' cities.csv` 359 | * We set a custom field (word) separator with `-F,`. 360 | * Note the comma between the positional arguments. IF you remember `{print "a" "b"}` will output "ab". You need comma to separate by space, `{print "a","b"}`. 361 | * You can get fancy and skip the header row of the csv with `awk -F, '{if (NR>1) print $1,$3}' cities.csv `. If Number Record is bigger than 1 (not the first) only then print. 362 | 363 | ### Custom field separator with OFS 364 | * `{print "a", "b"}` will output "a b". When a comma is present awk uses the output field separator (OFS) which is a space by default. 365 | * You can change OFS to something else, like `::` for example. `{OFS="::";print "a", "b"}` will output `a::b` 366 | 367 | 368 | ### Mix with command line text 369 | * Let's say you have a file produced by the `ls` command, like this: 370 | ``` 371 | drwxr-xr-x 2 root root 69632 Nov 13 19:21 . 372 | drwxr-xr-x 16 root root 4096 Nov 9 07:35 .. 373 | -rwxr-xr-x 1 root root 59888 Dec 5 2020 [ 374 | -rwxr-xr-x 1 root root 18456 Feb 7 2021 411toppm 375 | -rwxr-xr-x 1 root root 39 Aug 15 2020 7z 376 | -rwxr-xr-x 1 root root 40 Aug 15 2020 7za 377 | -rwxr-xr-x 1 root root 40 Aug 15 2020 7zr 378 | -rwxr-xr-x 1 root root 35344 Jul 1 00:42 aa-enabled 379 | -rwxr-xr-x 1 root root 35344 Jul 1 00:42 aa-exec 380 | ``` 381 | * you want to print the name of the executable but only if it's smaller than 50 bytes. Use this `awk '{if ($5<50) print $9}' bin_10 ` 382 | * If the fifth field (containing size in bytes) is smaller than 50 print the 9th field (name of executable) 383 | 384 | ### Math on text. 385 | * This is one of the cool things about awk. You can take some text, perform all kinds of programming magic on it, spit it out nice and modified. Usually you need to write a lot of boilerplate if you're using a general scripting language like python. 386 | * Let's take the cities .csv again: 387 | ``` 388 | city,area,population 389 | LA,400,100 390 | Miami,500,101, 391 | Buenos Aires,800,102 392 | ``` 393 | * Check out this script. 394 | ``` 395 | #!/usr/bin/awk -f 396 | BEGIN { 397 | total=0 398 | FS="," 399 | } 400 | { 401 | if (FNR>1) { 402 | real_pop=$3 * 1000 403 | total+=real_pop 404 | print "Real population of", $1, "is" ,real_pop 405 | } 406 | } 407 | 408 | END { 409 | 410 | print "Total Population:", total 411 | } 412 | ``` 413 | 414 | * The output will be: 415 | ``` 416 | Real population of LA is 100000 417 | Real population of Miami is 101000 418 | Real population of Buenos Aires is 102000 419 | Total Population: 303000 420 | ``` 421 | * By now you should understand what this does. We start by creating a variable and setting it to 0. We set the Field separator to a comma because we have a csv. 422 | * If the File Number Record is bigger than 1 do stuff (note how we use FNR, not NR. That's because we want to skip the header of every file, not just the first file). 423 | * We do some math, addition/multiplication. Finally we print the total. This works with multiple csv files. 424 | * With a bit of care we could cram it on 1/2 lines in the terminal. 425 | 426 | 427 | ### Fancy line numbers 428 | * Look at this beaut: `awk '{print "(" NR ")" ,$0}' f1` 429 | * It will print the line number in parenthesis followed by the actual line (from a file that doesn't have line numbering). Something like : 430 | ``` 431 | (1) line one 432 | (2) line two 433 | ``` 434 | * Carefully study the spaces and commas from print. We know that by not using a command strings will be concatenated without any separator in between. So `"(" NR ")"` will output something like `(9)`. We could even dispense with the spaces in the command (`"("NR")"`) but I've kept them because they make things clearer. Next we print the actual line. Note the comma. It means awk will put a space (OFS) between the parenthesised line number and the actual line. 435 | 436 | 437 | ### Print words by their number 438 | * It's time to get fancy and change Field Separator and Record separator. 439 | * By default RS is a newline. What happens if you change it into an empty string? 440 | * Vim will load all the file in memory and treat it as a single record. 441 | * FS (Field Separator)remains the same. But since the whole file it's treated as a single record the positional variable will refer to word number in FILE (not in record/line). 442 | * Sounds complicated? It's not. You have a file like this: 443 | ``` 444 | first second third 445 | fourth fifth sixth 446 | seventh eight 447 | ``` 448 | * This is the awk script. 449 | ``` 450 | #!/usr/bin/awk -f 451 | BEGIN { RS="" } 452 | { 453 | print $1, $8 454 | } 455 | ``` 456 | * It will output `first eight`. 457 | * Look at the script again. If RS would be a newline `print $1, $8` would print something like: 458 | ``` 459 | first 460 | fourth 461 | seventh 462 | ``` 463 | * It prints the first field on each record but since records are lines it can't find the eight field (word). 464 | * But when the whole file is treated as a single record the positional variable ($) refers to field number in whole file. 465 | * You can put this on a one liner and use it daily: `awk 'BEGIN{RS=""}{print $1020}' file.txt ` 466 | 467 | 468 | ### Pass stdin to awk (and show nicely formatted size of files) 469 | * `ls -lh | awk '{print $9,"has size of",$5}'` 470 | * Generate text with `ls -lh` that looks something like: 471 | ``` 472 | -rwxr-xr-x 1 root root 39 Aug 15 2020 7z 473 | -rwxr-xr-x 1 root root 40 Aug 15 2020 7za 474 | -rwxr-xr-x 1 root root 40 Aug 15 2020 7zr 475 | ``` 476 | * pipe it to awk. Print field $9 (filename) and field $5 (size) 477 | 478 | 479 | 480 | ### Pass both stdin and file to awk 481 | * `echo "Coming from stdin" | awk "{print}" file.txt -` 482 | * In the above command awk first processes file.txt then stdin (passed by using the final dash) 483 | 484 | 485 | ### Check if text coming from stdin or file 486 | * `{ if (FILENAME!="-") print $0 }` it's the script. 487 | * When you pass it both stdin and files `echo "Coming from stdin" | awk "{print}" file.txt -` awk won't process text from stdin. 488 | 489 | 490 | ### Arrays intro 491 | * `{myarr=["one","two","three"]}` - WON'T work, even though this is the syntax used by many languages to declare arrays 492 | * Use instead something like `{myarr[0]="one";myarr[1]="two"}` 493 | * Arrays are associative. You associate an index with a value. Like a dictionary. 494 | ``` 495 | #!/usr/bin/awk -f 496 | { 497 | myarr["hobits"]="hobitses" 498 | print(myarr["hobits"]) 499 | } 500 | ``` 501 | * As you can see you can use strings and numbers as an index (as a key). In the end though array subscripts are always strings. 502 | * That is `{print myarr[5.1]}` will check for an '5.1' string index. 503 | 504 | ### Store lines in array 505 | ``` 506 | #!/usr/bin/awk -f 507 | /bilbo/ { 508 | myarr[NR]=$0 509 | } 510 | END{ 511 | for (i in myarr){ 512 | print "subscript is",i 513 | print myarr[i] 514 | } 515 | } 516 | ``` 517 | * For each record (line) that matches `/bilbo/` store the record(line) in 'myarr', using the current NR (Number Record - or current line count) as a subscript/index 518 | 519 | * At the end enumerate over indexes in my arr (NOT over elements) with a 'for..in' loop. 520 | * It will print something like: 521 | ``` 522 | subscript is 3 523 | a story by frodo and bilbo 524 | subscript is 13 525 | bilbo again and frodo 526 | ``` 527 | * Note how this array is not contiguous. It is sparse. The indexes between 0-3 and 3-13 don't exist and are not assigned automatically. Kinda make sense since awk uses strings and NOT integers as indexes/subscripts. 528 | 529 | 530 | ### Delete array elems 531 | * `{delete myarr[1]}` will delete element at index '1' (where '1' is a string). 532 | * `{delete myarr}` - will delete ALL elements of the array 533 | 534 | 535 | ### Array index concatenation 536 | ``` 537 | #!/usr/bin/awk -f 538 | BEGIN{ 539 | arr[1,2]="one" 540 | arr["abc","bcd"]="two" 541 | arr["abc",1]="three" 542 | arr["foo" "bar"]="four" 543 | arr["bar" "foo"]="five" 544 | for (i in arr){ 545 | print "at index",i,"value is",arr[i] 546 | } 547 | } 548 | ``` 549 | * Will output: 550 | ``` 551 | at index abc1 value is three 552 | at index abcbcd value is two 553 | at index foobar value is four 554 | at index barfoo value is five 555 | at index 12 value is one 556 | ``` 557 | * Awk treats integers as strings when used as array indexes. Then it concatenates them if you pass multiple values, either separated by space, comma or not separated at all. There is no space between these concatenated values. 558 | 559 | * Another thing to note is that the 'for..in' loop doesn't enumerate elems in the order they were added. 560 | 561 | 562 | ### Ordered array indexes 563 | ``` 564 | #!/usr/bin/awk -f 565 | BEGIN{ 566 | i=0 567 | arr[""]=0 568 | } 569 | /bilbo/{ 570 | arr[i++]=$0 571 | } 572 | END{ 573 | for (j=0;j4) next; print}' file.txt` 699 | * If NF (Number of Fields) is bigger than 4 stop processing current input record and get the "next" one. Else print the record. 700 | * This will in effect print lines that have 4 or less fields (words) 701 | 702 | ### Some math funcs 703 | ``` 704 | { 705 | print "log",log($1),$2 706 | print "col",cos($1),$2 707 | print "sin",sin($1),$2 708 | print "rand",rand() 709 | } 710 | ``` 711 | * Pretty self explanatory math funcs. rand() will print a rand float between 0 and 1. 712 | 713 | ### Print some random integers (1000 rand ints between 0-100) 714 | ``` 715 | #!/usr/bin/awk -f 716 | BEGIN { 717 | for (i=0; i<1000; i++) printf("%d\n",rand()*100) 718 | } 719 | ``` 720 | * We create random ints between 1 and 100 by multiplying the rand float (between 0-1) with 100 and cutting off the digits by using a `%d` (decimal) in printf, which treats any numbers as ints (an cuts the digits) 721 | 722 | ### String funcs - index 723 | ``` 724 | #!/usr/bin/awk -f 725 | { 726 | i=index($0,"bilbo") 727 | print $0 728 | if (i>0) print ">>>> INDEX IS",i 729 | } 730 | ``` 731 | * the 'index()' func searches for the index of the second string ("bilbo") in the first string (the current record/line). If none found it returns 0. 732 | 733 | ### String funcs - length 734 | * very basic but very much needed functionality - the length of a string. 735 | * `awk '{print "[",$0,"]","has length of",length($0)}' file.txt` outputs something like: 736 | ``` 737 | [ Hello world world ] has length of 17 738 | [ hello andback again Again ] has length of 25 739 | [ a story by frodo and bilbo with ring ] has length of 36 740 | ``` 741 | 742 | ### String funcs - split 743 | * Split a string into parts based on a separator (regex or string). 744 | ``` 745 | #!/usr/bin/awk -f 746 | BEGIN { 747 | mystring="The best | leaf in the Shire, isn't it?" 748 | n=split(mystring,array," ") 749 | for (i in array) print array[i] 750 | print "TOTAL splits:",n 751 | } 752 | ``` 753 | * will output: 754 | ``` 755 | The 756 | best 757 | leaf 758 | in 759 | the 760 | Shire, 761 | isn't 762 | it? 763 | TOTAL splits: 8 764 | ``` 765 | * First parameter is the string. The second parameter is an array in which to store the split parts of the string. We can pass an uninitialized variable directly. The last param is the separator. 766 | * The function returns the number of splits (in the 'n') variable. It returns the split string components in the array "array" passed as argument. 767 | 768 | 769 | ### String funcs - substr 770 | ``` 771 | #!/usr/bin/awk -f 772 | BEGIN { 773 | mystring="lotr is cool" 774 | n=substr(mystring,1,3) 775 | print n 776 | } 777 | 778 | ``` 779 | * will output: 780 | `lot` 781 | * substring returns a portion of the string. First arg is the string. Second is the index from where to start cutting. The indexing begins from 1, NOT 0. So first char is at index 1. The final (optional) arg is the length of the cut. 782 | * Read it as: get 3 chars from the first char (including the first char). 783 | * If you don't pass the third arg (length) it will cut until the end of string. 784 | * Use it in a one-liner to print the first X chars of every line ` awk '{print substr($0,1,10)}' file.txt` 785 | 786 | 787 | ### String funcs - gensub 788 | ``` 789 | #!/usr/bin/awk -f 790 | BEGIN { 791 | mystring="Run Halifax. Show us the meaning of haste." 792 | res=gensub(/[Hh]ali/,"Shadow","g",mystring) 793 | print res 794 | } 795 | ``` 796 | * will output: 797 | ` Run Shadowfax. Show us the meaning of haste.` 798 | * first arg is the regex to search for. Second is the replacement string. Third is a flag. It can be "g" for global or an integer. If it's "g" it replaces all occurrences of regex in mystring. If it's for example 2 it will replace only the second occurrence of regex. The last arg is the target string (where to do search & replace) 799 | * gensub will return a new string, leaving the original intact. 800 | * If you look at it you'll notice that it's similar to sed's substitute: `s/[Hh]ali/Shadow/g` 801 | * If the last (target) string not supplied, use $0 (the current record) 802 | 803 | 804 | 805 | ### String func - gsub 806 | * gsub is similar to gensub but has some diffs. It doesn't take the flag argument and replaces globally by default. It also performs search & replace directly on the target string. It returns the number of substitutions (whereas gensub() returns the modified string) 807 | * This make it very useful for modifying records nicely and rapidly. 808 | * `awk '{gsub(/bilbo/,"MASTER BILBO");print}' file.txt` will search for `/bilbo/` on the current record (line) and replace it with "MASTER BILBO". 809 | 810 | 811 | ### String func - sub 812 | * Just like gsub() but only replaces the FIRST occurrence. 813 | * `awk '{sub(/bilbo/,"MASTER BILBO");print}' file.txt` will search for `/bilbo/` on the current record (line) and replace it with "MASTER BILBO", just the FIRST occurrence on the record 814 | 815 | 816 | ### String func - match 817 | * Returns 0 if a match is not found, otherwise the starting index for the match (index STARTING AT 1, not 0) 818 | * Look at this: 819 | ``` 820 | #!/usr/bin/awk -f 821 | { 822 | i=match($0,/bilbo/) 823 | if (i>0){ 824 | res=substr($0,i,5) 825 | print res 826 | } 827 | } 828 | 829 | ``` 830 | * It returns the index for `/bilbo/` in current record. If bigger than 0 then get a substring (from 'i' to 'i'+5) 831 | 832 | 833 | ### String func - tolower 834 | * `echo "My Precious!" | awk '{print tolower($0)}'` 835 | * Will output `my precious!` 836 | 837 | ### String func - toupper 838 | * `echo "My Precious!" | awk '{print toupper($0)}'` 839 | * Will output `MY PRECIOUS!` 840 | 841 | ### String func - asort 842 | * Sorting an array of numbers or string might come in hand. 843 | ``` 844 | #!/usr/bin/awk -f 845 | BEGIN { 846 | for (i=0;i<5;i++){ 847 | arr[i]=rand()*100 848 | print arr[i] 849 | } 850 | print ">> SORTED" 851 | asort(arr) 852 | for (i in arr)print arr[i] 853 | } 854 | 855 | ``` 856 | * will output: 857 | ``` 858 | 92.4046 859 | 59.3909 860 | 30.6394 861 | 57.8941 862 | 74.0133 863 | >> SORTED 864 | 30.6394 865 | 57.8941 866 | 59.3909 867 | 74.0133 868 | 92.4046 869 | ``` 870 | 871 | ### Time func - strftime 872 | * ` awk 'BEGIN {print strftime("%a %b %e %H:%M:%S %Z %Y")}' will output `Sun Nov 14 15:09:56 EET 2021` 873 | * Here's the format specifiers (taken from grymoire.com). (The man has some format specifiers but it's not this complete) 874 | ``` 875 | %a The locale's abbreviated weekday name 876 | %A The locale's full weekday name 877 | %b The locale's abbreviated month name 878 | %B The locale's full month name 879 | %c The locale's "appropriate" date and time representation 880 | %d The day of the month as a decimal number (01--31) 881 | %H The hour (24-hour clock) as a decimal number (00--23) 882 | %I The hour (12-hour clock) as a decimal number (01--12) 883 | %j The day of the year as a decimal number (001--366) 884 | %m The month as a decimal number (01--12) 885 | %M The minute as a decimal number (00--59) 886 | %p The locale's equivalent of the AM/PM 887 | %S The second as a decimal number (00--61). 888 | %U The week number of the year (Sunday is first day of week) 889 | %w The weekday as a decimal number (0--6). Sunday is day 0 890 | %W The week number of the year (Monday is first day of week) 891 | %x The locale's "appropriate" date representation 892 | %X The locale's "appropriate" time representation 893 | %y The year without century as a decimal number (00--99) 894 | %Y The year with century as a decimal number 895 | %Z The time zone name or abbreviation 896 | %% A literal %. 897 | ``` 898 | 899 | 900 | ### Extract the inverse of a regex match 901 | * When you use match() func and get an actual match you'll modify RSTART and RLENGTH. These are 2 built in awk vars that signify the start and the length of the match. 902 | * You can use these vars to extract the inverse of a regex match. This might come handy in certain advanced scenarios. 903 | ``` 904 | #!/usr/bin/awk -f 905 | { 906 | reg="[Bb]ilbo" 907 | if (match($0,reg)){ 908 | bef=substr($0,1,RSTART-1) 909 | aft=substr($0,RSTART+RLENGTH) 910 | pat=substr($0,RSTART,RLENGTH) 911 | print bef,"|",pat,"|",aft 912 | } 913 | else print $0 914 | } 915 | 916 | ``` 917 | * This will output something like: 918 | ``` 919 | Hello world world 920 | hello andback again Again 921 | a story by frodo and | bilbo | with ring with bilbo 922 | | Bilbo | baggins baggins baggins 923 | ``` 924 | * We start by checking for a match. If a match exists commands inside if will run (since it will return a value greater than 0). 925 | * 'bef' is the substring before the match. We cut it from position beginning (string indexing starts at 1) up to the start of the match (to RSTART). 926 | * 'aft' is the substring after the match. We cut it from the end of the match up to the end. Note how we use "RSTART+RLENGTH" to calculate the first char AFTER the end of the match. This it's a bit confusing with indexing starting from 1. If you're used to 0 indexing you would want to add a 1 ("RSTART+RLENGTH+1") but it's not needed here because indexing starts at 1. 927 | * Note the comments starting with `#` 928 | 929 | ### Put all lines on one line 930 | * Check out this wacky oneliner `awk '{a=a $0}END{print a}' file.txt` 931 | * It will concatenate all lines on a single line (without any separators between lines). 932 | * Here's how it works. We start by declaring a variable a. Simply using it will make it uninitialized the first time awk encounters the variable name. Uninitialized variables have a value of "". You can use whatever var name you like, I used "a" because it's short and simple. 933 | * On every record (line) add to "a" previous value of "a" + the current record (line). Awk concatenates values simply by putting them one after another. The line above would even work even without any spaces - eg `{a=a$0}` but I kept the space for clarity. 934 | * After all input records are done processing print the final value of "a". "a" now contains all lines concatenated together. 935 | * Feel free to add some separators between lines for clarity, like: `awk '{a=a"|"$0}END{print a}' file.txt` 936 | 937 | 938 | 939 | ### User declared funcs 940 | ``` 941 | 942 | #!/usr/bin/awk -f 943 | 944 | # Declare custom func outside 945 | function throw_ring(who){ 946 | if (who=="gollum"){ 947 | return 0 948 | } 949 | else if (who=="frodo"){ 950 | return 1 951 | } 952 | } 953 | 954 | # On all records matching /ring/ 955 | /ring/{ 956 | # Find gollum or frodo 957 | if (match($0,"gollum")){ 958 | was_thrown=throw_ring("gollum") 959 | #use ternary if/else. If was_thrown is true (or bigger than 0) return "THROWN". Else return "NOT THROWN" 960 | print "the ring throw status is ", was_thrown?"THROWN":"NOT THROWN" 961 | } 962 | 963 | if (match($0,"frodo")){ 964 | was_thrown=throw_ring("frodo") 965 | print "the ring throw status is ", was_thrown?"THROWN":"NOT THROWN" 966 | } 967 | 968 | } 969 | 970 | ``` 971 | * This is a rather lange script but simple when you break it into pieces. 972 | * Start by declaring a custom "throw_ring" function. 973 | * on records that match `/ring/` exec the following operations: 974 | * check if that record (which contains `/ring/`) contains "gollum". If yes get "thrown_status" value. 975 | * Use the ternary operator to return either "THROWN" or "NOT_THROWN". 976 | * Do the same for "frodo" 977 | * The output will look like: `the ring throw status is THROWN` 978 | 979 | ### More about conditional patterns 980 | * Conditional patterns are powerful and succint. We can use them to check for certain conditions before executing any commands inside curly braces. 981 | * `NF>4{print}` is equivalent to `{if(NF>4)print}`. As you can see the first one is clearer and shorter. It uses a conditional pattern. If true, exec commands inside curly braces. 982 | * The second runs for every record and inside curly braces use a conditional 'if'. 983 | * We can use fairly advanced conditional patterns, using parenthesis and &&, || and even functions. 984 | * `awk 'match($0,/bilbo/){print}` - if we find `/bilbo/` inside current record print. Obviously we could've dispensed with the match and simply used a regular pattern. 985 | * But the power comes from combination with other tests. `awk '(NF>6)&& match($0,/bilbo/) {print}' file.txt`. If Number of Fields is bigger than 6 AND we have `/bilbo/` inside current record, print. 986 | * We can replace match with the match operator - `~`. Like this: `NF>6 && $0~/bilbo {print}`. If Number of Fields bigger than 6 and we have a match of `/bilbo/` inside current record, print. 987 | 988 | 989 | ### Advanced conditional patterns 990 | * Here's a mind blowing one: `awk 'NF==3,/bilbo/{print}' f1`. The comma signifies a range (as for pattern range). But we use a conditional pattern as the start of the range and a regex as the end of the range. Pretty powerful stuff. 991 | * The line above will print before the first record where Number of Fields is 3 and up to (and including) the last record containing `/bilbo/` 992 | * Here's a simpler but highly useful oneliner `awk 'NF!=0{print} file.txt`. If Number of Fields different than 0, print. It will skip printing empty lines. 993 | * This one will make you oogle. `awk 'rand()>0.5{print}' file.txt`. It will print about half lines from 'file.txt', depending on the result of 'rand()'. 994 | 995 | 996 | ### Print the first Nth lines of every file 997 | * `awk 'FNR<=5{print}' f1 f2 f3` 998 | * The line above will print the first 5 lines of every file passed to awk (without any separation between them). It uses a conditional pattern. If File Number Record smaller or equal to 5, print. 999 | 1000 | 1001 | ### Print until you encounter this pattern. Then move to next file. 1002 | * `awk '{if ($0~/bilbo/) nextfile;print}' f1 f2`. 1003 | * Print records (lines) from current file until you encounter `/bilbo/` in current record (line). Then move to the next file (skipping the rest of the current file). 1004 | * Do the same for the rest of the files. 1005 | * It will NOT print the line containing `/bilbo/`. 1006 | 1007 | 1008 | ### Print every Nth line of file 1009 | * `awk '{getline;print}' file.txt` - print every second line of the file. 1010 | * Why? For every record start with `getline`. This command sets '$0' from the next input record - it basically reads the next line into '$0'. Print then prints the newly updated '$0' 1011 | * `awk '{getline;getline;print}' file.txt` - print every third line of the file. We consume two lines then print the third one. And so on. Add more `getline` to increase `Nth` printing. 1012 | 1013 | # About 1014 | * Truth be told I'm also writing this for myself. With so many possible use cases it's easy to forget them all. 1015 | * I tried to make this as simple and clear as possible. Even complete awk newbs should be able to pick up this easily. 1016 | * I don't claim to be an awk guru. As such the way I express myself might not be up to official lingo. Or some techniques used here won't follow best practices. Kindly let me know and I'll correct if possible. 1017 | * Some techniques will not be production code but dumbed down (or otherwise not 100% perfection)for demonstration purposes. 1018 | 1019 | 1020 | --------------------------------------------------------------------------------