├── ChangeLog ├── README.md ├── awk-sys-prog.texi ├── awk.pdf ├── cleanit ├── du.awk └── texinfo.tex /ChangeLog: -------------------------------------------------------------------------------- 1 | 2024-06-20 Arnold D. Robbins 2 | 3 | * awk-sys-prog.texi: Update for 2024. Bump edition to 1.0. 4 | 5 | 2018-03-05 Arnold D. Robbins 6 | 7 | * awk-sys-prog.texi: Update for 2018, new preface, lots 8 | of new or changed prose, some typo fixes. 9 | 10 | 2018-02-25 Arnold D. Robbins 11 | 12 | * README.md: Updated. 13 | * awk.pdf: New file. 14 | 15 | 2018-02-22 Arnold D. Robbins 16 | 17 | * README.md: New file. 18 | * texinfo.tex: Updated. 19 | 20 | 2015-07-03 Arnold D. Robbins 21 | 22 | * awk-sys-prog.texi: Fix sectioning. Further cleanup edits. 23 | Bump version and date. 24 | 25 | 2015-06-08 Arnold D. Robbins 26 | 27 | * awk-sys-prog.texi: Finish incorporating comments. 28 | Polish some. Counterpoints section is still awkward. 29 | 30 | 2015-06-06 Arnold D. Robbins 31 | 32 | * awk-sys-prog.texi: Continue incorporating comments. 33 | * texinfo.tex: Update to a current version. 34 | 35 | 2015-06-05 Arnold D. Robbins 36 | 37 | * awk-sys-prog.texi: Start incorporating comments. 38 | 39 | 2014-08-18 Arnold D. Robbins 40 | 41 | * awk-sys-prog.texi, du.awk, texinfo.tex, cleanit: 42 | Initial checkin to git, although it's mostly done. 43 | Last modified August 13, 2013. 44 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | This is a paper I wrote on why Awk may finally fill the bill as a systems 4 | programming language. More detail on it is found in the paper itself. 5 | I am publishing it on GitHub because I want to. So there. 6 | 7 | To format the paper, just use `texi2pdf awk-sys-prog.texi`. 8 | 9 | The file `awk.pdf` is Henry Spencer's original paper, to which my paper 10 | is a reply. It is made available by permission. I thank him. 11 | 12 | Arnold Robbins 13 | [arnold at skeeve.com](mailto:arnold@skeeve.com) 14 | 15 | ##### Last Updated: 16 | 17 | Sun Feb 25 21:15:31 IST 2018 18 | -------------------------------------------------------------------------------- /awk-sys-prog.texi: -------------------------------------------------------------------------------- 1 | \input texinfo @c -*-texinfo-*- 2 | @c %**start of header (This is for running Texinfo on a region.) 3 | @setfilename awk-sys-prog.info 4 | @settitle AWK As A Major Systems Programming Language --- Revisited 5 | @c %**end of header (This is for running Texinfo on a region.) 6 | 7 | @set xref-automatic-section-title 8 | 9 | @c The following information should be updated here only! 10 | @c This sets the edition of the document. 11 | 12 | @c These apply across the board. 13 | @set UPDATE-MONTH June 2024 14 | @set EDITION 1.0 15 | 16 | @set TITLE AWK As A Major Systems Programming Language---Revisited 17 | 18 | @iftex 19 | @set DOCUMENT book 20 | @set CHAPTER chapter 21 | @set APPENDIX appendix 22 | @set SECTION section 23 | @set SUBSECTION subsection 24 | @end iftex 25 | @ifinfo 26 | @set DOCUMENT Info file 27 | @set CHAPTER major node 28 | @set APPENDIX major node 29 | @set SECTION minor node 30 | @set SUBSECTION node 31 | @end ifinfo 32 | @ifhtml 33 | @set DOCUMENT Web page 34 | @set CHAPTER chapter 35 | @set APPENDIX appendix 36 | @set SECTION section 37 | @set SUBSECTION subsection 38 | @end ifhtml 39 | @ifdocbook 40 | @set DOCUMENT book 41 | @set CHAPTER chapter 42 | @set APPENDIX appendix 43 | @set SECTION section 44 | @set SUBSECTION subsection 45 | @end ifdocbook 46 | @ifplaintext 47 | @set DOCUMENT book 48 | @set CHAPTER chapter 49 | @set APPENDIX appendix 50 | @set SECTION section 51 | @set SUBSECTION subsection 52 | @end ifplaintext 53 | 54 | @tex 55 | \gdef\urlcolor{0.5 0.09 0.12} 56 | @end tex 57 | 58 | @c For HTML, spell out email addresses, to avoid problems with 59 | @c address harvesters for spammers. 60 | @ifhtml 61 | @macro EMAIL{real,spelled} 62 | ``\spelled\'' 63 | @end macro 64 | @end ifhtml 65 | @ifnothtml 66 | @macro EMAIL{real,spelled} 67 | @email{\real\} 68 | @end macro 69 | @end ifnothtml 70 | 71 | @c merge the function and variable indexes into the concept index 72 | @ifinfo 73 | @synindex fn cp 74 | @synindex vr cp 75 | @end ifinfo 76 | @iftex 77 | @syncodeindex fn cp 78 | @syncodeindex vr cp 79 | @end iftex 80 | @ifxml 81 | @syncodeindex fn cp 82 | @syncodeindex vr cp 83 | @end ifxml 84 | 85 | @c If "finalout" is commented out, the printed output will show 86 | @c black boxes that mark lines that are too long. Thus, it is 87 | @c unwise to comment it out when running a master in case there are 88 | @c overfulls which are deemed okay. 89 | 90 | @iftex 91 | @finalout 92 | @end iftex 93 | 94 | @copying 95 | Copyright @copyright{} 2013, 2015, 2018, 2024 96 | Arnold David Robbins 97 | 98 | All Rights Reserved. 99 | @sp 2 100 | 101 | This is Edition @value{EDITION} of @cite{@value{TITLE}}. 102 | @end copying 103 | 104 | @c Uncomment this for the release. Leaving it off saves paper 105 | @c during editing and review. 106 | @c @setchapternewpage odd 107 | 108 | @titlepage 109 | @title @value{TITLE} 110 | @subtitle @value{UPDATE-MONTH} 111 | @author Arnold D. Robbins 112 | 113 | @c Include the Distribution inside the titlepage environment so 114 | @c that headings are turned off. Headings on and off do not work. 115 | 116 | @page 117 | @vskip 0pt plus 1filll 118 | Published by: 119 | @sp 1 120 | 121 | Arnold David Robbins @* 122 | P.O. Box 354 @* 123 | Nof Ayalon @* 124 | D.N. Shimshon 9978500 @* 125 | ISRAEL @* 126 | Email: @EMAIL{arnold@@skeeve.com,arnold AT skeeve.com} @* 127 | URL: @uref{https://www.skeeve.com/} @* 128 | 129 | @sp 2 130 | @insertcopying 131 | @end titlepage 132 | 133 | @unnumbered Preface 134 | 135 | I started this paper in 2013, and in 2015 sent it out for review to the people 136 | listed later on. After incorporating comments, I sent it to Rik Farrow, the 137 | editor of the USENIX magazine @cite{;login:} to see if he would publish it. 138 | He declined to do so, for reasonably good reasons. 139 | 140 | The paper languished, forgotten, until early 2018 when I came across it and 141 | decided to polish it off, put it up on GitHub, and make it available from 142 | my home page in HTML. 143 | 144 | In 2024, I took a fresh look at it, and decided to polish it a 145 | little bit more. 146 | 147 | If you are interested in language design and evolution in general, and in Awk 148 | in particular, I hope you will enjoy reading this paper. If not, then why 149 | are you bothering looking at it now? 150 | 151 | @noindent 152 | Arnold Robbins @* 153 | Nof Ayalon, ISRAEL @* 154 | June, 2024 155 | 156 | @chapter Introduction 157 | 158 | At the March 1991 USENIX conference, Henry Spencer presented a paper 159 | entitled @cite{AWK As A Major Systems Programming Language}. In it, 160 | he described his experiences using the original version of @command{awk} 161 | to write two significant ``systems'' programs---a clone for a reasonable 162 | subset of the @command{nroff} formatter@footnote{The Amazingly Workable 163 | Formatter, @command{awf}, is available from 164 | @uref{ftp://ftp.freefriends.org/arnold/Awkstuff/awf.tgz}.}, 165 | and a simple parser generator. 166 | 167 | He described what @command{awk} did well, as well as what it didn't, and 168 | presented a list of things that @command{awk} would need to acquire 169 | in order to take the position of a reasonable alternative to C for 170 | systems programming tasks on Unix systems. 171 | 172 | In particular, @command{awk} lies about in the middle of the spectrum 173 | between C, which is ``close to the metal,'' and the shell, which is 174 | quite high-level. A language at this level that is useful for 175 | doing systems programming is very desirable. 176 | 177 | This paper reviews Henry's wish list, and describes some of the events that 178 | have occurred in the Unix/Linux world since 1991. 179 | It presents a case that @command{gawk}---GNU Awk---fills most of the 180 | major needs Henry listed way back in 1991, and then describes the 181 | author's opinion as to why other languages have successfully filled the 182 | systems programming role which @command{awk} did not. It discusses 183 | how the current version of @command{gawk} may 184 | finally be able to join the ranks of other popular, powerful, scripting 185 | languages in common use today, and ends off with some counter-arguments 186 | and the author's responses to them. 187 | 188 | @subheading Acknowledgements 189 | 190 | Thanks to Andrew Schorr, Henry Spencer, Nelson H.F.@: Beebe, and Brian Kernighan 191 | for reviewing an earlier draft of this paper. 192 | 193 | @chapter That Was Then @dots{} 194 | 195 | In this @value{SECTION} we review the state of the Unix world in 196 | 1991, as well as the state of @command{awk}, and then list what Henry Spencer 197 | saw as missing for @command{awk}. 198 | 199 | @section The Unix World in 1991 200 | 201 | Undoubtedly, many readers of this paper were not using 202 | computers in 1991, so this @value{SECTION} provides the context 203 | in which Henry's paper was written. In March of 1991: 204 | 205 | @itemize @bullet 206 | 207 | @item 208 | Commercial Unix systems were the norm, with offerings from 209 | AT&T, 210 | Digital Equipment Corporation, 211 | Hewlett Packard, 212 | IBM, 213 | Sun Microsystems, 214 | and many others, 215 | all vying for market share. Microsoft Windows existed, but was 216 | primarily a layer on top of MS-DOS and was not taken seriously. 217 | 218 | @item 219 | Very few sites still ran the original Bell Labs or direct-from-UCB 220 | variants of Unix; those did not keep up with the available hardware 221 | and AT&T was itself trying to succeed in the Unix hardware market. 222 | 223 | @item 224 | GNU/Linux did not exist! Some unencumbered BSD variants were available, 225 | but they were still under the cloud of the AT&T/UCB law suit.@footnote{See 226 | the @uref{https://en.wikipedia.org/wiki/USL_v._BSDi, Wikipedia article}, 227 | and @uref{https://www.bell-labs.com/usr/dmr/www/bsdi/bsdisuit.html, 228 | some notes at the late Dennis Ritchie's website}. There are undoubtedly 229 | other sources of information as well.} 230 | 231 | @item 232 | So-called ``new'' @command{awk} was about 2.5 years old. 233 | The book by Aho, Weinberger and Kernighan was published in 234 | October of 1987, so most people knew about new @command{awk}, but they 235 | just couldn't get it. 236 | 237 | Who could? New @command{awk} was available to educational institutions 238 | from the Bell Labs research group, and to those who had Unix source 239 | licenses for System V Releases 3.1, 3.2, and 4. By this time, source 240 | licensees were an extremely rare breed, since the cost for commercial 241 | licenses had skyrocketed, and even for educational licensees it had 242 | increased greatly.@footnote{Especially for budget-strapped educational 243 | institutions, source licences were increasingly an expensive luxury, 244 | since SVR4 rarely ran on hardware that they had.} If I recall correctly, 245 | an educational license cost around US $1,000, considerably more than 246 | the earlier Unix licenses. 247 | 248 | @item 249 | PERL@footnote{I've been told that one of the reasons Larry Wall created 250 | PERL is that he either didn't know about new @command{awk}, or he couldn't 251 | get it.} existed and was starting to gain in popularity. In 1991, 252 | ``PERL'' most likely meant PERL 3 or a very early version of PERL 4. 253 | The World Wide Web, which was one of the major reasons for PERL's growth 254 | in popularity, had not yet really taken off. 255 | 256 | @item 257 | Other implementations of new @command{awk} were available: 258 | 259 | @itemize + 260 | @item 261 | MKS Awk for PC systems (MS-DOS). 262 | 263 | @item 264 | GNU Awk was available and relatively stable, but could not be 265 | called ``solid.'' 266 | @end itemize 267 | 268 | @noindent 269 | The problem with the first of these is that source code was not 270 | available. And the latter came with (to quote Henry) ``troublesome 271 | licenses.'' (Actually, Henry no longer remembers whether his statement 272 | about ``troublesome licenses'' referred to the GPL, or to the Bell Labs 273 | source licenses.) 274 | 275 | @item 276 | Michael Brennan's @command{mawk} (also GPL'ed) was @emph{not} yet available. 277 | Version 1.0 was accepted for posting in @code{comp.sources.reviewed} 278 | on September 30, 1991, half a year after Henry's paper was published. 279 | @end itemize 280 | 281 | @section What Awk Lacked In 1991 282 | 283 | Here is a summary of what was wrong with the @command{awk} picture 284 | in 1991. These are in the same order as presented Henry's paper. 285 | We qualify each issue in order to later discuss how it has been addressed 286 | over time. 287 | 288 | @enumerate 1 289 | @item 290 | New @command{awk} was not widely available. Most Unix vendors 291 | still shipped only old @command{awk}. (Here is where he mentions that 292 | ``the independently-available implementations either cost substantial 293 | amounts of money or come with troublesome [sic] licenses.'') His point then 294 | was that for portability, @command{awk} programs had to be restricted 295 | to old @command{awk}. 296 | 297 | This could be considered a quality of implementation issue, although 298 | it's really a ``lack of available implementation'' issue. 299 | 300 | @item 301 | There is no way to tell @command{awk} to start matching all its 302 | patterns over again against the existing @code{$0}. 303 | This is a language design issue. 304 | 305 | @item 306 | There is no array assignment. 307 | (Language design issue.) 308 | 309 | @item 310 | Getting an error message out to standard error is difficult. 311 | (Implementation issue.) 312 | 313 | @item 314 | There is no precise language specification for @command{awk}. 315 | This leads to gratuitous portability problems. 316 | This too is thus a quality of implementation issue, in that without 317 | a specification, it's difficult to produce uniform, high quality 318 | implementations. 319 | 320 | @item 321 | The existing widely available implementation is slow; a much 322 | faster implementation is needed and the best thing of all would be 323 | an optimizing compiler. 324 | (Implementation issue.) 325 | 326 | @item 327 | There is no @command{awk}-level debugger. 328 | (Support tool or quality of implementation issue.) 329 | 330 | @item 331 | There is no @command{awk}-level profiler. 332 | (Support tool or quality of implementation issue.) 333 | @end enumerate 334 | 335 | In private email, Henry added the following items, saying 336 | ``there are a couple more things I'd add now, in hindsight.'' 337 | These are direct quotes: 338 | 339 | @enumerate 9 340 | @item 341 | [I can't believe I didn't discuss this in the paper, because I was 342 | certainly aware of it then!] Lack of any convenient mechanism for adding 343 | libraries. When @command{awk} is being invoked from a shell file, the 344 | shell file can do substitutions or use multiple @option{-f} options, but 345 | those are mechanisms outside the language, and not very convenient ones. 346 | What's really wanted is something like you get in Python etc., where one 347 | little statement up near the top says ``arrange for this program to have 348 | the xyz library available when it runs.'' 349 | 350 | @item 351 | I think it was Rob Pike who later said (roughly): ``It says something 352 | bad about Awk that in a language with integrated regular expressions, 353 | you end up using @code{substr()} so often.'' My paper did allude 354 | to the difficulty of finding out @emph{where} something matched in 355 | old-@command{awk} programs, but even in new @command{awk}, what you get 356 | is a number that you then have to feed to @code{substr()}. The language 357 | could really use some more convenient way of dissecting a string using 358 | regexp matching. [Caveat: I have not looked lately at Gawk to see if 359 | it has one.] 360 | @end enumerate 361 | 362 | The first of these is somewhere between a language design and a 363 | language implementation issue. The latter is a language design issue. 364 | 365 | @chapter @dots{} And This Is Now 366 | 367 | Fast forward to 2024. Where do things stand? 368 | 369 | @section What Awk Has Today 370 | 371 | The state of the @command{awk} world is much better now. 372 | In the same order: 373 | 374 | @enumerate 1 375 | @item 376 | New @command{awk} is the standard version of @command{awk} 377 | today on GNU/Linux, BSD, and commercial Unix systems. 378 | The one notable exception is Solaris, where @file{/usr/bin/awk} 379 | is still the old one; on all other systems, plain @command{awk} 380 | is some version of new @command{awk}. 381 | 382 | @item 383 | There remains no way to tell @command{awk} to start matching all its 384 | patterns over again against the existing @code{$0}. Furthermore, 385 | this is a feature that has not been called for by the @command{awk} 386 | community, except in Henry's paper. (We do acknowledge that this might 387 | be a useful feature.) 388 | 389 | @item 390 | There continues to be no array assignment. 391 | However, this function in @command{gawk}, which has arrays of arrays, can do 392 | the trick nicely. It is also efficient, since @command{gawk} uses 393 | reference counted strings internally: 394 | 395 | @example 396 | function copy_array(dest, source, i, count) 397 | @{ 398 | delete dest 399 | 400 | for (i in source) @{ 401 | if (typeof(source[i]) == "array") 402 | count += copy_array(dest[i], source[i]) 403 | else @{ 404 | dest[i] = source[i] 405 | count++ 406 | @} 407 | @} 408 | 409 | return count 410 | @} 411 | @end example 412 | 413 | @item 414 | Getting error messages out is easier. All modern systems have 415 | a @file{/dev/stderr} special file to which error messages 416 | may be sent directly. @command{gawk}, @command{mawk} and 417 | Brian Kernighan's @command{awk} all have @file{"/dev/stderr"} built in 418 | for I/O redirections, so even on systems without a real 419 | @file{/dev/stderr} special file, you can still send error 420 | messages to standard error. 421 | 422 | @item 423 | Perhaps most important of all, with the 424 | @uref{https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html, 425 | POSIX standard}, 426 | there is a formal standard specification 427 | for @command{awk}. As with all formal standards, it isn't 428 | perfect. But it provides an excellent starting point, as well 429 | as chapter and verse to cite when explaining the behavior 430 | of a standards-compliant version of @command{awk}. 431 | 432 | Additionally, the second edition of @cite{The AWK Progamming 433 | Language} is now available. 434 | 435 | @item 436 | There are a number of freely available implementations, with 437 | different licenses, such that everyone ought to be able to find 438 | a suitable one: 439 | 440 | @itemize @bullet 441 | @item 442 | Brian Kernighan's @command{awk} is the direct lineal 443 | descendant of Unix @command{awk}. He calls it the ``One 444 | True Awk'' (sic). It is available from 445 | Github: 446 | 447 | @example 448 | $ @kbd{git clone git://github.com/onetrueawk/awk bwkawk} 449 | @end example 450 | 451 | @item 452 | GNU Awk, @command{gawk}, is available from the Free Software Foundation. 453 | You may use an HTTPS downloader: 454 | @url{https://ftp.gnu.org/gnu/gawk/gawk-5.3.0.tar.gz} is the current 455 | version. There may be a newer one. 456 | 457 | @item 458 | Michael Brennan's @command{awk}, known as @command{mawk}. 459 | In 2009, Thomas Dickey took on @command{mawk} maintenance. 460 | Basic information is available on 461 | @uref{https://www.invisible-island.net/mawk, the project's web page}. 462 | The download URL is 463 | @url{https://invisible-island.net/datafiles/release/mawk.tar.gz}. 464 | 465 | In 2017 Michael published a beta of @command{mawk} 2.0. It's available from 466 | the project's @uref{https://github.com/mikebrennan000/mawk-2, GitHub page}. 467 | 468 | @item 469 | MKS Awk was used for Solaris's @file{/usr/xpg4/bin/awk}, which is 470 | their standards-compliant version of new @command{awk}. For a while it was 471 | available as part of Open Solaris, but is no longer so. 472 | Some years ago, we were able to make this version compile and run on 473 | GNU/Linux after just a few hours work. 474 | 475 | Although Open Solaris is now history, the 476 | @uref{https://illumos.org, Illumos project} 477 | does make the MKS Awk available. You can view the files one at a time from 478 | @uref{https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/awk_xpg4}. 479 | 480 | @item 481 | Other, more esoteric versions as well. See the 482 | @uref{https://en.wikipedia.org/wiki/Awk_language#Versions_and_implementations, 483 | Wikipedia article}, and also the 484 | @uref{https://www.gnu.org/software/gawk/manual/html_node/Other-Versions.html#Other-Versions, 485 | @command{gawk} documentation}. 486 | @end itemize 487 | 488 | @end enumerate 489 | 490 | @section And What GNU Awk Has Today 491 | 492 | The more difficult of the quality of implementation issues are addressed 493 | by @command{gawk}. In particular: 494 | 495 | @enumerate 7 496 | @item 497 | Beginning with version 4.0 in 2011, @command{gawk} provides an 498 | @command{awk}-level debugger which is 499 | modeled after GDB. 500 | This is a full debugger, with breakpoints, watchpoints, single statement 501 | stepping and expression evaluation capabilities. 502 | (Older versions had a separate executable named @command{dgawk}. Today 503 | it's built into regular @command{gawk}.) 504 | 505 | @item 506 | @command{gawk} has provided an @command{awk}-level statement profiler for 507 | many years (@command{pgawk}). 508 | Although there is no direct correlation with CPU time used, the 509 | statement level profiler remains a powerful tool for understanding 510 | program behavior. 511 | 512 | @item 513 | Since version 4.0, @command{gawk} has had an @samp{@@include} facility 514 | whereby @command{gawk} goes and finds the named @command{awk} source 515 | progrm. For much longer it has searched for files specified with 516 | @option{-f} along the path named by the @env{AWKPATH} environment variable. 517 | The @samp{@@include} mechanism also uses @env{AWKPATH}. 518 | 519 | @item 520 | In terms of getting at the pieces of text matched by a regular expression, 521 | @command{gawk} provides an optional third argument to the @code{match()} 522 | function. This argument is an array which @command{gawk} fills in with both 523 | the matched text for the full regexp and subexpressions, and index and length 524 | information for use with @code{substr()}. @command{gawk} also provides the 525 | @code{gensub()} general substitution function, an enhanced version of the 526 | @code{split()} function, and the @code{patsplit()} function for specifying 527 | contents instead of separators using a regexp. 528 | @end enumerate 529 | 530 | While @command{gawk} has almost always been faster than Brian 531 | Kernighan's @command{awk}, performance improvements bring 532 | it closer to @command{mawk}'s performance level (a byte-code based 533 | execution engine and internal improvements in array indexing). 534 | 535 | And @command{gawk} clearly has the most features of any version, 536 | many of which considerably increase the power of the language. 537 | 538 | @section So Where Does Awk Stand? 539 | 540 | Despite all of the above, @command{gawk} is not as popular as other 541 | scripting languages. Since 1991, we can point to four major scripting 542 | languages which have enjoyed, or currently enjoy, differing levels of 543 | popularity: PERL, tcl/tk, Python, and Ruby. We think it is fair to say 544 | that Python is the most popular scripting languages in the 545 | third decade of the 21st century. 546 | 547 | Is @command{awk}, as we've described it up to this point, now ready to 548 | compete with the other languages? Not quite yet. 549 | 550 | @chapter Key Reasons Why Other Languages Have Gained Popularity 551 | 552 | In retrospect, it seems clear (at least to us!) that there are two 553 | major reasons that all of the previously mentioned languages have enjoyed 554 | significant popularity. The first is their @emph{extensibility}. 555 | The second is @emph{namespace management}. 556 | 557 | One certainly cannot attribute their popularity to improved syntax. 558 | In the opinion of many, PERL and Ruby both suffer from terrible syntax. 559 | Tcl's syntax is readable but nothing special. Python's syntax is elegant, 560 | although slightly unusual. The point here is that they all differ greatly 561 | in syntax, and none really offers the clean pattern--action paradigm 562 | that is @command{awk}'s trademark, yet they are all popular languages. 563 | 564 | If not syntax, then what? We believe that their popularity stems from 565 | the fact that all of these languages are easily @emph{extensible}. This is true 566 | with both ``modules'' in the scripting language, and more importantly, 567 | with access to C level facilities via dynamic library loading. 568 | 569 | Furthermore, these languages allow you to group related functions and 570 | variables into packages or modules: they let you manage the namespace. 571 | 572 | @command{awk}, on the other hand, has always been closed. An @command{awk} 573 | program cannot even change its working directory, much less open 574 | a connection to an SQL database or a socket to a server on the 575 | Internet somewhere (although @command{gawk} can do the latter). 576 | 577 | If one examines the number of extensions available for PERL on CPAN, 578 | or for Python such as PyQt or the Python tk bindings, it becomes 579 | clear that extensibility is the real key to power (and from there 580 | to popularity). 581 | 582 | Furthermore, in @command{awk}, all global variables and functions 583 | share a single namespace. This prevents many good software development 584 | practices based on the principle of information hiding. 585 | 586 | To summarize: A reasonable language definition, efficient implementations, 587 | debuggers and profilers are necessary but not sufficient for true power. 588 | The final ingredients are @emph{extensibility} and @emph{namespaces}. 589 | 590 | @chapter Filling The Extensibility Gap 591 | 592 | With version 4.1, @command{gawk} (finally) provides a defined C API 593 | for extending the core language. 594 | 595 | @section API Overview 596 | 597 | The API makes it possible to write functions in C or C++ that are 598 | callable from an @command{awk} program as if the function were 599 | written in @command{awk}. The most straightforward way to think 600 | of these functions is as user-defined functions that happen to be 601 | implemented in a different language. 602 | 603 | The API provides the following facilities: 604 | 605 | @itemize @bullet 606 | @item 607 | Structures that map @code{awk} string, numeric, and undefined values 608 | into C types that can be worked with. 609 | 610 | @item 611 | Management of function parameters, including the ability to convert 612 | a parameter whose original type is undefined, into an array. That is, 613 | there is full call-by-reference for arrays. Scalars are passed by 614 | value, of course. 615 | 616 | @item 617 | Access to the symbol table. Extension functions can read all @command{awk} 618 | variables, and create and update new variables. As an initial, relatively 619 | arbitrary design decision, extensions cannot update special variables such as 620 | @code{NR} or @code{NF}, with the single exception of @code{PROCINFO}. 621 | 622 | @item 623 | Full array management, including the ability to create arrays, and arrays 624 | of arrays, and the ability to add and delete elements from an array. It 625 | is also possible to ``flatten'' an array into a data structure that 626 | makes it simple for C code to loop over all the elements of an array. 627 | 628 | @item 629 | The ability to run a procedure when @command{gawk} exits. This is conceptually 630 | the same as the C @code{atexit()} function. 631 | 632 | @item 633 | Hooks into the built-in I/O redirection mechanisms in @command{gawk}. 634 | In particular, there are separate facilities for input redirections 635 | with @code{getline} and @samp{<}, output redirections with 636 | @code{print} or @code{printf} and @samp{>} or @samp{>>}, and two-way 637 | pipelines with @command{gawk}'s @samp{|&} operator. 638 | 639 | @end itemize 640 | 641 | @section Discussion 642 | 643 | Considerable thought went into the design of the API. 644 | The @command{gawk} documentation provides a 645 | @uref{https://www.gnu.org/software/gawk/manual/html_node/Dynamic-Extensions.html#Dynamic-Extensions, 646 | full description of the API itself}, 647 | with examples (over 50 pages worth!), as well as 648 | @uref{https://www.gnu.org/software/gawk/manual/html_node/Extension-Design.html#Extension-Design, some discussion of the goals and design decisions} 649 | behind the API (in an appendix). 650 | The development was done over the course of 651 | about a year and a half, together with the developers of @command{xgawk}, 652 | a fork of @command{gawk} that added features that made using extensions 653 | easier, and included an extension for processing XML files in a way that 654 | fit naturally with the pattern--action paradigm. While it may not be 655 | perfect, the @command{gawk} developers feel that it is a good start. 656 | 657 | @strong{FIXME}: Henry Spencer suggests adding more info on the API and 658 | on the design decisions. 659 | I think this paper is long enough, and the full doc is quite big. 660 | It'd be hard to pull API doc into this paper in a reasonable fashion, 661 | although it would be possible to review some of the design decisions. 662 | Comments? 663 | 664 | The major @command{xgawk} additions to the C code base have been merged 665 | into @command{gawk}, and the extensions from that project have been 666 | rewritten to use the new API. As a result, the @command{xgawk} project 667 | developers renamed their project @code{gawkextlib}, and the project now 668 | provides only extensions.@footnote{For more information, see the 669 | @uref{https://sourceforge.net/projects/gawkextlib/, 670 | @code{gawkextlib} project page}.} 671 | 672 | It is notable that functions written in @command{awk} can do a number 673 | of things that extension functions cannot, such as modify any 674 | variables, do I/O, call @command{awk} built-in functions, 675 | and call other user-defined functions. 676 | 677 | While it would certainly be possible to provide APIs for all of these 678 | features for extension functions, this seemed to be overkill. Instead, 679 | the @command{gawk} developers took the view that extension functions 680 | should provide access to external facilities, and provide communication 681 | to the @command{awk} level via function parameters and/or global variables, 682 | including associative arrays, which are the only real data structure. 683 | 684 | Consider a simple example. The standard @command{du} program 685 | can recursively walk one or more arbitrary file hierarchies, call 686 | @code{stat()} to retrieve file information, and then sum up the blocks 687 | used. In the process, @command{du} must track hard links, so that no 688 | file is accounted for or reported more than once. 689 | 690 | The @samp{filefuncs} extension shipped with @command{gawk} provides a 691 | @code{stat()} function that takes a pathname and fills in an associative 692 | array with the information retrieved from @code{stat()}. The array 693 | elements have names like @code{"size"}, @code{"mtime"} and so on, with 694 | corresponding appropriate values. (Compare this to PERL's @code{stat()} 695 | function that returns a linearly-indexed array!) 696 | 697 | The @code{fts()} function in the @samp{filefuncs} extension builds on 698 | @code{stat()} to create a multidimensional array of arrays that describes 699 | the requested file hierarchies, with each element being an array filled 700 | in by @code{stat()}. Directories are arrays containing elements for each 701 | directory entry, with an element named @code{"."} for the array itself. 702 | 703 | Given that @code{fts()} does the heavy lifting, @command{du} can be 704 | written quite nicely, and quite portably@footnote{The @command{awk} 705 | version of @command{du} works on Unix, GNU/Linux, Mac OS X, and MS 706 | Windows. On Windows only Cygwin is currently supported. We hope to one 707 | day support MinGW also.}, in @command{awk}. @xref{du in awk}, for the 708 | code, which weighs in at under 250 lines. Much of this is comments and 709 | argument parsing. 710 | 711 | @section Future Work 712 | 713 | The extension facility is relatively new, and undoubtedly has introduced new 714 | ``dark corners'' into @command{gawk}. These remain to be uncovered 715 | and any new bugs need to be shaken out and removed. 716 | 717 | Some issues are known and may not be resolvable. For example, 64-bit 718 | integer values such as the timestamps in @code{stat()} data on modern 719 | systems don't fit into @command{awk}'s 64-bit double-precision 720 | numbers which only have 53 bits of significand. This is also a 721 | problem for the bit-manipulation functions. 722 | 723 | With respect to namespaces, in 2017 I (finally) figured out how 724 | namespaces in @command{awk} ought to work to provide the needed 725 | functionality while retaining backwards compatibility. 726 | The was released with @command{gawk} 5.0. 727 | 728 | One or two of the sample extensions shipped with @command{gawk} 729 | and in @code{gawkextlib} have been modified to take advantage 730 | of namespaces. 731 | 732 | @chapter Counterpoints 733 | 734 | Brian Kernighan raised several counterpoints in response to 735 | an earlier draft of the paper. They are worth addressing (or 736 | at least trying to): 737 | 738 | @quotation 739 | I'm not 100% convinced by your basic premise, that the lack of an 740 | extension mechanism is the main / a big reason why Awk isn't used for 741 | the kinds of system programming tasks that Perl, Python, etc., are. 742 | It's absolutely a factor---without such a mechanism, there's just no 743 | way to do a lot of important computations. But how does that trade off 744 | against just having built-in mechanisms for the core system programming 745 | facilities (as Perl does) or a handful of core libraries like @code{sys}, 746 | @code{os}, @code{regex}, etc., for Python? 747 | @end quotation 748 | 749 | I think that Perl's original inclusion of most of the Unix system calls 750 | was, @emph{from a language design standpoint}, ultimately a mistake. At 751 | the time it was first done, there was no other choice: dynamic loading 752 | of libraries didn't exist on Unix systems in the early and mid-1980s 753 | (nor did shared libraries, for that matter). But having all those 754 | built-in functions bloats the language, making it harder to learn, 755 | document, and maintain, and I definitely did not wish to go down that 756 | path for @command{gawk}. 757 | 758 | With respect to Python, the question is: how are those libraries 759 | implemented? Are they built-in to the interpreter and separated from the 760 | ``core'' language simply by the language design? Or are they dynamically 761 | loaded modules? 762 | 763 | If the latter, that sounds like an argument @emph{for} the case of having 764 | extensions, not against it. And indeed, this merely emphasizes the 765 | point made at the end of the previous section, which is that to make an 766 | extension facility really scalable, you also need some sort of namespace / 767 | module capability. 768 | 769 | Thus, Brian is correct: an extension facility is needed, but the 770 | last part of the puzzle would be a module facility in the language. 771 | I think that I have solved this, and invite the curious reader to 772 | checkout the current versions of @command{gawk}. 773 | 774 | @quotation 775 | I'm also not convinced that Awk is the right language for writing 776 | things that need extensions. It was originally designed for 1-liners, 777 | and a lot of its constructs don't scale up to bigger programs. The 778 | notation for function locals is appalling (all my fault too, which makes 779 | it worse). There's little chance to recover from random spelling 780 | mistakes and typos; the use of mere adjacency for concatenation looks 781 | ever more like a bad idea. 782 | @end quotation 783 | 784 | This is hard to argue with. Nonetheless, @command{gawk}'s @option{--lint} 785 | option may be of help here, as well as the @option{--dump-variables} 786 | option which produces a list of all variables used in the program. 787 | 788 | @quotation 789 | Awk is fine for its original purpose, but I find myself writing Python 790 | for anything that's going to be bigger than say 10-20 lines unless the 791 | lines are basically just longer pattern-action sequences. (That 792 | notation is a win, of course, which you point out.) 793 | @end quotation 794 | 795 | I have worked for several years in Python. For string manipulation 796 | and processing records, you still have to write all the manual 797 | stuff: open the file, read lines in a loop, split them, etc. Awk 798 | does all this stuff for me. 799 | 800 | Additionally, I think that with discipline, it's possible to write fairly 801 | good-sized, understandable and maintainable @command{awk} programs; 802 | in my experience @command{awk} does scale up well beyond the one-liner 803 | range. 804 | 805 | Not to mention that Brian published (twice now!) a whole book of @command{awk} 806 | programs larger than one line. @code{:-)} (See the Resources section.) 807 | 808 | Some of my own, good-sized @command{awk} programs are available 809 | from GitHub: 810 | 811 | @table @asis 812 | @item The TexiWeb Jr.@: literate programming system 813 | See @uref{https://github.com/arnoldrobbins/texiwebjr}. 814 | The suite has two programs that total over 1,300 lines 815 | of @command{awk}. (They share some code.) 816 | 817 | @item Prepinfo 818 | See @uref{https://github.com/arnoldrobbins/prepinfo}. 819 | This script processes Texinfo files, updating menus 820 | as needed. This version is rewritten in TexiWeb Jr.@:; it's 821 | about 350 lines of @command{awk}. 822 | 823 | @item Sortmail 824 | See @uref{https://github.com/arnoldrobbins/sortmail}. 825 | This script sorts a Unix mbox format mailbox by thread. 826 | I use it daily. It's also written in TexiWeb Jr.@: and 827 | is about 330 lines of @command{awk}. 828 | 829 | @end table 830 | 831 | Brian continues: 832 | 833 | @quotation 834 | The @command{du} example is awfully big, though it does show off some of the 835 | language features. Could you get the same mileage with something 836 | quite a bit shorter? 837 | @end quotation 838 | 839 | My definition of ``small'' and ``big'' has changed over time. 250 lines 840 | may be big for a script, but the @command{du.awk} program is much smaller 841 | than a full implementation in C: GNU @command{du} is over 1,100 lines 842 | of C, plus all the libraries it relies upon in the GNU Coreutils. 843 | 844 | With respect to shorter examples, nothing springs to mind immediately. 845 | However, @command{gawk} comes with several useful extensions that 846 | are worth exploring, much more than we've covered here. 847 | 848 | For example, the @code{readdir} extension in the @command{gawk} 849 | distribution causes @command{gawk} to read directories and return one 850 | record per directory entry in an easy-to-parse format: 851 | 852 | @example 853 | $ @kbd{gawk -lreaddir '@{ print @}' .} 854 | @print{} 2109292/mail.mbx/f 855 | @print{} 2109295/awk-sys-prog.texi/f 856 | @print{} 2100007/./d 857 | @print{} 2100056/texinfo.tex/f 858 | @print{} 2100055/cleanit/f 859 | @print{} 2109282/awk-sys-prog.pdf/f 860 | @print{} 2100009/du.awk/f 861 | @print{} 2100010/.git/d 862 | @print{} 2098025/../d 863 | @print{} 2109294/ChangeLog/f 864 | @end example 865 | 866 | How cool is that?!? @code{:-)} 867 | 868 | Also, the @code{gawkextlib} project provides some very interesting 869 | extensions. Of particular interest are the XML and JSON extensions, 870 | but there are a number of others, and it's worth checking out. 871 | 872 | In 2018 I wrote here: 873 | 874 | @quotation 875 | In short, it's too early to really tell. This is the beginning of 876 | an experiment. I hope it will be a fun journey for me, the other 877 | @command{gawk} maintainers, and the larger community of @command{awk} 878 | users. 879 | @end quotation 880 | 881 | In 2024, I have to say that extensions haven't particularly 882 | caught on. This saddens me, but it seems to be typical of @command{awk} 883 | users that they use what's in the language and aren't interested 884 | in extending it, or they don't know that they can. Sigh. 885 | 886 | @chapter Conclusion 887 | 888 | It has taken much longer than any @command{awk} fan would like, but finally, 889 | GNU Awk fills in almost all the gaps listed by Henry Spencer for 890 | @command{awk} to be really useful as a systems programming language. 891 | 892 | In addition, experience from other popular languages has shown that 893 | extensibility and namespaces are the keys to true power, 894 | usability, and popularity. 895 | 896 | With the release of @command{gawk} 4.1, we feel that @command{gawk} 897 | (and thus the Awk language) are now almost on par with the basic capabilities 898 | of other popular languages. With @command{gawk} 5.0, we hope(d) to truly 899 | reach par. 900 | 901 | Is it too late in the game? 902 | In 2024, sadly, it does seem to be. But at least I had fun 903 | adding the new features to @command{gawk}. 904 | 905 | I hope that this paper will have piqued @emph{your} 906 | curiosity, and that you will take the time to give @command{gawk} 907 | a fresh look. 908 | 909 | @appendix Resources 910 | 911 | @enumerate 912 | @item 913 | @cite{The AWK Programming Language Paperback}, 914 | second edition, 915 | Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. 916 | Addison-Wesley, 2023. 917 | ISBN-13: 978-0138269722, ISBN-10: 0138269726. 918 | 919 | @item 920 | @cite{Effective awk Programming}, fourth edition. 921 | Arnold Robbins. 922 | O'Reilly Media, 2015. 923 | ISBN-13: 978-1491904619, ISBN-10: 1491904615. 924 | 925 | @item 926 | Online version of the @command{gawk} documentation: 927 | @uref{https://www.gnu.org/software/gawk/manual/}. 928 | 929 | @item 930 | The @code{gawkextlib} project: 931 | @uref{https://sourceforge.net/projects/gawkextlib/}. 932 | @end enumerate 933 | 934 | @node du in awk 935 | @appendix Awk Code For @command{du} 936 | 937 | Here is the @command{du} program, written in Awk. 938 | Besides demonstrating the power of the @code{stat()} and @code{fts()} 939 | extensions and @command{gawk}'s multidimensional arrays, 940 | it also shows the @code{switch} statement and the built-in 941 | bit manipulation functions @code{and()}, @code{or()}, and @code{compl()}. 942 | 943 | The output is not identical to GNU @command{du}'s, since filenames are 944 | not sorted. However, @command{gawk}'s built-in sorting facilities 945 | should make sorting the output straightforward; we leave that as the 946 | traditional ``exercise for the reader.'' 947 | 948 | @smallexample 949 | #! /usr/local/bin/gawk -f 950 | 951 | # du.awk --- write POSIX du utility in awk. 952 | # See https://pubs.opengroup.org/onlinepubs/9699919799/utilities/du.html 953 | # 954 | # Most of the heavy lifting is done by the fts() function in the "filefuncs" 955 | # extension. 956 | # 957 | # We think this conforms to POSIX, except for the default block size, which 958 | # is set to 1024. Following GNU standards, set POSIXLY_CORRECT in the 959 | # environment to force 512-byte blocks. 960 | # 961 | # Arnold Robbins 962 | # arnold@@skeeve.com 963 | 964 | @@include "getopt" 965 | @@load "filefuncs" 966 | 967 | BEGIN @{ 968 | FALSE = 0 969 | TRUE = 1 970 | 971 | BLOCK_SIZE = 1024 # Sane default for the past 30 years 972 | if ("POSIXLY_CORRECT" in ENVIRON) 973 | BLOCK_SIZE = 512 # POSIX default 974 | 975 | compute_scale() 976 | 977 | fts_flags = FTS_PHYSICAL 978 | sum_only = FALSE 979 | all_files = FALSE 980 | 981 | while ((c = getopt(ARGC, ARGV, "aHkLsx")) != -1) @{ 982 | switch (c) @{ 983 | case "a": 984 | # report size of all files 985 | all_files = TRUE; 986 | break 987 | case "H": 988 | # follow symbolic links named on the command line 989 | fts_flags = or(fts_flags, FTS_COMFOLLOW) 990 | break 991 | case "k": 992 | BLOCK_SIZE = 1024 # 1K block size 993 | break 994 | case "L": 995 | # follow all symbolic links 996 | 997 | # fts_flags &= ~FTS_PHYSICAL 998 | fts_flags = and(fts_flags, compl(FTS_PHYSICAL)) 999 | 1000 | # fts_flags |= FTS_LOGICAL 1001 | fts_flags = or(fts_flags, FTS_LOGICAL) 1002 | break 1003 | case "s": 1004 | # do sums only 1005 | sum_only = TRUE 1006 | break 1007 | case "x": 1008 | # don't cross filesystems 1009 | fts_flags = or(fts_flags, FTS_XDEV) 1010 | break 1011 | case "?": 1012 | default: 1013 | usage() 1014 | break 1015 | @} 1016 | @} 1017 | 1018 | # if both -a and -s 1019 | if (all_files && sum_only) 1020 | usage() 1021 | 1022 | for (i = 0; i < Optind; i++) 1023 | delete ARGV[i] 1024 | 1025 | if (Optind >= ARGC) @{ 1026 | delete ARGV # clear all, just to be safe 1027 | ARGV[1] = "." # default to current directory 1028 | @} 1029 | 1030 | fts(ARGV, fts_flags, filedata) # all the magic happens here 1031 | 1032 | # now walk the trees 1033 | if (sum_only) 1034 | sum_walk(filedata) 1035 | else if (all_files) 1036 | all_walk(filedata) 1037 | else 1038 | top_walk(filedata) 1039 | @} 1040 | 1041 | # usage --- print a message and die 1042 | 1043 | function usage() 1044 | @{ 1045 | print "usage: du [-a|-s] [-kx] [-H|-L] [file] ..." > "/dev/stderr" 1046 | exit 1 1047 | @} 1048 | 1049 | # compute_scale --- compute the scale factor for block size calculations 1050 | 1051 | function compute_scale( stat_info, blocksize) 1052 | @{ 1053 | stat(".", stat_info) 1054 | 1055 | if (! ("devbsize" in stat_info)) @{ 1056 | printf("du.awk: you must be using filefuncs extension from gawk 4.1.1 or later\n") > "/dev/stderr" 1057 | exit 1 1058 | @} 1059 | 1060 | # Use "devbsize", which is the units for the count of blocks 1061 | # in "blocks". 1062 | blocksize = stat_info["devbsize"] 1063 | if (blocksize > BLOCK_SIZE) 1064 | SCALE = blocksize / BLOCK_SIZE 1065 | else # I can't really imagine this would be true 1066 | SCALE = BLOCK_SIZE / blocksize 1067 | @} 1068 | 1069 | # islinked --- return true if a file has been seen already 1070 | 1071 | function islinked(stat_info, device, inode, ret) 1072 | @{ 1073 | device = stat_info["dev"] 1074 | inode = stat_info["ino"] 1075 | 1076 | ret = ((device, inode) in Files_seen) 1077 | 1078 | return ret 1079 | @} 1080 | 1081 | # file_blocks --- return number of blocks if a file has not been seen yet 1082 | 1083 | function file_blocks(stat_info, device, inode) 1084 | @{ 1085 | if (islinked(stat_info)) 1086 | return 0 1087 | 1088 | device = stat_info["dev"] 1089 | inode = stat_info["ino"] 1090 | 1091 | Files_seen[device, inode]++ 1092 | 1093 | return block_count(stat_info) # delegate actual counting 1094 | @} 1095 | 1096 | # block_count --- return number of blocks from a stat() result array 1097 | 1098 | function block_count(stat_info, result) 1099 | @{ 1100 | if ("blocks" in stat_info) 1101 | result = int(stat_info["blocks"] / SCALE) 1102 | else 1103 | # otherwise round up from size 1104 | result = int((stat_info["size"] + (BLOCK_SIZE - 1)) / BLOCK_SIZE) 1105 | 1106 | return result 1107 | @} 1108 | 1109 | # sum_dir --- data on a single directory 1110 | 1111 | function sum_dir(directory, do_print, i, sum, count) 1112 | @{ 1113 | for (i in directory) @{ 1114 | if ("." in directory[i]) @{ # directory 1115 | count = sum_dir(directory[i], do_print) 1116 | count += file_blocks(directory[i]["."]) 1117 | if (do_print) 1118 | printf("%d\t%s\n", count, directory[i]["."]["path"]) 1119 | @} else @{ # regular file 1120 | count = file_blocks(directory[i]["stat"]) 1121 | @} 1122 | sum += count 1123 | @} 1124 | 1125 | return sum 1126 | @} 1127 | 1128 | # simple_walk --- summarize directories --- print info per parameter 1129 | 1130 | function simple_walk(filedata, do_print, i, sum, path) 1131 | @{ 1132 | for (i in filedata) @{ 1133 | if ("." in filedata[i]) @{ # directory 1134 | sum = sum_dir(filedata[i], do_print) 1135 | path = filedata[i]["."]["path"] 1136 | @} else @{ # regular file 1137 | sum = file_blocks(filedata[i]["stat"]) 1138 | path = filedata[i]["path"] 1139 | @} 1140 | printf("%d\t%s\n", sum, path) 1141 | @} 1142 | @} 1143 | 1144 | # sum_walk --- summarize directories --- print info only for the top set of directories 1145 | 1146 | function sum_walk(filedata) 1147 | @{ 1148 | simple_walk(filedata, FALSE) 1149 | @} 1150 | 1151 | # top_walk --- data on the main arguments only 1152 | 1153 | function top_walk(filedata) 1154 | @{ 1155 | simple_walk(filedata, TRUE) 1156 | @} 1157 | 1158 | # all_walk --- data on every file 1159 | 1160 | function all_walk(filedata, i, sum, count) 1161 | @{ 1162 | for (i in filedata) @{ 1163 | if ("." in filedata[i]) @{ # directory 1164 | count = all_walk(filedata[i]) 1165 | sum += count 1166 | printf("%s\t%s\n", count, filedata[i]["."]["path"]) 1167 | @} else @{ # regular file 1168 | if (! islinked(filedata[i]["stat"])) @{ 1169 | count = file_blocks(filedata[i]["stat"]) 1170 | sum += count 1171 | if (i != ".") 1172 | printf("%d\t%s\n", count, filedata[i]["path"]) 1173 | @} 1174 | @} 1175 | @} 1176 | return sum 1177 | @} 1178 | @end smallexample 1179 | 1180 | @bye 1181 | -------------------------------------------------------------------------------- /awk.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arnoldrobbins/awk-sys-prog/bc1059565ced3934b42055cd38cf0171038f3132/awk.pdf -------------------------------------------------------------------------------- /cleanit: -------------------------------------------------------------------------------- 1 | rm awk-sys-prog.aux 2 | rm awk-sys-prog.cp 3 | rm awk-sys-prog.fn 4 | rm awk-sys-prog.ky 5 | rm awk-sys-prog.log 6 | rm awk-sys-prog.pg 7 | rm awk-sys-prog.toc 8 | rm awk-sys-prog.tp 9 | rm awk-sys-prog.vr 10 | -------------------------------------------------------------------------------- /du.awk: -------------------------------------------------------------------------------- 1 | #! /usr/local/bin/gawk -f 2 | 3 | # du.awk --- write POSIX du utility in awk. 4 | # See http://pubs.opengroup.org/onlinepubs/9699919799/utilities/du.html 5 | # 6 | # Most of the heavy lifting is done by the fts() function in the "filefuncs" 7 | # extension. 8 | # 9 | # We think this conforms to POSIX, except for the default block size, which 10 | # is set to 1024. Following GNU standards, set POSIXLY_CORRECT in the 11 | # environment to force 512-byte blocks. 12 | # 13 | # Arnold Robbins 14 | # arnold@skeeve.com 15 | 16 | @include "getopt" 17 | @load "filefuncs" 18 | 19 | BEGIN { 20 | FALSE = 0 21 | TRUE = 1 22 | 23 | BLOCK_SIZE = 1024 # Sane default for the past 30 years 24 | if ("POSIXLY_CORRECT" in ENVIRON) 25 | BLOCK_SIZE = 512 # POSIX default 26 | 27 | compute_scale() 28 | 29 | fts_flags = FTS_PHYSICAL 30 | sum_only = FALSE 31 | all_files = FALSE 32 | 33 | while ((c = getopt(ARGC, ARGV, "aHkLsx")) != -1) { 34 | switch (c) { 35 | case "a": 36 | # report size of all files 37 | all_files = TRUE; 38 | break 39 | case "H": 40 | # follow symbolic links named on the command line 41 | fts_flags = or(fts_flags, FTS_COMFOLLOW) 42 | break 43 | case "k": 44 | BLOCK_SIZE = 1024 # 1K block size 45 | break 46 | case "L": 47 | # follow all symbolic links 48 | 49 | # fts_flags &= ~FTS_PHYSICAL 50 | fts_flags = and(fts_flags, compl(FTS_PHYSICAL)) 51 | 52 | # fts_flags |= FTS_LOGICAL 53 | fts_flags = or(fts_flags, FTS_LOGICAL) 54 | break 55 | case "s": 56 | # do sums only 57 | sum_only = TRUE 58 | break 59 | case "x": 60 | # don't cross filesystems 61 | fts_flags = or(fts_flags, FTS_XDEV) 62 | break 63 | case "?": 64 | default: 65 | usage() 66 | break 67 | } 68 | } 69 | 70 | # if both -a and -s 71 | if (all_files && sum_only) 72 | usage() 73 | 74 | for (i = 0; i < Optind; i++) 75 | delete ARGV[i] 76 | 77 | if (Optind >= ARGC) { 78 | delete ARGV # clear all, just to be safe 79 | ARGV[1] = "." # default to current directory 80 | } 81 | 82 | fts(ARGV, fts_flags, filedata) # all the magic happens here 83 | 84 | # now walk the trees 85 | if (sum_only) 86 | sum_walk(filedata) 87 | else if (all_files) 88 | all_walk(filedata) 89 | else 90 | top_walk(filedata) 91 | } 92 | 93 | # usage --- print a message and die 94 | 95 | function usage() 96 | { 97 | print "usage: du [-a|-s] [-kx] [-H|-L] [file] ..." > "/dev/stderr" 98 | exit 1 99 | } 100 | 101 | # compute_scale --- compute the scale factor for block size calculations 102 | 103 | function compute_scale( stat_info, blocksize) 104 | { 105 | stat(".", stat_info) 106 | 107 | if (! ("devbsize" in stat_info)) { 108 | printf("du.awk: you must be using filefuncs extension from gawk 4.1.1 or later\n") > "/dev/stderr" 109 | exit 1 110 | } 111 | 112 | # Use "devbsize", which is the units for the count of blocks 113 | # in "blocks". 114 | blocksize = stat_info["devbsize"] 115 | if (blocksize > BLOCK_SIZE) 116 | SCALE = blocksize / BLOCK_SIZE 117 | else # I can't really imagine this would be true 118 | SCALE = BLOCK_SIZE / blocksize 119 | } 120 | 121 | # islinked --- return true if a file has been seen already 122 | 123 | function islinked(stat_info, device, inode, ret) 124 | { 125 | device = stat_info["dev"] 126 | inode = stat_info["ino"] 127 | 128 | ret = ((device, inode) in Files_seen) 129 | 130 | return ret 131 | } 132 | 133 | # file_blocks --- return number of blocks if a file has not been seen yet 134 | 135 | function file_blocks(stat_info, device, inode) 136 | { 137 | if (islinked(stat_info)) 138 | return 0 139 | 140 | device = stat_info["dev"] 141 | inode = stat_info["ino"] 142 | 143 | Files_seen[device, inode]++ 144 | 145 | return block_count(stat_info) # delegate actual counting 146 | } 147 | 148 | # block_count --- return number of blocks from a stat() result array 149 | 150 | function block_count(stat_info, result) 151 | { 152 | if ("blocks" in stat_info) 153 | result = int(stat_info["blocks"] / SCALE) 154 | else 155 | # otherwise round up from size 156 | result = int((stat_info["size"] + (BLOCK_SIZE - 1)) / BLOCK_SIZE) 157 | 158 | return result 159 | } 160 | 161 | # sum_dir --- data on a single directory 162 | 163 | function sum_dir(directory, do_print, i, sum, count) 164 | { 165 | for (i in directory) { 166 | if ("." in directory[i]) { # directory 167 | count = sum_dir(directory[i], do_print) 168 | count += file_blocks(directory[i]["."]) 169 | if (do_print) 170 | printf("%d\t%s\n", count, directory[i]["."]["path"]) 171 | } else { # regular file 172 | count = file_blocks(directory[i]["stat"]) 173 | } 174 | sum += count 175 | } 176 | 177 | return sum 178 | } 179 | 180 | # simple_walk --- summarize directories --- print info per parameter 181 | 182 | function simple_walk(filedata, do_print, i, sum, path) 183 | { 184 | for (i in filedata) { 185 | if ("." in filedata[i]) { # directory 186 | sum = sum_dir(filedata[i], do_print) 187 | path = filedata[i]["."]["path"] 188 | } else { # regular file 189 | sum = file_blocks(filedata[i]["stat"]) 190 | path = filedata[i]["path"] 191 | } 192 | printf("%d\t%s\n", sum, path) 193 | } 194 | } 195 | 196 | # sum_walk --- summarize directories --- print info only for the top set of directories 197 | 198 | function sum_walk(filedata) 199 | { 200 | simple_walk(filedata, FALSE) 201 | } 202 | 203 | # top_walk --- data on the main arguments only 204 | 205 | function top_walk(filedata) 206 | { 207 | simple_walk(filedata, TRUE) 208 | } 209 | 210 | # all_walk --- data on every file 211 | 212 | function all_walk(filedata, i, sum, count) 213 | { 214 | for (i in filedata) { 215 | if ("." in filedata[i]) { # directory 216 | count = all_walk(filedata[i]) 217 | sum += count 218 | printf("%s\t%s\n", count, filedata[i]["."]["path"]) 219 | } else { # regular file 220 | if (! islinked(filedata[i]["stat"])) { 221 | count = file_blocks(filedata[i]["stat"]) 222 | sum += count 223 | if (i != ".") 224 | printf("%d\t%s\n", count, filedata[i]["path"]) 225 | } 226 | } 227 | } 228 | return sum 229 | } 230 | --------------------------------------------------------------------------------