├── Bugs ├── ChangeLog └── fmtlatex /Bugs: -------------------------------------------------------------------------------- 1 | The following produces an extra blank line: 2 | 3 | some text leading up to 4 | % 5 | \( 6 | x^2 + y^2 = z^2 7 | \) 8 | % 9 | a fantastic formula 10 | 11 | Removing any of the indents or removing the comment line removes the fault. 12 | -------------------------------------------------------------------------------- /ChangeLog: -------------------------------------------------------------------------------- 1 | 2013-09-25 Andrew Stacey 2 | 3 | * fmtlatex: The argument slurper broke if the argument ended 4 | several lines after it began. Fixed this by adding a checker to 5 | see if there are any closing braces/brackets still on the line and 6 | bailing out if not. 7 | 8 | Due to the way it works (always on the previous line), the line 9 | before `\documentclass` got put after `\begin{document}` if the 10 | preamble was being skipped. So when we meet `\documentclass` then 11 | we flush the output to prevent this. 12 | 13 | 2009-07-08 Andrew Edgell Stacey 14 | 15 | * fmtlatex: Added a caveat to breaking at start of maths: 16 | shouldn't break if this is at the start of a line. 17 | 18 | 2008-05-19 Andrew Edgell Stacey 19 | 20 | * fmtlatex: Improved detection of preamble so that it detects the 21 | start of the preamble as well as the end (via 'documentclass'). 22 | Added a switch to force processing of preamble. 23 | Comments on their own line are never indented, and whitespace 24 | between the start of a line and a % is ignored. 25 | 26 | -------------------------------------------------------------------------------- /fmtlatex: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -w 2 | # 3 | # fmtlatex - put a LaTeX document into a canonical formatting suitable 4 | # for comparing genuine differences rather than just formatting 5 | # differences 6 | # 7 | # Copyright (C) 2007 Andrew Stacey 8 | # 9 | # This program is free software; you can redistribute it and/or mod 10 | # ify it under the terms of the GNU General Public License as 11 | # published by the Free Software Foundation; either version 2 of the 12 | # License, or any later version. 13 | # 14 | # This program is distributed in the hope that it will be useful, but 15 | # WITHOUT ANY WARRANTY; without even the implied warranty of 16 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 17 | # General Public License for more details. 18 | # 19 | # You should have received a copy of the GNU General Public License 20 | # along with this program; if not, write to the Free Software 21 | # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, 22 | # USA. 23 | # 24 | 25 | use strict; 26 | use Getopt::Long qw(:config auto_help bundling); 27 | use Pod::Usage; 28 | 29 | my ( 30 | $shortmath, 31 | $shortarray, 32 | $indent, 33 | $parsep, 34 | $state, 35 | $descend, 36 | $ascend, 37 | $previous, 38 | $preamble, 39 | $dopreamble, 40 | $curindent, 41 | $man, 42 | $help, 43 | $gpl, 44 | $version, 45 | $verstring 46 | ); 47 | 48 | $verstring = "fmtlatex version 1.1, Copyright (C) 2007--2013 Andrew Stacey\nfmtlatex comes with ABSOLUTELY NO WARRANTY; for details type\n `fmtlatex --man'.\nThis is free software, and you are welcome to redistribute it under certain\nconditions; type `fmtlatex --man' for details."; 49 | 50 | # Default options 51 | 52 | # How long should inline maths be before it gets its own line? 53 | $shortmath = 20; 54 | 55 | # Ditto for array-like environments 56 | $shortarray = 40; 57 | 58 | # Do we want indentation? 59 | $indent = " "; 60 | 61 | # Commandline options 62 | 63 | GetOptions ( 64 | "p|preamble" => \$dopreamble, 65 | "i|indent=i" => sub {my ($a, $b) = @_; $indent = $indent x $b;}, 66 | "m|math=i" => \$shortmath, 67 | "a|array=i" => \$shortarray, 68 | "man" => \$man, 69 | "h|?|help" => \$help, 70 | "gpl" => \$gpl, 71 | "V|version" => \$version 72 | ) or pod2usage(2); 73 | 74 | die "$verstring\n" if $version; 75 | pod2usage(-exitval => 1, -verbose => 1) if $help; 76 | pod2usage(-verbose => 2 ) if $man; 77 | exec 'perldoc perlgpl' if $gpl; 78 | 79 | 80 | # Regular expression to match a lines that start or separate 81 | # paragraphs. Copied from Emacs, had to translate \> to \b and $$ 82 | # seemed in the wrong place (needed preceeding backslash in Emacs 83 | # version). 84 | 85 | 86 | # If the start of the line matches something here then it is not part 87 | # of the previous paragraph. 88 | 89 | $parsep = '^ (?: \f | 90 | \s* (?: $ | 91 | % | 92 | \$\$ | 93 | \\\\[][] | 94 | \\\\ (?: b(?:egin | ibitem) | 95 | c(?:aption | hapter) | 96 | end | 97 | footnote | 98 | item | 99 | label | 100 | marginpar | 101 | n(?:ew(?:lin | pag)e | oindent) | 102 | par(?:agraph | box | t) | 103 | s(?:ection | 104 | ub(?:paragraph | 105 | s(?:(?:ubs)?section) 106 | ) 107 | ) | 108 | [a-z]*(page | s(?:kip | space)) 109 | ) \b 110 | ) 111 | )'; 112 | 113 | # Certain commands should be on their own line. The following regexp 114 | # determines which. Note, we only add an eol after them if they 115 | # (essentially) _begin_ the current line. 116 | 117 | $state = '^ \s* (?: \$\$ | 118 | \\\\[][] | 119 | \\\\ (?: b(?:egin | ibitem) | 120 | c(?:aption | hapter) | 121 | end | 122 | footnote | 123 | item | 124 | label | 125 | marginpar | 126 | n(?:ew(?:lin | pag)e | oindent) | 127 | par(?:agraph | box | t) | 128 | s(?:ection | 129 | ub(?:paragraph | 130 | s(?:(?:ubs)?section) 131 | ) 132 | ) | 133 | [a-z]*(page | s(?:kip | space)) 134 | ) \b 135 | )'; 136 | 137 | 138 | # Regular expression to match a descent or ascent of depth 139 | 140 | $descend = ' \\\\ \[ | 141 | \\\\ begin \b | 142 | \{ '; 143 | 144 | $ascend = ' \\\\ \] | 145 | \\\\ end \b | 146 | \}'; 147 | 148 | # Actually, we work on the previous line 149 | $previous = ""; 150 | 151 | # And we start with no indentation 152 | $curindent = 0; 153 | 154 | while (<>) { 155 | # Skip between '\documentclass' and '\begin{document}' 156 | 157 | if (/^[^%]*\\documentclass\b/) { 158 | if ($previous) { 159 | &output($previous); 160 | $previous = ""; 161 | } 162 | $preamble = 1; 163 | } 164 | 165 | if (s/^([^%]*\\begin\{document\})\s*//) { 166 | $dopreamble or $preamble and print $1 . "\n"; 167 | $preamble = 0; 168 | } 169 | 170 | $dopreamble or $preamble and print and next; 171 | 172 | # Do we remove the eol from the previous line? Check the current 173 | # line against the paragraph separator and remove the eol if it 174 | # doesn't match. 175 | 176 | if (!/$parsep/x 177 | and ($previous !~ /%/) 178 | and ($previous !~ /^\s*$/)) { 179 | # This line appears to be in the same paragraph as the 180 | # previous one, so we ought to chomp the previous one. There 181 | # are two cases where we don't: if the previous line was 182 | # (essentially) blank then we don't chomp it, and if the 183 | # previous line ended in a comment then we don't. If we do 184 | # chomp, we add a whitespace to ensure that the whitespace 185 | # count is not effected. 186 | if ($previous =~ s/[\r\n]+$//) { 187 | $previous .= " "; 188 | } 189 | } 190 | # does $previous have an eol? 191 | 192 | if ($previous =~ /\n/) { 193 | &output($previous); 194 | $previous = ""; 195 | } 196 | $previous .= $_; 197 | } 198 | 199 | &output($previous); 200 | 201 | exit 0; 202 | 203 | sub output { 204 | my ($line) = @_; 205 | my $comment = ""; 206 | $line =~ s/[\r\n]+$//; 207 | 208 | 209 | $line =~ s/(?:^|(?<=[^\\]))(%.*)// and $comment = $1; 210 | 211 | # So much for removing eols, do we want to add any in? 212 | 213 | # Rule 1. New sentence, new line. Use TeX's algorithm: .!? end a 214 | # sentence unless . is preceeded by a capital letter. 215 | 216 | $line =~ s/([^A-Z]\.|^\.|\?|!) +/$1\n/g; 217 | 218 | # Rule 3. Long inline maths is on its own line. 219 | 220 | # Break on starting maths: 221 | # 222 | # Must have a whitespace to break on, and don't break if the break 223 | # point is already at the start 224 | 225 | $line =~ s/([^\n\s])[\n\s]+(\S*(?:\\\(|\$))/$1\n$2/g; 226 | 227 | # If the closing delimeter occurs less than $shortmath from the 228 | # start, remove the break; otherwise add an ending break. 229 | 230 | $line =~ s/(^\S*(?:\\\(|\$).{$shortmath,}\\\)\S*) */$1\n/mg; 231 | $line =~ s/\n(\S*(?:\\\(|\$).{0,$shortmath}\\\)\S*)/ $1/g; 232 | 233 | # Rule 4. Tables and arrays are split according to lines, and 234 | # entries if long enough. Ought to make this the same throughout 235 | # the table but that involves remembering too much information. 236 | 237 | # Break after & and \\'s. 238 | # We want to ensure that we don't _add_ any whitespace here so if there 239 | # wasn't any originally we add a comment string before the eol. Also, 240 | # don't break if we are already (effectively) broken. 241 | 242 | $line =~ s/(\&|\\\\)\s+(?!%|$)/$1\n/g; 243 | $line =~ s/(\&|\\\\)(?!\s|%|$)/$1%\n/g; 244 | 245 | # If two adjacent column entries are shorter than the given 246 | # minimum length, remove the eol between them. Can't just add 'g' 247 | # modifier to the regexp as we may need to join several entries 248 | # together and so need to backtrack. 249 | 250 | while ($line =~ s/(^.{0,$shortarray}\&)\n(.{0,$shortarray}(?:\&|\\\\|\n))/$1 $2/s) {}; 251 | 252 | # Rule 2. Change-of-states that occur at the start of a line get their 253 | # own line. Need to include any arguments on that line. 254 | 255 | my $newline = ""; 256 | while ($line =~ s/($state)//x) { 257 | $newline .= $1; 258 | my $nextarg; 259 | ($nextarg, $line) = &nextarg($line); 260 | while ($nextarg !~ /^ *$/) { 261 | $newline .= $nextarg; 262 | ($nextarg, $line) = &nextarg($line); 263 | } 264 | $newline .= "\n"; 265 | } 266 | $line = $newline . $line; 267 | # Make sure we haven't inadvertantly _added_ a double eol or any 268 | # excessive whitespace 269 | 270 | $line =~ s/\n+/\n/g; 271 | $line =~ s/^ +//mg; 272 | $line =~ s/ +/ /g; 273 | 274 | # Add some basic contextual indentation. 275 | 276 | if ($indent) { 277 | my @lines = split("\n", $line); 278 | for (my $i = 0; $i <= $#lines; $i++) { 279 | # If the line _starts_ with an ascender or an \item, cancel an indent 280 | my $offset = ($lines[$i] =~ /^(?:$ascend|\\item)/x ? -1 : 0); 281 | # If the line _starts_ with inline maths, add an indent. Note this and the 282 | # previous can't _both_ match so we're okay setting $offset to 1 if this 283 | # matches. 284 | $offset = ($lines[$i] =~ /^\s*\\\(/ ? 1 : $offset); 285 | $lines[$i] = $indent x ($curindent + $offset) . $lines[$i]; 286 | 287 | # Count the ascenders and descenders in the line to get the current 288 | # indent for the next line. 289 | # Actually work on a 'decommentised' version of the line so that stuff in 290 | # comments doesn't effect the count. 291 | my $decomment = $lines[$i]; 292 | $decomment =~ s/%.*//; 293 | while ($decomment =~ /$descend/gx) { 294 | $curindent++; 295 | } 296 | while ($decomment =~ /$ascend/gx) { 297 | $curindent--; 298 | } 299 | } 300 | $line = join("\n", @lines) . "\n"; 301 | } 302 | 303 | # Finally, don't indent comments 304 | 305 | $line =~ s/^ +%/%/; 306 | 307 | $line =~ s/[\r\n]+$//; 308 | print $line . $comment . "\n"; 309 | } 310 | 311 | 312 | # Extract the next block from the string 313 | sub nextarg { 314 | my ($string) = @_; 315 | my $first = ""; 316 | 317 | # Bail out if we didn't actually get anything 318 | return ("", "") unless ($string); 319 | 320 | if ($string =~ s/^(\{.*?\})//s) { 321 | # We need to ensure that the braces are balanced. 322 | $first = $1; 323 | my $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/); 324 | while ($c > 0 && $string =~ /}/) { 325 | $string =~ s/^(.*?})//s; 326 | $first .= $1; 327 | # Count the number of opening and closing braces in $first 328 | $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/); 329 | } 330 | return ($first, $string); 331 | } 332 | 333 | if ($string =~ s/^(\[.*?\])//s) { 334 | # We need to ensure that the curly braces are balanced. 335 | $first = $1; 336 | my $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/); 337 | while ($c > 0 && $string =~ /\]/) { 338 | $string =~ s/^(.*?])//s; 339 | $first .= $1; 340 | # Count the number of opening and closing braces in $first 341 | $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/); 342 | } 343 | return ($first, $string); 344 | } 345 | # All else, just return first character 346 | 347 | $string =~ s/(.)//s; 348 | 349 | return ($1, $string); 350 | } 351 | 352 | __END__ 353 | 354 | =head1 NAME 355 | 356 | fmtlatex - format a LaTeX document putting it into a canonical format suitable for finding genuine differences rather than formatting differences 357 | 358 | =head1 SYNPOSIS 359 | 360 | fmtlatex [OPTIONS] [FILES] 361 | 362 | Format [FILES] or standard input according to the rules, modified by [OPTIONS] if specified, sending output to standard output. 363 | 364 | Example: fmtlatex < paper.tex > paper.fmt.tex 365 | 366 | Options: 367 | -p, --preamble process the preamble 368 | -m, --math NUM inline math longer than NUM gets its own line 369 | (default: 20) 370 | -a, --array NUM array-like entries longer than NUM get their own line 371 | (default: 40) 372 | -i, --indent NUM add indentation using NUM multiples of " " 373 | (default: 1) 374 | --man print man page 375 | -h,-?, --help print basic help 376 | -V, --version print version information 377 | -gpl print GPL 378 | 379 | =head1 DESCRIPTION 380 | 381 | B formats a LaTeX document into a standardised layout according to the following rules: 382 | 383 | =item B<1.> Each sentence is on its own line. 384 | 385 | =item B<2.> Change-of-state commands are on their own line (e.g. \begin or \[) 386 | 387 | =item B<3.> Long pieces of inline maths are on their own line 388 | 389 | =item B<4.> Long array entries are on their own line, otherwise array rows are on their own line. 390 | 391 | The other guiding principle is that the original author is assumed to be intelligent. 392 | 393 | B works by going through a LaTeX document, supplied either on the command line or via standard input, line by line and removing or adding end-of-line characters according to the above rules. The purpose is to produce a document in a standardised form that can be compared with a previous version using, say, I so that the differences shown are due to actual differences of content and not accidental differences of formatting. 394 | 395 | Minor modification of the algorithm can be done by changing the values of certain options. Major modification can be done by hacking the program. 396 | 397 | =head1 OPTIONS 398 | 399 | =over 8 400 | 401 | =item B<-p, --preamble> 402 | 403 | Determines whether or not to process the preamble. The default is to skip the preamble (everything between C and C). 404 | 405 | =item B<-m NUM, --math NUM> 406 | 407 | Designates how long a stretch of inline maths has to be before it qualifies for its own line. Note: the length is of the maths but there may be extra junk on the line as the program only inserts eols at existing whitespace. 408 | 409 | Default: 20 410 | 411 | =item B<-a NUM, --array NUM> 412 | 413 | Designates how long an array entry has to be before it qualifies for its own line. Note: both the current and next array entries have to be shorter than this to qualify for a single line. 414 | 415 | Default: 40 416 | 417 | =item B<-i NUM, --indent NUM> 418 | 419 | The program also inserts basic indentation according to how "deep" the start of the current line is (it counts the number of '\\begin's, '\['s, and '{'s and their corresponding terminators) and indents accordingly. NUM designates how many spaces corresponds to one indentation. 420 | 421 | =item B<--man> 422 | 423 | Display full man page. 424 | 425 | =item B<-h, -?, --help> 426 | 427 | Display basic help page. 428 | 429 | =item B<-V, --version> 430 | 431 | Display version information. 432 | 433 | =item B<--gpl> 434 | 435 | Display GPL. 436 | 437 | =head1 NOTES 438 | 439 | The indentation and inline math algorithms ignore '$' but the paragraph separation knows about them. Basically, you shouldn't be using '$'s in a LaTeX document. Coding for things like C<$a=b$$c=d$> is a hassle and I'd rather not bother if I didn't have to. 440 | 441 | This program shouldn't introduce any TeXtual changes into your document. A simple way to test this (before you delete the original), is to C the original and the formatted version each a few times, then C them, and C the resulting postscript files. You should see something like 442 | 443 | shell% diff original.ps new.ps 444 | 3c3 445 | < %%Title: original.dvi 446 | --- 447 | > %%Title: new.dvi 448 | 12c12 449 | < %DVIPSCommandLine: dvips original.dvi 450 | --- 451 | > %DVIPSCommandLine: dvips new.dvi 452 | 14c14 453 | < %DVIPSSource: TeX output 2007.12.20:1128 454 | --- 455 | > %DVIPSSource: TeX output 2007.12.20:1159 456 | 1560c1560 457 | < TeXDict begin 39139632 55387786 1000 600 600 (original.dvi) 458 | --- 459 | > TeXDict begin 39139632 55387786 1000 600 600 (new.dvi) 460 | 461 | That is, no significant change at all. 462 | 463 | 464 | =head1 BUGS 465 | 466 | Whad'dya mean, "I've found a bug."? Impossible. There are no bugs, 467 | only features. To tell me about features, send an email to 468 | 469 | stacey@math.ntnu.no 470 | 471 | With the following in the subject line: 472 | 473 | Howay man, av fund ah borg een furmatleetach. 474 | 475 | =head1 AUTHOR 476 | 477 | Andrew Stacey 478 | 479 | =head1 LICENSE 480 | 481 | Copyright (C) 2007--2013 Andrew Stacey 482 | 483 | This program is free software; you can redistribute it and/or mod ify 484 | it under the terms of the GNU General Public License as published by 485 | the Free Software Foundation; either version 2 of the License, or any 486 | later version. 487 | 488 | This program is distributed in the hope that it will be useful, but 489 | WITHOUT ANY WARRANTY; without even the implied warranty of 490 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 491 | General Public License for more details. 492 | 493 | You should have received a copy of the GNU General Public License 494 | along with this program; if not, write to the Free Software 495 | Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, 496 | USA. 497 | 498 | --------------------------------------------------------------------------------