├── Bugs
├── ChangeLog
└── fmtlatex


/Bugs:
--------------------------------------------------------------------------------
 1 | The following produces an extra blank line:
 2 | 
 3 | some text leading up to
 4 | %
 5 |   \(
 6 |   x^2 + y^2 = z^2
 7 |   \)
 8 | %
 9 | a fantastic formula
10 | 
11 | Removing any of the indents or removing the comment line removes the fault.
12 | 


--------------------------------------------------------------------------------
/ChangeLog:
--------------------------------------------------------------------------------
 1 | 2013-09-25  Andrew Stacey  <andrew.stacey@math.ntnu.no>
 2 | 
 3 | 	* fmtlatex: The argument slurper broke if the argument ended
 4 | 	several lines after it began.  Fixed this by adding a checker to
 5 | 	see if there are any closing braces/brackets still on the line and
 6 | 	bailing out if not.
 7 | 
 8 | 	Due to the way it works (always on the previous line), the line
 9 | 	before `\documentclass` got put after `\begin{document}` if the
10 | 	preamble was being skipped.  So when we meet `\documentclass` then
11 | 	we flush the output to prevent this.
12 | 
13 | 2009-07-08  Andrew Edgell Stacey  <andrew.stacey@math.ntnu.no>
14 | 
15 | 	* fmtlatex: Added a caveat to breaking at start of maths:
16 | 	shouldn't break if this is at the start of a line.
17 | 
18 | 2008-05-19  Andrew Edgell Stacey  <andrew.stacey@math.ntnu.no>
19 | 
20 | 	* fmtlatex: Improved detection of preamble so that it detects the
21 | 	start of the preamble as well as the end (via 'documentclass').
22 | 	Added a switch to force processing of preamble.
23 | 	Comments on their own line are never indented, and whitespace
24 | 	between the start of a line and a % is ignored.
25 | 
26 | 


--------------------------------------------------------------------------------
/fmtlatex:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl -w
  2 | #
  3 | # fmtlatex - put a LaTeX document into a canonical formatting suitable
  4 | # for comparing genuine differences rather than just formatting
  5 | # differences
  6 | # 
  7 | # Copyright (C) 2007   Andrew Stacey <stacey@math.ntnu.no>
  8 | # 
  9 | # This program is free software; you can redistribute it and/or mod
 10 | # ify it under the terms of the GNU General Public License as
 11 | # published by the Free Software Foundation; either version 2 of the
 12 | # License, or any later version.
 13 | # 
 14 | # This program is distributed in the hope that it will be useful, but
 15 | # WITHOUT ANY WARRANTY; without even the implied warranty of
 16 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 17 | # General Public License for more details.
 18 | # 
 19 | # You should have received a copy of the GNU General Public License
 20 | # along with this program; if not, write to the Free Software
 21 | # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
 22 | # USA.
 23 | # 
 24 | 
 25 | use strict;
 26 | use Getopt::Long qw(:config auto_help bundling);
 27 | use Pod::Usage;
 28 | 
 29 | my (
 30 |     $shortmath,
 31 |     $shortarray,
 32 |     $indent,
 33 |     $parsep,
 34 |     $state,
 35 |     $descend,
 36 |     $ascend,
 37 |     $previous,
 38 |     $preamble,
 39 |     $dopreamble,
 40 |     $curindent,
 41 |     $man,
 42 |     $help,
 43 |     $gpl,
 44 |     $version,
 45 |     $verstring
 46 |     );
 47 | 
 48 | $verstring = "fmtlatex version 1.1, Copyright (C) 2007--2013 Andrew Stacey\nfmtlatex comes with ABSOLUTELY NO WARRANTY; for details type\n `fmtlatex --man'.\nThis is free software, and you are welcome to redistribute it under certain\nconditions; type `fmtlatex --man' for details.";
 49 | 
 50 | # Default options
 51 | 
 52 | # How long should inline maths be before it gets its own line?
 53 | $shortmath = 20;
 54 | 
 55 | # Ditto for array-like environments
 56 | $shortarray = 40;
 57 | 
 58 | # Do we want indentation?
 59 | $indent = " ";
 60 | 
 61 | # Commandline options
 62 | 
 63 | GetOptions (
 64 |     "p|preamble" => \$dopreamble,
 65 |     "i|indent=i" => sub {my ($a, $b) = @_; $indent = $indent x $b;},
 66 |     "m|math=i" => \$shortmath,
 67 |     "a|array=i" => \$shortarray,
 68 |     "man" => \$man,
 69 |     "h|?|help" => \$help,
 70 |     "gpl" => \$gpl,
 71 |     "V|version" => \$version
 72 |     ) or pod2usage(2);
 73 | 
 74 | die "$verstring\n" if $version;
 75 | pod2usage(-exitval => 1, -verbose => 1) if $help;
 76 | pod2usage(-verbose => 2 ) if $man;
 77 | exec 'perldoc perlgpl' if $gpl;
 78 | 
 79 | 
 80 | # Regular expression to match a lines that start or separate
 81 | # paragraphs.  Copied from Emacs, had to translate \> to \b and $$
 82 | # seemed in the wrong place (needed preceeding backslash in Emacs
 83 | # version).
 84 | 
 85 | 
 86 | # If the start of the line matches something here then it is not part
 87 | # of the previous paragraph.
 88 | 
 89 | $parsep = '^ (?: \f |
 90 |                     \s* (?: $        |
 91 |                             %        |
 92 |                             \$\$     |
 93 |                             \\\\[][] |
 94 |                             \\\\ (?: b(?:egin   | ibitem)            |
 95 |                                      c(?:aption | hapter)            |
 96 |                                      end                             |
 97 |                                      footnote                        |
 98 |                                      item                            |
 99 |                                      label                           |
100 |                                      marginpar                       |
101 |                                      n(?:ew(?:lin | pag)e | oindent) |
102 |                                      par(?:agraph | box | t)         |
103 |                                      s(?:ection |
104 |                                          ub(?:paragraph |
105 |                                               s(?:(?:ubs)?section)
106 |                                            )
107 |                                       )                              |
108 |                                      [a-z]*(page | s(?:kip | space))
109 |                                  ) \b
110 |                         )
111 |                 )';
112 | 
113 | # Certain commands should be on their own line.  The following regexp
114 | # determines which.  Note, we only add an eol after them if they
115 | # (essentially) _begin_ the current line.
116 | 
117 | $state = '^ \s* (?: \$\$   |
118 |                         \\\\[][] |
119 |                         \\\\ (?: b(?:egin   | ibitem)            |
120 |                                  c(?:aption | hapter)            |
121 |                                  end                             |
122 |                                  footnote                        |
123 |                                  item                            |
124 |                                  label                           |
125 |                                  marginpar                       |
126 |                                  n(?:ew(?:lin | pag)e | oindent) |
127 |                                  par(?:agraph | box | t)         |
128 |                                  s(?:ection |
129 |                                      ub(?:paragraph |
130 |                                           s(?:(?:ubs)?section)
131 |                                        )
132 |                                   )                              |
133 |                                  [a-z]*(page | s(?:kip | space))
134 |                               ) \b
135 |                       )';
136 | 
137 | 
138 | # Regular expression to match a descent or ascent of depth
139 | 
140 | $descend = ' \\\\ \[       |
141 |                 \\\\ begin \b |
142 |                 \{ ';
143 | 
144 | $ascend =  ' \\\\ \]     |
145 |                 \\\\ end \b |
146 |                 \}';
147 | 
148 | # Actually, we work on the previous line
149 | $previous = "";
150 | 
151 | # And we start with no indentation
152 | $curindent = 0;
153 | 
154 | while (<>) {
155 |     # Skip between '\documentclass' and '\begin{document}'
156 | 
157 |     if (/^[^%]*\\documentclass\b/) {
158 | 	if ($previous) {
159 | 	    &output($previous);
160 | 	    $previous = "";
161 | 	}
162 | 	$preamble = 1;
163 |     }
164 | 
165 |     if (s/^([^%]*\\begin\{document\})\s*//) {
166 | 	$dopreamble or $preamble and print $1 . "\n";
167 | 	$preamble = 0;
168 |     }
169 | 
170 |     $dopreamble or $preamble and print and next;
171 | 
172 |     # Do we remove the eol from the previous line?  Check the current
173 |     # line against the paragraph separator and remove the eol if it
174 |     # doesn't match.
175 | 
176 |     if (!/$parsep/x
177 | 	and ($previous !~ /%/)
178 | 	and ($previous !~ /^\s*$/)) {
179 | 	# This line appears to be in the same paragraph as the
180 | 	# previous one, so we ought to chomp the previous one.  There
181 | 	# are two cases where we don't: if the previous line was
182 | 	# (essentially) blank then we don't chomp it, and if the
183 | 	# previous line ended in a comment then we don't.  If we do
184 | 	# chomp, we add a whitespace to ensure that the whitespace
185 | 	# count is not effected.
186 | 	if ($previous =~ s/[\r\n]+$//) {
187 | 	    $previous .=  " ";
188 | 	}
189 |     }
190 |     # does $previous have an eol?
191 | 
192 |     if ($previous =~ /\n/) {
193 | 	&output($previous);
194 | 	$previous = "";
195 |     }
196 |     $previous .= $_;
197 | }
198 | 
199 | &output($previous);
200 | 
201 | exit 0;
202 | 
203 | sub output {
204 |     my ($line) = @_;
205 |     my $comment = "";
206 |     $line =~ s/[\r\n]+$//;
207 | 
208 | 
209 |     $line =~ s/(?:^|(?<=[^\\]))(%.*)// and $comment = $1;
210 | 
211 |     # So much for removing eols, do we want to add any in?
212 | 
213 |     # Rule 1.  New sentence, new line.  Use TeX's algorithm: .!? end a
214 |     # sentence unless . is preceeded by a capital letter.
215 | 
216 |     $line =~ s/([^A-Z]\.|^\.|\?|!) +/$1\n/g;
217 | 
218 |     # Rule 3.  Long inline maths is on its own line.
219 | 
220 |     # Break on starting maths:
221 |     #
222 |     # Must have a whitespace to break on, and don't break if the break
223 |     # point is already at the start
224 | 
225 |     $line =~ s/([^\n\s])[\n\s]+(\S*(?:\\\(|\$))/$1\n$2/g;
226 | 
227 |     # If the closing delimeter occurs less than $shortmath from the
228 |     # start, remove the break; otherwise add an ending break.
229 | 	    
230 |     $line =~ s/(^\S*(?:\\\(|\$).{$shortmath,}\\\)\S*) */$1\n/mg;
231 |     $line =~ s/\n(\S*(?:\\\(|\$).{0,$shortmath}\\\)\S*)/ $1/g;
232 | 
233 |     # Rule 4.  Tables and arrays are split according to lines, and
234 |     # entries if long enough.  Ought to make this the same throughout
235 |     # the table but that involves remembering too much information.
236 | 
237 |     # Break after & and \\'s.
238 |     # We want to ensure that we don't _add_ any whitespace here so if there
239 |     # wasn't any originally we add a comment string before the eol.  Also,
240 |     # don't break if we are already (effectively) broken.
241 | 
242 |     $line =~ s/(\&|\\\\)\s+(?!%|$)/$1\n/g;
243 |     $line =~ s/(\&|\\\\)(?!\s|%|$)/$1%\n/g;
244 | 
245 |     # If two adjacent column entries are shorter than the given
246 |     # minimum length, remove the eol between them.  Can't just add 'g'
247 |     # modifier to the regexp as we may need to join several entries
248 |     # together and so need to backtrack.
249 | 
250 |     while ($line =~ s/(^.{0,$shortarray}\&)\n(.{0,$shortarray}(?:\&|\\\\|\n))/$1 $2/s) {};
251 | 
252 |     # Rule 2.  Change-of-states that occur at the start of a line get their
253 |     # own line.  Need to include any arguments on that line.
254 | 
255 |     my $newline = "";
256 |     while ($line =~ s/($state)//x) {
257 | 	$newline .= $1;
258 | 	my $nextarg;
259 | 	($nextarg, $line) = &nextarg($line);
260 | 	while ($nextarg !~ /^ *$/) {
261 | 	    $newline .= $nextarg;
262 | 	    ($nextarg, $line) = &nextarg($line);
263 | 	}
264 | 	$newline .= "\n";
265 |     }
266 |     $line = $newline . $line;
267 |     # Make sure we haven't inadvertantly _added_ a double eol or any
268 |     # excessive whitespace
269 | 
270 |     $line =~ s/\n+/\n/g;
271 |     $line =~ s/^ +//mg;
272 |     $line =~ s/  +/ /g;
273 | 
274 |     # Add some basic contextual indentation.
275 | 
276 |     if ($indent) {
277 | 	my @lines = split("\n", $line);
278 | 	for (my $i = 0; $i <= $#lines; $i++) {
279 | # If the line _starts_ with an ascender or an \item, cancel an indent
280 | 	    my $offset = ($lines[$i] =~ /^(?:$ascend|\\item)/x ? -1 : 0);
281 | # If the line _starts_ with inline maths, add an indent.  Note this and the
282 | # previous can't _both_ match so we're okay setting $offset to 1 if this
283 | # matches.
284 | 	    $offset = ($lines[$i] =~ /^\s*\\\(/ ? 1 : $offset);
285 | 	    $lines[$i] = $indent x ($curindent + $offset) . $lines[$i];
286 | 
287 | # Count the ascenders and descenders in the line to get the current
288 | # indent for the next line.
289 | # Actually work on a 'decommentised' version of the line so that stuff in
290 | # comments doesn't effect the count.
291 | 	    my $decomment = $lines[$i];
292 | 	    $decomment =~ s/%.*//;
293 | 	    while ($decomment =~ /$descend/gx) {
294 | 		$curindent++;
295 | 	    }
296 | 	    while ($decomment =~ /$ascend/gx) {
297 | 		$curindent--;
298 | 	    }
299 | 	}
300 | 	$line = join("\n", @lines) . "\n";
301 |     }
302 | 
303 | # Finally, don't indent comments
304 | 
305 |     $line =~ s/^ +%/%/;
306 | 
307 |     $line =~ s/[\r\n]+$//;
308 |     print $line . $comment . "\n";
309 | }
310 | 
311 | 
312 | # Extract the next block from the string
313 | sub nextarg {
314 |     my ($string) = @_;
315 |     my $first = "";
316 | 
317 | # Bail out if we didn't actually get anything
318 |     return ("", "") unless ($string);
319 | 
320 |     if ($string =~ s/^(\{.*?\})//s) {
321 | # We need to ensure that the braces are balanced.
322 | 	$first = $1;
323 | 	my $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/);
324 | 	while ($c > 0 && $string =~ /}/) {
325 | 	    $string =~ s/^(.*?})//s;
326 | 	    $first .= $1;
327 | # Count the number of opening and closing braces in $first
328 | 	    $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/);
329 | 	}
330 | 	return ($first, $string);
331 |     }
332 | 
333 |     if ($string =~ s/^(\[.*?\])//s) {
334 | # We need to ensure that the curly braces are balanced.
335 | 	$first = $1;
336 | 	my $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/);
337 | 	while ($c > 0 && $string =~ /\]/) {
338 | 	    $string =~ s/^(.*?])//s;
339 | 	    $first .= $1;
340 | # Count the number of opening and closing braces in $first
341 | 	    $c = ($first =~ tr/{/{/) - ($first =~ tr/}/}/);
342 | 	}
343 | 	return ($first, $string);
344 |     }
345 | # All else, just return first character
346 | 
347 |     $string =~ s/(.)//s;
348 | 
349 |     return ($1, $string);
350 | }
351 | 
352 | __END__
353 | 
354 | =head1 NAME
355 | 
356 | fmtlatex - format a LaTeX document putting it into a canonical format suitable for finding genuine differences rather than formatting differences
357 | 
358 | =head1 SYNPOSIS
359 | 
360 |  fmtlatex [OPTIONS] [FILES]
361 | 
362 | Format [FILES] or standard input according to the rules, modified by [OPTIONS] if specified, sending output to standard output.
363 | 
364 | Example: fmtlatex < paper.tex > paper.fmt.tex
365 | 
366 |  Options:
367 |   -p, --preamble    process the preamble
368 |   -m, --math NUM    inline math longer than NUM gets its own line
369 |                     (default: 20)
370 |   -a, --array NUM   array-like entries longer than NUM get their own line
371 |                     (default: 40)
372 |   -i, --indent NUM  add indentation using NUM multiples of " "
373 |                     (default: 1)
374 |   --man             print man page
375 |   -h,-?, --help     print basic help
376 |   -V, --version     print version information
377 |   -gpl              print GPL
378 | 
379 | =head1 DESCRIPTION
380 | 
381 | B<fmtlatex> formats a LaTeX document into a standardised layout according to the following rules:
382 | 
383 | =item B<1.> Each sentence is on its own line.
384 | 
385 | =item B<2.> Change-of-state commands are on their own line (e.g. \begin or \[)
386 | 
387 | =item B<3.> Long pieces of inline maths are on their own line
388 | 
389 | =item B<4.> Long array entries are on their own line, otherwise array rows are on their own line.
390 | 
391 | The other guiding principle is that the original author is assumed to be intelligent.
392 | 
393 | B<fmtlatex> works by going through a LaTeX document, supplied either on the command line or via standard input, line by line and removing or adding end-of-line characters according to the above rules.  The purpose is to produce a document in a standardised form that can be compared with a previous version using, say, I<diff> so that the differences shown are due to actual differences of content and not accidental differences of formatting.
394 | 
395 | Minor modification of the algorithm can be done by changing the values of certain options.  Major modification can be done by hacking the program.
396 | 
397 | =head1 OPTIONS
398 | 
399 | =over 8
400 | 
401 | =item B<-p, --preamble>
402 | 
403 | Determines whether or not to process the preamble.  The default is to skip the preamble (everything between C<documentclass> and C<begin{document}>).
404 | 
405 | =item B<-m NUM, --math NUM>
406 | 
407 | Designates how long a stretch of inline maths has to be before it qualifies for its own line.  Note: the length is of the maths but there may be extra junk on the line as the program only inserts eols at existing whitespace.
408 | 
409 | Default: 20
410 | 
411 | =item B<-a NUM, --array NUM>
412 | 
413 | Designates how long an array entry has to be before it qualifies for its own line.  Note: both the current and next array entries have to be shorter than this to qualify for a single line.
414 | 
415 | Default: 40
416 | 
417 | =item B<-i NUM, --indent NUM>
418 | 
419 | The program also inserts basic indentation according to how "deep" the start of the current line is (it counts the number of '\\begin's, '\['s, and '{'s and their corresponding terminators) and indents accordingly.  NUM designates how many spaces corresponds to one indentation.
420 | 
421 | =item B<--man>
422 | 
423 | Display full man page.
424 | 
425 | =item B<-h, -?, --help>
426 | 
427 | Display basic help page.
428 | 
429 | =item B<-V, --version>
430 | 
431 | Display version information.
432 | 
433 | =item B<--gpl>
434 | 
435 | Display GPL.
436 | 
437 | =head1 NOTES
438 | 
439 | The indentation and inline math algorithms ignore '$' but the paragraph separation knows about them.  Basically, you shouldn't be using '$'s in a LaTeX document.  Coding for things like C<$a=b$$c=d$> is a hassle and I'd rather not bother if I didn't have to.
440 | 
441 | This program shouldn't introduce any TeXtual changes into your document.  A simple way to test this (before you delete the original), is to C<latex> the original and the formatted version each a few times, then C<dvips> them, and C<diff> the resulting postscript files.  You should see something like
442 | 
443 |  shell% diff original.ps new.ps
444 |  3c3
445 |  < %%Title: original.dvi
446 |  ---
447 |  > %%Title: new.dvi
448 |  12c12
449 |  < %DVIPSCommandLine: dvips original.dvi
450 |  ---
451 |  > %DVIPSCommandLine: dvips new.dvi
452 |  14c14
453 |  < %DVIPSSource:  TeX output 2007.12.20:1128
454 |  ---
455 |  > %DVIPSSource:  TeX output 2007.12.20:1159
456 |  1560c1560
457 |  < TeXDict begin 39139632 55387786 1000 600 600 (original.dvi)
458 |  ---
459 |  > TeXDict begin 39139632 55387786 1000 600 600 (new.dvi)
460 | 
461 | That is, no significant change at all.
462 | 
463 | 
464 | =head1 BUGS
465 | 
466 | Whad'dya mean, "I've found a bug."?  Impossible.  There are no bugs,
467 | only features.  To tell me about features, send an email to 
468 | 
469 | stacey@math.ntnu.no
470 | 
471 | With the following in the subject line:
472 | 
473 | Howay man, av fund ah borg een furmatleetach.
474 | 
475 | =head1 AUTHOR
476 | 
477 | Andrew Stacey
478 | 
479 | =head1 LICENSE
480 | 
481 | Copyright (C) 2007--2013   Andrew Stacey
482 | 
483 | This program is free software; you can redistribute it and/or mod ify
484 | it under the terms of the GNU General Public License as published by
485 | the Free Software Foundation; either version 2 of the License, or any
486 | later version.
487 | 
488 | This program is distributed in the hope that it will be useful, but
489 | WITHOUT ANY WARRANTY; without even the implied warranty of
490 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
491 | General Public License for more details.
492 | 
493 | You should have received a copy of the GNU General Public License
494 | along with this program; if not, write to the Free Software
495 | Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
496 | USA.
497 | 
498 | 


--------------------------------------------------------------------------------