├── Dockerfile
├── README.md
├── data
    └── test.pdf
├── docker-compose.yml
└── maxtract
    ├── README
    ├── anderson_post.opt
    ├── connectedcomp_ci.opt-STATIC
    ├── extractElements.opt
    ├── linearizer.opt
    ├── maxtract.opt
    ├── pdf2tiff
    ├── pdftk
    ├── process_file.sh
    ├── test1.pdf
    └── test2.pdf


/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM i386/ubuntu:14.04
2 | 
3 | RUN apt update -qy && apt install -y pdftk poppler-utils imagemagick
4 | COPY maxtract .
5 | 
6 | CMD ["./process_file.sh", "test.pdf"]
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Maxtract - Extract LaTeX source from pdf files
  2 | This is an attempt to put into a working docker the Maxtract program.
  3 | The executable are [downloaded from official site](http://www.cs.bham.ac.uk/research/groupings/reasoning/sdag/maxtract_lin32.tgz) and are put in a working environment inside a docker.
  4 | 
  5 | ## How to build the docker
  6 | ```
  7 | docker build . -t maxtract
  8 | ```
  9 | 
 10 | ## How to use the docker
 11 | Create the directory `data` into which you can put some pdf files.
 12 | In this repository there is the default `test.pdf` file from Maxtract.
 13 | ```
 14 | docker run -v $(pwd)/data:/data -it maxtract:latest ./process_file.sh test.pdf
 15 | ```
 16 | 
 17 | This should produce a file `test.pdf.tex` inside the `data` folder.
 18 | 
 19 | ## Caveats
 20 | 
 21 | - I'm not the author of Maxtract, I just put together a working image because I needed it.
 22 | 
 23 | - The produced LaTeX files are good but need some tuning by hand to be more usable.
 24 | 
 25 | - If you want to know how the original program works, you should look at the paper or [at the code in other repositories](https://github.com/zorkow/MaxTract).
 26 |   If you want to know how I got it to work, look at the file in `maxtract/process_file.sh`.
 27 |   No magic involved, I just read the help of all given commands and figured out a way to get it to work.
 28 | 
 29 | - To compile the resulting tex file you can use as preamble:
 30 |   ```
 31 |   \documentclass[a4paper,11pt]{article}
 32 |   \usepackage{ifthen}
 33 |   \usepackage{amsmath,amssymb}
 34 |   
 35 |   \begin{document}
 36 |   ```
 37 |   Then insert the whole tex file produced and after it:
 38 |   ```
 39 |   \end{document}
 40 |   ```
 41 | 
 42 | ## Additional Usage
 43 | 	By default we use the latex output, but you can change this using the settings of the `process_file.sh` script.
 44 | 	In particular the script input is:
 45 | 	```
 46 | 	process_file.sh PDF_FILE_TO_PROCESS DRIVER LAYOUT TYPE
 47 | 	```
 48 | 	where the defaults are `DRIVER=latex2`, `LAYOUT=latex`, `TYPE=LINES`.
 49 | 	Be aware that many combinations do produce errors and so look carefully at the terminal output.
 50 | 
 51 | 	For convenience the relevant help of `anderson_post.opt` is reported here (do not ask me what they mean because I don't have an idea):
 52 | 	```
 53 | 	$ ./anderson_post.opt -drivers
 54 |     dummy:       A dummy expression driver that serves as default. 
 55 |     festival:    A basic driver for the Festival Speech Synthesis System based
 56 |                  exclusively on character information. 
 57 |     index:       A driver to provide XML annotation for Indexing. 
 58 |     latex:       A simple LaTeX driver based exclusively on character information.
 59 |                  (Retained for compatibility!) 
 60 |     latex2:      A complex LaTeX driver, using font and spacing information.
 61 |                  (Retained for compabitility!) 
 62 |     latex_com:   A complex LaTeX driver, using font and spacing information. 
 63 |     latex_med:   A medium LaTeX driver, using some font and size information. 
 64 |     latex_sim:   A simple LaTeX driver based exclusively on character information. 
 65 |     mathml:      A simple MathML driver based exclusively on character information. 
 66 |     pdf:         A LaTeX driver that adds Latex and MathML annotations for math
 67 |                  expressions in PDF. 
 68 |     pdf_latex:   A LaTeX driver that adds Latex annotations for math expressions in
 69 |                  PDF. 
 70 |     pdf_mathml:  A LaTeX driver that adds MathML annotations for math expressions
 71 |                  in PDF. 
 72 |     word:        A simple Word driver ignoring all Math. 
 73 |     xml:         A simple XML translation with linearity and minimal argument tags 
 74 |     xml_bbox:    A complex XML translation with linearity, minimal argument tags
 75 |                  and bounding box information for all sub-expressions. 
 76 |     xml_tree:    A simple XML translation faithful to the tree structure 
 77 | 
 78 | 	$ ./anderson_post.opt -layouts
 79 |     dummy:           A dummy layout driver that serves as default. 
 80 |                      Implements: 
 81 |     festival:        Layout analysis for Festival documents. 
 82 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
 83 |     festival_latex:  Layout analysis for Latex documents to simple text. 
 84 |                      Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 
 85 |     gts:             Ground truth data. 
 86 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
 87 |     index:           Layout analysis to provide XML annotation for Indexing. 
 88 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
 89 |     latex:           Layout analysis for Latex documents. 
 90 |                      Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 
 91 |     legacy:          Invoking legacy drivers. 
 92 |                      Implements: MKM09-latex MKM09-mathml DAS10-latex EuDML-latex
 93 |                                  EuDML-mathml ICDAR11-latex ICDAR11-gts 
 94 |     mathml:          Layout analysis for Mathml documents. 
 95 |                      Implements: EXPR EXPRS CEXPRS LINE LINES AREA PAGE DOC 
 96 |     pdf_both:        Multilayer PDF documents containing LaTeX code and normal text
 97 |                      underneath. 
 98 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
 99 |     pdf_code:        Multilayer PDF documents containing LaTeX code underneath. 
100 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
101 |     pdf_text:        Multilayer PDF documents containing plain text underneath. 
102 |                      Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 
103 |     tralics:         Translation to MathML using Tralics via a given Latex driver. 
104 |                      Implements: LINE LINES AREA PAGE DOC 
105 |     word:            Getting Words out of documents, etc. 
106 |                      Implements: LINE LINES AREA PAGE DOC 
107 |     xml:             Layout analysis for Latex documents. 
108 |                      Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 
109 | 
110 | 	$ ./anderson_post.opt -types
111 |     AREA:    Parsing an area. 
112 |              Implemented by drivers: festival festival_latex gts index latex mathml
113 |                                      pdf_both pdf_code pdf_text tralics word xml 
114 |     CEXPRS:  Parsing expressions for comparison with image files. 
115 |              Implemented by drivers: festival_latex latex mathml xml 
116 |     DOC:     Parsing an entire document. 
117 |              Implemented by drivers: festival festival_latex gts index latex mathml
118 |                                      pdf_both pdf_code pdf_text tralics word xml 
119 |     EXPR:    Parsing single Math expressions. 
120 |              Implemented by drivers: festival festival_latex gts index latex mathml
121 |                                      pdf_both pdf_code pdf_text xml 
122 |     EXPRS:   Parsing several Math expressions. 
123 |              Implemented by drivers: festival festival_latex gts index latex mathml
124 |                                      pdf_both pdf_code pdf_text xml 
125 |     LEXPRS:  Parsing expressions separately and including separate files into a
126 |              master file (rather than including the actual expressions). 
127 |              Implemented by drivers: festival_latex latex xml 
128 |     LINE:    Parsing a single line. 
129 |              Implemented by drivers: festival festival_latex gts index latex mathml
130 |                                      pdf_both pdf_code pdf_text tralics word xml 
131 |     LINES:   Parsing several lines. 
132 |              Implemented by drivers: festival festival_latex gts index latex mathml
133 |                                      pdf_both pdf_code pdf_text tralics word xml 
134 |     PAGE:    Parsing a page. 
135 |              Implemented by drivers: festival festival_latex gts index latex mathml
136 |                                      pdf_both pdf_code pdf_text tralics word xml 
137 |     Some drivers might implement other, undocumented layout types.
138 | 	```
139 | 
140 | ## Conclusions and other things
141 | 	Have fun with such thing.
142 | 	If you are one of the author of the original project, please provide the community with working source files as well as drivers description and differences.
143 | 


--------------------------------------------------------------------------------
/data/test.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/data/test.pdf


--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: '3'
2 | services:
3 |   main:
4 |     build: .
5 |     command: ./process_file.sh "${FILE-test.pdf}"
6 |     volumes:
7 |       - ./data:/data:rw
8 | 


--------------------------------------------------------------------------------
/maxtract/README:
--------------------------------------------------------------------------------
 1 | maxtract v.1752 
 2 | 32 bit linux 
 3 | 15/11/2012
 4 | 
 5 | ============================= DESCRIPTION ==================================
 6 | 
 7 | A command line tool that reads a PDF and returns different formats. Tool is written in Ocaml
 8 | and uses the pdftk for decompressingg the PDF file.
 9 | 
10 | A directory with a random name is created in /tmp during operation, which is
11 | removed after completion.
12 | 
13 | ============================= REQUIREMENTS ==================================
14 | 
15 | -- PDFtk which can be obtained from http://www.pdflabs.com/docs/install-pdftk/
16 |    If under ubuntu or debian 
17 |       apt-get install pdftk
18 | 
19 | ============================= INSTALLATION ==================================
20 | 
21 | Maxtract does not require installation, as a binary has been released.
22 | 
23 | ============================== USAGE ========================================
24 | 
25 | Maxtract must be run from the directory where the binaries have been extracted to.
26 | 
27 | Running
28 |  $  ./maxtract.opt -help
29 | 
30 | Produces the following option list
31 | 
32 | usage: ./maxtract.opt [-f file] [-o file] [-tex] [-txt] [-lay] [-mat]
33 |   -f : -f Name of input PDF file
34 |   -o : -o Name of output file
35 |  -tex:    Create a latex file (default)
36 |  -lay:    Create latex for a layered PDF with latex and text layers
37 |  -ann:    Create latex for an annotated PDF file with latex annotations
38 |  -txt:    Create a simple text file
39 |  -mat:    Create a text file with math in latex
40 | 
41 | 
42 | === Further explanation ========================================================
43 | 
44 | -f Is always required and takes a PDF filename which has been validated by pdftester
45 | 
46 | -o Is always required and specifies an output filename -including extension.
47 | 
48 | [-tex -lay -txt -mat] Are the choices of output options. -tex is the
49 | default. Any combinations of output driver can be chosen, if more than one is
50 | chosen extensions are automatically assigned.  
51 | 
52 | .tex output from -lay and -ann option needs to be pdflatexed twice to produce the desired
53 | pdf file.
54 | 
55 | If a file cannot be processed '0' is returned to stdout.
56 | 
57 | ===EXAMPLES===
58 | 
59 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test
60 | produces test.tex
61 | 
62 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test -txt
63 | produces test.txt
64 | 
65 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test.tex -lay -txt
66 | produces test.txt and test-lay.tex (Remember to pdflatex test_layer.tex twice)
67 | 
68 | ============================= SUPPORT ======================================
69 | 
70 | Any issues, please contact j.baker@cs.bham.ac.uk 
71 | 
72 | 
73 | 
74 | 


--------------------------------------------------------------------------------
/maxtract/anderson_post.opt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/anderson_post.opt


--------------------------------------------------------------------------------
/maxtract/connectedcomp_ci.opt-STATIC:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/connectedcomp_ci.opt-STATIC


--------------------------------------------------------------------------------
/maxtract/extractElements.opt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/extractElements.opt


--------------------------------------------------------------------------------
/maxtract/linearizer.opt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/linearizer.opt


--------------------------------------------------------------------------------
/maxtract/maxtract.opt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/maxtract.opt


--------------------------------------------------------------------------------
/maxtract/pdf2tiff:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | script () {
 4 | directory=`dirname $1`
 5 | filename=`basename $1`
 6 | 
 7 | 
 8 | pushd $directory > /dev/null
 9 | gs -q -sDEVICE='tiffg4' -sOutputFile=`basename $filename .pdf`.tif -r'300' -dNOPAUSE -dBATCH $filename 2> /dev/null
10 | popd > /dev/null
11 | 
12 | }
13 | 
14 | test () {
15 |     if  [ -f $1 ]  && [ `basename $1 | awk -F. '{print $2}'` = pdf ]
16 | 	then 
17 | 	script $1
18 |     else
19 | 	usage
20 |     fi
21 | }
22 |     
23 | 
24 | usage () {
25 | echo 'Script needs arguments of the form inputfile.pdf and generates inputfile.tif'
26 | }
27 | 
28 | 
29 | 
30 | 
31 | case "$#" in
32 | 
33 | 0 ) usage;;
34 | 1 ) test $1;;
35 | * ) for arg in $*
36 |     do 
37 |     test $arg
38 |     done;;
39 | esac
40 | 
41 | exit 0
42 | 


--------------------------------------------------------------------------------
/maxtract/pdftk:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # Uncompresses the file
 3 | 
 4 | while [[ "$1" != "" ]]; do
 5 | 	echo "$1"
 6 | shift
 7 | done
 8 | 
 9 | INFILE=$1
10 | OUTFILE=$3
11 | 
12 | echo "Uncompressing $1 to $3"
13 | qpdf --stream-data=uncompress "$1" "$3"
14 | 


--------------------------------------------------------------------------------
/maxtract/process_file.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash -e
 2 | 
 3 | BFILE=$1
 4 | DRIVER=${2:-latex2}
 5 | LAYOUT=${3:-latex}
 6 | TYPE=${4:-LINES}
 7 | FILEDIR=data
 8 | TEMPDIR=process
 9 | 
10 | # ./extractElements.opt -u -f gaal.pdf -d /tmp/process
11 | echo "Creating temporary directory"
12 | mkdir /tmp/$TEMPDIR
13 | echo "Uncompressing the file stream"
14 | pdftk "$FILEDIR/$BFILE" output "/tmp/$TEMPDIR/$BFILE" uncompress
15 | echo "Extracting elements"
16 | LANG=/usr/lib/locale/en_US ./extractElements.opt -u -f "$BFILE" -d "/tmp/$TEMPDIR" # -u se è gia uncompresso
17 | echo "Linearizing streams"
18 | IFS=$'\n';
19 | for file in $(find /tmp/$TEMPDIR -type f -iname "*.jsonf"); do
20 |     ./linearizer.opt -f "$file";
21 | done
22 | echo "Computing LaTeX"
23 | for file in $(find /tmp/$TEMPDIR -type f -iname "*.lin"); do
24 |     ./anderson_post.opt --driver "$DRIVER" --layout "$LAYOUT" --type "$TYPE" -auto "$file";
25 | done
26 | echo "Removing unnecessary files"
27 | find /tmp/$TEMPDIR -type f -iname "*.json" -delete
28 | find /tmp/$TEMPDIR -type f -iname "*.jsonf" -delete
29 | find /tmp/$TEMPDIR -type f -iname "*.bb" -delete
30 | find /tmp/$TEMPDIR -type f -iname "*.lin" -delete
31 | rm /tmp/$TEMPDIR/*.pdf /tmp/$TEMPDIR/*.tif
32 | echo "Now outputting those files"
33 | ls /tmp/$TEMPDIR/*
34 | EXTENSION=$(ls -1 /tmp/$TEMPDIR/* | grep -oP "[0-9]+\.\K[a-zA-Z0-9]+" | head -n 1)
35 | echo "Computed file extension: $EXTENSION"
36 | ORIGBASENAME=$(basename "$BFILE" .pdf);
37 | find /tmp/$TEMPDIR -type f -print | sort | xargs -n1 cat > "$FILEDIR/$ORIGBASENAME.$EXTENSION"
38 | original_owner=$(ls -l "$FILEDIR/$BFILE" | cut -d' ' -f3,4 | sed -e 's/ /:/g')
39 | chown "$original_owner" "$FILEDIR/$ORIGBASENAME.$EXTENSION"
40 | chmod ug+rw "$FILEDIR/$ORIGBASENAME.$EXTENSION"
41 | rm -rf /tmp/$TEMPDIR
42 | echo "Everything went well. Your output is in $ORIGBASENAME.$EXTENSION"
43 | 
44 | 


--------------------------------------------------------------------------------
/maxtract/test1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/test1.pdf


--------------------------------------------------------------------------------
/maxtract/test2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/test2.pdf


--------------------------------------------------------------------------------