├── Dockerfile ├── README.md ├── data └── test.pdf ├── docker-compose.yml └── maxtract ├── README ├── anderson_post.opt ├── connectedcomp_ci.opt-STATIC ├── extractElements.opt ├── linearizer.opt ├── maxtract.opt ├── pdf2tiff ├── pdftk ├── process_file.sh ├── test1.pdf └── test2.pdf /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM i386/ubuntu:14.04 2 | 3 | RUN apt update -qy && apt install -y pdftk poppler-utils imagemagick 4 | COPY maxtract . 5 | 6 | CMD ["./process_file.sh", "test.pdf"] 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Maxtract - Extract LaTeX source from pdf files 2 | This is an attempt to put into a working docker the Maxtract program. 3 | The executable are [downloaded from official site](http://www.cs.bham.ac.uk/research/groupings/reasoning/sdag/maxtract_lin32.tgz) and are put in a working environment inside a docker. 4 | 5 | ## How to build the docker 6 | ``` 7 | docker build . -t maxtract 8 | ``` 9 | 10 | ## How to use the docker 11 | Create the directory `data` into which you can put some pdf files. 12 | In this repository there is the default `test.pdf` file from Maxtract. 13 | ``` 14 | docker run -v $(pwd)/data:/data -it maxtract:latest ./process_file.sh test.pdf 15 | ``` 16 | 17 | This should produce a file `test.pdf.tex` inside the `data` folder. 18 | 19 | ## Caveats 20 | 21 | - I'm not the author of Maxtract, I just put together a working image because I needed it. 22 | 23 | - The produced LaTeX files are good but need some tuning by hand to be more usable. 24 | 25 | - If you want to know how the original program works, you should look at the paper or [at the code in other repositories](https://github.com/zorkow/MaxTract). 26 | If you want to know how I got it to work, look at the file in `maxtract/process_file.sh`. 27 | No magic involved, I just read the help of all given commands and figured out a way to get it to work. 28 | 29 | - To compile the resulting tex file you can use as preamble: 30 | ``` 31 | \documentclass[a4paper,11pt]{article} 32 | \usepackage{ifthen} 33 | \usepackage{amsmath,amssymb} 34 | 35 | \begin{document} 36 | ``` 37 | Then insert the whole tex file produced and after it: 38 | ``` 39 | \end{document} 40 | ``` 41 | 42 | ## Additional Usage 43 | By default we use the latex output, but you can change this using the settings of the `process_file.sh` script. 44 | In particular the script input is: 45 | ``` 46 | process_file.sh PDF_FILE_TO_PROCESS DRIVER LAYOUT TYPE 47 | ``` 48 | where the defaults are `DRIVER=latex2`, `LAYOUT=latex`, `TYPE=LINES`. 49 | Be aware that many combinations do produce errors and so look carefully at the terminal output. 50 | 51 | For convenience the relevant help of `anderson_post.opt` is reported here (do not ask me what they mean because I don't have an idea): 52 | ``` 53 | $ ./anderson_post.opt -drivers 54 | dummy: A dummy expression driver that serves as default. 55 | festival: A basic driver for the Festival Speech Synthesis System based 56 | exclusively on character information. 57 | index: A driver to provide XML annotation for Indexing. 58 | latex: A simple LaTeX driver based exclusively on character information. 59 | (Retained for compatibility!) 60 | latex2: A complex LaTeX driver, using font and spacing information. 61 | (Retained for compabitility!) 62 | latex_com: A complex LaTeX driver, using font and spacing information. 63 | latex_med: A medium LaTeX driver, using some font and size information. 64 | latex_sim: A simple LaTeX driver based exclusively on character information. 65 | mathml: A simple MathML driver based exclusively on character information. 66 | pdf: A LaTeX driver that adds Latex and MathML annotations for math 67 | expressions in PDF. 68 | pdf_latex: A LaTeX driver that adds Latex annotations for math expressions in 69 | PDF. 70 | pdf_mathml: A LaTeX driver that adds MathML annotations for math expressions 71 | in PDF. 72 | word: A simple Word driver ignoring all Math. 73 | xml: A simple XML translation with linearity and minimal argument tags 74 | xml_bbox: A complex XML translation with linearity, minimal argument tags 75 | and bounding box information for all sub-expressions. 76 | xml_tree: A simple XML translation faithful to the tree structure 77 | 78 | $ ./anderson_post.opt -layouts 79 | dummy: A dummy layout driver that serves as default. 80 | Implements: 81 | festival: Layout analysis for Festival documents. 82 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 83 | festival_latex: Layout analysis for Latex documents to simple text. 84 | Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 85 | gts: Ground truth data. 86 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 87 | index: Layout analysis to provide XML annotation for Indexing. 88 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 89 | latex: Layout analysis for Latex documents. 90 | Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 91 | legacy: Invoking legacy drivers. 92 | Implements: MKM09-latex MKM09-mathml DAS10-latex EuDML-latex 93 | EuDML-mathml ICDAR11-latex ICDAR11-gts 94 | mathml: Layout analysis for Mathml documents. 95 | Implements: EXPR EXPRS CEXPRS LINE LINES AREA PAGE DOC 96 | pdf_both: Multilayer PDF documents containing LaTeX code and normal text 97 | underneath. 98 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 99 | pdf_code: Multilayer PDF documents containing LaTeX code underneath. 100 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 101 | pdf_text: Multilayer PDF documents containing plain text underneath. 102 | Implements: EXPR EXPRS LINE LINES AREA PAGE DOC 103 | tralics: Translation to MathML using Tralics via a given Latex driver. 104 | Implements: LINE LINES AREA PAGE DOC 105 | word: Getting Words out of documents, etc. 106 | Implements: LINE LINES AREA PAGE DOC 107 | xml: Layout analysis for Latex documents. 108 | Implements: EXPR EXPRS CEXPRS LEXPRS LINE LINES AREA PAGE DOC 109 | 110 | $ ./anderson_post.opt -types 111 | AREA: Parsing an area. 112 | Implemented by drivers: festival festival_latex gts index latex mathml 113 | pdf_both pdf_code pdf_text tralics word xml 114 | CEXPRS: Parsing expressions for comparison with image files. 115 | Implemented by drivers: festival_latex latex mathml xml 116 | DOC: Parsing an entire document. 117 | Implemented by drivers: festival festival_latex gts index latex mathml 118 | pdf_both pdf_code pdf_text tralics word xml 119 | EXPR: Parsing single Math expressions. 120 | Implemented by drivers: festival festival_latex gts index latex mathml 121 | pdf_both pdf_code pdf_text xml 122 | EXPRS: Parsing several Math expressions. 123 | Implemented by drivers: festival festival_latex gts index latex mathml 124 | pdf_both pdf_code pdf_text xml 125 | LEXPRS: Parsing expressions separately and including separate files into a 126 | master file (rather than including the actual expressions). 127 | Implemented by drivers: festival_latex latex xml 128 | LINE: Parsing a single line. 129 | Implemented by drivers: festival festival_latex gts index latex mathml 130 | pdf_both pdf_code pdf_text tralics word xml 131 | LINES: Parsing several lines. 132 | Implemented by drivers: festival festival_latex gts index latex mathml 133 | pdf_both pdf_code pdf_text tralics word xml 134 | PAGE: Parsing a page. 135 | Implemented by drivers: festival festival_latex gts index latex mathml 136 | pdf_both pdf_code pdf_text tralics word xml 137 | Some drivers might implement other, undocumented layout types. 138 | ``` 139 | 140 | ## Conclusions and other things 141 | Have fun with such thing. 142 | If you are one of the author of the original project, please provide the community with working source files as well as drivers description and differences. 143 | -------------------------------------------------------------------------------- /data/test.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/data/test.pdf -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | services: 3 | main: 4 | build: . 5 | command: ./process_file.sh "${FILE-test.pdf}" 6 | volumes: 7 | - ./data:/data:rw 8 | -------------------------------------------------------------------------------- /maxtract/README: -------------------------------------------------------------------------------- 1 | maxtract v.1752 2 | 32 bit linux 3 | 15/11/2012 4 | 5 | ============================= DESCRIPTION ================================== 6 | 7 | A command line tool that reads a PDF and returns different formats. Tool is written in Ocaml 8 | and uses the pdftk for decompressingg the PDF file. 9 | 10 | A directory with a random name is created in /tmp during operation, which is 11 | removed after completion. 12 | 13 | ============================= REQUIREMENTS ================================== 14 | 15 | -- PDFtk which can be obtained from http://www.pdflabs.com/docs/install-pdftk/ 16 | If under ubuntu or debian 17 | apt-get install pdftk 18 | 19 | ============================= INSTALLATION ================================== 20 | 21 | Maxtract does not require installation, as a binary has been released. 22 | 23 | ============================== USAGE ======================================== 24 | 25 | Maxtract must be run from the directory where the binaries have been extracted to. 26 | 27 | Running 28 | $ ./maxtract.opt -help 29 | 30 | Produces the following option list 31 | 32 | usage: ./maxtract.opt [-f file] [-o file] [-tex] [-txt] [-lay] [-mat] 33 | -f : -f Name of input PDF file 34 | -o : -o Name of output file 35 | -tex: Create a latex file (default) 36 | -lay: Create latex for a layered PDF with latex and text layers 37 | -ann: Create latex for an annotated PDF file with latex annotations 38 | -txt: Create a simple text file 39 | -mat: Create a text file with math in latex 40 | 41 | 42 | === Further explanation ======================================================== 43 | 44 | -f Is always required and takes a PDF filename which has been validated by pdftester 45 | 46 | -o Is always required and specifies an output filename -including extension. 47 | 48 | [-tex -lay -txt -mat] Are the choices of output options. -tex is the 49 | default. Any combinations of output driver can be chosen, if more than one is 50 | chosen extensions are automatically assigned. 51 | 52 | .tex output from -lay and -ann option needs to be pdflatexed twice to produce the desired 53 | pdf file. 54 | 55 | If a file cannot be processed '0' is returned to stdout. 56 | 57 | ===EXAMPLES=== 58 | 59 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test 60 | produces test.tex 61 | 62 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test -txt 63 | produces test.txt 64 | 65 | ./maxtract.opt -f in./maxtract.opt -f test1.pdf -o test.tex -lay -txt 66 | produces test.txt and test-lay.tex (Remember to pdflatex test_layer.tex twice) 67 | 68 | ============================= SUPPORT ====================================== 69 | 70 | Any issues, please contact j.baker@cs.bham.ac.uk 71 | 72 | 73 | 74 | -------------------------------------------------------------------------------- /maxtract/anderson_post.opt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/anderson_post.opt -------------------------------------------------------------------------------- /maxtract/connectedcomp_ci.opt-STATIC: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/connectedcomp_ci.opt-STATIC -------------------------------------------------------------------------------- /maxtract/extractElements.opt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/extractElements.opt -------------------------------------------------------------------------------- /maxtract/linearizer.opt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/linearizer.opt -------------------------------------------------------------------------------- /maxtract/maxtract.opt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/maxtract.opt -------------------------------------------------------------------------------- /maxtract/pdf2tiff: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | script () { 4 | directory=`dirname $1` 5 | filename=`basename $1` 6 | 7 | 8 | pushd $directory > /dev/null 9 | gs -q -sDEVICE='tiffg4' -sOutputFile=`basename $filename .pdf`.tif -r'300' -dNOPAUSE -dBATCH $filename 2> /dev/null 10 | popd > /dev/null 11 | 12 | } 13 | 14 | test () { 15 | if [ -f $1 ] && [ `basename $1 | awk -F. '{print $2}'` = pdf ] 16 | then 17 | script $1 18 | else 19 | usage 20 | fi 21 | } 22 | 23 | 24 | usage () { 25 | echo 'Script needs arguments of the form inputfile.pdf and generates inputfile.tif' 26 | } 27 | 28 | 29 | 30 | 31 | case "$#" in 32 | 33 | 0 ) usage;; 34 | 1 ) test $1;; 35 | * ) for arg in $* 36 | do 37 | test $arg 38 | done;; 39 | esac 40 | 41 | exit 0 42 | -------------------------------------------------------------------------------- /maxtract/pdftk: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Uncompresses the file 3 | 4 | while [[ "$1" != "" ]]; do 5 | echo "$1" 6 | shift 7 | done 8 | 9 | INFILE=$1 10 | OUTFILE=$3 11 | 12 | echo "Uncompressing $1 to $3" 13 | qpdf --stream-data=uncompress "$1" "$3" 14 | -------------------------------------------------------------------------------- /maxtract/process_file.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -e 2 | 3 | BFILE=$1 4 | DRIVER=${2:-latex2} 5 | LAYOUT=${3:-latex} 6 | TYPE=${4:-LINES} 7 | FILEDIR=data 8 | TEMPDIR=process 9 | 10 | # ./extractElements.opt -u -f gaal.pdf -d /tmp/process 11 | echo "Creating temporary directory" 12 | mkdir /tmp/$TEMPDIR 13 | echo "Uncompressing the file stream" 14 | pdftk "$FILEDIR/$BFILE" output "/tmp/$TEMPDIR/$BFILE" uncompress 15 | echo "Extracting elements" 16 | LANG=/usr/lib/locale/en_US ./extractElements.opt -u -f "$BFILE" -d "/tmp/$TEMPDIR" # -u se è gia uncompresso 17 | echo "Linearizing streams" 18 | IFS=$'\n'; 19 | for file in $(find /tmp/$TEMPDIR -type f -iname "*.jsonf"); do 20 | ./linearizer.opt -f "$file"; 21 | done 22 | echo "Computing LaTeX" 23 | for file in $(find /tmp/$TEMPDIR -type f -iname "*.lin"); do 24 | ./anderson_post.opt --driver "$DRIVER" --layout "$LAYOUT" --type "$TYPE" -auto "$file"; 25 | done 26 | echo "Removing unnecessary files" 27 | find /tmp/$TEMPDIR -type f -iname "*.json" -delete 28 | find /tmp/$TEMPDIR -type f -iname "*.jsonf" -delete 29 | find /tmp/$TEMPDIR -type f -iname "*.bb" -delete 30 | find /tmp/$TEMPDIR -type f -iname "*.lin" -delete 31 | rm /tmp/$TEMPDIR/*.pdf /tmp/$TEMPDIR/*.tif 32 | echo "Now outputting those files" 33 | ls /tmp/$TEMPDIR/* 34 | EXTENSION=$(ls -1 /tmp/$TEMPDIR/* | grep -oP "[0-9]+\.\K[a-zA-Z0-9]+" | head -n 1) 35 | echo "Computed file extension: $EXTENSION" 36 | ORIGBASENAME=$(basename "$BFILE" .pdf); 37 | find /tmp/$TEMPDIR -type f -print | sort | xargs -n1 cat > "$FILEDIR/$ORIGBASENAME.$EXTENSION" 38 | original_owner=$(ls -l "$FILEDIR/$BFILE" | cut -d' ' -f3,4 | sed -e 's/ /:/g') 39 | chown "$original_owner" "$FILEDIR/$ORIGBASENAME.$EXTENSION" 40 | chmod ug+rw "$FILEDIR/$ORIGBASENAME.$EXTENSION" 41 | rm -rf /tmp/$TEMPDIR 42 | echo "Everything went well. Your output is in $ORIGBASENAME.$EXTENSION" 43 | 44 | -------------------------------------------------------------------------------- /maxtract/test1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/test1.pdf -------------------------------------------------------------------------------- /maxtract/test2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trenta3/maxtract-docker/8889110751e504dc5c317f148d480994c24cb974/maxtract/test2.pdf --------------------------------------------------------------------------------