Tools for Converting LaTeX to XML
37 | 38 | 39 | 45 |I spent some time surveying the available tools for converting LaTeX 46 | to XHTML+MathML or, more generally, LaTeX to XML. My criteria were 47 | the following:
48 | 49 |-
50 |
- The project must be free and open source. 51 |
- It should produce clean, semantic XHTML+MathML or XML output. 52 |
- It should be able to handle macro definitions using standard LaTeX commands. 53 |
- The possibility of adding support for additional LaTeX packages
54 | (e.g.,
natbiborhyperref) is a plus.
55 | - Tools that require little or no manual intervention or modification 56 | of the LaTeX source are preferred. 57 |
I was pleasantly surprised at how many projects I found that actually 60 | met most of these criteria. Those that I am aware of are described in 61 | more detail below.
62 | 63 |Overall, I was most impressed with LaTeXML. In my opinion, its 64 | usage is the most straightforward and it produces very clean, general 65 | XML output which can produce easily customizable XHTML+MathML 66 | documents.
67 | 68 |LaTeXML
69 | 70 |LaTeXML is Perl module which parses the actual LaTeX document and 71 | emits XML output for later post-processing (for example, for conversion 72 | to XHTML+MathML). With a proper XSLT stylesheet, one can obtain custom 73 | XHTML+MathML output. The LaTeXML homepage itself was generated 74 | using LaTeXML. The project is still active (at the time of writing), it’s 75 | very well documented and it has a Trac 76 | and a mailing list.
77 | 78 |If you use Debian GNU/Linux, you can install the relevant dependencies with
79 | 80 |sudo apt-get install libparse-recdescent-perl libimage-magick-perl \
81 | libxml-libxml-common-perl libxml-libxslt-perl
82 |
83 |
84 | The package is installed using the usual procedure for Perl modules:
85 | 86 |perl Makefile.PL
87 | make
88 | make test
89 | sudo make install
90 |
91 |
92 | The usage is straightforward. First convert the LaTeX document, say
93 | mydoc.tex to XML and then post-process the XML, converting it to
94 | XHTML+MathML:
latexml --dest=mydoc.xml mydoc
97 | latexmlpost -dest=somewhere/mydoc.xhtml mydoc.xml
98 |
99 |
100 | LaTeXML is a project of the NIST and is therefore in 101 | the public domain.
102 | 103 |Tralics
104 | 105 |Tralics is written in C++ and also directly parses the LaTeX 106 | source (and it’s also extremely fast). It is licensed under the 107 | French CeCill open source license which is GPL-compatible.
108 | 109 |Compiling it is straightforward:
110 | 111 |tar zxvf tralics-src-2.13.5.tar.gz
112 | cd tralics-2.13.5/src
113 | make
114 |
115 |
116 | To convert a LaTeX document to XML:
117 | 118 |tralics doc.tex
119 |
120 |
121 | A file called doc.xml will be created. Tralics handles any unknown
122 | commands from unsupported package such as hyperref, for example, by
123 | including an <error> tag:
<error n='\hypersetup' l='35' c='Undefined command'/>
126 |
127 |
128 | So, apparently it should never fail to parse the document as long 129 | as it is valid LaTeX.
130 | 131 |The XML file can then be converted to XHTML+MathML using a stylesheet. 132 | Several examples are provided in the “Extra files” package.
133 | 134 |Hermes
135 | 136 |Hermes is a grammar-based DVI-parser for translating LaTeX to 137 | Unicode-encoded XML+MathML. It works by first including a set of TeX 138 | macros in the original LaTeX document which insert specials in the DVI 139 | file. It then constructs XML output by parsing the semantic DVI file.
140 | 141 |Some examples are provided here. In 142 | particular, there is a 143 | collection of articles from 144 | arxiv-math that were translated 145 | to XHTML+MathML.
146 | 147 |Hermes is very complete in terms of functionality, but there are still
148 | a few glitches here and there, namely it has trouble handling spaces
149 | properly (see some of the examples). It also requires two steps just
150 | to get the XML file as you first have to create a “seed” LaTeX document
151 | (which essentially just adds a line \include dtx line which
152 | includes the extra macro definitions).
TeX4ht
155 | 156 |TeX4ht, available in the Debian package tex4ht, is probably the
157 | most widely used LaTeX to (X)HTML tool. It supports conversion to
158 | HTML, XHTML+MathML, OpenDocument, and DocBook. Direct XHTML+MathML
159 | conversion is possible using a command like the following:
htlatex filename "xhtml,mathml" " -cunihtf" "-cvalidate"
162 |
163 |
164 | See the documentation for details about the available options.
165 | 166 |The direct XHTML+MathML conversion looks very nice but the output 167 | didn’t seem very clean or semantic. It seems that it’s possible to 168 | heavily customize the output if you like, but the methods for doing so 169 | aren’t exactly obvious. I didn’t test its DocBook conversion, 170 | although this may also be a promising route.
171 | 172 |LXir
173 | 174 |LXir is another DVI-parsing LaTeX to XML translator. You must
175 | first include \RequirePackage{lxir} in your LaTeX document and
176 | run latex to obtain a DVI file. Then running lxir doc.dvi
177 | will produce an XML file that can be processed using xsltproc.
LXir looks promising but it still has some problems. It will fail
180 | if it encounters commands from any unsupported packages. Even after
181 | removing all external package dependencies from my document, LXir
182 | still failed to process the standard \author{foo \and bar}
183 | structure. Once I removed that, there were still errors in the
184 | generated MathML.
Overall LXir looks promising, and I think it’s a project worth keeping 187 | an eye on, but it doesn’t seem ready for production use (at least not 188 | for anything containing mathematics).
189 | 190 |GELLMU
191 | 192 |There is also an alternative markup language called GELLMU which 193 | supports XHTML+MathML, HTML, PDF, and DVI output. While it does meet 194 | most of my criteria, I’d rather be able to write real LaTeX, rather 195 | than pseudo-LaTeX. It’s certainly debatable but I consider LaTeX to 196 | be an archival format. At the very least an acceptable LaTeX-to-XML 197 | tool will eventually emerge. Clean LaTeX code is very structured and 198 | LaTeX is going to be with us for a very long time. Thus, it would 199 | simplify things if I were able to store my originals in LaTeX format.
200 | 201 |