├── LICENSE
├── README.md
├── djvu2pdf
└── djvu2pdf_toc_parser.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Hordur Freyr Yngvason 
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # WARNING
 3 | 
 4 | This is script is very fragile. Don't use it unless you know what it
 5 | does!
 6 | 
 7 | 
 8 | # About
 9 | 
10 | generates compressed PDF from DjVu and tries to include text layers
11 | from the original DjVu file. I have no idea what happens in case there
12 | is no embedded text.
13 | 
14 | 
15 | # (nontrivial) Dependencies
16 | 
17 | - [`djvused`](http://djvu.sourceforge.net/): To extract metadata like the TOC and the number of pages.
18 | - [`ddjvu`](http://djvu.sourceforge.net/): To split the djvu file into tiff pages.
19 | - [`djvu2hocr`](http://jwilk.net/software/ocrodjvu): To extract the OCR layers for `pdfbeads`.
20 | - [`pdfbeads`](http://rubygems.org/gems/pdfbeads): To combine TIFF images and OCR content into a highly
21 |   compressed pdf file.
22 | - `djvu2pdf_toc_parser.py`: A python script to convert the TOC for `pdfbeads`.
23 | 
24 | # TODO
25 | 
26 | ## Handle arguments
27 | 
28 | The basic use case for now is only `djvu2pdf [input] [output]`, but we
29 | should at least make sure that
30 | 
31 | - exactly two arguments are provided
32 | - `input` exists
33 | - `output` is writable (otherwise we'd lose a lot of precious
34 |   work)
35 |   
36 | Furthermore, it might be nice to have the option to include a
37 | `pdfbeads`-compatible TOC with the input file (the indentation-based
38 | syntax is nice, so one might decide to write a TOC). This feature could 
39 | be introduced through the flag `--toc=[table of contents file]`
40 | 
41 | 
42 | ## Handle errors along the way
43 | 
44 | - If `input` is not a djvu file, then we should fail instantly.
45 | - If some dependency isn't installed, we should quit immediately (but
46 |   it is not our job to make sure they are set up correctly if they are
47 |   there).
48 | - If there is no embedded text then we should not output any temporary
49 |   html files along the way.
50 | 


--------------------------------------------------------------------------------
/djvu2pdf:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | 
 4 | #
 5 | # Set up paths
 6 | # 
 7 | 
 8 | initial_directory=$(pwd)
 9 | file_in=$(readlink -f "$1")
10 | file_out=$(readlink -f "$2")
11 | file_out_basename=$(basename "$file_out")
12 | 
13 | #
14 | # Set up the temporary storage
15 | # 
16 | 
17 | tmpdir=$(mktemp -d)
18 | cd $tmpdir
19 | 
20 | function cleanup() 
21 | {
22 |     rm -rf $tmpdir
23 |     cd $initial_directory
24 | }
25 | trap "cleanup" EXIT # makes sure we clean up after ourselves
26 | 
27 | #
28 | # Extract raw pages and text
29 | #
30 | 
31 | ddjvu -format=tiff "$file_in" tmp_multipage.tiff
32 | tiffsplit tmp_multipage.tiff tmp_page_
33 | rm tmp_multipage.tiff
34 | num_pages=$(djvused -e 'n' "$file_in")
35 | strlen_num_pages="${#num_pages}"
36 | i=0
37 | for page_alpha in tmp_page_*; do
38 |     i=$[i+1]
39 |     j=$(printf "%0${strlen_num_pages}d" $i)
40 |     mv $page_alpha tmp_page_${j}.tiff
41 | 
42 |     # OCR content needs to have one html file per page for `pdfbeads`;
43 |     # `djvu2hocr` is capable of extracting it all at once, but then it
44 |     # goes into one big file which we would need to split afterwards,
45 |     # this would require html parsing, which is just too much.  The
46 |     # s/ocrx/ocr/g substitution is a small hack to make `pdfbeads`
47 |     # understand the output from `djvu2hocr`.
48 | 
49 |     djvu2hocr "$file_in" -p $i | sed 's/ocrx/ocr/g' > tmp_page_${j}.html
50 | done
51 | 
52 | 
53 | #
54 | # Generate the TOC
55 | #
56 | 
57 | djvused -e 'print-outline' "$file_in" > toc.txt
58 | 
59 | # The output returned by `djvused` has an s-expression like
60 | # tree structure, which is incompatible with the indentation based
61 | # structure used by `pdfbeads`
62 | 
63 | djvu2pdf_toc_parser.py < toc.txt > toc.out.txt
64 | 
65 | 
66 | #
67 | # Generate the final PDF
68 | #
69 | 
70 | pdfbeads --toc toc.out.txt -o output.pdf
71 | mv output.pdf "$file_out"
72 | 


--------------------------------------------------------------------------------
/djvu2pdf_toc_parser.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | import sys
 4 | 
 5 | def parse_sexp(toc_input, toc_output, indent_str, i):
 6 |     """
 7 |     Translate TOC in the s-exp format output by ``djvused`` to a
 8 |     format understood by ``pdfbeads``. 
 9 | 
10 |     ``toc_input[i:]`` is the string to parse, and ``indent_str`` is
11 |     the string of tabulations for our current level in the output
12 |     TOC. The output from ``djvused`` should not include the
13 |     '(bookmarks' prefix. Anything up until the opening brace is
14 |     disregarded, as well as anything after its matching closing brace.
15 | 
16 |     If the input is badly formatted, weird things will happen.
17 |     """
18 |     
19 |     while True:
20 |         
21 |         if toc_input[i] == '(':
22 |             i += 1
23 |             i, title = next_quote(toc_input, i)
24 |             i, page  = next_quote(toc_input, i)
25 |             
26 |             # `djvused` outputs page numbers prefixed with '#' (the
27 |             # second character of the `page` variable) which we don't
28 |             # want in our output
29 |             page = page[0]+page[2:] 
30 |             
31 |             toc_output += ["{0}{1} {2}".format(indent_str, title, page)]
32 |             
33 |             i = parse_sexp(toc_input, toc_output, indent_str+'\t', i)
34 | 
35 |         elif toc_input[i] == ')':
36 |             return i+1
37 |         
38 |         i += 1
39 | 
40 | def next_quote(str, i):
41 |     """
42 |     Finds the next substring ``str[k:j]`` of ``str[i:]`` enclosed by
43 |     non-escaped double-quotes and returns the tuple ``j, res`` where
44 |     ``res`` is ``str[k:j]`` with escaped double-quotes replaced by
45 |     single-quotes.
46 |     """
47 |     
48 |     # Find the opening quote. This is simple because we can safely
49 |     # assume there is only whitespace in between
50 |     j = i
51 |     while str[j] != '"':
52 |         j += 1
53 |     i = j
54 |     j += 1
55 |     output = ['"']
56 |     
57 |     # Find the closing quote. This is a bit more involved because the
58 |     # output may include escaped double quotes, which `pdfbeads`
59 |     # cannot handle correctly. To circumvent this bug, we replace
60 |     # every escaped double quote inside the literal with a single quote.
61 |     while True:
62 |         if str[j] == '"':
63 |             if str[j-1] == "\\":
64 |                 output.pop()
65 |                 output += ["'"]
66 |             else:
67 |                 output += [str[j]]
68 |                 break
69 |         else:
70 |             output += [str[j]]
71 |         j += 1
72 |         
73 |     return j+1, ''.join(output)
74 | 
75 | if __name__ == '__main__':
76 |     toc_input = sys.stdin.read()
77 |     
78 |     # It's possible that the file does not have a table of contents,
79 |     # in which case we won't read anything at all
80 |     if len(toc_input) > 0:
81 |         # We must skip the '(bookmarks' prefix as it doesn't fit the
82 |         # general pattern expected by `parse_sexp`. 
83 |         toc_output = []
84 |         parse_sexp(toc_input[1:], toc_output, '', 0) 
85 |         print('\n'.join(toc_output))
86 | 


--------------------------------------------------------------------------------