├── LICENSE ├── README.md ├── djvu2pdf └── djvu2pdf_toc_parser.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Hordur Freyr Yngvason 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # WARNING 3 | 4 | This is script is very fragile. Don't use it unless you know what it 5 | does! 6 | 7 | 8 | # About 9 | 10 | generates compressed PDF from DjVu and tries to include text layers 11 | from the original DjVu file. I have no idea what happens in case there 12 | is no embedded text. 13 | 14 | 15 | # (nontrivial) Dependencies 16 | 17 | - [`djvused`](http://djvu.sourceforge.net/): To extract metadata like the TOC and the number of pages. 18 | - [`ddjvu`](http://djvu.sourceforge.net/): To split the djvu file into tiff pages. 19 | - [`djvu2hocr`](http://jwilk.net/software/ocrodjvu): To extract the OCR layers for `pdfbeads`. 20 | - [`pdfbeads`](http://rubygems.org/gems/pdfbeads): To combine TIFF images and OCR content into a highly 21 | compressed pdf file. 22 | - `djvu2pdf_toc_parser.py`: A python script to convert the TOC for `pdfbeads`. 23 | 24 | # TODO 25 | 26 | ## Handle arguments 27 | 28 | The basic use case for now is only `djvu2pdf [input] [output]`, but we 29 | should at least make sure that 30 | 31 | - exactly two arguments are provided 32 | - `input` exists 33 | - `output` is writable (otherwise we'd lose a lot of precious 34 | work) 35 | 36 | Furthermore, it might be nice to have the option to include a 37 | `pdfbeads`-compatible TOC with the input file (the indentation-based 38 | syntax is nice, so one might decide to write a TOC). This feature could 39 | be introduced through the flag `--toc=[table of contents file]` 40 | 41 | 42 | ## Handle errors along the way 43 | 44 | - If `input` is not a djvu file, then we should fail instantly. 45 | - If some dependency isn't installed, we should quit immediately (but 46 | it is not our job to make sure they are set up correctly if they are 47 | there). 48 | - If there is no embedded text then we should not output any temporary 49 | html files along the way. 50 | -------------------------------------------------------------------------------- /djvu2pdf: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | 4 | # 5 | # Set up paths 6 | # 7 | 8 | initial_directory=$(pwd) 9 | file_in=$(readlink -f "$1") 10 | file_out=$(readlink -f "$2") 11 | file_out_basename=$(basename "$file_out") 12 | 13 | # 14 | # Set up the temporary storage 15 | # 16 | 17 | tmpdir=$(mktemp -d) 18 | cd $tmpdir 19 | 20 | function cleanup() 21 | { 22 | rm -rf $tmpdir 23 | cd $initial_directory 24 | } 25 | trap "cleanup" EXIT # makes sure we clean up after ourselves 26 | 27 | # 28 | # Extract raw pages and text 29 | # 30 | 31 | ddjvu -format=tiff "$file_in" tmp_multipage.tiff 32 | tiffsplit tmp_multipage.tiff tmp_page_ 33 | rm tmp_multipage.tiff 34 | num_pages=$(djvused -e 'n' "$file_in") 35 | strlen_num_pages="${#num_pages}" 36 | i=0 37 | for page_alpha in tmp_page_*; do 38 | i=$[i+1] 39 | j=$(printf "%0${strlen_num_pages}d" $i) 40 | mv $page_alpha tmp_page_${j}.tiff 41 | 42 | # OCR content needs to have one html file per page for `pdfbeads`; 43 | # `djvu2hocr` is capable of extracting it all at once, but then it 44 | # goes into one big file which we would need to split afterwards, 45 | # this would require html parsing, which is just too much. The 46 | # s/ocrx/ocr/g substitution is a small hack to make `pdfbeads` 47 | # understand the output from `djvu2hocr`. 48 | 49 | djvu2hocr "$file_in" -p $i | sed 's/ocrx/ocr/g' > tmp_page_${j}.html 50 | done 51 | 52 | 53 | # 54 | # Generate the TOC 55 | # 56 | 57 | djvused -e 'print-outline' "$file_in" > toc.txt 58 | 59 | # The output returned by `djvused` has an s-expression like 60 | # tree structure, which is incompatible with the indentation based 61 | # structure used by `pdfbeads` 62 | 63 | djvu2pdf_toc_parser.py < toc.txt > toc.out.txt 64 | 65 | 66 | # 67 | # Generate the final PDF 68 | # 69 | 70 | pdfbeads --toc toc.out.txt -o output.pdf 71 | mv output.pdf "$file_out" 72 | -------------------------------------------------------------------------------- /djvu2pdf_toc_parser.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | 5 | def parse_sexp(toc_input, toc_output, indent_str, i): 6 | """ 7 | Translate TOC in the s-exp format output by ``djvused`` to a 8 | format understood by ``pdfbeads``. 9 | 10 | ``toc_input[i:]`` is the string to parse, and ``indent_str`` is 11 | the string of tabulations for our current level in the output 12 | TOC. The output from ``djvused`` should not include the 13 | '(bookmarks' prefix. Anything up until the opening brace is 14 | disregarded, as well as anything after its matching closing brace. 15 | 16 | If the input is badly formatted, weird things will happen. 17 | """ 18 | 19 | while True: 20 | 21 | if toc_input[i] == '(': 22 | i += 1 23 | i, title = next_quote(toc_input, i) 24 | i, page = next_quote(toc_input, i) 25 | 26 | # `djvused` outputs page numbers prefixed with '#' (the 27 | # second character of the `page` variable) which we don't 28 | # want in our output 29 | page = page[0]+page[2:] 30 | 31 | toc_output += ["{0}{1} {2}".format(indent_str, title, page)] 32 | 33 | i = parse_sexp(toc_input, toc_output, indent_str+'\t', i) 34 | 35 | elif toc_input[i] == ')': 36 | return i+1 37 | 38 | i += 1 39 | 40 | def next_quote(str, i): 41 | """ 42 | Finds the next substring ``str[k:j]`` of ``str[i:]`` enclosed by 43 | non-escaped double-quotes and returns the tuple ``j, res`` where 44 | ``res`` is ``str[k:j]`` with escaped double-quotes replaced by 45 | single-quotes. 46 | """ 47 | 48 | # Find the opening quote. This is simple because we can safely 49 | # assume there is only whitespace in between 50 | j = i 51 | while str[j] != '"': 52 | j += 1 53 | i = j 54 | j += 1 55 | output = ['"'] 56 | 57 | # Find the closing quote. This is a bit more involved because the 58 | # output may include escaped double quotes, which `pdfbeads` 59 | # cannot handle correctly. To circumvent this bug, we replace 60 | # every escaped double quote inside the literal with a single quote. 61 | while True: 62 | if str[j] == '"': 63 | if str[j-1] == "\\": 64 | output.pop() 65 | output += ["'"] 66 | else: 67 | output += [str[j]] 68 | break 69 | else: 70 | output += [str[j]] 71 | j += 1 72 | 73 | return j+1, ''.join(output) 74 | 75 | if __name__ == '__main__': 76 | toc_input = sys.stdin.read() 77 | 78 | # It's possible that the file does not have a table of contents, 79 | # in which case we won't read anything at all 80 | if len(toc_input) > 0: 81 | # We must skip the '(bookmarks' prefix as it doesn't fit the 82 | # general pattern expected by `parse_sexp`. 83 | toc_output = [] 84 | parse_sexp(toc_input[1:], toc_output, '', 0) 85 | print('\n'.join(toc_output)) 86 | --------------------------------------------------------------------------------