├── requirements.txt
├── .gitignore
├── LICENSE
├── README.md
└── PdfQRSplit.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | zxing
2 | pypdf4
3 | pillow
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Created by https://www.toptal.com/developers/gitignore/api/venv
 2 | # Edit at https://www.toptal.com/developers/gitignore?templates=venv
 3 | 
 4 | ### venv ###
 5 | # Virtualenv
 6 | # http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
 7 | .Python
 8 | [Bb]in
 9 | [Ii]nclude
10 | [Ll]ib
11 | [Ll]ib64
12 | [Ll]ocal
13 | [Ss]cripts
14 | pyvenv.cfg
15 | .venv
16 | pip-selfcheck.json
17 | 
18 | # End of https://www.toptal.com/developers/gitignore/api/venv
19 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Florian Knodt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PdfQRSplit
 2 | 
 3 | *PdfQRSplit* is a small utility to split a multi-page PDF document into separate PDF files based on pages containing a specified barcode. This concept is known as "separator page" and used in combination with high volume document scanners to scan a large number of unrelated documents in bulk.
 4 | 
 5 | While named "*QR*" this tool will also work with most other barcode types.
 6 | 
 7 | ## Installation and requirements
 8 | 
 9 | Python 3 or newer is required. You also need **zxing** (Barcode recognition), **pypdf4** (PDF handling) and **pillow** (image handling) - all of them can be installed using pip:
10 | 
11 | ```
12 | pip install zxing pypdf4 pillow
13 | ```
14 | or
15 | ```
16 | pip install -r requirements.txt
17 | ```
18 | 
19 | ## Usage
20 | ```
21 | usage: PdfQRSplit.py [-h] [-p PREFIX] [-s SEPARATOR] [-k] [--keep-page-next] [-b BRIGHTNESS] [-v] [-d] inputfile
22 | 
23 | Split PDF-file into separate files based on a separator barcode
24 | 
25 | positional arguments:
26 |   inputfile             Filename or glob to process
27 | 
28 | optional arguments:
29 |   -h, --help            show this help message and exit
30 |   -p PREFIX, --prefix PREFIX
31 |                         Prefix for generated PDF files. Default: split
32 |   -s SEPARATOR, --separator SEPARATOR
33 |                         Barcode content used to find separator pages. Default: ADAR-NEXTDOC
34 |   -k, --keep-page       Keep separator page in previous document
35 |   --keep-page-next      Keep separator page in next document
36 |   -b BRIGHTNESS, --brightness BRIGHTNESS
37 |                         brightness threshold for barcode preparation (0-255). Default: 128
38 |   -v, --verbose         Show verbose processing messages
39 |   -d, --debug           Show debug messages
40 | ```
41 | 
42 | ### Example
43 | 
44 | Take the file **input.pdf**, search all pages for barcodes containing the text *"SPLITME"*. If found (or at the end of the input file) previously encountered pages will be written to a separate file, in this case (-k) including the page containing the separator barcode. Since no prefix was given the first file will be named "*split_0_0.pdf*". *split* is the default prefix, 0 indicates it was generated from the first (and in this case only) input file and the second 0 indicates it's the first document extracted from this file.
45 | 
46 | ```python .\test.py .\input.pdf -s "SPLITME" -k -v```
47 | 
48 | ```
49 | Processing file .\input.pdf containing 66 pages
50 |   Analyzing page 1
51 |   Analyzing page 2
52 |   [...]
53 |   Analyzing page 6
54 |     Found separator - writing 6 pages to split_0_0.pdf
55 |   Analyzing page 7
56 |   [...]
57 |   Analyzing page 13
58 |     Found separator - writing 7 pages to split_0_1.pdf
59 |   Analyzing page 14
60 |   [...]
61 | Split 1 given files into 19 files
62 | ```
63 | 
64 | ## Thanks
65 | 
66 | This script is based on ["pdf_split_tool" by Thiago Carvalho D'Ávila (staticdev)](https://github.com/staticdev/pdf-split-tool/).
67 | 


--------------------------------------------------------------------------------
/PdfQRSplit.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import glob
  3 | import sys
  4 | import io
  5 | import PyPDF4
  6 | import zxing
  7 | 
  8 | from typing import List
  9 | from tempfile import TemporaryDirectory
 10 | from PIL import Image
 11 | 
 12 | parser = argparse.ArgumentParser(description='Split PDF-file into separate files based on a separator barcode')
 13 | parser.add_argument('filename', metavar='inputfile', type=str,
 14 |                     help='Filename or glob to process')
 15 | parser.add_argument('-p', '--prefix', default="split",
 16 |                     help='Prefix for generated PDF files. Default: split')
 17 | parser.add_argument('-s', '--separator', default="ADAR-NEXTDOC",
 18 |                     help='Barcode content used to find separator pages. Default: ADAR-NEXTDOC')
 19 | parser.add_argument('-k', '--keep-page', action='store_true',
 20 |                     help='Keep separator page in previous document')
 21 | parser.add_argument('--keep-page-next', action='store_true',
 22 |                     help='Keep separator page in next document')
 23 | parser.add_argument('-b', '--brightness', type=int, default=128,
 24 |                     help='brightness threshold for barcode preparation (0-255). Default: 128')
 25 | parser.add_argument('-v', '--verbose', action='store_true',
 26 |                     help='Show verbose processing messages')
 27 | parser.add_argument('-d', '--debug', action='store_true',
 28 |                     help='Show debug messages')
 29 | 
 30 | class PdfQrSplit:
 31 |     def __init__(self, filepath: str, verbose: bool, debug: bool, brightness: 128) -> None:
 32 |         self.filepath = filepath
 33 |         self.verbose = verbose
 34 |         self.debug = debug
 35 |         self.brightness = brightness
 36 |         self.input_pdf = PyPDF4.PdfFileReader(filepath, "rb")
 37 |         self.total_pages = self.input_pdf.getNumPages()
 38 |         if verbose:
 39 |             print(
 40 |                 "Processing file {} containing {} pages".format(
 41 |                     filepath, self.total_pages
 42 |                 )
 43 |             )
 44 | 
 45 |     def split_qr(self, split_text: str, ifiles: int) -> int:
 46 |         """Creates new files based on barcode contents.
 47 |         Args:
 48 |             split_text: Barcode content to recognize a separator page
 49 |         Returns:
 50 |             int: Number of generated files.
 51 |         """
 52 |         pdfs_count = 0
 53 |         current_page = 0
 54 | 
 55 |         reader = zxing.BarCodeReader()
 56 |         pdf_writer = PyPDF4.PdfFileWriter()
 57 | 
 58 |         while current_page != self.total_pages:
 59 | 
 60 |             if self.verbose:
 61 |                 print("  Analyzing page {}".format((current_page+1)))
 62 | 
 63 |             page = self.input_pdf.getPage(current_page)
 64 | 
 65 |             xObject = page['/Resources']['/XObject'].getObject()
 66 | 
 67 |             with TemporaryDirectory() as temp_dir:
 68 |                 if self.debug:
 69 |                     print("    Writing page images to temporary directory {}".format(temp_dir))
 70 | 
 71 |                 split = False
 72 |                 for obj in xObject:
 73 |                     tgtn=False
 74 |                     if xObject[obj]['/Subtype'] == '/Image':
 75 |                         data = xObject[obj].getData()
 76 | 
 77 |                         if '/FlateDecode' in xObject[obj]['/Filter']  or \
 78 |                             '/DCTDecode' in xObject[obj]['/Filter'] or \
 79 |                             '/JPXDecode' in xObject[obj]['/Filter']:
 80 |                                 tgtn = temp_dir + "/" + obj[1:] + ".png"
 81 |                                 img = Image.open(io.BytesIO(data))
 82 |                                 fn = lambda x : 255 if x > self.brightness else 0
 83 |                                 img = img.convert('L').point(fn, mode='1')
 84 |                                 img.save(tgtn)
 85 |                         elif self.debug:
 86 |                             print(f"      Unknown filter type {xObject[obj]['/Filter']}")
 87 |                         
 88 |                         if tgtn:
 89 |                             if self.debug:
 90 |                                 print("      Wrote image {}; Checking for separator barcode".format(tgtn))
 91 |                             barcode = reader.decode(tgtn)
 92 |                             if barcode and args.separator in barcode.parsed:
 93 |                                 if self.debug:
 94 |                                     print("        Found separator barcode")
 95 |                                 split = True
 96 | 
 97 |                 if split:
 98 |                     if args.keep_page:
 99 |                         pdf_writer.addPage(page)
100 |                     
101 |                     output = args.prefix + '_' + str(ifiles) + '_' + str(pdfs_count) + '.pdf'
102 |                     if self.verbose:
103 |                         print("    Found separator - writing {} pages to {}".format(pdf_writer.getNumPages(), output))
104 |                     with open(output, 'wb') as output_pdf:
105 |                         pdf_writer.write(output_pdf)
106 | 
107 |                     pdf_writer = PyPDF4.PdfFileWriter()
108 |                     pdfs_count += 1
109 |                     #Due to a bug in PyPDF4 PdfFileReader breaks when invoking PdfFileWriter.write - reopen file
110 |                     self.input_pdf = PyPDF4.PdfFileReader(filepath, "rb")
111 | 
112 |                     if args.keep_page_next:
113 |                         pdf_writer.addPage(page)
114 |                 else:
115 |                     pdf_writer.addPage(page)
116 |             
117 |             current_page += 1
118 | 
119 |         output = args.prefix + '_' + str(ifiles) + '_' + str(pdfs_count) + '.pdf'
120 |         if self.verbose:
121 |             print("    End of input - writing {} pages to {}".format(pdf_writer.getNumPages(), output))
122 |         with open(output, 'wb') as output_pdf:
123 |             pdf_writer.write(output_pdf)
124 |         pdfs_count += 1
125 |         
126 |         return pdfs_count
127 | 
128 | args = parser.parse_args()
129 | 
130 | if args.debug:
131 |     args.verbose = True
132 | 
133 | if args.brightness < 0:
134 |     args.brightness = 0
135 | if args.brightness > 255:
136 |     args.brightness = 255
137 | 
138 | filepaths = glob.glob(args.filename)
139 | if not filepaths:
140 |     sys.exit("Error: no file found, check the documentation for more info.")
141 |     
142 | ofiles = 0
143 | ifiles = 0
144 | 
145 | for filepath in filepaths:
146 |     splitter = PdfQrSplit(filepath, args.verbose, args.debug, brightness=args.brightness)
147 |     ofiles += splitter.split_qr(args.separator, ifiles)
148 |     ifiles += 1
149 | 
150 | print(
151 |     "Split {} given files into {} files".format(
152 |         ifiles, ofiles
153 |     )
154 | )
155 | 


--------------------------------------------------------------------------------