├── requirements.txt ├── .gitignore ├── LICENSE ├── README.md └── PdfQRSplit.py /requirements.txt: -------------------------------------------------------------------------------- 1 | zxing 2 | pypdf4 3 | pillow 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by https://www.toptal.com/developers/gitignore/api/venv 2 | # Edit at https://www.toptal.com/developers/gitignore?templates=venv 3 | 4 | ### venv ### 5 | # Virtualenv 6 | # http://iamzed.com/2009/05/07/a-primer-on-virtualenv/ 7 | .Python 8 | [Bb]in 9 | [Ii]nclude 10 | [Ll]ib 11 | [Ll]ib64 12 | [Ll]ocal 13 | [Ss]cripts 14 | pyvenv.cfg 15 | .venv 16 | pip-selfcheck.json 17 | 18 | # End of https://www.toptal.com/developers/gitignore/api/venv 19 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Florian Knodt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PdfQRSplit 2 | 3 | *PdfQRSplit* is a small utility to split a multi-page PDF document into separate PDF files based on pages containing a specified barcode. This concept is known as "separator page" and used in combination with high volume document scanners to scan a large number of unrelated documents in bulk. 4 | 5 | While named "*QR*" this tool will also work with most other barcode types. 6 | 7 | ## Installation and requirements 8 | 9 | Python 3 or newer is required. You also need **zxing** (Barcode recognition), **pypdf4** (PDF handling) and **pillow** (image handling) - all of them can be installed using pip: 10 | 11 | ``` 12 | pip install zxing pypdf4 pillow 13 | ``` 14 | or 15 | ``` 16 | pip install -r requirements.txt 17 | ``` 18 | 19 | ## Usage 20 | ``` 21 | usage: PdfQRSplit.py [-h] [-p PREFIX] [-s SEPARATOR] [-k] [--keep-page-next] [-b BRIGHTNESS] [-v] [-d] inputfile 22 | 23 | Split PDF-file into separate files based on a separator barcode 24 | 25 | positional arguments: 26 | inputfile Filename or glob to process 27 | 28 | optional arguments: 29 | -h, --help show this help message and exit 30 | -p PREFIX, --prefix PREFIX 31 | Prefix for generated PDF files. Default: split 32 | -s SEPARATOR, --separator SEPARATOR 33 | Barcode content used to find separator pages. Default: ADAR-NEXTDOC 34 | -k, --keep-page Keep separator page in previous document 35 | --keep-page-next Keep separator page in next document 36 | -b BRIGHTNESS, --brightness BRIGHTNESS 37 | brightness threshold for barcode preparation (0-255). Default: 128 38 | -v, --verbose Show verbose processing messages 39 | -d, --debug Show debug messages 40 | ``` 41 | 42 | ### Example 43 | 44 | Take the file **input.pdf**, search all pages for barcodes containing the text *"SPLITME"*. If found (or at the end of the input file) previously encountered pages will be written to a separate file, in this case (-k) including the page containing the separator barcode. Since no prefix was given the first file will be named "*split_0_0.pdf*". *split* is the default prefix, 0 indicates it was generated from the first (and in this case only) input file and the second 0 indicates it's the first document extracted from this file. 45 | 46 | ```python .\test.py .\input.pdf -s "SPLITME" -k -v``` 47 | 48 | ``` 49 | Processing file .\input.pdf containing 66 pages 50 | Analyzing page 1 51 | Analyzing page 2 52 | [...] 53 | Analyzing page 6 54 | Found separator - writing 6 pages to split_0_0.pdf 55 | Analyzing page 7 56 | [...] 57 | Analyzing page 13 58 | Found separator - writing 7 pages to split_0_1.pdf 59 | Analyzing page 14 60 | [...] 61 | Split 1 given files into 19 files 62 | ``` 63 | 64 | ## Thanks 65 | 66 | This script is based on ["pdf_split_tool" by Thiago Carvalho D'Ávila (staticdev)](https://github.com/staticdev/pdf-split-tool/). 67 | -------------------------------------------------------------------------------- /PdfQRSplit.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import glob 3 | import sys 4 | import io 5 | import PyPDF4 6 | import zxing 7 | 8 | from typing import List 9 | from tempfile import TemporaryDirectory 10 | from PIL import Image 11 | 12 | parser = argparse.ArgumentParser(description='Split PDF-file into separate files based on a separator barcode') 13 | parser.add_argument('filename', metavar='inputfile', type=str, 14 | help='Filename or glob to process') 15 | parser.add_argument('-p', '--prefix', default="split", 16 | help='Prefix for generated PDF files. Default: split') 17 | parser.add_argument('-s', '--separator', default="ADAR-NEXTDOC", 18 | help='Barcode content used to find separator pages. Default: ADAR-NEXTDOC') 19 | parser.add_argument('-k', '--keep-page', action='store_true', 20 | help='Keep separator page in previous document') 21 | parser.add_argument('--keep-page-next', action='store_true', 22 | help='Keep separator page in next document') 23 | parser.add_argument('-b', '--brightness', type=int, default=128, 24 | help='brightness threshold for barcode preparation (0-255). Default: 128') 25 | parser.add_argument('-v', '--verbose', action='store_true', 26 | help='Show verbose processing messages') 27 | parser.add_argument('-d', '--debug', action='store_true', 28 | help='Show debug messages') 29 | 30 | class PdfQrSplit: 31 | def __init__(self, filepath: str, verbose: bool, debug: bool, brightness: 128) -> None: 32 | self.filepath = filepath 33 | self.verbose = verbose 34 | self.debug = debug 35 | self.brightness = brightness 36 | self.input_pdf = PyPDF4.PdfFileReader(filepath, "rb") 37 | self.total_pages = self.input_pdf.getNumPages() 38 | if verbose: 39 | print( 40 | "Processing file {} containing {} pages".format( 41 | filepath, self.total_pages 42 | ) 43 | ) 44 | 45 | def split_qr(self, split_text: str, ifiles: int) -> int: 46 | """Creates new files based on barcode contents. 47 | Args: 48 | split_text: Barcode content to recognize a separator page 49 | Returns: 50 | int: Number of generated files. 51 | """ 52 | pdfs_count = 0 53 | current_page = 0 54 | 55 | reader = zxing.BarCodeReader() 56 | pdf_writer = PyPDF4.PdfFileWriter() 57 | 58 | while current_page != self.total_pages: 59 | 60 | if self.verbose: 61 | print(" Analyzing page {}".format((current_page+1))) 62 | 63 | page = self.input_pdf.getPage(current_page) 64 | 65 | xObject = page['/Resources']['/XObject'].getObject() 66 | 67 | with TemporaryDirectory() as temp_dir: 68 | if self.debug: 69 | print(" Writing page images to temporary directory {}".format(temp_dir)) 70 | 71 | split = False 72 | for obj in xObject: 73 | tgtn=False 74 | if xObject[obj]['/Subtype'] == '/Image': 75 | data = xObject[obj].getData() 76 | 77 | if '/FlateDecode' in xObject[obj]['/Filter'] or \ 78 | '/DCTDecode' in xObject[obj]['/Filter'] or \ 79 | '/JPXDecode' in xObject[obj]['/Filter']: 80 | tgtn = temp_dir + "/" + obj[1:] + ".png" 81 | img = Image.open(io.BytesIO(data)) 82 | fn = lambda x : 255 if x > self.brightness else 0 83 | img = img.convert('L').point(fn, mode='1') 84 | img.save(tgtn) 85 | elif self.debug: 86 | print(f" Unknown filter type {xObject[obj]['/Filter']}") 87 | 88 | if tgtn: 89 | if self.debug: 90 | print(" Wrote image {}; Checking for separator barcode".format(tgtn)) 91 | barcode = reader.decode(tgtn) 92 | if barcode and args.separator in barcode.parsed: 93 | if self.debug: 94 | print(" Found separator barcode") 95 | split = True 96 | 97 | if split: 98 | if args.keep_page: 99 | pdf_writer.addPage(page) 100 | 101 | output = args.prefix + '_' + str(ifiles) + '_' + str(pdfs_count) + '.pdf' 102 | if self.verbose: 103 | print(" Found separator - writing {} pages to {}".format(pdf_writer.getNumPages(), output)) 104 | with open(output, 'wb') as output_pdf: 105 | pdf_writer.write(output_pdf) 106 | 107 | pdf_writer = PyPDF4.PdfFileWriter() 108 | pdfs_count += 1 109 | #Due to a bug in PyPDF4 PdfFileReader breaks when invoking PdfFileWriter.write - reopen file 110 | self.input_pdf = PyPDF4.PdfFileReader(filepath, "rb") 111 | 112 | if args.keep_page_next: 113 | pdf_writer.addPage(page) 114 | else: 115 | pdf_writer.addPage(page) 116 | 117 | current_page += 1 118 | 119 | output = args.prefix + '_' + str(ifiles) + '_' + str(pdfs_count) + '.pdf' 120 | if self.verbose: 121 | print(" End of input - writing {} pages to {}".format(pdf_writer.getNumPages(), output)) 122 | with open(output, 'wb') as output_pdf: 123 | pdf_writer.write(output_pdf) 124 | pdfs_count += 1 125 | 126 | return pdfs_count 127 | 128 | args = parser.parse_args() 129 | 130 | if args.debug: 131 | args.verbose = True 132 | 133 | if args.brightness < 0: 134 | args.brightness = 0 135 | if args.brightness > 255: 136 | args.brightness = 255 137 | 138 | filepaths = glob.glob(args.filename) 139 | if not filepaths: 140 | sys.exit("Error: no file found, check the documentation for more info.") 141 | 142 | ofiles = 0 143 | ifiles = 0 144 | 145 | for filepath in filepaths: 146 | splitter = PdfQrSplit(filepath, args.verbose, args.debug, brightness=args.brightness) 147 | ofiles += splitter.split_qr(args.separator, ifiles) 148 | ifiles += 1 149 | 150 | print( 151 | "Split {} given files into {} files".format( 152 | ifiles, ofiles 153 | ) 154 | ) 155 | --------------------------------------------------------------------------------