├── .gitignore ├── CHANGELOG ├── LICENSE ├── MANIFEST.in ├── PDF_Samples ├── AutoCad_Diagram.pdf ├── AutoCad_Simple.pdf ├── README.txt └── SF424_page2.pdf ├── PyPDF2 ├── __init__.py ├── _version.py ├── filters.py ├── generic.py ├── merger.py ├── pdf.py ├── utils.py └── xmp.py ├── README ├── Sample_Code ├── 2-up.py ├── README.txt ├── basic_features.py └── basic_merging.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.swp 3 | -------------------------------------------------------------------------------- /CHANGELOG: -------------------------------------------------------------------------------- 1 | Version 1.20, 2014-01-?? 2 | ------------------------ 3 | 4 | - Many Python 3 support changes (with contributions from TWAC and cgammans) 5 | 6 | - Updated FAQ; link included in README 7 | 8 | - Allow more (unnecessary) escape sequences 9 | 10 | - Prevent exception when reading a null object in decoding parameters 11 | 12 | - Corrected error in reading destination types (added a slash since they 13 | are name objects) 14 | 15 | - Corrected TypeError in scaleTo() method 16 | 17 | - addBookmark() method in PdfFileMerger now returns bookmark (so nested 18 | bookmarks can be created) 19 | 20 | - Additions to Sample Code and Sample PDFs 21 | 22 | - changes to allow 2up script to work (by Dylan McNamee) 23 | 24 | - changes to metadata encoding (by Chris Hiestand) 25 | 26 | - New methods for links: addLink() (by Enrico Lambertini) and ignoreLinks() 27 | 28 | 29 | Version 1.19, 2013-10-08 30 | ------------------------ 31 | 32 | BUGFIXES: 33 | - Removed pop in sweepIndirectReferences to prevent infinite loop 34 | (provided by ian-su-sirca) 35 | 36 | - Fixed bug caused by whitespace when parsing PDFs generated by AutoCad 37 | 38 | - Fixed a bug caused by reading a 'null' ASCII value in a dictionary 39 | object (primarily in PDFs generated by AutoCad). 40 | 41 | FEATURES: 42 | - Added new folders for PyPDF2 sample code and example PDFs; see README 43 | for each folder 44 | 45 | - Added a method for debugging purposes to show current location while 46 | parsing 47 | 48 | - Ability to create custom metadata (by jamma313) 49 | 50 | - Ability to access and customize document layout and view mode 51 | (by Joshua Arnott) 52 | 53 | OTHER: 54 | - Added and corrected some documentation 55 | 56 | - Added some more warnings and exception messages 57 | 58 | - Removed old test/debugging code 59 | 60 | UPCOMING: 61 | - More bugfixes (We have received many problematic PDFs via email, we 62 | will work with them) 63 | 64 | - Documentation - It's time for PyPDF2 to get its own documentation 65 | since it has grown much since the original pyPdf 66 | 67 | - A FAQ to answer common questions 68 | 69 | 70 | Version 1.18, 2013-08-19 71 | ------------------------ 72 | 73 | - Fixed a bug where older verions of objects were incorrectly added to the 74 | cache, resulting in outdated or missing pages, images, and other objects 75 | (from speedplane) 76 | 77 | - Fixed a bug in parsing the xref table where new xref values were 78 | overwritten; also cleaned up code (from speedplane) 79 | 80 | - New method mergeRotatedAroundPointPage which merges a page while rotating 81 | it around a point (from speedplane) 82 | 83 | - Updated Destination syntax to respect PDF 1.6 specifications (from 84 | jamma313) 85 | 86 | - Prevented infinite loop when a PdfFileReader object was instantiated 87 | with an empty file (from Jerome Nexedi) 88 | 89 | Other Changes: 90 | 91 | - Downloads now available via PyPI 92 | https://pypi.python.org/pypi?:action=display&name=PyPDF2 93 | 94 | - Installation through pip library is fixed 95 | 96 | 97 | Version 1.17, 2013-07-25 98 | ------------------------ 99 | 100 | - Removed one (from pdf.py) of the two Destination classes. Both 101 | classes had the same name, but were slightly different in content, 102 | causing some errors. (from Janne Vanhala) 103 | 104 | - Corrected and Expanded README file to demonstrate PdfFileMerger 105 | 106 | - Added filter for LZW encoded streams (from Michal Horejsek) 107 | 108 | - PyPDF2 issue tracker enabled on Github to allow community 109 | discussion and collaboration 110 | 111 | 112 | Versions -1.16, -2013-06-30 113 | --------------------------- 114 | 115 | - Note: This ChangeLog has not been kept up-to-date for a while. 116 | Hopefully we can keep better track of it from now on. Some of the 117 | changes listed here come from previous versions 1.14 and 1.15; they 118 | were only vaguely defined. With the new _version.py file we should 119 | have more structured and better documented versioning from now on. 120 | 121 | - Defined PyPDF2.__version__ 122 | 123 | - Fixed encrypt() method (from Martijn The) 124 | 125 | - Improved error handling on PDFs with truncated streams (from cecilkorik) 126 | 127 | - Python 3 support (from kushal-kumaran) 128 | 129 | - Fixed example code in README (from Jeremy Bethmont) 130 | 131 | - Fixed an bug caused by DecimalError Exception (from Adam Morris) 132 | 133 | - Many other bug fixes and features by: 134 | 135 | jeansch 136 | Anton Vlasenko 137 | Joseph Walton 138 | Jan Oliver Oelerich 139 | Fabian Henze 140 | And any others I missed. 141 | Thanks for contributing! 142 | 143 | 144 | Version 1.13, 2010-12-04 145 | ------------------------ 146 | 147 | - Fixed a typo in code for reading a "\b" escape character in strings. 148 | 149 | - Improved __repr__ in FloatObject. 150 | 151 | - Fixed a bug in reading octal escape sequences in strings. 152 | 153 | - Added getWidth and getHeight methods to the RectangleObject class. 154 | 155 | - Fixed compatibility warnings with Python 2.4 and 2.5. 156 | 157 | - Added addBlankPage and insertBlankPage methods on PdfFileWriter class. 158 | 159 | - Fixed a bug with circular references in page's object trees (typically 160 | annotations) that prevented correctly writing out a copy of those pages. 161 | 162 | - New merge page functions allow application of a transformation matrix. 163 | 164 | - To all patch contributors: I did a poor job of keeping this ChangeLog 165 | up-to-date for this release, so I am missing attributions here for any 166 | changes you submitted. Sorry! I'll do better in the future. 167 | 168 | 169 | Version 1.12, 2008-09-02 170 | ------------------------ 171 | 172 | - Added support for XMP metadata. 173 | 174 | - Fix reading files with xref streams with multiple /Index values. 175 | 176 | - Fix extracting content streams that use graphics operators longer than 2 177 | characters. Affects merging PDF files. 178 | 179 | 180 | Version 1.11, 2008-05-09 181 | ------------------------ 182 | 183 | - Patch from Hartmut Goebel to permit RectangleObjects to accept NumberObject 184 | or FloatObject values. 185 | 186 | - PDF compatibility fixes. 187 | 188 | - Fix to read object xref stream in correct order. 189 | 190 | - Fix for comments inside content streams. 191 | 192 | 193 | Version 1.10, 2007-10-04 194 | ------------------------ 195 | 196 | - Text strings from PDF files are returned as Unicode string objects when 197 | pyPdf determines that they can be decoded (as UTF-16 strings, or as 198 | PDFDocEncoding strings). Unicode objects are also written out when 199 | necessary. This means that string objects in pyPdf can be either 200 | generic.ByteStringObject instances, or generic.TextStringObject instances. 201 | 202 | - The extractText method now returns a unicode string object. 203 | 204 | - All document information properties now return unicode string objects. In 205 | the event that a document provides docinfo properties that are not decoded by 206 | pyPdf, the raw byte strings can be accessed with an "_raw" property (ie. 207 | title_raw rather than title) 208 | 209 | - generic.DictionaryObject instances have been enhanced to be easier to use. 210 | Values coming out of dictionary objects will automatically be de-referenced 211 | (.getObject will be called on them), unless accessed by the new "raw_get" 212 | method. DictionaryObjects can now only contain PdfObject instances (as keys 213 | and values), making it easier to debug where non-PdfObject values (which 214 | cannot be written out) are entering dictionaries. 215 | 216 | - Support for reading named destinations and outlines in PDF files. Original 217 | patch by Ashish Kulkarni. 218 | 219 | - Stream compatibility reading enhancements for malformed PDF files. 220 | 221 | - Cross reference table reading enhancements for malformed PDF files. 222 | 223 | - Encryption documentation. 224 | 225 | - Replace some "assert" statements with error raising. 226 | 227 | - Minor optimizations to FlateDecode algorithm increase speed when using PNG 228 | predictors. 229 | 230 | Version 1.9, 2006-12-15 231 | ----------------------- 232 | 233 | - Fix several serious bugs introduced in version 1.8, caused by a failure to 234 | run through our PDF test suite before releasing that version. 235 | 236 | - Fix bug in NullObject reading and writing. 237 | 238 | Version 1.8, 2006-12-14 239 | ----------------------- 240 | 241 | - Add support for decryption with the standard PDF security handler. This 242 | allows for decrypting PDF files given the proper user or owner password. 243 | 244 | - Add support for encryption with the standard PDF security handler. 245 | 246 | - Add new pythondoc documentation. 247 | 248 | - Fix bug in ASCII85 decode that occurs when whitespace exists inside the 249 | two terminating characters of the stream. 250 | 251 | Version 1.7, 2006-12-10 252 | ----------------------- 253 | 254 | - Fix a bug when using a single page object in two PdfFileWriter objects. 255 | 256 | - Adjust PyPDF to be tolerant of whitespace characters that don't belong 257 | during a stream object. 258 | 259 | - Add documentInfo property to PdfFileReader. 260 | 261 | - Add numPages property to PdfFileReader. 262 | 263 | - Add pages property to PdfFileReader. 264 | 265 | - Add extractText function to PdfFileReader. 266 | 267 | 268 | Version 1.6, 2006-06-06 269 | ----------------------- 270 | 271 | - Add basic support for comments in PDF files. This allows us to read some 272 | ReportLab PDFs that could not be read before. 273 | 274 | - Add "auto-repair" for finding xref table at slightly bad locations. 275 | 276 | - New StreamObject backend, cleaner and more powerful. Allows the use of 277 | stream filters more easily, including compressed streams. 278 | 279 | - Add a graphics state push/pop around page merges. Improves quality of 280 | page merges when one page's content stream leaves the graphics 281 | in an abnormal state. 282 | 283 | - Add PageObject.compressContentStreams function, which filters all content 284 | streams and compresses them. This will reduce the size of PDF pages, 285 | especially after they could have been decompressed in a mergePage 286 | operation. 287 | 288 | - Support inline images in PDF content streams. 289 | 290 | - Add support for using .NET framework compression when zlib is not 291 | available. This does not make pyPdf compatible with IronPython, but it 292 | is a first step. 293 | 294 | - Add support for reading the document information dictionary, and extracting 295 | title, author, subject, producer and creator tags. 296 | 297 | - Add patch to support NullObject and multiple xref streams, from Bradley 298 | Lawrence. 299 | 300 | 301 | Version 1.5, 2006-01-28 302 | ----------------------- 303 | 304 | - Fix a bug where merging pages did not work in "no-rename" cases when the 305 | second page has an array of content streams. 306 | 307 | - Remove some debugging output that should not have been present. 308 | 309 | 310 | Version 1.4, 2006-01-27 311 | ----------------------- 312 | 313 | - Add capability to merge pages from multiple PDF files into a single page 314 | using the PageObject.mergePage function. See example code (README or web 315 | site) for more information. 316 | 317 | - Add ability to modify a page's MediaBox, CropBox, BleedBox, TrimBox, and 318 | ArtBox properties through PageObject. See example code (README or web site) 319 | for more information. 320 | 321 | - Refactor pdf.py into multiple files: generic.py (contains objects like 322 | NameObject, DictionaryObject), filters.py (contains filter code), 323 | utils.py (various). This does not affect importing PdfFileReader 324 | or PdfFileWriter. 325 | 326 | - Add new decoding functions for standard PDF filters ASCIIHexDecode and 327 | ASCII85Decode. 328 | 329 | - Change url and download_url to refer to new pybrary.net web site. 330 | 331 | 332 | Version 1.3, 2006-01-23 333 | ----------------------- 334 | 335 | - Fix new bug introduced in 1.2 where PDF files with \r line endings did not 336 | work properly anymore. A new test suite developed with various PDF files 337 | should prevent regression bugs from now on. 338 | 339 | - Fix a bug where inheriting attributes from page nodes did not work. 340 | 341 | 342 | Version 1.2, 2006-01-23 343 | ----------------------- 344 | 345 | - Improved support for files with CRLF-based line endings, fixing a common 346 | reported problem stating "assertion error: assert line == "%%EOF"". 347 | 348 | - Software author/maintainer is now officially a proud married person, which 349 | is sure to result in better software... somehow. 350 | 351 | 352 | Version 1.1, 2006-01-18 353 | ----------------------- 354 | 355 | - Add capability to rotate pages. 356 | 357 | - Improved PDF reading support to properly manage inherited attributes from 358 | /Type=/Pages nodes. This means that page groups that are rotated or have 359 | different media boxes or whatever will now work properly. 360 | 361 | - Added PDF 1.5 support. Namely cross-reference streams and object streams. 362 | This release can mangle Adobe's PDFReference16.pdf successfully. 363 | 364 | 365 | Version 1.0, 2006-01-17 366 | ----------------------- 367 | 368 | - First distutils-capable true public release. Supports a wide variety of PDF 369 | files that I found sitting around on my system. 370 | 371 | - Does not support some PDF 1.5 features, such as object streams, 372 | cross-reference streams. 373 | 374 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2006-2008, Mathieu Fenniak 2 | Some contributions copyright (c) 2007, Ashish Kulkarni 3 | 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are 8 | met: 9 | 10 | * Redistributions of source code must retain the above copyright notice, 11 | this list of conditions and the following disclaimer. 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | * The name of the author may not be used to endorse or promote products 16 | derived from this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include CHANGELOG 2 | -------------------------------------------------------------------------------- /PDF_Samples/AutoCad_Diagram.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/AutoCad_Diagram.pdf -------------------------------------------------------------------------------- /PDF_Samples/AutoCad_Simple.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/AutoCad_Simple.pdf -------------------------------------------------------------------------------- /PDF_Samples/README.txt: -------------------------------------------------------------------------------- 1 | PDF Sample Folder 2 | ----------------- 3 | 4 | PDF files are generated by a large variety of sources 5 | for many different purposes. One of the goals of PyPDF2 6 | is to be able to read/write any PDF instance that Adobe 7 | can. 8 | 9 | This is a catalog of various PDF files. The 10 | files may not have worked with PyPDF2 but do now, they 11 | may be complicated or unconventional files, or they may 12 | just be good for testing. The purpose is to insure that 13 | when changes to PyPDF2 are made, we keep them in mind. 14 | 15 | If you have confidential PDFs that don't work with 16 | PyPDF2, feel free to still e-mail them for debugging - 17 | we won't add PDFs without expressed permission. 18 | 19 | (This folder is available through GitHub only) 20 | 21 | 22 | Feel free to add any type of PDF file or sample code, 23 | either by 24 | 25 | 1) sending it via email to PyPDF2@phaseit.net 26 | 2) including it in a pull request on GitHub -------------------------------------------------------------------------------- /PDF_Samples/SF424_page2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/SF424_page2.pdf -------------------------------------------------------------------------------- /PyPDF2/__init__.py: -------------------------------------------------------------------------------- 1 | from .pdf import PdfFileReader, PdfFileWriter 2 | from .merger import PdfFileMerger 3 | from ._version import __version__ 4 | __all__ = ["pdf", "PdfFileMerger"] 5 | -------------------------------------------------------------------------------- /PyPDF2/_version.py: -------------------------------------------------------------------------------- 1 | __version__ = '1.20b' 2 | 3 | -------------------------------------------------------------------------------- /PyPDF2/filters.py: -------------------------------------------------------------------------------- 1 | # vim: sw=4:expandtab:foldmethod=marker 2 | # 3 | # Copyright (c) 2006, Mathieu Fenniak 4 | # All rights reserved. 5 | # 6 | # Redistribution and use in source and binary forms, with or without 7 | # modification, are permitted provided that the following conditions are 8 | # met: 9 | # 10 | # * Redistributions of source code must retain the above copyright notice, 11 | # this list of conditions and the following disclaimer. 12 | # * Redistributions in binary form must reproduce the above copyright notice, 13 | # this list of conditions and the following disclaimer in the documentation 14 | # and/or other materials provided with the distribution. 15 | # * The name of the author may not be used to endorse or promote products 16 | # derived from this software without specific prior written permission. 17 | # 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | # POSSIBILITY OF SUCH DAMAGE. 29 | 30 | 31 | """ 32 | Implementation of stream filters for PDF. 33 | """ 34 | __author__ = "Mathieu Fenniak" 35 | __author_email__ = "biziqe@mathieu.fenniak.net" 36 | 37 | from .utils import PdfReadError 38 | from sys import version_info 39 | if version_info < ( 3, 0 ): 40 | from cStringIO import StringIO 41 | else: 42 | from io import StringIO 43 | 44 | try: 45 | import zlib 46 | def decompress(data): 47 | return zlib.decompress(data) 48 | def compress(data): 49 | return zlib.compress(data) 50 | except ImportError: 51 | # Unable to import zlib. Attempt to use the System.IO.Compression 52 | # library from the .NET framework. (IronPython only) 53 | import System 54 | from System import IO, Collections, Array 55 | def _string_to_bytearr(buf): 56 | retval = Array.CreateInstance(System.Byte, len(buf)) 57 | for i in range(len(buf)): 58 | retval[i] = ord(buf[i]) 59 | return retval 60 | def _bytearr_to_string(bytes): 61 | retval = "" 62 | for i in range(bytes.Length): 63 | retval += chr(bytes[i]) 64 | return retval 65 | def _read_bytes(stream): 66 | ms = IO.MemoryStream() 67 | buf = Array.CreateInstance(System.Byte, 2048) 68 | while True: 69 | bytes = stream.Read(buf, 0, buf.Length) 70 | if bytes == 0: 71 | break 72 | else: 73 | ms.Write(buf, 0, bytes) 74 | retval = ms.ToArray() 75 | ms.Close() 76 | return retval 77 | def decompress(data): 78 | bytes = _string_to_bytearr(data) 79 | ms = IO.MemoryStream() 80 | ms.Write(bytes, 0, bytes.Length) 81 | ms.Position = 0 # fseek 0 82 | gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress) 83 | bytes = _read_bytes(gz) 84 | retval = _bytearr_to_string(bytes) 85 | gz.Close() 86 | return retval 87 | def compress(data): 88 | bytes = _string_to_bytearr(data) 89 | ms = IO.MemoryStream() 90 | gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True) 91 | gz.Write(bytes, 0, bytes.Length) 92 | gz.Close() 93 | ms.Position = 0 # fseek 0 94 | bytes = ms.ToArray() 95 | retval = _bytearr_to_string(bytes) 96 | ms.Close() 97 | return retval 98 | 99 | 100 | class FlateDecode(object): 101 | def decode(data, decodeParms): 102 | data = decompress(data) 103 | predictor = 1 104 | if decodeParms: 105 | try: 106 | predictor = decodeParms.get("/Predictor", 1) 107 | except AttributeError: 108 | pass # usually an array with a null object was read 109 | 110 | # predictor 1 == no predictor 111 | if predictor != 1: 112 | columns = decodeParms["/Columns"] 113 | # PNG prediction: 114 | if predictor >= 10 and predictor <= 15: 115 | output = StringIO() 116 | # PNG prediction can vary from row to row 117 | rowlength = columns + 1 118 | assert len(data) % rowlength == 0 119 | prev_rowdata = (0,) * rowlength 120 | for row in range(len(data) // rowlength): 121 | rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]] 122 | filterByte = rowdata[0] 123 | if filterByte == 0: 124 | pass 125 | elif filterByte == 1: 126 | for i in range(2, rowlength): 127 | rowdata[i] = (rowdata[i] + rowdata[i-1]) % 256 128 | elif filterByte == 2: 129 | for i in range(1, rowlength): 130 | rowdata[i] = (rowdata[i] + prev_rowdata[i]) % 256 131 | else: 132 | # unsupported PNG filter 133 | raise PdfReadError("Unsupported PNG filter %r" % filterByte) 134 | prev_rowdata = rowdata 135 | output.write(''.join([chr(x) for x in rowdata[1:]])) 136 | data = output.getvalue() 137 | else: 138 | # unsupported predictor 139 | raise PdfReadError("Unsupported flatedecode predictor %r" % predictor) 140 | return data 141 | decode = staticmethod(decode) 142 | 143 | def encode(data): 144 | return compress(data) 145 | encode = staticmethod(encode) 146 | 147 | class ASCIIHexDecode(object): 148 | def decode(data, decodeParms=None): 149 | retval = "" 150 | char = "" 151 | x = 0 152 | while True: 153 | c = data[x] 154 | if c == ">": 155 | break 156 | elif c.isspace(): 157 | x += 1 158 | continue 159 | char += c 160 | if len(char) == 2: 161 | retval += chr(int(char, base=16)) 162 | char = "" 163 | x += 1 164 | assert char == "" 165 | return retval 166 | decode = staticmethod(decode) 167 | 168 | class LZWDecode(object): 169 | """Taken from: 170 | http://www.java2s.com/Open-Source/Java-Document/PDF/PDF-Renderer/com/sun/pdfview/decode/LZWDecode.java.htm 171 | """ 172 | class decoder(object): 173 | def __init__(self, data): 174 | self.STOP=257 175 | self.CLEARDICT=256 176 | self.data=data 177 | self.bytepos=0 178 | self.bitpos=0 179 | self.dict=[""]*4096 180 | for i in range(256): 181 | self.dict[i]=chr(i) 182 | self.resetDict() 183 | 184 | def resetDict(self): 185 | self.dictlen=258 186 | self.bitspercode=9 187 | 188 | 189 | def nextCode(self): 190 | fillbits=self.bitspercode 191 | value=0 192 | while fillbits>0 : 193 | if self.bytepos >= len(self.data): 194 | return -1 195 | nextbits=ord(self.data[self.bytepos]) 196 | bitsfromhere=8-self.bitpos 197 | if bitsfromhere>fillbits: 198 | bitsfromhere=fillbits 199 | value |= (((nextbits >> (8-self.bitpos-bitsfromhere)) & 200 | (0xff >> (8-bitsfromhere))) << 201 | (fillbits-bitsfromhere)) 202 | fillbits -= bitsfromhere 203 | self.bitpos += bitsfromhere 204 | if self.bitpos >=8: 205 | self.bitpos=0 206 | self.bytepos = self.bytepos+1 207 | return value 208 | 209 | def decode(self): 210 | """ algorithm derived from: 211 | http://www.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html 212 | and the PDFReference 213 | """ 214 | cW = self.CLEARDICT; 215 | baos="" 216 | while True: 217 | pW = cW; 218 | cW = self.nextCode(); 219 | if cW == -1: 220 | raise PdfReadError("Missed the stop code in LZWDecode!") 221 | if cW == self.STOP: 222 | break; 223 | elif cW == self.CLEARDICT: 224 | self.resetDict(); 225 | elif pW == self.CLEARDICT: 226 | baos+=self.dict[cW] 227 | else: 228 | if cW < self.dictlen: 229 | baos += self.dict[cW] 230 | p=self.dict[pW]+self.dict[cW][0] 231 | self.dict[self.dictlen]=p 232 | self.dictlen+=1 233 | else: 234 | p=self.dict[pW]+self.dict[pW][0] 235 | baos+=p 236 | self.dict[self.dictlen] = p; 237 | self.dictlen+=1 238 | if (self.dictlen >= (1 << self.bitspercode) - 1 and 239 | self.bitspercode < 12): 240 | self.bitspercode+=1 241 | return baos 242 | 243 | 244 | 245 | @staticmethod 246 | def decode(data,decodeParams=None): 247 | return LZWDecode.decoder(data).decode() 248 | 249 | class ASCII85Decode(object): 250 | def decode(data, decodeParms=None): 251 | retval = "" 252 | group = [] 253 | x = 0 254 | hitEod = False 255 | # remove all whitespace from data 256 | data = [y for y in data if not (y in ' \n\r\t')] 257 | while not hitEod: 258 | c = data[x] 259 | if len(retval) == 0 and c == "<" and data[x+1] == "~": 260 | x += 2 261 | continue 262 | #elif c.isspace(): 263 | # x += 1 264 | # continue 265 | elif c == 'z': 266 | assert len(group) == 0 267 | retval += '\x00\x00\x00\x00' 268 | continue 269 | elif c == "~" and data[x+1] == ">": 270 | if len(group) != 0: 271 | # cannot have a final group of just 1 char 272 | assert len(group) > 1 273 | cnt = len(group) - 1 274 | group += [ 85, 85, 85 ] 275 | hitEod = cnt 276 | else: 277 | break 278 | else: 279 | c = ord(c) - 33 280 | assert c >= 0 and c < 85 281 | group += [ c ] 282 | if len(group) >= 5: 283 | b = group[0] * (85**4) + \ 284 | group[1] * (85**3) + \ 285 | group[2] * (85**2) + \ 286 | group[3] * 85 + \ 287 | group[4] 288 | assert b < (2**32 - 1) 289 | c4 = chr((b >> 0) % 256) 290 | c3 = chr((b >> 8) % 256) 291 | c2 = chr((b >> 16) % 256) 292 | c1 = chr(b >> 24) 293 | retval += (c1 + c2 + c3 + c4) 294 | if hitEod: 295 | retval = retval[:-4+hitEod] 296 | group = [] 297 | x += 1 298 | return retval 299 | decode = staticmethod(decode) 300 | 301 | def decodeStreamData(stream): 302 | from .generic import NameObject 303 | filters = stream.get("/Filter", ()) 304 | if len(filters) and not isinstance(filters[0], NameObject): 305 | # we have a single filter instance 306 | filters = (filters,) 307 | data = stream._data 308 | for filterType in filters: 309 | if filterType == "/FlateDecode": 310 | data = FlateDecode.decode(data, stream.get("/DecodeParms")) 311 | elif filterType == "/ASCIIHexDecode": 312 | data = ASCIIHexDecode.decode(data) 313 | elif filterType == "/LZWDecode": 314 | data = LZWDecode.decode(data, stream.get("/DecodeParms")) 315 | elif filterType == "/ASCII85Decode": 316 | data = ASCII85Decode.decode(data) 317 | elif filterType == "/Crypt": 318 | decodeParams = stream.get("/DecodeParams", {}) 319 | if "/Name" not in decodeParams and "/Type" not in decodeParams: 320 | pass 321 | else: 322 | raise NotImplementedError("/Crypt filter with /Name or /Type not supported yet") 323 | else: 324 | # unsupported filter 325 | raise NotImplementedError("unsupported filter %s" % filterType) 326 | return data 327 | -------------------------------------------------------------------------------- /PyPDF2/generic.py: -------------------------------------------------------------------------------- 1 | # vim: sw=4:expandtab:foldmethod=marker 2 | # 3 | # Copyright (c) 2006, Mathieu Fenniak 4 | # All rights reserved. 5 | # 6 | # Redistribution and use in source and binary forms, with or without 7 | # modification, are permitted provided that the following conditions are 8 | # met: 9 | # 10 | # * Redistributions of source code must retain the above copyright notice, 11 | # this list of conditions and the following disclaimer. 12 | # * Redistributions in binary form must reproduce the above copyright notice, 13 | # this list of conditions and the following disclaimer in the documentation 14 | # and/or other materials provided with the distribution. 15 | # * The name of the author may not be used to endorse or promote products 16 | # derived from this software without specific prior written permission. 17 | # 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | # POSSIBILITY OF SUCH DAMAGE. 29 | 30 | 31 | """ 32 | Implementation of generic PDF objects (dictionary, number, string, and so on) 33 | """ 34 | __author__ = "Mathieu Fenniak" 35 | __author_email__ = "biziqe@mathieu.fenniak.net" 36 | 37 | import re 38 | from .utils import readNonWhitespace, RC4_encrypt 39 | from .utils import b_, u_, chr_, ord_ 40 | from .utils import PdfStreamError 41 | import warnings 42 | from . import filters 43 | from . import utils 44 | import decimal 45 | import codecs 46 | #import debugging 47 | 48 | def readObject(stream, pdf): 49 | tok = stream.read(1) 50 | stream.seek(-1, 1) # reset to start 51 | if tok == b_('t') or tok == b_('f'): 52 | # boolean object 53 | return BooleanObject.readFromStream(stream) 54 | elif tok == b_('('): 55 | # string object 56 | return readStringFromStream(stream) 57 | elif tok == b_('/'): 58 | # name object 59 | return NameObject.readFromStream(stream) 60 | elif tok == b_('['): 61 | # array object 62 | return ArrayObject.readFromStream(stream, pdf) 63 | elif tok == b_('n'): 64 | # null object 65 | return NullObject.readFromStream(stream) 66 | elif tok == b_('<'): 67 | # hexadecimal string OR dictionary 68 | peek = stream.read(2) 69 | stream.seek(-2, 1) # reset to start 70 | if peek == b_('<<'): 71 | return DictionaryObject.readFromStream(stream, pdf) 72 | else: 73 | return readHexStringFromStream(stream) 74 | elif tok == b_('%'): 75 | # comment 76 | while tok not in (b_('\r'), b_('\n')): 77 | tok = stream.read(1) 78 | tok = readNonWhitespace(stream) 79 | stream.seek(-1, 1) 80 | return readObject(stream, pdf) 81 | else: 82 | # number object OR indirect reference 83 | if tok == b_('+') or tok == b_('-'): 84 | # number 85 | return NumberObject.readFromStream(stream) 86 | peek = stream.read(20) 87 | stream.seek(-len(peek), 1) # reset to start 88 | if re.match(b_(r"(\d+)\s(\d+)\sR[^a-zA-Z]"), peek) != None: 89 | return IndirectObject.readFromStream(stream, pdf) 90 | else: 91 | return NumberObject.readFromStream(stream) 92 | 93 | class PdfObject(object): 94 | def getObject(self): 95 | """Resolves indirect references.""" 96 | return self 97 | 98 | 99 | class NullObject(PdfObject): 100 | def writeToStream(self, stream, encryption_key): 101 | stream.write(b_("null")) 102 | 103 | def readFromStream(stream): 104 | nulltxt = stream.read(4) 105 | if nulltxt != b_("null"): 106 | raise utils.PdfReadError("Could not read Null object") 107 | return NullObject() 108 | readFromStream = staticmethod(readFromStream) 109 | 110 | 111 | class BooleanObject(PdfObject): 112 | def __init__(self, value): 113 | self.value = value 114 | 115 | def writeToStream(self, stream, encryption_key): 116 | if self.value: 117 | stream.write(b_("true")) 118 | else: 119 | stream.write(b_("false")) 120 | 121 | def readFromStream(stream): 122 | word = stream.read(4) 123 | if word == b_("true"): 124 | return BooleanObject(True) 125 | elif word == b_("fals"): 126 | stream.read(1) 127 | return BooleanObject(False) 128 | else: 129 | raise utils.PdfReadError('Could not read Boolean object') 130 | readFromStream = staticmethod(readFromStream) 131 | 132 | 133 | class ArrayObject(list, PdfObject): 134 | def writeToStream(self, stream, encryption_key): 135 | stream.write(b_("[")) 136 | for data in self: 137 | stream.write(b_(" ")) 138 | data.writeToStream(stream, encryption_key) 139 | stream.write(b_(" ]")) 140 | 141 | def readFromStream(stream, pdf): 142 | arr = ArrayObject() 143 | tmp = stream.read(1) 144 | if tmp != b_("["): 145 | raise utils.PdfReadError("Could not read array") 146 | while True: 147 | # skip leading whitespace 148 | tok = stream.read(1) 149 | while tok.isspace(): 150 | tok = stream.read(1) 151 | stream.seek(-1, 1) 152 | # check for array ending 153 | peekahead = stream.read(1) 154 | if peekahead == b_("]"): 155 | break 156 | stream.seek(-1, 1) 157 | # read and append obj 158 | arr.append(readObject(stream, pdf)) 159 | return arr 160 | readFromStream = staticmethod(readFromStream) 161 | 162 | 163 | class IndirectObject(PdfObject): 164 | def __init__(self, idnum, generation, pdf): 165 | self.idnum = idnum 166 | self.generation = generation 167 | self.pdf = pdf 168 | 169 | def getObject(self): 170 | return self.pdf.getObject(self).getObject() 171 | 172 | def __repr__(self): 173 | return "IndirectObject(%r, %r)" % (self.idnum, self.generation) 174 | 175 | def __eq__(self, other): 176 | return ( 177 | other != None and 178 | isinstance(other, IndirectObject) and 179 | self.idnum == other.idnum and 180 | self.generation == other.generation and 181 | self.pdf is other.pdf 182 | ) 183 | 184 | def __ne__(self, other): 185 | return not self.__eq__(other) 186 | 187 | def writeToStream(self, stream, encryption_key): 188 | stream.write(b_("%s %s R" % (self.idnum, self.generation))) 189 | 190 | def readFromStream(stream, pdf): 191 | idnum = b_("") 192 | while True: 193 | tok = stream.read(1) 194 | if not tok: 195 | # stream has truncated prematurely 196 | raise PdfStreamError("Stream has ended unexpectedly") 197 | if tok.isspace(): 198 | break 199 | idnum += tok 200 | generation = b_("") 201 | while True: 202 | tok = stream.read(1) 203 | if not tok: 204 | # stream has truncated prematurely 205 | raise PdfStreamError("Stream has ended unexpectedly") 206 | if tok.isspace(): 207 | break 208 | generation += tok 209 | r = stream.read(1) 210 | if r != b_("R"): 211 | raise utils.PdfReadError("Error reading indirect object reference at byte %s" % utils.hexStr(stream.tell())) 212 | return IndirectObject(int(idnum), int(generation), pdf) 213 | readFromStream = staticmethod(readFromStream) 214 | 215 | 216 | class FloatObject(decimal.Decimal, PdfObject): 217 | def __new__(cls, value="0", context=None): 218 | try: 219 | return decimal.Decimal.__new__(cls, utils.str_(value), context) 220 | except: 221 | return decimal.Decimal.__new__(cls, utils.str_(value)) 222 | def __repr__(self): 223 | if self == self.to_integral(): 224 | return str(self.quantize(decimal.Decimal(1))) 225 | else: 226 | # XXX: this adds useless extraneous zeros. 227 | return "%.5f" % self 228 | 229 | def as_numeric(self): 230 | return float(b_(repr(self))) 231 | 232 | def writeToStream(self, stream, encryption_key): 233 | stream.write(b_(repr(self))) 234 | 235 | 236 | class NumberObject(int, PdfObject): 237 | def __init__(self, value): 238 | int.__init__(value) 239 | 240 | def as_numeric(self): 241 | return int(b_(repr(self))) 242 | 243 | def writeToStream(self, stream, encryption_key): 244 | stream.write(b_(repr(self))) 245 | 246 | def readFromStream(stream): 247 | num = b_("") 248 | while True: 249 | tok = stream.read(1) 250 | if tok != b_('+') and tok != b_('-') and tok != b_('.') and not tok.isdigit(): 251 | stream.seek(-1, 1) 252 | break 253 | num += tok 254 | if num.find(b_(".")) != -1: 255 | return FloatObject(num) 256 | else: 257 | return NumberObject(num) 258 | readFromStream = staticmethod(readFromStream) 259 | 260 | 261 | ## 262 | # Given a string (either a "str" or "unicode"), create a ByteStringObject or a 263 | # TextStringObject to represent the string. 264 | def createStringObject(string): 265 | if isinstance(string, utils.string_type): 266 | return TextStringObject(string) 267 | elif isinstance(string, utils.bytes_type): 268 | try: 269 | if string.startswith(codecs.BOM_UTF16_BE): 270 | retval = TextStringObject(string.decode("utf-16")) 271 | retval.autodetect_utf16 = True 272 | return retval 273 | else: 274 | # This is probably a big performance hit here, but we need to 275 | # convert string objects into the text/unicode-aware version if 276 | # possible... and the only way to check if that's possible is 277 | # to try. Some strings are strings, some are just byte arrays. 278 | retval = TextStringObject(decode_pdfdocencoding(string)) 279 | retval.autodetect_pdfdocencoding = True 280 | return retval 281 | except UnicodeDecodeError: 282 | return ByteStringObject(string) 283 | else: 284 | raise TypeError("createStringObject should have str or unicode arg") 285 | 286 | 287 | def readHexStringFromStream(stream): 288 | stream.read(1) 289 | txt = "" 290 | x = b_("") 291 | while True: 292 | tok = readNonWhitespace(stream) 293 | if not tok: 294 | # stream has truncated prematurely 295 | raise PdfStreamError("Stream has ended unexpectedly") 296 | if tok == b_(">"): 297 | break 298 | x += tok 299 | if len(x) == 2: 300 | txt += chr(int(x, base=16)) 301 | x = b_("") 302 | if len(x) == 1: 303 | x += b_("0") 304 | if len(x) == 2: 305 | txt += chr(int(x, base=16)) 306 | return createStringObject(b_(txt)) 307 | 308 | 309 | def readStringFromStream(stream): 310 | tok = stream.read(1) 311 | parens = 1 312 | txt = b_("") 313 | while True: 314 | tok = stream.read(1) 315 | if not tok: 316 | # stream has truncated prematurely 317 | raise PdfStreamError("Stream has ended unexpectedly") 318 | if tok == b_("("): 319 | parens += 1 320 | elif tok == b_(")"): 321 | parens -= 1 322 | if parens == 0: 323 | break 324 | elif tok == b_("\\"): 325 | tok = stream.read(1) 326 | if tok == b_("n"): 327 | tok = b_("\n") 328 | elif tok == b_("r"): 329 | tok = b_("\r") 330 | elif tok == b_("t"): 331 | tok = b_("\t") 332 | elif tok == b_("b"): 333 | tok = b_("\b") 334 | elif tok == b_("f"): 335 | tok = b_("\f") 336 | elif tok == b_("("): 337 | tok = b_("(") 338 | elif tok == b_(")"): 339 | tok = b_(")") 340 | elif tok == b_("\\"): 341 | tok = b_("\\") 342 | elif tok in (b_(" "), b_("/"), b_("%"), b_("<"), b_(">"), b_("["), b_("]")): 343 | # odd/unnessecary escape sequences we have encountered 344 | tok = b_(tok) 345 | elif tok.isdigit(): 346 | # "The number ddd may consist of one, two, or three 347 | # octal digits; high-order overflow shall be ignored. 348 | # Three octal digits shall be used, with leading zeros 349 | # as needed, if the next character of the string is also 350 | # a digit." (PDF reference 7.3.4.2, p 16) 351 | for i in range(2): 352 | ntok = stream.read(1) 353 | if ntok.isdigit(): 354 | tok += ntok 355 | else: 356 | break 357 | tok = b_(chr(int(tok, base=8))) 358 | elif tok in b_("\n\r"): 359 | # This case is hit when a backslash followed by a line 360 | # break occurs. If it's a multi-char EOL, consume the 361 | # second character: 362 | tok = stream.read(1) 363 | if not tok in b_("\n\r"): 364 | stream.seek(-1, 1) 365 | # Then don't add anything to the actual string, since this 366 | # line break was escaped: 367 | tok = b_('') 368 | else: 369 | raise utils.PdfReadError("Unexpected escaped string") 370 | txt += tok 371 | return createStringObject(txt) 372 | 373 | 374 | ## 375 | # Represents a string object where the text encoding could not be determined. 376 | # This occurs quite often, as the PDF spec doesn't provide an alternate way to 377 | # represent strings -- for example, the encryption data stored in files (like 378 | # /O) is clearly not text, but is still stored in a "String" object. 379 | class ByteStringObject(utils.bytes_type, PdfObject): 380 | 381 | ## 382 | # For compatibility with TextStringObject.original_bytes. This method 383 | # returns self. 384 | original_bytes = property(lambda self: self) 385 | 386 | def writeToStream(self, stream, encryption_key): 387 | bytearr = self 388 | if encryption_key: 389 | bytearr = RC4_encrypt(encryption_key, bytearr) 390 | stream.write(b_("<")) 391 | stream.write(utils.hexencode(bytearr)) 392 | stream.write(b_(">")) 393 | 394 | 395 | ## 396 | # Represents a string object that has been decoded into a real unicode string. 397 | # If read from a PDF document, this string appeared to match the 398 | # PDFDocEncoding, or contained a UTF-16BE BOM mark to cause UTF-16 decoding to 399 | # occur. 400 | class TextStringObject(utils.string_type, PdfObject): 401 | autodetect_pdfdocencoding = False 402 | autodetect_utf16 = False 403 | 404 | ## 405 | # It is occasionally possible that a text string object gets created where 406 | # a byte string object was expected due to the autodetection mechanism -- 407 | # if that occurs, this "original_bytes" property can be used to 408 | # back-calculate what the original encoded bytes were. 409 | original_bytes = property(lambda self: self.get_original_bytes()) 410 | 411 | def get_original_bytes(self): 412 | # We're a text string object, but the library is trying to get our raw 413 | # bytes. This can happen if we auto-detected this string as text, but 414 | # we were wrong. It's pretty common. Return the original bytes that 415 | # would have been used to create this object, based upon the autodetect 416 | # method. 417 | if self.autodetect_utf16: 418 | return codecs.BOM_UTF16_BE + self.encode("utf-16be") 419 | elif self.autodetect_pdfdocencoding: 420 | return encode_pdfdocencoding(self) 421 | else: 422 | raise Exception("no information about original bytes") 423 | 424 | def writeToStream(self, stream, encryption_key): 425 | # Try to write the string out as a PDFDocEncoding encoded string. It's 426 | # nicer to look at in the PDF file. Sadly, we take a performance hit 427 | # here for trying... 428 | try: 429 | bytearr = encode_pdfdocencoding(self) 430 | except UnicodeEncodeError: 431 | bytearr = codecs.BOM_UTF16_BE + self.encode("utf-16be") 432 | if encryption_key: 433 | bytearr = RC4_encrypt(encryption_key, bytearr) 434 | obj = ByteStringObject(bytearr) 435 | obj.writeToStream(stream, None) 436 | else: 437 | stream.write(b_("(")) 438 | for c in bytearr: 439 | if not chr_(c).isalnum() and c != b_(' '): 440 | stream.write(b_("\\%03o" % ord_(c))) 441 | else: 442 | stream.write(b_(chr_(c))) 443 | stream.write(b_(")")) 444 | 445 | 446 | class NameObject(str, PdfObject): 447 | delimiterCharacters = b_("("), b_(")"), b_("<"), b_(">"), b_("["), b_("]"), b_("{"), b_("}"), b_("/"), b_("%") 448 | 449 | def __init__(self, data): 450 | str.__init__(data) 451 | 452 | def writeToStream(self, stream, encryption_key): 453 | stream.write(b_(self)) 454 | 455 | def readFromStream(stream): 456 | debug = False 457 | if debug: print((stream.tell())) 458 | name = stream.read(1) 459 | if name != b_("/"): 460 | raise utils.PdfReadError("name read error") 461 | while True: 462 | tok = stream.read(1) 463 | if not tok: 464 | # stream has truncated prematurely 465 | raise PdfStreamError("Stream has ended unexpectedly") 466 | if tok.isspace() or tok in NameObject.delimiterCharacters: 467 | stream.seek(-1, 1) 468 | break 469 | name += tok 470 | if debug: print(name) 471 | return NameObject(name.decode('utf-8')) 472 | readFromStream = staticmethod(readFromStream) 473 | 474 | 475 | class DictionaryObject(dict, PdfObject): 476 | 477 | def __init__(self, *args, **kwargs): 478 | if len(args) == 0: 479 | self.update(kwargs) 480 | elif len(args) == 1: 481 | arr = args[0] 482 | # If we're passed a list/tuple, make a dict out of it 483 | if not hasattr(arr, "iteritems"): 484 | newarr = {} 485 | for k, v in arr: 486 | newarr[k] = v 487 | arr = newarr 488 | self.update(arr) 489 | else: 490 | raise TypeError("dict expected at most 1 argument, got 3") 491 | 492 | def update(self, arr): 493 | # note, a ValueError halfway through copying values 494 | # will leave half the values in this dict. 495 | for k, v in list(arr.items()): 496 | self.__setitem__(k, v) 497 | 498 | def raw_get(self, key): 499 | return dict.__getitem__(self, key) 500 | 501 | def __setitem__(self, key, value): 502 | if not isinstance(key, PdfObject): 503 | raise ValueError("key must be PdfObject") 504 | if not isinstance(value, PdfObject): 505 | raise ValueError("value must be PdfObject") 506 | return dict.__setitem__(self, key, value) 507 | 508 | def setdefault(self, key, value=None): 509 | if not isinstance(key, PdfObject): 510 | raise ValueError("key must be PdfObject") 511 | if not isinstance(value, PdfObject): 512 | raise ValueError("value must be PdfObject") 513 | return dict.setdefault(self, key, value) 514 | 515 | def __getitem__(self, key): 516 | return dict.__getitem__(self, key).getObject() 517 | 518 | ## 519 | # Retrieves XMP (Extensible Metadata Platform) data relevant to the 520 | # this object, if available. 521 | #

522 | # Stability: Added in v1.12, will exist for all future v1.x releases. 523 | # @return Returns a {@link #xmp.XmpInformation XmlInformation} instance 524 | # that can be used to access XMP metadata from the document. Can also 525 | # return None if no metadata was found on the document root. 526 | def getXmpMetadata(self): 527 | metadata = self.get("/Metadata", None) 528 | if metadata == None: 529 | return None 530 | metadata = metadata.getObject() 531 | from . import xmp 532 | if not isinstance(metadata, xmp.XmpInformation): 533 | metadata = xmp.XmpInformation(metadata) 534 | self[NameObject("/Metadata")] = metadata 535 | return metadata 536 | 537 | ## 538 | # Read-only property that accesses the {@link 539 | # #DictionaryObject.getXmpData getXmpData} function. 540 | #

541 | # Stability: Added in v1.12, will exist for all future v1.x releases. 542 | xmpMetadata = property(lambda self: self.getXmpMetadata(), None, None) 543 | 544 | def writeToStream(self, stream, encryption_key): 545 | stream.write(b_("<<\n")) 546 | for key, value in list(self.items()): 547 | key.writeToStream(stream, encryption_key) 548 | stream.write(b_(" ")) 549 | value.writeToStream(stream, encryption_key) 550 | stream.write(b_("\n")) 551 | stream.write(b_(">>")) 552 | 553 | def readFromStream(stream, pdf): 554 | # This method is broken in Python 3+ and needs work, 555 | # especially when finding endstream marker 556 | debug = False 557 | tmp = stream.read(2) 558 | if tmp != b_("<<"): 559 | raise utils.PdfReadError("Dictionary read error at byte %s: stream must begin with '<<'" % utils.hexStr(stream.tell())) 560 | data = {} 561 | while True: 562 | tok = readNonWhitespace(stream) 563 | if tok == b_('\x00'): 564 | continue 565 | if not tok: 566 | # stream has truncated prematurely 567 | raise PdfStreamError("Stream has ended unexpectedly") 568 | 569 | if debug: print(("Tok:", tok)) 570 | if tok == b_(">"): 571 | stream.read(1) 572 | break 573 | stream.seek(-1, 1) 574 | key = readObject(stream, pdf) 575 | tok = readNonWhitespace(stream) 576 | stream.seek(-1, 1) 577 | value = readObject(stream, pdf) 578 | if key in data: 579 | # multiple definitions of key not permitted 580 | raise utils.PdfReadError("Multiple definitions in dictionary at byte %s for key %s" \ 581 | % (utils.hexStr(stream.tell()), key)) 582 | data[key] = value 583 | pos = stream.tell() 584 | s = readNonWhitespace(stream) 585 | if s == b_('s') and stream.read(5) == b_('tream'): 586 | eol = stream.read(1) 587 | # odd PDF file output has spaces after 'stream' keyword but before EOL. 588 | # patch provided by Danial Sandler 589 | while eol == b_(' '): 590 | eol = stream.read(1) 591 | assert eol in (b_("\n"), b_("\r")) 592 | if eol == b_("\r"): 593 | # read \n after 594 | if stream.read(1) != b_('\n'): 595 | stream.seek(-1, 1) 596 | # this is a stream object, not a dictionary 597 | assert "/Length" in data 598 | length = data["/Length"] 599 | if debug: print(data) 600 | if isinstance(length, IndirectObject): 601 | t = stream.tell() 602 | length = pdf.getObject(length) 603 | stream.seek(t, 0) 604 | data["__streamdata__"] = stream.read(length) 605 | if debug: print("here") 606 | #if debug: print(binascii.hexlify(data["__streamdata__"])) 607 | e = readNonWhitespace(stream) 608 | ndstream = stream.read(8) 609 | if (e + ndstream) != b_("endstream"): 610 | # (sigh) - the odd PDF file has a length that is too long, so 611 | # we need to read backwards to find the "endstream" ending. 612 | # ReportLab (unknown version) generates files with this bug, 613 | # and Python users into PDF files tend to be our audience. 614 | # we need to do this to correct the streamdata and chop off 615 | # an extra character. 616 | pos = stream.tell() 617 | stream.seek(-10, 1) 618 | end = stream.read(9) 619 | if end == b_("endstream"): 620 | # we found it by looking back one character further. 621 | data["__streamdata__"] = data["__streamdata__"][:-1] 622 | else: 623 | if debug: print(("E", e, ndstream, debugging.toHex(end))) 624 | stream.seek(pos, 0) 625 | raise utils.PdfReadError("Unable to find 'endstream' marker after stream at byte %s." % utils.hexStr(stream.tell())) 626 | else: 627 | stream.seek(pos, 0) 628 | if "__streamdata__" in data: 629 | return StreamObject.initializeFromDictionary(data) 630 | else: 631 | retval = DictionaryObject() 632 | retval.update(data) 633 | return retval 634 | readFromStream = staticmethod(readFromStream) 635 | 636 | class TreeObject(DictionaryObject): 637 | def __init__(self): 638 | DictionaryObject.__init__(self) 639 | 640 | def hasChildren(self): 641 | return '/First' in self 642 | 643 | def __iter__(self): 644 | return self.children() 645 | 646 | def children(self): 647 | if not self.hasChildren(): 648 | raise StopIteration 649 | 650 | child = self['/First'] 651 | while True: 652 | yield child 653 | if child == self['/Last']: 654 | raise StopIteration 655 | child = child['/Next'] 656 | 657 | def addChild(self, child, pdf): 658 | childObj = child.getObject() 659 | child = pdf.getReference(childObj) 660 | assert isinstance(child, IndirectObject) 661 | 662 | if '/First' not in self: 663 | self[NameObject('/First')] = child 664 | self[NameObject('/Count')] = NumberObject(0) 665 | prev = None 666 | else: 667 | prev = self['/Last'] 668 | 669 | self[NameObject('/Last')] = child 670 | self[NameObject('/Count')] = NumberObject(self[NameObject('/Count')] + 1) 671 | 672 | if prev: 673 | prevRef = pdf.getReference(prev) 674 | assert isinstance(prevRef, IndirectObject) 675 | childObj[NameObject('/Prev')] = prevRef 676 | prev[NameObject('/Next')] = child 677 | 678 | parentRef = pdf.getReference(self) 679 | assert isinstance(parentRef, IndirectObject) 680 | childObj[NameObject('/Parent')] = parentRef 681 | 682 | def removeChild(self, child): 683 | childObj = child.getObject() 684 | 685 | if NameObject('/Parent') not in childObj: 686 | raise ValueError("Removed child does not appear to be a tree item") 687 | elif childObj[NameObject('/Parent')] != self: 688 | raise ValueError("Removed child is not a member of this tree") 689 | 690 | found = False 691 | prevRef = None 692 | prev = None 693 | curRef = self[NameObject('/First')] 694 | cur = curRef.getObject() 695 | lastRef = self[NameObject('/Last')] 696 | last = lastRef.getObject() 697 | while cur != None: 698 | if cur == childObj: 699 | if prev == None: 700 | if NameObject('/Next') in cur: 701 | # Removing first tree node 702 | nextRef = cur[NameObject('/Next')] 703 | next = nextRef.getObject() 704 | del next[NameObject('/Prev')] 705 | self[NameObject('/First')] = nextRef 706 | self[NameObject('/Count')] = self[NameObject('/Count')] - 1 707 | 708 | else: 709 | # Removing only tree node 710 | assert self[NameObject('/Count')] == 1 711 | del self[NameObject('/Count')] 712 | del self[NameObject('/First')] 713 | if NameObject('/Last') in self: 714 | del self[NameObject('/Last')] 715 | else: 716 | if NameObject('/Next') in cur: 717 | # Removing middle tree node 718 | nextRef = cur[NameObject('/Next')] 719 | next = nextRef.getObject() 720 | next[NameObject('/Prev')] = prevRef 721 | prev[NameObject('/Next')] = nextRef 722 | self[NameObject('/Count')] = self[NameObject('/Count')] - 1 723 | else: 724 | # Removing last tree node 725 | assert cur == last 726 | del prev[NameObject('/Next')] 727 | self[NameObject('/Last')] = prevRef 728 | self[NameObject('/Count')] = self[NameObject('/Count')] - 1 729 | found = True 730 | break 731 | 732 | 733 | prevRef = curRef 734 | prev = cur 735 | if NameObject('/Next') in cur: 736 | curRef = cur[NameObject('/Next')] 737 | cur = curRef.getObject() 738 | else: 739 | curRef = None 740 | cur = None 741 | 742 | if not found: 743 | raise ValueError("Removal couldn't find item in tree") 744 | 745 | del childObj[NameObject('/Parent')] 746 | if NameObject('/Next') in childObj: 747 | del childObj[NameObject('/Next')] 748 | if NameObject('/Prev') in childObj: 749 | del childObj[NameObject('/Prev')] 750 | 751 | def emptyTree(self): 752 | for child in self: 753 | childObj = child.getObject() 754 | del childObj[NameObject('/Parent')] 755 | if NameObject('/Next') in childObj: 756 | del childObj[NameObject('/Next')] 757 | if NameObject('/Prev') in childObj: 758 | del childObj[NameObject('/Prev')] 759 | 760 | if NameObject('/Count') in self: 761 | del self[NameObject('/Count')] 762 | if NameObject('/First') in self: 763 | del self[NameObject('/First')] 764 | if NameObject('/Last') in self: 765 | del self[NameObject('/Last')] 766 | 767 | 768 | class StreamObject(DictionaryObject): 769 | def __init__(self): 770 | self._data = None 771 | self.decodedSelf = None 772 | 773 | def writeToStream(self, stream, encryption_key): 774 | self[NameObject("/Length")] = NumberObject(len(self._data)) 775 | DictionaryObject.writeToStream(self, stream, encryption_key) 776 | del self["/Length"] 777 | stream.write(b_("\nstream\n")) 778 | data = self._data 779 | if encryption_key: 780 | data = RC4_encrypt(encryption_key, data) 781 | stream.write(data) 782 | stream.write(b_("\nendstream")) 783 | 784 | def initializeFromDictionary(data): 785 | if "/Filter" in data: 786 | retval = EncodedStreamObject() 787 | else: 788 | retval = DecodedStreamObject() 789 | retval._data = data["__streamdata__"] 790 | del data["__streamdata__"] 791 | del data["/Length"] 792 | retval.update(data) 793 | return retval 794 | initializeFromDictionary = staticmethod(initializeFromDictionary) 795 | 796 | def flateEncode(self): 797 | if "/Filter" in self: 798 | f = self["/Filter"] 799 | if isinstance(f, ArrayObject): 800 | f.insert(0, NameObject("/FlateDecode")) 801 | else: 802 | newf = ArrayObject() 803 | newf.append(NameObject("/FlateDecode")) 804 | newf.append(f) 805 | f = newf 806 | else: 807 | f = NameObject("/FlateDecode") 808 | retval = EncodedStreamObject() 809 | retval[NameObject("/Filter")] = f 810 | retval._data = filters.FlateDecode.encode(self._data) 811 | return retval 812 | 813 | 814 | class DecodedStreamObject(StreamObject): 815 | def getData(self): 816 | return self._data 817 | 818 | def setData(self, data): 819 | self._data = data 820 | 821 | 822 | class EncodedStreamObject(StreamObject): 823 | def __init__(self): 824 | self.decodedSelf = None 825 | 826 | def getData(self): 827 | if self.decodedSelf: 828 | # cached version of decoded object 829 | return self.decodedSelf.getData() 830 | else: 831 | # create decoded object 832 | decoded = DecodedStreamObject() 833 | 834 | decoded._data = filters.decodeStreamData(self) 835 | for key, value in list(self.items()): 836 | if not key in ("/Length", "/Filter", "/DecodeParms"): 837 | decoded[key] = value 838 | self.decodedSelf = decoded 839 | return decoded._data 840 | 841 | def setData(self, data): 842 | raise utils.PdfReadError("Creating EncodedStreamObject is not currently supported") 843 | 844 | 845 | class RectangleObject(ArrayObject): 846 | def __init__(self, arr): 847 | # must have four points 848 | assert len(arr) == 4 849 | # automatically convert arr[x] into NumberObject(arr[x]) if necessary 850 | ArrayObject.__init__(self, [self.ensureIsNumber(x) for x in arr]) 851 | 852 | def ensureIsNumber(self, value): 853 | if not isinstance(value, (NumberObject, FloatObject)): 854 | value = FloatObject(value) 855 | return value 856 | 857 | def __repr__(self): 858 | return "RectangleObject(%s)" % repr(list(self)) 859 | 860 | def getLowerLeft_x(self): 861 | return self[0] 862 | 863 | def getLowerLeft_y(self): 864 | return self[1] 865 | 866 | def getUpperRight_x(self): 867 | return self[2] 868 | 869 | def getUpperRight_y(self): 870 | return self[3] 871 | 872 | def getUpperLeft_x(self): 873 | return self.getLowerLeft_x() 874 | 875 | def getUpperLeft_y(self): 876 | return self.getUpperRight_y() 877 | 878 | def getLowerRight_x(self): 879 | return self.getUpperRight_x() 880 | 881 | def getLowerRight_y(self): 882 | return self.getLowerLeft_y() 883 | 884 | def getLowerLeft(self): 885 | return self.getLowerLeft_x(), self.getLowerLeft_y() 886 | 887 | def getLowerRight(self): 888 | return self.getLowerRight_x(), self.getLowerRight_y() 889 | 890 | def getUpperLeft(self): 891 | return self.getUpperLeft_x(), self.getUpperLeft_y() 892 | 893 | def getUpperRight(self): 894 | return self.getUpperRight_x(), self.getUpperRight_y() 895 | 896 | def setLowerLeft(self, value): 897 | self[0], self[1] = [self.ensureIsNumber(x) for x in value] 898 | 899 | def setLowerRight(self, value): 900 | self[2], self[1] = [self.ensureIsNumber(x) for x in value] 901 | 902 | def setUpperLeft(self, value): 903 | self[0], self[3] = [self.ensureIsNumber(x) for x in value] 904 | 905 | def setUpperRight(self, value): 906 | self[2], self[3] = [self.ensureIsNumber(x) for x in value] 907 | 908 | def getWidth(self): 909 | return self.getUpperRight_x() - self.getLowerLeft_x() 910 | 911 | def getHeight(self): 912 | return self.getUpperRight_y() - self.getLowerLeft_x() 913 | 914 | lowerLeft = property(getLowerLeft, setLowerLeft, None, None) 915 | lowerRight = property(getLowerRight, setLowerRight, None, None) 916 | upperLeft = property(getUpperLeft, setUpperLeft, None, None) 917 | upperRight = property(getUpperRight, setUpperRight, None, None) 918 | 919 | 920 | ## 921 | # A class representing a destination within a PDF file. 922 | # See section 8.2.1 of the PDF 1.6 reference. 923 | # Stability: Added in v1.10, will exist for all v1.x releases. 924 | class Destination(TreeObject): 925 | def __init__(self, title, page, typ, *args): 926 | DictionaryObject.__init__(self) 927 | self[NameObject("/Title")] = title 928 | self[NameObject("/Page")] = page 929 | self[NameObject("/Type")] = typ 930 | 931 | # from table 8.2 of the PDF 1.6 reference. 932 | if typ == "/XYZ": 933 | (self[NameObject("/Left")], self[NameObject("/Top")], 934 | self[NameObject("/Zoom")]) = args 935 | elif typ == "/FitR": 936 | (self[NameObject("/Left")], self[NameObject("/Bottom")], 937 | self[NameObject("/Right")], self[NameObject("/Top")]) = args 938 | elif typ in ["/FitH", "/FitBH"]: 939 | self[NameObject("/Top")], = args 940 | elif typ in ["/FitV", "/FitBV"]: 941 | self[NameObject("/Left")], = args 942 | elif typ in ["/Fit", "/FitB"]: 943 | pass 944 | else: 945 | raise utils.PdfReadError("Unknown Destination Type: %r" % typ) 946 | 947 | def getDestArray(self): 948 | return ArrayObject([self.raw_get('/Page'), self['/Type']] + [self[x] for x in ['/Left', '/Bottom', '/Right', '/Top', '/Zoom'] if x in self]) 949 | 950 | def writeToStream(self, stream, encryption_key): 951 | stream.write(b_("<<\n")) 952 | key = NameObject('/D') 953 | key.writeToStream(stream, encryption_key) 954 | stream.write(b_(" ")) 955 | value = self.getDestArray() 956 | value.writeToStream(stream, encryption_key) 957 | 958 | key = NameObject("/S") 959 | key.writeToStream(stream, encryption_key) 960 | stream.write(b_(" ")) 961 | value = NameObject("/GoTo") 962 | value.writeToStream(stream, encryption_key) 963 | 964 | stream.write(b_("\n")) 965 | stream.write(b_(">>")) 966 | 967 | ## 968 | # Read-only property accessing the destination title. 969 | # @return A string. 970 | title = property(lambda self: self.get("/Title")) 971 | 972 | ## 973 | # Read-only property accessing the destination page. 974 | # @return An integer. 975 | page = property(lambda self: self.get("/Page")) 976 | 977 | ## 978 | # Read-only property accessing the destination type. 979 | # @return A string. 980 | typ = property(lambda self: self.get("/Type")) 981 | 982 | ## 983 | # Read-only property accessing the zoom factor. 984 | # @return A number, or None if not available. 985 | zoom = property(lambda self: self.get("/Zoom", None)) 986 | 987 | ## 988 | # Read-only property accessing the left horizontal coordinate. 989 | # @return A number, or None if not available. 990 | left = property(lambda self: self.get("/Left", None)) 991 | 992 | ## 993 | # Read-only property accessing the right horizontal coordinate. 994 | # @return A number, or None if not available. 995 | right = property(lambda self: self.get("/Right", None)) 996 | 997 | ## 998 | # Read-only property accessing the top vertical coordinate. 999 | # @return A number, or None if not available. 1000 | top = property(lambda self: self.get("/Top", None)) 1001 | 1002 | ## 1003 | # Read-only property accessing the bottom vertical coordinate. 1004 | # @return A number, or None if not available. 1005 | bottom = property(lambda self: self.get("/Bottom", None)) 1006 | 1007 | 1008 | class Bookmark(Destination): 1009 | def writeToStream(self, stream, encryption_key): 1010 | stream.write(b_("<<\n")) 1011 | for key in [NameObject(x) for x in ['/Title', '/Parent', '/First', '/Last', '/Next', '/Prev'] if x in self]: 1012 | key.writeToStream(stream, encryption_key) 1013 | stream.write(b_(" ")) 1014 | value = self.raw_get(key) 1015 | value.writeToStream(stream, encryption_key) 1016 | stream.write(b_("\n")) 1017 | key = NameObject('/Dest') 1018 | key.writeToStream(stream, encryption_key) 1019 | stream.write(b_(" ")) 1020 | value = self.getDestArray() 1021 | value.writeToStream(stream, encryption_key) 1022 | stream.write(b_("\n")) 1023 | stream.write(b_(">>")) 1024 | 1025 | 1026 | def encode_pdfdocencoding(unicode_string): 1027 | retval = b_('') 1028 | for c in unicode_string: 1029 | try: 1030 | retval += b_(chr(_pdfDocEncoding_rev[c])) 1031 | except KeyError: 1032 | raise UnicodeEncodeError("pdfdocencoding", c, -1, -1, 1033 | "does not exist in translation table") 1034 | return retval 1035 | 1036 | def decode_pdfdocencoding(byte_array): 1037 | retval = u_('') 1038 | for b in byte_array: 1039 | c = _pdfDocEncoding[ord_(b)] 1040 | if c == u_('\u0000'): 1041 | raise UnicodeDecodeError("pdfdocencoding", utils.barray(b), -1, -1, 1042 | "does not exist in translation table") 1043 | retval += c 1044 | return retval 1045 | 1046 | _pdfDocEncoding = ( 1047 | u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), 1048 | u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), 1049 | u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), 1050 | u_('\u02d8'), u_('\u02c7'), u_('\u02c6'), u_('\u02d9'), u_('\u02dd'), u_('\u02db'), u_('\u02da'), u_('\u02dc'), 1051 | u_('\u0020'), u_('\u0021'), u_('\u0022'), u_('\u0023'), u_('\u0024'), u_('\u0025'), u_('\u0026'), u_('\u0027'), 1052 | u_('\u0028'), u_('\u0029'), u_('\u002a'), u_('\u002b'), u_('\u002c'), u_('\u002d'), u_('\u002e'), u_('\u002f'), 1053 | u_('\u0030'), u_('\u0031'), u_('\u0032'), u_('\u0033'), u_('\u0034'), u_('\u0035'), u_('\u0036'), u_('\u0037'), 1054 | u_('\u0038'), u_('\u0039'), u_('\u003a'), u_('\u003b'), u_('\u003c'), u_('\u003d'), u_('\u003e'), u_('\u003f'), 1055 | u_('\u0040'), u_('\u0041'), u_('\u0042'), u_('\u0043'), u_('\u0044'), u_('\u0045'), u_('\u0046'), u_('\u0047'), 1056 | u_('\u0048'), u_('\u0049'), u_('\u004a'), u_('\u004b'), u_('\u004c'), u_('\u004d'), u_('\u004e'), u_('\u004f'), 1057 | u_('\u0050'), u_('\u0051'), u_('\u0052'), u_('\u0053'), u_('\u0054'), u_('\u0055'), u_('\u0056'), u_('\u0057'), 1058 | u_('\u0058'), u_('\u0059'), u_('\u005a'), u_('\u005b'), u_('\u005c'), u_('\u005d'), u_('\u005e'), u_('\u005f'), 1059 | u_('\u0060'), u_('\u0061'), u_('\u0062'), u_('\u0063'), u_('\u0064'), u_('\u0065'), u_('\u0066'), u_('\u0067'), 1060 | u_('\u0068'), u_('\u0069'), u_('\u006a'), u_('\u006b'), u_('\u006c'), u_('\u006d'), u_('\u006e'), u_('\u006f'), 1061 | u_('\u0070'), u_('\u0071'), u_('\u0072'), u_('\u0073'), u_('\u0074'), u_('\u0075'), u_('\u0076'), u_('\u0077'), 1062 | u_('\u0078'), u_('\u0079'), u_('\u007a'), u_('\u007b'), u_('\u007c'), u_('\u007d'), u_('\u007e'), u_('\u0000'), 1063 | u_('\u2022'), u_('\u2020'), u_('\u2021'), u_('\u2026'), u_('\u2014'), u_('\u2013'), u_('\u0192'), u_('\u2044'), 1064 | u_('\u2039'), u_('\u203a'), u_('\u2212'), u_('\u2030'), u_('\u201e'), u_('\u201c'), u_('\u201d'), u_('\u2018'), 1065 | u_('\u2019'), u_('\u201a'), u_('\u2122'), u_('\ufb01'), u_('\ufb02'), u_('\u0141'), u_('\u0152'), u_('\u0160'), 1066 | u_('\u0178'), u_('\u017d'), u_('\u0131'), u_('\u0142'), u_('\u0153'), u_('\u0161'), u_('\u017e'), u_('\u0000'), 1067 | u_('\u20ac'), u_('\u00a1'), u_('\u00a2'), u_('\u00a3'), u_('\u00a4'), u_('\u00a5'), u_('\u00a6'), u_('\u00a7'), 1068 | u_('\u00a8'), u_('\u00a9'), u_('\u00aa'), u_('\u00ab'), u_('\u00ac'), u_('\u0000'), u_('\u00ae'), u_('\u00af'), 1069 | u_('\u00b0'), u_('\u00b1'), u_('\u00b2'), u_('\u00b3'), u_('\u00b4'), u_('\u00b5'), u_('\u00b6'), u_('\u00b7'), 1070 | u_('\u00b8'), u_('\u00b9'), u_('\u00ba'), u_('\u00bb'), u_('\u00bc'), u_('\u00bd'), u_('\u00be'), u_('\u00bf'), 1071 | u_('\u00c0'), u_('\u00c1'), u_('\u00c2'), u_('\u00c3'), u_('\u00c4'), u_('\u00c5'), u_('\u00c6'), u_('\u00c7'), 1072 | u_('\u00c8'), u_('\u00c9'), u_('\u00ca'), u_('\u00cb'), u_('\u00cc'), u_('\u00cd'), u_('\u00ce'), u_('\u00cf'), 1073 | u_('\u00d0'), u_('\u00d1'), u_('\u00d2'), u_('\u00d3'), u_('\u00d4'), u_('\u00d5'), u_('\u00d6'), u_('\u00d7'), 1074 | u_('\u00d8'), u_('\u00d9'), u_('\u00da'), u_('\u00db'), u_('\u00dc'), u_('\u00dd'), u_('\u00de'), u_('\u00df'), 1075 | u_('\u00e0'), u_('\u00e1'), u_('\u00e2'), u_('\u00e3'), u_('\u00e4'), u_('\u00e5'), u_('\u00e6'), u_('\u00e7'), 1076 | u_('\u00e8'), u_('\u00e9'), u_('\u00ea'), u_('\u00eb'), u_('\u00ec'), u_('\u00ed'), u_('\u00ee'), u_('\u00ef'), 1077 | u_('\u00f0'), u_('\u00f1'), u_('\u00f2'), u_('\u00f3'), u_('\u00f4'), u_('\u00f5'), u_('\u00f6'), u_('\u00f7'), 1078 | u_('\u00f8'), u_('\u00f9'), u_('\u00fa'), u_('\u00fb'), u_('\u00fc'), u_('\u00fd'), u_('\u00fe'), u_('\u00ff') 1079 | ) 1080 | 1081 | assert len(_pdfDocEncoding) == 256 1082 | 1083 | _pdfDocEncoding_rev = {} 1084 | for i in range(256): 1085 | char = _pdfDocEncoding[i] 1086 | if char == u_("\u0000"): 1087 | continue 1088 | assert char not in _pdfDocEncoding_rev 1089 | _pdfDocEncoding_rev[char] = i 1090 | 1091 | -------------------------------------------------------------------------------- /PyPDF2/merger.py: -------------------------------------------------------------------------------- 1 | # vim: sw=4:expandtab:foldmethod=marker 2 | # 3 | # Copyright (c) 2006, Mathieu Fenniak 4 | # All rights reserved. 5 | # 6 | # Redistribution and use in source and binary forms, with or without 7 | # modification, are permitted provided that the following conditions are 8 | # met: 9 | # 10 | # * Redistributions of source code must retain the above copyright notice, 11 | # this list of conditions and the following disclaimer. 12 | # * Redistributions in binary form must reproduce the above copyright notice, 13 | # this list of conditions and the following disclaimer in the documentation 14 | # and/or other materials provided with the distribution. 15 | # * The name of the author may not be used to endorse or promote products 16 | # derived from this software without specific prior written permission. 17 | # 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | # POSSIBILITY OF SUCH DAMAGE. 29 | 30 | from .generic import * 31 | from .pdf import PdfFileReader, PdfFileWriter 32 | from sys import version_info 33 | if version_info < ( 3, 0 ): 34 | from cStringIO import StringIO 35 | else: 36 | from io import StringIO 37 | from io import FileIO as file 38 | 39 | class _MergedPage(object): 40 | """ 41 | _MergedPage is used internally by PdfFileMerger to collect necessary information on each page that is being merged. 42 | """ 43 | def __init__(self, pagedata, src, id): 44 | self.src = src 45 | self.pagedata = pagedata 46 | self.out_pagedata = None 47 | self.id = id 48 | 49 | class PdfFileMerger(object): 50 | """ 51 | PdfFileMerger merges multiple PDFs into a single PDF. It can concatenate, 52 | slice, insert, or any combination of the above. 53 | 54 | See the functions "merge" (or "append") and "write" (or "overwrite") for 55 | usage information. 56 | """ 57 | 58 | def __init__(self, strict=True): 59 | """ 60 | >>> PdfFileMerger() 61 | 62 | Initializes a PdfFileMerger, no parameters required 63 | """ 64 | self.inputs = [] 65 | self.pages = [] 66 | self.output = PdfFileWriter() 67 | self.bookmarks = [] 68 | self.named_dests = [] 69 | self.id_count = 0 70 | self.strict = strict 71 | 72 | def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True): 73 | """ 74 | >>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True) 75 | 76 | Merges the pages from the source document specified by "file" into the output 77 | file at the page number specified by "position". 78 | 79 | Optionally, you may specify a bookmark to be applied at the beginning of the 80 | included file by supplying the text of the bookmark in the "bookmark" parameter. 81 | 82 | You may prevent the source document's bookmarks from being imported by 83 | specifying "import_bookmarks" as False. 84 | 85 | You may also use the "pages" parameter to merge only the specified range of 86 | pages from the source document into the output document. 87 | """ 88 | 89 | # This parameter is passed to self.inputs.append and means 90 | # that the stream used was created in this method. 91 | my_file = False 92 | 93 | # If the fileobj parameter is a string, assume it is a path 94 | # and create a file object at that location. If it is a file, 95 | # copy the file's contents into a StringIO stream object; if 96 | # it is a PdfFileReader, copy that reader's stream into a 97 | # StringIO stream. 98 | # If fileobj is none of the above types, it is not modified 99 | if type(fileobj) in (str, str): 100 | fileobj = file(fileobj, 'rb') 101 | my_file = True 102 | elif isinstance(fileobj, file): 103 | fileobj.seek(0) 104 | filecontent = fileobj.read() 105 | fileobj = StringIO(filecontent) 106 | my_file = True 107 | elif isinstance(fileobj, PdfFileReader): 108 | orig_tell = fileobj.stream.tell() 109 | fileobj.stream.seek(0) 110 | filecontent = StringIO(fileobj.stream.read()) 111 | fileobj.stream.seek(orig_tell) # reset the stream to its original location 112 | fileobj = filecontent 113 | my_file = True 114 | 115 | # Create a new PdfFileReader instance using the stream 116 | # (either file or StringIO) created above 117 | pdfr = PdfFileReader(fileobj, strict=self.strict) 118 | 119 | # Find the range of pages to merge 120 | if pages == None: 121 | pages = (0, pdfr.getNumPages()) 122 | elif type(pages) in (int, float, str, str): 123 | raise TypeError('"pages" must be a tuple of (start, end)') 124 | 125 | srcpages = [] 126 | if bookmark: 127 | bookmark = Bookmark(TextStringObject(bookmark), NumberObject(self.id_count), NameObject('/Fit')) 128 | 129 | outline = [] 130 | if import_bookmarks: 131 | outline = pdfr.getOutlines() 132 | outline = self._trim_outline(pdfr, outline, pages) 133 | 134 | if bookmark: 135 | self.bookmarks += [bookmark, outline] 136 | else: 137 | self.bookmarks += outline 138 | 139 | dests = pdfr.namedDestinations 140 | dests = self._trim_dests(pdfr, dests, pages) 141 | self.named_dests += dests 142 | 143 | # Gather all the pages that are going to be merged 144 | for i in range(*pages): 145 | pg = pdfr.getPage(i) 146 | 147 | id = self.id_count 148 | self.id_count += 1 149 | 150 | mp = _MergedPage(pg, pdfr, id) 151 | 152 | srcpages.append(mp) 153 | 154 | self._associate_dests_to_pages(srcpages) 155 | self._associate_bookmarks_to_pages(srcpages) 156 | 157 | 158 | # Slice to insert the pages at the specified position 159 | self.pages[position:position] = srcpages 160 | 161 | # Keep track of our input files so we can close them later 162 | self.inputs.append((fileobj, pdfr, my_file)) 163 | 164 | 165 | def append(self, fileobj, bookmark=None, pages=None, import_bookmarks=True): 166 | """ 167 | >>> append(file, bookmark=None, pages=None, import_bookmarks=True): 168 | 169 | Identical to the "merge" function, but assumes you want to concatenate all pages 170 | onto the end of the file instead of specifying a position. 171 | """ 172 | 173 | self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks) 174 | 175 | 176 | def write(self, fileobj): 177 | """ 178 | >>> write(file) 179 | 180 | Writes all data that has been merged to "file" (which can be a filename or any 181 | kind of file-like object) 182 | """ 183 | my_file = False 184 | if type(fileobj) in (str, str): 185 | fileobj = file(fileobj, 'wb') 186 | my_file = True 187 | 188 | 189 | # Add pages to the PdfFileWriter 190 | # The commented out line below was replaced with the two lines below it to allow PdfFileMerger to work with PyPdf 1.13 191 | for page in self.pages: 192 | self.output.addPage(page.pagedata) 193 | page.out_pagedata = self.output.getReference(self.output._pages.getObject()["/Kids"][-1].getObject()) 194 | #idnum = self.output._objects.index(self.output._pages.getObject()["/Kids"][-1].getObject()) + 1 195 | #page.out_pagedata = IndirectObject(idnum, 0, self.output) 196 | 197 | # Once all pages are added, create bookmarks to point at those pages 198 | self._write_dests() 199 | self._write_bookmarks() 200 | 201 | # Write the output to the file 202 | self.output.write(fileobj) 203 | 204 | if my_file: 205 | fileobj.close() 206 | 207 | 208 | 209 | def close(self): 210 | """ 211 | >>> close() 212 | 213 | Shuts all file descriptors (input and output) and clears all memory usage 214 | """ 215 | self.pages = [] 216 | for fo, pdfr, mine in self.inputs: 217 | if mine: 218 | fo.close() 219 | 220 | self.inputs = [] 221 | self.output = None 222 | 223 | def addMetadata(self, infos): 224 | """See addMetadata method in PdfFileWriter class""" 225 | self.output.addMetadata(infos) 226 | 227 | def setPageLayout(self, layout): 228 | """See setPageLayout() methods in pdf.py""" 229 | self.output.setPageLayout(layout) 230 | 231 | def setPageMode(self, mode): 232 | """See setPageMode() methods in pdf.py""" 233 | self.output.setPageMode(mode) 234 | 235 | def _trim_dests(self, pdf, dests, pages): 236 | """ 237 | Removes any named destinations that are not a part of the specified page set 238 | """ 239 | new_dests = [] 240 | prev_header_added = True 241 | for k, o in list(dests.items()): 242 | for j in range(*pages): 243 | if pdf.getPage(j).getObject() == o['/Page'].getObject(): 244 | o[NameObject('/Page')] = o['/Page'].getObject() 245 | assert str(k) == str(o['/Title']) 246 | new_dests.append(o) 247 | break 248 | return new_dests 249 | 250 | def _trim_outline(self, pdf, outline, pages): 251 | """ 252 | Removes any outline/bookmark entries that are not a part of the specified page set 253 | """ 254 | new_outline = [] 255 | prev_header_added = True 256 | for i, o in enumerate(outline): 257 | if isinstance(o, list): 258 | sub = self._trim_outline(pdf, o, pages) 259 | if sub: 260 | if not prev_header_added: 261 | new_outline.append(outline[i-1]) 262 | new_outline.append(sub) 263 | else: 264 | prev_header_added = False 265 | for j in range(*pages): 266 | if pdf.getPage(j).getObject() == o['/Page'].getObject(): 267 | o[NameObject('/Page')] = o['/Page'].getObject() 268 | new_outline.append(o) 269 | prev_header_added = True 270 | break 271 | return new_outline 272 | 273 | def _write_dests(self): 274 | dests = self.named_dests 275 | 276 | for v in dests: 277 | pageno = None 278 | pdf = None 279 | if '/Page' in v: 280 | for i, p in enumerate(self.pages): 281 | if p.id == v['/Page']: 282 | v[NameObject('/Page')] = p.out_pagedata 283 | pageno = i 284 | pdf = p.src 285 | break 286 | if pageno != None: 287 | self.output.addNamedDestinationObject(v) 288 | 289 | def _write_bookmarks(self, bookmarks=None, parent=None): 290 | 291 | if bookmarks == None: 292 | bookmarks = self.bookmarks 293 | 294 | 295 | last_added = None 296 | for b in bookmarks: 297 | if isinstance(b, list): 298 | self._write_bookmarks(b, last_added) 299 | continue 300 | 301 | pageno = None 302 | pdf = None 303 | if '/Page' in b: 304 | for i, p in enumerate(self.pages): 305 | if p.id == b['/Page']: 306 | #b[NameObject('/Page')] = p.out_pagedata 307 | args = [NumberObject(p.id), NameObject(b['/Type'])] 308 | #nothing more to add 309 | #if b['/Type'] == '/Fit' or b['/Type'] == '/FitB' 310 | if b['/Type'] == '/FitH' or b['/Type'] == '/FitBH': 311 | if '/Top' in b and not isinstance(b['/Top'], NullObject): 312 | args.append(FloatObject(b['/Top'])) 313 | else: 314 | args.append(FloatObject(0)) 315 | del b['/Top'] 316 | elif b['/Type'] == '/FitV' or b['/Type'] == '/FitBV': 317 | if '/Left' in b and not isinstance(b['/Left'], NullObject): 318 | args.append(FloatObject(b['/Left'])) 319 | else: 320 | args.append(FloatObject(0)) 321 | del b['/Left'] 322 | elif b['/Type'] == '/XYZ': 323 | if '/Left' in b and not isinstance(b['/Left'], NullObject): 324 | args.append(FloatObject(b['/Left'])) 325 | else: 326 | args.append(FloatObject(0)) 327 | if '/Top' in b and not isinstance(b['/Top'], NullObject): 328 | args.append(FloatObject(b['/Top'])) 329 | else: 330 | args.append(FloatObject(0)) 331 | if '/Zoom' in b and not isinstance(b['/Zoom'], NullObject): 332 | args.append(FloatObject(b['/Zoom'])) 333 | else: 334 | args.append(FloatObject(0)) 335 | del b['/Top'], b['/Zoom'], b['/Left'] 336 | elif b['/Type'] == '/FitR': 337 | if '/Left' in b and not isinstance(b['/Left'], NullObject): 338 | args.append(FloatObject(b['/Left'])) 339 | else: 340 | args.append(FloatObject(0)) 341 | if '/Bottom' in b and not isinstance(b['/Bottom'], NullObject): 342 | args.append(FloatObject(b['/Bottom'])) 343 | else: 344 | args.append(FloatObject(0)) 345 | if '/Right' in b and not isinstance(b['/Right'], NullObject): 346 | args.append(FloatObject(b['/Right'])) 347 | else: 348 | args.append(FloatObject(0)) 349 | if '/Top' in b and not isinstance(b['/Top'], NullObject): 350 | args.append(FloatObject(b['/Top'])) 351 | else: 352 | args.append(FloatObject(0)) 353 | del b['/Left'], b['/Right'], b['/Bottom'], b['/Top'] 354 | 355 | b[NameObject('/A')] = DictionaryObject({NameObject('/S'): NameObject('/GoTo'), NameObject('/D'): ArrayObject(args)}) 356 | 357 | pageno = i 358 | pdf = p.src 359 | break 360 | if pageno != None: 361 | del b['/Page'], b['/Type'] 362 | last_added = self.output.addBookmarkDict(b, parent) 363 | 364 | def _associate_dests_to_pages(self, pages): 365 | for nd in self.named_dests: 366 | pageno = None 367 | np = nd['/Page'] 368 | 369 | if isinstance(np, NumberObject): 370 | continue 371 | 372 | for p in pages: 373 | if np.getObject() == p.pagedata.getObject(): 374 | pageno = p.id 375 | 376 | if pageno != None: 377 | nd[NameObject('/Page')] = NumberObject(pageno) 378 | else: 379 | raise ValueError("Unresolved named destination '%s'" % (nd['/Title'],)) 380 | 381 | def _associate_bookmarks_to_pages(self, pages, bookmarks=None): 382 | if bookmarks == None: 383 | bookmarks = self.bookmarks 384 | 385 | for b in bookmarks: 386 | if isinstance(b, list): 387 | self._associate_bookmarks_to_pages(pages, b) 388 | continue 389 | 390 | pageno = None 391 | bp = b['/Page'] 392 | 393 | if isinstance(bp, NumberObject): 394 | continue 395 | 396 | for p in pages: 397 | if bp.getObject() == p.pagedata.getObject(): 398 | pageno = p.id 399 | 400 | if pageno != None: 401 | b[NameObject('/Page')] = NumberObject(pageno) 402 | else: 403 | raise ValueError("Unresolved bookmark '%s'" % (b['/Title'],)) 404 | 405 | def findBookmark(self, bookmark, root=None): 406 | if root == None: 407 | root = self.bookmarks 408 | 409 | for i, b in enumerate(root): 410 | if isinstance(b, list): 411 | res = self.findBookmark(bookmark, b) 412 | if res: 413 | return [i] + res 414 | if b == bookmark or b['/Title'] == bookmark: 415 | return [i] 416 | 417 | return None 418 | 419 | def addBookmark(self, title, pagenum, parent=None): 420 | """ 421 | Add a bookmark to the pdf, using the specified title and pointing at 422 | the specified page number. A parent can be specified to make this a 423 | nested bookmark below the parent. 424 | """ 425 | 426 | if parent == None: 427 | iloc = [len(self.bookmarks)-1] 428 | elif isinstance(parent, list): 429 | iloc = parent 430 | else: 431 | iloc = self.findBookmark(parent) 432 | 433 | dest = Bookmark(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826)) 434 | 435 | if parent == None: 436 | self.bookmarks.append(dest) 437 | else: 438 | bmparent = self.bookmarks 439 | for i in iloc[:-1]: 440 | bmparent = bmparent[i] 441 | npos = iloc[-1]+1 442 | if npos < len(bmparent) and isinstance(bmparent[npos], list): 443 | bmparent[npos].append(dest) 444 | else: 445 | bmparent.insert(npos, [dest]) 446 | return dest 447 | 448 | 449 | def addNamedDestination(self, title, pagenum): 450 | """ 451 | Add a destination to the pdf, using the specified title and pointing 452 | at the specified page number. 453 | """ 454 | 455 | dest = Destination(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826)) 456 | self.named_dests.append(dest) 457 | 458 | 459 | class OutlinesObject(list): 460 | def __init__(self, pdf, tree, parent=None): 461 | list.__init__(self) 462 | self.tree = tree 463 | self.pdf = pdf 464 | self.parent = parent 465 | 466 | def remove(self, index): 467 | obj = self[index] 468 | del self[index] 469 | self.tree.removeChild(obj) 470 | 471 | def add(self, title, page): 472 | pageRef = self.pdf.getObject(self.pdf._pages)['/Kids'][pagenum] 473 | action = DictionaryObject() 474 | action.update({ 475 | NameObject('/D') : ArrayObject([pageRef, NameObject('/FitH'), NumberObject(826)]), 476 | NameObject('/S') : NameObject('/GoTo') 477 | }) 478 | actionRef = self.pdf._addObject(action) 479 | bookmark = TreeObject() 480 | 481 | bookmark.update({ 482 | NameObject('/A'): actionRef, 483 | NameObject('/Title'): createStringObject(title), 484 | }) 485 | 486 | pdf._addObject(bookmark) 487 | 488 | self.tree.addChild(bookmark) 489 | 490 | def removeAll(self): 491 | for child in [x for x in self.tree.children()]: 492 | self.tree.removeChild(child) 493 | self.pop() 494 | -------------------------------------------------------------------------------- /PyPDF2/utils.py: -------------------------------------------------------------------------------- 1 | # vim: sw=4:expandtab:foldmethod=marker 2 | # 3 | # Copyright (c) 2006, Mathieu Fenniak 4 | # All rights reserved. 5 | # 6 | # Redistribution and use in source and binary forms, with or without 7 | # modification, are permitted provided that the following conditions are 8 | # met: 9 | # 10 | # * Redistributions of source code must retain the above copyright notice, 11 | # this list of conditions and the following disclaimer. 12 | # * Redistributions in binary form must reproduce the above copyright notice, 13 | # this list of conditions and the following disclaimer in the documentation 14 | # and/or other materials provided with the distribution. 15 | # * The name of the author may not be used to endorse or promote products 16 | # derived from this software without specific prior written permission. 17 | # 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 28 | # POSSIBILITY OF SUCH DAMAGE. 29 | 30 | """ 31 | Utility functions for PDF library. 32 | """ 33 | __author__ = "Mathieu Fenniak" 34 | __author_email__ = "biziqe@mathieu.fenniak.net" 35 | 36 | #custom implementation of warnings.formatwarning 37 | def _formatwarning(message, category, filename, lineno, line=None): 38 | file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name 39 | return "%s: %s [%s:%s]\n" % (category.__name__, message, file, lineno) 40 | 41 | def readUntilWhitespace(stream, maxchars=None): 42 | """ 43 | Reads non-whitespace characters and returns them. 44 | Stops upon encountering whitespace or when maxchars is reached. 45 | """ 46 | txt = b_("") 47 | while True: 48 | tok = stream.read(1) 49 | if tok.isspace() or not tok: 50 | break 51 | txt += tok 52 | if len(txt) == maxchars: 53 | break 54 | return txt 55 | 56 | def readNonWhitespace(stream): 57 | """ 58 | Finds and reads the next non-whitespace character (ignores whitespace). 59 | """ 60 | tok = b_(' ') 61 | while tok == b_('\n') or tok == b_('\r') or tok == b_(' ') or tok == b_('\t'): 62 | tok = stream.read(1) 63 | return tok 64 | 65 | def skipOverWhitespace(stream): 66 | """ 67 | Similar to readNonWhitespace, but returns a Boolean if more than 68 | one whitespace character was read. 69 | """ 70 | tok = b_(' ') 71 | cnt = 0; 72 | while tok == b_('\n') or tok == b_('\r') or tok == b_(' ') or tok == b_('\t'): 73 | tok = stream.read(1) 74 | cnt+=1 75 | return (cnt > 1) 76 | 77 | def skipOverComment(stream): 78 | tok = stream.read(1) 79 | stream.seek(-1, 1) 80 | if tok == b_('%'): 81 | while tok not in (b_('\n'), b_('\r')): 82 | tok = stream.read(1) 83 | 84 | class ConvertFunctionsToVirtualList(object): 85 | def __init__(self, lengthFunction, getFunction): 86 | self.lengthFunction = lengthFunction 87 | self.getFunction = getFunction 88 | 89 | def __len__(self): 90 | return self.lengthFunction() 91 | 92 | def __getitem__(self, index): 93 | if not isinstance(index, int): 94 | raise TypeError("sequence indices must be integers") 95 | len_self = len(self) 96 | if index < 0: 97 | # support negative indexes 98 | index = len_self + index 99 | if index < 0 or index >= len_self: 100 | raise IndexError("sequence index out of range") 101 | return self.getFunction(index) 102 | 103 | def RC4_encrypt(key, plaintext): 104 | S = [i for i in range(256)] 105 | j = 0 106 | for i in range(256): 107 | j = (j + S[i] + ord_(key[i % len(key)])) % 256 108 | S[i], S[j] = S[j], S[i] 109 | i, j = 0, 0 110 | retval = b_("") 111 | for x in range(len(plaintext)): 112 | i = (i + 1) % 256 113 | j = (j + S[i]) % 256 114 | S[i], S[j] = S[j], S[i] 115 | t = S[(S[i] + S[j]) % 256] 116 | retval += b_(chr(ord_(plaintext[x]) ^ t)) 117 | return retval 118 | 119 | def matrixMultiply(a, b): 120 | return [[sum([float(i)*float(j) 121 | for i, j in zip(row, col)] 122 | ) for col in zip(*b)] 123 | for row in a] 124 | 125 | def markLocation(stream): 126 | """Creates text file showing current location in context.""" 127 | # Mainly for debugging 128 | RADIUS = 5000 129 | stream.seek(-RADIUS, 1) 130 | outputDoc = open('PyPDF2_pdfLocation.txt', 'w') 131 | outputDoc.write(stream.read(RADIUS)) 132 | outputDoc.write('HERE') 133 | outputDoc.write(stream.read(RADIUS)) 134 | outputDoc.close() 135 | stream.seek(-RADIUS, 1) 136 | 137 | class PyPdfError(Exception): 138 | pass 139 | 140 | class PdfReadError(PyPdfError): 141 | pass 142 | 143 | class PageSizeNotDefinedError(PyPdfError): 144 | pass 145 | 146 | class PdfReadWarning(UserWarning): 147 | pass 148 | 149 | class PdfStreamError(PdfReadError): 150 | pass 151 | 152 | def hexStr(num): 153 | return hex(num).replace('L', '') 154 | 155 | import sys 156 | 157 | def b_(s): 158 | if sys.version_info[0] < 3: 159 | return s 160 | else: 161 | if type(s) == bytes: 162 | return s 163 | else: 164 | return s.encode('latin-1') 165 | 166 | def u_(s): 167 | if sys.version_info[0] < 3: 168 | return unicode(s, 'unicode_escape') 169 | else: 170 | return s 171 | 172 | 173 | def str_(b): 174 | if sys.version_info[0] < 3: 175 | return b 176 | else: 177 | if type(b) == bytes: 178 | return b.decode('latin-1') 179 | else: 180 | return b 181 | 182 | def ord_(b): 183 | if sys.version_info[0] < 3: 184 | return ord(b) 185 | else: 186 | return b 187 | 188 | def chr_(c): 189 | if sys.version_info[0] < 3: 190 | return c 191 | else: 192 | return chr(c) 193 | 194 | def barray(b): 195 | if sys.version_info[0] < 3: 196 | return b 197 | else: 198 | return bytearray(b) 199 | 200 | def hexencode(b): 201 | if sys.version_info[0] < 3: 202 | return b.encode('hex') 203 | else: 204 | import codecs 205 | coder = codecs.getencoder('hex_codec') 206 | return coder(b)[0] 207 | 208 | if sys.version_info[0] < 3: 209 | string_type = unicode 210 | bytes_type = str 211 | else: 212 | string_type = str 213 | bytes_type = bytes 214 | -------------------------------------------------------------------------------- /PyPDF2/xmp.py: -------------------------------------------------------------------------------- 1 | import re 2 | import datetime 3 | import decimal 4 | from .generic import PdfObject 5 | from xml.dom import getDOMImplementation 6 | from xml.dom.minidom import parseString 7 | from .utils import u_ 8 | 9 | RDF_NAMESPACE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" 10 | DC_NAMESPACE = "http://purl.org/dc/elements/1.1/" 11 | XMP_NAMESPACE = "http://ns.adobe.com/xap/1.0/" 12 | PDF_NAMESPACE = "http://ns.adobe.com/pdf/1.3/" 13 | XMPMM_NAMESPACE = "http://ns.adobe.com/xap/1.0/mm/" 14 | 15 | # What is the PDFX namespace, you might ask? I might ask that too. It's 16 | # a completely undocumented namespace used to place "custom metadata" 17 | # properties, which are arbitrary metadata properties with no semantic or 18 | # documented meaning. Elements in the namespace are key/value-style storage, 19 | # where the element name is the key and the content is the value. The keys 20 | # are transformed into valid XML identifiers by substituting an invalid 21 | # identifier character with \u2182 followed by the unicode hex ID of the 22 | # original character. A key like "my car" is therefore "my\u21820020car". 23 | # 24 | # \u2182, in case you're wondering, is the unicode character 25 | # \u{ROMAN NUMERAL TEN THOUSAND}, a straightforward and obvious choice for 26 | # escaping characters. 27 | # 28 | # Intentional users of the pdfx namespace should be shot on sight. A 29 | # custom data schema and sensical XML elements could be used instead, as is 30 | # suggested by Adobe's own documentation on XMP (under "Extensibility of 31 | # Schemas"). 32 | # 33 | # Information presented here on the /pdfx/ schema is a result of limited 34 | # reverse engineering, and does not constitute a full specification. 35 | PDFX_NAMESPACE = "http://ns.adobe.com/pdfx/1.3/" 36 | 37 | iso8601 = re.compile(""" 38 | (?P[0-9]{4}) 39 | (- 40 | (?P[0-9]{2}) 41 | (- 42 | (?P[0-9]+) 43 | (T 44 | (?P[0-9]{2}): 45 | (?P[0-9]{2}) 46 | (:(?P[0-9]{2}(.[0-9]+)?))? 47 | (?PZ|[-+][0-9]{2}:[0-9]{2}) 48 | )? 49 | )? 50 | )? 51 | """, re.VERBOSE) 52 | 53 | ## 54 | # An object that represents Adobe XMP metadata. 55 | class XmpInformation(PdfObject): 56 | 57 | def __init__(self, stream): 58 | self.stream = stream 59 | docRoot = parseString(self.stream.getData()) 60 | self.rdfRoot = docRoot.getElementsByTagNameNS(RDF_NAMESPACE, "RDF")[0] 61 | self.cache = {} 62 | 63 | def writeToStream(self, stream, encryption_key): 64 | self.stream.writeToStream(stream, encryption_key) 65 | 66 | def getElement(self, aboutUri, namespace, name): 67 | for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"): 68 | if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri: 69 | attr = desc.getAttributeNodeNS(namespace, name) 70 | if attr != None: 71 | yield attr 72 | for element in desc.getElementsByTagNameNS(namespace, name): 73 | yield element 74 | 75 | def getNodesInNamespace(self, aboutUri, namespace): 76 | for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"): 77 | if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri: 78 | for i in range(desc.attributes.length): 79 | attr = desc.attributes.item(i) 80 | if attr.namespaceURI == namespace: 81 | yield attr 82 | for child in desc.childNodes: 83 | if child.namespaceURI == namespace: 84 | yield child 85 | 86 | def _getText(self, element): 87 | text = "" 88 | for child in element.childNodes: 89 | if child.nodeType == child.TEXT_NODE: 90 | text += child.data 91 | return text 92 | 93 | def _converter_string(value): 94 | return value 95 | 96 | def _converter_date(value): 97 | m = iso8601.match(value) 98 | year = int(m.group("year")) 99 | month = int(m.group("month") or "1") 100 | day = int(m.group("day") or "1") 101 | hour = int(m.group("hour") or "0") 102 | minute = int(m.group("minute") or "0") 103 | second = decimal.Decimal(m.group("second") or "0") 104 | seconds = second.to_integral(decimal.ROUND_FLOOR) 105 | milliseconds = (second - seconds) * 1000000 106 | tzd = m.group("tzd") or "Z" 107 | dt = datetime.datetime(year, month, day, hour, minute, seconds, milliseconds) 108 | if tzd != "Z": 109 | tzd_hours, tzd_minutes = [int(x) for x in tzd.split(":")] 110 | tzd_hours *= -1 111 | if tzd_hours < 0: 112 | tzd_minutes *= -1 113 | dt = dt + datetime.timedelta(hours=tzd_hours, minutes=tzd_minutes) 114 | return dt 115 | _test_converter_date = staticmethod(_converter_date) 116 | 117 | def _getter_bag(namespace, name, converter): 118 | def get(self): 119 | cached = self.cache.get(namespace, {}).get(name) 120 | if cached: 121 | return cached 122 | retval = [] 123 | for element in self.getElement("", namespace, name): 124 | bags = element.getElementsByTagNameNS(RDF_NAMESPACE, "Bag") 125 | if len(bags): 126 | for bag in bags: 127 | for item in bag.getElementsByTagNameNS(RDF_NAMESPACE, "li"): 128 | value = self._getText(item) 129 | value = converter(value) 130 | retval.append(value) 131 | ns_cache = self.cache.setdefault(namespace, {}) 132 | ns_cache[name] = retval 133 | return retval 134 | return get 135 | 136 | def _getter_seq(namespace, name, converter): 137 | def get(self): 138 | cached = self.cache.get(namespace, {}).get(name) 139 | if cached: 140 | return cached 141 | retval = [] 142 | for element in self.getElement("", namespace, name): 143 | seqs = element.getElementsByTagNameNS(RDF_NAMESPACE, "Seq") 144 | if len(seqs): 145 | for seq in seqs: 146 | for item in seq.getElementsByTagNameNS(RDF_NAMESPACE, "li"): 147 | value = self._getText(item) 148 | value = converter(value) 149 | retval.append(value) 150 | else: 151 | value = converter(self._getText(element)) 152 | retval.append(value) 153 | ns_cache = self.cache.setdefault(namespace, {}) 154 | ns_cache[name] = retval 155 | return retval 156 | return get 157 | 158 | def _getter_langalt(namespace, name, converter): 159 | def get(self): 160 | cached = self.cache.get(namespace, {}).get(name) 161 | if cached: 162 | return cached 163 | retval = {} 164 | for element in self.getElement("", namespace, name): 165 | alts = element.getElementsByTagNameNS(RDF_NAMESPACE, "Alt") 166 | if len(alts): 167 | for alt in alts: 168 | for item in alt.getElementsByTagNameNS(RDF_NAMESPACE, "li"): 169 | value = self._getText(item) 170 | value = converter(value) 171 | retval[item.getAttribute("xml:lang")] = value 172 | else: 173 | retval["x-default"] = converter(self._getText(element)) 174 | ns_cache = self.cache.setdefault(namespace, {}) 175 | ns_cache[name] = retval 176 | return retval 177 | return get 178 | 179 | def _getter_single(namespace, name, converter): 180 | def get(self): 181 | cached = self.cache.get(namespace, {}).get(name) 182 | if cached: 183 | return cached 184 | value = None 185 | for element in self.getElement("", namespace, name): 186 | if element.nodeType == element.ATTRIBUTE_NODE: 187 | value = element.nodeValue 188 | else: 189 | value = self._getText(element) 190 | break 191 | if value != None: 192 | value = converter(value) 193 | ns_cache = self.cache.setdefault(namespace, {}) 194 | ns_cache[name] = value 195 | return value 196 | return get 197 | 198 | ## 199 | # Contributors to the resource (other than the authors). An unsorted 200 | # array of names. 201 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 202 | dc_contributor = property(_getter_bag(DC_NAMESPACE, "contributor", _converter_string)) 203 | 204 | ## 205 | # Text describing the extent or scope of the resource. 206 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 207 | dc_coverage = property(_getter_single(DC_NAMESPACE, "coverage", _converter_string)) 208 | 209 | ## 210 | # A sorted array of names of the authors of the resource, listed in order 211 | # of precedence. 212 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 213 | dc_creator = property(_getter_seq(DC_NAMESPACE, "creator", _converter_string)) 214 | 215 | ## 216 | # A sorted array of dates (datetime.datetime instances) of signifigance to 217 | # the resource. The dates and times are in UTC. 218 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 219 | dc_date = property(_getter_seq(DC_NAMESPACE, "date", _converter_date)) 220 | 221 | ## 222 | # A language-keyed dictionary of textual descriptions of the content of the 223 | # resource. 224 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 225 | dc_description = property(_getter_langalt(DC_NAMESPACE, "description", _converter_string)) 226 | 227 | ## 228 | # The mime-type of the resource. 229 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 230 | dc_format = property(_getter_single(DC_NAMESPACE, "format", _converter_string)) 231 | 232 | ## 233 | # Unique identifier of the resource. 234 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 235 | dc_identifier = property(_getter_single(DC_NAMESPACE, "identifier", _converter_string)) 236 | 237 | ## 238 | # An unordered array specifying the languages used in the resource. 239 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 240 | dc_language = property(_getter_bag(DC_NAMESPACE, "language", _converter_string)) 241 | 242 | ## 243 | # An unordered array of publisher names. 244 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 245 | dc_publisher = property(_getter_bag(DC_NAMESPACE, "publisher", _converter_string)) 246 | 247 | ## 248 | # An unordered array of text descriptions of relationships to other 249 | # documents. 250 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 251 | dc_relation = property(_getter_bag(DC_NAMESPACE, "relation", _converter_string)) 252 | 253 | ## 254 | # A language-keyed dictionary of textual descriptions of the rights the 255 | # user has to this resource. 256 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 257 | dc_rights = property(_getter_langalt(DC_NAMESPACE, "rights", _converter_string)) 258 | 259 | ## 260 | # Unique identifier of the work from which this resource was derived. 261 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 262 | dc_source = property(_getter_single(DC_NAMESPACE, "source", _converter_string)) 263 | 264 | ## 265 | # An unordered array of descriptive phrases or keywrods that specify the 266 | # topic of the content of the resource. 267 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 268 | dc_subject = property(_getter_bag(DC_NAMESPACE, "subject", _converter_string)) 269 | 270 | ## 271 | # A language-keyed dictionary of the title of the resource. 272 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 273 | dc_title = property(_getter_langalt(DC_NAMESPACE, "title", _converter_string)) 274 | 275 | ## 276 | # An unordered array of textual descriptions of the document type. 277 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 278 | dc_type = property(_getter_bag(DC_NAMESPACE, "type", _converter_string)) 279 | 280 | ## 281 | # An unformatted text string representing document keywords. 282 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 283 | pdf_keywords = property(_getter_single(PDF_NAMESPACE, "Keywords", _converter_string)) 284 | 285 | ## 286 | # The PDF file version, for example 1.0, 1.3. 287 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 288 | pdf_pdfversion = property(_getter_single(PDF_NAMESPACE, "PDFVersion", _converter_string)) 289 | 290 | ## 291 | # The name of the tool that created the PDF document. 292 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 293 | pdf_producer = property(_getter_single(PDF_NAMESPACE, "Producer", _converter_string)) 294 | 295 | ## 296 | # The date and time the resource was originally created. The date and 297 | # time are returned as a UTC datetime.datetime object. 298 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 299 | xmp_createDate = property(_getter_single(XMP_NAMESPACE, "CreateDate", _converter_date)) 300 | 301 | ## 302 | # The date and time the resource was last modified. The date and time 303 | # are returned as a UTC datetime.datetime object. 304 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 305 | xmp_modifyDate = property(_getter_single(XMP_NAMESPACE, "ModifyDate", _converter_date)) 306 | 307 | ## 308 | # The date and time that any metadata for this resource was last 309 | # changed. The date and time are returned as a UTC datetime.datetime 310 | # object. 311 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 312 | xmp_metadataDate = property(_getter_single(XMP_NAMESPACE, "MetadataDate", _converter_date)) 313 | 314 | ## 315 | # The name of the first known tool used to create the resource. 316 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 317 | xmp_creatorTool = property(_getter_single(XMP_NAMESPACE, "CreatorTool", _converter_string)) 318 | 319 | ## 320 | # The common identifier for all versions and renditions of this resource. 321 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 322 | xmpmm_documentId = property(_getter_single(XMPMM_NAMESPACE, "DocumentID", _converter_string)) 323 | 324 | ## 325 | # An identifier for a specific incarnation of a document, updated each 326 | # time a file is saved. 327 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 328 | xmpmm_instanceId = property(_getter_single(XMPMM_NAMESPACE, "InstanceID", _converter_string)) 329 | 330 | def custom_properties(self): 331 | if not hasattr(self, "_custom_properties"): 332 | self._custom_properties = {} 333 | for node in self.getNodesInNamespace("", PDFX_NAMESPACE): 334 | key = node.localName 335 | while True: 336 | # see documentation about PDFX_NAMESPACE earlier in file 337 | idx = key.find(u_("\u2182")) 338 | if idx == -1: 339 | break 340 | key = key[:idx] + chr(int(key[idx+1:idx+5], base=16)) + key[idx+5:] 341 | if node.nodeType == node.ATTRIBUTE_NODE: 342 | value = node.nodeValue 343 | else: 344 | value = self._getText(node) 345 | self._custom_properties[key] = value 346 | return self._custom_properties 347 | 348 | ## 349 | # Retrieves custom metadata properties defined in the undocumented pdfx 350 | # metadata schema. 351 | #

Stability: Added in v1.12, will exist for all future v1.x releases. 352 | # @return Returns a dictionary of key/value items for custom metadata 353 | # properties. 354 | custom_properties = property(custom_properties) 355 | 356 | 357 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | PyPDF2 2 | ------------------------------------------------- 3 | 4 | PyPDF2 is a pure-python PDF library capable of 5 | splitting, merging together, cropping, and transforming 6 | the pages of PDF files. It can also add custom 7 | data, viewing options, and passwords to PDF files. 8 | It can retrieve text and metadata from PDFs as well 9 | as merge entire files together. 10 | 11 | See sample code folder for helpful examples. 12 | 13 | Documentation: 14 | FAQ: 15 | PyPI: 16 | GitHub: 17 | Homepage: 18 | -------------------------------------------------------------------------------- /Sample_Code/2-up.py: -------------------------------------------------------------------------------- 1 | from PyPDF2 import PdfFileWriter, PdfFileReader 2 | import sys 3 | import math 4 | 5 | def main(): 6 | if (len(sys.argv) != 3): 7 | print("usage: python 2-up.py input_file output_file") 8 | sys.exit(1) 9 | print ("2-up input " + sys.argv[1]) 10 | input1 = PdfFileReader(open(sys.argv[1], "rb")) 11 | output = PdfFileWriter() 12 | for iter in range (0, input1.getNumPages()-1, 2): 13 | lhs = input1.getPage(iter) 14 | rhs = input1.getPage(iter+1) 15 | lhs.mergeTranslatedPage(rhs, lhs.mediaBox.getUpperRight_x(),0, True) 16 | output.addPage(lhs) 17 | print (str(iter) + " "), 18 | sys.stdout.flush() 19 | 20 | print("writing " + sys.argv[2]) 21 | outputStream = file(sys.argv[2], "wb") 22 | output.write(outputStream) 23 | print("done.") 24 | 25 | if __name__ == "__main__": 26 | main() 27 | -------------------------------------------------------------------------------- /Sample_Code/README.txt: -------------------------------------------------------------------------------- 1 | PyPDF2 Sample Code Folder 2 | ------------------------- 3 | 4 | This will contain demonstrations of the many features 5 | PyPDF2 is capable of. Example code should make it easy 6 | for users to know how to use all aspects of PyPDF2. 7 | 8 | 9 | 10 | Feel free to add any type of PDF file or sample code, 11 | either by 12 | 13 | 1) sending it via email to PyPDF2@phaseit.net 14 | 2) including it in a pull request on GitHub -------------------------------------------------------------------------------- /Sample_Code/basic_features.py: -------------------------------------------------------------------------------- 1 | from PyPDF2 import PdfFileWriter, PdfFileReader 2 | 3 | output = PdfFileWriter() 4 | input1 = PdfFileReader(open("document1.pdf", "rb")) 5 | 6 | # print how many pages input1 has: 7 | print "document1.pdf has %d pages." % input1.getNumPages() 8 | 9 | # add page 1 from input1 to output document, unchanged 10 | output.addPage(input1.getPage(0)) 11 | 12 | # add page 2 from input1, but rotated clockwise 90 degrees 13 | output.addPage(input1.getPage(1).rotateClockwise(90)) 14 | 15 | # add page 3 from input1, rotated the other way: 16 | output.addPage(input1.getPage(2).rotateCounterClockwise(90)) 17 | # alt: output.addPage(input1.getPage(2).rotateClockwise(270)) 18 | 19 | # add page 4 from input1, but first add a watermark from another PDF: 20 | page4 = input1.getPage(3) 21 | watermark = PdfFileReader(open("watermark.pdf", "rb")) 22 | page4.mergePage(watermark.getPage(0)) 23 | output.addPage(page4) 24 | 25 | 26 | # add page 5 from input1, but crop it to half size: 27 | page5 = input1.getPage(4) 28 | page5.mediaBox.upperRight = ( 29 | page5.mediaBox.getUpperRight_x() / 2, 30 | page5.mediaBox.getUpperRight_y() / 2 31 | ) 32 | output.addPage(page5) 33 | 34 | # encrypt your new PDF and add a password 35 | password = "secret" 36 | output.encrypt(password) 37 | 38 | # finally, write "output" to document-output.pdf 39 | outputStream = file("PyPDF2-output.pdf", "wb") 40 | output.write(outputStream) 41 | -------------------------------------------------------------------------------- /Sample_Code/basic_merging.py: -------------------------------------------------------------------------------- 1 | from PyPDF2 import PdfFileMerger 2 | 3 | merger = PdfFileMerger() 4 | 5 | input1 = open("document1.pdf", "rb") 6 | input2 = open("document2.pdf", "rb") 7 | input3 = open("document3.pdf", "rb") 8 | 9 | # add the first 3 pages of input1 document to output 10 | merger.append(fileobj = input1, pages = (0,3)) 11 | 12 | # insert the first page of input2 into the output beginning after the second page 13 | merger.merge(position = 2, fileobj = input2, pages = (0,1)) 14 | 15 | # append entire input3 document to the end of the output document 16 | merger.append(input3) 17 | 18 | # Write to an output PDF document 19 | output = open("document-output.pdf", "wb") 20 | merger.write(output) 21 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | from distutils.core import setup 4 | import re 5 | 6 | long_description = """ 7 | A Pure-Python library built as a PDF toolkit. It is capable of: 8 | 9 | - extracting document information (title, author, ...), 10 | - splitting documents page by page, 11 | - merging documents page by page, 12 | - cropping pages, 13 | - merging multiple pages into a single page, 14 | - encrypting and decrypting PDF files. 15 | 16 | By being Pure-Python, it should run on any Python platform without any 17 | dependencies on external libraries. It can also work entirely on StringIO 18 | objects rather than file streams, allowing for PDF manipulation in memory. 19 | It is therefore a useful tool for websites that manage or manipulate PDFs. 20 | """ 21 | 22 | VERSIONFILE="PyPDF2/_version.py" 23 | verstrline = open(VERSIONFILE, "rt").read() 24 | VSRE = r"^__version__ = ['\"]([^'\"]*)['\"]" 25 | mo = re.search(VSRE, verstrline, re.M) 26 | if mo: 27 | verstr = mo.group(1) 28 | else: 29 | raise RuntimeError("Unable to find version string in %s." % (VERSIONFILE)) 30 | 31 | setup( 32 | name="PyPDF2", 33 | version=verstr, 34 | description="PDF toolkit", 35 | long_description=long_description, 36 | author="Mathieu Fenniak", 37 | author_email="biziqe@mathieu.fenniak.net", 38 | maintainer="Phaseit, Inc.", 39 | maintainer_email="PyPDF2@phaseit.net", 40 | url="http://mstamy2.github.com/PyPDF2", 41 | download_url="http://github.com/mstamy2/PyPDF2/tarball/master", 42 | classifiers = [ 43 | "Development Status :: 5 - Production/Stable", 44 | "Intended Audience :: Developers", 45 | "License :: OSI Approved :: BSD License", 46 | "Programming Language :: Python :: 2", 47 | "Programming Language :: Python :: 3", 48 | "Operating System :: OS Independent", 49 | "Topic :: Software Development :: Libraries :: Python Modules", 50 | ], 51 | packages=["PyPDF2"], 52 | ) 53 | 54 | --------------------------------------------------------------------------------