├── .gitignore
├── CHANGELOG
├── LICENSE
├── MANIFEST.in
├── PDF_Samples
    ├── AutoCad_Diagram.pdf
    ├── AutoCad_Simple.pdf
    ├── README.txt
    └── SF424_page2.pdf
├── PyPDF2
    ├── __init__.py
    ├── _version.py
    ├── filters.py
    ├── generic.py
    ├── merger.py
    ├── pdf.py
    ├── utils.py
    └── xmp.py
├── README
├── Sample_Code
    ├── 2-up.py
    ├── README.txt
    ├── basic_features.py
    └── basic_merging.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *.swp
3 | 


--------------------------------------------------------------------------------
/CHANGELOG:
--------------------------------------------------------------------------------
  1 | Version 1.20, 2014-01-??
  2 | ------------------------
  3 | 
  4 |  - Many Python 3 support changes (with contributions from TWAC and cgammans)
  5 | 
  6 |  - Updated FAQ; link included in README
  7 | 
  8 |  - Allow more (unnecessary) escape sequences
  9 | 
 10 |  - Prevent exception when reading a null object in decoding parameters
 11 | 
 12 |  - Corrected error in reading destination types (added a slash since they
 13 |    are name objects)
 14 | 
 15 |  - Corrected TypeError in scaleTo() method
 16 | 
 17 |  - addBookmark() method in PdfFileMerger now returns bookmark (so nested
 18 |    bookmarks can be created)
 19 | 
 20 |  - Additions to Sample Code and Sample PDFs
 21 | 
 22 |  - changes to allow 2up script to work (by Dylan McNamee)
 23 | 
 24 |  - changes to metadata encoding (by Chris Hiestand)
 25 | 
 26 |  - New methods for links: addLink() (by Enrico Lambertini) and ignoreLinks()
 27 | 
 28 | 
 29 | Version 1.19, 2013-10-08
 30 | ------------------------
 31 | 
 32 | BUGFIXES:
 33 |  - Removed pop in sweepIndirectReferences to prevent infinite loop
 34 |    (provided by ian-su-sirca)
 35 | 
 36 |  - Fixed bug caused by whitespace when parsing PDFs generated by AutoCad
 37 | 
 38 |  - Fixed a bug caused by reading a 'null' ASCII value in a dictionary
 39 |    object (primarily in PDFs generated by AutoCad).
 40 | 
 41 | FEATURES:
 42 |  - Added new folders for PyPDF2 sample code and example PDFs; see README
 43 |    for each folder
 44 | 
 45 |  - Added a method for debugging purposes to show current location while
 46 |    parsing
 47 | 
 48 |  - Ability to create custom metadata (by jamma313)
 49 | 
 50 |  - Ability to access and customize document layout and view mode
 51 |    (by Joshua Arnott)
 52 | 
 53 | OTHER:
 54 |  - Added and corrected some documentation
 55 | 
 56 |  - Added some more warnings and exception messages
 57 | 
 58 |  - Removed old test/debugging code
 59 | 
 60 | UPCOMING:
 61 |  - More bugfixes (We have received many problematic PDFs via email, we
 62 |    will work with them)
 63 |  
 64 |  - Documentation - It's time for PyPDF2 to get its own documentation
 65 |    since it has grown much since the original pyPdf
 66 | 
 67 |  - A FAQ to answer common questions
 68 | 
 69 | 
 70 | Version 1.18, 2013-08-19
 71 | ------------------------
 72 | 
 73 |  - Fixed a bug where older verions of objects were incorrectly added to the 
 74 |    cache, resulting in outdated or missing pages, images, and other objects
 75 |    (from speedplane)
 76 | 
 77 |  - Fixed a bug in parsing the xref table where new xref values were 
 78 |    overwritten; also cleaned up code (from speedplane)
 79 | 
 80 |  - New method mergeRotatedAroundPointPage which merges a page while rotating
 81 |    it around a point (from speedplane)
 82 | 
 83 |  - Updated Destination syntax to respect PDF 1.6 specifications (from
 84 |    jamma313)
 85 | 
 86 |  - Prevented infinite loop when a PdfFileReader object was instantiated
 87 |    with an empty file (from Jerome Nexedi)
 88 | 
 89 | Other Changes:
 90 | 
 91 |  - Downloads now available via PyPI
 92 |    https://pypi.python.org/pypi?:action=display&name=PyPDF2
 93 | 
 94 |  - Installation through pip library is fixed
 95 | 
 96 | 
 97 | Version 1.17, 2013-07-25
 98 | ------------------------
 99 | 
100 |  - Removed one (from pdf.py) of the two Destination classes. Both 
101 |    classes had the same name, but were slightly different in content, 
102 |    causing some errors. (from Janne Vanhala)
103 | 
104 |  - Corrected and Expanded README file to demonstrate PdfFileMerger
105 | 
106 |  - Added filter for LZW encoded streams (from Michal Horejsek)
107 | 
108 |  - PyPDF2 issue tracker enabled on Github to allow community
109 |    discussion and collaboration
110 | 
111 | 
112 | Versions -1.16, -2013-06-30
113 | ---------------------------
114 | 
115 |  - Note: This ChangeLog has not been kept up-to-date for a while.
116 |    Hopefully we can keep better track of it from now on. Some of the
117 |    changes listed here come from previous versions 1.14 and 1.15; they
118 |    were only vaguely defined. With the new _version.py file we should 
119 |    have more structured and better documented versioning from now on.
120 |  
121 |  - Defined PyPDF2.__version__
122 | 
123 |  - Fixed encrypt() method (from Martijn The)
124 | 
125 |  - Improved error handling on PDFs with truncated streams (from cecilkorik)
126 | 
127 |  - Python 3 support (from kushal-kumaran)
128 | 
129 |  - Fixed example code in README (from Jeremy Bethmont)
130 | 
131 |  - Fixed an bug caused by DecimalError Exception (from Adam Morris)
132 | 
133 |  - Many other bug fixes and features by: 
134 | 	
135 | 	jeansch
136 | 	Anton Vlasenko
137 | 	Joseph Walton
138 | 	Jan Oliver Oelerich
139 | 	Fabian Henze
140 | 	And any others I missed. 
141 | 	Thanks for contributing!
142 | 
143 | 
144 | Version 1.13, 2010-12-04
145 | ------------------------
146 | 
147 |  - Fixed a typo in code for reading a "\b" escape character in strings.
148 | 
149 |  - Improved __repr__ in FloatObject.
150 | 
151 |  - Fixed a bug in reading octal escape sequences in strings.
152 | 
153 |  - Added getWidth and getHeight methods to the RectangleObject class.
154 | 
155 |  - Fixed compatibility warnings with Python 2.4 and 2.5.
156 | 
157 |  - Added addBlankPage and insertBlankPage methods on PdfFileWriter class.
158 | 
159 |  - Fixed a bug with circular references in page's object trees (typically
160 |    annotations) that prevented correctly writing out a copy of those pages.
161 | 
162 |  - New merge page functions allow application of a transformation matrix.
163 | 
164 |  - To all patch contributors: I did a poor job of keeping this ChangeLog
165 |    up-to-date for this release, so I am missing attributions here for any
166 |    changes you submitted.  Sorry!  I'll do better in the future.
167 | 
168 | 
169 | Version 1.12, 2008-09-02
170 | ------------------------
171 | 
172 |  - Added support for XMP metadata.
173 | 
174 |  - Fix reading files with xref streams with multiple /Index values.
175 | 
176 |  - Fix extracting content streams that use graphics operators longer than 2
177 |    characters.  Affects merging PDF files.
178 | 
179 | 
180 | Version 1.11, 2008-05-09
181 | ------------------------
182 | 
183 |  - Patch from Hartmut Goebel to permit RectangleObjects to accept NumberObject
184 |    or FloatObject values.
185 | 
186 |  - PDF compatibility fixes.
187 | 
188 |  - Fix to read object xref stream in correct order.
189 | 
190 |  - Fix for comments inside content streams.
191 | 
192 | 
193 | Version 1.10, 2007-10-04
194 | ------------------------
195 | 
196 |  - Text strings from PDF files are returned as Unicode string objects when
197 |  pyPdf determines that they can be decoded (as UTF-16 strings, or as
198 |  PDFDocEncoding strings).  Unicode objects are also written out when
199 |  necessary.  This means that string objects in pyPdf can be either
200 |  generic.ByteStringObject instances, or generic.TextStringObject instances.
201 | 
202 |  - The extractText method now returns a unicode string object.
203 | 
204 |  - All document information properties now return unicode string objects.  In
205 |  the event that a document provides docinfo properties that are not decoded by
206 |  pyPdf, the raw byte strings can be accessed with an "_raw" property (ie.
207 |  title_raw rather than title)
208 | 
209 |  - generic.DictionaryObject instances have been enhanced to be easier to use.
210 |  Values coming out of dictionary objects will automatically be de-referenced
211 |  (.getObject will be called on them), unless accessed by the new "raw_get"
212 |  method.  DictionaryObjects can now only contain PdfObject instances (as keys
213 |  and values), making it easier to debug where non-PdfObject values (which
214 |  cannot be written out) are entering dictionaries.
215 | 
216 |  - Support for reading named destinations and outlines in PDF files.  Original
217 |  patch by Ashish Kulkarni.
218 | 
219 |  - Stream compatibility reading enhancements for malformed PDF files.
220 | 
221 |  - Cross reference table reading enhancements for malformed PDF files.
222 | 
223 |  - Encryption documentation.
224 | 
225 |  - Replace some "assert" statements with error raising.
226 | 
227 |  - Minor optimizations to FlateDecode algorithm increase speed when using PNG
228 |  predictors.
229 | 
230 | Version 1.9, 2006-12-15
231 | -----------------------
232 | 
233 |  - Fix several serious bugs introduced in version 1.8, caused by a failure to
234 |    run through our PDF test suite before releasing that version.
235 | 
236 |  - Fix bug in NullObject reading and writing.
237 | 
238 | Version 1.8, 2006-12-14
239 | -----------------------
240 | 
241 |  - Add support for decryption with the standard PDF security handler.  This
242 |    allows for decrypting PDF files given the proper user or owner password.
243 | 
244 |  - Add support for encryption with the standard PDF security handler.
245 | 
246 |  - Add new pythondoc documentation.
247 | 
248 |  - Fix bug in ASCII85 decode that occurs when whitespace exists inside the
249 |    two terminating characters of the stream.
250 | 
251 | Version 1.7, 2006-12-10
252 | -----------------------
253 | 
254 |  - Fix a bug when using a single page object in two PdfFileWriter objects.
255 | 
256 |  - Adjust PyPDF to be tolerant of whitespace characters that don't belong
257 |    during a stream object.
258 | 
259 |  - Add documentInfo property to PdfFileReader.
260 | 
261 |  - Add numPages property to PdfFileReader.
262 | 
263 |  - Add pages property to PdfFileReader.
264 | 
265 |  - Add extractText function to PdfFileReader.
266 | 
267 | 
268 | Version 1.6, 2006-06-06
269 | -----------------------
270 | 
271 |  - Add basic support for comments in PDF files.  This allows us to read some
272 |    ReportLab PDFs that could not be read before.
273 | 
274 |  - Add "auto-repair" for finding xref table at slightly bad locations.
275 | 
276 |  - New StreamObject backend, cleaner and more powerful.  Allows the use of
277 |    stream filters more easily, including compressed streams.
278 | 
279 |  - Add a graphics state push/pop around page merges.  Improves quality of
280 |    page merges when one page's content stream leaves the graphics 
281 |    in an abnormal state.
282 | 
283 |  - Add PageObject.compressContentStreams function, which filters all content
284 |    streams and compresses them.  This will reduce the size of PDF pages,
285 |    especially after they could have been decompressed in a mergePage
286 |    operation.
287 | 
288 |  - Support inline images in PDF content streams.
289 | 
290 |  - Add support for using .NET framework compression when zlib is not
291 |    available.  This does not make pyPdf compatible with IronPython, but it
292 |    is a first step.
293 | 
294 |  - Add support for reading the document information dictionary, and extracting
295 |    title, author, subject, producer and creator tags.
296 | 
297 |  - Add patch to support NullObject and multiple xref streams, from Bradley
298 |    Lawrence.
299 | 
300 | 
301 | Version 1.5, 2006-01-28
302 | -----------------------
303 | 
304 | - Fix a bug where merging pages did not work in "no-rename" cases when the
305 |   second page has an array of content streams.
306 | 
307 | - Remove some debugging output that should not have been present.
308 | 
309 | 
310 | Version 1.4, 2006-01-27
311 | -----------------------
312 | 
313 | - Add capability to merge pages from multiple PDF files into a single page
314 |   using the PageObject.mergePage function.  See example code (README or web
315 |   site) for more information.
316 | 
317 | - Add ability to modify a page's MediaBox, CropBox, BleedBox, TrimBox, and
318 |   ArtBox properties through PageObject.  See example code (README or web site)
319 |   for more information.
320 | 
321 | - Refactor pdf.py into multiple files: generic.py (contains objects like
322 |   NameObject, DictionaryObject), filters.py (contains filter code),
323 |   utils.py (various).  This does not affect importing PdfFileReader
324 |   or PdfFileWriter.
325 | 
326 | - Add new decoding functions for standard PDF filters ASCIIHexDecode and
327 |   ASCII85Decode.
328 | 
329 | - Change url and download_url to refer to new pybrary.net web site.
330 | 
331 | 
332 | Version 1.3, 2006-01-23
333 | -----------------------
334 | 
335 | - Fix new bug introduced in 1.2 where PDF files with \r line endings did not
336 |   work properly anymore.  A new test suite developed with various PDF files
337 |   should prevent regression bugs from now on.
338 | 
339 | - Fix a bug where inheriting attributes from page nodes did not work.
340 | 
341 | 
342 | Version 1.2, 2006-01-23
343 | -----------------------
344 | 
345 | - Improved support for files with CRLF-based line endings, fixing a common
346 |   reported problem stating "assertion error: assert line == "%%EOF"".
347 | 
348 | - Software author/maintainer is now officially a proud married person, which
349 |   is sure to result in better software... somehow.
350 | 
351 | 
352 | Version 1.1, 2006-01-18
353 | -----------------------
354 | 
355 | - Add capability to rotate pages.
356 | 
357 | - Improved PDF reading support to properly manage inherited attributes from
358 |   /Type=/Pages nodes.  This means that page groups that are rotated or have
359 |   different media boxes or whatever will now work properly.
360 | 
361 | - Added PDF 1.5 support.  Namely cross-reference streams and object streams.
362 |   This release can mangle Adobe's PDFReference16.pdf successfully.
363 | 
364 | 
365 | Version 1.0, 2006-01-17
366 | -----------------------
367 | 
368 | - First distutils-capable true public release.  Supports a wide variety of PDF
369 |   files that I found sitting around on my system.
370 | 
371 | - Does not support some PDF 1.5 features, such as object streams,
372 |   cross-reference streams.
373 | 
374 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2006-2008, Mathieu Fenniak
 2 | Some contributions copyright (c) 2007, Ashish Kulkarni <kulkarni.ashish@gmail.com>
 3 | 
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are
 8 | met:
 9 | 
10 | * Redistributions of source code must retain the above copyright notice,
11 | this list of conditions and the following disclaimer.
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 | * The name of the author may not be used to endorse or promote products
16 | derived from this software without specific prior written permission.
17 | 
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
21 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
22 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
23 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
24 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
25 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
26 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
27 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
28 | POSSIBILITY OF SUCH DAMAGE.
29 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include CHANGELOG
2 | 


--------------------------------------------------------------------------------
/PDF_Samples/AutoCad_Diagram.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/AutoCad_Diagram.pdf


--------------------------------------------------------------------------------
/PDF_Samples/AutoCad_Simple.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/AutoCad_Simple.pdf


--------------------------------------------------------------------------------
/PDF_Samples/README.txt:
--------------------------------------------------------------------------------
 1 | PDF Sample Folder
 2 | -----------------
 3 | 
 4 | PDF files are generated by a large variety of sources
 5 | for many different purposes. One of the goals of PyPDF2
 6 | is to be able to read/write any PDF instance that Adobe
 7 | can.
 8 | 
 9 | This is a catalog of various PDF files. The
10 | files may not have worked with PyPDF2 but do now, they
11 | may be complicated or unconventional files, or they may
12 | just be good for testing. The purpose is to insure that
13 | when changes to PyPDF2 are made, we keep them in mind.
14 | 
15 | If you have confidential PDFs that don't work with
16 | PyPDF2, feel free to still e-mail them for debugging -
17 | we won't add PDFs without expressed permission.
18 | 
19 | (This folder is available through GitHub only)
20 | 
21 | 
22 | Feel free to add any type of PDF file or sample code, 
23 | either by
24 | 
25 | 	1) sending it via email to PyPDF2@phaseit.net
26 | 	2) including it in a pull request on GitHub


--------------------------------------------------------------------------------
/PDF_Samples/SF424_page2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/talumbau/PyPDF2/24b270d876518d15773224b5d0d6c2206db29f64/PDF_Samples/SF424_page2.pdf


--------------------------------------------------------------------------------
/PyPDF2/__init__.py:
--------------------------------------------------------------------------------
1 | from .pdf import PdfFileReader, PdfFileWriter
2 | from .merger import PdfFileMerger
3 | from ._version import __version__
4 | __all__ = ["pdf", "PdfFileMerger"]
5 | 


--------------------------------------------------------------------------------
/PyPDF2/_version.py:
--------------------------------------------------------------------------------
1 | __version__ = '1.20b'
2 | 
3 | 


--------------------------------------------------------------------------------
/PyPDF2/filters.py:
--------------------------------------------------------------------------------
  1 | # vim: sw=4:expandtab:foldmethod=marker
  2 | #
  3 | # Copyright (c) 2006, Mathieu Fenniak
  4 | # All rights reserved.
  5 | #
  6 | # Redistribution and use in source and binary forms, with or without
  7 | # modification, are permitted provided that the following conditions are
  8 | # met:
  9 | #
 10 | # * Redistributions of source code must retain the above copyright notice,
 11 | # this list of conditions and the following disclaimer.
 12 | # * Redistributions in binary form must reproduce the above copyright notice,
 13 | # this list of conditions and the following disclaimer in the documentation
 14 | # and/or other materials provided with the distribution.
 15 | # * The name of the author may not be used to endorse or promote products
 16 | # derived from this software without specific prior written permission.
 17 | #
 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 28 | # POSSIBILITY OF SUCH DAMAGE.
 29 | 
 30 | 
 31 | """
 32 | Implementation of stream filters for PDF.
 33 | """
 34 | __author__ = "Mathieu Fenniak"
 35 | __author_email__ = "biziqe@mathieu.fenniak.net"
 36 | 
 37 | from .utils import PdfReadError
 38 | from sys import version_info
 39 | if version_info < ( 3, 0 ):
 40 |     from cStringIO import StringIO
 41 | else:
 42 |     from io import StringIO
 43 | 
 44 | try:
 45 |     import zlib
 46 |     def decompress(data):
 47 |         return zlib.decompress(data)
 48 |     def compress(data):
 49 |         return zlib.compress(data)
 50 | except ImportError:
 51 |     # Unable to import zlib.  Attempt to use the System.IO.Compression
 52 |     # library from the .NET framework. (IronPython only)
 53 |     import System
 54 |     from System import IO, Collections, Array
 55 |     def _string_to_bytearr(buf):
 56 |         retval = Array.CreateInstance(System.Byte, len(buf))
 57 |         for i in range(len(buf)):
 58 |             retval[i] = ord(buf[i])
 59 |         return retval
 60 |     def _bytearr_to_string(bytes):
 61 |         retval = ""
 62 |         for i in range(bytes.Length):
 63 |             retval += chr(bytes[i])
 64 |         return retval
 65 |     def _read_bytes(stream):
 66 |         ms = IO.MemoryStream()
 67 |         buf = Array.CreateInstance(System.Byte, 2048)
 68 |         while True:
 69 |             bytes = stream.Read(buf, 0, buf.Length)
 70 |             if bytes == 0:
 71 |                 break
 72 |             else:
 73 |                 ms.Write(buf, 0, bytes)
 74 |         retval = ms.ToArray()
 75 |         ms.Close()
 76 |         return retval
 77 |     def decompress(data):
 78 |         bytes = _string_to_bytearr(data)
 79 |         ms = IO.MemoryStream()
 80 |         ms.Write(bytes, 0, bytes.Length)
 81 |         ms.Position = 0  # fseek 0
 82 |         gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress)
 83 |         bytes = _read_bytes(gz)
 84 |         retval = _bytearr_to_string(bytes)
 85 |         gz.Close()
 86 |         return retval
 87 |     def compress(data):
 88 |         bytes = _string_to_bytearr(data)
 89 |         ms = IO.MemoryStream()
 90 |         gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True)
 91 |         gz.Write(bytes, 0, bytes.Length)
 92 |         gz.Close()
 93 |         ms.Position = 0 # fseek 0
 94 |         bytes = ms.ToArray()
 95 |         retval = _bytearr_to_string(bytes)
 96 |         ms.Close()
 97 |         return retval
 98 | 
 99 | 
100 | class FlateDecode(object):
101 |     def decode(data, decodeParms):
102 |         data = decompress(data)
103 |         predictor = 1
104 |         if decodeParms:
105 |             try:
106 |                 predictor = decodeParms.get("/Predictor", 1)
107 |             except AttributeError:
108 |                 pass    # usually an array with a null object was read
109 |             
110 |         # predictor 1 == no predictor
111 |         if predictor != 1:
112 |             columns = decodeParms["/Columns"]
113 |             # PNG prediction:
114 |             if predictor >= 10 and predictor <= 15:
115 |                 output = StringIO()
116 |                 # PNG prediction can vary from row to row
117 |                 rowlength = columns + 1
118 |                 assert len(data) % rowlength == 0
119 |                 prev_rowdata = (0,) * rowlength
120 |                 for row in range(len(data) // rowlength):
121 |                     rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
122 |                     filterByte = rowdata[0]
123 |                     if filterByte == 0:
124 |                         pass
125 |                     elif filterByte == 1:
126 |                         for i in range(2, rowlength):
127 |                             rowdata[i] = (rowdata[i] + rowdata[i-1]) % 256
128 |                     elif filterByte == 2:
129 |                         for i in range(1, rowlength):
130 |                             rowdata[i] = (rowdata[i] + prev_rowdata[i]) % 256
131 |                     else:
132 |                         # unsupported PNG filter
133 |                         raise PdfReadError("Unsupported PNG filter %r" % filterByte)
134 |                     prev_rowdata = rowdata
135 |                     output.write(''.join([chr(x) for x in rowdata[1:]]))
136 |                 data = output.getvalue()
137 |             else:
138 |                 # unsupported predictor
139 |                 raise PdfReadError("Unsupported flatedecode predictor %r" % predictor)
140 |         return data
141 |     decode = staticmethod(decode)
142 | 
143 |     def encode(data):
144 |         return compress(data)
145 |     encode = staticmethod(encode)
146 | 
147 | class ASCIIHexDecode(object):
148 |     def decode(data, decodeParms=None):
149 |         retval = ""
150 |         char = ""
151 |         x = 0
152 |         while True:
153 |             c = data[x]
154 |             if c == ">":
155 |                 break
156 |             elif c.isspace():
157 |                 x += 1
158 |                 continue
159 |             char += c
160 |             if len(char) == 2:
161 |                 retval += chr(int(char, base=16))
162 |                 char = ""
163 |             x += 1
164 |         assert char == ""
165 |         return retval
166 |     decode = staticmethod(decode)
167 | 
168 | class LZWDecode(object):
169 |     """Taken from:
170 |     http://www.java2s.com/Open-Source/Java-Document/PDF/PDF-Renderer/com/sun/pdfview/decode/LZWDecode.java.htm
171 |     """
172 |     class decoder(object):
173 |         def __init__(self, data):
174 |             self.STOP=257
175 |             self.CLEARDICT=256
176 |             self.data=data
177 |             self.bytepos=0
178 |             self.bitpos=0
179 |             self.dict=[""]*4096
180 |             for i in range(256):
181 |                 self.dict[i]=chr(i)
182 |             self.resetDict()
183 | 
184 |         def resetDict(self):
185 |             self.dictlen=258
186 |             self.bitspercode=9
187 |                 
188 | 
189 |         def nextCode(self):
190 |             fillbits=self.bitspercode
191 |             value=0
192 |             while fillbits>0 :
193 |                 if self.bytepos >= len(self.data):
194 |                     return -1
195 |                 nextbits=ord(self.data[self.bytepos])
196 |                 bitsfromhere=8-self.bitpos
197 |                 if bitsfromhere>fillbits:
198 |                     bitsfromhere=fillbits
199 |                 value |= (((nextbits >> (8-self.bitpos-bitsfromhere)) & 
200 |                            (0xff >> (8-bitsfromhere))) << 
201 |                           (fillbits-bitsfromhere))
202 |                 fillbits -= bitsfromhere
203 |                 self.bitpos += bitsfromhere
204 |                 if self.bitpos >=8:
205 |                     self.bitpos=0
206 |                     self.bytepos = self.bytepos+1
207 |             return value
208 | 
209 |         def decode(self):
210 |             """ algorithm derived from:
211 |             http://www.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html
212 |             and the PDFReference
213 |             """
214 |             cW = self.CLEARDICT;
215 |             baos=""
216 |             while True:
217 |                 pW = cW;
218 |                 cW = self.nextCode();
219 |                 if cW == -1:
220 |                     raise PdfReadError("Missed the stop code in LZWDecode!")
221 |                 if cW == self.STOP:
222 |                     break;
223 |                 elif cW == self.CLEARDICT:
224 |                     self.resetDict();
225 |                 elif pW == self.CLEARDICT:
226 |                     baos+=self.dict[cW]
227 |                 else:
228 |                     if cW < self.dictlen:
229 |                         baos += self.dict[cW]
230 |                         p=self.dict[pW]+self.dict[cW][0]
231 |                         self.dict[self.dictlen]=p
232 |                         self.dictlen+=1
233 |                     else:
234 |                         p=self.dict[pW]+self.dict[pW][0]
235 |                         baos+=p
236 |                         self.dict[self.dictlen] = p;
237 |                         self.dictlen+=1
238 |                     if (self.dictlen >= (1 << self.bitspercode) - 1 and 
239 |                         self.bitspercode < 12):
240 |                         self.bitspercode+=1
241 |             return baos
242 | 
243 | 
244 |     
245 |     @staticmethod
246 |     def decode(data,decodeParams=None):
247 |         return LZWDecode.decoder(data).decode()
248 | 
249 | class ASCII85Decode(object):
250 |     def decode(data, decodeParms=None):
251 |         retval = ""
252 |         group = []
253 |         x = 0
254 |         hitEod = False
255 |         # remove all whitespace from data
256 |         data = [y for y in data if not (y in ' \n\r\t')]
257 |         while not hitEod:
258 |             c = data[x]
259 |             if len(retval) == 0 and c == "<" and data[x+1] == "~":
260 |                 x += 2
261 |                 continue
262 |             #elif c.isspace():
263 |             #    x += 1
264 |             #    continue
265 |             elif c == 'z':
266 |                 assert len(group) == 0
267 |                 retval += '\x00\x00\x00\x00'
268 |                 continue
269 |             elif c == "~" and data[x+1] == ">":
270 |                 if len(group) != 0:
271 |                     # cannot have a final group of just 1 char
272 |                     assert len(group) > 1
273 |                     cnt = len(group) - 1
274 |                     group += [ 85, 85, 85 ]
275 |                     hitEod = cnt
276 |                 else:
277 |                     break
278 |             else:
279 |                 c = ord(c) - 33
280 |                 assert c >= 0 and c < 85
281 |                 group += [ c ]
282 |             if len(group) >= 5:
283 |                 b = group[0] * (85**4) + \
284 |                     group[1] * (85**3) + \
285 |                     group[2] * (85**2) + \
286 |                     group[3] * 85 + \
287 |                     group[4]
288 |                 assert b < (2**32 - 1)
289 |                 c4 = chr((b >> 0) % 256)
290 |                 c3 = chr((b >> 8) % 256)
291 |                 c2 = chr((b >> 16) % 256)
292 |                 c1 = chr(b >> 24)
293 |                 retval += (c1 + c2 + c3 + c4)
294 |                 if hitEod:
295 |                     retval = retval[:-4+hitEod]
296 |                 group = []
297 |             x += 1
298 |         return retval
299 |     decode = staticmethod(decode)
300 | 
301 | def decodeStreamData(stream):
302 |     from .generic import NameObject
303 |     filters = stream.get("/Filter", ())
304 |     if len(filters) and not isinstance(filters[0], NameObject):
305 |         # we have a single filter instance
306 |         filters = (filters,)
307 |     data = stream._data
308 |     for filterType in filters:
309 |         if filterType == "/FlateDecode":
310 |             data = FlateDecode.decode(data, stream.get("/DecodeParms"))
311 |         elif filterType == "/ASCIIHexDecode":
312 |             data = ASCIIHexDecode.decode(data)
313 |         elif filterType == "/LZWDecode":
314 |             data = LZWDecode.decode(data, stream.get("/DecodeParms"))
315 |         elif filterType == "/ASCII85Decode":
316 |             data = ASCII85Decode.decode(data)
317 |         elif filterType == "/Crypt":
318 |             decodeParams = stream.get("/DecodeParams", {})
319 |             if "/Name" not in decodeParams and "/Type" not in decodeParams:
320 |                 pass
321 |             else:
322 |                 raise NotImplementedError("/Crypt filter with /Name or /Type not supported yet")
323 |         else:
324 |             # unsupported filter
325 |             raise NotImplementedError("unsupported filter %s" % filterType)
326 |     return data
327 | 


--------------------------------------------------------------------------------
/PyPDF2/generic.py:
--------------------------------------------------------------------------------
   1 | # vim: sw=4:expandtab:foldmethod=marker
   2 | #
   3 | # Copyright (c) 2006, Mathieu Fenniak
   4 | # All rights reserved.
   5 | #
   6 | # Redistribution and use in source and binary forms, with or without
   7 | # modification, are permitted provided that the following conditions are
   8 | # met:
   9 | #
  10 | # * Redistributions of source code must retain the above copyright notice,
  11 | # this list of conditions and the following disclaimer.
  12 | # * Redistributions in binary form must reproduce the above copyright notice,
  13 | # this list of conditions and the following disclaimer in the documentation
  14 | # and/or other materials provided with the distribution.
  15 | # * The name of the author may not be used to endorse or promote products
  16 | # derived from this software without specific prior written permission.
  17 | #
  18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  28 | # POSSIBILITY OF SUCH DAMAGE.
  29 | 
  30 | 
  31 | """
  32 | Implementation of generic PDF objects (dictionary, number, string, and so on)
  33 | """
  34 | __author__ = "Mathieu Fenniak"
  35 | __author_email__ = "biziqe@mathieu.fenniak.net"
  36 | 
  37 | import re
  38 | from .utils import readNonWhitespace, RC4_encrypt
  39 | from .utils import b_, u_, chr_, ord_
  40 | from .utils import PdfStreamError
  41 | import warnings
  42 | from . import filters
  43 | from . import utils
  44 | import decimal
  45 | import codecs
  46 | #import debugging
  47 | 
  48 | def readObject(stream, pdf):
  49 |     tok = stream.read(1)
  50 |     stream.seek(-1, 1) # reset to start
  51 |     if tok == b_('t') or tok == b_('f'):
  52 |         # boolean object
  53 |         return BooleanObject.readFromStream(stream)
  54 |     elif tok == b_('('):
  55 |         # string object
  56 |         return readStringFromStream(stream)
  57 |     elif tok == b_('/'):
  58 |         # name object
  59 |         return NameObject.readFromStream(stream)
  60 |     elif tok == b_('['):
  61 |         # array object
  62 |         return ArrayObject.readFromStream(stream, pdf)
  63 |     elif tok == b_('n'):
  64 |         # null object
  65 |         return NullObject.readFromStream(stream)
  66 |     elif tok == b_('<'):
  67 |         # hexadecimal string OR dictionary
  68 |         peek = stream.read(2)
  69 |         stream.seek(-2, 1) # reset to start
  70 |         if peek == b_('<<'):
  71 |             return DictionaryObject.readFromStream(stream, pdf)
  72 |         else:
  73 |             return readHexStringFromStream(stream)
  74 |     elif tok == b_('%'):
  75 |         # comment
  76 |         while tok not in (b_('\r'), b_('\n')):
  77 |             tok = stream.read(1)
  78 |         tok = readNonWhitespace(stream)
  79 |         stream.seek(-1, 1)
  80 |         return readObject(stream, pdf)
  81 |     else:
  82 |         # number object OR indirect reference
  83 |         if tok == b_('+') or tok == b_('-'):
  84 |             # number
  85 |             return NumberObject.readFromStream(stream)
  86 |         peek = stream.read(20)
  87 |         stream.seek(-len(peek), 1) # reset to start
  88 |         if re.match(b_(r"(\d+)\s(\d+)\sR[^a-zA-Z]"), peek) != None:
  89 |             return IndirectObject.readFromStream(stream, pdf)
  90 |         else:
  91 |             return NumberObject.readFromStream(stream)
  92 | 
  93 | class PdfObject(object):
  94 |     def getObject(self):
  95 |         """Resolves indirect references."""
  96 |         return self
  97 | 
  98 | 
  99 | class NullObject(PdfObject):
 100 |     def writeToStream(self, stream, encryption_key):
 101 |         stream.write(b_("null"))
 102 | 
 103 |     def readFromStream(stream):
 104 |         nulltxt = stream.read(4)
 105 |         if nulltxt != b_("null"):
 106 |             raise utils.PdfReadError("Could not read Null object")
 107 |         return NullObject()
 108 |     readFromStream = staticmethod(readFromStream)
 109 | 
 110 | 
 111 | class BooleanObject(PdfObject):
 112 |     def __init__(self, value):
 113 |         self.value = value
 114 | 
 115 |     def writeToStream(self, stream, encryption_key):
 116 |         if self.value:
 117 |             stream.write(b_("true"))
 118 |         else:
 119 |             stream.write(b_("false"))
 120 | 
 121 |     def readFromStream(stream):
 122 |         word = stream.read(4)
 123 |         if word == b_("true"):
 124 |             return BooleanObject(True)
 125 |         elif word == b_("fals"):
 126 |             stream.read(1)
 127 |             return BooleanObject(False)
 128 |         else:
 129 |             raise utils.PdfReadError('Could not read Boolean object')
 130 |     readFromStream = staticmethod(readFromStream)
 131 | 
 132 | 
 133 | class ArrayObject(list, PdfObject):
 134 |     def writeToStream(self, stream, encryption_key):
 135 |         stream.write(b_("["))
 136 |         for data in self:
 137 |             stream.write(b_(" "))
 138 |             data.writeToStream(stream, encryption_key)
 139 |         stream.write(b_(" ]"))
 140 | 
 141 |     def readFromStream(stream, pdf):
 142 |         arr = ArrayObject()
 143 |         tmp = stream.read(1)
 144 |         if tmp != b_("["):
 145 |             raise utils.PdfReadError("Could not read array")
 146 |         while True:
 147 |             # skip leading whitespace
 148 |             tok = stream.read(1)
 149 |             while tok.isspace():
 150 |                 tok = stream.read(1)
 151 |             stream.seek(-1, 1)
 152 |             # check for array ending
 153 |             peekahead = stream.read(1)
 154 |             if peekahead == b_("]"):
 155 |                 break
 156 |             stream.seek(-1, 1)
 157 |             # read and append obj
 158 |             arr.append(readObject(stream, pdf))
 159 |         return arr
 160 |     readFromStream = staticmethod(readFromStream)
 161 | 
 162 | 
 163 | class IndirectObject(PdfObject):
 164 |     def __init__(self, idnum, generation, pdf):
 165 |         self.idnum = idnum
 166 |         self.generation = generation
 167 |         self.pdf = pdf
 168 | 
 169 |     def getObject(self):
 170 |         return self.pdf.getObject(self).getObject()
 171 | 
 172 |     def __repr__(self):
 173 |         return "IndirectObject(%r, %r)" % (self.idnum, self.generation)
 174 | 
 175 |     def __eq__(self, other):
 176 |         return (
 177 |             other != None and
 178 |             isinstance(other, IndirectObject) and
 179 |             self.idnum == other.idnum and
 180 |             self.generation == other.generation and
 181 |             self.pdf is other.pdf
 182 |             )
 183 | 
 184 |     def __ne__(self, other):
 185 |         return not self.__eq__(other)
 186 | 
 187 |     def writeToStream(self, stream, encryption_key):
 188 |         stream.write(b_("%s %s R" % (self.idnum, self.generation)))
 189 | 
 190 |     def readFromStream(stream, pdf):
 191 |         idnum = b_("")
 192 |         while True:
 193 |             tok = stream.read(1)
 194 |             if not tok:
 195 |                 # stream has truncated prematurely
 196 |                 raise PdfStreamError("Stream has ended unexpectedly")
 197 |             if tok.isspace():
 198 |                 break
 199 |             idnum += tok
 200 |         generation = b_("")
 201 |         while True:
 202 |             tok = stream.read(1)
 203 |             if not tok:
 204 |                 # stream has truncated prematurely
 205 |                 raise PdfStreamError("Stream has ended unexpectedly")
 206 |             if tok.isspace():
 207 |                 break
 208 |             generation += tok
 209 |         r = stream.read(1)
 210 |         if r != b_("R"):
 211 |             raise utils.PdfReadError("Error reading indirect object reference at byte %s" % utils.hexStr(stream.tell()))
 212 |         return IndirectObject(int(idnum), int(generation), pdf)
 213 |     readFromStream = staticmethod(readFromStream)
 214 | 
 215 | 
 216 | class FloatObject(decimal.Decimal, PdfObject):
 217 |     def __new__(cls, value="0", context=None):
 218 |         try:
 219 |             return decimal.Decimal.__new__(cls, utils.str_(value), context)
 220 |         except:
 221 |             return decimal.Decimal.__new__(cls, utils.str_(value))
 222 |     def __repr__(self):
 223 |         if self == self.to_integral():
 224 |             return str(self.quantize(decimal.Decimal(1)))
 225 |         else:
 226 |             # XXX: this adds useless extraneous zeros.
 227 |             return "%.5f" % self
 228 |             
 229 |     def as_numeric(self):
 230 |         return float(b_(repr(self)))
 231 |         
 232 |     def writeToStream(self, stream, encryption_key):
 233 |         stream.write(b_(repr(self)))
 234 | 
 235 | 
 236 | class NumberObject(int, PdfObject):
 237 |     def __init__(self, value):
 238 |         int.__init__(value)
 239 |         
 240 |     def as_numeric(self):
 241 |         return int(b_(repr(self)))
 242 | 
 243 |     def writeToStream(self, stream, encryption_key):
 244 |         stream.write(b_(repr(self)))
 245 | 
 246 |     def readFromStream(stream):
 247 |         num = b_("")
 248 |         while True:
 249 |             tok = stream.read(1)
 250 |             if tok != b_('+') and tok != b_('-') and tok != b_('.') and not tok.isdigit():
 251 |                 stream.seek(-1, 1)
 252 |                 break
 253 |             num += tok
 254 |         if num.find(b_(".")) != -1:
 255 |             return FloatObject(num)
 256 |         else:
 257 |             return NumberObject(num)
 258 |     readFromStream = staticmethod(readFromStream)
 259 | 
 260 | 
 261 | ##
 262 | # Given a string (either a "str" or "unicode"), create a ByteStringObject or a
 263 | # TextStringObject to represent the string.
 264 | def createStringObject(string):
 265 |     if isinstance(string, utils.string_type):
 266 |         return TextStringObject(string)
 267 |     elif isinstance(string, utils.bytes_type):
 268 |         try:
 269 |             if string.startswith(codecs.BOM_UTF16_BE):
 270 |                 retval = TextStringObject(string.decode("utf-16"))
 271 |                 retval.autodetect_utf16 = True
 272 |                 return retval
 273 |             else:
 274 |                 # This is probably a big performance hit here, but we need to
 275 |                 # convert string objects into the text/unicode-aware version if
 276 |                 # possible... and the only way to check if that's possible is
 277 |                 # to try.  Some strings are strings, some are just byte arrays.
 278 |                 retval = TextStringObject(decode_pdfdocencoding(string))
 279 |                 retval.autodetect_pdfdocencoding = True
 280 |                 return retval
 281 |         except UnicodeDecodeError:
 282 |             return ByteStringObject(string)
 283 |     else:
 284 |         raise TypeError("createStringObject should have str or unicode arg")
 285 | 
 286 | 
 287 | def readHexStringFromStream(stream):
 288 |     stream.read(1)
 289 |     txt = ""
 290 |     x = b_("")
 291 |     while True:
 292 |         tok = readNonWhitespace(stream)
 293 |         if not tok:
 294 |             # stream has truncated prematurely
 295 |             raise PdfStreamError("Stream has ended unexpectedly")
 296 |         if tok == b_(">"):
 297 |             break
 298 |         x += tok
 299 |         if len(x) == 2:
 300 |             txt += chr(int(x, base=16))
 301 |             x = b_("")
 302 |     if len(x) == 1:
 303 |         x += b_("0")
 304 |     if len(x) == 2:
 305 |         txt += chr(int(x, base=16))
 306 |     return createStringObject(b_(txt))
 307 | 
 308 | 
 309 | def readStringFromStream(stream):
 310 |     tok = stream.read(1)
 311 |     parens = 1
 312 |     txt = b_("")
 313 |     while True:
 314 |         tok = stream.read(1)
 315 |         if not tok:
 316 |             # stream has truncated prematurely
 317 |             raise PdfStreamError("Stream has ended unexpectedly")
 318 |         if tok == b_("("):
 319 |             parens += 1
 320 |         elif tok == b_(")"):
 321 |             parens -= 1
 322 |             if parens == 0:
 323 |                 break
 324 |         elif tok == b_("\\"):
 325 |             tok = stream.read(1)
 326 |             if tok == b_("n"):
 327 |                 tok = b_("\n")
 328 |             elif tok == b_("r"):
 329 |                 tok = b_("\r")
 330 |             elif tok == b_("t"):
 331 |                 tok = b_("\t")
 332 |             elif tok == b_("b"):
 333 |                 tok = b_("\b")
 334 |             elif tok == b_("f"):
 335 |                 tok = b_("\f")
 336 |             elif tok == b_("("):
 337 |                 tok = b_("(")
 338 |             elif tok == b_(")"):
 339 |                 tok = b_(")")
 340 |             elif tok == b_("\\"):
 341 |                 tok = b_("\\")
 342 |             elif tok in (b_(" "), b_("/"), b_("%"), b_("<"), b_(">"), b_("["), b_("]")):
 343 |                 # odd/unnessecary escape sequences we have encountered
 344 |                 tok = b_(tok)
 345 |             elif tok.isdigit():
 346 |                 # "The number ddd may consist of one, two, or three
 347 |                 # octal digits; high-order overflow shall be ignored.
 348 |                 # Three octal digits shall be used, with leading zeros
 349 |                 # as needed, if the next character of the string is also
 350 |                 # a digit." (PDF reference 7.3.4.2, p 16)
 351 |                 for i in range(2):
 352 |                     ntok = stream.read(1)
 353 |                     if ntok.isdigit():
 354 |                         tok += ntok
 355 |                     else:
 356 |                         break
 357 |                 tok = b_(chr(int(tok, base=8)))
 358 |             elif tok in b_("\n\r"):
 359 |                 # This case is  hit when a backslash followed by a line
 360 |                 # break occurs.  If it's a multi-char EOL, consume the
 361 |                 # second character:
 362 |                 tok = stream.read(1)
 363 |                 if not tok in b_("\n\r"):
 364 |                     stream.seek(-1, 1)
 365 |                 # Then don't add anything to the actual string, since this
 366 |                 # line break was escaped:
 367 |                 tok = b_('')
 368 |             else:
 369 |                 raise utils.PdfReadError("Unexpected escaped string")
 370 |         txt += tok
 371 |     return createStringObject(txt)
 372 | 
 373 | 
 374 | ##
 375 | # Represents a string object where the text encoding could not be determined.
 376 | # This occurs quite often, as the PDF spec doesn't provide an alternate way to
 377 | # represent strings -- for example, the encryption data stored in files (like
 378 | # /O) is clearly not text, but is still stored in a "String" object.
 379 | class ByteStringObject(utils.bytes_type, PdfObject):
 380 | 
 381 |     ##
 382 |     # For compatibility with TextStringObject.original_bytes.  This method
 383 |     # returns self.
 384 |     original_bytes = property(lambda self: self)
 385 | 
 386 |     def writeToStream(self, stream, encryption_key):
 387 |         bytearr = self
 388 |         if encryption_key:
 389 |             bytearr = RC4_encrypt(encryption_key, bytearr)
 390 |         stream.write(b_("<"))
 391 |         stream.write(utils.hexencode(bytearr))
 392 |         stream.write(b_(">"))
 393 | 
 394 | 
 395 | ##
 396 | # Represents a string object that has been decoded into a real unicode string.
 397 | # If read from a PDF document, this string appeared to match the
 398 | # PDFDocEncoding, or contained a UTF-16BE BOM mark to cause UTF-16 decoding to
 399 | # occur.
 400 | class TextStringObject(utils.string_type, PdfObject):
 401 |     autodetect_pdfdocencoding = False
 402 |     autodetect_utf16 = False
 403 | 
 404 |     ##
 405 |     # It is occasionally possible that a text string object gets created where
 406 |     # a byte string object was expected due to the autodetection mechanism --
 407 |     # if that occurs, this "original_bytes" property can be used to
 408 |     # back-calculate what the original encoded bytes were.
 409 |     original_bytes = property(lambda self: self.get_original_bytes())
 410 | 
 411 |     def get_original_bytes(self):
 412 |         # We're a text string object, but the library is trying to get our raw
 413 |         # bytes.  This can happen if we auto-detected this string as text, but
 414 |         # we were wrong.  It's pretty common.  Return the original bytes that
 415 |         # would have been used to create this object, based upon the autodetect
 416 |         # method.
 417 |         if self.autodetect_utf16:
 418 |             return codecs.BOM_UTF16_BE + self.encode("utf-16be")
 419 |         elif self.autodetect_pdfdocencoding:
 420 |             return encode_pdfdocencoding(self)
 421 |         else:
 422 |             raise Exception("no information about original bytes")
 423 | 
 424 |     def writeToStream(self, stream, encryption_key):
 425 |         # Try to write the string out as a PDFDocEncoding encoded string.  It's
 426 |         # nicer to look at in the PDF file.  Sadly, we take a performance hit
 427 |         # here for trying...
 428 |         try:
 429 |             bytearr = encode_pdfdocencoding(self)
 430 |         except UnicodeEncodeError:
 431 |             bytearr = codecs.BOM_UTF16_BE + self.encode("utf-16be")
 432 |         if encryption_key:
 433 |             bytearr = RC4_encrypt(encryption_key, bytearr)
 434 |             obj = ByteStringObject(bytearr)
 435 |             obj.writeToStream(stream, None)
 436 |         else:
 437 |             stream.write(b_("("))
 438 |             for c in bytearr:
 439 |                 if not chr_(c).isalnum() and c != b_(' '):
 440 |                     stream.write(b_("\\%03o" % ord_(c)))
 441 |                 else:
 442 |                     stream.write(b_(chr_(c)))
 443 |             stream.write(b_(")"))
 444 | 
 445 | 
 446 | class NameObject(str, PdfObject):
 447 |     delimiterCharacters = b_("("), b_(")"), b_("<"), b_(">"), b_("["), b_("]"), b_("{"), b_("}"), b_("/"), b_("%")
 448 | 
 449 |     def __init__(self, data):
 450 |         str.__init__(data)
 451 | 
 452 |     def writeToStream(self, stream, encryption_key):
 453 |         stream.write(b_(self))
 454 | 
 455 |     def readFromStream(stream):
 456 |         debug = False
 457 |         if debug: print((stream.tell()))
 458 |         name = stream.read(1)
 459 |         if name != b_("/"):
 460 |             raise utils.PdfReadError("name read error")
 461 |         while True:
 462 |             tok = stream.read(1)
 463 |             if not tok:
 464 |                 # stream has truncated prematurely
 465 |                 raise PdfStreamError("Stream has ended unexpectedly")
 466 |             if tok.isspace() or tok in NameObject.delimiterCharacters:
 467 |                 stream.seek(-1, 1)
 468 |                 break
 469 |             name += tok
 470 |         if debug: print(name)
 471 |         return NameObject(name.decode('utf-8'))
 472 |     readFromStream = staticmethod(readFromStream)
 473 | 
 474 | 
 475 | class DictionaryObject(dict, PdfObject):
 476 | 
 477 |     def __init__(self, *args, **kwargs):
 478 |         if len(args) == 0:
 479 |             self.update(kwargs)
 480 |         elif len(args) == 1:
 481 |             arr = args[0]
 482 |             # If we're passed a list/tuple, make a dict out of it
 483 |             if not hasattr(arr, "iteritems"):
 484 |                 newarr = {}
 485 |                 for k, v in arr:
 486 |                     newarr[k] = v
 487 |                 arr = newarr
 488 |             self.update(arr)
 489 |         else:
 490 |             raise TypeError("dict expected at most 1 argument, got 3")
 491 | 
 492 |     def update(self, arr):
 493 |         # note, a ValueError halfway through copying values
 494 |         # will leave half the values in this dict.
 495 |         for k, v in list(arr.items()):
 496 |             self.__setitem__(k, v)
 497 | 
 498 |     def raw_get(self, key):
 499 |         return dict.__getitem__(self, key)
 500 | 
 501 |     def __setitem__(self, key, value):
 502 |         if not isinstance(key, PdfObject):
 503 |             raise ValueError("key must be PdfObject")
 504 |         if not isinstance(value, PdfObject):
 505 |             raise ValueError("value must be PdfObject")
 506 |         return dict.__setitem__(self, key, value)
 507 | 
 508 |     def setdefault(self, key, value=None):
 509 |         if not isinstance(key, PdfObject):
 510 |             raise ValueError("key must be PdfObject")
 511 |         if not isinstance(value, PdfObject):
 512 |             raise ValueError("value must be PdfObject")
 513 |         return dict.setdefault(self, key, value)
 514 | 
 515 |     def __getitem__(self, key):
 516 |         return dict.__getitem__(self, key).getObject()
 517 | 
 518 |     ##
 519 |     # Retrieves XMP (Extensible Metadata Platform) data relevant to the
 520 |     # this object, if available.
 521 |     # <p>
 522 |     # Stability: Added in v1.12, will exist for all future v1.x releases.
 523 |     # @return Returns a {@link #xmp.XmpInformation XmlInformation} instance
 524 |     # that can be used to access XMP metadata from the document.  Can also
 525 |     # return None if no metadata was found on the document root.
 526 |     def getXmpMetadata(self):
 527 |         metadata = self.get("/Metadata", None)
 528 |         if metadata == None:
 529 |             return None
 530 |         metadata = metadata.getObject()
 531 |         from . import xmp
 532 |         if not isinstance(metadata, xmp.XmpInformation):
 533 |             metadata = xmp.XmpInformation(metadata)
 534 |             self[NameObject("/Metadata")] = metadata
 535 |         return metadata
 536 | 
 537 |     ##
 538 |     # Read-only property that accesses the {@link
 539 |     # #DictionaryObject.getXmpData getXmpData} function.
 540 |     # <p>
 541 |     # Stability: Added in v1.12, will exist for all future v1.x releases.
 542 |     xmpMetadata = property(lambda self: self.getXmpMetadata(), None, None)
 543 | 
 544 |     def writeToStream(self, stream, encryption_key):
 545 |         stream.write(b_("<<\n"))
 546 |         for key, value in list(self.items()):
 547 |             key.writeToStream(stream, encryption_key)
 548 |             stream.write(b_(" "))
 549 |             value.writeToStream(stream, encryption_key)
 550 |             stream.write(b_("\n"))
 551 |         stream.write(b_(">>"))
 552 | 
 553 |     def readFromStream(stream, pdf):
 554 |         # This method is broken in Python 3+ and needs work,
 555 |         # especially when finding endstream marker
 556 |         debug = False
 557 |         tmp = stream.read(2)
 558 |         if tmp != b_("<<"):
 559 |             raise utils.PdfReadError("Dictionary read error at byte %s: stream must begin with '<<'" % utils.hexStr(stream.tell()))
 560 |         data = {}
 561 |         while True:
 562 |             tok = readNonWhitespace(stream)
 563 |             if tok == b_('\x00'):
 564 |                 continue
 565 |             if not tok:
 566 |                 # stream has truncated prematurely
 567 |                 raise PdfStreamError("Stream has ended unexpectedly")
 568 | 
 569 |             if debug: print(("Tok:", tok))
 570 |             if tok == b_(">"):
 571 |                 stream.read(1)
 572 |                 break
 573 |             stream.seek(-1, 1)
 574 |             key = readObject(stream, pdf)
 575 |             tok = readNonWhitespace(stream)
 576 |             stream.seek(-1, 1)
 577 |             value = readObject(stream, pdf)
 578 |             if key in data:
 579 |                 # multiple definitions of key not permitted
 580 |                 raise utils.PdfReadError("Multiple definitions in dictionary at byte %s for key %s" \
 581 |                                            % (utils.hexStr(stream.tell()), key))
 582 |             data[key] = value
 583 |         pos = stream.tell()
 584 |         s = readNonWhitespace(stream)
 585 |         if s == b_('s') and stream.read(5) == b_('tream'):
 586 |             eol = stream.read(1)
 587 |             # odd PDF file output has spaces after 'stream' keyword but before EOL.
 588 |             # patch provided by Danial Sandler
 589 |             while eol == b_(' '):
 590 |                 eol = stream.read(1)
 591 |             assert eol in (b_("\n"), b_("\r"))
 592 |             if eol == b_("\r"):
 593 |                 # read \n after
 594 |                 if stream.read(1)  != b_('\n'):
 595 |                     stream.seek(-1, 1)
 596 |             # this is a stream object, not a dictionary
 597 |             assert "/Length" in data
 598 |             length = data["/Length"]
 599 |             if debug: print(data)
 600 |             if isinstance(length, IndirectObject):
 601 |                 t = stream.tell()
 602 |                 length = pdf.getObject(length)
 603 |                 stream.seek(t, 0)
 604 |             data["__streamdata__"] = stream.read(length)
 605 |             if debug: print("here")
 606 |             #if debug: print(binascii.hexlify(data["__streamdata__"]))
 607 |             e = readNonWhitespace(stream)
 608 |             ndstream = stream.read(8)
 609 |             if (e + ndstream) != b_("endstream"):
 610 |                 # (sigh) - the odd PDF file has a length that is too long, so
 611 |                 # we need to read backwards to find the "endstream" ending.
 612 |                 # ReportLab (unknown version) generates files with this bug,
 613 |                 # and Python users into PDF files tend to be our audience.
 614 |                 # we need to do this to correct the streamdata and chop off
 615 |                 # an extra character.
 616 |                 pos = stream.tell()
 617 |                 stream.seek(-10, 1)
 618 |                 end = stream.read(9)
 619 |                 if end == b_("endstream"):
 620 |                     # we found it by looking back one character further.
 621 |                     data["__streamdata__"] = data["__streamdata__"][:-1]
 622 |                 else:
 623 |                     if debug: print(("E", e, ndstream, debugging.toHex(end)))
 624 |                     stream.seek(pos, 0)
 625 |                     raise utils.PdfReadError("Unable to find 'endstream' marker after stream at byte %s." % utils.hexStr(stream.tell()))
 626 |         else:
 627 |             stream.seek(pos, 0)
 628 |         if "__streamdata__" in data:
 629 |             return StreamObject.initializeFromDictionary(data)
 630 |         else:
 631 |             retval = DictionaryObject()
 632 |             retval.update(data)
 633 |             return retval
 634 |     readFromStream = staticmethod(readFromStream)
 635 | 
 636 | class TreeObject(DictionaryObject):
 637 |     def __init__(self):
 638 |         DictionaryObject.__init__(self)
 639 |         
 640 |     def hasChildren(self):
 641 |         return '/First' in self
 642 |     
 643 |     def __iter__(self):
 644 |         return self.children()
 645 |         
 646 |     def children(self):
 647 |         if not self.hasChildren():
 648 |             raise StopIteration
 649 |             
 650 |         child = self['/First']
 651 |         while True:
 652 |             yield child
 653 |             if child == self['/Last']:
 654 |                 raise StopIteration
 655 |             child = child['/Next']
 656 |         
 657 |     def addChild(self, child, pdf):
 658 |         childObj = child.getObject()
 659 |         child = pdf.getReference(childObj)
 660 |         assert isinstance(child, IndirectObject)
 661 |         
 662 |         if '/First' not in self:
 663 |             self[NameObject('/First')] = child
 664 |             self[NameObject('/Count')] = NumberObject(0)
 665 |             prev = None
 666 |         else:
 667 |             prev = self['/Last']
 668 | 
 669 |         self[NameObject('/Last')] = child
 670 |         self[NameObject('/Count')] = NumberObject(self[NameObject('/Count')] + 1)
 671 | 
 672 |         if prev:
 673 |             prevRef = pdf.getReference(prev)
 674 |             assert isinstance(prevRef, IndirectObject)
 675 |             childObj[NameObject('/Prev')] = prevRef
 676 |             prev[NameObject('/Next')] = child
 677 | 
 678 |         parentRef = pdf.getReference(self)
 679 |         assert isinstance(parentRef, IndirectObject)
 680 |         childObj[NameObject('/Parent')] = parentRef
 681 |         
 682 |     def removeChild(self, child):
 683 |         childObj = child.getObject()
 684 |         
 685 |         if NameObject('/Parent') not in childObj:
 686 |             raise ValueError("Removed child does not appear to be a tree item")
 687 |         elif childObj[NameObject('/Parent')] != self:
 688 |             raise ValueError("Removed child is not a member of this tree")
 689 |         
 690 |         found = False
 691 |         prevRef = None
 692 |         prev = None
 693 |         curRef = self[NameObject('/First')]
 694 |         cur = curRef.getObject()
 695 |         lastRef = self[NameObject('/Last')]
 696 |         last = lastRef.getObject() 
 697 |         while cur != None:
 698 |             if cur == childObj:
 699 |                 if prev == None:
 700 |                     if NameObject('/Next') in cur:
 701 |                         # Removing first tree node
 702 |                         nextRef = cur[NameObject('/Next')]
 703 |                         next = nextRef.getObject()
 704 |                         del next[NameObject('/Prev')]
 705 |                         self[NameObject('/First')] = nextRef
 706 |                         self[NameObject('/Count')] = self[NameObject('/Count')] - 1
 707 |                         
 708 |                     else:
 709 |                         # Removing only tree node
 710 |                         assert self[NameObject('/Count')] == 1
 711 |                         del self[NameObject('/Count')]
 712 |                         del self[NameObject('/First')]
 713 |                         if NameObject('/Last') in self:
 714 |                             del self[NameObject('/Last')]
 715 |                 else:
 716 |                     if NameObject('/Next') in cur:
 717 |                         # Removing middle tree node
 718 |                         nextRef = cur[NameObject('/Next')]
 719 |                         next = nextRef.getObject()
 720 |                         next[NameObject('/Prev')] = prevRef
 721 |                         prev[NameObject('/Next')] = nextRef
 722 |                         self[NameObject('/Count')] = self[NameObject('/Count')] - 1
 723 |                     else:
 724 |                         # Removing last tree node
 725 |                         assert cur == last
 726 |                         del prev[NameObject('/Next')]
 727 |                         self[NameObject('/Last')] = prevRef
 728 |                         self[NameObject('/Count')] = self[NameObject('/Count')] - 1
 729 |                 found = True
 730 |                 break        
 731 |                     
 732 |             
 733 |             prevRef = curRef
 734 |             prev = cur
 735 |             if NameObject('/Next') in cur:
 736 |                 curRef = cur[NameObject('/Next')]
 737 |                 cur = curRef.getObject()
 738 |             else:
 739 |                 curRef = None
 740 |                 cur = None
 741 |        
 742 |         if not found:
 743 |             raise ValueError("Removal couldn't find item in tree")
 744 |        
 745 |         del childObj[NameObject('/Parent')]
 746 |         if NameObject('/Next') in childObj:
 747 |             del childObj[NameObject('/Next')]
 748 |         if NameObject('/Prev') in childObj:
 749 |             del childObj[NameObject('/Prev')]
 750 | 
 751 |     def emptyTree(self):
 752 |         for child in self:
 753 |             childObj = child.getObject()
 754 |             del childObj[NameObject('/Parent')]
 755 |             if NameObject('/Next') in childObj:
 756 |                 del childObj[NameObject('/Next')]
 757 |             if NameObject('/Prev') in childObj:
 758 |                 del childObj[NameObject('/Prev')]
 759 | 
 760 |         if NameObject('/Count') in self:
 761 |             del self[NameObject('/Count')]
 762 |         if NameObject('/First') in self:
 763 |             del self[NameObject('/First')]
 764 |         if NameObject('/Last') in self:
 765 |             del self[NameObject('/Last')]
 766 | 
 767 | 
 768 | class StreamObject(DictionaryObject):
 769 |     def __init__(self):
 770 |         self._data = None
 771 |         self.decodedSelf = None
 772 | 
 773 |     def writeToStream(self, stream, encryption_key):
 774 |         self[NameObject("/Length")] = NumberObject(len(self._data))
 775 |         DictionaryObject.writeToStream(self, stream, encryption_key)
 776 |         del self["/Length"]
 777 |         stream.write(b_("\nstream\n"))
 778 |         data = self._data
 779 |         if encryption_key:
 780 |             data = RC4_encrypt(encryption_key, data)
 781 |         stream.write(data)
 782 |         stream.write(b_("\nendstream"))
 783 | 
 784 |     def initializeFromDictionary(data):
 785 |         if "/Filter" in data:
 786 |             retval = EncodedStreamObject()
 787 |         else:
 788 |             retval = DecodedStreamObject()
 789 |         retval._data = data["__streamdata__"]
 790 |         del data["__streamdata__"]
 791 |         del data["/Length"]
 792 |         retval.update(data)
 793 |         return retval
 794 |     initializeFromDictionary = staticmethod(initializeFromDictionary)
 795 | 
 796 |     def flateEncode(self):
 797 |         if "/Filter" in self:
 798 |             f = self["/Filter"]
 799 |             if isinstance(f, ArrayObject):
 800 |                 f.insert(0, NameObject("/FlateDecode"))
 801 |             else:
 802 |                 newf = ArrayObject()
 803 |                 newf.append(NameObject("/FlateDecode"))
 804 |                 newf.append(f)
 805 |                 f = newf
 806 |         else:
 807 |             f = NameObject("/FlateDecode")
 808 |         retval = EncodedStreamObject()
 809 |         retval[NameObject("/Filter")] = f
 810 |         retval._data = filters.FlateDecode.encode(self._data)
 811 |         return retval
 812 | 
 813 | 
 814 | class DecodedStreamObject(StreamObject):
 815 |     def getData(self):
 816 |         return self._data
 817 | 
 818 |     def setData(self, data):
 819 |         self._data = data
 820 | 
 821 | 
 822 | class EncodedStreamObject(StreamObject):
 823 |     def __init__(self):
 824 |         self.decodedSelf = None
 825 | 
 826 |     def getData(self):
 827 |         if self.decodedSelf:
 828 |             # cached version of decoded object
 829 |             return self.decodedSelf.getData()
 830 |         else:
 831 |             # create decoded object
 832 |             decoded = DecodedStreamObject()
 833 |             
 834 |             decoded._data = filters.decodeStreamData(self)
 835 |             for key, value in list(self.items()):
 836 |                 if not key in ("/Length", "/Filter", "/DecodeParms"):
 837 |                     decoded[key] = value
 838 |             self.decodedSelf = decoded
 839 |             return decoded._data
 840 | 
 841 |     def setData(self, data):
 842 |         raise utils.PdfReadError("Creating EncodedStreamObject is not currently supported")
 843 | 
 844 | 
 845 | class RectangleObject(ArrayObject):
 846 |     def __init__(self, arr):
 847 |         # must have four points
 848 |         assert len(arr) == 4
 849 |         # automatically convert arr[x] into NumberObject(arr[x]) if necessary
 850 |         ArrayObject.__init__(self, [self.ensureIsNumber(x) for x in arr])
 851 | 
 852 |     def ensureIsNumber(self, value):
 853 |         if not isinstance(value, (NumberObject, FloatObject)):
 854 |             value = FloatObject(value)
 855 |         return value
 856 | 
 857 |     def __repr__(self):
 858 |         return "RectangleObject(%s)" % repr(list(self))
 859 | 
 860 |     def getLowerLeft_x(self):
 861 |         return self[0]
 862 | 
 863 |     def getLowerLeft_y(self):
 864 |         return self[1]
 865 | 
 866 |     def getUpperRight_x(self):
 867 |         return self[2]
 868 | 
 869 |     def getUpperRight_y(self):
 870 |         return self[3]
 871 | 
 872 |     def getUpperLeft_x(self):
 873 |         return self.getLowerLeft_x()
 874 |     
 875 |     def getUpperLeft_y(self):
 876 |         return self.getUpperRight_y()
 877 | 
 878 |     def getLowerRight_x(self):
 879 |         return self.getUpperRight_x()
 880 | 
 881 |     def getLowerRight_y(self):
 882 |         return self.getLowerLeft_y()
 883 | 
 884 |     def getLowerLeft(self):
 885 |         return self.getLowerLeft_x(), self.getLowerLeft_y()
 886 | 
 887 |     def getLowerRight(self):
 888 |         return self.getLowerRight_x(), self.getLowerRight_y()
 889 | 
 890 |     def getUpperLeft(self):
 891 |         return self.getUpperLeft_x(), self.getUpperLeft_y()
 892 | 
 893 |     def getUpperRight(self):
 894 |         return self.getUpperRight_x(), self.getUpperRight_y()
 895 | 
 896 |     def setLowerLeft(self, value):
 897 |         self[0], self[1] = [self.ensureIsNumber(x) for x in value]
 898 | 
 899 |     def setLowerRight(self, value):
 900 |         self[2], self[1] = [self.ensureIsNumber(x) for x in value]
 901 | 
 902 |     def setUpperLeft(self, value):
 903 |         self[0], self[3] = [self.ensureIsNumber(x) for x in value]
 904 | 
 905 |     def setUpperRight(self, value):
 906 |         self[2], self[3] = [self.ensureIsNumber(x) for x in value]
 907 | 
 908 |     def getWidth(self):
 909 |         return self.getUpperRight_x() - self.getLowerLeft_x()
 910 | 
 911 |     def getHeight(self):
 912 |         return self.getUpperRight_y() - self.getLowerLeft_x()
 913 | 
 914 |     lowerLeft = property(getLowerLeft, setLowerLeft, None, None)
 915 |     lowerRight = property(getLowerRight, setLowerRight, None, None)
 916 |     upperLeft = property(getUpperLeft, setUpperLeft, None, None)
 917 |     upperRight = property(getUpperRight, setUpperRight, None, None)
 918 | 
 919 | 
 920 | ##
 921 | # A class representing a destination within a PDF file.
 922 | # See section 8.2.1 of the PDF 1.6 reference.
 923 | # Stability: Added in v1.10, will exist for all v1.x releases.
 924 | class Destination(TreeObject):
 925 |     def __init__(self, title, page, typ, *args):
 926 |         DictionaryObject.__init__(self)
 927 |         self[NameObject("/Title")] = title
 928 |         self[NameObject("/Page")] = page
 929 |         self[NameObject("/Type")] = typ
 930 |         
 931 |         # from table 8.2 of the PDF 1.6 reference.
 932 |         if typ == "/XYZ":
 933 |             (self[NameObject("/Left")], self[NameObject("/Top")],
 934 |                 self[NameObject("/Zoom")]) = args
 935 |         elif typ == "/FitR":
 936 |             (self[NameObject("/Left")], self[NameObject("/Bottom")],
 937 |                 self[NameObject("/Right")], self[NameObject("/Top")]) = args
 938 |         elif typ in ["/FitH", "/FitBH"]:
 939 |             self[NameObject("/Top")], = args
 940 |         elif typ in ["/FitV", "/FitBV"]:
 941 |             self[NameObject("/Left")], = args
 942 |         elif typ in ["/Fit", "/FitB"]:
 943 |             pass
 944 |         else:
 945 |             raise utils.PdfReadError("Unknown Destination Type: %r" % typ)
 946 |             
 947 |     def getDestArray(self):
 948 |         return ArrayObject([self.raw_get('/Page'), self['/Type']] + [self[x] for x in ['/Left', '/Bottom', '/Right', '/Top', '/Zoom'] if x in self])
 949 |         
 950 |     def writeToStream(self, stream, encryption_key):
 951 |         stream.write(b_("<<\n"))
 952 |         key = NameObject('/D')
 953 |         key.writeToStream(stream, encryption_key)
 954 |         stream.write(b_(" "))
 955 |         value = self.getDestArray()
 956 |         value.writeToStream(stream, encryption_key)
 957 | 
 958 |         key = NameObject("/S")
 959 |         key.writeToStream(stream, encryption_key)
 960 |         stream.write(b_(" "))
 961 |         value = NameObject("/GoTo")
 962 |         value.writeToStream(stream, encryption_key)
 963 |         
 964 |         stream.write(b_("\n"))
 965 |         stream.write(b_(">>"))
 966 |          
 967 |     ##
 968 |     # Read-only property accessing the destination title.
 969 |     # @return A string.
 970 |     title = property(lambda self: self.get("/Title"))
 971 | 
 972 |     ##
 973 |     # Read-only property accessing the destination page.
 974 |     # @return An integer.
 975 |     page = property(lambda self: self.get("/Page"))
 976 | 
 977 |     ##
 978 |     # Read-only property accessing the destination type.
 979 |     # @return A string.
 980 |     typ = property(lambda self: self.get("/Type"))
 981 | 
 982 |     ##
 983 |     # Read-only property accessing the zoom factor.
 984 |     # @return A number, or None if not available.
 985 |     zoom = property(lambda self: self.get("/Zoom", None))
 986 | 
 987 |     ##
 988 |     # Read-only property accessing the left horizontal coordinate.
 989 |     # @return A number, or None if not available.
 990 |     left = property(lambda self: self.get("/Left", None))
 991 | 
 992 |     ##
 993 |     # Read-only property accessing the right horizontal coordinate.
 994 |     # @return A number, or None if not available.
 995 |     right = property(lambda self: self.get("/Right", None))
 996 | 
 997 |     ##
 998 |     # Read-only property accessing the top vertical coordinate.
 999 |     # @return A number, or None if not available.
1000 |     top = property(lambda self: self.get("/Top", None))
1001 | 
1002 |     ##
1003 |     # Read-only property accessing the bottom vertical coordinate.
1004 |     # @return A number, or None if not available.
1005 |     bottom = property(lambda self: self.get("/Bottom", None))
1006 |         
1007 | 
1008 | class Bookmark(Destination):
1009 |     def writeToStream(self, stream, encryption_key):
1010 |         stream.write(b_("<<\n"))
1011 |         for key in [NameObject(x) for x in ['/Title', '/Parent', '/First', '/Last', '/Next', '/Prev'] if x in self]:
1012 |             key.writeToStream(stream, encryption_key)
1013 |             stream.write(b_(" "))
1014 |             value = self.raw_get(key)
1015 |             value.writeToStream(stream, encryption_key)
1016 |             stream.write(b_("\n"))
1017 |         key = NameObject('/Dest')
1018 |         key.writeToStream(stream, encryption_key)
1019 |         stream.write(b_(" "))
1020 |         value = self.getDestArray()
1021 |         value.writeToStream(stream, encryption_key)
1022 |         stream.write(b_("\n"))
1023 |         stream.write(b_(">>"))
1024 |         
1025 |  
1026 | def encode_pdfdocencoding(unicode_string):
1027 |     retval = b_('')
1028 |     for c in unicode_string:
1029 |         try:
1030 |             retval += b_(chr(_pdfDocEncoding_rev[c]))
1031 |         except KeyError:
1032 |             raise UnicodeEncodeError("pdfdocencoding", c, -1, -1,
1033 |                     "does not exist in translation table")
1034 |     return retval
1035 | 
1036 | def decode_pdfdocencoding(byte_array):
1037 |     retval = u_('')
1038 |     for b in byte_array:
1039 |         c = _pdfDocEncoding[ord_(b)]
1040 |         if c == u_('\u0000'):
1041 |             raise UnicodeDecodeError("pdfdocencoding", utils.barray(b), -1, -1,
1042 |                     "does not exist in translation table")
1043 |         retval += c
1044 |     return retval
1045 | 
1046 | _pdfDocEncoding = (
1047 |   u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
1048 |   u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
1049 |   u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'), u_('\u0000'),
1050 |   u_('\u02d8'), u_('\u02c7'), u_('\u02c6'), u_('\u02d9'), u_('\u02dd'), u_('\u02db'), u_('\u02da'), u_('\u02dc'),
1051 |   u_('\u0020'), u_('\u0021'), u_('\u0022'), u_('\u0023'), u_('\u0024'), u_('\u0025'), u_('\u0026'), u_('\u0027'),
1052 |   u_('\u0028'), u_('\u0029'), u_('\u002a'), u_('\u002b'), u_('\u002c'), u_('\u002d'), u_('\u002e'), u_('\u002f'),
1053 |   u_('\u0030'), u_('\u0031'), u_('\u0032'), u_('\u0033'), u_('\u0034'), u_('\u0035'), u_('\u0036'), u_('\u0037'),
1054 |   u_('\u0038'), u_('\u0039'), u_('\u003a'), u_('\u003b'), u_('\u003c'), u_('\u003d'), u_('\u003e'), u_('\u003f'),
1055 |   u_('\u0040'), u_('\u0041'), u_('\u0042'), u_('\u0043'), u_('\u0044'), u_('\u0045'), u_('\u0046'), u_('\u0047'),
1056 |   u_('\u0048'), u_('\u0049'), u_('\u004a'), u_('\u004b'), u_('\u004c'), u_('\u004d'), u_('\u004e'), u_('\u004f'),
1057 |   u_('\u0050'), u_('\u0051'), u_('\u0052'), u_('\u0053'), u_('\u0054'), u_('\u0055'), u_('\u0056'), u_('\u0057'),
1058 |   u_('\u0058'), u_('\u0059'), u_('\u005a'), u_('\u005b'), u_('\u005c'), u_('\u005d'), u_('\u005e'), u_('\u005f'),
1059 |   u_('\u0060'), u_('\u0061'), u_('\u0062'), u_('\u0063'), u_('\u0064'), u_('\u0065'), u_('\u0066'), u_('\u0067'),
1060 |   u_('\u0068'), u_('\u0069'), u_('\u006a'), u_('\u006b'), u_('\u006c'), u_('\u006d'), u_('\u006e'), u_('\u006f'),
1061 |   u_('\u0070'), u_('\u0071'), u_('\u0072'), u_('\u0073'), u_('\u0074'), u_('\u0075'), u_('\u0076'), u_('\u0077'),
1062 |   u_('\u0078'), u_('\u0079'), u_('\u007a'), u_('\u007b'), u_('\u007c'), u_('\u007d'), u_('\u007e'), u_('\u0000'),
1063 |   u_('\u2022'), u_('\u2020'), u_('\u2021'), u_('\u2026'), u_('\u2014'), u_('\u2013'), u_('\u0192'), u_('\u2044'),
1064 |   u_('\u2039'), u_('\u203a'), u_('\u2212'), u_('\u2030'), u_('\u201e'), u_('\u201c'), u_('\u201d'), u_('\u2018'),
1065 |   u_('\u2019'), u_('\u201a'), u_('\u2122'), u_('\ufb01'), u_('\ufb02'), u_('\u0141'), u_('\u0152'), u_('\u0160'),
1066 |   u_('\u0178'), u_('\u017d'), u_('\u0131'), u_('\u0142'), u_('\u0153'), u_('\u0161'), u_('\u017e'), u_('\u0000'),
1067 |   u_('\u20ac'), u_('\u00a1'), u_('\u00a2'), u_('\u00a3'), u_('\u00a4'), u_('\u00a5'), u_('\u00a6'), u_('\u00a7'),
1068 |   u_('\u00a8'), u_('\u00a9'), u_('\u00aa'), u_('\u00ab'), u_('\u00ac'), u_('\u0000'), u_('\u00ae'), u_('\u00af'),
1069 |   u_('\u00b0'), u_('\u00b1'), u_('\u00b2'), u_('\u00b3'), u_('\u00b4'), u_('\u00b5'), u_('\u00b6'), u_('\u00b7'),
1070 |   u_('\u00b8'), u_('\u00b9'), u_('\u00ba'), u_('\u00bb'), u_('\u00bc'), u_('\u00bd'), u_('\u00be'), u_('\u00bf'),
1071 |   u_('\u00c0'), u_('\u00c1'), u_('\u00c2'), u_('\u00c3'), u_('\u00c4'), u_('\u00c5'), u_('\u00c6'), u_('\u00c7'),
1072 |   u_('\u00c8'), u_('\u00c9'), u_('\u00ca'), u_('\u00cb'), u_('\u00cc'), u_('\u00cd'), u_('\u00ce'), u_('\u00cf'),
1073 |   u_('\u00d0'), u_('\u00d1'), u_('\u00d2'), u_('\u00d3'), u_('\u00d4'), u_('\u00d5'), u_('\u00d6'), u_('\u00d7'),
1074 |   u_('\u00d8'), u_('\u00d9'), u_('\u00da'), u_('\u00db'), u_('\u00dc'), u_('\u00dd'), u_('\u00de'), u_('\u00df'),
1075 |   u_('\u00e0'), u_('\u00e1'), u_('\u00e2'), u_('\u00e3'), u_('\u00e4'), u_('\u00e5'), u_('\u00e6'), u_('\u00e7'),
1076 |   u_('\u00e8'), u_('\u00e9'), u_('\u00ea'), u_('\u00eb'), u_('\u00ec'), u_('\u00ed'), u_('\u00ee'), u_('\u00ef'),
1077 |   u_('\u00f0'), u_('\u00f1'), u_('\u00f2'), u_('\u00f3'), u_('\u00f4'), u_('\u00f5'), u_('\u00f6'), u_('\u00f7'),
1078 |   u_('\u00f8'), u_('\u00f9'), u_('\u00fa'), u_('\u00fb'), u_('\u00fc'), u_('\u00fd'), u_('\u00fe'), u_('\u00ff')
1079 | )
1080 | 
1081 | assert len(_pdfDocEncoding) == 256
1082 | 
1083 | _pdfDocEncoding_rev = {}
1084 | for i in range(256):
1085 |     char = _pdfDocEncoding[i]
1086 |     if char == u_("\u0000"):
1087 |         continue
1088 |     assert char not in _pdfDocEncoding_rev
1089 |     _pdfDocEncoding_rev[char] = i
1090 | 
1091 | 


--------------------------------------------------------------------------------
/PyPDF2/merger.py:
--------------------------------------------------------------------------------
  1 | # vim: sw=4:expandtab:foldmethod=marker
  2 | #
  3 | # Copyright (c) 2006, Mathieu Fenniak
  4 | # All rights reserved.
  5 | #
  6 | # Redistribution and use in source and binary forms, with or without
  7 | # modification, are permitted provided that the following conditions are
  8 | # met:
  9 | #
 10 | # * Redistributions of source code must retain the above copyright notice,
 11 | # this list of conditions and the following disclaimer.
 12 | # * Redistributions in binary form must reproduce the above copyright notice,
 13 | # this list of conditions and the following disclaimer in the documentation
 14 | # and/or other materials provided with the distribution.
 15 | # * The name of the author may not be used to endorse or promote products
 16 | # derived from this software without specific prior written permission.
 17 | #
 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 28 | # POSSIBILITY OF SUCH DAMAGE.
 29 | 
 30 | from .generic import *
 31 | from .pdf import PdfFileReader, PdfFileWriter
 32 | from sys import version_info
 33 | if version_info < ( 3, 0 ):
 34 |     from cStringIO import StringIO
 35 | else:
 36 |     from io import StringIO
 37 |     from io import FileIO as file
 38 | 
 39 | class _MergedPage(object):
 40 |     """
 41 |     _MergedPage is used internally by PdfFileMerger to collect necessary information on each page that is being merged.
 42 |     """
 43 |     def __init__(self, pagedata, src, id):
 44 |         self.src = src
 45 |         self.pagedata = pagedata
 46 |         self.out_pagedata = None
 47 |         self.id = id
 48 |         
 49 | class PdfFileMerger(object):
 50 |     """
 51 |     PdfFileMerger merges multiple PDFs into a single PDF. It can concatenate, 
 52 |     slice, insert, or any combination of the above.
 53 |     
 54 |     See the functions "merge" (or "append") and "write" (or "overwrite") for
 55 |     usage information.
 56 |     """
 57 |     
 58 |     def __init__(self, strict=True):
 59 |         """
 60 |         >>> PdfFileMerger()
 61 |         
 62 |         Initializes a PdfFileMerger, no parameters required
 63 |         """
 64 |         self.inputs = []
 65 |         self.pages = []
 66 |         self.output = PdfFileWriter()
 67 |         self.bookmarks = []
 68 |         self.named_dests = []
 69 |         self.id_count = 0
 70 |         self.strict = strict
 71 |         
 72 |     def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True):
 73 |         """
 74 |         >>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True)
 75 |         
 76 |         Merges the pages from the source document specified by "file" into the output
 77 |         file at the page number specified by "position".
 78 |         
 79 |         Optionally, you may specify a bookmark to be applied at the beginning of the 
 80 |         included file by supplying the text of the bookmark in the "bookmark" parameter.
 81 |         
 82 |         You may prevent the source document's bookmarks from being imported by
 83 |         specifying "import_bookmarks" as False.
 84 |         
 85 |         You may also use the "pages" parameter to merge only the specified range of 
 86 |         pages from the source document into the output document.
 87 |         """
 88 |         
 89 |         # This parameter is passed to self.inputs.append and means
 90 |         # that the stream used was created in this method.
 91 |         my_file = False
 92 |         
 93 |         # If the fileobj parameter is a string, assume it is a path
 94 |         # and create a file object at that location. If it is a file,
 95 |         # copy the file's contents into a StringIO stream object; if 
 96 |         # it is a PdfFileReader, copy that reader's stream into a 
 97 |         # StringIO stream.
 98 |         # If fileobj is none of the above types, it is not modified
 99 |         if type(fileobj) in (str, str):
100 |             fileobj = file(fileobj, 'rb')
101 |             my_file = True
102 |         elif isinstance(fileobj, file):
103 |             fileobj.seek(0)
104 |             filecontent = fileobj.read()
105 |             fileobj = StringIO(filecontent)
106 |             my_file = True
107 |         elif isinstance(fileobj, PdfFileReader):
108 |             orig_tell = fileobj.stream.tell()   
109 |             fileobj.stream.seek(0)
110 |             filecontent = StringIO(fileobj.stream.read())
111 |             fileobj.stream.seek(orig_tell) # reset the stream to its original location
112 |             fileobj = filecontent
113 |             my_file = True
114 |             
115 |         # Create a new PdfFileReader instance using the stream
116 |         # (either file or StringIO) created above
117 |         pdfr = PdfFileReader(fileobj, strict=self.strict)
118 |         
119 |         # Find the range of pages to merge
120 |         if pages == None:
121 |             pages = (0, pdfr.getNumPages())
122 |         elif type(pages) in (int, float, str, str):
123 |             raise TypeError('"pages" must be a tuple of (start, end)')
124 |         
125 |         srcpages = []
126 |         if bookmark:
127 |             bookmark = Bookmark(TextStringObject(bookmark), NumberObject(self.id_count), NameObject('/Fit'))
128 |         
129 |         outline = []
130 |         if import_bookmarks:
131 |             outline = pdfr.getOutlines()
132 |             outline = self._trim_outline(pdfr, outline, pages)
133 |         
134 |         if bookmark:
135 |             self.bookmarks += [bookmark, outline]
136 |         else:
137 |             self.bookmarks += outline
138 |         
139 |         dests = pdfr.namedDestinations
140 |         dests = self._trim_dests(pdfr, dests, pages)
141 |         self.named_dests += dests
142 |         
143 |         # Gather all the pages that are going to be merged
144 |         for i in range(*pages):
145 |             pg = pdfr.getPage(i)
146 |             
147 |             id = self.id_count
148 |             self.id_count += 1
149 |             
150 |             mp = _MergedPage(pg, pdfr, id)
151 |             
152 |             srcpages.append(mp)
153 | 
154 |         self._associate_dests_to_pages(srcpages)
155 |         self._associate_bookmarks_to_pages(srcpages)
156 |             
157 |         
158 |         # Slice to insert the pages at the specified position
159 |         self.pages[position:position] = srcpages
160 |         
161 |         # Keep track of our input files so we can close them later
162 |         self.inputs.append((fileobj, pdfr, my_file))
163 |         
164 |         
165 |     def append(self, fileobj, bookmark=None, pages=None, import_bookmarks=True):
166 |         """
167 |         >>> append(file, bookmark=None, pages=None, import_bookmarks=True):
168 |         
169 |         Identical to the "merge" function, but assumes you want to concatenate all pages
170 |         onto the end of the file instead of specifying a position.
171 |         """
172 |         
173 |         self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
174 |         
175 |     
176 |     def write(self, fileobj):
177 |         """
178 |         >>> write(file)
179 |         
180 |         Writes all data that has been merged to "file" (which can be a filename or any
181 |         kind of file-like object)
182 |         """
183 |         my_file = False
184 |         if type(fileobj) in (str, str):
185 |             fileobj = file(fileobj, 'wb')
186 |             my_file = True
187 | 
188 | 
189 |         # Add pages to the PdfFileWriter
190 |         # The commented out line below was replaced with the two lines below it to allow PdfFileMerger to work with PyPdf 1.13
191 |         for page in self.pages:
192 |             self.output.addPage(page.pagedata)
193 |             page.out_pagedata = self.output.getReference(self.output._pages.getObject()["/Kids"][-1].getObject())
194 |             #idnum = self.output._objects.index(self.output._pages.getObject()["/Kids"][-1].getObject()) + 1
195 |             #page.out_pagedata = IndirectObject(idnum, 0, self.output)
196 | 
197 |         # Once all pages are added, create bookmarks to point at those pages
198 |         self._write_dests()
199 |         self._write_bookmarks()
200 |         
201 |         # Write the output to the file   
202 |         self.output.write(fileobj)
203 |         
204 |         if my_file:
205 |             fileobj.close()
206 | 
207 | 
208 |         
209 |     def close(self):
210 |         """
211 |         >>> close()
212 |         
213 |         Shuts all file descriptors (input and output) and clears all memory usage
214 |         """
215 |         self.pages = []
216 |         for fo, pdfr, mine in self.inputs:
217 |             if mine:
218 |                 fo.close()
219 |         
220 |         self.inputs = []
221 |         self.output = None
222 | 
223 |     def addMetadata(self, infos):
224 |         """See addMetadata method in PdfFileWriter class"""
225 |         self.output.addMetadata(infos)
226 |     
227 |     def setPageLayout(self, layout):
228 |         """See setPageLayout() methods in pdf.py"""
229 |         self.output.setPageLayout(layout)
230 | 
231 |     def setPageMode(self, mode):
232 |         """See setPageMode() methods in pdf.py"""
233 |         self.output.setPageMode(mode)
234 | 
235 |     def _trim_dests(self, pdf, dests, pages):
236 |         """
237 |         Removes any named destinations that are not a part of the specified page set
238 |         """
239 |         new_dests = []
240 |         prev_header_added = True
241 |         for k, o in list(dests.items()):
242 |             for j in range(*pages):
243 |                 if pdf.getPage(j).getObject() == o['/Page'].getObject():
244 |                     o[NameObject('/Page')] = o['/Page'].getObject()
245 |                     assert str(k) == str(o['/Title'])
246 |                     new_dests.append(o)
247 |                     break
248 |         return new_dests
249 |     
250 |     def _trim_outline(self, pdf, outline, pages):
251 |         """
252 |         Removes any outline/bookmark entries that are not a part of the specified page set
253 |         """
254 |         new_outline = []
255 |         prev_header_added = True
256 |         for i, o in enumerate(outline):
257 |             if isinstance(o, list):
258 |                 sub = self._trim_outline(pdf, o, pages)
259 |                 if sub:
260 |                     if not prev_header_added:
261 |                         new_outline.append(outline[i-1])
262 |                     new_outline.append(sub)
263 |             else:
264 |                 prev_header_added = False
265 |                 for j in range(*pages):
266 |                     if pdf.getPage(j).getObject() == o['/Page'].getObject():
267 |                         o[NameObject('/Page')] = o['/Page'].getObject()
268 |                         new_outline.append(o)
269 |                         prev_header_added = True
270 |                         break
271 |         return new_outline
272 |    
273 |     def _write_dests(self):
274 |         dests = self.named_dests
275 |         
276 |         for v in dests:
277 |             pageno = None
278 |             pdf = None
279 |             if '/Page' in v:
280 |                 for i, p in enumerate(self.pages):
281 |                     if p.id == v['/Page']:
282 |                         v[NameObject('/Page')] = p.out_pagedata
283 |                         pageno = i
284 |                         pdf = p.src
285 |                         break
286 |             if pageno != None:
287 |                 self.output.addNamedDestinationObject(v)
288 |  
289 |     def _write_bookmarks(self, bookmarks=None, parent=None):
290 |         
291 |         if bookmarks == None:
292 |             bookmarks = self.bookmarks
293 |         
294 | 
295 |         last_added = None
296 |         for b in bookmarks:
297 |             if isinstance(b, list):
298 |                 self._write_bookmarks(b, last_added)
299 |                 continue
300 |                 
301 |             pageno = None
302 |             pdf = None
303 |             if '/Page' in b:
304 |                 for i, p in enumerate(self.pages):
305 |                     if p.id == b['/Page']:
306 |                         #b[NameObject('/Page')] = p.out_pagedata
307 |                         args = [NumberObject(p.id), NameObject(b['/Type'])]
308 |                         #nothing more to add
309 |                         #if b['/Type'] == '/Fit' or b['/Type'] == '/FitB'
310 |                         if b['/Type'] == '/FitH' or b['/Type'] == '/FitBH':
311 |                             if '/Top' in b and not isinstance(b['/Top'], NullObject):
312 |                                 args.append(FloatObject(b['/Top']))
313 |                             else:
314 |                                 args.append(FloatObject(0))
315 |                             del b['/Top']
316 |                         elif b['/Type'] == '/FitV' or b['/Type'] == '/FitBV':
317 |                             if '/Left' in b and not isinstance(b['/Left'], NullObject):
318 |                                 args.append(FloatObject(b['/Left']))
319 |                             else:
320 |                                 args.append(FloatObject(0))
321 |                             del b['/Left']
322 |                         elif b['/Type'] == '/XYZ':
323 |                             if '/Left' in b and not isinstance(b['/Left'], NullObject):
324 |                                 args.append(FloatObject(b['/Left']))
325 |                             else:
326 |                                 args.append(FloatObject(0))
327 |                             if '/Top' in b and not isinstance(b['/Top'], NullObject):
328 |                                 args.append(FloatObject(b['/Top']))
329 |                             else:
330 |                                 args.append(FloatObject(0))
331 |                             if '/Zoom' in b and not isinstance(b['/Zoom'], NullObject):
332 |                                 args.append(FloatObject(b['/Zoom']))
333 |                             else:
334 |                                 args.append(FloatObject(0))
335 |                             del b['/Top'], b['/Zoom'], b['/Left']
336 |                         elif b['/Type'] == '/FitR':
337 |                             if '/Left' in b and not isinstance(b['/Left'], NullObject):
338 |                                 args.append(FloatObject(b['/Left']))
339 |                             else:
340 |                                 args.append(FloatObject(0))
341 |                             if '/Bottom' in b and not isinstance(b['/Bottom'], NullObject):
342 |                                 args.append(FloatObject(b['/Bottom']))
343 |                             else:
344 |                                 args.append(FloatObject(0))
345 |                             if '/Right' in b and not isinstance(b['/Right'], NullObject):
346 |                                 args.append(FloatObject(b['/Right']))
347 |                             else:
348 |                                 args.append(FloatObject(0))
349 |                             if '/Top' in b and not isinstance(b['/Top'], NullObject):
350 |                                 args.append(FloatObject(b['/Top']))
351 |                             else:
352 |                                 args.append(FloatObject(0))
353 |                             del b['/Left'], b['/Right'], b['/Bottom'], b['/Top']
354 | 
355 |                         b[NameObject('/A')] = DictionaryObject({NameObject('/S'): NameObject('/GoTo'), NameObject('/D'): ArrayObject(args)})
356 |                        
357 |                         pageno = i
358 |                         pdf = p.src
359 |                         break
360 |             if pageno != None:
361 |                 del b['/Page'], b['/Type']
362 |                 last_added = self.output.addBookmarkDict(b, parent)    
363 | 
364 |     def _associate_dests_to_pages(self, pages):
365 |         for nd in self.named_dests:
366 |             pageno = None
367 |             np = nd['/Page']
368 |             
369 |             if isinstance(np, NumberObject):
370 |                 continue
371 |             
372 |             for p in pages:
373 |                 if np.getObject() == p.pagedata.getObject():
374 |                     pageno = p.id
375 |             
376 |             if pageno != None:
377 |                 nd[NameObject('/Page')] = NumberObject(pageno)
378 |             else:
379 |                 raise ValueError("Unresolved named destination '%s'" % (nd['/Title'],))
380 |     
381 |     def _associate_bookmarks_to_pages(self, pages, bookmarks=None):
382 |         if bookmarks == None:
383 |             bookmarks = self.bookmarks
384 | 
385 |         for b in bookmarks:
386 |             if isinstance(b, list):
387 |                 self._associate_bookmarks_to_pages(pages, b)
388 |                 continue
389 |                 
390 |             pageno = None
391 |             bp = b['/Page']
392 |             
393 |             if isinstance(bp, NumberObject):
394 |                 continue
395 |                 
396 |             for p in pages:
397 |                 if bp.getObject() == p.pagedata.getObject():
398 |                     pageno = p.id
399 |             
400 |             if pageno != None:
401 |                 b[NameObject('/Page')] = NumberObject(pageno)
402 |             else:
403 |                 raise ValueError("Unresolved bookmark '%s'" % (b['/Title'],))
404 |                 
405 |     def findBookmark(self, bookmark, root=None):
406 |     	if root == None:
407 |     		root = self.bookmarks
408 |     	
409 |     	for i, b in enumerate(root):
410 |     		if isinstance(b, list):
411 |     			res = self.findBookmark(bookmark, b)
412 |     			if res:
413 |     				return [i] + res
414 |     		if b == bookmark or b['/Title'] == bookmark:
415 |     			return [i]
416 |     
417 |     	return None
418 | 
419 |     def addBookmark(self, title, pagenum, parent=None):
420 |         """
421 |         Add a bookmark to the pdf, using the specified title and pointing at 
422 |         the specified page number. A parent can be specified to make this a
423 |         nested bookmark below the parent.
424 |         """
425 | 
426 |         if parent == None:
427 |         	iloc = [len(self.bookmarks)-1]
428 |         elif isinstance(parent, list):
429 |         	iloc = parent
430 |         else:
431 |         	iloc = self.findBookmark(parent)
432 |         
433 |         dest = Bookmark(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
434 |         
435 |         if parent == None:
436 |         	self.bookmarks.append(dest)
437 |         else:
438 |         	bmparent = self.bookmarks
439 |         	for i in iloc[:-1]:
440 |         		bmparent = bmparent[i]
441 |         	npos = iloc[-1]+1
442 |         	if npos < len(bmparent) and isinstance(bmparent[npos], list):
443 |         		bmparent[npos].append(dest)
444 |         	else:
445 |         		bmparent.insert(npos, [dest])
446 |         return dest
447 |         		
448 |         
449 |     def addNamedDestination(self, title, pagenum):
450 |         """
451 |         Add a destination to the pdf, using the specified title and pointing
452 |         at the specified page number.
453 |         """
454 |         
455 |         dest = Destination(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
456 |         self.named_dests.append(dest)
457 | 
458 | 
459 | class OutlinesObject(list):
460 |     def __init__(self, pdf, tree, parent=None):
461 |         list.__init__(self)
462 |         self.tree = tree
463 |         self.pdf = pdf
464 |         self.parent = parent
465 |     
466 |     def remove(self, index):
467 |         obj = self[index]
468 |         del self[index]
469 |         self.tree.removeChild(obj)
470 |         
471 |     def add(self, title, page):
472 |         pageRef = self.pdf.getObject(self.pdf._pages)['/Kids'][pagenum]
473 |         action = DictionaryObject()
474 |         action.update({
475 |             NameObject('/D') : ArrayObject([pageRef, NameObject('/FitH'), NumberObject(826)]),
476 |             NameObject('/S') : NameObject('/GoTo')
477 |         })
478 |         actionRef = self.pdf._addObject(action)
479 |         bookmark = TreeObject()
480 | 
481 |         bookmark.update({
482 |             NameObject('/A'): actionRef,
483 |             NameObject('/Title'): createStringObject(title),
484 |         })
485 | 
486 |         pdf._addObject(bookmark)
487 | 
488 |         self.tree.addChild(bookmark)
489 |         
490 |     def removeAll(self):
491 |         for child in [x for x in self.tree.children()]:
492 |             self.tree.removeChild(child)
493 |             self.pop()
494 | 


--------------------------------------------------------------------------------
/PyPDF2/utils.py:
--------------------------------------------------------------------------------
  1 | # vim: sw=4:expandtab:foldmethod=marker
  2 | #
  3 | # Copyright (c) 2006, Mathieu Fenniak
  4 | # All rights reserved.
  5 | #
  6 | # Redistribution and use in source and binary forms, with or without
  7 | # modification, are permitted provided that the following conditions are
  8 | # met:
  9 | #
 10 | # * Redistributions of source code must retain the above copyright notice,
 11 | # this list of conditions and the following disclaimer.
 12 | # * Redistributions in binary form must reproduce the above copyright notice,
 13 | # this list of conditions and the following disclaimer in the documentation
 14 | # and/or other materials provided with the distribution.
 15 | # * The name of the author may not be used to endorse or promote products
 16 | # derived from this software without specific prior written permission.
 17 | #
 18 | # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 19 | # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 20 | # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 21 | # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 22 | # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 23 | # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 24 | # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 25 | # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 26 | # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 27 | # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 28 | # POSSIBILITY OF SUCH DAMAGE.
 29 | 
 30 | """
 31 | Utility functions for PDF library.
 32 | """
 33 | __author__ = "Mathieu Fenniak"
 34 | __author_email__ = "biziqe@mathieu.fenniak.net"
 35 | 
 36 | #custom implementation of warnings.formatwarning 
 37 | def _formatwarning(message, category, filename, lineno, line=None):
 38 |     file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
 39 |     return "%s: %s [%s:%s]\n" % (category.__name__, message, file, lineno)
 40 | 
 41 | def readUntilWhitespace(stream, maxchars=None):
 42 |     """
 43 |     Reads non-whitespace characters and returns them.
 44 |     Stops upon encountering whitespace or when maxchars is reached.
 45 |     """
 46 |     txt = b_("")
 47 |     while True:
 48 |         tok = stream.read(1)
 49 |         if tok.isspace() or not tok:
 50 |             break
 51 |         txt += tok
 52 |         if len(txt) == maxchars:
 53 |             break
 54 |     return txt
 55 | 
 56 | def readNonWhitespace(stream):
 57 |     """
 58 |     Finds and reads the next non-whitespace character (ignores whitespace).
 59 |     """
 60 |     tok = b_(' ')
 61 |     while tok == b_('\n') or tok == b_('\r') or tok == b_(' ') or tok == b_('\t'):
 62 |         tok = stream.read(1)
 63 |     return tok
 64 | 
 65 | def skipOverWhitespace(stream):
 66 |     """
 67 |     Similar to readNonWhitespace, but returns a Boolean if more than
 68 |     one whitespace character was read.
 69 |     """
 70 |     tok = b_(' ')
 71 |     cnt = 0;
 72 |     while tok == b_('\n') or tok == b_('\r') or tok == b_(' ') or tok == b_('\t'):
 73 |         tok = stream.read(1)
 74 |         cnt+=1
 75 |     return (cnt > 1)
 76 | 
 77 | def skipOverComment(stream):
 78 |     tok = stream.read(1)
 79 |     stream.seek(-1, 1)
 80 |     if tok == b_('%'):
 81 |         while tok not in (b_('\n'), b_('\r')):
 82 |             tok = stream.read(1)
 83 | 
 84 | class ConvertFunctionsToVirtualList(object):
 85 |     def __init__(self, lengthFunction, getFunction):
 86 |         self.lengthFunction = lengthFunction
 87 |         self.getFunction = getFunction
 88 | 
 89 |     def __len__(self):
 90 |         return self.lengthFunction()
 91 | 
 92 |     def __getitem__(self, index):
 93 |         if not isinstance(index, int):
 94 |             raise TypeError("sequence indices must be integers")
 95 |         len_self = len(self)
 96 |         if index < 0:
 97 |             # support negative indexes
 98 |             index = len_self + index
 99 |         if index < 0 or index >= len_self:
100 |             raise IndexError("sequence index out of range")
101 |         return self.getFunction(index)
102 | 
103 | def RC4_encrypt(key, plaintext):
104 |     S = [i for i in range(256)]
105 |     j = 0
106 |     for i in range(256):
107 |         j = (j + S[i] + ord_(key[i % len(key)])) % 256
108 |         S[i], S[j] = S[j], S[i]
109 |     i, j = 0, 0
110 |     retval = b_("")
111 |     for x in range(len(plaintext)):
112 |         i = (i + 1) % 256
113 |         j = (j + S[i]) % 256
114 |         S[i], S[j] = S[j], S[i]
115 |         t = S[(S[i] + S[j]) % 256]
116 |         retval += b_(chr(ord_(plaintext[x]) ^ t))
117 |     return retval
118 | 
119 | def matrixMultiply(a, b):
120 |     return [[sum([float(i)*float(j)
121 |                   for i, j in zip(row, col)]
122 |                 ) for col in zip(*b)]
123 |             for row in a]
124 | 
125 | def markLocation(stream):
126 |     """Creates text file showing current location in context."""
127 |     # Mainly for debugging
128 |     RADIUS = 5000
129 |     stream.seek(-RADIUS, 1)
130 |     outputDoc = open('PyPDF2_pdfLocation.txt', 'w')
131 |     outputDoc.write(stream.read(RADIUS))
132 |     outputDoc.write('HERE')
133 |     outputDoc.write(stream.read(RADIUS))
134 |     outputDoc.close()
135 |     stream.seek(-RADIUS, 1)
136 | 
137 | class PyPdfError(Exception):
138 |     pass
139 | 
140 | class PdfReadError(PyPdfError):
141 |     pass
142 | 
143 | class PageSizeNotDefinedError(PyPdfError):
144 |     pass
145 |     
146 | class PdfReadWarning(UserWarning):
147 |     pass
148 | 
149 | class PdfStreamError(PdfReadError):
150 |     pass
151 | 
152 | def hexStr(num):
153 |     return hex(num).replace('L', '')
154 | 
155 | import sys
156 | 
157 | def b_(s):
158 |     if sys.version_info[0] < 3:
159 |         return s
160 |     else:
161 |         if type(s) == bytes:
162 |             return s
163 |         else:
164 |             return s.encode('latin-1')
165 | 
166 | def u_(s):
167 |     if sys.version_info[0] < 3:
168 |         return unicode(s, 'unicode_escape')
169 |     else:
170 |         return s
171 | 
172 | 
173 | def str_(b):
174 |     if sys.version_info[0] < 3:
175 |         return b
176 |     else:
177 |         if type(b) == bytes:
178 |             return b.decode('latin-1')
179 |         else:
180 |             return b
181 | 
182 | def ord_(b):
183 |     if sys.version_info[0] < 3:
184 |         return ord(b)
185 |     else:
186 |         return b
187 | 
188 | def chr_(c):
189 |     if sys.version_info[0] < 3:
190 |         return c
191 |     else:
192 |         return chr(c)
193 | 
194 | def barray(b):
195 |     if sys.version_info[0] < 3:
196 |         return b
197 |     else:
198 |         return bytearray(b)
199 | 
200 | def hexencode(b):
201 |     if sys.version_info[0] < 3:
202 |         return b.encode('hex')
203 |     else:
204 |         import codecs
205 |         coder = codecs.getencoder('hex_codec')
206 |         return coder(b)[0]
207 | 
208 | if sys.version_info[0] < 3:
209 |     string_type = unicode
210 |     bytes_type = str
211 | else:
212 |     string_type = str
213 |     bytes_type = bytes
214 | 


--------------------------------------------------------------------------------
/PyPDF2/xmp.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import datetime
  3 | import decimal
  4 | from .generic import PdfObject
  5 | from xml.dom import getDOMImplementation
  6 | from xml.dom.minidom import parseString
  7 | from .utils import u_
  8 | 
  9 | RDF_NAMESPACE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 10 | DC_NAMESPACE = "http://purl.org/dc/elements/1.1/"
 11 | XMP_NAMESPACE = "http://ns.adobe.com/xap/1.0/"
 12 | PDF_NAMESPACE = "http://ns.adobe.com/pdf/1.3/"
 13 | XMPMM_NAMESPACE = "http://ns.adobe.com/xap/1.0/mm/"
 14 | 
 15 | # What is the PDFX namespace, you might ask?  I might ask that too.  It's
 16 | # a completely undocumented namespace used to place "custom metadata"
 17 | # properties, which are arbitrary metadata properties with no semantic or
 18 | # documented meaning.  Elements in the namespace are key/value-style storage,
 19 | # where the element name is the key and the content is the value.  The keys
 20 | # are transformed into valid XML identifiers by substituting an invalid
 21 | # identifier character with \u2182 followed by the unicode hex ID of the
 22 | # original character.  A key like "my car" is therefore "my\u21820020car".
 23 | #
 24 | # \u2182, in case you're wondering, is the unicode character
 25 | # \u{ROMAN NUMERAL TEN THOUSAND}, a straightforward and obvious choice for
 26 | # escaping characters.
 27 | #
 28 | # Intentional users of the pdfx namespace should be shot on sight.  A
 29 | # custom data schema and sensical XML elements could be used instead, as is
 30 | # suggested by Adobe's own documentation on XMP (under "Extensibility of
 31 | # Schemas").
 32 | #
 33 | # Information presented here on the /pdfx/ schema is a result of limited
 34 | # reverse engineering, and does not constitute a full specification.
 35 | PDFX_NAMESPACE = "http://ns.adobe.com/pdfx/1.3/"
 36 | 
 37 | iso8601 = re.compile("""
 38 |         (?P<year>[0-9]{4})
 39 |         (-
 40 |             (?P<month>[0-9]{2})
 41 |             (-
 42 |                 (?P<day>[0-9]+)
 43 |                 (T
 44 |                     (?P<hour>[0-9]{2}):
 45 |                     (?P<minute>[0-9]{2})
 46 |                     (:(?P<second>[0-9]{2}(.[0-9]+)?))?
 47 |                     (?P<tzd>Z|[-+][0-9]{2}:[0-9]{2})
 48 |                 )?
 49 |             )?
 50 |         )?
 51 |         """, re.VERBOSE)
 52 | 
 53 | ##
 54 | # An object that represents Adobe XMP metadata.
 55 | class XmpInformation(PdfObject):
 56 | 
 57 |     def __init__(self, stream):
 58 |         self.stream = stream
 59 |         docRoot = parseString(self.stream.getData())
 60 |         self.rdfRoot = docRoot.getElementsByTagNameNS(RDF_NAMESPACE, "RDF")[0]
 61 |         self.cache = {}
 62 | 
 63 |     def writeToStream(self, stream, encryption_key):
 64 |         self.stream.writeToStream(stream, encryption_key)
 65 | 
 66 |     def getElement(self, aboutUri, namespace, name):
 67 |         for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
 68 |             if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
 69 |                 attr = desc.getAttributeNodeNS(namespace, name)
 70 |                 if attr != None:
 71 |                     yield attr
 72 |                 for element in desc.getElementsByTagNameNS(namespace, name):
 73 |                     yield element
 74 | 
 75 |     def getNodesInNamespace(self, aboutUri, namespace):
 76 |         for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
 77 |             if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
 78 |                 for i in range(desc.attributes.length):
 79 |                     attr = desc.attributes.item(i)
 80 |                     if attr.namespaceURI == namespace:
 81 |                         yield attr
 82 |                 for child in desc.childNodes:
 83 |                     if child.namespaceURI == namespace:
 84 |                         yield child
 85 | 
 86 |     def _getText(self, element):
 87 |         text = ""
 88 |         for child in element.childNodes:
 89 |             if child.nodeType == child.TEXT_NODE:
 90 |                 text += child.data
 91 |         return text
 92 | 
 93 |     def _converter_string(value):
 94 |         return value
 95 | 
 96 |     def _converter_date(value):
 97 |         m = iso8601.match(value)
 98 |         year = int(m.group("year"))
 99 |         month = int(m.group("month") or "1")
100 |         day = int(m.group("day") or "1")
101 |         hour = int(m.group("hour") or "0")
102 |         minute = int(m.group("minute") or "0")
103 |         second = decimal.Decimal(m.group("second") or "0")
104 |         seconds = second.to_integral(decimal.ROUND_FLOOR)
105 |         milliseconds = (second - seconds) * 1000000
106 |         tzd = m.group("tzd") or "Z"
107 |         dt = datetime.datetime(year, month, day, hour, minute, seconds, milliseconds)
108 |         if tzd != "Z":
109 |             tzd_hours, tzd_minutes = [int(x) for x in tzd.split(":")]
110 |             tzd_hours *= -1
111 |             if tzd_hours < 0:
112 |                 tzd_minutes *= -1
113 |             dt = dt + datetime.timedelta(hours=tzd_hours, minutes=tzd_minutes)
114 |         return dt
115 |     _test_converter_date = staticmethod(_converter_date)
116 | 
117 |     def _getter_bag(namespace, name, converter):
118 |         def get(self):
119 |             cached = self.cache.get(namespace, {}).get(name)
120 |             if cached:
121 |                 return cached
122 |             retval = []
123 |             for element in self.getElement("", namespace, name):
124 |                 bags = element.getElementsByTagNameNS(RDF_NAMESPACE, "Bag")
125 |                 if len(bags):
126 |                     for bag in bags:
127 |                         for item in bag.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
128 |                             value = self._getText(item)
129 |                             value = converter(value)
130 |                             retval.append(value)
131 |             ns_cache = self.cache.setdefault(namespace, {})
132 |             ns_cache[name] = retval
133 |             return retval
134 |         return get
135 | 
136 |     def _getter_seq(namespace, name, converter):
137 |         def get(self):
138 |             cached = self.cache.get(namespace, {}).get(name)
139 |             if cached:
140 |                 return cached
141 |             retval = []
142 |             for element in self.getElement("", namespace, name):
143 |                 seqs = element.getElementsByTagNameNS(RDF_NAMESPACE, "Seq")
144 |                 if len(seqs):
145 |                     for seq in seqs:
146 |                         for item in seq.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
147 |                             value = self._getText(item)
148 |                             value = converter(value)
149 |                             retval.append(value)
150 |                 else:
151 |                     value = converter(self._getText(element))
152 |                     retval.append(value)
153 |             ns_cache = self.cache.setdefault(namespace, {})
154 |             ns_cache[name] = retval
155 |             return retval
156 |         return get
157 | 
158 |     def _getter_langalt(namespace, name, converter):
159 |         def get(self):
160 |             cached = self.cache.get(namespace, {}).get(name)
161 |             if cached:
162 |                 return cached
163 |             retval = {}
164 |             for element in self.getElement("", namespace, name):
165 |                 alts = element.getElementsByTagNameNS(RDF_NAMESPACE, "Alt")
166 |                 if len(alts):
167 |                     for alt in alts:
168 |                         for item in alt.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
169 |                             value = self._getText(item)
170 |                             value = converter(value)
171 |                             retval[item.getAttribute("xml:lang")] = value
172 |                 else:
173 |                     retval["x-default"] = converter(self._getText(element))
174 |             ns_cache = self.cache.setdefault(namespace, {})
175 |             ns_cache[name] = retval
176 |             return retval
177 |         return get
178 | 
179 |     def _getter_single(namespace, name, converter):
180 |         def get(self):
181 |             cached = self.cache.get(namespace, {}).get(name)
182 |             if cached:
183 |                 return cached
184 |             value = None
185 |             for element in self.getElement("", namespace, name):
186 |                 if element.nodeType == element.ATTRIBUTE_NODE:
187 |                     value = element.nodeValue
188 |                 else:
189 |                     value = self._getText(element)
190 |                 break
191 |             if value != None:
192 |                 value = converter(value)
193 |             ns_cache = self.cache.setdefault(namespace, {})
194 |             ns_cache[name] = value
195 |             return value
196 |         return get
197 | 
198 |     ##
199 |     # Contributors to the resource (other than the authors).  An unsorted
200 |     # array of names.
201 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
202 |     dc_contributor = property(_getter_bag(DC_NAMESPACE, "contributor", _converter_string))
203 | 
204 |     ##
205 |     # Text describing the extent or scope of the resource.
206 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
207 |     dc_coverage = property(_getter_single(DC_NAMESPACE, "coverage", _converter_string))
208 | 
209 |     ##
210 |     # A sorted array of names of the authors of the resource, listed in order
211 |     # of precedence.
212 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
213 |     dc_creator = property(_getter_seq(DC_NAMESPACE, "creator", _converter_string))
214 | 
215 |     ##
216 |     # A sorted array of dates (datetime.datetime instances) of signifigance to
217 |     # the resource.  The dates and times are in UTC.
218 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
219 |     dc_date = property(_getter_seq(DC_NAMESPACE, "date", _converter_date))
220 | 
221 |     ##
222 |     # A language-keyed dictionary of textual descriptions of the content of the
223 |     # resource.
224 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
225 |     dc_description = property(_getter_langalt(DC_NAMESPACE, "description", _converter_string))
226 | 
227 |     ##
228 |     # The mime-type of the resource.
229 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
230 |     dc_format = property(_getter_single(DC_NAMESPACE, "format", _converter_string))
231 | 
232 |     ##
233 |     # Unique identifier of the resource.
234 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
235 |     dc_identifier = property(_getter_single(DC_NAMESPACE, "identifier", _converter_string))
236 | 
237 |     ##
238 |     # An unordered array specifying the languages used in the resource.
239 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
240 |     dc_language = property(_getter_bag(DC_NAMESPACE, "language", _converter_string))
241 | 
242 |     ##
243 |     # An unordered array of publisher names.
244 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
245 |     dc_publisher = property(_getter_bag(DC_NAMESPACE, "publisher", _converter_string))
246 | 
247 |     ##
248 |     # An unordered array of text descriptions of relationships to other
249 |     # documents.
250 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
251 |     dc_relation = property(_getter_bag(DC_NAMESPACE, "relation", _converter_string))
252 | 
253 |     ##
254 |     # A language-keyed dictionary of textual descriptions of the rights the
255 |     # user has to this resource.
256 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
257 |     dc_rights = property(_getter_langalt(DC_NAMESPACE, "rights", _converter_string))
258 | 
259 |     ##
260 |     # Unique identifier of the work from which this resource was derived.
261 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
262 |     dc_source = property(_getter_single(DC_NAMESPACE, "source", _converter_string))
263 | 
264 |     ##
265 |     # An unordered array of descriptive phrases or keywrods that specify the
266 |     # topic of the content of the resource.
267 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
268 |     dc_subject = property(_getter_bag(DC_NAMESPACE, "subject", _converter_string))
269 | 
270 |     ##
271 |     # A language-keyed dictionary of the title of the resource.
272 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
273 |     dc_title = property(_getter_langalt(DC_NAMESPACE, "title", _converter_string))
274 | 
275 |     ##
276 |     # An unordered array of textual descriptions of the document type.
277 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
278 |     dc_type = property(_getter_bag(DC_NAMESPACE, "type", _converter_string))
279 | 
280 |     ##
281 |     # An unformatted text string representing document keywords.
282 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
283 |     pdf_keywords = property(_getter_single(PDF_NAMESPACE, "Keywords", _converter_string))
284 | 
285 |     ##
286 |     # The PDF file version, for example 1.0, 1.3.
287 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
288 |     pdf_pdfversion = property(_getter_single(PDF_NAMESPACE, "PDFVersion", _converter_string))
289 | 
290 |     ##
291 |     # The name of the tool that created the PDF document.
292 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
293 |     pdf_producer = property(_getter_single(PDF_NAMESPACE, "Producer", _converter_string))
294 | 
295 |     ##
296 |     # The date and time the resource was originally created.  The date and
297 |     # time are returned as a UTC datetime.datetime object.
298 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
299 |     xmp_createDate = property(_getter_single(XMP_NAMESPACE, "CreateDate", _converter_date))
300 |     
301 |     ##
302 |     # The date and time the resource was last modified.  The date and time
303 |     # are returned as a UTC datetime.datetime object.
304 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
305 |     xmp_modifyDate = property(_getter_single(XMP_NAMESPACE, "ModifyDate", _converter_date))
306 | 
307 |     ##
308 |     # The date and time that any metadata for this resource was last
309 |     # changed.  The date and time are returned as a UTC datetime.datetime
310 |     # object.
311 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
312 |     xmp_metadataDate = property(_getter_single(XMP_NAMESPACE, "MetadataDate", _converter_date))
313 | 
314 |     ##
315 |     # The name of the first known tool used to create the resource.
316 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
317 |     xmp_creatorTool = property(_getter_single(XMP_NAMESPACE, "CreatorTool", _converter_string))
318 | 
319 |     ##
320 |     # The common identifier for all versions and renditions of this resource.
321 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
322 |     xmpmm_documentId = property(_getter_single(XMPMM_NAMESPACE, "DocumentID", _converter_string))
323 | 
324 |     ##
325 |     # An identifier for a specific incarnation of a document, updated each
326 |     # time a file is saved.
327 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
328 |     xmpmm_instanceId = property(_getter_single(XMPMM_NAMESPACE, "InstanceID", _converter_string))
329 | 
330 |     def custom_properties(self):
331 |         if not hasattr(self, "_custom_properties"):
332 |             self._custom_properties = {}
333 |             for node in self.getNodesInNamespace("", PDFX_NAMESPACE):
334 |                 key = node.localName
335 |                 while True:
336 |                     # see documentation about PDFX_NAMESPACE earlier in file
337 |                     idx = key.find(u_("\u2182"))
338 |                     if idx == -1:
339 |                         break
340 |                     key = key[:idx] + chr(int(key[idx+1:idx+5], base=16)) + key[idx+5:]
341 |                 if node.nodeType == node.ATTRIBUTE_NODE:
342 |                     value = node.nodeValue
343 |                 else:
344 |                     value = self._getText(node)
345 |                 self._custom_properties[key] = value
346 |         return self._custom_properties
347 | 
348 |     ##
349 |     # Retrieves custom metadata properties defined in the undocumented pdfx
350 |     # metadata schema.
351 |     # <p>Stability: Added in v1.12, will exist for all future v1.x releases.
352 |     # @return Returns a dictionary of key/value items for custom metadata
353 |     # properties.
354 |     custom_properties = property(custom_properties)
355 | 
356 | 
357 | 


--------------------------------------------------------------------------------
/README:
--------------------------------------------------------------------------------
 1 | PyPDF2
 2 | -------------------------------------------------
 3 | 
 4 | PyPDF2 is a pure-python PDF library capable of
 5 | splitting, merging together, cropping, and transforming
 6 | the pages of PDF files. It can also add custom
 7 | data, viewing options, and passwords to PDF files.
 8 | It can retrieve text and metadata from PDFs as well
 9 | as merge entire files together.
10 | 
11 | See sample code folder for helpful examples.
12 | 
13 | Documentation: <URL coming soon>
14 | FAQ: <http://mstamy2.github.io/PyPDF2/FAQ.html>
15 | PyPI: <https://pypi.python.org/pypi/PyPDF2>
16 | GitHub: <https://github.com/mstamy2/PyPDF2>
17 | Homepage: <http://mstamy2.github.io/PyPDF2/>
18 | 


--------------------------------------------------------------------------------
/Sample_Code/2-up.py:
--------------------------------------------------------------------------------
 1 | from PyPDF2 import PdfFileWriter, PdfFileReader
 2 | import sys
 3 | import math
 4 | 
 5 | def main():
 6 |     if (len(sys.argv) != 3):
 7 |         print("usage: python 2-up.py input_file output_file")
 8 |         sys.exit(1)
 9 |     print ("2-up input " + sys.argv[1])
10 |     input1 = PdfFileReader(open(sys.argv[1], "rb"))
11 |     output = PdfFileWriter()
12 |     for iter in range (0, input1.getNumPages()-1, 2):
13 |         lhs = input1.getPage(iter)
14 |         rhs = input1.getPage(iter+1)
15 |         lhs.mergeTranslatedPage(rhs, lhs.mediaBox.getUpperRight_x(),0, True)
16 |         output.addPage(lhs)
17 |         print (str(iter) + " "),
18 |         sys.stdout.flush()
19 | 
20 |     print("writing " + sys.argv[2])
21 |     outputStream = file(sys.argv[2], "wb")
22 |     output.write(outputStream)
23 |     print("done.")
24 | 
25 | if __name__ == "__main__":
26 |     main()
27 | 


--------------------------------------------------------------------------------
/Sample_Code/README.txt:
--------------------------------------------------------------------------------
 1 | PyPDF2 Sample Code Folder
 2 | -------------------------
 3 | 
 4 | This will contain demonstrations of the many features
 5 | PyPDF2 is capable of. Example code should make it easy
 6 | for users to know how to use all aspects of PyPDF2.
 7 | 
 8 | 
 9 | 
10 | Feel free to add any type of PDF file or sample code, 
11 | either by
12 | 
13 | 	1) sending it via email to PyPDF2@phaseit.net
14 | 	2) including it in a pull request on GitHub


--------------------------------------------------------------------------------
/Sample_Code/basic_features.py:
--------------------------------------------------------------------------------
 1 | from PyPDF2 import PdfFileWriter, PdfFileReader
 2 | 
 3 | output = PdfFileWriter()
 4 | input1 = PdfFileReader(open("document1.pdf", "rb"))
 5 |     
 6 | # print how many pages input1 has:
 7 | print "document1.pdf has %d pages." % input1.getNumPages()
 8 | 
 9 | # add page 1 from input1 to output document, unchanged
10 | output.addPage(input1.getPage(0))
11 | 
12 | # add page 2 from input1, but rotated clockwise 90 degrees
13 | output.addPage(input1.getPage(1).rotateClockwise(90))
14 | 
15 | # add page 3 from input1, rotated the other way:
16 | output.addPage(input1.getPage(2).rotateCounterClockwise(90))
17 | # alt: output.addPage(input1.getPage(2).rotateClockwise(270))
18 | 
19 | # add page 4 from input1, but first add a watermark from another PDF:
20 | page4 = input1.getPage(3)
21 | watermark = PdfFileReader(open("watermark.pdf", "rb"))
22 | page4.mergePage(watermark.getPage(0))
23 | output.addPage(page4)
24 |     
25 | 
26 | # add page 5 from input1, but crop it to half size:
27 | page5 = input1.getPage(4)
28 | page5.mediaBox.upperRight = (
29 |     page5.mediaBox.getUpperRight_x() / 2,
30 |     page5.mediaBox.getUpperRight_y() / 2
31 | )
32 | output.addPage(page5)
33 | 
34 | # encrypt your new PDF and add a password
35 | password = "secret"
36 | output.encrypt(password)
37 | 
38 | # finally, write "output" to document-output.pdf
39 | outputStream = file("PyPDF2-output.pdf", "wb")
40 | output.write(outputStream)
41 | 


--------------------------------------------------------------------------------
/Sample_Code/basic_merging.py:
--------------------------------------------------------------------------------
 1 | from PyPDF2 import PdfFileMerger
 2 | 
 3 | merger = PdfFileMerger()
 4 |      
 5 | input1 = open("document1.pdf", "rb")
 6 | input2 = open("document2.pdf", "rb")
 7 | input3 = open("document3.pdf", "rb")
 8 | 
 9 | # add the first 3 pages of input1 document to output
10 | merger.append(fileobj = input1, pages = (0,3))
11 | 
12 | # insert the first page of input2 into the output beginning after the second page
13 | merger.merge(position = 2, fileobj = input2, pages = (0,1))
14 | 
15 | # append entire input3 document to the end of the output document
16 | merger.append(input3)
17 | 
18 | # Write to an output PDF document
19 | output = open("document-output.pdf", "wb")
20 | merger.write(output)
21 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | from distutils.core import setup
 4 | import re
 5 | 
 6 | long_description = """
 7 | A Pure-Python library built as a PDF toolkit.  It is capable of:
 8 |     
 9 | - extracting document information (title, author, ...),
10 | - splitting documents page by page,
11 | - merging documents page by page,
12 | - cropping pages,
13 | - merging multiple pages into a single page,
14 | - encrypting and decrypting PDF files.
15 | 
16 | By being Pure-Python, it should run on any Python platform without any
17 | dependencies on external libraries.  It can also work entirely on StringIO
18 | objects rather than file streams, allowing for PDF manipulation in memory.
19 | It is therefore a useful tool for websites that manage or manipulate PDFs.
20 | """
21 | 
22 | VERSIONFILE="PyPDF2/_version.py"
23 | verstrline = open(VERSIONFILE, "rt").read()
24 | VSRE = r"^__version__ = ['\"]([^'\"]*)['\"]"
25 | mo = re.search(VSRE, verstrline, re.M)
26 | if mo:
27 |    verstr = mo.group(1)
28 | else:
29 |    raise RuntimeError("Unable to find version string in %s." % (VERSIONFILE))
30 | 
31 | setup(
32 |         name="PyPDF2",
33 |         version=verstr,
34 |         description="PDF toolkit",
35 |         long_description=long_description,
36 |         author="Mathieu Fenniak",
37 |         author_email="biziqe@mathieu.fenniak.net",
38 |         maintainer="Phaseit, Inc.",
39 |         maintainer_email="PyPDF2@phaseit.net",
40 |         url="http://mstamy2.github.com/PyPDF2",
41 |         download_url="http://github.com/mstamy2/PyPDF2/tarball/master",
42 |         classifiers = [
43 |             "Development Status :: 5 - Production/Stable",
44 |             "Intended Audience :: Developers",
45 |             "License :: OSI Approved :: BSD License",
46 |             "Programming Language :: Python :: 2",
47 |             "Programming Language :: Python :: 3",
48 |             "Operating System :: OS Independent",
49 |             "Topic :: Software Development :: Libraries :: Python Modules",
50 |             ],
51 |         packages=["PyPDF2"],
52 |     )
53 | 
54 | 


--------------------------------------------------------------------------------