├── .github └── workflows │ └── main.yml ├── README.md ├── minipdf ├── __init__.py ├── filters.py ├── lzw.py ├── minipdf.py └── minipdfo.py ├── scripts ├── mkpdfT20170432.py ├── mkpdfjpeg.py ├── mkpdfjs ├── mkpdfotf ├── mkpdftext └── mkpdfxfa ├── setup.py └── tests └── test_general.py /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | # This is a basic workflow to help you get started with Actions 2 | 3 | name: CI 4 | 5 | # Controls when the action will run. Triggers the workflow on push or pull request 6 | # events but only for the master branch 7 | on: 8 | push: 9 | branches: [ master ] 10 | pull_request: 11 | branches: [ master ] 12 | 13 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel 14 | jobs: 15 | # This workflow contains a single job called "build" 16 | test: 17 | runs-on: ubuntu-latest 18 | 19 | steps: 20 | - uses: actions/checkout@v2 21 | - name: Set up Python 3.6 22 | uses: actions/setup-python@v1 23 | with: 24 | python-version: 3.6 25 | 26 | 27 | # Runs a single command using the runners shell 28 | - name: Test 29 | run: | 30 | python -m unittest discover -v $GITHUB_WORKSPACE/tests 31 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![CI](https://github.com/feliam/miniPDF/workflows/CI/badge.svg) 2 | 3 | miniPDF 4 | ======= 5 | 6 | A python library for making PDF files in a very low level way. 7 | 8 | The legendary minipdf python library reaches github. This is a cleaner version of the old micro lib used in more than 10 PDF related exploits. 9 | 10 | ## Features 11 | It supports only the most basic file structure (PDF3200:7.5.1), that’s it without incremental updates or linearization. 12 | 13 | ![](https://feliam.files.wordpress.com/2010/01/pdffilestructure.jpg?w=240) 14 | * A one-line header identifying the version of the PDF file 15 | * A body containing the objects that make up the document contained in the file 16 | * A cross-reference table containing information about the indirect objects in the file 17 | * A trailer dictionary pointing the location of the cross-reference table and other special objects within the body of the file 18 | 19 | Also all basic PDF types: null, references, strings, numbers, arrays and dictionaries. 20 | 21 | ## Example: A minimal text displaying PDF 22 | As an example Let's create a minimal text displaying PDF file in python using minipdf. The following graph outlines the simplest possible structure: 23 | ![](http://feliam.files.wordpress.com/2010/01/minimalpdfstructure.jpg?w=600) 24 | 25 | ### The python script 26 | First we import the lib and create a PDFDoc object representing a document in memory … 27 | ```python 28 | from minipdf import * 29 | doc = PDFDoc() 30 | ``` 31 | 32 | As shown in the last figure the main object is the *Catalog*. The next 3 lines builds a *Catalog* dictionary object, add them to the document and set it as the root object… 33 | 34 | ```python 35 | catalog = PDFDict() 36 | catalog['Type'] = PDFName('Catalog') 37 | doc += catalog 38 | doc.setRoot(catalog) 39 | ``` 40 | 41 | At this point we don’t even have a valid pdf but if we output the inclomplete PDF this is how the output will look like: 42 | 43 | ``` 44 | %PDF-1.5 45 | %��� 46 | 1 0 obj 47 | <> 48 | endobj 49 | xref 50 | 0 2 51 | 0000000000 65535 f 52 | 0000000015 00000 n 53 | trailer 54 | <> 55 | startxref 56 | 50 57 | %%EOF 58 | ``` 59 | 60 | As you can see, it's only a matter of adding all the different pdf objects link together from the *Catalog*. The library allows to add them in almost any order. Let’s try to follow the basic tree structure. To add a *page*, first we need a *pages* dictionary. 61 | ``` 62 | pages = PDFDict() 63 | pages['Type'] = PDFName('Pages') 64 | doc += pages 65 | ``` 66 | 67 | Which should be linked from the *Catalog*. 68 | 69 | ``` 70 | catalog['Pages'] = PDFRef(pages) 71 | ``` 72 | 73 | Then a *page*. 74 | 75 | ``` 76 | #page 77 | page = PDFDict() 78 | page['Type'] = PDFName('Page') 79 | page['MediaBox'] = PDFArray([0, 0, 612, 792]) 80 | doc += page 81 | 82 | #add parent reference in page 83 | page['Parent'] = PDFRef(pages) 84 | ``` 85 | 86 | Which should be linked from the *pages* dictionary. 87 | 88 | ``` 89 | pages['Kids'] = PDFArray([PDFRef(page)]) 90 | pages['Count'] = PDFNum(1) 91 | ``` 92 | 93 | Now we add some content to the page. This is called a *content stream*. 94 | 95 | ``` 96 | contents = PDFStream('''BT 97 | /F1 24 Tf 0 700 Td 98 | %s Tj 99 | ET 100 | '''%PDFString(sys.argv[1])) 101 | doc += contents 102 | ``` 103 | 104 | The *content stream* is linked from the page 105 | 106 | ``` 107 | page['Contents'] = PDFRef(contents) 108 | ``` 109 | 110 | Note that in the *content stream* we are referencing a font name */F1*. We shall define this font. 111 | 112 | ``` 113 | font = PDFDict() 114 | font['Name'] = PDFName('F1') 115 | font['Subtype'] = PDFName('Type1') 116 | font['BaseFont'] = PDFName('Helvetica') 117 | ``` 118 | 119 | Associate each defined font with a name in a font map. 120 | 121 | ``` 122 | fontname = PDFDict() 123 | fontname['F1'] = font 124 | ``` 125 | 126 | And add/link all that from the */Font* field of the *resource* dictionary. 127 | 128 | ``` 129 | #resources 130 | resources = PDFDict() 131 | resources['Font'] = fontname 132 | doc += resources 133 | ``` 134 | 135 | Then link the resources to it's page under the *Resources* field. 136 | 137 | ``` 138 | page['Resources'] = PDFRef(resources) 139 | ``` 140 | 141 | We are done! Just print the resulted document.. 142 | 143 | ``` 144 | print doc 145 | ``` 146 | 147 | -------------------------------------------------------------------------------- /minipdf/__init__.py: -------------------------------------------------------------------------------- 1 | from .minipdf import * 2 | 3 | __version__ = '0.1' 4 | -------------------------------------------------------------------------------- /minipdf/filters.py: -------------------------------------------------------------------------------- 1 | import zlib,struct 2 | from StringIO import StringIO 3 | import logging 4 | logger = logging.getLogger("FILTER") 5 | 6 | #Some code in this file was inspired on ghoststcript C code. 7 | #TODO: document and test A LOT, add at least 1 test per filter/perams convination 8 | #Needs refactoring. The parameters part of the filters is not as clean as it could be. 9 | 10 | #7.4 Filters 11 | #Stream filters are introduced in 7.3.8, "Stream Objects." An option when reading 12 | #stream data is to decode it using a filter to produce the original non-encoded 13 | #data. Whether to do so and which decoding filter or filters to use may be specified 14 | #in the stream dictionary. 15 | 16 | class PDFFilter(object): 17 | def __init__(self,params=None): 18 | self.setParams(params) 19 | self.default=None 20 | 21 | def getDefaultParams(self): 22 | return self.default 23 | 24 | def getParams(self): 25 | return (self.params == {} or self.params == None) and self.getDefaultParams() or self.params 26 | 27 | def setParams(self,params=None): 28 | self.params = {} 29 | self.params.update(self.getDefaultParams()) 30 | self.params.update(params) 31 | #self.params = not params and self.getDefaultParams() or params 32 | 33 | def decode(data): 34 | pass 35 | 36 | def encode(data): 37 | pass 38 | 39 | #################################ASCIIHexDecode######################################### 40 | class ASCIIHexDecode(PDFFilter): 41 | '''Decodes data encoded in an ASCII hexadecimal 42 | representation, reproducing the original binary data. 43 | 44 | The ASCIIHexDecode filter shall produce one byte of binary data for each 45 | pair of ASCII hexadecimal digits (0-9 and A-F or a-f). All white-space 46 | characters shall be ignored. A GREATER-THAN SIGN (3Eh) indicates EOD. 47 | Any other characters shall cause an error. If the filter encounters the EOD 48 | marker after reading an odd number of hexadecimal digits, it shall behave as 49 | if a 0 (zero) followed the last digit. 50 | ''' 51 | default = {} 52 | name = 'ASCIIHexDecode' 53 | def __init__(self,params={}): 54 | PDFFilter.__init__(self,params) 55 | 56 | def decode(self, data): 57 | result = "" 58 | for c in data: 59 | if c in "0123456789ABCDEFabcdef": 60 | result+=c 61 | elif c == '>': 62 | break 63 | elif c not in "\x20\r\n\t\x0c\x00": 64 | continue 65 | else: 66 | raise "ERROR" 67 | result = result + '0'*(len(result)%2) 68 | return result.decode('hex') 69 | 70 | def encode(self, data): 71 | return data.encode('hex') 72 | 73 | #################################ASCII85Decode######################################### 74 | class ASCII85Decode(PDFFilter): 75 | ''' 76 | 7.4.3 ASCII85Decode Filter 77 | The ASCII85Decode filter decodes data that has been encoded in ASCII base-85 78 | encoding and produces binary data. The following paragraphs describe the process 79 | for encoding binary data in ASCII base-85; the ASCII85Decode filter reverses this 80 | process. The ASCII base-85 encoding shall use the ASCII characters ! through u and 81 | the character z, with the 2-character sequence ~> as its EOD marker. The ASCII85Decode 82 | filter shall ignore all white-space characters. Any other characters, and any character 83 | sequences that represent impossible combinations in the ASCII base-85 encoding shall 84 | cause an error. 85 | ''' 86 | name='ASCII85Decode' 87 | def __init__(self,params={}): 88 | self.pad=False 89 | #This does not work. Most streams encoded with this use all chars. TODO:recheck 90 | #self._b85chars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~" 91 | self._b85chars = [chr(x) for x in range(0,0xff)] 92 | self._b85chars2 = [(a + b) for a in self._b85chars for b in self._b85chars] 93 | self._b85dec = {} 94 | self.default = {} 95 | for i, c in enumerate(self._b85chars): 96 | self._b85dec[c] = i 97 | PDFFilter.__init__(self,params) 98 | 99 | def decode(self, text): 100 | """decode base85-encoded text""" 101 | #Remove whitespaces... 102 | text = ''.join([ c for c in text if not c in "\x20\r\n\t\x0c\x00" ]) 103 | 104 | #Cut the stream at the eod 105 | eod = text.find('~>') 106 | if eod != -1: 107 | test=text[:eod] 108 | 109 | l = len(text) 110 | out = [] 111 | for i in range(0, len(text), 5): 112 | chunk = text[i:i+5] 113 | acc = 0 114 | for j, c in enumerate(chunk): 115 | acc = acc * 85 + self._b85dec[c] 116 | 117 | #This does not work. Most streams encoded with this use all chars and overflow. TODO:recheck 118 | # try: 119 | # acc = acc * 85 + self._b85dec[c] 120 | # except KeyError: 121 | # raise TypeError('Bad base85 character at byte %d' % (i + j)) 122 | #if acc > 4294967295: 123 | # raise OverflowError('Base85 overflow in hunk starting at byte %d' % i) 124 | out.append(acc&0xffffffff) 125 | 126 | # Pad final chunk if necessary 127 | cl = l % 5 128 | if cl: 129 | acc *= 85 ** (5 - cl) 130 | if cl > 1: 131 | acc += 0xffffff >> (cl - 2) * 8 132 | out[-1] = acc 133 | 134 | out = struct.pack('>%dL' % (len(out)), *out) 135 | if cl: 136 | out = out[:-(5 - cl)] 137 | 138 | return out 139 | 140 | def encode(self, text): 141 | """encode text in base85 format""" 142 | l = len(text) 143 | r = l % 4 144 | if r: 145 | text += '\0' * (4 - r) 146 | longs = len(text) >> 2 147 | words = struct.unpack('>%dL' % (longs), text) 148 | 149 | out = ''.join(self._b85chars[(word // 52200625) % 85] + 150 | self._b85chars2[(word // 7225) % 7225] + 151 | self._b85chars2[word % 7225] 152 | for word in words) 153 | 154 | if self.pad: 155 | return out 156 | 157 | # Trim padding 158 | olen = l % 4 159 | if olen: 160 | olen += 1 161 | olen += l // 4 * 5 162 | return out[:olen] 163 | 164 | 165 | 166 | class Predictor(): 167 | ''' 168 | 7.4.4.4 LZW and Flate Predictor Functions 169 | LZW and Flate encoding compress more compactly if their input data is highly 170 | predictable. One way of increasing the predictability of many continuous-tone 171 | sampled images is to replace each sample with the difference between that sample 172 | and a predictor function applied to earlier neighboring samples. If the predictor 173 | function works well, the postprediction data clusters toward 0. 174 | 175 | 1 No prediction (the default value) 176 | 2 TIFF Predictor 2 177 | 10 PNG prediction (on encoding, PNG None on all rows) 178 | 11 PNG prediction (on encoding, PNG Sub on all rows) 179 | 12 PNG prediction (on encoding, PNG Up on all rows) 180 | 13 PNG prediction (on encoding, PNG Average on all rows) 181 | 14 PNG prediction (on encoding, PNG Paeth on all rows) 182 | 15 PNG prediction (on encoding, PNG optimum) 183 | 184 | ''' 185 | def __init__(self,n=1,columns=1,bits=8): 186 | assert n in [1,2,10,11,12,13,14,15] 187 | self.predictor = n 188 | self.columns=columns 189 | self.bits=bits 190 | 191 | def encode(self): 192 | raise "Unsupported Predictor encoder" 193 | 194 | def decode(self, data): 195 | def decode_row(rowdata,prev_rowdata): 196 | if self.predictor == 1: 197 | return rowdata 198 | if self.predictor == 2: 199 | #TIFF_PREDICTOR 200 | bpp = (self.bits + 7) / 8 201 | for i in range(bpp+1, rowlength): 202 | rowdata[i] = (rowdata[i] + rowdata[i-bpp]) % 256 203 | # PNG prediction 204 | elif self.predictor >= 10 and self.predictor <= 15: 205 | filterByte = rowdata[0] 206 | if filterByte == 0: 207 | pass 208 | elif filterByte == 1: 209 | # prior 210 | bpp = (self.bits + 7) / 8 211 | for i in range(bpp+1, rowlength): 212 | rowdata[i] = (rowdata[i] + rowdata[i-1]) % 256 213 | elif filterByte == 2: 214 | # up 215 | for i in range(1, rowlength): 216 | rowdata[i] = (rowdata[i] + prev_rowdata[i]) % 256 217 | elif filterByte == 3: 218 | # average 219 | bpp = (self.bits + 7) / 8 220 | for i in xrange(1,bpp): 221 | rowdata[i] = (rowdata[i] + prev_rowdata[i]/2) % 256 222 | for j in xrange(i,rowlength): 223 | rowdata[j] = (rowdata[j] + (rowdata[j-bpp] + prev_rowdata[j])/2) % 256 224 | elif filterByte == 4: 225 | # paeth filtering 226 | bpp = (self.bits + 7) / 8; 227 | for i in xrange(1,bpp): 228 | rowdata[i] = rowdata[i] + prev_rowdata[i]; 229 | for j in xrange(i,rowlength): 230 | # fetch pixels 231 | a = rowdata[j-bpp] 232 | b = prev_rowdata[j] 233 | c = prev_rowdata[j-bpp] 234 | 235 | # distances to surrounding pixels 236 | pa = abs(b - c) 237 | pb = abs(a - c) 238 | pc = abs(a + b - 2*c) 239 | 240 | # pick predictor with the shortest distance 241 | if pa <= pb and pa <= pc : 242 | pred = a 243 | elif pb <= pc: 244 | pred = b 245 | else: 246 | pred = c 247 | rowdata[j] = rowdata[j] + pred 248 | 249 | else: 250 | raise "Unsupported PNG filter %r" % filterByte 251 | return rowdata 252 | #begin 253 | rowlength = self.columns + 1 254 | assert len(data) % rowlength == 0 255 | if self.predictor == 1 : 256 | return data 257 | output = StringIO() 258 | # PNG prediction can vary from row to row 259 | prev_rowdata = (0,) * rowlength 260 | for row in xrange(0,len(data) / rowlength): 261 | # print (row*rowlength),((row+1)*rowlength),len(data) / rowlength 262 | rowdata = decode_row([ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]],prev_rowdata) 263 | if self.predictor in [1,2]: 264 | output.write(''.join([chr(x) for x in rowdata[0:]])) 265 | else: 266 | output.write(''.join([chr(x) for x in rowdata[1:]])) 267 | prev_rowdata = rowdata 268 | data = output.getvalue() 269 | return data 270 | 271 | 272 | class FlateDecode(PDFFilter): 273 | ''' 274 | The Flate method is based on the public-domain zlib/deflate compression method, 275 | which is a variable-length Lempel-Ziv adaptive compression method cascaded with 276 | adaptive Huffman coding. It is fully defined in Internet RFCs 1950, ZLIB Compressed 277 | Data Format Specification, and 1951, DEFLATE Compressed Data Format Specification 278 | ''' 279 | default = { 'Predictor': 1, 280 | 'Columns' : 0, 281 | 'Colors' : 1, 282 | 'BitsPerComponent': 8} 283 | name = "Fl" 284 | #name = "FlateDecode" 285 | def __init__(self,params={}): 286 | PDFFilter.__init__(self,params) 287 | 288 | def decode(self, data): 289 | p = self.getParams() 290 | data = data.decode('zlib') 291 | data = Predictor(int(p['Predictor']),int(p['Columns']),int(p['BitsPerComponent'])).decode(data) 292 | return data 293 | 294 | def encode(self, data): 295 | assert self.getParams()['Predictor'] == 1 296 | return data.encode('zlib') 297 | 298 | import lzw 299 | class LZWDecode(PDFFilter): 300 | ''' 301 | 7.4.4.2 Details of LZW Encoding 302 | LZW (Lempel-Ziv-Welch) is a variable-length, adaptive compression method 303 | that has been adopted as one of the standard compression methods in the 304 | Tag Image File Format (TIFF) standard. 305 | 306 | Data encoded using the LZW compression method shall consist of a sequence 307 | of codes that are 9 to 12 bits long. Each code shall represent a single 308 | character of input data (0-255), a clear-table marker (256), an EOD marker 309 | (257), or a table entry representing a multiple-character sequence that has 310 | been encountered previously in the input (258 or greater). 311 | ''' 312 | default = { 'Predictor': 1, 313 | 'Columns' : 1, 314 | 'Colors' : 1, 315 | 'BitsPerComponent': 8, 316 | 'EarlyChange': 1 } 317 | name = "LZW" #Decode 318 | def __init__(self,params=None): 319 | PDFFilter.__init__(self, self.default) 320 | 321 | def decode(self, data): 322 | assert self.getParams()['EarlyChange']==1 323 | data = lzw.decompress(data) 324 | data = Predictor(p['Predictor'],p['Columns'],p['BitsPerComponent']).decode(data) 325 | return data 326 | 327 | def encode(self, data): 328 | assert self.getParams()['EarlyChange']==1 329 | assert self.getParams()['Predictor']==1 330 | return ''.join(lzw.compress(data)) 331 | 332 | class RunLengthDecode(PDFFilter): 333 | '''Decompresses data encoded using a byte-oriented run-length encoding algorithm, 334 | reproducing the original text or binary data (typically monochrome image data, 335 | or any data that contains frequent long runs of a single byte value). 336 | 337 | The RunLengthDecode filter decodes data that has been encoded in a simple byte-oriented 338 | format based on run length. The encoded data shall be a sequence of runs, where each run 339 | shall consist of a length byte followed by 1 to 128 bytes of data. If the length byte is 340 | in the range 0 to 127, the following length + 1 (1 to 128) bytes shall be copied literally 341 | during decompression. If length is in the range 129 to 255, the following single byte shall 342 | be copied 257 - length (2 to 128) times during decompression. A length value of 128 shall 343 | denote EOD. 344 | ''' 345 | name = "RunLengthDecode" 346 | default = {} 347 | def __init__(self,params={}): 348 | PDFFilter.__init__(self,params) 349 | 350 | def decode(self, data): 351 | inp = StringIO(data) 352 | out = StringIO() 353 | try: 354 | while True: 355 | n = ord(inp.read(1)) 356 | if n < 127: 357 | out.write(inp.read(n+1)) 358 | else: 359 | out.write(inp.read(1)*(257-n)) 360 | except: 361 | pass 362 | return out.getvalue() 363 | 364 | def encode(self, data): 365 | #Trivial encoding x2 in size 366 | out = StringIO() 367 | for c in data: 368 | out.write("\x00"+c) 369 | return out.getvalue() 370 | 371 | 372 | 373 | 374 | ### filter multiplexers.... 375 | def defilterData(filtername,stream,params=None): 376 | logger.debug("Filtering stream with %s"%repr((filtername,params))) 377 | if filtername == "FlateDecode": 378 | return FlateDecode(params).decode(stream) 379 | elif filtername == "LZWDecode": 380 | return LZWDecode(params).decode(stream) 381 | elif filtername == "ASCIIHexDecode": 382 | return ASCIIHexDecode(params).decode(stream) 383 | elif filtername == "ASCII85Decode": 384 | return ASCII85Decode(params).decode(stream) 385 | elif filtername == "RunLengthDecode": 386 | return RunLengthDecode(params).decode(stream) 387 | 388 | def filterData(filtername,stream,params=None): 389 | if filtername == "FlateDecode": 390 | return FlateDecode(params).encode(stream) 391 | elif filtername == "ASCIIHexDecode": 392 | return ASCIIHexDecode(params).encode(stream) 393 | 394 | if __name__ == "__main__": 395 | pass 396 | 397 | -------------------------------------------------------------------------------- /minipdf/lzw.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | From https://code.google.com/p/python-lzw/ 4 | A stream friendly, simple compression library, built around 5 | iterators. See L{compress} and L{decompress} for the easiest way to 6 | get started. 7 | 8 | After the TIFF implementation of LZW, as described at 9 | U{http://www.fileformat.info/format/tiff/corion-lzw.htm} 10 | 11 | 12 | In an even-nuttier-shell, lzw compresses input bytes with integer 13 | codes. Starting with codes 0-255 that code to themselves, and two 14 | control codes, we work our way through a stream of bytes. When we 15 | encounter a pair of codes c1,c2 we add another entry to our code table 16 | with the lowest available code and the value value(c1) + value(c2)[0] 17 | 18 | Of course, there are details :) 19 | 20 | The Details 21 | =========== 22 | 23 | Our control codes are 24 | 25 | - CLEAR_CODE (codepoint 256). When this code is encountered, we flush 26 | the codebook and start over. 27 | - END_OF_INFO_CODE (codepoint 257). This code is reserved for 28 | encoder/decoders over the integer codepoint stream (like the 29 | mechanical bit that unpacks bits into codepoints) 30 | 31 | When dealing with bytes, codes are emitted as variable 32 | length bit strings packed into the stream of bytes. 33 | 34 | codepoints are written with varying length 35 | - initially 9 bits 36 | - at 512 entries 10 bits 37 | - at 1025 entries at 11 bits 38 | - at 2048 entries 12 bits 39 | - with max of 4095 entries in a table (including Clear and EOI) 40 | 41 | code points are stored with their MSB in the most significant bit 42 | available in the output character. 43 | 44 | >>> import lzw 45 | >>> 46 | >>> mybytes = lzw.readbytes("README.txt") 47 | >>> lessbytes = lzw.compress(mybytes) 48 | >>> newbytes = b"".join(lzw.decompress(lessbytes)) 49 | >>> oldbytes = b"".join(lzw.readbytes("README.txt")) 50 | >>> oldbytes == newbytes 51 | True 52 | 53 | 54 | """ 55 | 56 | __author__ = "Joe Bowers" 57 | __license__ = "MIT License" 58 | __version__ = "0.01.01" 59 | __status__ = "Development" 60 | __email__ = "joerbowers@gmail.com" 61 | __url__ = "http://www.joe-bowers.com/static/lzw" 62 | 63 | import struct 64 | import itertools 65 | 66 | 67 | CLEAR_CODE = 256 68 | END_OF_INFO_CODE = 257 69 | 70 | DEFAULT_MIN_BITS = 9 71 | DEFAULT_MAX_BITS = 12 72 | 73 | 74 | 75 | 76 | def compress(plaintext_bytes): 77 | """ 78 | Given an iterable of bytes, returns a (hopefully shorter) iterable 79 | of bytes that you can store in a file or pass over the network or 80 | what-have-you, and later use to get back your original bytes with 81 | L{decompress}. This is the best place to start using this module. 82 | """ 83 | encoder = ByteEncoder() 84 | return encoder.encodetobytes(plaintext_bytes) 85 | 86 | 87 | def decompress(compressed_bytes): 88 | """ 89 | Given an iterable of bytes that were the result of a call to 90 | L{compress}, returns an iterator over the uncompressed bytes. 91 | """ 92 | decoder = ByteDecoder() 93 | return decoder.decodefrombytes(compressed_bytes) 94 | 95 | 96 | 97 | 98 | 99 | class ByteEncoder(object): 100 | """ 101 | Takes a stream of uncompressed bytes and produces a stream of 102 | compressed bytes, usable by L{ByteDecoder}. Combines an L{Encoder} 103 | with a L{BitPacker}. 104 | 105 | 106 | >>> import lzw 107 | >>> 108 | >>> enc = lzw.ByteEncoder(12) 109 | >>> bigstr = b"gabba gabba yo gabba gabba gabba yo gabba gabba gabba yo gabba gabba gabba yo" 110 | >>> encoding = enc.encodetobytes(bigstr) 111 | >>> encoded = b"".join( b for b in encoding ) 112 | >>> encoded 113 | '3\\x98LF#\\x08\\x82\\x05\\x04\\x83\\x1eM\\xf0x\\x1c\\x16\\x1b\\t\\x88C\\xe1q(4"\\x1f\\x17\\x85C#1X\\xec.\\x00' 114 | >>> 115 | >>> dec = lzw.ByteDecoder() 116 | >>> decoding = dec.decodefrombytes(encoded) 117 | >>> decoded = b"".join(decoding) 118 | >>> decoded == bigstr 119 | True 120 | 121 | """ 122 | 123 | def __init__(self, max_width=DEFAULT_MAX_BITS): 124 | """ 125 | max_width is the maximum width in bits we want to see in the 126 | output stream of codepoints. 127 | """ 128 | self._encoder = Encoder(max_code_size=2**max_width) 129 | self._packer = BitPacker(initial_code_size=self._encoder.code_size()) 130 | 131 | 132 | def encodetobytes(self, bytesource): 133 | """ 134 | Returns an iterator of bytes, adjusting our packed width 135 | between minwidth and maxwidth when it detects an overflow is 136 | about to occur. Dual of L{ByteDecoder.decodefrombytes}. 137 | """ 138 | codepoints = self._encoder.encode(bytesource) 139 | codebytes = self._packer.pack(codepoints) 140 | 141 | return codebytes 142 | 143 | 144 | class ByteDecoder(object): 145 | """ 146 | Decodes, combines bit-unpacking and interpreting a codepoint 147 | stream, suitable for use with bytes generated by 148 | L{ByteEncoder}. 149 | 150 | See L{ByteDecoder} for a usage example. 151 | """ 152 | def __init__(self): 153 | """ 154 | """ 155 | 156 | self._decoder = Decoder() 157 | self._unpacker = BitUnpacker(initial_code_size=self._decoder.code_size()) 158 | self.remaining = [] 159 | 160 | def decodefrombytes(self, bytesource): 161 | """ 162 | Given an iterator over BitPacked, Encoded bytes, Returns an 163 | iterator over the uncompressed bytes. Dual of 164 | L{ByteEncoder.encodetobytes}. See L{ByteEncoder} for an 165 | example of use. 166 | """ 167 | codepoints = self._unpacker.unpack(bytesource) 168 | clearbytes = self._decoder.decode(codepoints) 169 | 170 | return clearbytes 171 | 172 | 173 | class BitPacker(object): 174 | """ 175 | Translates a stream of lzw codepoints into a variable width packed 176 | stream of bytes, for use by L{BitUnpacker}. One of a (potential) 177 | set of encoders for a stream of LZW codepoints, intended to behave 178 | as closely to the TIFF variable-width encoding scheme as closely 179 | as possible. 180 | 181 | The inbound stream of integer lzw codepoints are packed into 182 | variable width bit fields, starting at the smallest number of bits 183 | it can and then increasing the bit width as it anticipates the LZW 184 | code size growing to overflow. 185 | 186 | This class knows all kinds of intimate things about how it's 187 | upstream codepoint processors work; it knows the control codes 188 | CLEAR_CODE and END_OF_INFO_CODE, and (more intimately still), it 189 | makes assumptions about the rate of growth of it's consumer's 190 | codebook. This is ok, as long as the underlying encoder/decoders 191 | don't know any intimate details about their BitPackers/Unpackers 192 | """ 193 | 194 | def __init__(self, initial_code_size): 195 | """ 196 | Takes an initial code book size (that is, the count of known 197 | codes at the beginning of encoding, or after a clear) 198 | """ 199 | self._initial_code_size = initial_code_size 200 | 201 | 202 | def pack(self, codepoints): 203 | """ 204 | Given an iterator of integer codepoints, returns an iterator 205 | over bytes containing the codepoints packed into varying 206 | lengths, with bit width growing to accomodate an input code 207 | that it assumes will grow by one entry per codepoint seen. 208 | 209 | Widths will be reset to the given initial_code_size when the 210 | LZW CLEAR_CODE or END_OF_INFO_CODE code appears in the input, 211 | and bytes following END_OF_INFO_CODE will be aligned to the 212 | next byte boundary. 213 | 214 | >>> import lzw 215 | >>> pkr = lzw.BitPacker(258) 216 | >>> [ b for b in pkr.pack([ 1, 257]) ] == [ chr(0), chr(0xC0), chr(0x40) ] 217 | True 218 | """ 219 | tailbits = [] 220 | codesize = self._initial_code_size 221 | 222 | minwidth = 8 223 | while (1 << minwidth) < codesize: 224 | minwidth = minwidth + 1 225 | 226 | nextwidth = minwidth 227 | 228 | for pt in codepoints: 229 | 230 | newbits = inttobits(pt, nextwidth) 231 | tailbits = tailbits + newbits 232 | 233 | # PAY ATTENTION. This calculation should be driven by the 234 | # size of the upstream codebook, right now we're just trusting 235 | # that everybody intends to follow the TIFF spec. 236 | codesize = codesize + 1 237 | 238 | if pt == END_OF_INFO_CODE: 239 | while len(tailbits) % 8: 240 | tailbits.append(0) 241 | 242 | if pt in [ CLEAR_CODE, END_OF_INFO_CODE ]: 243 | nextwidth = minwidth 244 | codesize = self._initial_code_size 245 | elif codesize >= (2 ** nextwidth): 246 | nextwidth = nextwidth + 1 247 | 248 | while len(tailbits) > 8: 249 | nextbits = tailbits[:8] 250 | nextbytes = bitstobytes(nextbits) 251 | for bt in nextbytes: 252 | yield struct.pack("B", bt) 253 | 254 | tailbits = tailbits[8:] 255 | 256 | 257 | if tailbits: 258 | tail = bitstobytes(tailbits) 259 | for bt in tail: 260 | yield struct.pack("B", bt) 261 | 262 | 263 | 264 | 265 | class BitUnpacker(object): 266 | """ 267 | An adaptive-width bit unpacker, intended to decode streams written 268 | by L{BitPacker} into integer codepoints. Like L{BitPacker}, knows 269 | about code size changes and control codes. 270 | """ 271 | 272 | def __init__(self, initial_code_size): 273 | """ 274 | initial_code_size is the starting size of the codebook 275 | associated with the to-be-unpacked stream. 276 | """ 277 | self._initial_code_size = initial_code_size 278 | 279 | 280 | def unpack(self, bytesource): 281 | """ 282 | Given an iterator of bytes, returns an iterator of integer 283 | code points. Auto-magically adjusts point width when it sees 284 | an almost-overflow in the input stream, or an LZW CLEAR_CODE 285 | or END_OF_INFO_CODE 286 | 287 | Trailing bits at the end of the given iterator, after the last 288 | codepoint, will be dropped on the floor. 289 | 290 | At the end of the iteration, or when an END_OF_INFO_CODE seen 291 | the unpacker will ignore the bits after the code until it 292 | reaches the next aligned byte. END_OF_INFO_CODE will *not* 293 | stop the generator, just reset the alignment and the width 294 | 295 | 296 | >>> import lzw 297 | >>> unpk = lzw.BitUnpacker(initial_code_size=258) 298 | >>> [ i for i in unpk.unpack([ chr(0), chr(0xC0), chr(0x40) ]) ] 299 | [1, 257] 300 | """ 301 | bits = [] 302 | offset = 0 303 | ignore = 0 304 | 305 | codesize = self._initial_code_size 306 | minwidth = 8 307 | while (1 << minwidth) < codesize: 308 | minwidth = minwidth + 1 309 | 310 | pointwidth = minwidth 311 | 312 | for nextbit in bytestobits(bytesource): 313 | 314 | offset = (offset + 1) % 8 315 | if ignore > 0: 316 | ignore = ignore - 1 317 | continue 318 | 319 | bits.append(nextbit) 320 | 321 | if len(bits) == pointwidth: 322 | codepoint = intfrombits(bits) 323 | bits = [] 324 | 325 | yield codepoint 326 | 327 | codesize = codesize + 1 328 | 329 | if codepoint in [ CLEAR_CODE, END_OF_INFO_CODE ]: 330 | codesize = self._initial_code_size 331 | pointwidth = minwidth 332 | else: 333 | # is this too late? 334 | while codesize >= (2 ** pointwidth): 335 | pointwidth = pointwidth + 1 336 | 337 | if codepoint == END_OF_INFO_CODE: 338 | ignore = (8 - offset) % 8 339 | 340 | 341 | 342 | class Decoder(object): 343 | """ 344 | Uncompresses a stream of lzw code points, as created by 345 | L{Encoder}. Given a list of integer code points, with all 346 | unpacking foolishness complete, turns that list of codepoints into 347 | a list of uncompressed bytes. See L{BitUnpacker} for what this 348 | doesn't do. 349 | """ 350 | def __init__(self): 351 | """ 352 | Creates a new Decoder. Decoders should not be reused for 353 | different streams. 354 | """ 355 | self._clear_codes() 356 | self.remainder = [] 357 | 358 | 359 | def code_size(self): 360 | """ 361 | Returns the current size of the Decoder's code book, that is, 362 | it's mapping of codepoints to byte strings. The return value of 363 | this method will change as the decode encounters more encoded 364 | input, or control codes. 365 | """ 366 | return len(self._codepoints) 367 | 368 | 369 | def decode(self, codepoints): 370 | """ 371 | Given an iterable of integer codepoints, yields the 372 | corresponding bytes, one at a time, as byte strings of length 373 | E{1}. Retains the state of the codebook from call to call, so 374 | if you have another stream, you'll likely need another 375 | decoder! 376 | 377 | Decoders will NOT handle END_OF_INFO_CODE (rather, they will 378 | handle the code by throwing an exception); END_OF_INFO should 379 | be handled by the upstream codepoint generator (see 380 | L{BitUnpacker}, for example) 381 | 382 | >>> import lzw 383 | >>> dec = lzw.Decoder() 384 | >>> ''.join(dec.decode([103, 97, 98, 98, 97, 32, 258, 260, 262, 121, 111, 263, 259, 261, 256])) 385 | 'gabba gabba yo gabba' 386 | 387 | """ 388 | codepoints = [ cp for cp in codepoints ] 389 | 390 | for cp in codepoints: 391 | decoded = self._decode_codepoint(cp) 392 | for character in decoded: 393 | yield character 394 | 395 | 396 | 397 | def _decode_codepoint(self, codepoint): 398 | """ 399 | Will raise a ValueError if given an END_OF_INFORMATION 400 | code. EOI codes should be handled by callers if they're 401 | present in our source stream. 402 | 403 | >>> import lzw 404 | >>> dec = lzw.Decoder() 405 | >>> beforesize = dec.code_size() 406 | >>> dec._decode_codepoint(0x80) 407 | '\\x80' 408 | >>> dec._decode_codepoint(0x81) 409 | '\\x81' 410 | >>> beforesize + 1 == dec.code_size() 411 | True 412 | >>> dec._decode_codepoint(256) 413 | '' 414 | >>> beforesize == dec.code_size() 415 | True 416 | """ 417 | 418 | ret = b"" 419 | 420 | if codepoint == CLEAR_CODE: 421 | self._clear_codes() 422 | elif codepoint == END_OF_INFO_CODE: 423 | raise ValueError("End of information code not supported directly by this Decoder") 424 | else: 425 | if codepoint in self._codepoints: 426 | ret = self._codepoints[ codepoint ] 427 | if None != self._prefix: 428 | self._codepoints[ len(self._codepoints) ] = self._prefix + ret[0] 429 | 430 | else: 431 | ret = self._prefix + self._prefix[0] 432 | self._codepoints[ len(self._codepoints) ] = ret 433 | 434 | self._prefix = ret 435 | 436 | return ret 437 | 438 | 439 | def _clear_codes(self): 440 | self._codepoints = dict( (pt, struct.pack("B", pt)) for pt in range(256) ) 441 | self._codepoints[CLEAR_CODE] = CLEAR_CODE 442 | self._codepoints[END_OF_INFO_CODE] = END_OF_INFO_CODE 443 | self._prefix = None 444 | 445 | 446 | class Encoder(object): 447 | """ 448 | Given an iterator of bytes, returns an iterator of integer 449 | codepoints, suitable for use by L{Decoder}. The core of the 450 | "compression" side of lzw compression/decompression. 451 | """ 452 | def __init__(self, max_code_size=(2**DEFAULT_MAX_BITS)): 453 | """ 454 | When the encoding codebook grows larger than max_code_size, 455 | the Encoder will clear its codebook and emit a CLEAR_CODE 456 | """ 457 | 458 | self.closed = False 459 | 460 | self._max_code_size = max_code_size 461 | self._buffer = '' 462 | self._clear_codes() 463 | 464 | if max_code_size < self.code_size(): 465 | raise ValueError("Max code size too small, (must be at least {0})".format(self.code_size())) 466 | 467 | 468 | def code_size(self): 469 | """ 470 | Returns a count of the known codes, including codes that are 471 | implicit in the data but have not yet been produced by the 472 | iterator. 473 | """ 474 | return len(self._prefixes) 475 | 476 | 477 | def flush(self): 478 | """ 479 | Yields any buffered codepoints, followed by a CLEAR_CODE, and 480 | clears the codebook as a side effect. 481 | """ 482 | 483 | flushed = [] 484 | 485 | if self._buffer: 486 | yield self._prefixes[ self._buffer ] 487 | self._buffer = '' 488 | 489 | yield CLEAR_CODE 490 | self._clear_codes() 491 | 492 | 493 | 494 | 495 | def encode(self, bytesource): 496 | """ 497 | Given an iterator over bytes, yields the 498 | corresponding stream of codepoints. 499 | Will clear the codes at the end of the stream. 500 | 501 | >>> import lzw 502 | >>> enc = lzw.Encoder() 503 | >>> [ cp for cp in enc.encode("gabba gabba yo gabba") ] 504 | [103, 97, 98, 98, 97, 32, 258, 260, 262, 121, 111, 263, 259, 261, 256] 505 | 506 | """ 507 | for b in bytesource: 508 | for point in self._encode_byte(b): 509 | yield point 510 | 511 | if self.code_size() >= self._max_code_size: 512 | for pt in self.flush(): 513 | yield pt 514 | 515 | for point in self.flush(): 516 | yield point 517 | 518 | 519 | def _encode_byte(self, byte): 520 | # Yields one or zero bytes, AND changes the internal state of 521 | # the codebook and prefix buffer. 522 | # 523 | # Unless you're in self.encode(), you almost certainly don't 524 | # want to call this. 525 | 526 | new_prefix = self._buffer 527 | 528 | if new_prefix + byte in self._prefixes: 529 | new_prefix = new_prefix + byte 530 | elif new_prefix: 531 | encoded = self._prefixes[ new_prefix ] 532 | self._add_code(new_prefix + byte) 533 | new_prefix = byte 534 | 535 | yield encoded 536 | 537 | self._buffer = new_prefix 538 | 539 | 540 | 541 | 542 | def _clear_codes(self): 543 | 544 | # Teensy hack, CLEAR_CODE and END_OF_INFO_CODE aren't 545 | # equal to any possible string. 546 | 547 | self._prefixes = dict( (struct.pack("B", codept), codept) for codept in range(256) ) 548 | self._prefixes[ CLEAR_CODE ] = CLEAR_CODE 549 | self._prefixes[ END_OF_INFO_CODE ] = END_OF_INFO_CODE 550 | 551 | 552 | def _add_code(self, newstring): 553 | self._prefixes[ newstring ] = len(self._prefixes) 554 | 555 | 556 | 557 | class PagingEncoder(object): 558 | """ 559 | UNTESTED. Handles encoding of multiple chunks or streams of encodable data, 560 | separated with control codes. Dual of PagingDecoder. 561 | """ 562 | def __init__(self, initial_code_size, max_code_size): 563 | self._initial_code_size = initial_code_size 564 | self._max_code_size = max_code_size 565 | 566 | 567 | def encodepages(self, pages): 568 | """ 569 | Given an iterator of iterators of bytes, produces a single 570 | iterator containing a delimited sequence of independantly 571 | compressed LZW sequences, all beginning on a byte-aligned 572 | spot, all beginning with a CLEAR code and all terminated with 573 | an END_OF_INFORMATION code (and zero to seven trailing junk 574 | bits.) 575 | 576 | The dual of PagingDecoder.decodepages 577 | 578 | >>> import lzw 579 | >>> enc = lzw.PagingEncoder(257, 2**12) 580 | >>> coded = enc.encodepages([ "say hammer yo hammer mc hammer go hammer", 581 | ... "and the rest can go and play", 582 | ... "can't touch this" ]) 583 | ... 584 | >>> b"".join(coded) 585 | '\\x80\\x1c\\xcc\\'\\x91\\x01\\xa0\\xc2m6\\x99NB\\x03\\xc9\\xbe\\x0b\\x07\\x84\\xc2\\xcd\\xa68|"\\x14 3\\xc3\\xa0\\xd1c\\x94\\x02\\x02\\x80\\x18M\\xc6A\\x01\\xd0\\xd0e\\x10\\x1c\\x8c\\xa73\\xa0\\x80\\xc7\\x02\\x10\\x19\\xcd\\xe2\\x08\\x14\\x10\\xe0l0\\x9e`\\x10\\x10\\x80\\x18\\xcc&\\xe19\\xd0@t7\\x9dLf\\x889\\xa0\\xd2s\\x80@@' 586 | 587 | """ 588 | 589 | for page in pages: 590 | 591 | encoder = Encoder(max_code_size=self._max_code_size) 592 | codepoints = encoder.encode(page) 593 | codes_and_eoi = itertools.chain([ CLEAR_CODE ], codepoints, [ END_OF_INFO_CODE ]) 594 | 595 | packer = BitPacker(initial_code_size=encoder.code_size()) 596 | packed = packer.pack(codes_and_eoi) 597 | 598 | for byte in packed: 599 | yield byte 600 | 601 | 602 | 603 | 604 | class PagingDecoder(object): 605 | """ 606 | UNTESTED. Dual of PagingEncoder, knows how to handle independantly encoded, 607 | END_OF_INFO_CODE delimited chunks of an inbound byte stream 608 | """ 609 | 610 | def __init__(self, initial_code_size): 611 | self._initial_code_size = initial_code_size 612 | self._remains = [] 613 | 614 | def next_page(self, codepoints): 615 | """ 616 | Iterator over the next page of codepoints. 617 | """ 618 | self._remains = [] 619 | 620 | try: 621 | while 1: 622 | cp = codepoints.next() 623 | if cp != END_OF_INFO_CODE: 624 | yield cp 625 | else: 626 | self._remains = codepoints 627 | break 628 | 629 | except StopIteration: 630 | pass 631 | 632 | 633 | def decodepages(self, bytesource): 634 | """ 635 | Takes an iterator of bytes, returns an iterator of iterators 636 | of uncompressed data. Expects input to conform to the output 637 | conventions of PagingEncoder(), in particular that "pages" are 638 | separated with an END_OF_INFO_CODE and padding up to the next 639 | byte boundary. 640 | 641 | BUG: Dangling trailing page on decompression. 642 | 643 | >>> import lzw 644 | >>> pgdec = lzw.PagingDecoder(initial_code_size=257) 645 | >>> pgdecoded = pgdec.decodepages( 646 | ... ''.join([ '\\x80\\x1c\\xcc\\'\\x91\\x01\\xa0\\xc2m6', 647 | ... '\\x99NB\\x03\\xc9\\xbe\\x0b\\x07\\x84\\xc2', 648 | ... '\\xcd\\xa68|"\\x14 3\\xc3\\xa0\\xd1c\\x94', 649 | ... '\\x02\\x02\\x80\\x18M\\xc6A\\x01\\xd0\\xd0e', 650 | ... '\\x10\\x1c\\x8c\\xa73\\xa0\\x80\\xc7\\x02\\x10', 651 | ... '\\x19\\xcd\\xe2\\x08\\x14\\x10\\xe0l0\\x9e`\\x10', 652 | ... '\\x10\\x80\\x18\\xcc&\\xe19\\xd0@t7\\x9dLf\\x889', 653 | ... '\\xa0\\xd2s\\x80@@' ]) 654 | ... ) 655 | >>> [ b"".join(pg) for pg in pgdecoded ] 656 | ['say hammer yo hammer mc hammer go hammer', 'and the rest can go and play', "can't touch this", ''] 657 | 658 | """ 659 | 660 | # TODO: WE NEED A CODE SIZE POLICY OBJECT THAT ISN'T THIS. 661 | # honestly, we should have a "codebook" object we need to pass 662 | # to bit packing/unpacking tools, etc, such that we don't have 663 | # to roll all of these code size assumptions everyplace. 664 | 665 | unpacker = BitUnpacker(initial_code_size=self._initial_code_size) 666 | codepoints = unpacker.unpack(bytesource) 667 | 668 | self._remains = codepoints 669 | while self._remains: 670 | nextpoints = self.next_page(self._remains) 671 | nextpoints = [ nx for nx in nextpoints ] 672 | 673 | decoder = Decoder() 674 | decoded = decoder.decode(nextpoints) 675 | decoded = [ dec for dec in decoded ] 676 | 677 | yield decoded 678 | 679 | 680 | 681 | ######################################### 682 | # Conveniences. 683 | 684 | 685 | # PYTHON V2 686 | def unpackbyte(b): 687 | """ 688 | Given a one-byte long byte string, returns an integer. Equivalent 689 | to struct.unpack("B", b) 690 | """ 691 | (ret,) = struct.unpack("B", b) 692 | return ret 693 | 694 | 695 | # PYTHON V3 696 | # def unpackbyte(b): return b 697 | 698 | 699 | def filebytes(fileobj, buffersize=1024): 700 | """ 701 | Convenience for iterating over the bytes in a file. Given a 702 | file-like object (with a read(int) method), returns an iterator 703 | over the bytes of that file. 704 | """ 705 | buff = fileobj.read(buffersize) 706 | while buff: 707 | for byte in buff: yield byte 708 | buff = fileobj.read(buffersize) 709 | 710 | 711 | def readbytes(filename, buffersize=1024): 712 | """ 713 | Opens a file named by filename and iterates over the L{filebytes} 714 | found therein. Will close the file when the bytes run out. 715 | """ 716 | with open(filename, "rb") as infile: 717 | for byte in filebytes(infile, buffersize): 718 | yield byte 719 | 720 | 721 | 722 | def writebytes(filename, bytesource): 723 | """ 724 | Convenience for emitting the bytes we generate to a file. Given a 725 | filename, opens and truncates the file, dumps the bytes 726 | from bytesource into it, and closes it 727 | """ 728 | 729 | with open(filename, "wb") as outfile: 730 | for bt in bytesource: 731 | outfile.write(bt) 732 | 733 | 734 | def inttobits(anint, width=None): 735 | """ 736 | Produces an array of booleans representing the given argument as 737 | an unsigned integer, MSB first. If width is given, will pad the 738 | MSBs to the given width (but will NOT truncate overflowing 739 | results) 740 | 741 | >>> import lzw 742 | >>> lzw.inttobits(304, width=16) 743 | [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0] 744 | 745 | """ 746 | remains = anint 747 | retreverse = [] 748 | while remains: 749 | retreverse.append(remains & 1) 750 | remains = remains >> 1 751 | 752 | retreverse.reverse() 753 | 754 | ret = retreverse 755 | if None != width: 756 | ret_head = [ 0 ] * (width - len(ret)) 757 | ret = ret_head + ret 758 | 759 | return ret 760 | 761 | 762 | def intfrombits(bits): 763 | """ 764 | Given a list of boolean values, interprets them as a binary 765 | encoded, MSB-first unsigned integer (with True == 1 and False 766 | == 0) and returns the result. 767 | 768 | >>> import lzw 769 | >>> lzw.intfrombits([ 1, 0, 0, 1, 1, 0, 0, 0, 0 ]) 770 | 304 771 | """ 772 | ret = 0 773 | lsb_first = [ b for b in bits ] 774 | lsb_first.reverse() 775 | 776 | for bit_index in range(len(lsb_first)): 777 | if lsb_first[ bit_index ]: 778 | ret = ret | (1 << bit_index) 779 | 780 | return ret 781 | 782 | 783 | def bytestobits(bytesource): 784 | """ 785 | Breaks a given iterable of bytes into an iterable of boolean 786 | values representing those bytes as unsigned integers. 787 | 788 | >>> import lzw 789 | >>> [ x for x in lzw.bytestobits(b"\\x01\\x30") ] 790 | [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0] 791 | """ 792 | for b in bytesource: 793 | 794 | value = unpackbyte(b) 795 | 796 | for bitplusone in range(8, 0, -1): 797 | bitindex = bitplusone - 1 798 | nextbit = 1 & (value >> bitindex) 799 | yield nextbit 800 | 801 | 802 | def bitstobytes(bits): 803 | """ 804 | Interprets an indexable list of booleans as bits, MSB first, to be 805 | packed into a list of integers from 0 to 256, MSB first, with LSBs 806 | zero-padded. Note this padding behavior means that round-trips of 807 | bytestobits(bitstobytes(x, width=W)) may not yield what you expect 808 | them to if W % 8 != 0 809 | 810 | Does *NOT* pack the returned values into a bytearray or the like. 811 | 812 | >>> import lzw 813 | >>> bitstobytes([0, 0, 0, 0, 0, 0, 0, 0, "Yes, I'm True"]) == [ 0x00, 0x80 ] 814 | True 815 | >>> bitstobytes([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]) == [ 0x01, 0x30 ] 816 | True 817 | """ 818 | ret = [] 819 | nextbyte = 0 820 | nextbit = 7 821 | for bit in bits: 822 | if bit: 823 | nextbyte = nextbyte | (1 << nextbit) 824 | 825 | if nextbit: 826 | nextbit = nextbit - 1 827 | else: 828 | ret.append(nextbyte) 829 | nextbit = 7 830 | nextbyte = 0 831 | 832 | if nextbit < 7: ret.append(nextbyte) 833 | return ret 834 | 835 | -------------------------------------------------------------------------------- /minipdf/minipdf.py: -------------------------------------------------------------------------------- 1 | ########################################################################## 2 | #### Felipe Andres Manzano * felipe.andres.manzano@gmail.com #### 3 | #### http://twitter.com/feliam * http://wordpress.com/feliam #### 4 | ########################################################################## 5 | import struct 6 | 7 | # For constructing a minimal pdf file 8 | ## PDF REference 3rd edition:: 3.2 Objects 9 | class PDFObject(object): 10 | def __init__(self): 11 | self.n = None 12 | self.v = None 13 | 14 | def __str__(self): 15 | raise NotImplemented() 16 | 17 | 18 | ## PDF REference 3rd edition:: 3.2.1 Booleans Objects 19 | class PDFBool(PDFObject): 20 | def __init__(self, val): 21 | assert type(val) == bool 22 | super(PDFBool, self).__init__() 23 | self.val = val 24 | 25 | def __str__(self): 26 | if self.val: 27 | return "true" 28 | return "false" 29 | 30 | 31 | ## PDF REference 3rd edition:: 3.2.2 Numeric Objects 32 | class PDFNum(PDFObject): 33 | def __init__(self, s): 34 | PDFObject.__init__(self) 35 | self.s = s 36 | 37 | def __str__(self): 38 | return "%s" % self.s 39 | 40 | 41 | ## PDF REference 3rd edition:: 3.2.3 String Objects 42 | class PDFString(PDFObject): 43 | def __init__(self, s): 44 | PDFObject.__init__(self) 45 | self.s = s 46 | 47 | def __str__(self): 48 | return "(%s)" % self.s.replace(")", "\\%03o" % ord(")")) 49 | 50 | 51 | ## PDF REference 3rd edition:: 3.2.3 String Objects / Hexadecimal Strings 52 | class PDFHexString(PDFObject): 53 | def __init__(self, s): 54 | PDFObject.__init__(self) 55 | self.s = s 56 | 57 | def __str__(self): 58 | return "<" + "".join(["%02x" % ord(c) for c in self.s]) + ">" 59 | 60 | 61 | ## A convenient type of literal Strings 62 | class PDFOctalString(PDFObject): 63 | def __init__(self, s): 64 | PDFObject.__init__(self) 65 | self.s = "".join(["\\%03o" % ord(c) for c in s]) 66 | 67 | def __str__(self): 68 | return "(%s)" % self.s 69 | 70 | 71 | ## PDF REference 3rd edition:: 3.2.4 Name Objects 72 | class PDFName(PDFObject): 73 | def __init__(self, s): 74 | PDFObject.__init__(self) 75 | self.s = s 76 | 77 | def __str__(self): 78 | whitespaces = "\x00\x09\x0a\x0c\x0d\x20" 79 | delimiters = "()<>[]{}/%" 80 | badchars = whitespaces + delimiters + "#" 81 | result = "/" 82 | for c in self.s: 83 | if c in badchars: 84 | result += "#%02x" % ord(c) 85 | else: 86 | result += c 87 | return result 88 | 89 | 90 | ## PDF REference 3rd edition:: 3.2.5 Array Objects 91 | class PDFArray(PDFObject): 92 | def __init__(self, s): 93 | PDFObject.__init__(self) 94 | assert type(s) == type([]) 95 | self.s = s 96 | 97 | def append(self, o): 98 | self.s.append(o) 99 | return self 100 | 101 | def __len__(self): 102 | return len(self.s) 103 | 104 | def __str__(self): 105 | return "[%s]" % (" ".join([o.__str__() for o in self.s])) 106 | 107 | 108 | ## PDF REference 3rd edition:: 3.2.6 Dictionary Objects 109 | class PDFDict(PDFObject, dict): 110 | def __init__(self, d={}): 111 | super(PDFDict, self).__init__() 112 | for k in d: 113 | self[k] = d[k] 114 | 115 | def __str__(self): 116 | s = "<<" 117 | for name in self: 118 | s += "%s %s " % (PDFName(name), self[name]) 119 | s += ">>" 120 | return s 121 | 122 | 123 | ## PDF REference 3rd edition:: 3.2.7 Stream Objects 124 | class PDFStream(PDFDict): 125 | def __init__(self, stream=""): 126 | super(PDFDict, self).__init__() 127 | self.stream = stream 128 | self.filtered = self.stream 129 | self["Length"] = len(stream) 130 | self.filters = [] 131 | 132 | def appendFilter(self, filter): 133 | self.filters.append(filter) 134 | self._applyFilters() # yeah every time .. so what! 135 | 136 | def _applyFilters(self): 137 | self.filtered = self.stream 138 | for f in reversed(self.filters): 139 | self.filtered = f.encode(self.filtered) 140 | if len(self.filters) > 0: 141 | self["Length"] = len(self.filtered) 142 | self["Filter"] = PDFArray([PDFName(f.name) for f in self.filters]) 143 | # Add Filter parameters ? 144 | 145 | def __str__(self): 146 | self._applyFilters() # yeah every time .. so what! 147 | s = "" 148 | s += PDFDict.__str__(self) 149 | s += "\nstream\n" 150 | s += self.filtered 151 | s += "\nendstream" 152 | return s 153 | 154 | 155 | ## PDF REference 3rd edition:: 3.2.8 Null Object 156 | class PDFNull(PDFObject): 157 | def __init__(self): 158 | PDFObject.__init__(self) 159 | 160 | def __str__(self): 161 | return "null" 162 | 163 | 164 | ## PDF REference 3rd edition:: 3.2.9 Indirect Objects 165 | class PDFRef(PDFObject): 166 | def __init__(self, obj): 167 | PDFObject.__init__(self) 168 | self.obj = [obj] 169 | 170 | def __str__(self): 171 | if self.obj[0].n is None: 172 | raise Exception( 173 | "Cannot take a reference of " 174 | + str(self.obj[0]) 175 | + "because it is not added to any PDF Document" 176 | ) 177 | 178 | return "%d %d R" % (self.obj[0].n, self.obj[0].v) 179 | 180 | 181 | ## PDF REference 3rd edition:: 3.4 File Structure 182 | ## Simplest file structure... 183 | class PDFDoc(list): 184 | def __init__(self, obfuscate=0): 185 | self.info = None 186 | self.root = None 187 | 188 | def setRoot(self, root): 189 | self.root = root 190 | 191 | def setInfo(self, info): 192 | self.info = info 193 | 194 | def __iadd__(self, x): 195 | self.append(x) 196 | return self 197 | 198 | def append(self, obj): 199 | assert isinstance(obj, PDFObject) 200 | if obj.v != None or obj.n != None: 201 | raise Exception( 202 | "Object " + repr(obj) + " has been already added to a PDF Document!" 203 | ) 204 | obj.v = 0 205 | obj.n = 1 + len(self) 206 | super(PDFDoc, self).append(obj) 207 | 208 | def __str__(self): 209 | doc1 = "%PDF-1.5\n%\xE7\xF3\xCF\xD3\n" 210 | xref = {} 211 | for obj in self: 212 | # doc1+=file('/dev/urandom','r').read(100000) 213 | xref[obj.n] = len(doc1) 214 | doc1 += "%d %d obj\n" % (obj.n, obj.v) 215 | doc1 += str(obj) 216 | doc1 += "\nendobj\n" 217 | # doc1+=file('/dev/urandom','r').read(100000) 218 | posxref = len(doc1) 219 | doc1 += "xref\n" 220 | doc1 += "0 %d\n" % (len(self) + 1) 221 | doc1 += "0000000000 65535 f \n" 222 | for xr in xref.keys(): 223 | doc1 += "%010d %05d n \n" % (xref[xr], 0) 224 | doc1 += "trailer\n" 225 | trailer = PDFDict() 226 | trailer["Size"] = len(self) + 1 227 | trailer["Root"] = PDFRef(self.root) 228 | if self.info: 229 | trailer["Info"] = PDFRef(self.info) 230 | doc1 += str(trailer) 231 | doc1 += "\nstartxref\n%d\n" % posxref 232 | doc1 += "%%EOF" 233 | return doc1 234 | -------------------------------------------------------------------------------- /minipdf/minipdfo.py: -------------------------------------------------------------------------------- 1 | ## UNFINISHED -UNFINISHED -UNFINISHED -UNFINISHED -UNFINISHED -UNFINISHED -UNFINISHED -UNFINISHED - 2 | ## This is a preliminary version, lots of tricks are still to be implemented... 3 | ## Anyway if you find some trick not coded here, let me know! 4 | ## Let me implement some of the ways... 5 | ## http://feliam.wordpress.com 6 | 7 | ## Ref: Some of this first pointed out in ... 8 | ## http://blog.didierstevens.com/2008/04/29/pdf-let-me-count-the-ways/ 9 | 10 | 11 | import random 12 | 13 | decoys=[ 14 | '''<>''' 15 | '''<>''' 16 | '''<> >> >> /Type /Page /Contents 4 0 R >>''' 17 | '''4 0 obj''' 18 | '''endstream''' 19 | '''endobj''' 20 | '''xref''' 21 | '''0000000339 00000 n ''' 22 | '''trailer''' 23 | '''<>''' 24 | '''startxref''' 25 | ] 26 | delimiters = list('()<>[]{}/%') 27 | whitespaces = list('\x20\x0a\x0c\x0d') 28 | EOL = '\x0A' 29 | 30 | def putSome(l): 31 | some = "" 32 | size = random.randint(0,5) 33 | for i in range(0,size): 34 | some += random.choice(l) 35 | return some 36 | 37 | def getSeparator(): 38 | if random.randint(0,100)<40: 39 | return random.choice(["\00","\x09","\0a","\x0d","\x20"]) 40 | elif random.randint(0,100)<101: 41 | return "%"+random.choice(decoys)+EOL 42 | 43 | import struct 44 | 45 | #For constructing a minimal pdf file 46 | class PDFObject: 47 | def __init__(self): 48 | self.n=None 49 | self.v=None 50 | 51 | def __str__(self): 52 | raise "Fail" 53 | 54 | class PDFDict(PDFObject): 55 | def __init__(self, d={}): 56 | PDFObject.__init__(self) 57 | self.dict = {} 58 | for k in d: 59 | self.dict[k]=d[k] 60 | 61 | def add(self,name,obj): 62 | self.dict[name] = obj 63 | 64 | def __str__(self): 65 | s="<<" 66 | s+=random.choice(["\00","\x09","\0a","\x0c","\x0d","\x20"]) 67 | for name in self.dict: 68 | s+="%s"%PDFName(name).__str__() 69 | s+=getSeparator() 70 | s+="%s"%self.dict[name] 71 | s+=getSeparator() 72 | s+=">>" 73 | s+=getSeparator() 74 | return s 75 | 76 | class PDFStream(PDFDict): 77 | def __init__(self,stream=""): 78 | PDFDict.__init__(self) 79 | self.stream=stream 80 | self.filtered=self.stream 81 | self.filters = [] 82 | 83 | def appendFilter(self, filter): 84 | self.filters.append(filter) 85 | self._applyFilters() #yeah every time .. so what! 86 | 87 | def _applyFilters(self): 88 | self.filtered = self.stream 89 | for f in self.filters: 90 | self.filtered = f.encode(self.filtered) 91 | self.add('Length', len(self.filtered)) 92 | if len(self.filters)>0: 93 | self.add('Filter', PDFArray([f.name for f in self.filters])) 94 | #Add Filter parameters ? 95 | 96 | def __str__(self): 97 | self._applyFilters() #yeah every time .. so what! 98 | s="" 99 | s+=PDFDict.__str__(self) 100 | s+="\nstream\n" 101 | s+=self.filtered 102 | s+="\nendstream" 103 | return s 104 | 105 | class PDFArray(PDFObject): 106 | def __init__(self,s): 107 | PDFObject.__init__(self) 108 | self.s=s 109 | def __str__(self): 110 | return "[%s]"%(random.choice(whitespaces).join([ o.__str__() for o in self.s])) 111 | 112 | 113 | ##7.3.5 Name Objects 114 | class PDFName(PDFObject): 115 | def __init__(self,s): 116 | PDFObject.__init__(self) 117 | self.s=s 118 | def __str__(self): 119 | obfuscated = "" 120 | for c in self.s: 121 | r=random.randint(0,100) 122 | if (ord(c)<=ord('!') and ord(c) >= ord('~')) or r < 50: 123 | obfuscated+='#%02x'%ord(c) 124 | else: 125 | obfuscated+=c 126 | return "/%s"%obfuscated 127 | 128 | ##7.3.4.3 Hexadecimal Strings 129 | class PDFHexString(PDFObject): 130 | def __init__(self,s): 131 | PDFObject.__init__(self) 132 | self.s=s 133 | 134 | def __str__(self): 135 | return "<" + "".join(["%02x"%ord(c) for c in self.s]) + ">" 136 | 137 | class PDFOctalString(PDFObject): 138 | def __init__(self,s): 139 | PDFObject.__init__(self) 140 | self.s="".join(["\\%03o"%ord(c) for c in s]) 141 | 142 | def __str__(self): 143 | return "(%s)"%self.s 144 | 145 | 146 | ##7.3.4.2 Literal Strings 147 | class PDFString(PDFObject): 148 | escapes = {'\x0a': '\\n', 149 | '\x0d': '\\r', 150 | '\x09': '\\t', 151 | '\x08': '\\b', 152 | '\xff': '\\f', 153 | '(': '\\(', 154 | ')': '\\)', 155 | '\\': '\\\\', } 156 | 157 | def __init__(self,s): 158 | PDFObject.__init__(self) 159 | self.s=s 160 | 161 | def __str__(self): 162 | if random.randint(0,100) < 10: 163 | return PDFHexString(self.s).__str__() 164 | obfuscated = "" 165 | for c in self.s: 166 | if random.randint(0,100)>70: 167 | obfuscated+='\\%03o'%ord(c) 168 | elif c in self.escapes.keys(): 169 | obfuscated+=self.escapes[c] 170 | else: 171 | obfuscated+=c 172 | if random.randint(0,100) <10 : 173 | obfuscated+='\\\n' 174 | return "(%s)"%obfuscated 175 | 176 | 177 | class PDFNum(PDFObject): 178 | def __init__(self,s): 179 | PDFObject.__init__(self) 180 | self.s=s 181 | 182 | def __str__(self): 183 | sign = "" 184 | if random.randint(0,100)>50: 185 | if self.s>0: 186 | sign = '+' 187 | elif self.s<0: 188 | sign = '-' 189 | elif random.randint(0,100)>50: 190 | sign = '-' 191 | else: 192 | sign = '+' 193 | obfuscated = "" 194 | obfuscated += sign 195 | obfuscated += putSome(['0']) 196 | obfuscated += "%s"%self.s 197 | if type(self.s)==type(0): 198 | if random.randint(0,100)>60: 199 | obfuscated += "."+putSome(['0']) 200 | else: 201 | if random.randint(0,100)>60: 202 | obfuscated += putSome(['0']) 203 | return obfuscated 204 | 205 | class PDFBool(PDFObject): 206 | def __init__(self,s): 207 | PDFObject.__init__(self) 208 | self.s=s 209 | 210 | def __str__(self): 211 | if self.s: 212 | return "true" 213 | return "false" 214 | 215 | class PDFRef(PDFObject): 216 | def __init__(self,obj): 217 | PDFObject.__init__(self) 218 | self.obj=[obj] 219 | def __str__(self): 220 | return "%d %d R"%(self.obj[0].n,self.obj[0].v) 221 | 222 | class PDFNull(PDFObject): 223 | def __init__(self): 224 | PDFObject.__init__(self) 225 | 226 | def __str__(self): 227 | return "null" 228 | 229 | 230 | class PDFDoc(): 231 | def __init__(self,obfuscate=0): 232 | self.objs=[] 233 | self.info=None 234 | self.root=None 235 | 236 | def setRoot(self,root): 237 | self.root=root 238 | 239 | def setInfo(self,info): 240 | self.info=info 241 | 242 | def _add(self,obj): 243 | obj.v=0 244 | obj.n=1+len(self.objs) 245 | self.objs.append(obj) 246 | 247 | def add(self,obj): 248 | if type(obj) != type([]): 249 | self._add(obj) 250 | else: 251 | for o in obj: 252 | self._add(o) 253 | 254 | def _header(self): 255 | ##Adobe suplement to ISO3200 3.4.1 File Header 256 | header = "%"+random.choice(['!PS','PDF'])+"-%d.%d"%(random.randint(0,0xffffffff),random.randint(0,0xffffffff)) 257 | crap = "" 258 | while len(crap) < random.randint(4,1024): 259 | crap=crap+chr(random.choice(list(set(range(0,256))- set([chr(i) for i in [0x0a,0x0d]])))) 260 | while len(header)<1024: 261 | header = chr(random.randint(0,255))+header 262 | 263 | return header+"\n%"+crap+"\n" 264 | 265 | def __str__(self): 266 | doc1 = self._header() 267 | xref = {} 268 | for obj in self.objs: 269 | xref[obj.n] = len(doc1) 270 | doc1+="%d %d obj\n"%(obj.n,obj.v) 271 | doc1+=obj.__str__() 272 | doc1+="\nendobj\n" 273 | posxref=len(doc1) 274 | doc1+="xref\n" 275 | doc1+="0 %d\n"%(len(self.objs)+1) 276 | doc1+="0000000000 65535 f \n" 277 | for xr in xref.keys(): 278 | doc1+= "%010d %05d n \n"%(xref[xr],0) 279 | doc1+="trailer\n" 280 | trailer = PDFDict() 281 | trailer.add("Size",len(self.objs)+1) 282 | trailer.add("Root",PDFRef(self.root)) 283 | if self.info: 284 | trailer.add("Info",PDFRef(self.info)) 285 | doc1+=trailer.__str__() 286 | doc1+="\nstartxref\n%d\n"%posxref 287 | doc1+="%%EOF" 288 | 289 | return doc1 290 | 291 | -------------------------------------------------------------------------------- /scripts/mkpdfT20170432.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | #More info at http://github.com/feliam/miniPDF 3 | ''' 4 | Make a minimal PDF file displaying some text. 5 | ''' 6 | import sys, StringIO 7 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray 8 | 9 | import optparse 10 | parser = optparse.OptionParser(usage="%prog [options] [TEXT]", description=__doc__) 11 | parser.add_option("-i", "--input", metavar="IFILE", help="read text from IFILE (otherwise stdin)") 12 | parser.add_option("-o", "--output", metavar="OFILE", help="write output to OFILE (otherwise stdout)") 13 | (options, args) = parser.parse_args() 14 | 15 | if options.input: 16 | file_input = file(options.input) 17 | elif len(args) > 0: 18 | file_input = StringIO.StringIO(" ".join(args)) 19 | else: 20 | file_input = sys.stdin 21 | 22 | 23 | if not options.output is None: 24 | file_output = file(options.output, "w") 25 | else: 26 | file_output = sys.stdout 27 | 28 | 29 | #The document 30 | doc = PDFDoc() 31 | 32 | #font 33 | font = PDFDict() 34 | font['Name'] = PDFName('F1') 35 | font['Subtype'] = PDFName('Type1') 36 | font['BaseFont'] = PDFName('Helvetica') 37 | 38 | #name:font map 39 | fontname = PDFDict() 40 | fontname['F1'] = font 41 | 42 | #resources 43 | resources = PDFDict() 44 | resources['Font'] = fontname 45 | doc += resources 46 | 47 | #contents 48 | contents = PDFStream((file_input.read()).encode('zlib') ) 49 | doc += contents 50 | 51 | contents['Filter'] = PDFName('FlateDecode') 52 | decodeparams = PDFDict() 53 | decodeparams['Columns'] = PDFNum(2) 54 | decodeparams['Colors'] = PDFNum(1) 55 | decodeparams['BitsPerComponent'] = PDFNum(16) 56 | decodeparams['Predictor'] = PDFNum(2) 57 | contents['DecodeParms'] = decodeparams 58 | 59 | #page 60 | page = PDFDict() 61 | page['Type'] = PDFName('Page') 62 | page['MediaBox'] = PDFArray([0, 0, 612, 792]) 63 | page['Contents'] = PDFRef(contents) 64 | page['Resources'] = PDFRef(resources) 65 | doc += page 66 | 67 | #pages 68 | pages = PDFDict() 69 | pages['Type'] = PDFName('Pages') 70 | pages['Kids'] = PDFArray([PDFRef(page)]) 71 | pages['Count'] = PDFNum(1) 72 | doc += pages 73 | 74 | #add parent reference in page 75 | page['Parent'] = PDFRef(pages) 76 | 77 | #catalog 78 | catalog = PDFDict() 79 | catalog['Type'] = PDFName('Catalog') 80 | catalog['Pages'] = PDFRef(pages) 81 | doc += catalog 82 | 83 | doc.setRoot(catalog) 84 | 85 | file_output.write(str(doc)) 86 | 87 | #@feliam 88 | -------------------------------------------------------------------------------- /scripts/mkpdfjpeg.py: -------------------------------------------------------------------------------- 1 | ''' This generates a minimal pdf that renders a jpeg file ''' 2 | import os 3 | import zlib 4 | import sys 5 | from minipdf import * 6 | 7 | def make(data): 8 | doc = PDFDoc() 9 | 10 | #pages 11 | pages = PDFDict() 12 | pages['Type'] = PDFName('Pages') 13 | 14 | #catalog 15 | catalog = PDFDict() 16 | catalog['Type'] = PDFName('Catalog') 17 | catalog['Pages'] = PDFRef(pages) 18 | 19 | #lets add those to doc just for showing up the Ref object. 20 | doc+=catalog 21 | doc+=pages 22 | #Set the pdf rootpoython 23 | doc.setRoot(catalog) 24 | 25 | try: 26 | from PIL import Image 27 | from StringIO import StringIO 28 | im = Image.open( StringIO(data)) 29 | _width, _height = im.size 30 | filter_name = {'JPEG2000':'JPXDecode', 31 | 'JPEG': 'DCTDecode'} [im.format] 32 | except Exception, e: 33 | _width=256 34 | _height=256 35 | 36 | ##XOBJECT 37 | xobj = PDFStream(data) 38 | xobj['Type'] = PDFName('XObject') 39 | xobj['Subtype'] = PDFName('Image') 40 | xobj['Width'] = PDFNum(_width) 41 | xobj['Height'] = PDFNum(_height) 42 | xobj['ColorSpace'] = PDFName('DeviceRGB') 43 | xobj['BitsPerComponent'] = PDFNum(8) 44 | xobj['Filter'] = PDFName(filter_name) 45 | 46 | contents = PDFStream('q %d 0 0 %d 0 0 cm /Im1 Do Q' % (_width, _height)) 47 | resources = PDFDict() 48 | resources['ProcSet'] = PDFArray([PDFName('PDF'), PDFName('ImageC'), PDFName('ImageI'), PDFName('ImageB')]) 49 | 50 | Im1=PDFDict() 51 | Im1['Im1'] = PDFRef(xobj) 52 | resources['XObject'] = Im1 53 | 54 | #The pdf page 55 | page = PDFDict() 56 | page['Type'] = PDFName('Page') 57 | page['Parent'] = PDFRef(pages) 58 | page['MediaBox'] = PDFArray([ 0, 0, _width, _height]) 59 | page['Contents'] = PDFRef(contents) 60 | page['Resources'] = PDFRef(resources) 61 | 62 | [doc.append(x) for x in [xobj, contents, resources, page]] 63 | pages['Count'] = PDFNum(1) 64 | pages['Kids'] = PDFArray([PDFRef(page)]) 65 | return doc 66 | 67 | 68 | 69 | ##Main 70 | if __name__=='__main__': 71 | print make(file(sys.argv[1], 'rb').read()) 72 | 73 | -------------------------------------------------------------------------------- /scripts/mkpdfjs: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | #More info at http://github.com/feliam/miniPDF 3 | ''' 4 | Make a minimal PDF file displaying some text. 5 | ''' 6 | import sys, StringIO 7 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray 8 | 9 | import optparse 10 | parser = optparse.OptionParser(usage='%prog [options] [TEXT]', description=__doc__) 11 | parser.add_option('-i', '--input', metavar='IFILE', help='read text from IFILE (otherwise stdin)') 12 | parser.add_option('-o', '--output', metavar='OFILE', help='write output to OFILE (otherwise stdout)') 13 | (options, args) = parser.parse_args() 14 | 15 | if options.input: 16 | file_input = file(options.input) 17 | elif len(args) > 0: 18 | file_input = StringIO.StringIO(' '.join(args)) 19 | else: 20 | file_input = sys.stdin 21 | 22 | 23 | if not options.output is None: 24 | file_output = file(options.output, 'w') 25 | else: 26 | file_output = sys.stdout 27 | 28 | 29 | #The document 30 | doc = PDFDoc() 31 | 32 | #font 33 | font = PDFDict() 34 | font['Name'] = PDFName('F1') 35 | font['Subtype'] = PDFName('Type1') 36 | font['BaseFont'] = PDFName('Helvetica') 37 | 38 | #name:font map 39 | fontname = PDFDict() 40 | fontname['F1'] = font 41 | 42 | #resources 43 | resources = PDFDict() 44 | resources['Font'] = fontname 45 | doc += resources 46 | 47 | #contents 48 | contents = PDFStream('''BT 49 | /F1 24 Tf 0 700 Td 50 | %s Tj 51 | ET 52 | '''%PDFString('Mini pdf lib')) 53 | doc += contents 54 | 55 | #Example js.. 56 | #console.println('ALERT!'); 57 | 58 | #Add OpenAction javascript to the Document 59 | jsStream = PDFStream(file_input.read()) 60 | doc += jsStream 61 | 62 | actionJS = PDFDict() 63 | actionJS['S'] = PDFName('JavaScript') 64 | actionJS['JS'] = PDFRef(jsStream) 65 | doc += actionJS 66 | 67 | #page 68 | page = PDFDict() 69 | page['Type'] = PDFName('Page') 70 | page['MediaBox'] = PDFArray([0, 0, 612, 792]) 71 | page['Contents'] = PDFRef(contents) 72 | page['Resources'] = PDFRef(resources) 73 | doc += page 74 | 75 | #pages 76 | pages = PDFDict() 77 | pages['Type'] = PDFName('Pages') 78 | pages['Kids'] = PDFArray([PDFRef(page)]) 79 | pages['Count'] = PDFNum(1) 80 | doc += pages 81 | 82 | #add parent reference in page 83 | page['Parent'] = PDFRef(pages) 84 | 85 | #catalog 86 | catalog = PDFDict() 87 | catalog['Type'] = PDFName('Catalog') 88 | catalog['Pages'] = PDFRef(pages) 89 | catalog['OpenAction'] = PDFRef(actionJS) 90 | doc += catalog 91 | 92 | doc.setRoot(catalog) 93 | 94 | file_output.write(str(doc)) 95 | 96 | #@feliam 97 | -------------------------------------------------------------------------------- /scripts/mkpdfotf: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | #More info at http://github.com/feliam/miniPDF 3 | ''' 4 | Make a minimal PDF file displaying some text using an specified OTF font file. 5 | ''' 6 | 7 | import sys, StringIO 8 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray 9 | 10 | import optparse 11 | parser = optparse.OptionParser(usage="%prog [options] [TEXT]", description=__doc__) 12 | parser.add_option("-i", "--input", metavar="IFILE", help="read text from IFILE (otherwise stdin)") 13 | parser.add_option("-o", "--output", metavar="OFILE", help="write output to OFILE (otherwise stdout)") 14 | (options, args) = parser.parse_args() 15 | 16 | if options.input: 17 | file_input = file(options.input) 18 | elif len(args) > 0: 19 | file_input = StringIO.StringIO(" ".join(args)) 20 | else: 21 | file_input = sys.stdin 22 | 23 | 24 | if not options.output is None: 25 | file_output = file(options.output, "w") 26 | else: 27 | file_output = sys.stdout 28 | 29 | 30 | #The document 31 | doc = PDFDoc() 32 | 33 | #the font file 34 | fontstream = PDFStream(file_input.read()) 35 | fontstream['Subtype'] = PDFName("OpenType") 36 | doc += fontstream 37 | #fontdescriptor 38 | fontdesc = PDFDict() 39 | fontdesc['Type'] = PDFName("FontDescriptor") 40 | #fontdesc['ItalicAngle'] = PDFNum(0)) 41 | #fontdesc['Descent'] = PDFNum(-1)) 42 | #fontdesc['FontFamily'] = PDFString("Minion Pro SmBd")) 43 | fontdesc['Flags'] = PDFNum(34) 44 | #fontdesc['XHeight'] = PDFNum(440)) 45 | #fontdesc['FontWeight'] = PDFNum(600)) 46 | #fontdesc['StemV'] = PDFNum(112)) 47 | #fontdesc['Ascent'] = PDFNum(1011)) 48 | #fontdesc['CapHeight'] = PDFNum(651)) 49 | #fontdesc['FontStrech'] = PDFName("Normal")) 50 | #fontdesc['FontBox'] = "[-308 -360 1684 1011]") 51 | fontdesc['FontFile3'] = PDFRef(fontstream) 52 | #fontdesc['FontName'] = PDFName("KIOPAZ+MinionPro-Semibold")) 53 | 54 | 55 | doc += fontdesc 56 | 57 | #font 58 | font = PDFDict() 59 | font['Type'] = PDFName("Font") 60 | font['Subtype'] = PDFName("TrueType") 61 | #font['FirstChar'] = PDFNum(31)) 62 | #font['LastChar'] = PDFNum(122)) 63 | #font['MediaBox'] = "[10 10 1000 1000]") 64 | font['BBox'] = "[10 12 10 12 100 12 100]" 65 | font['BaseFont'] = PDFName("s") 66 | font['Encoding'] = PDFName("WinAnsiEncoding") 67 | font['FontDescriptor'] = PDFRef(fontdesc) 68 | doc += font 69 | 70 | #name:font map 71 | fontnames = PDFDict() 72 | fontnames['F1'] = PDFRef(font) 73 | 74 | #resources 75 | resources = PDFDict() 76 | resources['Font'] = fontnames 77 | 78 | #contents 79 | contents= PDFStream( 80 | '''BT 81 | /F1 24 Tf 82 | 240 700 Td 83 | (uCon 2009) Tj 84 | ET''') 85 | doc += contents 86 | 87 | #page 88 | page = PDFDict() 89 | page['Type'] = PDFName("Page") 90 | page['Resources'] = resources 91 | page['Contents'] = PDFRef(contents) 92 | doc += page 93 | 94 | #contents 95 | dummycontents= PDFStream( 96 | '''fegwbhk qew; sd/ dsfls d ''') 97 | doc += dummycontents 98 | 99 | #page 100 | dummypage = PDFDict() 101 | dummypage['Type'] = PDFName("Page") 102 | dummypage['Resources'] = resources 103 | dummypage['Contents'] = PDFRef(dummycontents) 104 | doc += dummypage 105 | 106 | 107 | #pages 108 | pages = PDFDict() 109 | pages['Type'] = PDFName("Pages") 110 | pages['Kids'] = PDFArray([PDFRef(dummypage),PDFRef(page)]) 111 | pages['Count'] = PDFNum(2) 112 | doc += pages 113 | 114 | #add parent reference in page 115 | page['Parent'] = PDFRef(pages) 116 | #add parent reference in page 117 | dummypage['Parent'] = PDFRef(pages) 118 | 119 | 120 | #catalog 121 | catalog = PDFDict() 122 | catalog['Type'] = PDFName("Catalog") 123 | catalog['Pages'] = PDFRef(pages) 124 | doc += catalog 125 | doc.setRoot(catalog) 126 | 127 | file_output.write(str(doc)) 128 | 129 | -------------------------------------------------------------------------------- /scripts/mkpdftext: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | #More info at http://github.com/feliam/miniPDF 3 | ''' 4 | Make a minimal PDF file displaying some text. 5 | ''' 6 | import sys 7 | try: 8 | from StringIO import StringIO 9 | except: 10 | from io import StringIO 11 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray 12 | 13 | import optparse 14 | parser = optparse.OptionParser(usage="%prog [options] [TEXT]", description=__doc__) 15 | parser.add_option("-i", "--input", metavar="IFILE", help="read text from IFILE (otherwise stdin)") 16 | parser.add_option("-o", "--output", metavar="OFILE", help="write output to OFILE (otherwise stdout)") 17 | (options, args) = parser.parse_args() 18 | 19 | if options.input: 20 | file_input = file(options.input) 21 | elif len(args) > 0: 22 | file_input = StringIO(" ".join(args)) 23 | else: 24 | file_input = sys.stdin 25 | 26 | 27 | if not options.output is None: 28 | file_output = file(options.output, "w") 29 | else: 30 | file_output = sys.stdout 31 | 32 | 33 | #The document 34 | doc = PDFDoc() 35 | 36 | #font 37 | font = PDFDict() 38 | font['Name'] = PDFName('F1') 39 | font['Subtype'] = PDFName('Type1') 40 | font['BaseFont'] = PDFName('Helvetica') 41 | 42 | #name:font map 43 | fontname = PDFDict() 44 | fontname['F1'] = font 45 | 46 | #resources 47 | resources = PDFDict() 48 | resources['Font'] = fontname 49 | doc += resources 50 | 51 | #contents 52 | contents = PDFStream('''BT 53 | /F1 24 Tf 0 700 Td 54 | %s Tj 55 | ET 56 | '''%PDFString(file_input.read())) 57 | doc += contents 58 | 59 | #page 60 | page = PDFDict() 61 | page['Type'] = PDFName('Page') 62 | page['MediaBox'] = PDFArray([0, 0, 612, 792]) 63 | page['Contents'] = PDFRef(contents) 64 | page['Resources'] = PDFRef(resources) 65 | doc += page 66 | 67 | #pages 68 | pages = PDFDict() 69 | pages['Type'] = PDFName('Pages') 70 | pages['Kids'] = PDFArray([PDFRef(page)]) 71 | pages['Count'] = PDFNum(1) 72 | doc += pages 73 | 74 | #add parent reference in page 75 | page['Parent'] = PDFRef(pages) 76 | 77 | #catalog 78 | catalog = PDFDict() 79 | catalog['Type'] = PDFName('Catalog') 80 | catalog['Pages'] = PDFRef(pages) 81 | doc += catalog 82 | 83 | doc.setRoot(catalog) 84 | 85 | file_output.write(str(doc)) 86 | 87 | #@feliam 88 | -------------------------------------------------------------------------------- /scripts/mkpdfxfa: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | #More info at http://github.com/feliam/miniPDF 3 | ''' 4 | Make a minimal PDF with XFA 5 | ''' 6 | import sys, StringIO 7 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray, PDFBool 8 | 9 | import optparse 10 | parser = optparse.OptionParser(usage="%prog [options] [TEXT]", description=__doc__) 11 | parser.add_option("-i", "--input", metavar="IFILE", help="read XFA from IFILE (otherwise stdin)") 12 | parser.add_option("-o", "--output", metavar="OFILE", help="write output to OFILE (otherwise stdout)") 13 | (options, args) = parser.parse_args() 14 | 15 | if options.input: 16 | file_input = file(options.input) 17 | elif len(args) > 0: 18 | file_input = StringIO.StringIO(" ".join(args)) 19 | else: 20 | file_input = sys.stdin 21 | 22 | 23 | if not options.output is None: 24 | file_output = file(options.output, "w") 25 | else: 26 | file_output = sys.stdout 27 | 28 | 29 | #The document 30 | doc = PDFDoc() 31 | 32 | #font 33 | font = PDFDict() 34 | font['Name'] = PDFName('F1') 35 | font['Subtype'] = PDFName('Type1') 36 | font['BaseFont'] = PDFName('Helvetica') 37 | 38 | #name:font map 39 | fontname = PDFDict() 40 | fontname['F1'] = font 41 | 42 | #resources 43 | resources = PDFDict() 44 | resources['Font'] = fontname 45 | doc += resources 46 | 47 | #contents 48 | contents = PDFStream('''BT 49 | /F1 24 Tf 0 700 Td 50 | %s Tj 51 | ET 52 | '''%PDFString(file_input.read())) 53 | doc += contents 54 | 55 | #page 56 | page = PDFDict() 57 | page['Type'] = PDFName('Page') 58 | page['MediaBox'] = PDFArray([0, 0, 612, 792]) 59 | page['Contents'] = PDFRef(contents) 60 | page['Resources'] = PDFRef(resources) 61 | doc += page 62 | 63 | #pages 64 | pages = PDFDict() 65 | pages['Type'] = PDFName('Pages') 66 | pages['Kids'] = PDFArray([PDFRef(page)]) 67 | pages['Count'] = PDFNum(1) 68 | doc += pages 69 | 70 | #add parent reference in page 71 | page['Parent'] = PDFRef(pages) 72 | 73 | 74 | xfa = PDFStream(file(sys.argv[1]).read()) 75 | doc+=xfa 76 | 77 | #form 78 | form = PDFDict() 79 | form['XFA'] = PDFRef(xfa) 80 | doc+=form 81 | 82 | #catalog 83 | catalog = PDFDict() 84 | catalog['Type'] = PDFName("Catalog") 85 | catalog['Pages'] =PDFRef(pages) 86 | catalog['NeedsRendering'] = PDFBool(True) 87 | catalog['AcroForm'] = PDFRef(form) 88 | 89 | 90 | adbe = PDFDict() 91 | adbe['BaseVersion'] = '/1.7' 92 | adbe['ExtensionLevel'] = PDFNum(3) 93 | 94 | extensions = PDFDict() 95 | extensions['ADBE'] = adbe 96 | 97 | 98 | 99 | #catalog 100 | catalog = PDFDict() 101 | catalog['Type'] = PDFName('Catalog') 102 | catalog['Pages'] = PDFRef(pages) 103 | doc += catalog 104 | 105 | doc.setRoot(catalog) 106 | 107 | file_output.write(str(doc)) 108 | 109 | #@feliam 110 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from setuptools import setup, find_packages 3 | from setuptools.command.test import test as TestCommand 4 | import io 5 | import codecs 6 | import os 7 | import sys 8 | 9 | import minipdf 10 | 11 | def read(*filenames, **kwargs): 12 | encoding = kwargs.get('encoding', 'utf-8') 13 | sep = kwargs.get('sep', '\n') 14 | buf = [] 15 | for filename in filenames: 16 | with io.open(filename, encoding=encoding) as f: 17 | buf.append(f.read()) 18 | return sep.join(buf) 19 | 20 | long_description = read('README.md') 21 | 22 | class PyTest(TestCommand): 23 | def finalize_options(self): 24 | TestCommand.finalize_options(self) 25 | self.test_args = [] 26 | self.test_suite = True 27 | 28 | def run_tests(self): 29 | import pytest 30 | errcode = pytest.main(self.test_args) 31 | sys.exit(errcode) 32 | 33 | setup( 34 | name='minipdf', 35 | version=minipdf.__version__, 36 | url='https://github.com/feliam/miniPDF', 37 | license='Apache Software License', 38 | author='Felipe Andres Manzano', 39 | tests_require=['pytest'], 40 | install_requires=[], 41 | cmdclass={'test': PyTest}, 42 | author_email='feliam@binamuse.com', 43 | description='A python library for making PDF files in a very low level way.', 44 | long_description=long_description, 45 | packages=['minipdf'], 46 | include_package_data=True, 47 | platforms='any', 48 | test_suite='minipdf.test.test_minipdf', 49 | classifiers = [ 50 | 'Programming Language :: Python', 51 | 'Development Status :: 4 - Beta', 52 | 'Natural Language :: English', 53 | 'Intended Audience :: Developers', 54 | 'License :: OSI Approved :: Apache Software License', 55 | 'Operating System :: OS Independent', 56 | 'Topic :: Software Development :: Libraries :: Python Modules', 57 | 'Topic :: Software Development :: Libraries :: Application Frameworks', 58 | 'Topic :: Internet :: WWW/HTTP :: Dynamic Content', 59 | ], 60 | scripts=['scripts/mkpdftext', 'scripts/mkpdfxfa', 'scripts/mkpdfotf', 'scripts/mkpdfjs'], 61 | #extras_require={ 62 | # 'testing': ['pytest'], 63 | #} 64 | ) 65 | -------------------------------------------------------------------------------- /tests/test_general.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | from minipdf import PDFDoc, PDFName, PDFNum, PDFString, PDFRef, PDFStream, PDFDict, PDFArray 3 | 4 | 5 | class ManticoreTest(unittest.TestCase): 6 | _multiprocess_can_split_ = True 7 | 8 | def setUp(self): 9 | pass 10 | def tearDown(self): 11 | pass 12 | 13 | def test_name(self): 14 | self.assertEqual(str(PDFName("NAME")), '/NAME') 15 | self.assertEqual(str(PDFName("/NAME")), '/#2fNAME') 16 | 17 | --------------------------------------------------------------------------------