├── .gitignore
├── README.md
├── __init__.py
├── install
    ├── install.sh
    └── lcms-1.19.tar.gz
├── parser.py
├── pdfreader
    ├── __init__.py
    ├── book.pdf
    ├── brochureint.pdf
    ├── brochureprint.pdf
    ├── buttons.pdf
    ├── ff.pdf
    ├── ff.png
    ├── ffa.pdf
    ├── hello world.pdf
    ├── lib
    │   ├── __init__.py
    │   ├── book.py
    │   ├── char.py
    │   ├── line.py
    │   ├── page.py
    │   └── paragraph.py
    ├── main.py
    ├── pff.pdf
    ├── pffl.pdf
    ├── png.pdf
    └── util
    │   ├── __init__.py
    │   └── convert.py
├── psd_rb
    ├── Gemfile
    ├── json_type.json
    ├── ruby
    │   ├── factories
    │   │   ├── group_unit.rb
    │   │   ├── image_unit.rb
    │   │   ├── text_unit.rb
    │   │   ├── unit_factory.rb
    │   │   └── unknown_unit.rb
    │   ├── panda_psd.rb
    │   ├── unit_manager.rb
    │   └── util.rb
    └── test.rb
└── psdtools
    ├── A2 16 column.ai
    ├── Logo.psd
    ├── __init__.py
    ├── brochure.indd
    ├── brochure_rtangfold_11x17_OUTh.indd
    ├── buttons.psd
    ├── main.py
    └── test-text.psd


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | pdfreader/.DS_Store
 3 | 
 4 | *.sublime-project
 5 | 
 6 | *.sublime-workspace
 7 | 
 8 | psdtools/.DS_Store
 9 | 
10 | psdtools/images/.DS_Store
11 | 
12 | *.pyc
13 | 
14 | /.DS_Store
15 | /install/.DS_Store
16 | /metadata.json
17 | /install/install02500809/
18 | /images/
19 | /psd_rb/.idea/
20 | /psd_rb/Gemfile.lock
21 | /psd_rb/.DS_Store
22 | /psd_rb/psds/
23 | /psd_rb/output/
24 | /psd_rb/output.txt


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | content-extractor
  2 | =================
  3 | 
  4 | Content-extractor is python based project. The only reason to this is the availability of the librairies. (The best ones are in python IMO)
  5 | Currently, this project parse pdf and psd file to extract meaningful content, such as texts and images both linked under a common json string
  6 | 
  7 | 
  8 | Dependencies
  9 | =================
 10 | 
 11 | Content-extractor is build upon the followings:
 12 | 
 13 | - [psd-tools](https://pypi.python.org/pypi/psd-tools/) To extract images and text from psd files
 14 | - [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/#intro) To extract text from pdf files as xml
 15 | - [beautiful soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) To convert the extracted xml as json
 16 | - [pdfimages](http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/) (from poppler-utils) To extract images from pdf files as image.ppm
 17 | - [ImageMagick](http://www.imagemagick.org/script/index.php) To convert the ppm images to png
 18 | 
 19 | 
 20 | Installation
 21 | =================
 22 | 
 23 | Since there is a lot of dependencies and most of them also have their own dependencies, I have made a shellscript to simplify the installation process.
 24 | 
 25 | The script assume the following:
 26 | 
 27 |  - If you don't have [apt-get](http://doc.ubuntu-fr.org/apt-get), then you should have [brew](http://mxcl.github.io/homebrew/).
 28 |  - You have [curl](http://pwet.fr/man/linux/commandes/curl) installed (the shell executable, not the php module).
 29 |  - You have [python 2.7.x](http://www.python.org/download/) installed on your system.
 30 |  - You have [unzip](http://www.info-zip.org/mans/unzip.html) installed on your system.
 31 | 
 32 | When you launch the script, it's installing [pip](https://pypi.python.org/pypi/pip), if it isn't already present on your system.
 33 | 
 34 | ```Shell
 35 | $ cd install
 36 | $ sh install.sh
 37 | ```
 38 | 
 39 | How to use it
 40 | =================
 41 | 
 42 | For any extension (currently pdf/psd) you can use `parser.py [file_path] [image_path]` it will automaticaly do the job.
 43 | 
 44 | ```Python
 45 | #Will write a metada.json and extract the images into the folder images
 46 | ./parser.py psdtools/work.psd './images/'
 47 | ./parser.py pdfreader/book.pdf './images/'
 48 | ```
 49 | 
 50 | You can also import parser.py into your own python project and use it the folowing way:
 51 | 
 52 | ```Python
 53 | #will return a string containing the json and extract the images into the folder images
 54 | from parser import parser
 55 | json = parser.parse("psdtools/work.psd", "./images/")
 56 | json = parser.parse("pdfreader/book.pdf", "./images/")
 57 | ```
 58 | 
 59 | You can also use the pdfreader and psdtools script independently doing so:
 60 | 
 61 | 
 62 | ```Shell
 63 | # Shell:
 64 | $ ./psdtools/main.py psdtools/work.psd './images/'
 65 | $ ./pdfreader/main.py pdfreader/book.pdf './images/'
 66 | ```
 67 | ```Python
 68 | # Python:
 69 | # PSD
 70 | from psdtools import main
 71 | json = main.run("psdtools/work.psd", "./images/")
 72 | # PDF
 73 | from pdfreader import main
 74 | json = main.run("pdfreader/book.pdf", "./images/")
 75 | ```
 76 | 
 77 | 
 78 | ./pdfreader/main.py is just a simplified interface to the very powerful pdfreader/util/convert.py, I have rewrite convert.py to be a class, but this is originally [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/#pdf2txt) from [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/index.html).
 79 | However, you can still use convert.py as if it was the originial [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt) tool, [here is the documentation](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt).
 80 | 
 81 | ```Shell
 82 | $ pdfreader/util/convert.py -o output.html samples/naacl06-shinyama.pdf
 83 | (extract text as an HTML file whose filename is output.html)
 84 | 
 85 | $ pdfreader/util/convert.py -V -c euc-jp -o output.html samples/jo.pdf
 86 | (extract a Japanese HTML file in vertical writing, CMap is required)
 87 | 
 88 | $ pdfreader/util/convert.py -P mypassword -o output.txt secret.pdf
 89 | (extract a text from an encrypted PDF file)
 90 | ```
 91 | 
 92 | convert.py can also be imported in a python project (but less options are available due to my lack of implementation)
 93 | 
 94 | ```Python
 95 | # @see pdfreader/main.py:text_to_dict as example
 96 | from util.convert import converter
 97 | convert = converter()
 98 | xml = convert.as_xml().add_input_file(fileinput).run()
 99 | ```
100 | 
101 | How does it work
102 | =================
103 | 
104 | The information are extracted having in mind to keep the parent-child relations.
105 | 
106 |  - For a pdf file:
107 |     - A pdf file can have many pages, and so the json string goes. each page has many images and many paragraphs.
108 |     - Each paragraph has a width, height, a content(string), a y and x position, a font, and a font size.
109 |     - For a pdf file the original image name can't be extracted so the images are name like the followong [uniqId]_p[pageNumber].jpg/png. uniqId is a unique Id, pageNumber is the page in which the images is contained. you shouldn't need to use this information since the json string contains it.
110 |     - Each extracted images are directly extracted a jpg but if it can't be then they are extracted as ppm/pbm and then converted in png.
111 | 
112 |  - For a psd file:
113 |     - A psd file can have many groups which are translated in pages into the json string. each group in a psd file can have many layers, a layer can be either text(paragraphs) or an image(images).
114 |     - Each text layer has a width, height, a content(string), a y and x position. The font and font size are currenlty not extracted (because psd-tools doesn't do it).
115 |     - For a psd file the image is name with the layer name it come from, but since many layers can have the same name the following is applied to be sure we have a unique name. [uniqId]_[layerName]_[groupName].png.
116 | 
117 |  - PSD files:
118 |    - Text: Because of the psd-tools library, you can't know the font, bold, italic, underline attribute.
119 | 
120 |  - PDF files:
121 |    - Text: Because of the pdfminer library, you can't have many fonts in the same paragraph. It is also not possible to extract the underlines. However the bold and italic attribute are extracted as html and directly integrated into the string.
122 | 
123 | You can see under a simplified example taken out from book.pdf of how look the json string.
124 | 
125 | JSON Format (from pdfreader/book.pdf 'simplified')
126 | =================
127 | 
128 | ```JSON
129 | {
130 |     "pages": [
131 |         {
132 |             "images": [
133 |                 "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
134 |             ],
135 |             "paragraphs": [
136 |                 {
137 |                     "size": 98,
138 |                     "width": 587,
139 |                     "string": "Book Title",
140 |                     "y": -98,
141 |                     "x": -324,
142 |                     "font": "Georgia",
143 |                     "height": 705
144 |                 }
145 |             ]
146 |         },
147 |         {
148 |             "images": [
149 |                 "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png"
150 |             ],
151 |             "paragraphs": [
152 |                 {
153 |                     "size": 24,
154 |                     "width": 138,
155 |                     "string": "CHAPTER 1",
156 |                     "y": -24,
157 |                     "x": -88,
158 |                     "font": "Georgia",
159 |                     "height": 711
160 |                 },
161 |                 {
162 |                     "size": 33,
163 |                     "width": 489,
164 |                     "string": "Lorem ipsum dolor sit amet, consectetur \n<i>adipisicing</i> <b>elit, sed</b> <i><b>do eiusmod</i></b>\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.",
165 |                     "y": -229,
166 |                     "x": -439,
167 |                     "font": "Georgia",
168 |                     "height": 269
169 |                 }
170 |             ]
171 |         },
172 |         {
173 |             "paragraphs": [
174 |                 {
175 |                     "size": 24,
176 |                     "width": 133,
177 |                     "string": "SECTION 1",
178 |                     "y": -24,
179 |                     "x": -83,
180 |                     "font": "Georgia",
181 |                     "height": 711
182 |                 }
183 |             ]
184 |         }
185 |     ]
186 | }
187 | ```
188 | 
189 | How to improve it
190 | =================
191 | 
192 |  - The psd-tools library should be improved to be able to take out the font, bold italic underline attribute
193 |  - The pdfminer library should be improved to be able to take out the underline attribute
194 |  - Matching the pep8?
195 |  - Checking if the proposed image folder exist if not we create it if we can't fire an error and stop
196 |  - Adding support for more file format
197 |  - Speed is not an issue but why not improving it ?
198 |  - The psd-tools library should be improved to be able to extract layers containing fx effect (is this even possible?)
199 |  - Depending of the platform or a parameter the `\n` into the string attribute in JSON should become `<br/>` or `\r\n`
200 |  - Reduce the number of dependencies ?
201 |  - What else ?
202 | 
203 | Contributing
204 | =================
205 | 
206 | You're welcome to contribute to this project in any way you can. If you don't know how to code, don't have time, don't worry, you still can post issue, I will be happy to answer you and correct it as fast as possible.
207 | Want to code ? fork it and submit pull request! Also, pull request comming with an example of what has been improved will be merge in priority.
208 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 | 
4 | if __name__ == '__main__': print __version__
5 | 


--------------------------------------------------------------------------------
/install/install.sh:
--------------------------------------------------------------------------------
  1 | mkdir install02500809
  2 | cd install02500809
  3 | echo "changing mode of /usr/local/etc to 775\n"
  4 | sudo chmod 775 /usr/local/etc
  5 | # We check the python2 avalaibility
  6 | echo "Checking python2 availability\n"
  7 | type python2 >/dev/null 2>&1 || { 
  8 |                                     echo "python2 unavaliable\nNow creating python2 as alias of python\n"
  9 |                                     alias python2="python"
 10 |                                 }
 11 | echo "Done python2\n"
 12 | # We suppose that the system has apt-get command already installed (if pdfimages is not available)
 13 | # We suppose that the system has curl command already installed
 14 | # We check that pip exist on the system
 15 | echo "Checking pip availability\n"
 16 | type pip >/dev/null 2>&1 || { 
 17 |                                 echo "pip unavaliable\nNow installing pip\n"
 18 |                                 mkdir pipInstall
 19 |                                 cd pipInstall
 20 |                                 curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
 21 |                                 sudo python get-pip.py
 22 |                                 cd ..
 23 |                                 rm -rf pipInstall
 24 |                             }
 25 | echo "Done pip\n"
 26 | 
 27 | # pil dependencies
 28 | echo "Installing pil dependecies\n"
 29 | type apt-get >/dev/null 2>&1 && {
 30 |                                     sudo apt-get install libjpeg libjpeg-dev libfreetype6 libfreetype6-dev zlib1g-dev
 31 |                                  } || {
 32 |                                     type brew >/dev/null 2>&1 && {
 33 |                                                                     brew install jpeg freetype
 34 |                                                                     brew tap homebrew/dupes
 35 |                                                                     brew install zlib
 36 |                                                                  }
 37 |                                  }
 38 | mkdir lcmsextract
 39 | tar xzf ../lcms-1.19.tar.gz --directory=./lcmsextract
 40 | cd lcmsextract/lcms-1.19
 41 | ./configure
 42 | make
 43 | sudo make install
 44 | cd ../..
 45 | rm -rf lcmsextract
 46 | echo "Done pil dependecies\n"
 47 | 
 48 | # psd-tools dependencies
 49 | echo "Installing psd-tools dependecies\n"
 50 | echo "Now installing packbits\n"
 51 | sudo pip install -U packbits
 52 | echo "Now installing docops\n"
 53 | sudo pip install -U docopt
 54 | echo "Now installing pil\n"
 55 | sudo pip install -U pil
 56 | echo "Done psd-tools dependecies\n"
 57 | 
 58 | 
 59 | # Installing psd-tools
 60 | echo "Installing psd-tools\n"
 61 | sudo pip install -U psd-tools
 62 | echo "Done psd-tools\n"
 63 | 
 64 | # Installing pdfminer
 65 | echo "Installing pdfminer\n"
 66 | mkdir pdfminerInstall
 67 | cd pdfminerInstall
 68 | curl -O https://codeload.github.com/euske/pdfminer/zip/master
 69 | unzip master
 70 | cd pdfminer-master
 71 | make cmap
 72 | sudo python setup.py install
 73 | cd ../..
 74 | rm -rf pdfminerInstall
 75 | echo "Done pdfminer\n"
 76 | 
 77 | # Installing beautifullsoup
 78 | echo "Installing beautifulsoup\n"
 79 | sudo pip install -U beautifulsoup4
 80 | echo "Done beautifulsoup\n"
 81 | 
 82 | # If pdfimages is not available we install the lib poppler-utils which contains it 
 83 | echo "Checking pdfimages\n"
 84 | type pdfimages >/dev/null 2>&1 || {
 85 |                                     echo "pdfimages unavaliable\nNow installing poppler-utils\n"
 86 |                                     type apt-get >/dev/null 2>&1 && {
 87 |                                                                         sudo apt-get install poppler-utils
 88 |                                                                      } || {
 89 |                                                                         type brew >/dev/null 2>&1 && {
 90 |                                                                                                         brew install fontconfig
 91 |                                                                                                         brew link --overwrite fontconfig
 92 |                                                                                                         brew install poppler
 93 |                                                                                                         brew link --overwrite poppler
 94 |                                                                                                      }
 95 |                                                                      }
 96 |                                   }
 97 | echo "Done pdfimages\n"
 98 | 
 99 | #If convert is not avalable we install the lib imagemagick which contains it
100 | echo "Checking convert\n"
101 | type convert >/dev/null 2>&1 || { 
102 |                                     echo "convert unavaliable\nNow installing imagemagick\n"
103 |                                     mkdir imagemagickInstall
104 |                                     cd imagemagickInstall
105 |                                     curl -O http://www.imagemagick.org/download/ImageMagick.tar.gz
106 |                                     tar xvfz ImageMagick.tar.gz
107 |                                     cd ImageMagick-6.8.5-8
108 |                                     ./configure
109 |                                     make
110 |                                     sudo make install
111 |                                     sudo ldconfig /usr/local/lib
112 |                                     cd ../..
113 |                                     rm -rf imagemagickInstall
114 |                                 }
115 | echo "Done convert\n"
116 | cd ..
117 | rm -rf install02500809
118 | 


--------------------------------------------------------------------------------
/install/lcms-1.19.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/install/lcms-1.19.tar.gz


--------------------------------------------------------------------------------
/parser.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python2
 2 | 
 3 | def file_exist(file_path):
 4 |     import os.path
 5 |     if os.path.exists(file_path) and os.path.isfile(file_path):
 6 |         return True
 7 |     return False
 8 | 
 9 | def get_4b_magic_number(file_path):
10 |     try:
11 |         binaryFile = open(file_path, 'rb')
12 |         magicNumber = binaryFile.read(4)
13 |     except(IOError), e:
14 |         print "%s: unable to open/read.\n%s" % (file_path, e)
15 |     else:
16 |         return magicNumber
17 |     return False
18 | 
19 | def parse(file_path, image_folder):
20 |     if file_exist(file_path):
21 |         magicNumber = get_4b_magic_number(file_path)
22 |         if magicNumber is not False:
23 |             if magicNumber == "8BPS": #PSD file
24 |                 from psdtools import main
25 |             elif magicNumber == "%PDF": #PDF file
26 |                 from pdfreader import main
27 |             return main.run(file_path, image_folder)
28 |         else:
29 |             print "%s: not a psd nor a pdf (unknown Magic Number)" % file_path
30 |     else:
31 |         print "%s: file not found." % file_path
32 | 
33 | if __name__ == '__main__':
34 |     import sys
35 |     if len(sys.argv) == 3:
36 |         json = parse(sys.argv[1], sys.argv[2])
37 |         """ We write the json into a file called metadata.json """
38 |         target = open("metadata.json", 'w+') # a will append, w will over-write
39 |         target.write(json)
40 |         target.close()
41 |     else:
42 |         print "usage: %s pdf_or_psd_file_path generated_images_path/ (eg: python %s book.pdf/.psd './images/')" % (sys.argv[0], sys.argv[0])
43 | 


--------------------------------------------------------------------------------
/pdfreader/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 | 
4 | if __name__ == '__main__': print __version__
5 | 


--------------------------------------------------------------------------------
/pdfreader/book.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/book.pdf


--------------------------------------------------------------------------------
/pdfreader/brochureint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureint.pdf


--------------------------------------------------------------------------------
/pdfreader/brochureprint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureprint.pdf


--------------------------------------------------------------------------------
/pdfreader/buttons.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/buttons.pdf


--------------------------------------------------------------------------------
/pdfreader/ff.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.pdf


--------------------------------------------------------------------------------
/pdfreader/ff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.png


--------------------------------------------------------------------------------
/pdfreader/ffa.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ffa.pdf


--------------------------------------------------------------------------------
/pdfreader/hello world.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/hello world.pdf


--------------------------------------------------------------------------------
/pdfreader/lib/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 | 
4 | if __name__ == '__main__': print __version__
5 | 


--------------------------------------------------------------------------------
/pdfreader/lib/char.py:
--------------------------------------------------------------------------------
 1 | class char(object):
 2 | 
 3 |     _size = 0
 4 |     _width = 0
 5 |     _height = 0
 6 |     _x = 0
 7 |     _y = 0
 8 |     _font = None
 9 |     _isBold = False
10 |     _isItalic = False
11 |     _char = ''
12 | 
13 |     @property
14 |     def font(self):
15 |         return self._font
16 | 
17 |     @property
18 |     def size(self):
19 |         return self._size
20 | 
21 |     @property
22 |     def x(self):
23 |         return self._x
24 | 
25 |     @property
26 |     def y(self):
27 |         return self._y
28 | 
29 |     def __init__(self, xml_char):
30 |         self._font = xml_char.get('font').split('+')[1] if xml_char.get('font') != None else None
31 |         if self._font is not None:
32 |             (left, top, self._width, self._height) = xml_char.get('bbox').split(',')
33 |             self._width = int(float(self._width))
34 |             self._height = int(float(self._height))
35 |             self._x = int(float(left)) - self._width
36 |             self._y = int(float(top)) - self._height
37 |             self._size = int(float(xml_char.get('size')))
38 |             self._isBold = True if len(self._font.split('-')) == 2 and 'Bold' in self._font.split('-')[1] else False
39 |             self._isItalic = True if len(self._font.split('-')) == 2 and 'Italic' in self._font.split('-')[1] else False
40 |             self._font = self._font.split('-')[0]
41 |             #
42 |             self._char = xml_char.string
43 | 
44 |     def isBold(self):
45 |         return self._isBold
46 | 
47 |     def isItalic(self):
48 |         return self._isItalic
49 | 
50 |     def __str__(self):
51 |         if self._font is None:
52 |             return ""
53 |         try:
54 |             char = str(self._char)
55 |             return char
56 |         except UnicodeEncodeError:
57 |             return ""
58 | 


--------------------------------------------------------------------------------
/pdfreader/lib/line.py:
--------------------------------------------------------------------------------
  1 | from cStringIO import StringIO
  2 | from char import char
  3 | 
  4 | class line(object):
  5 | 
  6 |     _width = 0
  7 |     _height = 0
  8 |     _x = 0
  9 |     _y = 0
 10 |     _font = None
 11 |     _lines = []
 12 | 
 13 |     @property
 14 |     def font(self):
 15 |         return self._font
 16 | 
 17 |     @property
 18 |     def size(self):
 19 |         return self._size
 20 | 
 21 |     @property
 22 |     def x(self):
 23 |         return self._x
 24 | 
 25 |     @property
 26 |     def y(self):
 27 |         return self._y
 28 | 
 29 | 
 30 |     def __init__(self, xml_line):
 31 |         (left, top, self._width, self._height) = xml_line.get('bbox').split(',')
 32 |         self._width = int(float(self._width))
 33 |         self._height = int(float(self._height))
 34 |         self._x = int(float(left)) - self._width
 35 |         self._y = int(float(top)) - self._height
 36 |         #
 37 |         xml_string = xml_line.find_all('text')
 38 |         self._chars = []
 39 |         for c in xml_string:
 40 |             self._chars.append(char(c))
 41 |         #
 42 |         self._font = self._chars[0].font if len(self._chars) > 0 else None
 43 |         self._size = self._chars[0].size if len(self._chars) > 0 else None
 44 | 
 45 |     _italic_on = False
 46 |     def handle_italic(self, c, string):
 47 |         if self._italic_on is False and c.isItalic() is True:
 48 |             string.write('<i>')
 49 |             self._italic_on = True
 50 |         if self._italic_on is True and c.isItalic() is False:
 51 |             string.write('</i>')
 52 |             self._italic_on = False
 53 | 
 54 |     _bold_on = False
 55 |     def handle_bold(self, c, string):
 56 |         if self._bold_on is False and c.isBold() is True:
 57 |             string.write('<b>')
 58 |             self._bold_on = True
 59 |         if self._bold_on is True and c.isBold() is False:
 60 |             string.write('</b>')
 61 |             self._bold_on = False
 62 | 
 63 | 
 64 |     def __str__(self):
 65 |         string = StringIO()
 66 |         for c in self._chars:
 67 |             self.handle_italic(c, string)
 68 |             self.handle_bold(c, string)
 69 |             string.write(str(c))
 70 |         return string.getvalue()
 71 | 
 72 | 
 73 | 
 74 | 
 75 | 
 76 | 
 77 | 
 78 | 
 79 | if __name__ == '__main__':
 80 |     import sys
 81 |     from bs4 import BeautifulSoup
 82 |     file="""<?xml version="1.0" encoding="utf-8" ?>
 83 | <pages>
 84 | <page id="1" bbox="0.000,0.000,1024.000,748.000" rotate="0">
 85 | <textbox id="0" bbox="263.163,702.557,365.297,721.036">
 86 | <textline bbox="263.163,702.557,365.297,721.036">
 87 | <text font="OWDOPW+Georgia" bbox="263.163,702.557,271.263,721.036" size="18.479">L</text>
 88 | <text font="OWDOPW+Georgia-Bold" bbox="271.927,702.557,281.904,721.036" size="18.479">O</text>
 89 | <text font="OWDOPW+Georgia-BoldItalic" bbox="282.582,702.557,291.996,721.036" size="18.479">R</text>
 90 | <text font="OWDOPW+Georgia-BoldItalic" bbox="292.659,702.557,301.416,721.036" size="18.479">E</text>
 91 | <text font="OWDOPW+Georgia-Italic" bbox="302.094,702.557,314.525,721.036" size="18.479">M</text>
 92 | <text> </text>
 93 | <text font="OWDOPW+Georgia" bbox="315.202,702.557,318.434,721.036" size="18.479"> </text>
 94 | <text> </text>
 95 | <text font="OWDOPW+Georgia" bbox="319.111,702.557,324.341,721.036" size="18.479">I</text>
 96 | <text font="OWDOPW+Georgia" bbox="325.005,702.557,333.186,721.036" size="18.479">P</text>
 97 | <text font="OWDOPW+Georgia" bbox="333.849,702.557,341.373,721.036" size="18.479">S</text>
 98 | <text font="OWDOPW+Georgia" bbox="342.050,702.557,352.188,721.036" size="18.479">U</text>
 99 | <text font="OWDOPW+Georgia" bbox="352.865,702.557,365.297,721.036" size="18.479">M</text>
100 | <text>
101 | </text>
102 | </textline>
103 | </textbox>
104 | </page>
105 | </pages>
106 |     """
107 |     soup = BeautifulSoup(file)
108 |     l = line(soup.pages.page.textbox.textline)
109 |     print ("[%s] font[%s]" % (l, l.font))
110 | 


--------------------------------------------------------------------------------
/pdfreader/lib/page.py:
--------------------------------------------------------------------------------
  1 | from paragraph import paragraph
  2 | from cStringIO import StringIO
  3 | 
  4 | class page(object):
  5 | 
  6 |     _paragraphs = []
  7 | 
  8 |     @property
  9 |     def paragraphs(self):
 10 |         return self._paragraphs
 11 | 
 12 | 
 13 |     def __init__(self, xml_page):
 14 |         xml = xml_page.find_all('textbox')
 15 |         self._paragraphs = []
 16 |         for p in xml:
 17 |             self._paragraphs.append(paragraph(p))
 18 | 
 19 |     def toDict(self):
 20 |         array = {'paragraphs': []}
 21 |         for p in self._paragraphs:
 22 |             array['paragraphs'].append(p.toDict())
 23 |         return array
 24 | 
 25 | 
 26 |     def __str__(self):
 27 |         string = StringIO()
 28 |         count = 0
 29 |         for p in self._paragraphs:
 30 |             if count != 0:
 31 |                 string.write("\n\n")
 32 |             string.write(str(p))
 33 |             count = 1
 34 |         return string.getvalue()
 35 | 
 36 | 
 37 | 
 38 | 
 39 | 
 40 | 
 41 | 
 42 | 
 43 | 
 44 | 
 45 | 
 46 | 
 47 | 
 48 | 
 49 | 
 50 | 
 51 | 
 52 | 
 53 | 
 54 | 
 55 | 
 56 | 
 57 | 
 58 | 
 59 | 
 60 | 
 61 | 
 62 | 
 63 | 
 64 | 
 65 | if __name__ == '__main__':
 66 |     import sys
 67 |     from bs4 import BeautifulSoup
 68 |     file="""<?xml version="1.0" encoding="utf-8" ?>
 69 | <pages>
 70 | <page id="1" bbox="0.000,0.000,1024.000,748.000" rotate="0">
 71 | <textbox id="0" bbox="50.000,40.406,489.850,269.078">
 72 | <textline bbox="50.000,236.006,485.115,269.078">
 73 | <text font="OWDOPW+Georgia" bbox="263.163,702.557,271.263,721.036" size="18.479">L</text>
 74 | <text font="OWDOPW+Georgia-Bold" bbox="271.927,702.557,281.904,721.036" size="18.479">O</text>
 75 | <text font="OWDOPW+Georgia-BoldItalic" bbox="282.582,702.557,291.996,721.036" size="18.479">R</text>
 76 | <text font="OWDOPW+Georgia-BoldItalic" bbox="292.659,702.557,301.416,721.036" size="18.479">E</text>
 77 | <text font="OWDOPW+Georgia-Italic" bbox="302.094,702.557,314.525,721.036" size="18.479">M</text>
 78 | <text font="OWDOPW+Georgia" bbox="315.202,702.557,318.434,721.036" size="18.479"> </text>
 79 | <text font="OWDOPW+Georgia" bbox="319.111,702.557,324.341,721.036" size="18.479">I</text>
 80 | <text font="OWDOPW+Georgia" bbox="325.005,702.557,333.186,721.036" size="18.479">P</text>
 81 | <text font="OWDOPW+Georgia" bbox="333.849,702.557,341.373,721.036" size="18.479">S</text>
 82 | <text font="OWDOPW+Georgia" bbox="342.050,702.557,352.188,721.036" size="18.479">U</text>
 83 | <text font="OWDOPW+Georgia" bbox="352.865,702.557,365.297,721.036" size="18.479">M</text>
 84 | <text font="OWDOPW+Georgia" bbox="191.847,236.006,197.631,269.078" size="33.072"> </text>
 85 | <text font="OWDOPW+Georgia" bbox="197.636,236.006,211.412,269.078" size="33.072">d</text>
 86 | <text font="OWDOPW+Georgia" bbox="211.417,236.006,224.353,269.078" size="33.072">o</text>
 87 | <text font="OWDOPW+Georgia" bbox="224.355,236.006,231.219,269.078" size="33.072">l</text>
 88 | <text font="OWDOPW+Georgia" bbox="231.222,236.006,244.158,269.078" size="33.072">o</text>
 89 | <text font="OWDOPW+Georgia" bbox="244.160,236.006,254.000,269.078" size="33.072">r</text>
 90 | <text font="OWDOPW+Georgia" bbox="253.993,236.006,259.777,269.078" size="33.072"> </text>
 91 | <text font="OWDOPW+Georgia" bbox="259.782,236.006,270.150,269.078" size="33.072">s</text>
 92 | <text font="OWDOPW+Georgia" bbox="270.152,236.006,277.184,269.078" size="33.072">i</text>
 93 | <text font="OWDOPW+Georgia" bbox="277.184,236.006,285.464,269.078" size="33.072">t</text>
 94 | <text font="OWDOPW+Georgia" bbox="285.469,236.006,291.253,269.078" size="33.072"> </text>
 95 | <text font="OWDOPW+Georgia" bbox="291.258,236.006,303.354,269.078" size="33.072">a</text>
 96 | <text font="OWDOPW+Georgia" bbox="303.351,236.006,324.495,269.078" size="33.072">m</text>
 97 | <text font="OWDOPW+Georgia" bbox="324.493,236.006,336.085,269.078" size="33.072">e</text>
 98 | <text font="OWDOPW+Georgia" bbox="336.094,236.006,344.374,269.078" size="33.072">t</text>
 99 | <text font="OWDOPW+Georgia" bbox="344.379,236.006,350.859,269.078" size="33.072">,</text>
100 | <text font="OWDOPW+Georgia" bbox="350.847,236.006,356.631,269.078" size="33.072"> </text>
101 | <text font="OWDOPW+Georgia" bbox="356.636,236.006,367.532,269.078" size="33.072">c</text>
102 | <text font="OWDOPW+Georgia" bbox="367.534,236.006,380.470,269.078" size="33.072">o</text>
103 | <text font="OWDOPW+Georgia" bbox="380.473,236.006,394.657,269.078" size="33.072">n</text>
104 | <text font="OWDOPW+Georgia" bbox="394.652,236.006,405.020,269.078" size="33.072">s</text>
105 | <text font="OWDOPW+Georgia" bbox="405.022,236.006,416.614,269.078" size="33.072">e</text>
106 | <text font="OWDOPW+Georgia" bbox="416.624,236.006,427.520,269.078" size="33.072">c</text>
107 | <text font="OWDOPW+Georgia" bbox="427.522,236.006,435.802,269.078" size="33.072">t</text>
108 | <text font="OWDOPW+Georgia" bbox="435.807,236.006,447.399,269.078" size="33.072">e</text>
109 | <text font="OWDOPW+Georgia" bbox="447.409,236.006,455.689,269.078" size="33.072">t</text>
110 | <text font="OWDOPW+Georgia" bbox="455.694,236.006,469.494,269.078" size="33.072">u</text>
111 | <text font="OWDOPW+Georgia" bbox="469.498,236.006,479.338,269.078" size="33.072">r</text>
112 | <text font="OWDOPW+Georgia" bbox="479.331,236.006,485.115,269.078" size="33.072"> </text>
113 | <text>
114 | </text>
115 | </textline>
116 | <textline bbox="50.000,203.606,420.075,237.278">
117 | <text font="OTOBVA+Georgia-Italic" bbox="50.000,203.606,63.752,236.654" size="33.048">a</text>
118 | <text font="OTOBVA+Georgia-Italic" bbox="63.747,203.606,77.547,236.654" size="33.048">d</text>
119 | <text font="OTOBVA+Georgia-Italic" bbox="77.542,203.606,84.670,236.654" size="33.048">i</text>
120 | <text font="OTOBVA+Georgia-Italic" bbox="84.690,203.606,98.562,236.654" size="33.048">p</text>
121 | <text font="OTOBVA+Georgia-Italic" bbox="98.557,203.606,105.685,236.654" size="33.048">i</text>
122 | <text font="OTOBVA+Georgia-Italic" bbox="105.704,203.606,116.048,236.654" size="33.048">s</text>
123 | <text font="OTOBVA+Georgia-Italic" bbox="116.043,203.606,123.171,236.654" size="33.048">i</text>
124 | <text font="OTOBVA+Georgia-Italic" bbox="123.190,203.606,134.086,236.654" size="33.048">c</text>
125 | <text font="OTOBVA+Georgia-Italic" bbox="134.082,203.606,141.210,236.654" size="33.048">i</text>
126 | <text font="OTOBVA+Georgia-Italic" bbox="141.229,203.606,155.389,236.654" size="33.048">n</text>
127 | <text font="OTOBVA+Georgia-Italic" bbox="155.384,203.606,169.136,236.654" size="33.048">g</text>
128 | <text font="OWDOPW+Georgia" bbox="169.109,203.606,174.893,236.678" size="33.072"> </text>
129 | <text font="YFNXOA+Georgia-Bold" bbox="174.898,203.606,188.626,237.278" size="33.672">e</text>
130 | <text font="YFNXOA+Georgia-Bold" bbox="188.622,203.606,196.878,237.278" size="33.672">l</text>
131 | <text font="YFNXOA+Georgia-Bold" bbox="196.873,203.606,205.369,237.278" size="33.672">i</text>
132 | <text font="YFNXOA+Georgia-Bold" bbox="205.364,203.606,214.892,237.278" size="33.672">t</text>
133 | <text font="YFNXOA+Georgia-Bold" bbox="214.911,203.606,222.783,237.278" size="33.672">,</text>
134 | <text font="YFNXOA+Georgia-Bold" bbox="222.778,203.606,228.874,237.278" size="33.672"> </text>
135 | <text font="YFNXOA+Georgia-Bold" bbox="228.870,203.606,241.182,237.278" size="33.672">s</text>
136 | <text font="YFNXOA+Georgia-Bold" bbox="241.177,203.606,254.905,237.278" size="33.672">e</text>
137 | <text font="YFNXOA+Georgia-Bold" bbox="254.900,203.606,270.812,237.278" size="33.672">d</text>
138 | <text font="OWDOPW+Georgia" bbox="270.816,203.606,276.600,236.678" size="33.072"> </text>
139 | <text font="UXGHCH+Georgia-BoldItalic" bbox="276.606,203.606,292.517,237.062" size="33.456">d</text>
140 | <text font="UXGHCH+Georgia-BoldItalic" bbox="292.508,203.606,307.772,237.062" size="33.456">o</text>
141 | <text font="UXGHCH+Georgia-BoldItalic" bbox="307.765,203.606,313.861,237.062" size="33.456"> </text>
142 | <text font="UXGHCH+Georgia-BoldItalic" bbox="313.858,203.606,327.226,237.062" size="33.456">e</text>
143 | <text font="UXGHCH+Georgia-BoldItalic" bbox="327.229,203.606,336.013,237.062" size="33.456">i</text>
144 | <text font="UXGHCH+Georgia-BoldItalic" bbox="336.005,203.606,352.445,237.062" size="33.456">u</text>
145 | <text font="UXGHCH+Georgia-BoldItalic" bbox="352.448,203.606,364.856,237.062" size="33.456">s</text>
146 | <text font="UXGHCH+Georgia-BoldItalic" bbox="364.858,203.606,388.906,237.062" size="33.456">m</text>
147 | <text font="UXGHCH+Georgia-BoldItalic" bbox="388.906,203.606,404.170,237.062" size="33.456">o</text>
148 | <text font="UXGHCH+Georgia-BoldItalic" bbox="404.163,203.606,420.075,237.062" size="33.456">d</text>
149 | <text>
150 | </text>
151 | </textline>
152 | <textline bbox="50.000,170.206,452.162,204.656">
153 | <text font="OWDOPW+Georgia" bbox="50.000,170.206,58.625,204.656" size="34.450">t</text>
154 | <text font="OWDOPW+Georgia" bbox="58.630,170.206,70.705,204.656" size="34.450">e</text>
155 | <text font="OWDOPW+Georgia" bbox="70.710,170.206,92.735,204.656" size="34.450">m</text>
156 | <text font="OWDOPW+Georgia" bbox="92.740,170.206,107.015,204.656" size="34.450">p</text>
157 | <text font="OWDOPW+Georgia" bbox="107.020,170.206,120.495,204.656" size="34.450">o</text>
158 | <text font="OWDOPW+Georgia" bbox="120.500,170.206,130.750,204.656" size="34.450">r</text>
159 | <text font="OWDOPW+Georgia" bbox="130.737,170.425,136.521,203.497" size="33.072"> </text>
160 | <text font="OWDOPW+Georgia" bbox="136.526,170.425,143.558,203.497" size="33.072">i</text>
161 | <text font="OWDOPW+Georgia" bbox="143.558,170.425,157.742,203.497" size="33.072">n</text>
162 | <text font="OWDOPW+Georgia" bbox="157.738,170.425,168.634,203.497" size="33.072">c</text>
163 | <text font="OWDOPW+Georgia" bbox="168.636,170.425,175.668,203.497" size="33.072">i</text>
164 | <text font="OWDOPW+Georgia" bbox="175.668,170.425,189.444,203.497" size="33.072">d</text>
165 | <text font="OWDOPW+Georgia" bbox="189.449,170.425,196.481,203.497" size="33.072">i</text>
166 | <text font="OWDOPW+Georgia" bbox="196.481,170.425,210.257,203.497" size="33.072">d</text>
167 | <text font="OWDOPW+Georgia" bbox="210.262,170.425,224.062,203.497" size="33.072">u</text>
168 | <text font="OWDOPW+Georgia" bbox="224.066,170.425,238.250,203.497" size="33.072">n</text>
169 | <text font="OWDOPW+Georgia" bbox="238.246,170.425,246.526,203.497" size="33.072">t</text>
170 | <text font="OWDOPW+Georgia" bbox="246.530,170.425,252.314,203.497" size="33.072"> </text>
171 | <text font="OWDOPW+Georgia" bbox="252.319,170.425,266.119,203.497" size="33.072">u</text>
172 | <text font="OWDOPW+Georgia" bbox="266.124,170.425,274.404,203.497" size="33.072">t</text>
173 | <text font="OWDOPW+Georgia" bbox="274.409,170.425,280.193,203.497" size="33.072"> </text>
174 | <text font="OWDOPW+Georgia" bbox="280.197,170.425,287.061,203.497" size="33.072">l</text>
175 | <text font="OWDOPW+Georgia" bbox="287.066,170.425,299.162,203.497" size="33.072">a</text>
176 | <text font="OWDOPW+Georgia" bbox="299.167,170.425,312.607,203.497" size="33.072">b</text>
177 | <text font="OWDOPW+Georgia" bbox="312.612,170.425,325.548,203.497" size="33.072">o</text>
178 | <text font="OWDOPW+Georgia" bbox="325.553,170.425,335.393,203.497" size="33.072">r</text>
179 | <text font="OWDOPW+Georgia" bbox="335.374,170.425,346.966,203.497" size="33.072">e</text>
180 | <text font="OWDOPW+Georgia" bbox="346.970,170.425,352.754,203.497" size="33.072"> </text>
181 | <text font="OWDOPW+Georgia" bbox="352.759,170.425,364.351,203.497" size="33.072">e</text>
182 | <text font="OWDOPW+Georgia" bbox="364.356,170.425,372.636,203.497" size="33.072">t</text>
183 | <text font="OWDOPW+Georgia" bbox="372.641,170.425,378.425,203.497" size="33.072"> </text>
184 | <text font="OWDOPW+Georgia" bbox="378.430,170.425,392.206,203.497" size="33.072">d</text>
185 | <text font="OWDOPW+Georgia" bbox="392.210,170.425,405.146,203.497" size="33.072">o</text>
186 | <text font="OWDOPW+Georgia" bbox="405.151,170.425,412.015,203.497" size="33.072">l</text>
187 | <text font="OWDOPW+Georgia" bbox="412.020,170.425,424.956,203.497" size="33.072">o</text>
188 | <text font="OWDOPW+Georgia" bbox="424.961,170.425,434.801,203.497" size="33.072">r</text>
189 | <text font="OWDOPW+Georgia" bbox="434.782,170.425,446.374,203.497" size="33.072">e</text>
190 | <text font="OWDOPW+Georgia" bbox="446.378,170.425,452.162,203.497" size="33.072"> </text>
191 | <text>
192 | </text>
193 | </textline>
194 | <textline bbox="50.000,137.606,489.490,170.678">
195 | <text font="OWDOPW+Georgia" bbox="50.000,137.606,71.144,170.678" size="33.072">m</text>
196 | <text font="OWDOPW+Georgia" bbox="71.142,137.606,83.238,170.678" size="33.072">a</text>
197 | <text font="OWDOPW+Georgia" bbox="83.235,137.606,95.451,170.678" size="33.072">g</text>
198 | <text font="OWDOPW+Georgia" bbox="95.458,137.606,109.642,170.678" size="33.072">n</text>
199 | <text font="OWDOPW+Georgia" bbox="109.638,137.606,121.734,170.678" size="33.072">a</text>
200 | <text font="OWDOPW+Georgia" bbox="121.731,137.606,127.515,170.678" size="33.072"> </text>
201 | <text font="OWDOPW+Georgia" bbox="127.520,137.606,139.616,170.678" size="33.072">a</text>
202 | <text font="OWDOPW+Georgia" bbox="139.614,137.606,146.478,170.678" size="33.072">l</text>
203 | <text font="OWDOPW+Georgia" bbox="146.480,137.606,153.512,170.678" size="33.072">i</text>
204 | <text font="OWDOPW+Georgia" bbox="153.512,137.606,166.952,170.678" size="33.072">q</text>
205 | <text font="OWDOPW+Georgia" bbox="166.942,137.606,180.742,170.678" size="33.072">u</text>
206 | <text font="OWDOPW+Georgia" bbox="180.747,137.606,192.843,170.678" size="33.072">a</text>
207 | <text font="OWDOPW+Georgia" bbox="192.841,137.606,199.321,170.678" size="33.072">.</text>
208 | <text font="OWDOPW+Georgia" bbox="199.309,137.606,205.093,170.678" size="33.072"> </text>
209 | <text font="OWDOPW+Georgia" bbox="205.098,137.606,223.242,170.678" size="33.072">U</text>
210 | <text font="OWDOPW+Georgia" bbox="223.249,137.606,231.529,170.678" size="33.072">t</text>
211 | <text font="OWDOPW+Georgia" bbox="231.534,137.606,237.318,170.678" size="33.072"> </text>
212 | <text font="OWDOPW+Georgia" bbox="237.322,137.606,248.914,170.678" size="33.072">e</text>
213 | <text font="OWDOPW+Georgia" bbox="248.924,137.606,263.108,170.678" size="33.072">n</text>
214 | <text font="OWDOPW+Georgia" bbox="263.103,137.606,270.135,170.678" size="33.072">i</text>
215 | <text font="OWDOPW+Georgia" bbox="270.135,137.606,291.279,170.678" size="33.072">m</text>
216 | <text font="OWDOPW+Georgia" bbox="291.277,137.606,297.061,170.678" size="33.072"> </text>
217 | <text font="OWDOPW+Georgia" bbox="297.066,137.606,309.162,170.678" size="33.072">a</text>
218 | <text font="OWDOPW+Georgia" bbox="309.159,137.606,322.935,170.678" size="33.072">d</text>
219 | <text font="OWDOPW+Georgia" bbox="322.940,137.606,328.724,170.678" size="33.072"> </text>
220 | <text font="OWDOPW+Georgia" bbox="328.729,137.606,349.873,170.678" size="33.072">m</text>
221 | <text font="OWDOPW+Georgia" bbox="349.870,137.606,356.902,170.678" size="33.072">i</text>
222 | <text font="OWDOPW+Georgia" bbox="356.902,137.606,371.086,170.678" size="33.072">n</text>
223 | <text font="OWDOPW+Georgia" bbox="371.082,137.606,378.114,170.678" size="33.072">i</text>
224 | <text font="OWDOPW+Georgia" bbox="378.114,137.606,399.258,170.678" size="33.072">m</text>
225 | <text font="OWDOPW+Georgia" bbox="399.255,137.606,405.039,170.678" size="33.072"> </text>
226 | <text font="OWDOPW+Georgia" bbox="405.044,137.606,416.972,170.678" size="33.072">v</text>
227 | <text font="OWDOPW+Georgia" bbox="416.962,137.606,428.554,170.678" size="33.072">e</text>
228 | <text font="OWDOPW+Georgia" bbox="428.564,137.606,442.748,170.678" size="33.072">n</text>
229 | <text font="OWDOPW+Georgia" bbox="442.743,137.606,449.775,170.678" size="33.072">i</text>
230 | <text font="OWDOPW+Georgia" bbox="449.775,137.606,461.871,170.678" size="33.072">a</text>
231 | <text font="OWDOPW+Georgia" bbox="461.869,137.606,483.013,170.678" size="33.072">m</text>
232 | <text font="OWDOPW+Georgia" bbox="483.010,137.606,489.490,170.678" size="33.072">,</text>
233 | <text>
234 | </text>
235 | </textline>
236 | </textbox>
237 | </textbox>
238 | <textbox id="1" bbox="50.000,598.517,314.528,697.733">
239 | <textline bbox="50.000,598.517,314.528,697.733">
240 | <text font="OWDOPW+Georgia" bbox="50.000,598.517,104.432,697.733" size="99.216">U</text>
241 | <text font="OWDOPW+Georgia" bbox="104.454,598.517,147.006,697.733" size="99.216">n</text>
242 | <text font="OWDOPW+Georgia" bbox="146.991,598.517,171.831,697.733" size="99.216">t</text>
243 | <text font="OWDOPW+Georgia" bbox="171.846,598.517,192.942,697.733" size="99.216">i</text>
244 | <text font="OWDOPW+Georgia" bbox="192.942,598.517,217.782,697.733" size="99.216">t</text>
245 | <text font="OWDOPW+Georgia" bbox="217.796,598.517,238.388,697.733" size="99.216">l</text>
246 | <text font="OWDOPW+Georgia" bbox="238.395,598.517,273.171,697.733" size="99.216">e</text>
247 | <text font="OWDOPW+Georgia" bbox="273.200,598.517,314.528,697.733" size="99.216">d</text>
248 | <text>
249 | </text>
250 | </textline>
251 | </textbox>
252 | <textbox id="2" bbox="66.250,478.004,181.273,502.808">
253 | <textline bbox="66.250,478.004,181.273,502.808">
254 | <text font="OWDOPW+Georgia" bbox="66.250,478.004,77.122,502.808" size="24.804">L</text>
255 | <text font="OWDOPW+Georgia" bbox="78.013,478.793,88.727,498.636" size="19.843">O</text>
256 | <text font="OWDOPW+Georgia" bbox="89.628,478.793,99.737,498.636" size="19.843">R</text>
257 | <text font="OWDOPW+Georgia" bbox="100.639,478.793,110.042,498.636" size="19.843">E</text>
258 | <text font="OWDOPW+Georgia" bbox="110.943,478.793,124.292,498.636" size="19.843">M</text>
259 | <text> </text>
260 | <text font="OWDOPW+Georgia" bbox="125.193,478.793,128.664,498.636" size="19.843"> </text>
261 | <text> </text>
262 | <text font="OWDOPW+Georgia" bbox="129.566,478.004,136.586,502.808" size="24.804">I</text>
263 | <text> </text>
264 | <text font="OWDOPW+Georgia" bbox="137.480,478.793,146.264,498.636" size="19.843">P</text>
265 | <text> </text>
266 | <text font="OWDOPW+Georgia" bbox="147.163,478.793,155.241,498.636" size="19.843">S</text>
267 | <text font="OWDOPW+Georgia" bbox="156.140,478.793,167.026,498.636" size="19.843">U</text>
268 | <text font="OWDOPW+Georgia" bbox="167.925,478.793,181.273,498.636" size="19.843">M</text>
269 | <text>
270 | </text>
271 | </textline>
272 | </textbox>
273 | </page>
274 | </pages>
275 |     """
276 |     soup = BeautifulSoup(file)
277 |     p = page(soup.pages.page)
278 |     print ("[%s] json[%s]" % (p, p.toDict()))
279 | 


--------------------------------------------------------------------------------
/pdfreader/lib/paragraph.py:
--------------------------------------------------------------------------------
  1 | from line import line
  2 | from cStringIO import StringIO
  3 | 
  4 | class paragraph(object):
  5 | 
  6 |     _width = 0
  7 |     _height = 0
  8 |     _x = 0
  9 |     _y = 0
 10 |     _font = None
 11 |     _lines = []
 12 | 
 13 |     @property
 14 |     def font(self):
 15 |         return self._font
 16 | 
 17 |     @property
 18 |     def size(self):
 19 |         return self._size
 20 | 
 21 |     @property
 22 |     def x(self):
 23 |         return self._x
 24 | 
 25 |     @property
 26 |     def y(self):
 27 |         return self._y
 28 | 
 29 |     def __init__(self, xml_paragraph):
 30 |         (left, top, self._width, self._height) = xml_paragraph.get('bbox').split(',')
 31 |         self._width = int(float(self._width))
 32 |         self._height = int(float(self._height))
 33 |         self._x = int(float(left)) - self._width
 34 |         self._y = int(float(top)) - self._height
 35 |         #
 36 |         xml = xml_paragraph.find_all('textline')
 37 |         self._lines = []
 38 |         for l in xml:
 39 |             self._lines.append(line(l))
 40 |         #
 41 |         self._font = self._lines[0].font if len(self._lines) > 0 else None
 42 |         self._size = self._lines[0].size if len(self._lines) > 0 else None
 43 | 
 44 |     def toDict(self):
 45 |         dico = dict({  'font': self._font,
 46 |                         'size': self._size,
 47 |                         'x': self._x,
 48 |                         'y': self._y,
 49 |                         'width': self._width,
 50 |                         'height': self._height
 51 |                         })
 52 |         string = StringIO()
 53 |         count = 0
 54 |         for l in self._lines:
 55 |             if count != 0:
 56 |                 string.write("\n")
 57 |             string.write(str(l))
 58 |             count = 1
 59 |         dico.update({'string': string.getvalue()})
 60 |         return dico
 61 | 
 62 |     def __str__(self):
 63 |         string = StringIO()
 64 |         count = 0
 65 |         for l in self._lines:
 66 |             if count != 0:
 67 |                 string.write("\n")
 68 |             string.write(str(l))
 69 |             count = 1
 70 |         return string.getvalue()
 71 | 
 72 | 
 73 | 
 74 | 
 75 | 
 76 | 
 77 | 
 78 | 
 79 | 
 80 | 
 81 | 
 82 | 
 83 | 
 84 | 
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | if __name__ == '__main__':
 94 |     import sys
 95 |     import json
 96 |     from bs4 import BeautifulSoup
 97 |     file="""<?xml version="1.0" encoding="utf-8" ?>
 98 | <pages>
 99 | <page id="1" bbox="0.000,0.000,1024.000,748.000" rotate="0">
100 | <textbox id="2" bbox="50.000,40.406,489.850,269.078">
101 | <textline bbox="50.000,236.006,485.115,269.078">
102 | <text font="OWDOPW+Georgia-Bold" bbox="263.163,702.557,271.263,721.036" size="18.479">L</text>
103 | <text font="OWDOPW+Georgia-Bold" bbox="271.927,702.557,281.904,721.036" size="18.479">O</text>
104 | <text font="OWDOPW+Georgia-BoldItalic" bbox="282.582,702.557,291.996,721.036" size="18.479">R</text>
105 | <text font="OWDOPW+Georgia-BoldItalic" bbox="292.659,702.557,301.416,721.036" size="18.479">E</text>
106 | <text font="OWDOPW+Georgia-Italic" bbox="302.094,702.557,314.525,721.036" size="18.479">M</text>
107 | <text font="OWDOPW+Georgia" bbox="315.202,702.557,318.434,721.036" size="18.479"> </text>
108 | <text font="OWDOPW+Georgia" bbox="319.111,702.557,324.341,721.036" size="18.479">I</text>
109 | <text font="OWDOPW+Georgia" bbox="325.005,702.557,333.186,721.036" size="18.479">P</text>
110 | <text font="OWDOPW+Georgia" bbox="333.849,702.557,341.373,721.036" size="18.479">S</text>
111 | <text font="OWDOPW+Georgia" bbox="342.050,702.557,352.188,721.036" size="18.479">U</text>
112 | <text font="OWDOPW+Georgia" bbox="352.865,702.557,365.297,721.036" size="18.479">M</text>
113 | <text font="OWDOPW+Georgia" bbox="191.847,236.006,197.631,269.078" size="33.072"> </text>
114 | <text font="OWDOPW+Georgia" bbox="197.636,236.006,211.412,269.078" size="33.072">d</text>
115 | <text font="OWDOPW+Georgia" bbox="211.417,236.006,224.353,269.078" size="33.072">o</text>
116 | <text font="OWDOPW+Georgia" bbox="224.355,236.006,231.219,269.078" size="33.072">l</text>
117 | <text font="OWDOPW+Georgia" bbox="231.222,236.006,244.158,269.078" size="33.072">o</text>
118 | <text font="OWDOPW+Georgia" bbox="244.160,236.006,254.000,269.078" size="33.072">r</text>
119 | <text font="OWDOPW+Georgia" bbox="253.993,236.006,259.777,269.078" size="33.072"> </text>
120 | <text font="OWDOPW+Georgia" bbox="259.782,236.006,270.150,269.078" size="33.072">s</text>
121 | <text font="OWDOPW+Georgia" bbox="270.152,236.006,277.184,269.078" size="33.072">i</text>
122 | <text font="OWDOPW+Georgia" bbox="277.184,236.006,285.464,269.078" size="33.072">t</text>
123 | <text font="OWDOPW+Georgia" bbox="285.469,236.006,291.253,269.078" size="33.072"> </text>
124 | <text font="OWDOPW+Georgia" bbox="291.258,236.006,303.354,269.078" size="33.072">a</text>
125 | <text font="OWDOPW+Georgia" bbox="303.351,236.006,324.495,269.078" size="33.072">m</text>
126 | <text font="OWDOPW+Georgia" bbox="324.493,236.006,336.085,269.078" size="33.072">e</text>
127 | <text font="OWDOPW+Georgia" bbox="336.094,236.006,344.374,269.078" size="33.072">t</text>
128 | <text font="OWDOPW+Georgia" bbox="344.379,236.006,350.859,269.078" size="33.072">,</text>
129 | <text font="OWDOPW+Georgia" bbox="350.847,236.006,356.631,269.078" size="33.072"> </text>
130 | <text font="OWDOPW+Georgia" bbox="356.636,236.006,367.532,269.078" size="33.072">c</text>
131 | <text font="OWDOPW+Georgia" bbox="367.534,236.006,380.470,269.078" size="33.072">o</text>
132 | <text font="OWDOPW+Georgia" bbox="380.473,236.006,394.657,269.078" size="33.072">n</text>
133 | <text font="OWDOPW+Georgia" bbox="394.652,236.006,405.020,269.078" size="33.072">s</text>
134 | <text font="OWDOPW+Georgia" bbox="405.022,236.006,416.614,269.078" size="33.072">e</text>
135 | <text font="OWDOPW+Georgia" bbox="416.624,236.006,427.520,269.078" size="33.072">c</text>
136 | <text font="OWDOPW+Georgia" bbox="427.522,236.006,435.802,269.078" size="33.072">t</text>
137 | <text font="OWDOPW+Georgia" bbox="435.807,236.006,447.399,269.078" size="33.072">e</text>
138 | <text font="OWDOPW+Georgia" bbox="447.409,236.006,455.689,269.078" size="33.072">t</text>
139 | <text font="OWDOPW+Georgia" bbox="455.694,236.006,469.494,269.078" size="33.072">u</text>
140 | <text font="OWDOPW+Georgia" bbox="469.498,236.006,479.338,269.078" size="33.072">r</text>
141 | <text font="OWDOPW+Georgia" bbox="479.331,236.006,485.115,269.078" size="33.072"> </text>
142 | <text>
143 | </text>
144 | </textline>
145 | <textline bbox="50.000,203.606,420.075,237.278">
146 | <text font="OTOBVA+Georgia-Italic" bbox="50.000,203.606,63.752,236.654" size="33.048">a</text>
147 | <text font="OTOBVA+Georgia-Italic" bbox="63.747,203.606,77.547,236.654" size="33.048">d</text>
148 | <text font="OTOBVA+Georgia-Italic" bbox="77.542,203.606,84.670,236.654" size="33.048">i</text>
149 | <text font="OTOBVA+Georgia-Italic" bbox="84.690,203.606,98.562,236.654" size="33.048">p</text>
150 | <text font="OTOBVA+Georgia-Italic" bbox="98.557,203.606,105.685,236.654" size="33.048">i</text>
151 | <text font="OTOBVA+Georgia-Italic" bbox="105.704,203.606,116.048,236.654" size="33.048">s</text>
152 | <text font="OTOBVA+Georgia-Italic" bbox="116.043,203.606,123.171,236.654" size="33.048">i</text>
153 | <text font="OTOBVA+Georgia-Italic" bbox="123.190,203.606,134.086,236.654" size="33.048">c</text>
154 | <text font="OTOBVA+Georgia-Italic" bbox="134.082,203.606,141.210,236.654" size="33.048">i</text>
155 | <text font="OTOBVA+Georgia-Italic" bbox="141.229,203.606,155.389,236.654" size="33.048">n</text>
156 | <text font="OTOBVA+Georgia-Italic" bbox="155.384,203.606,169.136,236.654" size="33.048">g</text>
157 | <text font="OWDOPW+Georgia" bbox="169.109,203.606,174.893,236.678" size="33.072"> </text>
158 | <text font="YFNXOA+Georgia-Bold" bbox="174.898,203.606,188.626,237.278" size="33.672">e</text>
159 | <text font="YFNXOA+Georgia-Bold" bbox="188.622,203.606,196.878,237.278" size="33.672">l</text>
160 | <text font="YFNXOA+Georgia-Bold" bbox="196.873,203.606,205.369,237.278" size="33.672">i</text>
161 | <text font="YFNXOA+Georgia-Bold" bbox="205.364,203.606,214.892,237.278" size="33.672">t</text>
162 | <text font="YFNXOA+Georgia-Bold" bbox="214.911,203.606,222.783,237.278" size="33.672">,</text>
163 | <text font="YFNXOA+Georgia-Bold" bbox="222.778,203.606,228.874,237.278" size="33.672"> </text>
164 | <text font="YFNXOA+Georgia-Bold" bbox="228.870,203.606,241.182,237.278" size="33.672">s</text>
165 | <text font="YFNXOA+Georgia-Bold" bbox="241.177,203.606,254.905,237.278" size="33.672">e</text>
166 | <text font="YFNXOA+Georgia-Bold" bbox="254.900,203.606,270.812,237.278" size="33.672">d</text>
167 | <text font="OWDOPW+Georgia" bbox="270.816,203.606,276.600,236.678" size="33.072"> </text>
168 | <text font="UXGHCH+Georgia-BoldItalic" bbox="276.606,203.606,292.517,237.062" size="33.456">d</text>
169 | <text font="UXGHCH+Georgia-BoldItalic" bbox="292.508,203.606,307.772,237.062" size="33.456">o</text>
170 | <text font="UXGHCH+Georgia-BoldItalic" bbox="307.765,203.606,313.861,237.062" size="33.456"> </text>
171 | <text font="UXGHCH+Georgia-BoldItalic" bbox="313.858,203.606,327.226,237.062" size="33.456">e</text>
172 | <text font="UXGHCH+Georgia-BoldItalic" bbox="327.229,203.606,336.013,237.062" size="33.456">i</text>
173 | <text font="UXGHCH+Georgia-BoldItalic" bbox="336.005,203.606,352.445,237.062" size="33.456">u</text>
174 | <text font="UXGHCH+Georgia-BoldItalic" bbox="352.448,203.606,364.856,237.062" size="33.456">s</text>
175 | <text font="UXGHCH+Georgia-BoldItalic" bbox="364.858,203.606,388.906,237.062" size="33.456">m</text>
176 | <text font="UXGHCH+Georgia-BoldItalic" bbox="388.906,203.606,404.170,237.062" size="33.456">o</text>
177 | <text font="UXGHCH+Georgia-BoldItalic" bbox="404.163,203.606,420.075,237.062" size="33.456">d</text>
178 | <text>
179 | </text>
180 | </textline>
181 | <textline bbox="50.000,170.206,452.162,204.656">
182 | <text font="OWDOPW+Georgia" bbox="50.000,170.206,58.625,204.656" size="34.450">t</text>
183 | <text font="OWDOPW+Georgia" bbox="58.630,170.206,70.705,204.656" size="34.450">e</text>
184 | <text font="OWDOPW+Georgia" bbox="70.710,170.206,92.735,204.656" size="34.450">m</text>
185 | <text font="OWDOPW+Georgia" bbox="92.740,170.206,107.015,204.656" size="34.450">p</text>
186 | <text font="OWDOPW+Georgia" bbox="107.020,170.206,120.495,204.656" size="34.450">o</text>
187 | <text font="OWDOPW+Georgia" bbox="120.500,170.206,130.750,204.656" size="34.450">r</text>
188 | <text font="OWDOPW+Georgia" bbox="130.737,170.425,136.521,203.497" size="33.072"> </text>
189 | <text font="OWDOPW+Georgia" bbox="136.526,170.425,143.558,203.497" size="33.072">i</text>
190 | <text font="OWDOPW+Georgia" bbox="143.558,170.425,157.742,203.497" size="33.072">n</text>
191 | <text font="OWDOPW+Georgia" bbox="157.738,170.425,168.634,203.497" size="33.072">c</text>
192 | <text font="OWDOPW+Georgia" bbox="168.636,170.425,175.668,203.497" size="33.072">i</text>
193 | <text font="OWDOPW+Georgia" bbox="175.668,170.425,189.444,203.497" size="33.072">d</text>
194 | <text font="OWDOPW+Georgia" bbox="189.449,170.425,196.481,203.497" size="33.072">i</text>
195 | <text font="OWDOPW+Georgia" bbox="196.481,170.425,210.257,203.497" size="33.072">d</text>
196 | <text font="OWDOPW+Georgia" bbox="210.262,170.425,224.062,203.497" size="33.072">u</text>
197 | <text font="OWDOPW+Georgia" bbox="224.066,170.425,238.250,203.497" size="33.072">n</text>
198 | <text font="OWDOPW+Georgia" bbox="238.246,170.425,246.526,203.497" size="33.072">t</text>
199 | <text font="OWDOPW+Georgia" bbox="246.530,170.425,252.314,203.497" size="33.072"> </text>
200 | <text font="OWDOPW+Georgia" bbox="252.319,170.425,266.119,203.497" size="33.072">u</text>
201 | <text font="OWDOPW+Georgia" bbox="266.124,170.425,274.404,203.497" size="33.072">t</text>
202 | <text font="OWDOPW+Georgia" bbox="274.409,170.425,280.193,203.497" size="33.072"> </text>
203 | <text font="OWDOPW+Georgia" bbox="280.197,170.425,287.061,203.497" size="33.072">l</text>
204 | <text font="OWDOPW+Georgia" bbox="287.066,170.425,299.162,203.497" size="33.072">a</text>
205 | <text font="OWDOPW+Georgia" bbox="299.167,170.425,312.607,203.497" size="33.072">b</text>
206 | <text font="OWDOPW+Georgia" bbox="312.612,170.425,325.548,203.497" size="33.072">o</text>
207 | <text font="OWDOPW+Georgia" bbox="325.553,170.425,335.393,203.497" size="33.072">r</text>
208 | <text font="OWDOPW+Georgia" bbox="335.374,170.425,346.966,203.497" size="33.072">e</text>
209 | <text font="OWDOPW+Georgia" bbox="346.970,170.425,352.754,203.497" size="33.072"> </text>
210 | <text font="OWDOPW+Georgia" bbox="352.759,170.425,364.351,203.497" size="33.072">e</text>
211 | <text font="OWDOPW+Georgia" bbox="364.356,170.425,372.636,203.497" size="33.072">t</text>
212 | <text font="OWDOPW+Georgia" bbox="372.641,170.425,378.425,203.497" size="33.072"> </text>
213 | <text font="OWDOPW+Georgia" bbox="378.430,170.425,392.206,203.497" size="33.072">d</text>
214 | <text font="OWDOPW+Georgia" bbox="392.210,170.425,405.146,203.497" size="33.072">o</text>
215 | <text font="OWDOPW+Georgia" bbox="405.151,170.425,412.015,203.497" size="33.072">l</text>
216 | <text font="OWDOPW+Georgia" bbox="412.020,170.425,424.956,203.497" size="33.072">o</text>
217 | <text font="OWDOPW+Georgia" bbox="424.961,170.425,434.801,203.497" size="33.072">r</text>
218 | <text font="OWDOPW+Georgia" bbox="434.782,170.425,446.374,203.497" size="33.072">e</text>
219 | <text font="OWDOPW+Georgia" bbox="446.378,170.425,452.162,203.497" size="33.072"> </text>
220 | <text>
221 | </text>
222 | </textline>
223 | <textline bbox="50.000,137.606,489.490,170.678">
224 | <text font="OWDOPW+Georgia" bbox="50.000,137.606,71.144,170.678" size="33.072">m</text>
225 | <text font="OWDOPW+Georgia" bbox="71.142,137.606,83.238,170.678" size="33.072">a</text>
226 | <text font="OWDOPW+Georgia" bbox="83.235,137.606,95.451,170.678" size="33.072">g</text>
227 | <text font="OWDOPW+Georgia" bbox="95.458,137.606,109.642,170.678" size="33.072">n</text>
228 | <text font="OWDOPW+Georgia" bbox="109.638,137.606,121.734,170.678" size="33.072">a</text>
229 | <text font="OWDOPW+Georgia" bbox="121.731,137.606,127.515,170.678" size="33.072"> </text>
230 | <text font="OWDOPW+Georgia" bbox="127.520,137.606,139.616,170.678" size="33.072">a</text>
231 | <text font="OWDOPW+Georgia" bbox="139.614,137.606,146.478,170.678" size="33.072">l</text>
232 | <text font="OWDOPW+Georgia" bbox="146.480,137.606,153.512,170.678" size="33.072">i</text>
233 | <text font="OWDOPW+Georgia" bbox="153.512,137.606,166.952,170.678" size="33.072">q</text>
234 | <text font="OWDOPW+Georgia" bbox="166.942,137.606,180.742,170.678" size="33.072">u</text>
235 | <text font="OWDOPW+Georgia" bbox="180.747,137.606,192.843,170.678" size="33.072">a</text>
236 | <text font="OWDOPW+Georgia" bbox="192.841,137.606,199.321,170.678" size="33.072">.</text>
237 | <text font="OWDOPW+Georgia" bbox="199.309,137.606,205.093,170.678" size="33.072"> </text>
238 | <text font="OWDOPW+Georgia" bbox="205.098,137.606,223.242,170.678" size="33.072">U</text>
239 | <text font="OWDOPW+Georgia" bbox="223.249,137.606,231.529,170.678" size="33.072">t</text>
240 | <text font="OWDOPW+Georgia" bbox="231.534,137.606,237.318,170.678" size="33.072"> </text>
241 | <text font="OWDOPW+Georgia" bbox="237.322,137.606,248.914,170.678" size="33.072">e</text>
242 | <text font="OWDOPW+Georgia" bbox="248.924,137.606,263.108,170.678" size="33.072">n</text>
243 | <text font="OWDOPW+Georgia" bbox="263.103,137.606,270.135,170.678" size="33.072">i</text>
244 | <text font="OWDOPW+Georgia" bbox="270.135,137.606,291.279,170.678" size="33.072">m</text>
245 | <text font="OWDOPW+Georgia" bbox="291.277,137.606,297.061,170.678" size="33.072"> </text>
246 | <text font="OWDOPW+Georgia" bbox="297.066,137.606,309.162,170.678" size="33.072">a</text>
247 | <text font="OWDOPW+Georgia" bbox="309.159,137.606,322.935,170.678" size="33.072">d</text>
248 | <text font="OWDOPW+Georgia" bbox="322.940,137.606,328.724,170.678" size="33.072"> </text>
249 | <text font="OWDOPW+Georgia" bbox="328.729,137.606,349.873,170.678" size="33.072">m</text>
250 | <text font="OWDOPW+Georgia" bbox="349.870,137.606,356.902,170.678" size="33.072">i</text>
251 | <text font="OWDOPW+Georgia" bbox="356.902,137.606,371.086,170.678" size="33.072">n</text>
252 | <text font="OWDOPW+Georgia" bbox="371.082,137.606,378.114,170.678" size="33.072">i</text>
253 | <text font="OWDOPW+Georgia" bbox="378.114,137.606,399.258,170.678" size="33.072">m</text>
254 | <text font="OWDOPW+Georgia" bbox="399.255,137.606,405.039,170.678" size="33.072"> </text>
255 | <text font="OWDOPW+Georgia" bbox="405.044,137.606,416.972,170.678" size="33.072">v</text>
256 | <text font="OWDOPW+Georgia" bbox="416.962,137.606,428.554,170.678" size="33.072">e</text>
257 | <text font="OWDOPW+Georgia" bbox="428.564,137.606,442.748,170.678" size="33.072">n</text>
258 | <text font="OWDOPW+Georgia" bbox="442.743,137.606,449.775,170.678" size="33.072">i</text>
259 | <text font="OWDOPW+Georgia" bbox="449.775,137.606,461.871,170.678" size="33.072">a</text>
260 | <text font="OWDOPW+Georgia" bbox="461.869,137.606,483.013,170.678" size="33.072">m</text>
261 | <text font="OWDOPW+Georgia" bbox="483.010,137.606,489.490,170.678" size="33.072">,</text>
262 | <text>
263 | </text>
264 | </textline>
265 | <textline bbox="50.000,105.206,489.850,138.278">
266 | <text font="OWDOPW+Georgia" bbox="50.000,105.206,63.440,138.278" size="33.072">q</text>
267 | <text font="OWDOPW+Georgia" bbox="63.430,105.206,77.230,138.278" size="33.072">u</text>
268 | <text font="OWDOPW+Georgia" bbox="77.235,105.206,84.267,138.278" size="33.072">i</text>
269 | <text font="OWDOPW+Georgia" bbox="84.267,105.206,94.635,138.278" size="33.072">s</text>
270 | <text font="OWDOPW+Georgia" bbox="94.638,105.206,100.422,138.278" size="33.072"> </text>
271 | <text font="OWDOPW+Georgia" bbox="100.426,105.206,114.610,138.278" size="33.072">n</text>
272 | <text font="OWDOPW+Georgia" bbox="114.606,105.206,127.542,138.278" size="33.072">o</text>
273 | <text font="OWDOPW+Georgia" bbox="127.544,105.206,137.912,138.278" size="33.072">s</text>
274 | <text font="OWDOPW+Georgia" bbox="137.914,105.206,146.194,138.278" size="33.072">t</text>
275 | <text font="OWDOPW+Georgia" bbox="146.199,105.206,156.039,138.278" size="33.072">r</text>
276 | <text font="OWDOPW+Georgia" bbox="156.032,105.206,169.832,138.278" size="33.072">u</text>
277 | <text font="OWDOPW+Georgia" bbox="169.837,105.206,183.613,138.278" size="33.072">d</text>
278 | <text font="OWDOPW+Georgia" bbox="183.618,105.206,189.402,138.278" size="33.072"> </text>
279 | <text font="OWDOPW+Georgia" bbox="189.406,105.206,200.998,138.278" size="33.072">e</text>
280 | <text font="OWDOPW+Georgia" bbox="201.008,105.206,213.128,138.278" size="33.072">x</text>
281 | <text font="OWDOPW+Georgia" bbox="213.126,105.206,224.718,138.278" size="33.072">e</text>
282 | <text font="OWDOPW+Georgia" bbox="224.727,105.206,234.567,138.278" size="33.072">r</text>
283 | <text font="OWDOPW+Georgia" bbox="234.560,105.206,245.456,138.278" size="33.072">c</text>
284 | <text font="OWDOPW+Georgia" bbox="245.458,105.206,252.490,138.278" size="33.072">i</text>
285 | <text font="OWDOPW+Georgia" bbox="252.490,105.206,260.770,138.278" size="33.072">t</text>
286 | <text font="OWDOPW+Georgia" bbox="260.775,105.206,272.871,138.278" size="33.072">a</text>
287 | <text font="OWDOPW+Georgia" bbox="272.869,105.206,281.149,138.278" size="33.072">t</text>
288 | <text font="OWDOPW+Georgia" bbox="281.154,105.206,288.186,138.278" size="33.072">i</text>
289 | <text font="OWDOPW+Georgia" bbox="288.186,105.206,301.122,138.278" size="33.072">o</text>
290 | <text font="OWDOPW+Georgia" bbox="301.124,105.206,315.308,138.278" size="33.072">n</text>
291 | <text font="OWDOPW+Georgia" bbox="315.303,105.206,321.087,138.278" size="33.072"> </text>
292 | <text font="OWDOPW+Georgia" bbox="321.092,105.206,334.892,138.278" size="33.072">u</text>
293 | <text font="OWDOPW+Georgia" bbox="334.897,105.206,341.761,138.278" size="33.072">l</text>
294 | <text font="OWDOPW+Georgia" bbox="341.763,105.206,348.627,138.278" size="33.072">l</text>
295 | <text font="OWDOPW+Georgia" bbox="348.630,105.206,360.726,138.278" size="33.072">a</text>
296 | <text font="OWDOPW+Georgia" bbox="360.723,105.206,381.867,138.278" size="33.072">m</text>
297 | <text font="OWDOPW+Georgia" bbox="381.865,105.206,392.761,138.278" size="33.072">c</text>
298 | <text font="OWDOPW+Georgia" bbox="392.763,105.206,405.699,138.278" size="33.072">o</text>
299 | <text font="OWDOPW+Georgia" bbox="405.702,105.206,411.486,138.278" size="33.072"> </text>
300 | <text font="OWDOPW+Georgia" bbox="411.490,105.206,418.354,138.278" size="33.072">l</text>
301 | <text font="OWDOPW+Georgia" bbox="418.357,105.206,430.453,138.278" size="33.072">a</text>
302 | <text font="OWDOPW+Georgia" bbox="430.450,105.206,443.890,138.278" size="33.072">b</text>
303 | <text font="OWDOPW+Georgia" bbox="443.893,105.206,456.829,138.278" size="33.072">o</text>
304 | <text font="OWDOPW+Georgia" bbox="456.831,105.206,466.671,138.278" size="33.072">r</text>
305 | <text font="OWDOPW+Georgia" bbox="466.664,105.206,473.696,138.278" size="33.072">i</text>
306 | <text font="OWDOPW+Georgia" bbox="473.696,105.206,484.064,138.278" size="33.072">s</text>
307 | <text font="OWDOPW+Georgia" bbox="484.066,105.206,489.850,138.278" size="33.072"> </text>
308 | <text>
309 | </text>
310 | </textline>
311 | <textline bbox="50.000,72.806,366.807,105.878">
312 | <text font="OWDOPW+Georgia" bbox="50.000,72.806,64.184,105.878" size="33.072">n</text>
313 | <text font="OWDOPW+Georgia" bbox="64.179,72.806,71.211,105.878" size="33.072">i</text>
314 | <text font="OWDOPW+Georgia" bbox="71.211,72.806,81.579,105.878" size="33.072">s</text>
315 | <text font="OWDOPW+Georgia" bbox="81.582,72.806,88.614,105.878" size="33.072">i</text>
316 | <text font="OWDOPW+Georgia" bbox="88.614,72.806,94.398,105.878" size="33.072"> </text>
317 | <text font="OWDOPW+Georgia" bbox="94.402,72.806,108.202,105.878" size="33.072">u</text>
318 | <text font="OWDOPW+Georgia" bbox="108.207,72.806,116.487,105.878" size="33.072">t</text>
319 | <text font="OWDOPW+Georgia" bbox="116.492,72.806,122.276,105.878" size="33.072"> </text>
320 | <text font="OWDOPW+Georgia" bbox="122.281,72.806,134.377,105.878" size="33.072">a</text>
321 | <text font="OWDOPW+Georgia" bbox="134.374,72.806,141.238,105.878" size="33.072">l</text>
322 | <text font="OWDOPW+Georgia" bbox="141.241,72.806,148.273,105.878" size="33.072">i</text>
323 | <text font="OWDOPW+Georgia" bbox="148.273,72.806,161.713,105.878" size="33.072">q</text>
324 | <text font="OWDOPW+Georgia" bbox="161.703,72.806,175.503,105.878" size="33.072">u</text>
325 | <text font="OWDOPW+Georgia" bbox="175.508,72.806,182.540,105.878" size="33.072">i</text>
326 | <text font="OWDOPW+Georgia" bbox="182.540,72.806,196.244,105.878" size="33.072">p</text>
327 | <text font="OWDOPW+Georgia" bbox="196.251,72.806,202.035,105.878" size="33.072"> </text>
328 | <text font="OWDOPW+Georgia" bbox="202.040,72.806,213.632,105.878" size="33.072">e</text>
329 | <text font="OWDOPW+Georgia" bbox="213.642,72.806,225.762,105.878" size="33.072">x</text>
330 | <text font="OWDOPW+Georgia" bbox="225.759,72.806,231.543,105.878" size="33.072"> </text>
331 | <text font="OWDOPW+Georgia" bbox="231.548,72.806,243.140,105.878" size="33.072">e</text>
332 | <text font="OWDOPW+Georgia" bbox="243.150,72.806,255.246,105.878" size="33.072">a</text>
333 | <text font="OWDOPW+Georgia" bbox="255.243,72.806,261.027,105.878" size="33.072"> </text>
334 | <text font="OWDOPW+Georgia" bbox="261.032,72.806,271.928,105.878" size="33.072">c</text>
335 | <text font="OWDOPW+Georgia" bbox="271.930,72.806,284.866,105.878" size="33.072">o</text>
336 | <text font="OWDOPW+Georgia" bbox="284.869,72.806,306.013,105.878" size="33.072">m</text>
337 | <text font="OWDOPW+Georgia" bbox="306.010,72.806,327.154,105.878" size="33.072">m</text>
338 | <text font="OWDOPW+Georgia" bbox="327.152,72.806,340.088,105.878" size="33.072">o</text>
339 | <text font="OWDOPW+Georgia" bbox="340.090,72.806,353.866,105.878" size="33.072">d</text>
340 | <text font="OWDOPW+Georgia" bbox="353.871,72.806,366.807,105.878" size="33.072">o</text>
341 | <text>
342 | </text>
343 | </textline>
344 | <textline bbox="50.000,40.406,164.070,73.478">
345 | <text font="OWDOPW+Georgia" bbox="50.000,40.406,60.896,73.478" size="33.072">c</text>
346 | <text font="OWDOPW+Georgia" bbox="60.898,40.406,73.834,73.478" size="33.072">o</text>
347 | <text font="OWDOPW+Georgia" bbox="73.837,40.406,88.021,73.478" size="33.072">n</text>
348 | <text font="OWDOPW+Georgia" bbox="88.023,40.406,98.391,73.478" size="33.072">s</text>
349 | <text font="OWDOPW+Georgia" bbox="98.394,40.406,109.986,73.478" size="33.072">e</text>
350 | <text font="OWDOPW+Georgia" bbox="109.988,40.406,123.428,73.478" size="33.072">q</text>
351 | <text font="OWDOPW+Georgia" bbox="123.406,40.406,137.206,73.478" size="33.072">u</text>
352 | <text font="OWDOPW+Georgia" bbox="137.209,40.406,149.305,73.478" size="33.072">a</text>
353 | <text font="OWDOPW+Georgia" bbox="149.307,40.406,157.587,73.478" size="33.072">t</text>
354 | <text font="OWDOPW+Georgia" bbox="157.590,40.406,164.070,73.478" size="33.072">.</text>
355 | <text>
356 | </text>
357 | </textline>
358 | </textbox>
359 | </page>
360 | </pages>
361 |     """
362 |     soup = BeautifulSoup(file)
363 |     p = paragraph(soup.pages.page.textbox)
364 |     print ("[%s] font[%s] json[%s]" % (p, p.font, p.toDict()))
365 | 


--------------------------------------------------------------------------------
/pdfreader/main.py:
--------------------------------------------------------------------------------
  1 | from util.convert import converter
  2 | from lib.book import book
  3 | 
  4 | def text_to_dict(fileinput):
  5 |     """
  6 |     extracting the text into a xml string
  7 |     """
  8 |     convert = converter()
  9 |     xml = convert.as_xml().add_input_file(fileinput).run()
 10 |     """
 11 |     Parsing the xml string to transform it into a dictionary
 12 |     """
 13 |     b = book(xml)
 14 |     return b.toDict()
 15 | 
 16 | 
 17 | """
 18 | Extracting the images out of the pdf file
 19 | images are named respecting the following convention: tempppm-[pageNumber]-[imageNumber].ppm (eg: tempppm-001-000.ppm)
 20 | """
 21 | import subprocess
 22 | import sys
 23 | 
 24 | def extract_images(file):
 25 |     subprocess.call('/usr/local/bin/pdfimages -p -j '+file+' tempimg', shell=True, stderr=sys.stdout)
 26 | 
 27 | 
 28 | 
 29 | """
 30 | Matching ppm images with pattern to convert them in png images
 31 | At the same time, dict_book is updated with the path of the png images
 32 | """
 33 | import glob
 34 | import json
 35 | import uuid
 36 | import re
 37 | import os
 38 | 
 39 | def get_img_names_by_page_number():
 40 |     image_list = {}
 41 |     nb_images = 0
 42 |     ppm_images = glob.glob('./tempimg*.*')
 43 |     for image in ppm_images:
 44 |         match = re.match( r'\./tempimg\-(\d+)\-(\d+)\.[jpg|ppm|pbm]', image, re.M|re.I)
 45 |         if match:
 46 |             page_num = int(match.group(1)) - 1
 47 |             image_num = int(match.group(2))
 48 |             nb_images += 1
 49 |             if page_num not in image_list:
 50 |                 image_list[page_num] = {}
 51 |             image_list[page_num].update({image_num:image})
 52 |     return image_list, nb_images
 53 | 
 54 | def rename_imgs__update_dict(image_list, dict_book, image_folder):
 55 |     image_num = 1
 56 |     for page in image_list.iterkeys():
 57 |         for image in image_list[page].iterkeys():
 58 |             print "Processing image %d" % image_num
 59 |             image_num += 1
 60 |             if 'images' not in dict_book['pages'][page]:
 61 |                 dict_book['pages'][page].update({'images':[]})
 62 |             if "jpg" in image_list[page][image]:
 63 |                 image_name = "%s_p%d.jpg" % (uuid.uuid1(), page)
 64 |                 dict_book['pages'][page]['images'].append(image_name)
 65 |                 subprocess.call('mv %s %s' % (image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout)
 66 |             elif "ppm" in image_list[page][image] or "pbm" in image_list[page][image]:
 67 |                 image_name = "%s_p%d.png" % (uuid.uuid1(), page)
 68 |                 dict_book['pages'][page]['images'].append(image_name)
 69 |                 subprocess.call('/usr/local/bin/convert %s %s'%(image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout)
 70 |                 os.remove(image_list[page][image])
 71 |     return dict_book
 72 | 
 73 | def get_images_update_dict(dict_book, image_folder):
 74 |     image_list, nb_images = get_img_names_by_page_number()
 75 |     print "%d images to process" % nb_images
 76 |     dict_book = rename_imgs__update_dict(image_list, dict_book, image_folder)
 77 |     return dict_book
 78 | 
 79 | 
 80 | def run(pdf_file, image_folder):
 81 |     print "Reading PDF"
 82 |     dict_book = text_to_dict(pdf_file)
 83 |     print "Extracting images"
 84 |     extract_images(pdf_file)
 85 |     dict_book = get_images_update_dict(dict_book, image_folder)
 86 |     return json.dumps(dict_book)
 87 | 
 88 | 
 89 | if __name__ == '__main__':
 90 |     import sys
 91 |     if len(sys.argv) == 3:
 92 |         print run(pdf_file=sys.argv[1], image_folder=sys.argv[2])
 93 |     else:
 94 |         print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0])
 95 | 
 96 | 
 97 | 
 98 | """
 99 | # here is an example of how look the dict_book
100 | print json.dumps(dict_book)
101 | {
102 |     "pages": [
103 |         {
104 |             "images": [
105 |                 "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
106 |             ],
107 |             "paragraphs": [
108 |                 {
109 |                     "size": 98,
110 |                     "width": 587,
111 |                     "string": "Book Title",
112 |                     "y": -98,
113 |                     "x": -324,
114 |                     "font": "Georgia",
115 |                     "height": 705
116 |                 }
117 |             ]
118 |         },
119 |         {
120 |             "images": [
121 |                 "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png"
122 |             ],
123 |             "paragraphs": [
124 |                 {
125 |                     "size": 24,
126 |                     "width": 138,
127 |                     "string": "CHAPTER 1",
128 |                     "y": -24,
129 |                     "x": -88,
130 |                     "font": "Georgia",
131 |                     "height": 711
132 |                 },
133 |                 {
134 |                     "size": 33,
135 |                     "width": 489,
136 |                     "string": "Lorem ipsum dolor sit amet, consectetur \n<i>adipisicing</i> <b>elit, sed</b> <i><b>do eiusmod</i></b>\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.",
137 |                     "y": -229,
138 |                     "x": -439,
139 |                     "font": "Georgia",
140 |                     "height": 269
141 |                 }
142 |             ]
143 |         },
144 |         {
145 |             "paragraphs": [
146 |                 {
147 |                     "size": 24,
148 |                     "width": 133,
149 |                     "string": "SECTION 1",
150 |                     "y": -24,
151 |                     "x": -83,
152 |                     "font": "Georgia",
153 |                     "height": 711
154 |                 }
155 |             ]
156 |         }
157 |     ]
158 | }
159 | """


--------------------------------------------------------------------------------
/pdfreader/pff.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pff.pdf


--------------------------------------------------------------------------------
/pdfreader/pffl.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pffl.pdf


--------------------------------------------------------------------------------
/pdfreader/png.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/png.pdf


--------------------------------------------------------------------------------
/pdfreader/util/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 | 
4 | if __name__ == '__main__': print __version__
5 | 


--------------------------------------------------------------------------------
/pdfreader/util/convert.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | from pdfminer.pdfdocument import PDFDocument
  3 | from pdfminer.pdfparser import  PDFParser
  4 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  5 | from pdfminer.pdfpage import PDFPage
  6 | from pdfminer.pdfdevice import PDFDevice, TagExtractor
  7 | from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
  8 | from pdfminer.cmapdb import CMapDB
  9 | from pdfminer.layout import LAParams
 10 | from pdfminer.image import ImageWriter
 11 | 
 12 | 
 13 | class converter(object):
 14 | 
 15 |     def usage(self):
 16 |         print ('usage: %s [-d] [-p pagenos] [-m maxpages] [-P password] [-o output] [-C] '
 17 |                '[-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin] [-F boxes_flow] '
 18 |                '[-Y layout_mode] [-O output_dir] [-t text|html|xml|tag] [-c codec] [-s scale] file ...' % self._argv[0])
 19 |         return 100
 20 | 
 21 |     def __init__(self):
 22 |         # debug option
 23 |         self._debug = 0
 24 |         # input option
 25 |         self._password = ''
 26 |         self._pagenos = set()
 27 |         self._maxpages = 0
 28 |         # output option
 29 |         self._outfile = None
 30 |         self._outtype = None
 31 |         self._imagewriter = None
 32 |         self._layoutmode = 'normal'
 33 |         self._codec = 'utf-8'
 34 |         self._pageno = 1
 35 |         self._scale = 1
 36 |         self._caching = True
 37 |         self._showpageno = True
 38 |         self._laparams = LAParams()
 39 |         self._argv = ['convert.py']
 40 |         self._args = []
 41 | 
 42 |     def mainIniter(self, argv):
 43 |         import getopt
 44 |         try:
 45 |             (opts, args) = getopt.getopt(argv[1:], 'dp:m:P:o:CnAVM:L:W:F:Y:O:t:c:s:')
 46 |         except getopt.GetoptError:
 47 |             return self.usage()
 48 |         if not args: return self.usage()
 49 |         for (k, v) in opts:
 50 |             if k == '-d': self._debug += 1
 51 |             elif k == '-p': self._pagenos.update( int(x)-1 for x in v.split(',') )
 52 |             elif k == '-m': self._maxpages = int(v)
 53 |             elif k == '-P': self._password = v
 54 |             elif k == '-o': self._outfile = v
 55 |             elif k == '-C': self._caching = False
 56 |             elif k == '-n': self._laparams = None
 57 |             elif k == '-A': self._laparams.all_texts = True
 58 |             elif k == '-V': self._laparams.detect_vertical = True
 59 |             elif k == '-M': self._laparams.char_margin = float(v)
 60 |             elif k == '-L': self._laparams.line_margin = float(v)
 61 |             elif k == '-W': self._laparams.word_margin = float(v)
 62 |             elif k == '-F': self._laparams.boxes_flow = float(v)
 63 |             elif k == '-Y': self._layoutmode = v
 64 |             elif k == '-O': self._imagewriter = ImageWriter(v)
 65 |             elif k == '-t': self._outtype = v
 66 |             elif k == '-c': self._codec = v
 67 |             elif k == '-s': self._scale = float(v)
 68 |         #
 69 |         self._argv = argv
 70 |         self._args = args
 71 |         #
 72 |         PDFDocument.debug = self._debug
 73 |         PDFParser.debug = self._debug
 74 |         CMapDB.debug = self._debug
 75 |         PDFResourceManager.debug = self._debug
 76 |         PDFPageInterpreter.debug = self._debug
 77 |         PDFDevice.debug = self._debug
 78 |         return self.run()
 79 | 
 80 |     """
 81 |     Output options
 82 |     """
 83 |     def with_debug(self, todo=True):
 84 |         self._debug = 1 if todo else 0
 85 |         return self
 86 | 
 87 |     def as_text(self, todo=True):
 88 |         self._outtype = 'text' if todo else None
 89 |         return self
 90 | 
 91 |     def as_xml(self, todo=True):
 92 |         self._outtype = 'xml' if todo else None
 93 |         return self
 94 | 
 95 |     def as_html(self, todo=True):
 96 |         self._outtype = 'html' if todo else None
 97 |         return self
 98 | 
 99 |     def as_tag(self, todo=True):
100 |         self._outtype = 'tag' if todo else None
101 |         return self
102 | 
103 |     def add_input_file(self, filename):
104 |         self._args.append(filename)
105 |         return self
106 | 
107 |     """
108 |     Process the pdf file(s)
109 |     """
110 |     def run(self):
111 |         rsrcmgr = PDFResourceManager(caching=self._caching)
112 |         if not self._outtype:
113 |             self._outtype = 'text'
114 |             if __name__ == '__main__':
115 |                 if self._outfile:
116 |                     if self._outfile.endswith('.htm') or self._outfile.endswith('.html'):
117 |                         self._outtype = 'html'
118 |                     elif self._outfile.endswith('.xml'):
119 |                         self._outtype = 'xml'
120 |                     elif self._outfile.endswith('.tag'):
121 |                         self._outtype = 'tag'
122 |         if __name__ == '__main__':
123 |             if self._outfile:
124 |                 outfp = file(self._outfile, 'w')
125 |             else:
126 |                 outfp = sys.stdout
127 |         else:
128 |             from cStringIO import StringIO
129 |             outfp = StringIO()
130 |         if self._outtype == 'text':
131 |             device = TextConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter)
132 |         elif self._outtype == 'xml':
133 |             device = XMLConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter)
134 |         elif self._outtype == 'html':
135 |             device = HTMLConverter(rsrcmgr, outfp, codec=self._codec, scale=self._scale, layoutmode=self._layoutmode, laparams=self._laparams, imagewriter=self._imagewriter)
136 |         elif self._outtype == 'tag':
137 |             device = TagExtractor(rsrcmgr, outfp, codec=self._codec)
138 |         else:
139 |             return usage()
140 |         for fname in self._args:
141 |             fp = file(fname, 'rb')
142 | 	    interpreter = PDFPageInterpreter(rsrcmgr, device)
143 | 
144 |             for page in PDFPage.get_pages(fp, self._pagenos, maxpages=self._maxpages, password=self._password, caching=self._caching, check_extractable=True):
145 |             	interpreter.process_page(page)
146 | 
147 |             fp.close()
148 |         device.close()
149 |         if __name__ == '__main__':
150 |             outfp.close()
151 |         else:
152 |             return outfp.getvalue()
153 | 
154 | 
155 | 
156 | 
157 | if __name__ == '__main__':
158 |     import sys
159 |     sys.exit(converter().mainIniter(sys.argv))
160 | 


--------------------------------------------------------------------------------
/psd_rb/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 | 
3 | gem 'psd'
4 | 


--------------------------------------------------------------------------------
/psd_rb/json_type.json:
--------------------------------------------------------------------------------
  1 | [{
  2 |     "folders": [{
  3 |         "created_at": "2014-04-17T08:23:42Z",
  4 |         "_id": "534f8f8e94da26c97d00003a",
  5 |         "description": "jheuh iuwh iwuhe iuhiuhwiuehriguh iuwheuirhgwhiuerh uihwgiuewrhguiehu hweurh u ieuwh eiuh iuh weuh ihwe iuhwe uhreuhgreugh wieh wiuehwueirghwei hiuh iuewh uiwh uiwhu u weiu rui wiuhewiug iuwer wegwe hguihrgiuwe  reh erwgheihg e g ehgiuheguihw eugi hwegh opj s dfspo sdk kdf",
  6 |         "data": {
  7 |             "orientation": "0",
  8 |             "projectColors": [{
  9 |                 "alpha": 1,
 10 |                 "color": 16777215
 11 |             }, {
 12 |                 "alpha": 1,
 13 |                 "color": 16777215
 14 |             }, {
 15 |                 "alpha": 1,
 16 |                 "color": 16777215
 17 |             }, {
 18 |                 "alpha": 1,
 19 |                 "color": 16777215
 20 |             }, {
 21 |                 "alpha": 1,
 22 |                 "color": 16777215
 23 |             }, {
 24 |                 "alpha": 1,
 25 |                 "color": 16777215
 26 |             }, {
 27 |                 "alpha": 1,
 28 |                 "color": 16777215
 29 |             }, {
 30 |                 "alpha": 1,
 31 |                 "color": 16777215
 32 |             }, {
 33 |                 "alpha": 1,
 34 |                 "color": 16777215
 35 |             }, {
 36 |                 "alpha": 1,
 37 |                 "color": 16777215
 38 |             }, {
 39 |                 "alpha": 1,
 40 |                 "color": 16777215
 41 |             }, {
 42 |                 "alpha": 1,
 43 |                 "color": 16777215
 44 |             }, {
 45 |                 "alpha": 1,
 46 |                 "color": 16777215
 47 |             }, {
 48 |                 "alpha": 1,
 49 |                 "color": 16777215
 50 |             }, {
 51 |                 "alpha": 1,
 52 |                 "color": 16777215
 53 |             }, {
 54 |                 "alpha": 1,
 55 |                 "color": 16777215
 56 |             }, {
 57 |                 "alpha": 1,
 58 |                 "color": 16777215
 59 |             }, {
 60 |                 "alpha": 1,
 61 |                 "color": 16777215
 62 |             }, {
 63 |                 "alpha": 1,
 64 |                 "color": 16777215
 65 |             }, {
 66 |                 "alpha": 1,
 67 |                 "color": 16777215
 68 |             }, {
 69 |                 "alpha": 1,
 70 |                 "color": 16777215
 71 |             }, {
 72 |                 "alpha": 1,
 73 |                 "color": 16777215
 74 |             }, {
 75 |                 "alpha": 1,
 76 |                 "color": 16777215
 77 |             }, {
 78 |                 "alpha": 1,
 79 |                 "color": 16777215
 80 |             }],
 81 |             "lastFontColor": 16777215,
 82 |             "navigation": true,
 83 |             "tags": null,
 84 |             "width": 1024,
 85 |             "height": 768
 86 |         },
 87 |         "sid": "534f8f8e94da26c97d00003a",
 88 |         "user_id": "52820a5994da266b19000002",
 89 |         "type": "ProjectFolder",
 90 |         "folderName": "V42",
 91 |         "last_published_at": "2014-04-17T08:23:42+00:00",
 92 |         "child_ids": ["534f8f8e94da26c97d00003b"],
 93 |         "unit_ids": ["534f8f8e94da26c97d00002e", "534f8f8e94da26c97d000030"],
 94 |         "resource_ids": ["5379fec120d5ac4294000051"],
 95 |         "font_ids": ["527a597cab134d6a3e000005"],
 96 |         "updated_at": "2014-05-19T12:53:32Z",
 97 |         "_type": "Folder"
 98 |     }, {
 99 |         "created_at": "2014-04-17T08:23:42Z",
100 |         "_id": "534f8f8e94da26c97d00003b",
101 |         "unit_ids": ["534f8f8e94da26c97d00002c"],
102 |         "data": {
103 |             "index": 0
104 |         },
105 |         "sid": "534f8f8e94da26c97d00003a",
106 |         "folderName": "Untitled",
107 |         "parent_ids": ["534f8f8e94da26c97d00003a"],
108 |         "type": "ViewFolder",
109 |         "updated_at": "2014-05-19T12:53:32Z",
110 |         "_type": "Folder"
111 |     }],
112 |     "units": [{
113 |         "created_at": "2014-04-17T08:23:42Z",
114 |         "_id": "534f8f8e94da26c97d00002c",
115 |         "folder_ids": ["534f8f8e94da26c97d00003b"],
116 |         "stype": "TMWorld",
117 |         "type": "world",
118 |         "sid": "534f8f8e94da26c97d00003a",
119 |         "child_ids": ["534f8f8e94da26c97d00002d", "534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a", "5379feb120d5ac429400004e", "5379feb120d5ac429400004f"],
120 |         "data": {
121 |             "index": 0,
122 |             "unitName": "Untitled",
123 |             "bindings": {
124 |                 "ressources": null,
125 |                 "properties": {}
126 |             }
127 |         },
128 |         "updated_at": "2014-05-19T12:53:32Z",
129 |         "_type": "Unit"
130 |     }, {
131 |         "created_at": "2014-04-17T08:23:42Z",
132 |         "folder_ids": ["534f8f8e94da26c97d00003a"],
133 |         "stype": "TMWorld",
134 |         "type": "world",
135 |         "_id": "534f8f8e94da26c97d00002e",
136 |         "sid": "534f8f8e94da26c97d00003a",
137 |         "child_ids": ["534f8f8e94da26c97d00002f"],
138 |         "data": {
139 |             "index": 0,
140 |             "unitName": "masterOverView"
141 |         },
142 |         "updated_at": "2014-04-17T08:23:42Z",
143 |         "_type": "Unit"
144 |     }, {
145 |         "created_at": "2014-04-17T08:23:42Z",
146 |         "folder_ids": ["534f8f8e94da26c97d00003a"],
147 |         "stype": "TMWorld",
148 |         "type": "world",
149 |         "_id": "534f8f8e94da26c97d000030",
150 |         "sid": "534f8f8e94da26c97d00003a",
151 |         "child_ids": ["534f8f8e94da26c97d000031"],
152 |         "data": {
153 |             "index": 0,
154 |             "unitName": "masterUnderView"
155 |         },
156 |         "updated_at": "2014-04-17T08:23:42Z",
157 |         "_type": "Unit"
158 |     }, {
159 |         "created_at": "2014-04-17T08:23:42Z",
160 |         "_id": "534f8f8e94da26c97d00002d",
161 |         "flat_child_ids": ["534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a"],
162 |         "stype": "statestack",
163 |         "type": "graphicunit",
164 |         "sid": "534f8f8e94da26c97d00003a",
165 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
166 |         "resource_ids": ["5379fec120d5ac4294000051"],
167 |         "data": {
168 |             "stacks": {
169 |                 "5347a59d5d04707f8f000020": {
170 |                     "masterBack": null,
171 |                     "actions": {},
172 |                     "thumb": "5379fec120d5ac4294000051",
173 |                     "ref": null,
174 |                     "sactions": null,
175 |                     "masterFront": null,
176 |                     "units": [0, 1, 2],
177 |                     "color": {
178 |                         "solid": {
179 |                             "alpha": 1,
180 |                             "color": 102
181 |                         }
182 |                     },
183 |                     "name": "Untitled"
184 |                 },
185 |                 "o": ["5347a59d5d04707f8f000020"]
186 |             },
187 |             "internalWPid": "5347a59d5d04707f8f00002d",
188 |             "world": {
189 |                 "args": [0, 0, 1024, 768, 0],
190 |                 "width": 1024,
191 |                 "height": 768,
192 |                 "type": "TMWorld"
193 |             },
194 |             "unitName": "RootStateStack1451",
195 |             "bindings": {
196 |                 "ressources": null,
197 |                 "properties": {}
198 |             },
199 |             "indexChild": 1
200 |         },
201 |         "updated_at": "2014-05-19T12:53:31Z",
202 |         "_type": "Unit"
203 |     }, {
204 |         "created_at": "2014-04-17T08:23:42Z",
205 |         "_id": "534f8f8e94da26c97d000038",
206 |         "stype": "Text",
207 |         "type": "media",
208 |         "sid": "534f8f8e94da26c97d00003a",
209 |         "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
210 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
211 |         "resource_ids": ["534f8f8e94da26c97d00002b"],
212 |         "data": {
213 |             "opacity": 0.5,
214 |             "world": {
215 |                 "args": [160, 31, 200, 183, 0],
216 |                 "width": 1024,
217 |                 "height": 768,
218 |                 "type": "TMWorld"
219 |             },
220 |             "unitName": "Lorem ipsum dolor sit amet, co",
221 |             "bindings": {
222 |                 "ressources": null,
223 |                 "properties": {}
224 |             },
225 |             "indexChild": 1
226 |         },
227 |         "updated_at": "2014-05-19T12:53:31Z",
228 |         "_type": "Unit"
229 |     }, {
230 |         "created_at": "2014-04-17T08:23:42Z",
231 |         "_id": "534f8f8e94da26c97d000039",
232 |         "stype": "image",
233 |         "type": "media",
234 |         "sid": "534f8f8e94da26c97d00003a",
235 |         "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
236 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
237 |         "resource_ids": ["5347a61c5d04707f8f000059"],
238 |         "data": {
239 |             "isAutoName": false,
240 |             "actions": {
241 |                 "4": null
242 |             },
243 |             "world": {
244 |                 "args": [536, 31, 329, 329, 0],
245 |                 "width": 1024,
246 |                 "height": 768,
247 |                 "type": "TMWorld"
248 |             },
249 |             "unitName": "seal_bear.jpg",
250 |             "bindings": {
251 |                 "ressources": null,
252 |                 "properties": {}
253 |             },
254 |             "indexChild": 2
255 |         },
256 |         "updated_at": "2014-05-19T12:53:31Z",
257 |         "_type": "Unit"
258 |     }, {
259 |         "created_at": "2014-05-19T12:53:31Z",
260 |         "_id": "5379feb120d5ac429400004a",
261 |         "stype": "group",
262 |         "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
263 |         "data": {
264 |             "stacks": {
265 |                 "5379feb120d5ac429400004b": {
266 |                     "ref": null,
267 |                     "units": [0, 1],
268 |                     "actions": {},
269 |                     "thumb": null,
270 |                     "name": "Untitled"
271 |                 },
272 |                 "o": ["5379feb120d5ac429400004b"]
273 |             },
274 |             "world": {
275 |                 "args": [243, 511, 321, 188, 0],
276 |                 "width": 1024,
277 |                 "height": 768,
278 |                 "type": "TMWorld"
279 |             },
280 |             "unitName": "Group_5774",
281 |             "indexChild": 3,
282 |             "internalWPid": "5379feb120d5ac429400004c"
283 |         },
284 |         "flat_child_ids": ["5379feb120d5ac429400004e", "5379feb120d5ac429400004f"],
285 |         "type": "container",
286 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
287 |         "sid": "534f8f8e94da26c97d00003a",
288 |         "updated_at": "2014-05-19T12:53:31Z",
289 |         "_type": "Unit"
290 |     }, {
291 |         "created_at": "2014-05-19T12:53:31Z",
292 |         "_id": "5379feb120d5ac429400004e",
293 |         "stype": "ellipse",
294 |         "flat_parent_ids": ["5379feb120d5ac429400004a"],
295 |         "data": {
296 |             "world": {
297 |                 "args": [46, 0, 135, 134, "20.20475968803733"],
298 |                 "width": 321,
299 |                 "height": 188,
300 |                 "type": "TMWorld"
301 |             },
302 |             "unitName": "Ellipse_3723",
303 |             "bindings": {
304 |                 "ressources": null,
305 |                 "properties": {}
306 |             },
307 |             "background": {
308 |                 "solid": {
309 |                     "alpha": 1,
310 |                     "color": 10412991
311 |                 }
312 |             },
313 |             "indexChild": 1
314 |         },
315 |         "type": "primitive",
316 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
317 |         "sid": "534f8f8e94da26c97d00003a",
318 |         "updated_at": "2014-05-19T12:53:31Z",
319 |         "_type": "Unit"
320 |     }, {
321 |         "created_at": "2014-05-19T12:53:31Z",
322 |         "_id": "5379feb120d5ac429400004f",
323 |         "stype": "path",
324 |         "flat_parent_ids": ["5379feb120d5ac429400004a"],
325 |         "data": {
326 |             "shapeBounds": [0, 0, 204, 102],
327 |             "background": {
328 |                 "solid": {
329 |                     "alpha": 1,
330 |                     "color": 10412991
331 |                 }
332 |             },
333 |             "shape": "triangle",
334 |             "world": {
335 |                 "args": [117, 86, 204, 102, 0],
336 |                 "width": 321,
337 |                 "height": 188,
338 |                 "type": "TMWorld"
339 |             },
340 |             "unitName": "Shape_3937",
341 |             "bindings": {
342 |                 "ressources": null,
343 |                 "properties": {}
344 |             },
345 |             "indexChild": 2,
346 |             "path": "M0 212.132 212.132 0 424.264 212.132 0 212.132Z"
347 |         },
348 |         "type": "primitive",
349 |         "parent_ids": ["534f8f8e94da26c97d00002c"],
350 |         "sid": "534f8f8e94da26c97d00003a",
351 |         "updated_at": "2014-05-19T12:53:31Z",
352 |         "_type": "Unit"
353 |     }, {
354 |         "created_at": "2014-04-17T08:23:42Z",
355 |         "_id": "534f8f8e94da26c97d00002f",
356 |         "stype": "statestack",
357 |         "type": "graphicunit",
358 |         "sid": "534f8f8e94da26c97d00003a",
359 |         "parent_ids": ["534f8f8e94da26c97d00002e"],
360 |         "resource_ids": ["5347a59d5d04707f8f000023"],
361 |         "data": {
362 |             "stacks": {
363 |                 "5347a59d5d04707f8f000024": {
364 |                     "masterBack": null,
365 |                     "actions": {},
366 |                     "thumb": "5347a59d5d04707f8f000023",
367 |                     "ref": null,
368 |                     "sactions": null,
369 |                     "masterFront": null,
370 |                     "units": null,
371 |                     "color": {
372 |                         "solid": {
373 |                             "alpha": 0,
374 |                             "color": 16777215
375 |                         }
376 |                     },
377 |                     "name": "Untitled"
378 |                 },
379 |                 "o": ["5347a59d5d04707f8f000024"]
380 |             },
381 |             "internalWPid": "5347a59d5d04707f8f000029",
382 |             "world": {
383 |                 "args": [0, 0, 1024, 768, 0],
384 |                 "width": 1024,
385 |                 "height": 768,
386 |                 "type": "TMWorld"
387 |             },
388 |             "unitName": "RootStateStack1447",
389 |             "bindings": {
390 |                 "ressources": null,
391 |                 "properties": {}
392 |             },
393 |             "indexChild": 1
394 |         },
395 |         "updated_at": "2014-05-19T12:53:31Z",
396 |         "_type": "Unit"
397 |     }, {
398 |         "created_at": "2014-04-17T08:23:42Z",
399 |         "_id": "534f8f8e94da26c97d000031",
400 |         "stype": "statestack",
401 |         "type": "graphicunit",
402 |         "sid": "534f8f8e94da26c97d00003a",
403 |         "parent_ids": ["534f8f8e94da26c97d000030"],
404 |         "resource_ids": ["5347a59d5d04707f8f000027"],
405 |         "data": {
406 |             "stacks": {
407 |                 "5347a59d5d04707f8f000028": {
408 |                     "masterBack": null,
409 |                     "actions": {},
410 |                     "thumb": "5347a59d5d04707f8f000027",
411 |                     "ref": null,
412 |                     "sactions": null,
413 |                     "masterFront": null,
414 |                     "units": null,
415 |                     "color": {
416 |                         "solid": {
417 |                             "alpha": 1,
418 |                             "color": 16777215
419 |                         }
420 |                     },
421 |                     "name": "Untitled"
422 |                 },
423 |                 "o": ["5347a59d5d04707f8f000028"]
424 |             },
425 |             "internalWPid": "5347a59d5d04707f8f00002b",
426 |             "world": {
427 |                 "args": [0, 0, 1024, 768, 0],
428 |                 "width": 1024,
429 |                 "height": 768,
430 |                 "type": "TMWorld"
431 |             },
432 |             "unitName": "RootStateStack1448",
433 |             "bindings": {
434 |                 "ressources": null,
435 |                 "properties": {}
436 |             },
437 |             "indexChild": 1
438 |         },
439 |         "updated_at": "2014-05-19T12:53:31Z",
440 |         "_type": "Unit"
441 |     }],
442 |     "resources": [{
443 |         "created_at": "2014-04-17T08:23:42Z",
444 |         "font_ids": ["527a597cab134d6a3e000005"],
445 |         "unit_ids": ["534f8f8e94da26c97d000038"],
446 |         "content": "<TextFlow columnCount=\"1\" columnGap=\"10\" fontLookup=\"embeddedCFF\" leadingModel=\"box\" renderingMode=\"cff\" textAlignLast=\"left\" whiteSpaceCollapse=\"preserve\" version=\"3.0.0\" xmlns=\"http://ns.adobe.com/textLayout/2008\"><p lineHeight=\"120%\" paragraphSpaceAfter=\"0\" paragraphSpaceBefore=\"0\"><span color=\"#ffffff\" fontFamily=\"Droid Sans Regular\" fontSize=\"30.5\">Lorem ipsum dolor sit amet, consectetur adipisicing elit.</span></p></TextFlow>",
447 |         "_id": "534f8f8e94da26c97d00002b",
448 |         "sid": "534f8f8e94da26c97d00003a",
449 |         "type": "ressourceTextLayout",
450 |         "updated_at": "2014-04-17T08:23:42Z",
451 |         "_type": "Resource"
452 |     }, {
453 |         "created_at": "2014-04-11T08:22:42Z",
454 |         "_id": "5347a61c5d04707f8f000059",
455 |         "type": "ressourceImage",
456 |         "unit_ids": ["5347a61c5d04707f8f00005a", "5347a81b94da26f432000011", "534bfa8894da26f432000025", "534bfb1194da2679f0000011", "534c0a5794da263072000011", "534c0fcb94da263072000025", "534d520294da262aae000011", "534d526b94da262aae000025", "534e34e994da267660000011", "534e368d94da267660000025", "534e36a294da26c974000011", "534e434a94da266b7b000011", "534e4abe94da26c1cf000011", "534e4d3694da267beb000011", "534e572d94da26c857000011", "534e59a394da26c857000025", "534e5ade94da26c97d000025", "534f8f8e94da26c97d000039", "535116c394da26c97d00004d", "53511e2694da26c97d000061"],
457 |         "data": {
458 |             "creationDate": 1394547794000,
459 |             "width": 1024,
460 |             "ressourceSize": 119186,
461 |             "modificationDate": 1394547794000,
462 |             "height": 1024
463 |         },
464 |         "sid": "53464ef394da26a11400000d",
465 |         "name": "seal_bear.jpg",
466 |         "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a61c5d04707f8f000059",
467 |         "updated_at": "2014-04-18T13:38:34Z",
468 |         "_type": "Resource"
469 |     }, {
470 |         "created_at": "2014-04-11T08:22:42Z",
471 |         "_id": "5347a59d5d04707f8f000023",
472 |         "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000023",
473 |         "unit_ids": ["5347a59d5d04707f8f000022", "5347a81b94da26f432000007", "534bfa8894da26f43200001b", "534bfb1194da2679f0000007", "534c0a5794da263072000007", "534c0fcb94da26307200001b", "534d520294da262aae000007", "534d526b94da262aae00001b", "534e34e994da267660000007", "534e368d94da26766000001b", "534e36a294da26c974000007", "534e434a94da266b7b000007", "534e4abe94da26c1cf000007", "534e4d3694da267beb000007", "534e572d94da26c857000007", "534e59a394da26c85700001b", "534e5ade94da26c97d00001b", "534f8f8e94da26c97d00002f", "535116c394da26c97d000043", "53511e2694da26c97d000057"],
474 |         "type": "ressourceImage",
475 |         "sid": "53464ef394da26a11400000d",
476 |         "data": {
477 |             "height": 768,
478 |             "width": 1024
479 |         },
480 |         "updated_at": "2014-04-18T13:38:34Z",
481 |         "_type": "Resource"
482 |     }, {
483 |         "created_at": "2014-04-11T08:22:42Z",
484 |         "_id": "5347a59d5d04707f8f000027",
485 |         "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000027",
486 |         "unit_ids": ["5347a59d5d04707f8f000026", "5347a81b94da26f432000009", "534bfa8894da26f43200001d", "534bfb1194da2679f0000009", "534c0a5794da263072000009", "534c0fcb94da26307200001d", "534d520294da262aae000009", "534d526b94da262aae00001d", "534e34e994da267660000009", "534e368d94da26766000001d", "534e36a294da26c974000009", "534e434a94da266b7b000009", "534e4abe94da26c1cf000009", "534e4d3694da267beb000009", "534e572d94da26c857000009", "534e59a394da26c85700001d", "534e5ade94da26c97d00001d", "534f8f8e94da26c97d000031", "535116c394da26c97d000045", "53511e2694da26c97d000059"],
487 |         "type": "ressourceImage",
488 |         "sid": "53464ef394da26a11400000d",
489 |         "data": {
490 |             "height": 768,
491 |             "width": 1024
492 |         },
493 |         "updated_at": "2014-04-18T13:38:34Z",
494 |         "_type": "Resource"
495 |     }]
496 | }]


--------------------------------------------------------------------------------
/psd_rb/ruby/factories/group_unit.rb:
--------------------------------------------------------------------------------
 1 | class GroupUnit
 2 | 
 3 |   def initialize(node)
 4 |     @node = node
 5 |   end
 6 | 
 7 |   def type
 8 |     'GroupUnit'
 9 |   end
10 | 
11 |   def to_json
12 |     as_json
13 |   end
14 | 
15 |   def as_json
16 |     {
17 |       created_at: '2014-05-19T12:53:31Z',
18 |       _id: '5379feb120d5ac429400004a',
19 |       stype: 'group',
20 |       flat_parent_ids: ['534f8f8e94da26c97d00002d'],
21 |       data: {
22 |         stacks: {
23 |           '5379feb120d5ac429400004b'=> {
24 |             ref: null,
25 |             units: [0, 1],
26 |             actions: {},
27 |             thumb: nil,
28 |             name: 'Untitled'
29 |           },
30 |           o: ['5379feb120d5ac429400004b']
31 |         },
32 |         world: {
33 |           args: [243, 511, 321, 188, 0],
34 |           width: 1024,
35 |           height: 768,
36 |           type: 'TMWorld'
37 |         },
38 |         unitName: 'Group_5774',
39 |         indexChild: 3,
40 |         internalWPid: '5379feb120d5ac429400004c'
41 |       },
42 |       flat_child_ids: %w(5379feb120d5ac429400004e 5379feb120d5ac429400004f),
43 |       type: 'container',
44 |       parent_ids: ['534f8f8e94da26c97d00002c'],
45 |       sid: '534f8f8e94da26c97d00003a',
46 |       updated_at: '2014-05-19T12:53:31Z',
47 |       _type: 'Unit'
48 |     }
49 |   end
50 | end


--------------------------------------------------------------------------------
/psd_rb/ruby/factories/image_unit.rb:
--------------------------------------------------------------------------------
 1 | class ImageUnit
 2 | 
 3 |   def initialize(node)
 4 |     @node = node
 5 |   end
 6 | 
 7 |   def type
 8 |     'ImageUnit'+( (@node.phas_crop)?(' with crop'):('') )
 9 |   end
10 | 
11 |   def to_json
12 |     as_json
13 |   end
14 | 
15 |   def as_json
16 |     {
17 |       stype: 'image',
18 |       type: 'media',
19 |       # flat_parent_ids: ["534f8f8e94da26c97d00002d"],
20 |       # parent_ids: ["534f8f8e94da26c97d00002c"],
21 |       # resource_ids: ["5347a61c5d04707f8f000059"],
22 |       data: {
23 |         isAutoName: false,
24 |         world: {
25 |           args: [536, 31, 329, 329, 0],
26 |           width: 1024,
27 |           height: 768,
28 |           type: 'TMWorld'
29 |         },
30 |         unitName: @node.ppng_name,
31 |         bindings: {
32 |           ressources: nil,
33 |           properties: {}
34 |         },
35 |         # indexChild: 2
36 |       },
37 |       _type: 'Unit'
38 |     }
39 |   end
40 | 
41 | end
42 | 


--------------------------------------------------------------------------------
/psd_rb/ruby/factories/text_unit.rb:
--------------------------------------------------------------------------------
 1 | class TextUnit
 2 | 
 3 |   def initialize(node)
 4 |     @node = node
 5 |   end
 6 | 
 7 |   def type
 8 |     'TextUnit'
 9 |   end
10 | 
11 |   def to_json
12 |     as_json
13 |   end
14 | 
15 |   def as_json
16 |     {
17 |       stype: 'Text',
18 |       type: 'media',
19 |       # flat_parent_ids: ["534f8f8e94da26c97d00002d"],
20 |       # parent_ids: ["534f8f8e94da26c97d00002c"],
21 |       # resource_ids: ["534f8f8e94da26c97d00002b"],
22 |       data: {
23 |         opacity: 0.5,
24 |         world: {
25 |           args: [160, 31, 200, 183, 0],
26 |           width: 1024,
27 |           height: 768,
28 |           type: 'TMWorld'
29 |         },
30 |         unitName: @node.pget_text,
31 |         bindings: {
32 |           ressources: nil,
33 |           properties: {}
34 |         },
35 |         # indexChild: 1
36 |       },
37 |       _type: 'Unit'
38 |     }
39 |   end
40 | end
41 | 


--------------------------------------------------------------------------------
/psd_rb/ruby/factories/unit_factory.rb:
--------------------------------------------------------------------------------
 1 | require_relative 'text_unit'
 2 | require_relative 'image_unit'
 3 | require_relative 'group_unit'
 4 | require_relative 'unknown_unit'
 5 | 
 6 | class UnitFactory
 7 | 
 8 |   def self.create_unit(node)
 9 |     if node.phas_text
10 |       return TextUnit.new(node)
11 |     elsif node.pis_group
12 |       return GroupUnit.new(node)
13 |     else
14 |       return ImageUnit.new(node)
15 |     end
16 |   end
17 | 
18 | end
19 | 


--------------------------------------------------------------------------------
/psd_rb/ruby/factories/unknown_unit.rb:
--------------------------------------------------------------------------------
 1 | class UnknownUnit
 2 | 
 3 |     def initialize(node)
 4 |       @node = node
 5 |     end
 6 | 
 7 |     def type
 8 |       'UnknownUnit'
 9 |     end
10 | 
11 |     def to_json
12 |       as_json
13 |     end
14 | 
15 |     def as_json
16 |       {
17 |           Unit: 'Unkown'
18 |       }
19 |     end
20 | end


--------------------------------------------------------------------------------
/psd_rb/ruby/panda_psd.rb:
--------------------------------------------------------------------------------
 1 | require 'psd'
 2 | require_relative 'util'
 3 | require_relative 'unit_manager'
 4 | 
 5 | class PandaPsd
 6 | 
 7 |   def initialize(file:nil)
 8 |     @errors = []
 9 |     @info = []
10 |     @orientation = nil
11 |     @unitManager = UnitManager.new
12 |     @psd = PSD.new(file)
13 |     @psd.parse!
14 |   end
15 | 
16 |   attr_reader :info, :errors
17 | 
18 |   public
19 | 
20 |   def check_integrity
21 |     check_dimensions
22 |     self
23 |   end
24 | 
25 |   def export
26 |     @unitManager.export
27 |   end
28 | 
29 |   def parse
30 |     check_integrity
31 |     go_through
32 | 
33 | 
34 |     # #get an image
35 |     # so = @psd.pget_layer_by_name('Smart Object (path d\'illustrator)')
36 |     # so.pas_png
37 |     #
38 |     # #get a shape
39 |     # so = @psd.pget_layer_by_name('Path rond')
40 |     # so.pas_png
41 |     #
42 |     # #get a crop position
43 |     # so = @psd.pget_layer_by_name('image croppée')
44 |     # if so.phas_mask
45 |     #   so.pas_png # get the whole image without cropping
46 |     #   puts so.pget_mask_position
47 |     # end
48 |     #
49 |     # #get a text
50 |     # so = @psd.pget_layer_by_name('Texte')
51 |     # if so.phas_text
52 |     #   pp so.pget_text_html
53 |     #   pp so.pget_positions
54 |     #   pp so.pget_dimensions
55 |     #   pp so.hidden?
56 |     #   pp so.pvisible
57 |     # end
58 |     # puts UnitFactory::create_unit(so).as_json
59 | 
60 |     self
61 |   end
62 | 
63 |   private
64 | 
65 |   def go_through
66 |     return nil unless @errors.empty?
67 |     @psd.tree.descendants_layers.each do |layer|
68 |       @unitManager.create_unit(layer) if layer.pvisible_tree?
69 |     end
70 |   end
71 | 
72 |   def check_dimensions
73 |     return nil unless @errors.empty?
74 |     if @psd.pget_dimensions == [2048, 1536]
75 |       @orientation = 'landscape'
76 |     elsif @psd.pget_dimensions == [1536, 2048]
77 |       @orientation = 'portrait'
78 |     else
79 |       @errors << 'Dimensions are wrong. Only 2048x1536 or 1536x2048.'
80 |     end
81 |   end
82 | 
83 | end
84 | 
85 | 
86 | 
87 | 


--------------------------------------------------------------------------------
/psd_rb/ruby/unit_manager.rb:
--------------------------------------------------------------------------------
 1 | require_relative 'factories/unit_factory'
 2 | 
 3 | class UnitManager
 4 | 
 5 |   def initialize
 6 |     @units = []
 7 |   end
 8 |   attr_reader :units
 9 | 
10 |   def create_unit(layer)
11 |     unit = UnitFactory::create_unit(layer)
12 |     @units << unit
13 |     unit
14 |   end
15 | 
16 |   def export
17 |     @units.map{|unit| unit.as_json}
18 |   end
19 | 
20 | end
21 | 


--------------------------------------------------------------------------------
/psd_rb/ruby/util.rb:
--------------------------------------------------------------------------------
  1 | 
  2 | PSD.send(:define_method, 'pget_dimensions') do
  3 |   {width:tree.document_width, height:tree.document_height}
  4 | end
  5 | 
  6 | PSD.send(:define_method, 'pget_layer_by_name') do |name|
  7 |   tree.children_layers.map { |layer| return layer if layer.name == name }.compact.first
  8 | end
  9 | 
 10 | 
 11 | PSD::Node::Base.send(:define_method, 'pget_layer_by_name') do |name|
 12 |   children_layers.map { |layer| return layer if layer.name == name }.compact.first
 13 | end
 14 | 
 15 | PSD::Node::Base.send(:define_method, 'pget_dimensions') do
 16 |   {width:width, height:height}
 17 | end
 18 | 
 19 | PSD::Node::Base.send(:define_method, 'ppng_name') do
 20 |   name.gsub(/\s/, '_')+'.png'
 21 | end
 22 | 
 23 | PSD::Node::Base.send(:define_method, 'pas_png') do
 24 |   image.save_as_png '../output/'+ppng_name
 25 | end
 26 | 
 27 | PSD::Node::Base.send(:define_method, 'pis_group') do
 28 |   group?
 29 | end
 30 | 
 31 | PSD::Node::Base.send(:define_method, 'pvisible_tree?') do
 32 |   see_it = true
 33 |   see_it = visible? unless root?
 34 |   # noinspection RubyScope
 35 |   def rec(node, see_it)
 36 |     see_it = node.visible? unless node.root? || (!see_it)
 37 |     see_it = rec(node.parent, see_it) unless node.root? || (!see_it)
 38 |     see_it
 39 |   end
 40 |   see_it = rec(parent, see_it) unless root? || (!see_it)
 41 |   see_it
 42 | end
 43 | 
 44 | PSD::Node::Base.send(:define_method, 'pget_mask_position') do
 45 |   {
 46 |     top:    mask.top-top,
 47 |     left:   mask.left-left,
 48 |     right:  right-mask.right,
 49 |     bottom: bottom-mask.bottom
 50 |   }
 51 | end
 52 | 
 53 | PSD::Node::Base.send(:define_method, 'phas_mask') do
 54 |   (mask.size > 0)
 55 | end
 56 | 
 57 | PSD::Node::Base.send(:define_method, 'phas_crop') do
 58 |   phas_mask && (pget_mask_position != {top:0, left:0, right:0, bottom:0})
 59 | end
 60 | 
 61 | PSD::Node::Base.send(:define_method, 'phas_text') do
 62 |   text != nil
 63 | end
 64 | 
 65 | PSD::Node::Base.send(:define_method, 'pget_text') do
 66 |   text[:value] if phas_text
 67 | end
 68 | 
 69 | PSD::Node::Base.send(:define_method, 'pget_text_html') do
 70 |   '<TextFlow columnCount="1" columnGap="34" fontLookup="embeddedCFF" leadingModel="box" renderingMode="cff" textAlignLast="left" whiteSpaceCollapse="preserve" version="3.0.0" xmlns="http://ns.adobe.com/textLayout/2008">'+
 71 |     '<p lineHeight="135%" paragraphSpaceAfter="0" paragraphSpaceBefore="0">'+
 72 |       "<span color=\"##{'%02x%02x%02x' % text[:font][:colors].first[0, 3]}\" fontFamily=\"#{text[:font][:name]}\" fontSize=\"#{text[:font][:sizes].first}\">"+
 73 |       pget_text+
 74 |       '</span>'+
 75 |     '</p>'+
 76 |   '</textFlow>' if phas_text
 77 | end
 78 | 
 79 | PSD::Node::Base.send(:define_method, 'pget_positions') do
 80 |   {
 81 |     top:    top,
 82 |     left:   left,
 83 |     right:  right,
 84 |     bottom: bottom
 85 |   }
 86 | end
 87 | 
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | # module Panda
 99 | 
100 |   # module TreeUtil
101 |   #
102 |   #   def get_layer_by_name(node, name)
103 |   #     return nil unless @errors.empty?
104 |   #     node = node.tree if node.instance_of?(PSD)
105 |   #     # noinspection RubyResolve
106 |   #     node.children_layers.map { |layer| return layer if layer.name == name }.compact.first
107 |   #   end
108 | 
109 |   #   def get_dimensions(node)
110 |   #     return nil unless @errors.empty?
111 |   #     # noinspection RubyResolve
112 |   #     return {width: node.tree.document_width, height: node.tree.document_height} if node.instance_of?(PSD)
113 |   #     {width:node.width, height:node.height}
114 |   #   end
115 |   #
116 |   # end
117 | 
118 |   # module NodeUtil
119 | 
120 |     # def is_visible(node)
121 |     #   return nil unless @errors.empty?
122 |     #   node.visible && !node.hidden?
123 |     # end
124 | 
125 |     # def as_png(node, name:nil)
126 |     #   return nil unless @errors.empty?
127 |     #   name = node.name.gsub(/\s/, '_')+'.png' if name.nil? || name.empty?
128 |     #   node.image.save_as_png './output/'+name
129 |     # end
130 | 
131 |     # def has_mask(node)
132 |     #   return nil unless @errors.empty?
133 |     #   (node.mask.size > 0) && (get_mask_position(node) != {top:0, left:0, right:0, bottom:0})
134 |     # end
135 | 
136 |     # def get_mask_position(node)
137 |     #   return nil unless @errors.empty?
138 |     #   node.mask.instance_exec {{
139 |     #       top:    top-node.top,
140 |     #       left:   left-node.left,
141 |     #       right:  node.right-right,
142 |     #       bottom: node.bottom-bottom
143 |     #   }}
144 |     # end
145 | 
146 |     # def has_text(node)
147 |     #   return nil unless @errors.empty?
148 |     #   node.text != nil
149 |     # end
150 | 
151 |     # def get_text(node)
152 |     #   return nil unless @errors.empty?
153 |     #   node.text[:value]
154 |     # end
155 | 
156 |     # def get_text_html(node)
157 |     #   return nil unless @errors.empty?
158 |     #   "<TextFlow columnCount=\"1\" columnGap=\"34\" fontLookup=\"embeddedCFF\" leadingModel=\"box\" renderingMode=\"cff\" textAlignLast=\"left\" whiteSpaceCollapse=\"preserve\" version=\"3.0.0\" xmlns=\"http://ns.adobe.com/textLayout/2008\">"+
159 |     #       "<p lineHeight=\"135%\" paragraphSpaceAfter=\"0\" paragraphSpaceBefore=\"0\">"+
160 |     #       "<span color=\"##{'%02x%02x%02x' % node.text[:font][:colors].first[0, 3]}\" fontFamily=\"#{node.text[:font][:name]}\" fontSize=\"#{node.text[:font][:sizes].first}\">"+
161 |     #       get_text(node)+
162 |     #       "</span>"+
163 |     #       "</p>"+
164 |     #       "</textFlow>"
165 |     # end
166 | 
167 |     # def get_positions(node)
168 |     #   return nil unless @errors.empty?
169 |     #   node.instance_exec {{
170 |     #       top:    top,
171 |     #       left:   left,
172 |     #       right:  right,
173 |     #       bottom: bottom
174 |     #   }}
175 |     # end
176 | 
177 |   # end
178 | 
179 |   # class PsdUtil
180 |   #   # include TreeUtil
181 |   #   # include NodeUtil
182 |   #
183 |   #   def initialize(file:nil)
184 |   #     @foreground_page_name = 'foreground'
185 |   #     @background_page_name = 'background'
186 |   #   end
187 |   #
188 |   #
189 |   # end
190 | # end


--------------------------------------------------------------------------------
/psd_rb/test.rb:
--------------------------------------------------------------------------------
 1 | require_relative 'ruby/panda_psd'
 2 | require 'benchmark'
 3 | require 'pp'
 4 | 
 5 | 
 6 | Benchmark.bm(10) do |x|
 7 |   x.report('PandaPsd:') {
 8 |     psd = PandaPsd.new(file:'../psds/rmn_v3.psd').parse
 9 |     psd = PandaPsd.new(file:'psds/micka.psd').parse
10 |     pp 'infos  -> '+psd.info.join(', ')
11 |     pp 'errors -> '+psd.errors.join(', ')
12 |   }
13 | end
14 | 


--------------------------------------------------------------------------------
/psdtools/A2 16 column.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/A2 16 column.ai


--------------------------------------------------------------------------------
/psdtools/Logo.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/Logo.psd


--------------------------------------------------------------------------------
/psdtools/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 | 
4 | if __name__ == '__main__': print __version__
5 | 


--------------------------------------------------------------------------------
/psdtools/brochure.indd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure.indd


--------------------------------------------------------------------------------
/psdtools/brochure_rtangfold_11x17_OUTh.indd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure_rtangfold_11x17_OUTh.indd


--------------------------------------------------------------------------------
/psdtools/buttons.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/buttons.psd


--------------------------------------------------------------------------------
/psdtools/main.py:
--------------------------------------------------------------------------------
  1 | from psd_tools import PSDImage
  2 | from psd_tools.constants import BlendMode
  3 | import psd_tools.user_api.psd_image
  4 | import json
  5 | import uuid
  6 | 
  7 | """
  8 | {"pages":[{"images":[], "paragraphs":[{"size":0, "width":0, "string":"", "x":0, "y":0, "font":"", "height":0}]}]}
  9 | or
 10 | {
 11 |     "pages": [
 12 |         {
 13 |             "images": [
 14 |                 "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
 15 |             ],
 16 |             "paragraphs": [
 17 |                 {
 18 |                     "size": 98,
 19 |                     "width": 587,
 20 |                     "string": "Book Title",
 21 |                     "y": -98,
 22 |                     "x": -324,
 23 |                     "font": "Georgia",
 24 |                     "height": 705
 25 |                 }
 26 |             ]
 27 |         }
 28 |     ]
 29 | }
 30 | """
 31 | class ImportPSD(object):
 32 | 
 33 |     """ Will Parse a PSD and store its data """
 34 |     data = {'pages':[]}
 35 |     imagePath = './'
 36 |     psd = None
 37 |     PSDFilePath = None
 38 | 
 39 |     @classmethod
 40 |     def __init__(self, PSDFilePath, imagePath):
 41 |         self.PSDFilePath = PSDFilePath
 42 |         self.imagePath = imagePath
 43 |         self.psd = PSDImage.load(PSDFilePath)
 44 |         self.nb_layers = self.countLayers(layers=self.psd.layers)
 45 |         print "Layers to process: %d" % self.nb_layers
 46 | 
 47 |     @classmethod
 48 |     def countLayers(self, layers, layer_num=0):
 49 |         for sheet in layers:
 50 |             if isinstance(sheet, psd_tools.user_api.psd_image.Layer):
 51 |                 layer_num += 1
 52 |             elif isinstance(sheet, psd_tools.user_api.psd_image.Group):
 53 |                 if sheet.visible_global:
 54 |                     layer_num = self.countLayers(layers=sheet.layers, layer_num=layer_num)
 55 |         return  layer_num
 56 | 
 57 | 
 58 |     @classmethod
 59 |     def parse(self):
 60 |         self.data['pages'].append(self.browseSheets(sheets=self.psd.layers, parentName="root"))
 61 | 
 62 |     @classmethod
 63 |     def browseSheets(self, sheets, parentName, page_num=1):
 64 |         array = {}
 65 |         for sheet in sheets:
 66 |             if isinstance(sheet, psd_tools.user_api.psd_image.Layer):
 67 |                 print "Processing layer: %d" % self.nb_layers
 68 |                 self.nb_layers -= 1
 69 |                 """ this sheet is a layer, it may be an image or some text """
 70 |                 if sheet.text_data is not None:
 71 |                     """ If it's a text """
 72 |                     dic = dict({    'name': sheet.name,
 73 |                                     'x': sheet.bbox.x1,
 74 |                                     'y': sheet.bbox.y2,
 75 |                                     'width': sheet.bbox.width,
 76 |                                     'height': sheet.bbox.height,
 77 |                                     'string': sheet.text_data.text,
 78 |                                     'font': None
 79 |                                 })
 80 |                     if 'paragraphs' not in array:
 81 |                         array['paragraphs'] = []
 82 |                     array['paragraphs'].append(dic)
 83 |                 else:
 84 |                     """ If it's an image """
 85 |                     imageName = str(uuid.uuid1())+'_'+sheet.name+'_'+parentName+'.png'
 86 |                     if sheet is not None and sheet.as_PIL() is not None:
 87 |                         sheet.as_PIL().save(self.imagePath+imageName)
 88 |                     else:
 89 |                         imageName = None
 90 |                     if 'images' not in array:
 91 |                         array['images'] = []
 92 |                     array['images'].append(imageName)
 93 |             elif isinstance(sheet, psd_tools.user_api.psd_image.Group):
 94 |                 """ this sheet is a group and may contains many layers """
 95 |                 if sheet.visible_global:
 96 |                     arr = self.browseSheets(sheets=sheet.layers, parentName=sheet.name, page_num=page_num+1)
 97 |                     self.data['pages'].append(arr)
 98 |         return array
 99 | 
100 |     @classmethod
101 |     def toJson(self):
102 |         """ Will convert the parsed data array into json """
103 |         return json.dumps(self.data)
104 | 
105 | 
106 | 
107 | def run(pdf_file, image_folder):
108 |     importedPSD = ImportPSD(PSDFilePath=pdf_file, imagePath=image_folder)
109 |     importedPSD.parse()
110 |     jsonString = importedPSD.toJson()
111 |     return jsonString
112 | 
113 | 
114 | 
115 | 
116 | if __name__ == '__main__':
117 |     import sys
118 |     if len(sys.argv) == 3:
119 |         print run(sys.argv[1], sys.argv[2])
120 |     else:
121 |         print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0])
122 | 


--------------------------------------------------------------------------------
/psdtools/test-text.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/test-text.psd


--------------------------------------------------------------------------------