├── .gitignore
├── README.md
├── __init__.py
├── install
├── install.sh
└── lcms-1.19.tar.gz
├── parser.py
├── pdfreader
├── __init__.py
├── book.pdf
├── brochureint.pdf
├── brochureprint.pdf
├── buttons.pdf
├── ff.pdf
├── ff.png
├── ffa.pdf
├── hello world.pdf
├── lib
│ ├── __init__.py
│ ├── book.py
│ ├── char.py
│ ├── line.py
│ ├── page.py
│ └── paragraph.py
├── main.py
├── pff.pdf
├── pffl.pdf
├── png.pdf
└── util
│ ├── __init__.py
│ └── convert.py
├── psd_rb
├── Gemfile
├── json_type.json
├── ruby
│ ├── factories
│ │ ├── group_unit.rb
│ │ ├── image_unit.rb
│ │ ├── text_unit.rb
│ │ ├── unit_factory.rb
│ │ └── unknown_unit.rb
│ ├── panda_psd.rb
│ ├── unit_manager.rb
│ └── util.rb
└── test.rb
└── psdtools
├── A2 16 column.ai
├── Logo.psd
├── __init__.py
├── brochure.indd
├── brochure_rtangfold_11x17_OUTh.indd
├── buttons.psd
├── main.py
└── test-text.psd
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | pdfreader/.DS_Store
3 |
4 | *.sublime-project
5 |
6 | *.sublime-workspace
7 |
8 | psdtools/.DS_Store
9 |
10 | psdtools/images/.DS_Store
11 |
12 | *.pyc
13 |
14 | /.DS_Store
15 | /install/.DS_Store
16 | /metadata.json
17 | /install/install02500809/
18 | /images/
19 | /psd_rb/.idea/
20 | /psd_rb/Gemfile.lock
21 | /psd_rb/.DS_Store
22 | /psd_rb/psds/
23 | /psd_rb/output/
24 | /psd_rb/output.txt
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | content-extractor
2 | =================
3 |
4 | Content-extractor is python based project. The only reason to this is the availability of the librairies. (The best ones are in python IMO)
5 | Currently, this project parse pdf and psd file to extract meaningful content, such as texts and images both linked under a common json string
6 |
7 |
8 | Dependencies
9 | =================
10 |
11 | Content-extractor is build upon the followings:
12 |
13 | - [psd-tools](https://pypi.python.org/pypi/psd-tools/) To extract images and text from psd files
14 | - [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/#intro) To extract text from pdf files as xml
15 | - [beautiful soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) To convert the extracted xml as json
16 | - [pdfimages](http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/) (from poppler-utils) To extract images from pdf files as image.ppm
17 | - [ImageMagick](http://www.imagemagick.org/script/index.php) To convert the ppm images to png
18 |
19 |
20 | Installation
21 | =================
22 |
23 | Since there is a lot of dependencies and most of them also have their own dependencies, I have made a shellscript to simplify the installation process.
24 |
25 | The script assume the following:
26 |
27 | - If you don't have [apt-get](http://doc.ubuntu-fr.org/apt-get), then you should have [brew](http://mxcl.github.io/homebrew/).
28 | - You have [curl](http://pwet.fr/man/linux/commandes/curl) installed (the shell executable, not the php module).
29 | - You have [python 2.7.x](http://www.python.org/download/) installed on your system.
30 | - You have [unzip](http://www.info-zip.org/mans/unzip.html) installed on your system.
31 |
32 | When you launch the script, it's installing [pip](https://pypi.python.org/pypi/pip), if it isn't already present on your system.
33 |
34 | ```Shell
35 | $ cd install
36 | $ sh install.sh
37 | ```
38 |
39 | How to use it
40 | =================
41 |
42 | For any extension (currently pdf/psd) you can use `parser.py [file_path] [image_path]` it will automaticaly do the job.
43 |
44 | ```Python
45 | #Will write a metada.json and extract the images into the folder images
46 | ./parser.py psdtools/work.psd './images/'
47 | ./parser.py pdfreader/book.pdf './images/'
48 | ```
49 |
50 | You can also import parser.py into your own python project and use it the folowing way:
51 |
52 | ```Python
53 | #will return a string containing the json and extract the images into the folder images
54 | from parser import parser
55 | json = parser.parse("psdtools/work.psd", "./images/")
56 | json = parser.parse("pdfreader/book.pdf", "./images/")
57 | ```
58 |
59 | You can also use the pdfreader and psdtools script independently doing so:
60 |
61 |
62 | ```Shell
63 | # Shell:
64 | $ ./psdtools/main.py psdtools/work.psd './images/'
65 | $ ./pdfreader/main.py pdfreader/book.pdf './images/'
66 | ```
67 | ```Python
68 | # Python:
69 | # PSD
70 | from psdtools import main
71 | json = main.run("psdtools/work.psd", "./images/")
72 | # PDF
73 | from pdfreader import main
74 | json = main.run("pdfreader/book.pdf", "./images/")
75 | ```
76 |
77 |
78 | ./pdfreader/main.py is just a simplified interface to the very powerful pdfreader/util/convert.py, I have rewrite convert.py to be a class, but this is originally [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/#pdf2txt) from [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/index.html).
79 | However, you can still use convert.py as if it was the originial [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt) tool, [here is the documentation](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt).
80 |
81 | ```Shell
82 | $ pdfreader/util/convert.py -o output.html samples/naacl06-shinyama.pdf
83 | (extract text as an HTML file whose filename is output.html)
84 |
85 | $ pdfreader/util/convert.py -V -c euc-jp -o output.html samples/jo.pdf
86 | (extract a Japanese HTML file in vertical writing, CMap is required)
87 |
88 | $ pdfreader/util/convert.py -P mypassword -o output.txt secret.pdf
89 | (extract a text from an encrypted PDF file)
90 | ```
91 |
92 | convert.py can also be imported in a python project (but less options are available due to my lack of implementation)
93 |
94 | ```Python
95 | # @see pdfreader/main.py:text_to_dict as example
96 | from util.convert import converter
97 | convert = converter()
98 | xml = convert.as_xml().add_input_file(fileinput).run()
99 | ```
100 |
101 | How does it work
102 | =================
103 |
104 | The information are extracted having in mind to keep the parent-child relations.
105 |
106 | - For a pdf file:
107 | - A pdf file can have many pages, and so the json string goes. each page has many images and many paragraphs.
108 | - Each paragraph has a width, height, a content(string), a y and x position, a font, and a font size.
109 | - For a pdf file the original image name can't be extracted so the images are name like the followong [uniqId]_p[pageNumber].jpg/png. uniqId is a unique Id, pageNumber is the page in which the images is contained. you shouldn't need to use this information since the json string contains it.
110 | - Each extracted images are directly extracted a jpg but if it can't be then they are extracted as ppm/pbm and then converted in png.
111 |
112 | - For a psd file:
113 | - A psd file can have many groups which are translated in pages into the json string. each group in a psd file can have many layers, a layer can be either text(paragraphs) or an image(images).
114 | - Each text layer has a width, height, a content(string), a y and x position. The font and font size are currenlty not extracted (because psd-tools doesn't do it).
115 | - For a psd file the image is name with the layer name it come from, but since many layers can have the same name the following is applied to be sure we have a unique name. [uniqId]_[layerName]_[groupName].png.
116 |
117 | - PSD files:
118 | - Text: Because of the psd-tools library, you can't know the font, bold, italic, underline attribute.
119 |
120 | - PDF files:
121 | - Text: Because of the pdfminer library, you can't have many fonts in the same paragraph. It is also not possible to extract the underlines. However the bold and italic attribute are extracted as html and directly integrated into the string.
122 |
123 | You can see under a simplified example taken out from book.pdf of how look the json string.
124 |
125 | JSON Format (from pdfreader/book.pdf 'simplified')
126 | =================
127 |
128 | ```JSON
129 | {
130 | "pages": [
131 | {
132 | "images": [
133 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
134 | ],
135 | "paragraphs": [
136 | {
137 | "size": 98,
138 | "width": 587,
139 | "string": "Book Title",
140 | "y": -98,
141 | "x": -324,
142 | "font": "Georgia",
143 | "height": 705
144 | }
145 | ]
146 | },
147 | {
148 | "images": [
149 | "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png"
150 | ],
151 | "paragraphs": [
152 | {
153 | "size": 24,
154 | "width": 138,
155 | "string": "CHAPTER 1",
156 | "y": -24,
157 | "x": -88,
158 | "font": "Georgia",
159 | "height": 711
160 | },
161 | {
162 | "size": 33,
163 | "width": 489,
164 | "string": "Lorem ipsum dolor sit amet, consectetur \nadipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.",
165 | "y": -229,
166 | "x": -439,
167 | "font": "Georgia",
168 | "height": 269
169 | }
170 | ]
171 | },
172 | {
173 | "paragraphs": [
174 | {
175 | "size": 24,
176 | "width": 133,
177 | "string": "SECTION 1",
178 | "y": -24,
179 | "x": -83,
180 | "font": "Georgia",
181 | "height": 711
182 | }
183 | ]
184 | }
185 | ]
186 | }
187 | ```
188 |
189 | How to improve it
190 | =================
191 |
192 | - The psd-tools library should be improved to be able to take out the font, bold italic underline attribute
193 | - The pdfminer library should be improved to be able to take out the underline attribute
194 | - Matching the pep8?
195 | - Checking if the proposed image folder exist if not we create it if we can't fire an error and stop
196 | - Adding support for more file format
197 | - Speed is not an issue but why not improving it ?
198 | - The psd-tools library should be improved to be able to extract layers containing fx effect (is this even possible?)
199 | - Depending of the platform or a parameter the `\n` into the string attribute in JSON should become `
` or `\r\n`
200 | - Reduce the number of dependencies ?
201 | - What else ?
202 |
203 | Contributing
204 | =================
205 |
206 | You're welcome to contribute to this project in any way you can. If you don't know how to code, don't have time, don't worry, you still can post issue, I will be happy to answer you and correct it as fast as possible.
207 | Want to code ? fork it and submit pull request! Also, pull request comming with an example of what has been improved will be merge in priority.
208 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 |
4 | if __name__ == '__main__': print __version__
5 |
--------------------------------------------------------------------------------
/install/install.sh:
--------------------------------------------------------------------------------
1 | mkdir install02500809
2 | cd install02500809
3 | echo "changing mode of /usr/local/etc to 775\n"
4 | sudo chmod 775 /usr/local/etc
5 | # We check the python2 avalaibility
6 | echo "Checking python2 availability\n"
7 | type python2 >/dev/null 2>&1 || {
8 | echo "python2 unavaliable\nNow creating python2 as alias of python\n"
9 | alias python2="python"
10 | }
11 | echo "Done python2\n"
12 | # We suppose that the system has apt-get command already installed (if pdfimages is not available)
13 | # We suppose that the system has curl command already installed
14 | # We check that pip exist on the system
15 | echo "Checking pip availability\n"
16 | type pip >/dev/null 2>&1 || {
17 | echo "pip unavaliable\nNow installing pip\n"
18 | mkdir pipInstall
19 | cd pipInstall
20 | curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
21 | sudo python get-pip.py
22 | cd ..
23 | rm -rf pipInstall
24 | }
25 | echo "Done pip\n"
26 |
27 | # pil dependencies
28 | echo "Installing pil dependecies\n"
29 | type apt-get >/dev/null 2>&1 && {
30 | sudo apt-get install libjpeg libjpeg-dev libfreetype6 libfreetype6-dev zlib1g-dev
31 | } || {
32 | type brew >/dev/null 2>&1 && {
33 | brew install jpeg freetype
34 | brew tap homebrew/dupes
35 | brew install zlib
36 | }
37 | }
38 | mkdir lcmsextract
39 | tar xzf ../lcms-1.19.tar.gz --directory=./lcmsextract
40 | cd lcmsextract/lcms-1.19
41 | ./configure
42 | make
43 | sudo make install
44 | cd ../..
45 | rm -rf lcmsextract
46 | echo "Done pil dependecies\n"
47 |
48 | # psd-tools dependencies
49 | echo "Installing psd-tools dependecies\n"
50 | echo "Now installing packbits\n"
51 | sudo pip install -U packbits
52 | echo "Now installing docops\n"
53 | sudo pip install -U docopt
54 | echo "Now installing pil\n"
55 | sudo pip install -U pil
56 | echo "Done psd-tools dependecies\n"
57 |
58 |
59 | # Installing psd-tools
60 | echo "Installing psd-tools\n"
61 | sudo pip install -U psd-tools
62 | echo "Done psd-tools\n"
63 |
64 | # Installing pdfminer
65 | echo "Installing pdfminer\n"
66 | mkdir pdfminerInstall
67 | cd pdfminerInstall
68 | curl -O https://codeload.github.com/euske/pdfminer/zip/master
69 | unzip master
70 | cd pdfminer-master
71 | make cmap
72 | sudo python setup.py install
73 | cd ../..
74 | rm -rf pdfminerInstall
75 | echo "Done pdfminer\n"
76 |
77 | # Installing beautifullsoup
78 | echo "Installing beautifulsoup\n"
79 | sudo pip install -U beautifulsoup4
80 | echo "Done beautifulsoup\n"
81 |
82 | # If pdfimages is not available we install the lib poppler-utils which contains it
83 | echo "Checking pdfimages\n"
84 | type pdfimages >/dev/null 2>&1 || {
85 | echo "pdfimages unavaliable\nNow installing poppler-utils\n"
86 | type apt-get >/dev/null 2>&1 && {
87 | sudo apt-get install poppler-utils
88 | } || {
89 | type brew >/dev/null 2>&1 && {
90 | brew install fontconfig
91 | brew link --overwrite fontconfig
92 | brew install poppler
93 | brew link --overwrite poppler
94 | }
95 | }
96 | }
97 | echo "Done pdfimages\n"
98 |
99 | #If convert is not avalable we install the lib imagemagick which contains it
100 | echo "Checking convert\n"
101 | type convert >/dev/null 2>&1 || {
102 | echo "convert unavaliable\nNow installing imagemagick\n"
103 | mkdir imagemagickInstall
104 | cd imagemagickInstall
105 | curl -O http://www.imagemagick.org/download/ImageMagick.tar.gz
106 | tar xvfz ImageMagick.tar.gz
107 | cd ImageMagick-6.8.5-8
108 | ./configure
109 | make
110 | sudo make install
111 | sudo ldconfig /usr/local/lib
112 | cd ../..
113 | rm -rf imagemagickInstall
114 | }
115 | echo "Done convert\n"
116 | cd ..
117 | rm -rf install02500809
118 |
--------------------------------------------------------------------------------
/install/lcms-1.19.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/install/lcms-1.19.tar.gz
--------------------------------------------------------------------------------
/parser.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 |
3 | def file_exist(file_path):
4 | import os.path
5 | if os.path.exists(file_path) and os.path.isfile(file_path):
6 | return True
7 | return False
8 |
9 | def get_4b_magic_number(file_path):
10 | try:
11 | binaryFile = open(file_path, 'rb')
12 | magicNumber = binaryFile.read(4)
13 | except(IOError), e:
14 | print "%s: unable to open/read.\n%s" % (file_path, e)
15 | else:
16 | return magicNumber
17 | return False
18 |
19 | def parse(file_path, image_folder):
20 | if file_exist(file_path):
21 | magicNumber = get_4b_magic_number(file_path)
22 | if magicNumber is not False:
23 | if magicNumber == "8BPS": #PSD file
24 | from psdtools import main
25 | elif magicNumber == "%PDF": #PDF file
26 | from pdfreader import main
27 | return main.run(file_path, image_folder)
28 | else:
29 | print "%s: not a psd nor a pdf (unknown Magic Number)" % file_path
30 | else:
31 | print "%s: file not found." % file_path
32 |
33 | if __name__ == '__main__':
34 | import sys
35 | if len(sys.argv) == 3:
36 | json = parse(sys.argv[1], sys.argv[2])
37 | """ We write the json into a file called metadata.json """
38 | target = open("metadata.json", 'w+') # a will append, w will over-write
39 | target.write(json)
40 | target.close()
41 | else:
42 | print "usage: %s pdf_or_psd_file_path generated_images_path/ (eg: python %s book.pdf/.psd './images/')" % (sys.argv[0], sys.argv[0])
43 |
--------------------------------------------------------------------------------
/pdfreader/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 |
4 | if __name__ == '__main__': print __version__
5 |
--------------------------------------------------------------------------------
/pdfreader/book.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/book.pdf
--------------------------------------------------------------------------------
/pdfreader/brochureint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureint.pdf
--------------------------------------------------------------------------------
/pdfreader/brochureprint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureprint.pdf
--------------------------------------------------------------------------------
/pdfreader/buttons.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/buttons.pdf
--------------------------------------------------------------------------------
/pdfreader/ff.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.pdf
--------------------------------------------------------------------------------
/pdfreader/ff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.png
--------------------------------------------------------------------------------
/pdfreader/ffa.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ffa.pdf
--------------------------------------------------------------------------------
/pdfreader/hello world.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/hello world.pdf
--------------------------------------------------------------------------------
/pdfreader/lib/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 |
4 | if __name__ == '__main__': print __version__
5 |
--------------------------------------------------------------------------------
/pdfreader/lib/char.py:
--------------------------------------------------------------------------------
1 | class char(object):
2 |
3 | _size = 0
4 | _width = 0
5 | _height = 0
6 | _x = 0
7 | _y = 0
8 | _font = None
9 | _isBold = False
10 | _isItalic = False
11 | _char = ''
12 |
13 | @property
14 | def font(self):
15 | return self._font
16 |
17 | @property
18 | def size(self):
19 | return self._size
20 |
21 | @property
22 | def x(self):
23 | return self._x
24 |
25 | @property
26 | def y(self):
27 | return self._y
28 |
29 | def __init__(self, xml_char):
30 | self._font = xml_char.get('font').split('+')[1] if xml_char.get('font') != None else None
31 | if self._font is not None:
32 | (left, top, self._width, self._height) = xml_char.get('bbox').split(',')
33 | self._width = int(float(self._width))
34 | self._height = int(float(self._height))
35 | self._x = int(float(left)) - self._width
36 | self._y = int(float(top)) - self._height
37 | self._size = int(float(xml_char.get('size')))
38 | self._isBold = True if len(self._font.split('-')) == 2 and 'Bold' in self._font.split('-')[1] else False
39 | self._isItalic = True if len(self._font.split('-')) == 2 and 'Italic' in self._font.split('-')[1] else False
40 | self._font = self._font.split('-')[0]
41 | #
42 | self._char = xml_char.string
43 |
44 | def isBold(self):
45 | return self._isBold
46 |
47 | def isItalic(self):
48 | return self._isItalic
49 |
50 | def __str__(self):
51 | if self._font is None:
52 | return ""
53 | try:
54 | char = str(self._char)
55 | return char
56 | except UnicodeEncodeError:
57 | return ""
58 |
--------------------------------------------------------------------------------
/pdfreader/lib/line.py:
--------------------------------------------------------------------------------
1 | from cStringIO import StringIO
2 | from char import char
3 |
4 | class line(object):
5 |
6 | _width = 0
7 | _height = 0
8 | _x = 0
9 | _y = 0
10 | _font = None
11 | _lines = []
12 |
13 | @property
14 | def font(self):
15 | return self._font
16 |
17 | @property
18 | def size(self):
19 | return self._size
20 |
21 | @property
22 | def x(self):
23 | return self._x
24 |
25 | @property
26 | def y(self):
27 | return self._y
28 |
29 |
30 | def __init__(self, xml_line):
31 | (left, top, self._width, self._height) = xml_line.get('bbox').split(',')
32 | self._width = int(float(self._width))
33 | self._height = int(float(self._height))
34 | self._x = int(float(left)) - self._width
35 | self._y = int(float(top)) - self._height
36 | #
37 | xml_string = xml_line.find_all('text')
38 | self._chars = []
39 | for c in xml_string:
40 | self._chars.append(char(c))
41 | #
42 | self._font = self._chars[0].font if len(self._chars) > 0 else None
43 | self._size = self._chars[0].size if len(self._chars) > 0 else None
44 |
45 | _italic_on = False
46 | def handle_italic(self, c, string):
47 | if self._italic_on is False and c.isItalic() is True:
48 | string.write('')
49 | self._italic_on = True
50 | if self._italic_on is True and c.isItalic() is False:
51 | string.write('')
52 | self._italic_on = False
53 |
54 | _bold_on = False
55 | def handle_bold(self, c, string):
56 | if self._bold_on is False and c.isBold() is True:
57 | string.write('')
58 | self._bold_on = True
59 | if self._bold_on is True and c.isBold() is False:
60 | string.write('')
61 | self._bold_on = False
62 |
63 |
64 | def __str__(self):
65 | string = StringIO()
66 | for c in self._chars:
67 | self.handle_italic(c, string)
68 | self.handle_bold(c, string)
69 | string.write(str(c))
70 | return string.getvalue()
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 | if __name__ == '__main__':
80 | import sys
81 | from bs4 import BeautifulSoup
82 | file="""
83 |
84 |
85 |
86 |
87 | L
88 | O
89 | R
90 | E
91 | M
92 |
93 |
94 |
95 | I
96 | P
97 | S
98 | U
99 | M
100 |
101 |
102 |
103 |
104 |
105 |
106 | """
107 | soup = BeautifulSoup(file)
108 | l = line(soup.pages.page.textbox.textline)
109 | print ("[%s] font[%s]" % (l, l.font))
110 |
--------------------------------------------------------------------------------
/pdfreader/lib/page.py:
--------------------------------------------------------------------------------
1 | from paragraph import paragraph
2 | from cStringIO import StringIO
3 |
4 | class page(object):
5 |
6 | _paragraphs = []
7 |
8 | @property
9 | def paragraphs(self):
10 | return self._paragraphs
11 |
12 |
13 | def __init__(self, xml_page):
14 | xml = xml_page.find_all('textbox')
15 | self._paragraphs = []
16 | for p in xml:
17 | self._paragraphs.append(paragraph(p))
18 |
19 | def toDict(self):
20 | array = {'paragraphs': []}
21 | for p in self._paragraphs:
22 | array['paragraphs'].append(p.toDict())
23 | return array
24 |
25 |
26 | def __str__(self):
27 | string = StringIO()
28 | count = 0
29 | for p in self._paragraphs:
30 | if count != 0:
31 | string.write("\n\n")
32 | string.write(str(p))
33 | count = 1
34 | return string.getvalue()
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 | if __name__ == '__main__':
66 | import sys
67 | from bs4 import BeautifulSoup
68 | file="""
69 |
70 |
71 |
72 |
73 | L
74 | O
75 | R
76 | E
77 | M
78 |
79 | I
80 | P
81 | S
82 | U
83 | M
84 |
85 | d
86 | o
87 | l
88 | o
89 | r
90 |
91 | s
92 | i
93 | t
94 |
95 | a
96 | m
97 | e
98 | t
99 | ,
100 |
101 | c
102 | o
103 | n
104 | s
105 | e
106 | c
107 | t
108 | e
109 | t
110 | u
111 | r
112 |
113 |
114 |
115 |
116 |
117 | a
118 | d
119 | i
120 | p
121 | i
122 | s
123 | i
124 | c
125 | i
126 | n
127 | g
128 |
129 | e
130 | l
131 | i
132 | t
133 | ,
134 |
135 | s
136 | e
137 | d
138 |
139 | d
140 | o
141 |
142 | e
143 | i
144 | u
145 | s
146 | m
147 | o
148 | d
149 |
150 |
151 |
152 |
153 | t
154 | e
155 | m
156 | p
157 | o
158 | r
159 |
160 | i
161 | n
162 | c
163 | i
164 | d
165 | i
166 | d
167 | u
168 | n
169 | t
170 |
171 | u
172 | t
173 |
174 | l
175 | a
176 | b
177 | o
178 | r
179 | e
180 |
181 | e
182 | t
183 |
184 | d
185 | o
186 | l
187 | o
188 | r
189 | e
190 |
191 |
192 |
193 |
194 |
195 | m
196 | a
197 | g
198 | n
199 | a
200 |
201 | a
202 | l
203 | i
204 | q
205 | u
206 | a
207 | .
208 |
209 | U
210 | t
211 |
212 | e
213 | n
214 | i
215 | m
216 |
217 | a
218 | d
219 |
220 | m
221 | i
222 | n
223 | i
224 | m
225 |
226 | v
227 | e
228 | n
229 | i
230 | a
231 | m
232 | ,
233 |
234 |
235 |
236 |
237 |
238 |
239 |
240 | U
241 | n
242 | t
243 | i
244 | t
245 | l
246 | e
247 | d
248 |
249 |
250 |
251 |
252 |
253 |
254 | L
255 | O
256 | R
257 | E
258 | M
259 |
260 |
261 |
262 | I
263 |
264 | P
265 |
266 | S
267 | U
268 | M
269 |
270 |
271 |
272 |
273 |
274 |
275 | """
276 | soup = BeautifulSoup(file)
277 | p = page(soup.pages.page)
278 | print ("[%s] json[%s]" % (p, p.toDict()))
279 |
--------------------------------------------------------------------------------
/pdfreader/lib/paragraph.py:
--------------------------------------------------------------------------------
1 | from line import line
2 | from cStringIO import StringIO
3 |
4 | class paragraph(object):
5 |
6 | _width = 0
7 | _height = 0
8 | _x = 0
9 | _y = 0
10 | _font = None
11 | _lines = []
12 |
13 | @property
14 | def font(self):
15 | return self._font
16 |
17 | @property
18 | def size(self):
19 | return self._size
20 |
21 | @property
22 | def x(self):
23 | return self._x
24 |
25 | @property
26 | def y(self):
27 | return self._y
28 |
29 | def __init__(self, xml_paragraph):
30 | (left, top, self._width, self._height) = xml_paragraph.get('bbox').split(',')
31 | self._width = int(float(self._width))
32 | self._height = int(float(self._height))
33 | self._x = int(float(left)) - self._width
34 | self._y = int(float(top)) - self._height
35 | #
36 | xml = xml_paragraph.find_all('textline')
37 | self._lines = []
38 | for l in xml:
39 | self._lines.append(line(l))
40 | #
41 | self._font = self._lines[0].font if len(self._lines) > 0 else None
42 | self._size = self._lines[0].size if len(self._lines) > 0 else None
43 |
44 | def toDict(self):
45 | dico = dict({ 'font': self._font,
46 | 'size': self._size,
47 | 'x': self._x,
48 | 'y': self._y,
49 | 'width': self._width,
50 | 'height': self._height
51 | })
52 | string = StringIO()
53 | count = 0
54 | for l in self._lines:
55 | if count != 0:
56 | string.write("\n")
57 | string.write(str(l))
58 | count = 1
59 | dico.update({'string': string.getvalue()})
60 | return dico
61 |
62 | def __str__(self):
63 | string = StringIO()
64 | count = 0
65 | for l in self._lines:
66 | if count != 0:
67 | string.write("\n")
68 | string.write(str(l))
69 | count = 1
70 | return string.getvalue()
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 | if __name__ == '__main__':
94 | import sys
95 | import json
96 | from bs4 import BeautifulSoup
97 | file="""
98 |
99 |
100 |
101 |
102 | L
103 | O
104 | R
105 | E
106 | M
107 |
108 | I
109 | P
110 | S
111 | U
112 | M
113 |
114 | d
115 | o
116 | l
117 | o
118 | r
119 |
120 | s
121 | i
122 | t
123 |
124 | a
125 | m
126 | e
127 | t
128 | ,
129 |
130 | c
131 | o
132 | n
133 | s
134 | e
135 | c
136 | t
137 | e
138 | t
139 | u
140 | r
141 |
142 |
143 |
144 |
145 |
146 | a
147 | d
148 | i
149 | p
150 | i
151 | s
152 | i
153 | c
154 | i
155 | n
156 | g
157 |
158 | e
159 | l
160 | i
161 | t
162 | ,
163 |
164 | s
165 | e
166 | d
167 |
168 | d
169 | o
170 |
171 | e
172 | i
173 | u
174 | s
175 | m
176 | o
177 | d
178 |
179 |
180 |
181 |
182 | t
183 | e
184 | m
185 | p
186 | o
187 | r
188 |
189 | i
190 | n
191 | c
192 | i
193 | d
194 | i
195 | d
196 | u
197 | n
198 | t
199 |
200 | u
201 | t
202 |
203 | l
204 | a
205 | b
206 | o
207 | r
208 | e
209 |
210 | e
211 | t
212 |
213 | d
214 | o
215 | l
216 | o
217 | r
218 | e
219 |
220 |
221 |
222 |
223 |
224 | m
225 | a
226 | g
227 | n
228 | a
229 |
230 | a
231 | l
232 | i
233 | q
234 | u
235 | a
236 | .
237 |
238 | U
239 | t
240 |
241 | e
242 | n
243 | i
244 | m
245 |
246 | a
247 | d
248 |
249 | m
250 | i
251 | n
252 | i
253 | m
254 |
255 | v
256 | e
257 | n
258 | i
259 | a
260 | m
261 | ,
262 |
263 |
264 |
265 |
266 | q
267 | u
268 | i
269 | s
270 |
271 | n
272 | o
273 | s
274 | t
275 | r
276 | u
277 | d
278 |
279 | e
280 | x
281 | e
282 | r
283 | c
284 | i
285 | t
286 | a
287 | t
288 | i
289 | o
290 | n
291 |
292 | u
293 | l
294 | l
295 | a
296 | m
297 | c
298 | o
299 |
300 | l
301 | a
302 | b
303 | o
304 | r
305 | i
306 | s
307 |
308 |
309 |
310 |
311 |
312 | n
313 | i
314 | s
315 | i
316 |
317 | u
318 | t
319 |
320 | a
321 | l
322 | i
323 | q
324 | u
325 | i
326 | p
327 |
328 | e
329 | x
330 |
331 | e
332 | a
333 |
334 | c
335 | o
336 | m
337 | m
338 | o
339 | d
340 | o
341 |
342 |
343 |
344 |
345 | c
346 | o
347 | n
348 | s
349 | e
350 | q
351 | u
352 | a
353 | t
354 | .
355 |
356 |
357 |
358 |
359 |
360 |
361 | """
362 | soup = BeautifulSoup(file)
363 | p = paragraph(soup.pages.page.textbox)
364 | print ("[%s] font[%s] json[%s]" % (p, p.font, p.toDict()))
365 |
--------------------------------------------------------------------------------
/pdfreader/main.py:
--------------------------------------------------------------------------------
1 | from util.convert import converter
2 | from lib.book import book
3 |
4 | def text_to_dict(fileinput):
5 | """
6 | extracting the text into a xml string
7 | """
8 | convert = converter()
9 | xml = convert.as_xml().add_input_file(fileinput).run()
10 | """
11 | Parsing the xml string to transform it into a dictionary
12 | """
13 | b = book(xml)
14 | return b.toDict()
15 |
16 |
17 | """
18 | Extracting the images out of the pdf file
19 | images are named respecting the following convention: tempppm-[pageNumber]-[imageNumber].ppm (eg: tempppm-001-000.ppm)
20 | """
21 | import subprocess
22 | import sys
23 |
24 | def extract_images(file):
25 | subprocess.call('/usr/local/bin/pdfimages -p -j '+file+' tempimg', shell=True, stderr=sys.stdout)
26 |
27 |
28 |
29 | """
30 | Matching ppm images with pattern to convert them in png images
31 | At the same time, dict_book is updated with the path of the png images
32 | """
33 | import glob
34 | import json
35 | import uuid
36 | import re
37 | import os
38 |
39 | def get_img_names_by_page_number():
40 | image_list = {}
41 | nb_images = 0
42 | ppm_images = glob.glob('./tempimg*.*')
43 | for image in ppm_images:
44 | match = re.match( r'\./tempimg\-(\d+)\-(\d+)\.[jpg|ppm|pbm]', image, re.M|re.I)
45 | if match:
46 | page_num = int(match.group(1)) - 1
47 | image_num = int(match.group(2))
48 | nb_images += 1
49 | if page_num not in image_list:
50 | image_list[page_num] = {}
51 | image_list[page_num].update({image_num:image})
52 | return image_list, nb_images
53 |
54 | def rename_imgs__update_dict(image_list, dict_book, image_folder):
55 | image_num = 1
56 | for page in image_list.iterkeys():
57 | for image in image_list[page].iterkeys():
58 | print "Processing image %d" % image_num
59 | image_num += 1
60 | if 'images' not in dict_book['pages'][page]:
61 | dict_book['pages'][page].update({'images':[]})
62 | if "jpg" in image_list[page][image]:
63 | image_name = "%s_p%d.jpg" % (uuid.uuid1(), page)
64 | dict_book['pages'][page]['images'].append(image_name)
65 | subprocess.call('mv %s %s' % (image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout)
66 | elif "ppm" in image_list[page][image] or "pbm" in image_list[page][image]:
67 | image_name = "%s_p%d.png" % (uuid.uuid1(), page)
68 | dict_book['pages'][page]['images'].append(image_name)
69 | subprocess.call('/usr/local/bin/convert %s %s'%(image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout)
70 | os.remove(image_list[page][image])
71 | return dict_book
72 |
73 | def get_images_update_dict(dict_book, image_folder):
74 | image_list, nb_images = get_img_names_by_page_number()
75 | print "%d images to process" % nb_images
76 | dict_book = rename_imgs__update_dict(image_list, dict_book, image_folder)
77 | return dict_book
78 |
79 |
80 | def run(pdf_file, image_folder):
81 | print "Reading PDF"
82 | dict_book = text_to_dict(pdf_file)
83 | print "Extracting images"
84 | extract_images(pdf_file)
85 | dict_book = get_images_update_dict(dict_book, image_folder)
86 | return json.dumps(dict_book)
87 |
88 |
89 | if __name__ == '__main__':
90 | import sys
91 | if len(sys.argv) == 3:
92 | print run(pdf_file=sys.argv[1], image_folder=sys.argv[2])
93 | else:
94 | print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0])
95 |
96 |
97 |
98 | """
99 | # here is an example of how look the dict_book
100 | print json.dumps(dict_book)
101 | {
102 | "pages": [
103 | {
104 | "images": [
105 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
106 | ],
107 | "paragraphs": [
108 | {
109 | "size": 98,
110 | "width": 587,
111 | "string": "Book Title",
112 | "y": -98,
113 | "x": -324,
114 | "font": "Georgia",
115 | "height": 705
116 | }
117 | ]
118 | },
119 | {
120 | "images": [
121 | "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png"
122 | ],
123 | "paragraphs": [
124 | {
125 | "size": 24,
126 | "width": 138,
127 | "string": "CHAPTER 1",
128 | "y": -24,
129 | "x": -88,
130 | "font": "Georgia",
131 | "height": 711
132 | },
133 | {
134 | "size": 33,
135 | "width": 489,
136 | "string": "Lorem ipsum dolor sit amet, consectetur \nadipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.",
137 | "y": -229,
138 | "x": -439,
139 | "font": "Georgia",
140 | "height": 269
141 | }
142 | ]
143 | },
144 | {
145 | "paragraphs": [
146 | {
147 | "size": 24,
148 | "width": 133,
149 | "string": "SECTION 1",
150 | "y": -24,
151 | "x": -83,
152 | "font": "Georgia",
153 | "height": 711
154 | }
155 | ]
156 | }
157 | ]
158 | }
159 | """
--------------------------------------------------------------------------------
/pdfreader/pff.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pff.pdf
--------------------------------------------------------------------------------
/pdfreader/pffl.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pffl.pdf
--------------------------------------------------------------------------------
/pdfreader/png.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/png.pdf
--------------------------------------------------------------------------------
/pdfreader/util/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 |
4 | if __name__ == '__main__': print __version__
5 |
--------------------------------------------------------------------------------
/pdfreader/util/convert.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | from pdfminer.pdfdocument import PDFDocument
3 | from pdfminer.pdfparser import PDFParser
4 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
5 | from pdfminer.pdfpage import PDFPage
6 | from pdfminer.pdfdevice import PDFDevice, TagExtractor
7 | from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
8 | from pdfminer.cmapdb import CMapDB
9 | from pdfminer.layout import LAParams
10 | from pdfminer.image import ImageWriter
11 |
12 |
13 | class converter(object):
14 |
15 | def usage(self):
16 | print ('usage: %s [-d] [-p pagenos] [-m maxpages] [-P password] [-o output] [-C] '
17 | '[-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin] [-F boxes_flow] '
18 | '[-Y layout_mode] [-O output_dir] [-t text|html|xml|tag] [-c codec] [-s scale] file ...' % self._argv[0])
19 | return 100
20 |
21 | def __init__(self):
22 | # debug option
23 | self._debug = 0
24 | # input option
25 | self._password = ''
26 | self._pagenos = set()
27 | self._maxpages = 0
28 | # output option
29 | self._outfile = None
30 | self._outtype = None
31 | self._imagewriter = None
32 | self._layoutmode = 'normal'
33 | self._codec = 'utf-8'
34 | self._pageno = 1
35 | self._scale = 1
36 | self._caching = True
37 | self._showpageno = True
38 | self._laparams = LAParams()
39 | self._argv = ['convert.py']
40 | self._args = []
41 |
42 | def mainIniter(self, argv):
43 | import getopt
44 | try:
45 | (opts, args) = getopt.getopt(argv[1:], 'dp:m:P:o:CnAVM:L:W:F:Y:O:t:c:s:')
46 | except getopt.GetoptError:
47 | return self.usage()
48 | if not args: return self.usage()
49 | for (k, v) in opts:
50 | if k == '-d': self._debug += 1
51 | elif k == '-p': self._pagenos.update( int(x)-1 for x in v.split(',') )
52 | elif k == '-m': self._maxpages = int(v)
53 | elif k == '-P': self._password = v
54 | elif k == '-o': self._outfile = v
55 | elif k == '-C': self._caching = False
56 | elif k == '-n': self._laparams = None
57 | elif k == '-A': self._laparams.all_texts = True
58 | elif k == '-V': self._laparams.detect_vertical = True
59 | elif k == '-M': self._laparams.char_margin = float(v)
60 | elif k == '-L': self._laparams.line_margin = float(v)
61 | elif k == '-W': self._laparams.word_margin = float(v)
62 | elif k == '-F': self._laparams.boxes_flow = float(v)
63 | elif k == '-Y': self._layoutmode = v
64 | elif k == '-O': self._imagewriter = ImageWriter(v)
65 | elif k == '-t': self._outtype = v
66 | elif k == '-c': self._codec = v
67 | elif k == '-s': self._scale = float(v)
68 | #
69 | self._argv = argv
70 | self._args = args
71 | #
72 | PDFDocument.debug = self._debug
73 | PDFParser.debug = self._debug
74 | CMapDB.debug = self._debug
75 | PDFResourceManager.debug = self._debug
76 | PDFPageInterpreter.debug = self._debug
77 | PDFDevice.debug = self._debug
78 | return self.run()
79 |
80 | """
81 | Output options
82 | """
83 | def with_debug(self, todo=True):
84 | self._debug = 1 if todo else 0
85 | return self
86 |
87 | def as_text(self, todo=True):
88 | self._outtype = 'text' if todo else None
89 | return self
90 |
91 | def as_xml(self, todo=True):
92 | self._outtype = 'xml' if todo else None
93 | return self
94 |
95 | def as_html(self, todo=True):
96 | self._outtype = 'html' if todo else None
97 | return self
98 |
99 | def as_tag(self, todo=True):
100 | self._outtype = 'tag' if todo else None
101 | return self
102 |
103 | def add_input_file(self, filename):
104 | self._args.append(filename)
105 | return self
106 |
107 | """
108 | Process the pdf file(s)
109 | """
110 | def run(self):
111 | rsrcmgr = PDFResourceManager(caching=self._caching)
112 | if not self._outtype:
113 | self._outtype = 'text'
114 | if __name__ == '__main__':
115 | if self._outfile:
116 | if self._outfile.endswith('.htm') or self._outfile.endswith('.html'):
117 | self._outtype = 'html'
118 | elif self._outfile.endswith('.xml'):
119 | self._outtype = 'xml'
120 | elif self._outfile.endswith('.tag'):
121 | self._outtype = 'tag'
122 | if __name__ == '__main__':
123 | if self._outfile:
124 | outfp = file(self._outfile, 'w')
125 | else:
126 | outfp = sys.stdout
127 | else:
128 | from cStringIO import StringIO
129 | outfp = StringIO()
130 | if self._outtype == 'text':
131 | device = TextConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter)
132 | elif self._outtype == 'xml':
133 | device = XMLConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter)
134 | elif self._outtype == 'html':
135 | device = HTMLConverter(rsrcmgr, outfp, codec=self._codec, scale=self._scale, layoutmode=self._layoutmode, laparams=self._laparams, imagewriter=self._imagewriter)
136 | elif self._outtype == 'tag':
137 | device = TagExtractor(rsrcmgr, outfp, codec=self._codec)
138 | else:
139 | return usage()
140 | for fname in self._args:
141 | fp = file(fname, 'rb')
142 | interpreter = PDFPageInterpreter(rsrcmgr, device)
143 |
144 | for page in PDFPage.get_pages(fp, self._pagenos, maxpages=self._maxpages, password=self._password, caching=self._caching, check_extractable=True):
145 | interpreter.process_page(page)
146 |
147 | fp.close()
148 | device.close()
149 | if __name__ == '__main__':
150 | outfp.close()
151 | else:
152 | return outfp.getvalue()
153 |
154 |
155 |
156 |
157 | if __name__ == '__main__':
158 | import sys
159 | sys.exit(converter().mainIniter(sys.argv))
160 |
--------------------------------------------------------------------------------
/psd_rb/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 |
3 | gem 'psd'
4 |
--------------------------------------------------------------------------------
/psd_rb/json_type.json:
--------------------------------------------------------------------------------
1 | [{
2 | "folders": [{
3 | "created_at": "2014-04-17T08:23:42Z",
4 | "_id": "534f8f8e94da26c97d00003a",
5 | "description": "jheuh iuwh iwuhe iuhiuhwiuehriguh iuwheuirhgwhiuerh uihwgiuewrhguiehu hweurh u ieuwh eiuh iuh weuh ihwe iuhwe uhreuhgreugh wieh wiuehwueirghwei hiuh iuewh uiwh uiwhu u weiu rui wiuhewiug iuwer wegwe hguihrgiuwe reh erwgheihg e g ehgiuheguihw eugi hwegh opj s dfspo sdk kdf",
6 | "data": {
7 | "orientation": "0",
8 | "projectColors": [{
9 | "alpha": 1,
10 | "color": 16777215
11 | }, {
12 | "alpha": 1,
13 | "color": 16777215
14 | }, {
15 | "alpha": 1,
16 | "color": 16777215
17 | }, {
18 | "alpha": 1,
19 | "color": 16777215
20 | }, {
21 | "alpha": 1,
22 | "color": 16777215
23 | }, {
24 | "alpha": 1,
25 | "color": 16777215
26 | }, {
27 | "alpha": 1,
28 | "color": 16777215
29 | }, {
30 | "alpha": 1,
31 | "color": 16777215
32 | }, {
33 | "alpha": 1,
34 | "color": 16777215
35 | }, {
36 | "alpha": 1,
37 | "color": 16777215
38 | }, {
39 | "alpha": 1,
40 | "color": 16777215
41 | }, {
42 | "alpha": 1,
43 | "color": 16777215
44 | }, {
45 | "alpha": 1,
46 | "color": 16777215
47 | }, {
48 | "alpha": 1,
49 | "color": 16777215
50 | }, {
51 | "alpha": 1,
52 | "color": 16777215
53 | }, {
54 | "alpha": 1,
55 | "color": 16777215
56 | }, {
57 | "alpha": 1,
58 | "color": 16777215
59 | }, {
60 | "alpha": 1,
61 | "color": 16777215
62 | }, {
63 | "alpha": 1,
64 | "color": 16777215
65 | }, {
66 | "alpha": 1,
67 | "color": 16777215
68 | }, {
69 | "alpha": 1,
70 | "color": 16777215
71 | }, {
72 | "alpha": 1,
73 | "color": 16777215
74 | }, {
75 | "alpha": 1,
76 | "color": 16777215
77 | }, {
78 | "alpha": 1,
79 | "color": 16777215
80 | }],
81 | "lastFontColor": 16777215,
82 | "navigation": true,
83 | "tags": null,
84 | "width": 1024,
85 | "height": 768
86 | },
87 | "sid": "534f8f8e94da26c97d00003a",
88 | "user_id": "52820a5994da266b19000002",
89 | "type": "ProjectFolder",
90 | "folderName": "V42",
91 | "last_published_at": "2014-04-17T08:23:42+00:00",
92 | "child_ids": ["534f8f8e94da26c97d00003b"],
93 | "unit_ids": ["534f8f8e94da26c97d00002e", "534f8f8e94da26c97d000030"],
94 | "resource_ids": ["5379fec120d5ac4294000051"],
95 | "font_ids": ["527a597cab134d6a3e000005"],
96 | "updated_at": "2014-05-19T12:53:32Z",
97 | "_type": "Folder"
98 | }, {
99 | "created_at": "2014-04-17T08:23:42Z",
100 | "_id": "534f8f8e94da26c97d00003b",
101 | "unit_ids": ["534f8f8e94da26c97d00002c"],
102 | "data": {
103 | "index": 0
104 | },
105 | "sid": "534f8f8e94da26c97d00003a",
106 | "folderName": "Untitled",
107 | "parent_ids": ["534f8f8e94da26c97d00003a"],
108 | "type": "ViewFolder",
109 | "updated_at": "2014-05-19T12:53:32Z",
110 | "_type": "Folder"
111 | }],
112 | "units": [{
113 | "created_at": "2014-04-17T08:23:42Z",
114 | "_id": "534f8f8e94da26c97d00002c",
115 | "folder_ids": ["534f8f8e94da26c97d00003b"],
116 | "stype": "TMWorld",
117 | "type": "world",
118 | "sid": "534f8f8e94da26c97d00003a",
119 | "child_ids": ["534f8f8e94da26c97d00002d", "534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a", "5379feb120d5ac429400004e", "5379feb120d5ac429400004f"],
120 | "data": {
121 | "index": 0,
122 | "unitName": "Untitled",
123 | "bindings": {
124 | "ressources": null,
125 | "properties": {}
126 | }
127 | },
128 | "updated_at": "2014-05-19T12:53:32Z",
129 | "_type": "Unit"
130 | }, {
131 | "created_at": "2014-04-17T08:23:42Z",
132 | "folder_ids": ["534f8f8e94da26c97d00003a"],
133 | "stype": "TMWorld",
134 | "type": "world",
135 | "_id": "534f8f8e94da26c97d00002e",
136 | "sid": "534f8f8e94da26c97d00003a",
137 | "child_ids": ["534f8f8e94da26c97d00002f"],
138 | "data": {
139 | "index": 0,
140 | "unitName": "masterOverView"
141 | },
142 | "updated_at": "2014-04-17T08:23:42Z",
143 | "_type": "Unit"
144 | }, {
145 | "created_at": "2014-04-17T08:23:42Z",
146 | "folder_ids": ["534f8f8e94da26c97d00003a"],
147 | "stype": "TMWorld",
148 | "type": "world",
149 | "_id": "534f8f8e94da26c97d000030",
150 | "sid": "534f8f8e94da26c97d00003a",
151 | "child_ids": ["534f8f8e94da26c97d000031"],
152 | "data": {
153 | "index": 0,
154 | "unitName": "masterUnderView"
155 | },
156 | "updated_at": "2014-04-17T08:23:42Z",
157 | "_type": "Unit"
158 | }, {
159 | "created_at": "2014-04-17T08:23:42Z",
160 | "_id": "534f8f8e94da26c97d00002d",
161 | "flat_child_ids": ["534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a"],
162 | "stype": "statestack",
163 | "type": "graphicunit",
164 | "sid": "534f8f8e94da26c97d00003a",
165 | "parent_ids": ["534f8f8e94da26c97d00002c"],
166 | "resource_ids": ["5379fec120d5ac4294000051"],
167 | "data": {
168 | "stacks": {
169 | "5347a59d5d04707f8f000020": {
170 | "masterBack": null,
171 | "actions": {},
172 | "thumb": "5379fec120d5ac4294000051",
173 | "ref": null,
174 | "sactions": null,
175 | "masterFront": null,
176 | "units": [0, 1, 2],
177 | "color": {
178 | "solid": {
179 | "alpha": 1,
180 | "color": 102
181 | }
182 | },
183 | "name": "Untitled"
184 | },
185 | "o": ["5347a59d5d04707f8f000020"]
186 | },
187 | "internalWPid": "5347a59d5d04707f8f00002d",
188 | "world": {
189 | "args": [0, 0, 1024, 768, 0],
190 | "width": 1024,
191 | "height": 768,
192 | "type": "TMWorld"
193 | },
194 | "unitName": "RootStateStack1451",
195 | "bindings": {
196 | "ressources": null,
197 | "properties": {}
198 | },
199 | "indexChild": 1
200 | },
201 | "updated_at": "2014-05-19T12:53:31Z",
202 | "_type": "Unit"
203 | }, {
204 | "created_at": "2014-04-17T08:23:42Z",
205 | "_id": "534f8f8e94da26c97d000038",
206 | "stype": "Text",
207 | "type": "media",
208 | "sid": "534f8f8e94da26c97d00003a",
209 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
210 | "parent_ids": ["534f8f8e94da26c97d00002c"],
211 | "resource_ids": ["534f8f8e94da26c97d00002b"],
212 | "data": {
213 | "opacity": 0.5,
214 | "world": {
215 | "args": [160, 31, 200, 183, 0],
216 | "width": 1024,
217 | "height": 768,
218 | "type": "TMWorld"
219 | },
220 | "unitName": "Lorem ipsum dolor sit amet, co",
221 | "bindings": {
222 | "ressources": null,
223 | "properties": {}
224 | },
225 | "indexChild": 1
226 | },
227 | "updated_at": "2014-05-19T12:53:31Z",
228 | "_type": "Unit"
229 | }, {
230 | "created_at": "2014-04-17T08:23:42Z",
231 | "_id": "534f8f8e94da26c97d000039",
232 | "stype": "image",
233 | "type": "media",
234 | "sid": "534f8f8e94da26c97d00003a",
235 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
236 | "parent_ids": ["534f8f8e94da26c97d00002c"],
237 | "resource_ids": ["5347a61c5d04707f8f000059"],
238 | "data": {
239 | "isAutoName": false,
240 | "actions": {
241 | "4": null
242 | },
243 | "world": {
244 | "args": [536, 31, 329, 329, 0],
245 | "width": 1024,
246 | "height": 768,
247 | "type": "TMWorld"
248 | },
249 | "unitName": "seal_bear.jpg",
250 | "bindings": {
251 | "ressources": null,
252 | "properties": {}
253 | },
254 | "indexChild": 2
255 | },
256 | "updated_at": "2014-05-19T12:53:31Z",
257 | "_type": "Unit"
258 | }, {
259 | "created_at": "2014-05-19T12:53:31Z",
260 | "_id": "5379feb120d5ac429400004a",
261 | "stype": "group",
262 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"],
263 | "data": {
264 | "stacks": {
265 | "5379feb120d5ac429400004b": {
266 | "ref": null,
267 | "units": [0, 1],
268 | "actions": {},
269 | "thumb": null,
270 | "name": "Untitled"
271 | },
272 | "o": ["5379feb120d5ac429400004b"]
273 | },
274 | "world": {
275 | "args": [243, 511, 321, 188, 0],
276 | "width": 1024,
277 | "height": 768,
278 | "type": "TMWorld"
279 | },
280 | "unitName": "Group_5774",
281 | "indexChild": 3,
282 | "internalWPid": "5379feb120d5ac429400004c"
283 | },
284 | "flat_child_ids": ["5379feb120d5ac429400004e", "5379feb120d5ac429400004f"],
285 | "type": "container",
286 | "parent_ids": ["534f8f8e94da26c97d00002c"],
287 | "sid": "534f8f8e94da26c97d00003a",
288 | "updated_at": "2014-05-19T12:53:31Z",
289 | "_type": "Unit"
290 | }, {
291 | "created_at": "2014-05-19T12:53:31Z",
292 | "_id": "5379feb120d5ac429400004e",
293 | "stype": "ellipse",
294 | "flat_parent_ids": ["5379feb120d5ac429400004a"],
295 | "data": {
296 | "world": {
297 | "args": [46, 0, 135, 134, "20.20475968803733"],
298 | "width": 321,
299 | "height": 188,
300 | "type": "TMWorld"
301 | },
302 | "unitName": "Ellipse_3723",
303 | "bindings": {
304 | "ressources": null,
305 | "properties": {}
306 | },
307 | "background": {
308 | "solid": {
309 | "alpha": 1,
310 | "color": 10412991
311 | }
312 | },
313 | "indexChild": 1
314 | },
315 | "type": "primitive",
316 | "parent_ids": ["534f8f8e94da26c97d00002c"],
317 | "sid": "534f8f8e94da26c97d00003a",
318 | "updated_at": "2014-05-19T12:53:31Z",
319 | "_type": "Unit"
320 | }, {
321 | "created_at": "2014-05-19T12:53:31Z",
322 | "_id": "5379feb120d5ac429400004f",
323 | "stype": "path",
324 | "flat_parent_ids": ["5379feb120d5ac429400004a"],
325 | "data": {
326 | "shapeBounds": [0, 0, 204, 102],
327 | "background": {
328 | "solid": {
329 | "alpha": 1,
330 | "color": 10412991
331 | }
332 | },
333 | "shape": "triangle",
334 | "world": {
335 | "args": [117, 86, 204, 102, 0],
336 | "width": 321,
337 | "height": 188,
338 | "type": "TMWorld"
339 | },
340 | "unitName": "Shape_3937",
341 | "bindings": {
342 | "ressources": null,
343 | "properties": {}
344 | },
345 | "indexChild": 2,
346 | "path": "M0 212.132 212.132 0 424.264 212.132 0 212.132Z"
347 | },
348 | "type": "primitive",
349 | "parent_ids": ["534f8f8e94da26c97d00002c"],
350 | "sid": "534f8f8e94da26c97d00003a",
351 | "updated_at": "2014-05-19T12:53:31Z",
352 | "_type": "Unit"
353 | }, {
354 | "created_at": "2014-04-17T08:23:42Z",
355 | "_id": "534f8f8e94da26c97d00002f",
356 | "stype": "statestack",
357 | "type": "graphicunit",
358 | "sid": "534f8f8e94da26c97d00003a",
359 | "parent_ids": ["534f8f8e94da26c97d00002e"],
360 | "resource_ids": ["5347a59d5d04707f8f000023"],
361 | "data": {
362 | "stacks": {
363 | "5347a59d5d04707f8f000024": {
364 | "masterBack": null,
365 | "actions": {},
366 | "thumb": "5347a59d5d04707f8f000023",
367 | "ref": null,
368 | "sactions": null,
369 | "masterFront": null,
370 | "units": null,
371 | "color": {
372 | "solid": {
373 | "alpha": 0,
374 | "color": 16777215
375 | }
376 | },
377 | "name": "Untitled"
378 | },
379 | "o": ["5347a59d5d04707f8f000024"]
380 | },
381 | "internalWPid": "5347a59d5d04707f8f000029",
382 | "world": {
383 | "args": [0, 0, 1024, 768, 0],
384 | "width": 1024,
385 | "height": 768,
386 | "type": "TMWorld"
387 | },
388 | "unitName": "RootStateStack1447",
389 | "bindings": {
390 | "ressources": null,
391 | "properties": {}
392 | },
393 | "indexChild": 1
394 | },
395 | "updated_at": "2014-05-19T12:53:31Z",
396 | "_type": "Unit"
397 | }, {
398 | "created_at": "2014-04-17T08:23:42Z",
399 | "_id": "534f8f8e94da26c97d000031",
400 | "stype": "statestack",
401 | "type": "graphicunit",
402 | "sid": "534f8f8e94da26c97d00003a",
403 | "parent_ids": ["534f8f8e94da26c97d000030"],
404 | "resource_ids": ["5347a59d5d04707f8f000027"],
405 | "data": {
406 | "stacks": {
407 | "5347a59d5d04707f8f000028": {
408 | "masterBack": null,
409 | "actions": {},
410 | "thumb": "5347a59d5d04707f8f000027",
411 | "ref": null,
412 | "sactions": null,
413 | "masterFront": null,
414 | "units": null,
415 | "color": {
416 | "solid": {
417 | "alpha": 1,
418 | "color": 16777215
419 | }
420 | },
421 | "name": "Untitled"
422 | },
423 | "o": ["5347a59d5d04707f8f000028"]
424 | },
425 | "internalWPid": "5347a59d5d04707f8f00002b",
426 | "world": {
427 | "args": [0, 0, 1024, 768, 0],
428 | "width": 1024,
429 | "height": 768,
430 | "type": "TMWorld"
431 | },
432 | "unitName": "RootStateStack1448",
433 | "bindings": {
434 | "ressources": null,
435 | "properties": {}
436 | },
437 | "indexChild": 1
438 | },
439 | "updated_at": "2014-05-19T12:53:31Z",
440 | "_type": "Unit"
441 | }],
442 | "resources": [{
443 | "created_at": "2014-04-17T08:23:42Z",
444 | "font_ids": ["527a597cab134d6a3e000005"],
445 | "unit_ids": ["534f8f8e94da26c97d000038"],
446 | "content": "Lorem ipsum dolor sit amet, consectetur adipisicing elit.
",
447 | "_id": "534f8f8e94da26c97d00002b",
448 | "sid": "534f8f8e94da26c97d00003a",
449 | "type": "ressourceTextLayout",
450 | "updated_at": "2014-04-17T08:23:42Z",
451 | "_type": "Resource"
452 | }, {
453 | "created_at": "2014-04-11T08:22:42Z",
454 | "_id": "5347a61c5d04707f8f000059",
455 | "type": "ressourceImage",
456 | "unit_ids": ["5347a61c5d04707f8f00005a", "5347a81b94da26f432000011", "534bfa8894da26f432000025", "534bfb1194da2679f0000011", "534c0a5794da263072000011", "534c0fcb94da263072000025", "534d520294da262aae000011", "534d526b94da262aae000025", "534e34e994da267660000011", "534e368d94da267660000025", "534e36a294da26c974000011", "534e434a94da266b7b000011", "534e4abe94da26c1cf000011", "534e4d3694da267beb000011", "534e572d94da26c857000011", "534e59a394da26c857000025", "534e5ade94da26c97d000025", "534f8f8e94da26c97d000039", "535116c394da26c97d00004d", "53511e2694da26c97d000061"],
457 | "data": {
458 | "creationDate": 1394547794000,
459 | "width": 1024,
460 | "ressourceSize": 119186,
461 | "modificationDate": 1394547794000,
462 | "height": 1024
463 | },
464 | "sid": "53464ef394da26a11400000d",
465 | "name": "seal_bear.jpg",
466 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a61c5d04707f8f000059",
467 | "updated_at": "2014-04-18T13:38:34Z",
468 | "_type": "Resource"
469 | }, {
470 | "created_at": "2014-04-11T08:22:42Z",
471 | "_id": "5347a59d5d04707f8f000023",
472 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000023",
473 | "unit_ids": ["5347a59d5d04707f8f000022", "5347a81b94da26f432000007", "534bfa8894da26f43200001b", "534bfb1194da2679f0000007", "534c0a5794da263072000007", "534c0fcb94da26307200001b", "534d520294da262aae000007", "534d526b94da262aae00001b", "534e34e994da267660000007", "534e368d94da26766000001b", "534e36a294da26c974000007", "534e434a94da266b7b000007", "534e4abe94da26c1cf000007", "534e4d3694da267beb000007", "534e572d94da26c857000007", "534e59a394da26c85700001b", "534e5ade94da26c97d00001b", "534f8f8e94da26c97d00002f", "535116c394da26c97d000043", "53511e2694da26c97d000057"],
474 | "type": "ressourceImage",
475 | "sid": "53464ef394da26a11400000d",
476 | "data": {
477 | "height": 768,
478 | "width": 1024
479 | },
480 | "updated_at": "2014-04-18T13:38:34Z",
481 | "_type": "Resource"
482 | }, {
483 | "created_at": "2014-04-11T08:22:42Z",
484 | "_id": "5347a59d5d04707f8f000027",
485 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000027",
486 | "unit_ids": ["5347a59d5d04707f8f000026", "5347a81b94da26f432000009", "534bfa8894da26f43200001d", "534bfb1194da2679f0000009", "534c0a5794da263072000009", "534c0fcb94da26307200001d", "534d520294da262aae000009", "534d526b94da262aae00001d", "534e34e994da267660000009", "534e368d94da26766000001d", "534e36a294da26c974000009", "534e434a94da266b7b000009", "534e4abe94da26c1cf000009", "534e4d3694da267beb000009", "534e572d94da26c857000009", "534e59a394da26c85700001d", "534e5ade94da26c97d00001d", "534f8f8e94da26c97d000031", "535116c394da26c97d000045", "53511e2694da26c97d000059"],
487 | "type": "ressourceImage",
488 | "sid": "53464ef394da26a11400000d",
489 | "data": {
490 | "height": 768,
491 | "width": 1024
492 | },
493 | "updated_at": "2014-04-18T13:38:34Z",
494 | "_type": "Resource"
495 | }]
496 | }]
--------------------------------------------------------------------------------
/psd_rb/ruby/factories/group_unit.rb:
--------------------------------------------------------------------------------
1 | class GroupUnit
2 |
3 | def initialize(node)
4 | @node = node
5 | end
6 |
7 | def type
8 | 'GroupUnit'
9 | end
10 |
11 | def to_json
12 | as_json
13 | end
14 |
15 | def as_json
16 | {
17 | created_at: '2014-05-19T12:53:31Z',
18 | _id: '5379feb120d5ac429400004a',
19 | stype: 'group',
20 | flat_parent_ids: ['534f8f8e94da26c97d00002d'],
21 | data: {
22 | stacks: {
23 | '5379feb120d5ac429400004b'=> {
24 | ref: null,
25 | units: [0, 1],
26 | actions: {},
27 | thumb: nil,
28 | name: 'Untitled'
29 | },
30 | o: ['5379feb120d5ac429400004b']
31 | },
32 | world: {
33 | args: [243, 511, 321, 188, 0],
34 | width: 1024,
35 | height: 768,
36 | type: 'TMWorld'
37 | },
38 | unitName: 'Group_5774',
39 | indexChild: 3,
40 | internalWPid: '5379feb120d5ac429400004c'
41 | },
42 | flat_child_ids: %w(5379feb120d5ac429400004e 5379feb120d5ac429400004f),
43 | type: 'container',
44 | parent_ids: ['534f8f8e94da26c97d00002c'],
45 | sid: '534f8f8e94da26c97d00003a',
46 | updated_at: '2014-05-19T12:53:31Z',
47 | _type: 'Unit'
48 | }
49 | end
50 | end
--------------------------------------------------------------------------------
/psd_rb/ruby/factories/image_unit.rb:
--------------------------------------------------------------------------------
1 | class ImageUnit
2 |
3 | def initialize(node)
4 | @node = node
5 | end
6 |
7 | def type
8 | 'ImageUnit'+( (@node.phas_crop)?(' with crop'):('') )
9 | end
10 |
11 | def to_json
12 | as_json
13 | end
14 |
15 | def as_json
16 | {
17 | stype: 'image',
18 | type: 'media',
19 | # flat_parent_ids: ["534f8f8e94da26c97d00002d"],
20 | # parent_ids: ["534f8f8e94da26c97d00002c"],
21 | # resource_ids: ["5347a61c5d04707f8f000059"],
22 | data: {
23 | isAutoName: false,
24 | world: {
25 | args: [536, 31, 329, 329, 0],
26 | width: 1024,
27 | height: 768,
28 | type: 'TMWorld'
29 | },
30 | unitName: @node.ppng_name,
31 | bindings: {
32 | ressources: nil,
33 | properties: {}
34 | },
35 | # indexChild: 2
36 | },
37 | _type: 'Unit'
38 | }
39 | end
40 |
41 | end
42 |
--------------------------------------------------------------------------------
/psd_rb/ruby/factories/text_unit.rb:
--------------------------------------------------------------------------------
1 | class TextUnit
2 |
3 | def initialize(node)
4 | @node = node
5 | end
6 |
7 | def type
8 | 'TextUnit'
9 | end
10 |
11 | def to_json
12 | as_json
13 | end
14 |
15 | def as_json
16 | {
17 | stype: 'Text',
18 | type: 'media',
19 | # flat_parent_ids: ["534f8f8e94da26c97d00002d"],
20 | # parent_ids: ["534f8f8e94da26c97d00002c"],
21 | # resource_ids: ["534f8f8e94da26c97d00002b"],
22 | data: {
23 | opacity: 0.5,
24 | world: {
25 | args: [160, 31, 200, 183, 0],
26 | width: 1024,
27 | height: 768,
28 | type: 'TMWorld'
29 | },
30 | unitName: @node.pget_text,
31 | bindings: {
32 | ressources: nil,
33 | properties: {}
34 | },
35 | # indexChild: 1
36 | },
37 | _type: 'Unit'
38 | }
39 | end
40 | end
41 |
--------------------------------------------------------------------------------
/psd_rb/ruby/factories/unit_factory.rb:
--------------------------------------------------------------------------------
1 | require_relative 'text_unit'
2 | require_relative 'image_unit'
3 | require_relative 'group_unit'
4 | require_relative 'unknown_unit'
5 |
6 | class UnitFactory
7 |
8 | def self.create_unit(node)
9 | if node.phas_text
10 | return TextUnit.new(node)
11 | elsif node.pis_group
12 | return GroupUnit.new(node)
13 | else
14 | return ImageUnit.new(node)
15 | end
16 | end
17 |
18 | end
19 |
--------------------------------------------------------------------------------
/psd_rb/ruby/factories/unknown_unit.rb:
--------------------------------------------------------------------------------
1 | class UnknownUnit
2 |
3 | def initialize(node)
4 | @node = node
5 | end
6 |
7 | def type
8 | 'UnknownUnit'
9 | end
10 |
11 | def to_json
12 | as_json
13 | end
14 |
15 | def as_json
16 | {
17 | Unit: 'Unkown'
18 | }
19 | end
20 | end
--------------------------------------------------------------------------------
/psd_rb/ruby/panda_psd.rb:
--------------------------------------------------------------------------------
1 | require 'psd'
2 | require_relative 'util'
3 | require_relative 'unit_manager'
4 |
5 | class PandaPsd
6 |
7 | def initialize(file:nil)
8 | @errors = []
9 | @info = []
10 | @orientation = nil
11 | @unitManager = UnitManager.new
12 | @psd = PSD.new(file)
13 | @psd.parse!
14 | end
15 |
16 | attr_reader :info, :errors
17 |
18 | public
19 |
20 | def check_integrity
21 | check_dimensions
22 | self
23 | end
24 |
25 | def export
26 | @unitManager.export
27 | end
28 |
29 | def parse
30 | check_integrity
31 | go_through
32 |
33 |
34 | # #get an image
35 | # so = @psd.pget_layer_by_name('Smart Object (path d\'illustrator)')
36 | # so.pas_png
37 | #
38 | # #get a shape
39 | # so = @psd.pget_layer_by_name('Path rond')
40 | # so.pas_png
41 | #
42 | # #get a crop position
43 | # so = @psd.pget_layer_by_name('image croppée')
44 | # if so.phas_mask
45 | # so.pas_png # get the whole image without cropping
46 | # puts so.pget_mask_position
47 | # end
48 | #
49 | # #get a text
50 | # so = @psd.pget_layer_by_name('Texte')
51 | # if so.phas_text
52 | # pp so.pget_text_html
53 | # pp so.pget_positions
54 | # pp so.pget_dimensions
55 | # pp so.hidden?
56 | # pp so.pvisible
57 | # end
58 | # puts UnitFactory::create_unit(so).as_json
59 |
60 | self
61 | end
62 |
63 | private
64 |
65 | def go_through
66 | return nil unless @errors.empty?
67 | @psd.tree.descendants_layers.each do |layer|
68 | @unitManager.create_unit(layer) if layer.pvisible_tree?
69 | end
70 | end
71 |
72 | def check_dimensions
73 | return nil unless @errors.empty?
74 | if @psd.pget_dimensions == [2048, 1536]
75 | @orientation = 'landscape'
76 | elsif @psd.pget_dimensions == [1536, 2048]
77 | @orientation = 'portrait'
78 | else
79 | @errors << 'Dimensions are wrong. Only 2048x1536 or 1536x2048.'
80 | end
81 | end
82 |
83 | end
84 |
85 |
86 |
87 |
--------------------------------------------------------------------------------
/psd_rb/ruby/unit_manager.rb:
--------------------------------------------------------------------------------
1 | require_relative 'factories/unit_factory'
2 |
3 | class UnitManager
4 |
5 | def initialize
6 | @units = []
7 | end
8 | attr_reader :units
9 |
10 | def create_unit(layer)
11 | unit = UnitFactory::create_unit(layer)
12 | @units << unit
13 | unit
14 | end
15 |
16 | def export
17 | @units.map{|unit| unit.as_json}
18 | end
19 |
20 | end
21 |
--------------------------------------------------------------------------------
/psd_rb/ruby/util.rb:
--------------------------------------------------------------------------------
1 |
2 | PSD.send(:define_method, 'pget_dimensions') do
3 | {width:tree.document_width, height:tree.document_height}
4 | end
5 |
6 | PSD.send(:define_method, 'pget_layer_by_name') do |name|
7 | tree.children_layers.map { |layer| return layer if layer.name == name }.compact.first
8 | end
9 |
10 |
11 | PSD::Node::Base.send(:define_method, 'pget_layer_by_name') do |name|
12 | children_layers.map { |layer| return layer if layer.name == name }.compact.first
13 | end
14 |
15 | PSD::Node::Base.send(:define_method, 'pget_dimensions') do
16 | {width:width, height:height}
17 | end
18 |
19 | PSD::Node::Base.send(:define_method, 'ppng_name') do
20 | name.gsub(/\s/, '_')+'.png'
21 | end
22 |
23 | PSD::Node::Base.send(:define_method, 'pas_png') do
24 | image.save_as_png '../output/'+ppng_name
25 | end
26 |
27 | PSD::Node::Base.send(:define_method, 'pis_group') do
28 | group?
29 | end
30 |
31 | PSD::Node::Base.send(:define_method, 'pvisible_tree?') do
32 | see_it = true
33 | see_it = visible? unless root?
34 | # noinspection RubyScope
35 | def rec(node, see_it)
36 | see_it = node.visible? unless node.root? || (!see_it)
37 | see_it = rec(node.parent, see_it) unless node.root? || (!see_it)
38 | see_it
39 | end
40 | see_it = rec(parent, see_it) unless root? || (!see_it)
41 | see_it
42 | end
43 |
44 | PSD::Node::Base.send(:define_method, 'pget_mask_position') do
45 | {
46 | top: mask.top-top,
47 | left: mask.left-left,
48 | right: right-mask.right,
49 | bottom: bottom-mask.bottom
50 | }
51 | end
52 |
53 | PSD::Node::Base.send(:define_method, 'phas_mask') do
54 | (mask.size > 0)
55 | end
56 |
57 | PSD::Node::Base.send(:define_method, 'phas_crop') do
58 | phas_mask && (pget_mask_position != {top:0, left:0, right:0, bottom:0})
59 | end
60 |
61 | PSD::Node::Base.send(:define_method, 'phas_text') do
62 | text != nil
63 | end
64 |
65 | PSD::Node::Base.send(:define_method, 'pget_text') do
66 | text[:value] if phas_text
67 | end
68 |
69 | PSD::Node::Base.send(:define_method, 'pget_text_html') do
70 | ''+
71 | ''+
72 | ""+
73 | pget_text+
74 | ''+
75 | '
'+
76 | '' if phas_text
77 | end
78 |
79 | PSD::Node::Base.send(:define_method, 'pget_positions') do
80 | {
81 | top: top,
82 | left: left,
83 | right: right,
84 | bottom: bottom
85 | }
86 | end
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 | # module Panda
99 |
100 | # module TreeUtil
101 | #
102 | # def get_layer_by_name(node, name)
103 | # return nil unless @errors.empty?
104 | # node = node.tree if node.instance_of?(PSD)
105 | # # noinspection RubyResolve
106 | # node.children_layers.map { |layer| return layer if layer.name == name }.compact.first
107 | # end
108 |
109 | # def get_dimensions(node)
110 | # return nil unless @errors.empty?
111 | # # noinspection RubyResolve
112 | # return {width: node.tree.document_width, height: node.tree.document_height} if node.instance_of?(PSD)
113 | # {width:node.width, height:node.height}
114 | # end
115 | #
116 | # end
117 |
118 | # module NodeUtil
119 |
120 | # def is_visible(node)
121 | # return nil unless @errors.empty?
122 | # node.visible && !node.hidden?
123 | # end
124 |
125 | # def as_png(node, name:nil)
126 | # return nil unless @errors.empty?
127 | # name = node.name.gsub(/\s/, '_')+'.png' if name.nil? || name.empty?
128 | # node.image.save_as_png './output/'+name
129 | # end
130 |
131 | # def has_mask(node)
132 | # return nil unless @errors.empty?
133 | # (node.mask.size > 0) && (get_mask_position(node) != {top:0, left:0, right:0, bottom:0})
134 | # end
135 |
136 | # def get_mask_position(node)
137 | # return nil unless @errors.empty?
138 | # node.mask.instance_exec {{
139 | # top: top-node.top,
140 | # left: left-node.left,
141 | # right: node.right-right,
142 | # bottom: node.bottom-bottom
143 | # }}
144 | # end
145 |
146 | # def has_text(node)
147 | # return nil unless @errors.empty?
148 | # node.text != nil
149 | # end
150 |
151 | # def get_text(node)
152 | # return nil unless @errors.empty?
153 | # node.text[:value]
154 | # end
155 |
156 | # def get_text_html(node)
157 | # return nil unless @errors.empty?
158 | # ""+
159 | # ""+
160 | # ""+
161 | # get_text(node)+
162 | # ""+
163 | # "
"+
164 | # ""
165 | # end
166 |
167 | # def get_positions(node)
168 | # return nil unless @errors.empty?
169 | # node.instance_exec {{
170 | # top: top,
171 | # left: left,
172 | # right: right,
173 | # bottom: bottom
174 | # }}
175 | # end
176 |
177 | # end
178 |
179 | # class PsdUtil
180 | # # include TreeUtil
181 | # # include NodeUtil
182 | #
183 | # def initialize(file:nil)
184 | # @foreground_page_name = 'foreground'
185 | # @background_page_name = 'background'
186 | # end
187 | #
188 | #
189 | # end
190 | # end
--------------------------------------------------------------------------------
/psd_rb/test.rb:
--------------------------------------------------------------------------------
1 | require_relative 'ruby/panda_psd'
2 | require 'benchmark'
3 | require 'pp'
4 |
5 |
6 | Benchmark.bm(10) do |x|
7 | x.report('PandaPsd:') {
8 | psd = PandaPsd.new(file:'../psds/rmn_v3.psd').parse
9 | psd = PandaPsd.new(file:'psds/micka.psd').parse
10 | pp 'infos -> '+psd.info.join(', ')
11 | pp 'errors -> '+psd.errors.join(', ')
12 | }
13 | end
14 |
--------------------------------------------------------------------------------
/psdtools/A2 16 column.ai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/A2 16 column.ai
--------------------------------------------------------------------------------
/psdtools/Logo.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/Logo.psd
--------------------------------------------------------------------------------
/psdtools/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python2
2 | __version__ = '20130515'
3 |
4 | if __name__ == '__main__': print __version__
5 |
--------------------------------------------------------------------------------
/psdtools/brochure.indd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure.indd
--------------------------------------------------------------------------------
/psdtools/brochure_rtangfold_11x17_OUTh.indd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure_rtangfold_11x17_OUTh.indd
--------------------------------------------------------------------------------
/psdtools/buttons.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/buttons.psd
--------------------------------------------------------------------------------
/psdtools/main.py:
--------------------------------------------------------------------------------
1 | from psd_tools import PSDImage
2 | from psd_tools.constants import BlendMode
3 | import psd_tools.user_api.psd_image
4 | import json
5 | import uuid
6 |
7 | """
8 | {"pages":[{"images":[], "paragraphs":[{"size":0, "width":0, "string":"", "x":0, "y":0, "font":"", "height":0}]}]}
9 | or
10 | {
11 | "pages": [
12 | {
13 | "images": [
14 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png"
15 | ],
16 | "paragraphs": [
17 | {
18 | "size": 98,
19 | "width": 587,
20 | "string": "Book Title",
21 | "y": -98,
22 | "x": -324,
23 | "font": "Georgia",
24 | "height": 705
25 | }
26 | ]
27 | }
28 | ]
29 | }
30 | """
31 | class ImportPSD(object):
32 |
33 | """ Will Parse a PSD and store its data """
34 | data = {'pages':[]}
35 | imagePath = './'
36 | psd = None
37 | PSDFilePath = None
38 |
39 | @classmethod
40 | def __init__(self, PSDFilePath, imagePath):
41 | self.PSDFilePath = PSDFilePath
42 | self.imagePath = imagePath
43 | self.psd = PSDImage.load(PSDFilePath)
44 | self.nb_layers = self.countLayers(layers=self.psd.layers)
45 | print "Layers to process: %d" % self.nb_layers
46 |
47 | @classmethod
48 | def countLayers(self, layers, layer_num=0):
49 | for sheet in layers:
50 | if isinstance(sheet, psd_tools.user_api.psd_image.Layer):
51 | layer_num += 1
52 | elif isinstance(sheet, psd_tools.user_api.psd_image.Group):
53 | if sheet.visible_global:
54 | layer_num = self.countLayers(layers=sheet.layers, layer_num=layer_num)
55 | return layer_num
56 |
57 |
58 | @classmethod
59 | def parse(self):
60 | self.data['pages'].append(self.browseSheets(sheets=self.psd.layers, parentName="root"))
61 |
62 | @classmethod
63 | def browseSheets(self, sheets, parentName, page_num=1):
64 | array = {}
65 | for sheet in sheets:
66 | if isinstance(sheet, psd_tools.user_api.psd_image.Layer):
67 | print "Processing layer: %d" % self.nb_layers
68 | self.nb_layers -= 1
69 | """ this sheet is a layer, it may be an image or some text """
70 | if sheet.text_data is not None:
71 | """ If it's a text """
72 | dic = dict({ 'name': sheet.name,
73 | 'x': sheet.bbox.x1,
74 | 'y': sheet.bbox.y2,
75 | 'width': sheet.bbox.width,
76 | 'height': sheet.bbox.height,
77 | 'string': sheet.text_data.text,
78 | 'font': None
79 | })
80 | if 'paragraphs' not in array:
81 | array['paragraphs'] = []
82 | array['paragraphs'].append(dic)
83 | else:
84 | """ If it's an image """
85 | imageName = str(uuid.uuid1())+'_'+sheet.name+'_'+parentName+'.png'
86 | if sheet is not None and sheet.as_PIL() is not None:
87 | sheet.as_PIL().save(self.imagePath+imageName)
88 | else:
89 | imageName = None
90 | if 'images' not in array:
91 | array['images'] = []
92 | array['images'].append(imageName)
93 | elif isinstance(sheet, psd_tools.user_api.psd_image.Group):
94 | """ this sheet is a group and may contains many layers """
95 | if sheet.visible_global:
96 | arr = self.browseSheets(sheets=sheet.layers, parentName=sheet.name, page_num=page_num+1)
97 | self.data['pages'].append(arr)
98 | return array
99 |
100 | @classmethod
101 | def toJson(self):
102 | """ Will convert the parsed data array into json """
103 | return json.dumps(self.data)
104 |
105 |
106 |
107 | def run(pdf_file, image_folder):
108 | importedPSD = ImportPSD(PSDFilePath=pdf_file, imagePath=image_folder)
109 | importedPSD.parse()
110 | jsonString = importedPSD.toJson()
111 | return jsonString
112 |
113 |
114 |
115 |
116 | if __name__ == '__main__':
117 | import sys
118 | if len(sys.argv) == 3:
119 | print run(sys.argv[1], sys.argv[2])
120 | else:
121 | print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0])
122 |
--------------------------------------------------------------------------------
/psdtools/test-text.psd:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/test-text.psd
--------------------------------------------------------------------------------