├── .gitignore ├── README.md ├── __init__.py ├── install ├── install.sh └── lcms-1.19.tar.gz ├── parser.py ├── pdfreader ├── __init__.py ├── book.pdf ├── brochureint.pdf ├── brochureprint.pdf ├── buttons.pdf ├── ff.pdf ├── ff.png ├── ffa.pdf ├── hello world.pdf ├── lib │ ├── __init__.py │ ├── book.py │ ├── char.py │ ├── line.py │ ├── page.py │ └── paragraph.py ├── main.py ├── pff.pdf ├── pffl.pdf ├── png.pdf └── util │ ├── __init__.py │ └── convert.py ├── psd_rb ├── Gemfile ├── json_type.json ├── ruby │ ├── factories │ │ ├── group_unit.rb │ │ ├── image_unit.rb │ │ ├── text_unit.rb │ │ ├── unit_factory.rb │ │ └── unknown_unit.rb │ ├── panda_psd.rb │ ├── unit_manager.rb │ └── util.rb └── test.rb └── psdtools ├── A2 16 column.ai ├── Logo.psd ├── __init__.py ├── brochure.indd ├── brochure_rtangfold_11x17_OUTh.indd ├── buttons.psd ├── main.py └── test-text.psd /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | pdfreader/.DS_Store 3 | 4 | *.sublime-project 5 | 6 | *.sublime-workspace 7 | 8 | psdtools/.DS_Store 9 | 10 | psdtools/images/.DS_Store 11 | 12 | *.pyc 13 | 14 | /.DS_Store 15 | /install/.DS_Store 16 | /metadata.json 17 | /install/install02500809/ 18 | /images/ 19 | /psd_rb/.idea/ 20 | /psd_rb/Gemfile.lock 21 | /psd_rb/.DS_Store 22 | /psd_rb/psds/ 23 | /psd_rb/output/ 24 | /psd_rb/output.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | content-extractor 2 | ================= 3 | 4 | Content-extractor is python based project. The only reason to this is the availability of the librairies. (The best ones are in python IMO) 5 | Currently, this project parse pdf and psd file to extract meaningful content, such as texts and images both linked under a common json string 6 | 7 | 8 | Dependencies 9 | ================= 10 | 11 | Content-extractor is build upon the followings: 12 | 13 | - [psd-tools](https://pypi.python.org/pypi/psd-tools/) To extract images and text from psd files 14 | - [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/#intro) To extract text from pdf files as xml 15 | - [beautiful soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) To convert the extracted xml as json 16 | - [pdfimages](http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/) (from poppler-utils) To extract images from pdf files as image.ppm 17 | - [ImageMagick](http://www.imagemagick.org/script/index.php) To convert the ppm images to png 18 | 19 | 20 | Installation 21 | ================= 22 | 23 | Since there is a lot of dependencies and most of them also have their own dependencies, I have made a shellscript to simplify the installation process. 24 | 25 | The script assume the following: 26 | 27 | - If you don't have [apt-get](http://doc.ubuntu-fr.org/apt-get), then you should have [brew](http://mxcl.github.io/homebrew/). 28 | - You have [curl](http://pwet.fr/man/linux/commandes/curl) installed (the shell executable, not the php module). 29 | - You have [python 2.7.x](http://www.python.org/download/) installed on your system. 30 | - You have [unzip](http://www.info-zip.org/mans/unzip.html) installed on your system. 31 | 32 | When you launch the script, it's installing [pip](https://pypi.python.org/pypi/pip), if it isn't already present on your system. 33 | 34 | ```Shell 35 | $ cd install 36 | $ sh install.sh 37 | ``` 38 | 39 | How to use it 40 | ================= 41 | 42 | For any extension (currently pdf/psd) you can use `parser.py [file_path] [image_path]` it will automaticaly do the job. 43 | 44 | ```Python 45 | #Will write a metada.json and extract the images into the folder images 46 | ./parser.py psdtools/work.psd './images/' 47 | ./parser.py pdfreader/book.pdf './images/' 48 | ``` 49 | 50 | You can also import parser.py into your own python project and use it the folowing way: 51 | 52 | ```Python 53 | #will return a string containing the json and extract the images into the folder images 54 | from parser import parser 55 | json = parser.parse("psdtools/work.psd", "./images/") 56 | json = parser.parse("pdfreader/book.pdf", "./images/") 57 | ``` 58 | 59 | You can also use the pdfreader and psdtools script independently doing so: 60 | 61 | 62 | ```Shell 63 | # Shell: 64 | $ ./psdtools/main.py psdtools/work.psd './images/' 65 | $ ./pdfreader/main.py pdfreader/book.pdf './images/' 66 | ``` 67 | ```Python 68 | # Python: 69 | # PSD 70 | from psdtools import main 71 | json = main.run("psdtools/work.psd", "./images/") 72 | # PDF 73 | from pdfreader import main 74 | json = main.run("pdfreader/book.pdf", "./images/") 75 | ``` 76 | 77 | 78 | ./pdfreader/main.py is just a simplified interface to the very powerful pdfreader/util/convert.py, I have rewrite convert.py to be a class, but this is originally [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/#pdf2txt) from [pdfminer](http://www.unixuser.org/~euske/python/pdfminer/index.html). 79 | However, you can still use convert.py as if it was the originial [pdf2txt.py](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt) tool, [here is the documentation](http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt). 80 | 81 | ```Shell 82 | $ pdfreader/util/convert.py -o output.html samples/naacl06-shinyama.pdf 83 | (extract text as an HTML file whose filename is output.html) 84 | 85 | $ pdfreader/util/convert.py -V -c euc-jp -o output.html samples/jo.pdf 86 | (extract a Japanese HTML file in vertical writing, CMap is required) 87 | 88 | $ pdfreader/util/convert.py -P mypassword -o output.txt secret.pdf 89 | (extract a text from an encrypted PDF file) 90 | ``` 91 | 92 | convert.py can also be imported in a python project (but less options are available due to my lack of implementation) 93 | 94 | ```Python 95 | # @see pdfreader/main.py:text_to_dict as example 96 | from util.convert import converter 97 | convert = converter() 98 | xml = convert.as_xml().add_input_file(fileinput).run() 99 | ``` 100 | 101 | How does it work 102 | ================= 103 | 104 | The information are extracted having in mind to keep the parent-child relations. 105 | 106 | - For a pdf file: 107 | - A pdf file can have many pages, and so the json string goes. each page has many images and many paragraphs. 108 | - Each paragraph has a width, height, a content(string), a y and x position, a font, and a font size. 109 | - For a pdf file the original image name can't be extracted so the images are name like the followong [uniqId]_p[pageNumber].jpg/png. uniqId is a unique Id, pageNumber is the page in which the images is contained. you shouldn't need to use this information since the json string contains it. 110 | - Each extracted images are directly extracted a jpg but if it can't be then they are extracted as ppm/pbm and then converted in png. 111 | 112 | - For a psd file: 113 | - A psd file can have many groups which are translated in pages into the json string. each group in a psd file can have many layers, a layer can be either text(paragraphs) or an image(images). 114 | - Each text layer has a width, height, a content(string), a y and x position. The font and font size are currenlty not extracted (because psd-tools doesn't do it). 115 | - For a psd file the image is name with the layer name it come from, but since many layers can have the same name the following is applied to be sure we have a unique name. [uniqId]_[layerName]_[groupName].png. 116 | 117 | - PSD files: 118 | - Text: Because of the psd-tools library, you can't know the font, bold, italic, underline attribute. 119 | 120 | - PDF files: 121 | - Text: Because of the pdfminer library, you can't have many fonts in the same paragraph. It is also not possible to extract the underlines. However the bold and italic attribute are extracted as html and directly integrated into the string. 122 | 123 | You can see under a simplified example taken out from book.pdf of how look the json string. 124 | 125 | JSON Format (from pdfreader/book.pdf 'simplified') 126 | ================= 127 | 128 | ```JSON 129 | { 130 | "pages": [ 131 | { 132 | "images": [ 133 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png" 134 | ], 135 | "paragraphs": [ 136 | { 137 | "size": 98, 138 | "width": 587, 139 | "string": "Book Title", 140 | "y": -98, 141 | "x": -324, 142 | "font": "Georgia", 143 | "height": 705 144 | } 145 | ] 146 | }, 147 | { 148 | "images": [ 149 | "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png" 150 | ], 151 | "paragraphs": [ 152 | { 153 | "size": 24, 154 | "width": 138, 155 | "string": "CHAPTER 1", 156 | "y": -24, 157 | "x": -88, 158 | "font": "Georgia", 159 | "height": 711 160 | }, 161 | { 162 | "size": 33, 163 | "width": 489, 164 | "string": "Lorem ipsum dolor sit amet, consectetur \nadipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.", 165 | "y": -229, 166 | "x": -439, 167 | "font": "Georgia", 168 | "height": 269 169 | } 170 | ] 171 | }, 172 | { 173 | "paragraphs": [ 174 | { 175 | "size": 24, 176 | "width": 133, 177 | "string": "SECTION 1", 178 | "y": -24, 179 | "x": -83, 180 | "font": "Georgia", 181 | "height": 711 182 | } 183 | ] 184 | } 185 | ] 186 | } 187 | ``` 188 | 189 | How to improve it 190 | ================= 191 | 192 | - The psd-tools library should be improved to be able to take out the font, bold italic underline attribute 193 | - The pdfminer library should be improved to be able to take out the underline attribute 194 | - Matching the pep8? 195 | - Checking if the proposed image folder exist if not we create it if we can't fire an error and stop 196 | - Adding support for more file format 197 | - Speed is not an issue but why not improving it ? 198 | - The psd-tools library should be improved to be able to extract layers containing fx effect (is this even possible?) 199 | - Depending of the platform or a parameter the `\n` into the string attribute in JSON should become `
` or `\r\n` 200 | - Reduce the number of dependencies ? 201 | - What else ? 202 | 203 | Contributing 204 | ================= 205 | 206 | You're welcome to contribute to this project in any way you can. If you don't know how to code, don't have time, don't worry, you still can post issue, I will be happy to answer you and correct it as fast as possible. 207 | Want to code ? fork it and submit pull request! Also, pull request comming with an example of what has been improved will be merge in priority. 208 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | __version__ = '20130515' 3 | 4 | if __name__ == '__main__': print __version__ 5 | -------------------------------------------------------------------------------- /install/install.sh: -------------------------------------------------------------------------------- 1 | mkdir install02500809 2 | cd install02500809 3 | echo "changing mode of /usr/local/etc to 775\n" 4 | sudo chmod 775 /usr/local/etc 5 | # We check the python2 avalaibility 6 | echo "Checking python2 availability\n" 7 | type python2 >/dev/null 2>&1 || { 8 | echo "python2 unavaliable\nNow creating python2 as alias of python\n" 9 | alias python2="python" 10 | } 11 | echo "Done python2\n" 12 | # We suppose that the system has apt-get command already installed (if pdfimages is not available) 13 | # We suppose that the system has curl command already installed 14 | # We check that pip exist on the system 15 | echo "Checking pip availability\n" 16 | type pip >/dev/null 2>&1 || { 17 | echo "pip unavaliable\nNow installing pip\n" 18 | mkdir pipInstall 19 | cd pipInstall 20 | curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py 21 | sudo python get-pip.py 22 | cd .. 23 | rm -rf pipInstall 24 | } 25 | echo "Done pip\n" 26 | 27 | # pil dependencies 28 | echo "Installing pil dependecies\n" 29 | type apt-get >/dev/null 2>&1 && { 30 | sudo apt-get install libjpeg libjpeg-dev libfreetype6 libfreetype6-dev zlib1g-dev 31 | } || { 32 | type brew >/dev/null 2>&1 && { 33 | brew install jpeg freetype 34 | brew tap homebrew/dupes 35 | brew install zlib 36 | } 37 | } 38 | mkdir lcmsextract 39 | tar xzf ../lcms-1.19.tar.gz --directory=./lcmsextract 40 | cd lcmsextract/lcms-1.19 41 | ./configure 42 | make 43 | sudo make install 44 | cd ../.. 45 | rm -rf lcmsextract 46 | echo "Done pil dependecies\n" 47 | 48 | # psd-tools dependencies 49 | echo "Installing psd-tools dependecies\n" 50 | echo "Now installing packbits\n" 51 | sudo pip install -U packbits 52 | echo "Now installing docops\n" 53 | sudo pip install -U docopt 54 | echo "Now installing pil\n" 55 | sudo pip install -U pil 56 | echo "Done psd-tools dependecies\n" 57 | 58 | 59 | # Installing psd-tools 60 | echo "Installing psd-tools\n" 61 | sudo pip install -U psd-tools 62 | echo "Done psd-tools\n" 63 | 64 | # Installing pdfminer 65 | echo "Installing pdfminer\n" 66 | mkdir pdfminerInstall 67 | cd pdfminerInstall 68 | curl -O https://codeload.github.com/euske/pdfminer/zip/master 69 | unzip master 70 | cd pdfminer-master 71 | make cmap 72 | sudo python setup.py install 73 | cd ../.. 74 | rm -rf pdfminerInstall 75 | echo "Done pdfminer\n" 76 | 77 | # Installing beautifullsoup 78 | echo "Installing beautifulsoup\n" 79 | sudo pip install -U beautifulsoup4 80 | echo "Done beautifulsoup\n" 81 | 82 | # If pdfimages is not available we install the lib poppler-utils which contains it 83 | echo "Checking pdfimages\n" 84 | type pdfimages >/dev/null 2>&1 || { 85 | echo "pdfimages unavaliable\nNow installing poppler-utils\n" 86 | type apt-get >/dev/null 2>&1 && { 87 | sudo apt-get install poppler-utils 88 | } || { 89 | type brew >/dev/null 2>&1 && { 90 | brew install fontconfig 91 | brew link --overwrite fontconfig 92 | brew install poppler 93 | brew link --overwrite poppler 94 | } 95 | } 96 | } 97 | echo "Done pdfimages\n" 98 | 99 | #If convert is not avalable we install the lib imagemagick which contains it 100 | echo "Checking convert\n" 101 | type convert >/dev/null 2>&1 || { 102 | echo "convert unavaliable\nNow installing imagemagick\n" 103 | mkdir imagemagickInstall 104 | cd imagemagickInstall 105 | curl -O http://www.imagemagick.org/download/ImageMagick.tar.gz 106 | tar xvfz ImageMagick.tar.gz 107 | cd ImageMagick-6.8.5-8 108 | ./configure 109 | make 110 | sudo make install 111 | sudo ldconfig /usr/local/lib 112 | cd ../.. 113 | rm -rf imagemagickInstall 114 | } 115 | echo "Done convert\n" 116 | cd .. 117 | rm -rf install02500809 118 | -------------------------------------------------------------------------------- /install/lcms-1.19.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/install/lcms-1.19.tar.gz -------------------------------------------------------------------------------- /parser.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | 3 | def file_exist(file_path): 4 | import os.path 5 | if os.path.exists(file_path) and os.path.isfile(file_path): 6 | return True 7 | return False 8 | 9 | def get_4b_magic_number(file_path): 10 | try: 11 | binaryFile = open(file_path, 'rb') 12 | magicNumber = binaryFile.read(4) 13 | except(IOError), e: 14 | print "%s: unable to open/read.\n%s" % (file_path, e) 15 | else: 16 | return magicNumber 17 | return False 18 | 19 | def parse(file_path, image_folder): 20 | if file_exist(file_path): 21 | magicNumber = get_4b_magic_number(file_path) 22 | if magicNumber is not False: 23 | if magicNumber == "8BPS": #PSD file 24 | from psdtools import main 25 | elif magicNumber == "%PDF": #PDF file 26 | from pdfreader import main 27 | return main.run(file_path, image_folder) 28 | else: 29 | print "%s: not a psd nor a pdf (unknown Magic Number)" % file_path 30 | else: 31 | print "%s: file not found." % file_path 32 | 33 | if __name__ == '__main__': 34 | import sys 35 | if len(sys.argv) == 3: 36 | json = parse(sys.argv[1], sys.argv[2]) 37 | """ We write the json into a file called metadata.json """ 38 | target = open("metadata.json", 'w+') # a will append, w will over-write 39 | target.write(json) 40 | target.close() 41 | else: 42 | print "usage: %s pdf_or_psd_file_path generated_images_path/ (eg: python %s book.pdf/.psd './images/')" % (sys.argv[0], sys.argv[0]) 43 | -------------------------------------------------------------------------------- /pdfreader/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | __version__ = '20130515' 3 | 4 | if __name__ == '__main__': print __version__ 5 | -------------------------------------------------------------------------------- /pdfreader/book.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/book.pdf -------------------------------------------------------------------------------- /pdfreader/brochureint.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureint.pdf -------------------------------------------------------------------------------- /pdfreader/brochureprint.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/brochureprint.pdf -------------------------------------------------------------------------------- /pdfreader/buttons.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/buttons.pdf -------------------------------------------------------------------------------- /pdfreader/ff.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.pdf -------------------------------------------------------------------------------- /pdfreader/ff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ff.png -------------------------------------------------------------------------------- /pdfreader/ffa.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/ffa.pdf -------------------------------------------------------------------------------- /pdfreader/hello world.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/hello world.pdf -------------------------------------------------------------------------------- /pdfreader/lib/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | __version__ = '20130515' 3 | 4 | if __name__ == '__main__': print __version__ 5 | -------------------------------------------------------------------------------- /pdfreader/lib/char.py: -------------------------------------------------------------------------------- 1 | class char(object): 2 | 3 | _size = 0 4 | _width = 0 5 | _height = 0 6 | _x = 0 7 | _y = 0 8 | _font = None 9 | _isBold = False 10 | _isItalic = False 11 | _char = '' 12 | 13 | @property 14 | def font(self): 15 | return self._font 16 | 17 | @property 18 | def size(self): 19 | return self._size 20 | 21 | @property 22 | def x(self): 23 | return self._x 24 | 25 | @property 26 | def y(self): 27 | return self._y 28 | 29 | def __init__(self, xml_char): 30 | self._font = xml_char.get('font').split('+')[1] if xml_char.get('font') != None else None 31 | if self._font is not None: 32 | (left, top, self._width, self._height) = xml_char.get('bbox').split(',') 33 | self._width = int(float(self._width)) 34 | self._height = int(float(self._height)) 35 | self._x = int(float(left)) - self._width 36 | self._y = int(float(top)) - self._height 37 | self._size = int(float(xml_char.get('size'))) 38 | self._isBold = True if len(self._font.split('-')) == 2 and 'Bold' in self._font.split('-')[1] else False 39 | self._isItalic = True if len(self._font.split('-')) == 2 and 'Italic' in self._font.split('-')[1] else False 40 | self._font = self._font.split('-')[0] 41 | # 42 | self._char = xml_char.string 43 | 44 | def isBold(self): 45 | return self._isBold 46 | 47 | def isItalic(self): 48 | return self._isItalic 49 | 50 | def __str__(self): 51 | if self._font is None: 52 | return "" 53 | try: 54 | char = str(self._char) 55 | return char 56 | except UnicodeEncodeError: 57 | return "" 58 | -------------------------------------------------------------------------------- /pdfreader/lib/line.py: -------------------------------------------------------------------------------- 1 | from cStringIO import StringIO 2 | from char import char 3 | 4 | class line(object): 5 | 6 | _width = 0 7 | _height = 0 8 | _x = 0 9 | _y = 0 10 | _font = None 11 | _lines = [] 12 | 13 | @property 14 | def font(self): 15 | return self._font 16 | 17 | @property 18 | def size(self): 19 | return self._size 20 | 21 | @property 22 | def x(self): 23 | return self._x 24 | 25 | @property 26 | def y(self): 27 | return self._y 28 | 29 | 30 | def __init__(self, xml_line): 31 | (left, top, self._width, self._height) = xml_line.get('bbox').split(',') 32 | self._width = int(float(self._width)) 33 | self._height = int(float(self._height)) 34 | self._x = int(float(left)) - self._width 35 | self._y = int(float(top)) - self._height 36 | # 37 | xml_string = xml_line.find_all('text') 38 | self._chars = [] 39 | for c in xml_string: 40 | self._chars.append(char(c)) 41 | # 42 | self._font = self._chars[0].font if len(self._chars) > 0 else None 43 | self._size = self._chars[0].size if len(self._chars) > 0 else None 44 | 45 | _italic_on = False 46 | def handle_italic(self, c, string): 47 | if self._italic_on is False and c.isItalic() is True: 48 | string.write('') 49 | self._italic_on = True 50 | if self._italic_on is True and c.isItalic() is False: 51 | string.write('') 52 | self._italic_on = False 53 | 54 | _bold_on = False 55 | def handle_bold(self, c, string): 56 | if self._bold_on is False and c.isBold() is True: 57 | string.write('') 58 | self._bold_on = True 59 | if self._bold_on is True and c.isBold() is False: 60 | string.write('') 61 | self._bold_on = False 62 | 63 | 64 | def __str__(self): 65 | string = StringIO() 66 | for c in self._chars: 67 | self.handle_italic(c, string) 68 | self.handle_bold(c, string) 69 | string.write(str(c)) 70 | return string.getvalue() 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | if __name__ == '__main__': 80 | import sys 81 | from bs4 import BeautifulSoup 82 | file=""" 83 | 84 | 85 | 86 | 87 | L 88 | O 89 | R 90 | E 91 | M 92 | 93 | 94 | 95 | I 96 | P 97 | S 98 | U 99 | M 100 | 101 | 102 | 103 | 104 | 105 | 106 | """ 107 | soup = BeautifulSoup(file) 108 | l = line(soup.pages.page.textbox.textline) 109 | print ("[%s] font[%s]" % (l, l.font)) 110 | -------------------------------------------------------------------------------- /pdfreader/lib/page.py: -------------------------------------------------------------------------------- 1 | from paragraph import paragraph 2 | from cStringIO import StringIO 3 | 4 | class page(object): 5 | 6 | _paragraphs = [] 7 | 8 | @property 9 | def paragraphs(self): 10 | return self._paragraphs 11 | 12 | 13 | def __init__(self, xml_page): 14 | xml = xml_page.find_all('textbox') 15 | self._paragraphs = [] 16 | for p in xml: 17 | self._paragraphs.append(paragraph(p)) 18 | 19 | def toDict(self): 20 | array = {'paragraphs': []} 21 | for p in self._paragraphs: 22 | array['paragraphs'].append(p.toDict()) 23 | return array 24 | 25 | 26 | def __str__(self): 27 | string = StringIO() 28 | count = 0 29 | for p in self._paragraphs: 30 | if count != 0: 31 | string.write("\n\n") 32 | string.write(str(p)) 33 | count = 1 34 | return string.getvalue() 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | if __name__ == '__main__': 66 | import sys 67 | from bs4 import BeautifulSoup 68 | file=""" 69 | 70 | 71 | 72 | 73 | L 74 | O 75 | R 76 | E 77 | M 78 | 79 | I 80 | P 81 | S 82 | U 83 | M 84 | 85 | d 86 | o 87 | l 88 | o 89 | r 90 | 91 | s 92 | i 93 | t 94 | 95 | a 96 | m 97 | e 98 | t 99 | , 100 | 101 | c 102 | o 103 | n 104 | s 105 | e 106 | c 107 | t 108 | e 109 | t 110 | u 111 | r 112 | 113 | 114 | 115 | 116 | 117 | a 118 | d 119 | i 120 | p 121 | i 122 | s 123 | i 124 | c 125 | i 126 | n 127 | g 128 | 129 | e 130 | l 131 | i 132 | t 133 | , 134 | 135 | s 136 | e 137 | d 138 | 139 | d 140 | o 141 | 142 | e 143 | i 144 | u 145 | s 146 | m 147 | o 148 | d 149 | 150 | 151 | 152 | 153 | t 154 | e 155 | m 156 | p 157 | o 158 | r 159 | 160 | i 161 | n 162 | c 163 | i 164 | d 165 | i 166 | d 167 | u 168 | n 169 | t 170 | 171 | u 172 | t 173 | 174 | l 175 | a 176 | b 177 | o 178 | r 179 | e 180 | 181 | e 182 | t 183 | 184 | d 185 | o 186 | l 187 | o 188 | r 189 | e 190 | 191 | 192 | 193 | 194 | 195 | m 196 | a 197 | g 198 | n 199 | a 200 | 201 | a 202 | l 203 | i 204 | q 205 | u 206 | a 207 | . 208 | 209 | U 210 | t 211 | 212 | e 213 | n 214 | i 215 | m 216 | 217 | a 218 | d 219 | 220 | m 221 | i 222 | n 223 | i 224 | m 225 | 226 | v 227 | e 228 | n 229 | i 230 | a 231 | m 232 | , 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | U 241 | n 242 | t 243 | i 244 | t 245 | l 246 | e 247 | d 248 | 249 | 250 | 251 | 252 | 253 | 254 | L 255 | O 256 | R 257 | E 258 | M 259 | 260 | 261 | 262 | I 263 | 264 | P 265 | 266 | S 267 | U 268 | M 269 | 270 | 271 | 272 | 273 | 274 | 275 | """ 276 | soup = BeautifulSoup(file) 277 | p = page(soup.pages.page) 278 | print ("[%s] json[%s]" % (p, p.toDict())) 279 | -------------------------------------------------------------------------------- /pdfreader/lib/paragraph.py: -------------------------------------------------------------------------------- 1 | from line import line 2 | from cStringIO import StringIO 3 | 4 | class paragraph(object): 5 | 6 | _width = 0 7 | _height = 0 8 | _x = 0 9 | _y = 0 10 | _font = None 11 | _lines = [] 12 | 13 | @property 14 | def font(self): 15 | return self._font 16 | 17 | @property 18 | def size(self): 19 | return self._size 20 | 21 | @property 22 | def x(self): 23 | return self._x 24 | 25 | @property 26 | def y(self): 27 | return self._y 28 | 29 | def __init__(self, xml_paragraph): 30 | (left, top, self._width, self._height) = xml_paragraph.get('bbox').split(',') 31 | self._width = int(float(self._width)) 32 | self._height = int(float(self._height)) 33 | self._x = int(float(left)) - self._width 34 | self._y = int(float(top)) - self._height 35 | # 36 | xml = xml_paragraph.find_all('textline') 37 | self._lines = [] 38 | for l in xml: 39 | self._lines.append(line(l)) 40 | # 41 | self._font = self._lines[0].font if len(self._lines) > 0 else None 42 | self._size = self._lines[0].size if len(self._lines) > 0 else None 43 | 44 | def toDict(self): 45 | dico = dict({ 'font': self._font, 46 | 'size': self._size, 47 | 'x': self._x, 48 | 'y': self._y, 49 | 'width': self._width, 50 | 'height': self._height 51 | }) 52 | string = StringIO() 53 | count = 0 54 | for l in self._lines: 55 | if count != 0: 56 | string.write("\n") 57 | string.write(str(l)) 58 | count = 1 59 | dico.update({'string': string.getvalue()}) 60 | return dico 61 | 62 | def __str__(self): 63 | string = StringIO() 64 | count = 0 65 | for l in self._lines: 66 | if count != 0: 67 | string.write("\n") 68 | string.write(str(l)) 69 | count = 1 70 | return string.getvalue() 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | if __name__ == '__main__': 94 | import sys 95 | import json 96 | from bs4 import BeautifulSoup 97 | file=""" 98 | 99 | 100 | 101 | 102 | L 103 | O 104 | R 105 | E 106 | M 107 | 108 | I 109 | P 110 | S 111 | U 112 | M 113 | 114 | d 115 | o 116 | l 117 | o 118 | r 119 | 120 | s 121 | i 122 | t 123 | 124 | a 125 | m 126 | e 127 | t 128 | , 129 | 130 | c 131 | o 132 | n 133 | s 134 | e 135 | c 136 | t 137 | e 138 | t 139 | u 140 | r 141 | 142 | 143 | 144 | 145 | 146 | a 147 | d 148 | i 149 | p 150 | i 151 | s 152 | i 153 | c 154 | i 155 | n 156 | g 157 | 158 | e 159 | l 160 | i 161 | t 162 | , 163 | 164 | s 165 | e 166 | d 167 | 168 | d 169 | o 170 | 171 | e 172 | i 173 | u 174 | s 175 | m 176 | o 177 | d 178 | 179 | 180 | 181 | 182 | t 183 | e 184 | m 185 | p 186 | o 187 | r 188 | 189 | i 190 | n 191 | c 192 | i 193 | d 194 | i 195 | d 196 | u 197 | n 198 | t 199 | 200 | u 201 | t 202 | 203 | l 204 | a 205 | b 206 | o 207 | r 208 | e 209 | 210 | e 211 | t 212 | 213 | d 214 | o 215 | l 216 | o 217 | r 218 | e 219 | 220 | 221 | 222 | 223 | 224 | m 225 | a 226 | g 227 | n 228 | a 229 | 230 | a 231 | l 232 | i 233 | q 234 | u 235 | a 236 | . 237 | 238 | U 239 | t 240 | 241 | e 242 | n 243 | i 244 | m 245 | 246 | a 247 | d 248 | 249 | m 250 | i 251 | n 252 | i 253 | m 254 | 255 | v 256 | e 257 | n 258 | i 259 | a 260 | m 261 | , 262 | 263 | 264 | 265 | 266 | q 267 | u 268 | i 269 | s 270 | 271 | n 272 | o 273 | s 274 | t 275 | r 276 | u 277 | d 278 | 279 | e 280 | x 281 | e 282 | r 283 | c 284 | i 285 | t 286 | a 287 | t 288 | i 289 | o 290 | n 291 | 292 | u 293 | l 294 | l 295 | a 296 | m 297 | c 298 | o 299 | 300 | l 301 | a 302 | b 303 | o 304 | r 305 | i 306 | s 307 | 308 | 309 | 310 | 311 | 312 | n 313 | i 314 | s 315 | i 316 | 317 | u 318 | t 319 | 320 | a 321 | l 322 | i 323 | q 324 | u 325 | i 326 | p 327 | 328 | e 329 | x 330 | 331 | e 332 | a 333 | 334 | c 335 | o 336 | m 337 | m 338 | o 339 | d 340 | o 341 | 342 | 343 | 344 | 345 | c 346 | o 347 | n 348 | s 349 | e 350 | q 351 | u 352 | a 353 | t 354 | . 355 | 356 | 357 | 358 | 359 | 360 | 361 | """ 362 | soup = BeautifulSoup(file) 363 | p = paragraph(soup.pages.page.textbox) 364 | print ("[%s] font[%s] json[%s]" % (p, p.font, p.toDict())) 365 | -------------------------------------------------------------------------------- /pdfreader/main.py: -------------------------------------------------------------------------------- 1 | from util.convert import converter 2 | from lib.book import book 3 | 4 | def text_to_dict(fileinput): 5 | """ 6 | extracting the text into a xml string 7 | """ 8 | convert = converter() 9 | xml = convert.as_xml().add_input_file(fileinput).run() 10 | """ 11 | Parsing the xml string to transform it into a dictionary 12 | """ 13 | b = book(xml) 14 | return b.toDict() 15 | 16 | 17 | """ 18 | Extracting the images out of the pdf file 19 | images are named respecting the following convention: tempppm-[pageNumber]-[imageNumber].ppm (eg: tempppm-001-000.ppm) 20 | """ 21 | import subprocess 22 | import sys 23 | 24 | def extract_images(file): 25 | subprocess.call('/usr/local/bin/pdfimages -p -j '+file+' tempimg', shell=True, stderr=sys.stdout) 26 | 27 | 28 | 29 | """ 30 | Matching ppm images with pattern to convert them in png images 31 | At the same time, dict_book is updated with the path of the png images 32 | """ 33 | import glob 34 | import json 35 | import uuid 36 | import re 37 | import os 38 | 39 | def get_img_names_by_page_number(): 40 | image_list = {} 41 | nb_images = 0 42 | ppm_images = glob.glob('./tempimg*.*') 43 | for image in ppm_images: 44 | match = re.match( r'\./tempimg\-(\d+)\-(\d+)\.[jpg|ppm|pbm]', image, re.M|re.I) 45 | if match: 46 | page_num = int(match.group(1)) - 1 47 | image_num = int(match.group(2)) 48 | nb_images += 1 49 | if page_num not in image_list: 50 | image_list[page_num] = {} 51 | image_list[page_num].update({image_num:image}) 52 | return image_list, nb_images 53 | 54 | def rename_imgs__update_dict(image_list, dict_book, image_folder): 55 | image_num = 1 56 | for page in image_list.iterkeys(): 57 | for image in image_list[page].iterkeys(): 58 | print "Processing image %d" % image_num 59 | image_num += 1 60 | if 'images' not in dict_book['pages'][page]: 61 | dict_book['pages'][page].update({'images':[]}) 62 | if "jpg" in image_list[page][image]: 63 | image_name = "%s_p%d.jpg" % (uuid.uuid1(), page) 64 | dict_book['pages'][page]['images'].append(image_name) 65 | subprocess.call('mv %s %s' % (image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout) 66 | elif "ppm" in image_list[page][image] or "pbm" in image_list[page][image]: 67 | image_name = "%s_p%d.png" % (uuid.uuid1(), page) 68 | dict_book['pages'][page]['images'].append(image_name) 69 | subprocess.call('/usr/local/bin/convert %s %s'%(image_list[page][image], image_folder+image_name), shell=True, stderr=sys.stdout) 70 | os.remove(image_list[page][image]) 71 | return dict_book 72 | 73 | def get_images_update_dict(dict_book, image_folder): 74 | image_list, nb_images = get_img_names_by_page_number() 75 | print "%d images to process" % nb_images 76 | dict_book = rename_imgs__update_dict(image_list, dict_book, image_folder) 77 | return dict_book 78 | 79 | 80 | def run(pdf_file, image_folder): 81 | print "Reading PDF" 82 | dict_book = text_to_dict(pdf_file) 83 | print "Extracting images" 84 | extract_images(pdf_file) 85 | dict_book = get_images_update_dict(dict_book, image_folder) 86 | return json.dumps(dict_book) 87 | 88 | 89 | if __name__ == '__main__': 90 | import sys 91 | if len(sys.argv) == 3: 92 | print run(pdf_file=sys.argv[1], image_folder=sys.argv[2]) 93 | else: 94 | print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0]) 95 | 96 | 97 | 98 | """ 99 | # here is an example of how look the dict_book 100 | print json.dumps(dict_book) 101 | { 102 | "pages": [ 103 | { 104 | "images": [ 105 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png" 106 | ], 107 | "paragraphs": [ 108 | { 109 | "size": 98, 110 | "width": 587, 111 | "string": "Book Title", 112 | "y": -98, 113 | "x": -324, 114 | "font": "Georgia", 115 | "height": 705 116 | } 117 | ] 118 | }, 119 | { 120 | "images": [ 121 | "96f4e9ee-c1eb-11e2-ad2b-040ccedc7e34_p1.png" 122 | ], 123 | "paragraphs": [ 124 | { 125 | "size": 24, 126 | "width": 138, 127 | "string": "CHAPTER 1", 128 | "y": -24, 129 | "x": -88, 130 | "font": "Georgia", 131 | "height": 711 132 | }, 133 | { 134 | "size": 33, 135 | "width": 489, 136 | "string": "Lorem ipsum dolor sit amet, consectetur \nadipisicing elit, sed do eiusmod\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.", 137 | "y": -229, 138 | "x": -439, 139 | "font": "Georgia", 140 | "height": 269 141 | } 142 | ] 143 | }, 144 | { 145 | "paragraphs": [ 146 | { 147 | "size": 24, 148 | "width": 133, 149 | "string": "SECTION 1", 150 | "y": -24, 151 | "x": -83, 152 | "font": "Georgia", 153 | "height": 711 154 | } 155 | ] 156 | } 157 | ] 158 | } 159 | """ -------------------------------------------------------------------------------- /pdfreader/pff.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pff.pdf -------------------------------------------------------------------------------- /pdfreader/pffl.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/pffl.pdf -------------------------------------------------------------------------------- /pdfreader/png.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/pdfreader/png.pdf -------------------------------------------------------------------------------- /pdfreader/util/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | __version__ = '20130515' 3 | 4 | if __name__ == '__main__': print __version__ 5 | -------------------------------------------------------------------------------- /pdfreader/util/convert.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | from pdfminer.pdfdocument import PDFDocument 3 | from pdfminer.pdfparser import PDFParser 4 | from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 5 | from pdfminer.pdfpage import PDFPage 6 | from pdfminer.pdfdevice import PDFDevice, TagExtractor 7 | from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter 8 | from pdfminer.cmapdb import CMapDB 9 | from pdfminer.layout import LAParams 10 | from pdfminer.image import ImageWriter 11 | 12 | 13 | class converter(object): 14 | 15 | def usage(self): 16 | print ('usage: %s [-d] [-p pagenos] [-m maxpages] [-P password] [-o output] [-C] ' 17 | '[-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin] [-F boxes_flow] ' 18 | '[-Y layout_mode] [-O output_dir] [-t text|html|xml|tag] [-c codec] [-s scale] file ...' % self._argv[0]) 19 | return 100 20 | 21 | def __init__(self): 22 | # debug option 23 | self._debug = 0 24 | # input option 25 | self._password = '' 26 | self._pagenos = set() 27 | self._maxpages = 0 28 | # output option 29 | self._outfile = None 30 | self._outtype = None 31 | self._imagewriter = None 32 | self._layoutmode = 'normal' 33 | self._codec = 'utf-8' 34 | self._pageno = 1 35 | self._scale = 1 36 | self._caching = True 37 | self._showpageno = True 38 | self._laparams = LAParams() 39 | self._argv = ['convert.py'] 40 | self._args = [] 41 | 42 | def mainIniter(self, argv): 43 | import getopt 44 | try: 45 | (opts, args) = getopt.getopt(argv[1:], 'dp:m:P:o:CnAVM:L:W:F:Y:O:t:c:s:') 46 | except getopt.GetoptError: 47 | return self.usage() 48 | if not args: return self.usage() 49 | for (k, v) in opts: 50 | if k == '-d': self._debug += 1 51 | elif k == '-p': self._pagenos.update( int(x)-1 for x in v.split(',') ) 52 | elif k == '-m': self._maxpages = int(v) 53 | elif k == '-P': self._password = v 54 | elif k == '-o': self._outfile = v 55 | elif k == '-C': self._caching = False 56 | elif k == '-n': self._laparams = None 57 | elif k == '-A': self._laparams.all_texts = True 58 | elif k == '-V': self._laparams.detect_vertical = True 59 | elif k == '-M': self._laparams.char_margin = float(v) 60 | elif k == '-L': self._laparams.line_margin = float(v) 61 | elif k == '-W': self._laparams.word_margin = float(v) 62 | elif k == '-F': self._laparams.boxes_flow = float(v) 63 | elif k == '-Y': self._layoutmode = v 64 | elif k == '-O': self._imagewriter = ImageWriter(v) 65 | elif k == '-t': self._outtype = v 66 | elif k == '-c': self._codec = v 67 | elif k == '-s': self._scale = float(v) 68 | # 69 | self._argv = argv 70 | self._args = args 71 | # 72 | PDFDocument.debug = self._debug 73 | PDFParser.debug = self._debug 74 | CMapDB.debug = self._debug 75 | PDFResourceManager.debug = self._debug 76 | PDFPageInterpreter.debug = self._debug 77 | PDFDevice.debug = self._debug 78 | return self.run() 79 | 80 | """ 81 | Output options 82 | """ 83 | def with_debug(self, todo=True): 84 | self._debug = 1 if todo else 0 85 | return self 86 | 87 | def as_text(self, todo=True): 88 | self._outtype = 'text' if todo else None 89 | return self 90 | 91 | def as_xml(self, todo=True): 92 | self._outtype = 'xml' if todo else None 93 | return self 94 | 95 | def as_html(self, todo=True): 96 | self._outtype = 'html' if todo else None 97 | return self 98 | 99 | def as_tag(self, todo=True): 100 | self._outtype = 'tag' if todo else None 101 | return self 102 | 103 | def add_input_file(self, filename): 104 | self._args.append(filename) 105 | return self 106 | 107 | """ 108 | Process the pdf file(s) 109 | """ 110 | def run(self): 111 | rsrcmgr = PDFResourceManager(caching=self._caching) 112 | if not self._outtype: 113 | self._outtype = 'text' 114 | if __name__ == '__main__': 115 | if self._outfile: 116 | if self._outfile.endswith('.htm') or self._outfile.endswith('.html'): 117 | self._outtype = 'html' 118 | elif self._outfile.endswith('.xml'): 119 | self._outtype = 'xml' 120 | elif self._outfile.endswith('.tag'): 121 | self._outtype = 'tag' 122 | if __name__ == '__main__': 123 | if self._outfile: 124 | outfp = file(self._outfile, 'w') 125 | else: 126 | outfp = sys.stdout 127 | else: 128 | from cStringIO import StringIO 129 | outfp = StringIO() 130 | if self._outtype == 'text': 131 | device = TextConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter) 132 | elif self._outtype == 'xml': 133 | device = XMLConverter(rsrcmgr, outfp, codec=self._codec, laparams=self._laparams, imagewriter=self._imagewriter) 134 | elif self._outtype == 'html': 135 | device = HTMLConverter(rsrcmgr, outfp, codec=self._codec, scale=self._scale, layoutmode=self._layoutmode, laparams=self._laparams, imagewriter=self._imagewriter) 136 | elif self._outtype == 'tag': 137 | device = TagExtractor(rsrcmgr, outfp, codec=self._codec) 138 | else: 139 | return usage() 140 | for fname in self._args: 141 | fp = file(fname, 'rb') 142 | interpreter = PDFPageInterpreter(rsrcmgr, device) 143 | 144 | for page in PDFPage.get_pages(fp, self._pagenos, maxpages=self._maxpages, password=self._password, caching=self._caching, check_extractable=True): 145 | interpreter.process_page(page) 146 | 147 | fp.close() 148 | device.close() 149 | if __name__ == '__main__': 150 | outfp.close() 151 | else: 152 | return outfp.getvalue() 153 | 154 | 155 | 156 | 157 | if __name__ == '__main__': 158 | import sys 159 | sys.exit(converter().mainIniter(sys.argv)) 160 | -------------------------------------------------------------------------------- /psd_rb/Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | 3 | gem 'psd' 4 | -------------------------------------------------------------------------------- /psd_rb/json_type.json: -------------------------------------------------------------------------------- 1 | [{ 2 | "folders": [{ 3 | "created_at": "2014-04-17T08:23:42Z", 4 | "_id": "534f8f8e94da26c97d00003a", 5 | "description": "jheuh iuwh iwuhe iuhiuhwiuehriguh iuwheuirhgwhiuerh uihwgiuewrhguiehu hweurh u ieuwh eiuh iuh weuh ihwe iuhwe uhreuhgreugh wieh wiuehwueirghwei hiuh iuewh uiwh uiwhu u weiu rui wiuhewiug iuwer wegwe hguihrgiuwe reh erwgheihg e g ehgiuheguihw eugi hwegh opj s dfspo sdk kdf", 6 | "data": { 7 | "orientation": "0", 8 | "projectColors": [{ 9 | "alpha": 1, 10 | "color": 16777215 11 | }, { 12 | "alpha": 1, 13 | "color": 16777215 14 | }, { 15 | "alpha": 1, 16 | "color": 16777215 17 | }, { 18 | "alpha": 1, 19 | "color": 16777215 20 | }, { 21 | "alpha": 1, 22 | "color": 16777215 23 | }, { 24 | "alpha": 1, 25 | "color": 16777215 26 | }, { 27 | "alpha": 1, 28 | "color": 16777215 29 | }, { 30 | "alpha": 1, 31 | "color": 16777215 32 | }, { 33 | "alpha": 1, 34 | "color": 16777215 35 | }, { 36 | "alpha": 1, 37 | "color": 16777215 38 | }, { 39 | "alpha": 1, 40 | "color": 16777215 41 | }, { 42 | "alpha": 1, 43 | "color": 16777215 44 | }, { 45 | "alpha": 1, 46 | "color": 16777215 47 | }, { 48 | "alpha": 1, 49 | "color": 16777215 50 | }, { 51 | "alpha": 1, 52 | "color": 16777215 53 | }, { 54 | "alpha": 1, 55 | "color": 16777215 56 | }, { 57 | "alpha": 1, 58 | "color": 16777215 59 | }, { 60 | "alpha": 1, 61 | "color": 16777215 62 | }, { 63 | "alpha": 1, 64 | "color": 16777215 65 | }, { 66 | "alpha": 1, 67 | "color": 16777215 68 | }, { 69 | "alpha": 1, 70 | "color": 16777215 71 | }, { 72 | "alpha": 1, 73 | "color": 16777215 74 | }, { 75 | "alpha": 1, 76 | "color": 16777215 77 | }, { 78 | "alpha": 1, 79 | "color": 16777215 80 | }], 81 | "lastFontColor": 16777215, 82 | "navigation": true, 83 | "tags": null, 84 | "width": 1024, 85 | "height": 768 86 | }, 87 | "sid": "534f8f8e94da26c97d00003a", 88 | "user_id": "52820a5994da266b19000002", 89 | "type": "ProjectFolder", 90 | "folderName": "V42", 91 | "last_published_at": "2014-04-17T08:23:42+00:00", 92 | "child_ids": ["534f8f8e94da26c97d00003b"], 93 | "unit_ids": ["534f8f8e94da26c97d00002e", "534f8f8e94da26c97d000030"], 94 | "resource_ids": ["5379fec120d5ac4294000051"], 95 | "font_ids": ["527a597cab134d6a3e000005"], 96 | "updated_at": "2014-05-19T12:53:32Z", 97 | "_type": "Folder" 98 | }, { 99 | "created_at": "2014-04-17T08:23:42Z", 100 | "_id": "534f8f8e94da26c97d00003b", 101 | "unit_ids": ["534f8f8e94da26c97d00002c"], 102 | "data": { 103 | "index": 0 104 | }, 105 | "sid": "534f8f8e94da26c97d00003a", 106 | "folderName": "Untitled", 107 | "parent_ids": ["534f8f8e94da26c97d00003a"], 108 | "type": "ViewFolder", 109 | "updated_at": "2014-05-19T12:53:32Z", 110 | "_type": "Folder" 111 | }], 112 | "units": [{ 113 | "created_at": "2014-04-17T08:23:42Z", 114 | "_id": "534f8f8e94da26c97d00002c", 115 | "folder_ids": ["534f8f8e94da26c97d00003b"], 116 | "stype": "TMWorld", 117 | "type": "world", 118 | "sid": "534f8f8e94da26c97d00003a", 119 | "child_ids": ["534f8f8e94da26c97d00002d", "534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a", "5379feb120d5ac429400004e", "5379feb120d5ac429400004f"], 120 | "data": { 121 | "index": 0, 122 | "unitName": "Untitled", 123 | "bindings": { 124 | "ressources": null, 125 | "properties": {} 126 | } 127 | }, 128 | "updated_at": "2014-05-19T12:53:32Z", 129 | "_type": "Unit" 130 | }, { 131 | "created_at": "2014-04-17T08:23:42Z", 132 | "folder_ids": ["534f8f8e94da26c97d00003a"], 133 | "stype": "TMWorld", 134 | "type": "world", 135 | "_id": "534f8f8e94da26c97d00002e", 136 | "sid": "534f8f8e94da26c97d00003a", 137 | "child_ids": ["534f8f8e94da26c97d00002f"], 138 | "data": { 139 | "index": 0, 140 | "unitName": "masterOverView" 141 | }, 142 | "updated_at": "2014-04-17T08:23:42Z", 143 | "_type": "Unit" 144 | }, { 145 | "created_at": "2014-04-17T08:23:42Z", 146 | "folder_ids": ["534f8f8e94da26c97d00003a"], 147 | "stype": "TMWorld", 148 | "type": "world", 149 | "_id": "534f8f8e94da26c97d000030", 150 | "sid": "534f8f8e94da26c97d00003a", 151 | "child_ids": ["534f8f8e94da26c97d000031"], 152 | "data": { 153 | "index": 0, 154 | "unitName": "masterUnderView" 155 | }, 156 | "updated_at": "2014-04-17T08:23:42Z", 157 | "_type": "Unit" 158 | }, { 159 | "created_at": "2014-04-17T08:23:42Z", 160 | "_id": "534f8f8e94da26c97d00002d", 161 | "flat_child_ids": ["534f8f8e94da26c97d000038", "534f8f8e94da26c97d000039", "5379feb120d5ac429400004a"], 162 | "stype": "statestack", 163 | "type": "graphicunit", 164 | "sid": "534f8f8e94da26c97d00003a", 165 | "parent_ids": ["534f8f8e94da26c97d00002c"], 166 | "resource_ids": ["5379fec120d5ac4294000051"], 167 | "data": { 168 | "stacks": { 169 | "5347a59d5d04707f8f000020": { 170 | "masterBack": null, 171 | "actions": {}, 172 | "thumb": "5379fec120d5ac4294000051", 173 | "ref": null, 174 | "sactions": null, 175 | "masterFront": null, 176 | "units": [0, 1, 2], 177 | "color": { 178 | "solid": { 179 | "alpha": 1, 180 | "color": 102 181 | } 182 | }, 183 | "name": "Untitled" 184 | }, 185 | "o": ["5347a59d5d04707f8f000020"] 186 | }, 187 | "internalWPid": "5347a59d5d04707f8f00002d", 188 | "world": { 189 | "args": [0, 0, 1024, 768, 0], 190 | "width": 1024, 191 | "height": 768, 192 | "type": "TMWorld" 193 | }, 194 | "unitName": "RootStateStack1451", 195 | "bindings": { 196 | "ressources": null, 197 | "properties": {} 198 | }, 199 | "indexChild": 1 200 | }, 201 | "updated_at": "2014-05-19T12:53:31Z", 202 | "_type": "Unit" 203 | }, { 204 | "created_at": "2014-04-17T08:23:42Z", 205 | "_id": "534f8f8e94da26c97d000038", 206 | "stype": "Text", 207 | "type": "media", 208 | "sid": "534f8f8e94da26c97d00003a", 209 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"], 210 | "parent_ids": ["534f8f8e94da26c97d00002c"], 211 | "resource_ids": ["534f8f8e94da26c97d00002b"], 212 | "data": { 213 | "opacity": 0.5, 214 | "world": { 215 | "args": [160, 31, 200, 183, 0], 216 | "width": 1024, 217 | "height": 768, 218 | "type": "TMWorld" 219 | }, 220 | "unitName": "Lorem ipsum dolor sit amet, co", 221 | "bindings": { 222 | "ressources": null, 223 | "properties": {} 224 | }, 225 | "indexChild": 1 226 | }, 227 | "updated_at": "2014-05-19T12:53:31Z", 228 | "_type": "Unit" 229 | }, { 230 | "created_at": "2014-04-17T08:23:42Z", 231 | "_id": "534f8f8e94da26c97d000039", 232 | "stype": "image", 233 | "type": "media", 234 | "sid": "534f8f8e94da26c97d00003a", 235 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"], 236 | "parent_ids": ["534f8f8e94da26c97d00002c"], 237 | "resource_ids": ["5347a61c5d04707f8f000059"], 238 | "data": { 239 | "isAutoName": false, 240 | "actions": { 241 | "4": null 242 | }, 243 | "world": { 244 | "args": [536, 31, 329, 329, 0], 245 | "width": 1024, 246 | "height": 768, 247 | "type": "TMWorld" 248 | }, 249 | "unitName": "seal_bear.jpg", 250 | "bindings": { 251 | "ressources": null, 252 | "properties": {} 253 | }, 254 | "indexChild": 2 255 | }, 256 | "updated_at": "2014-05-19T12:53:31Z", 257 | "_type": "Unit" 258 | }, { 259 | "created_at": "2014-05-19T12:53:31Z", 260 | "_id": "5379feb120d5ac429400004a", 261 | "stype": "group", 262 | "flat_parent_ids": ["534f8f8e94da26c97d00002d"], 263 | "data": { 264 | "stacks": { 265 | "5379feb120d5ac429400004b": { 266 | "ref": null, 267 | "units": [0, 1], 268 | "actions": {}, 269 | "thumb": null, 270 | "name": "Untitled" 271 | }, 272 | "o": ["5379feb120d5ac429400004b"] 273 | }, 274 | "world": { 275 | "args": [243, 511, 321, 188, 0], 276 | "width": 1024, 277 | "height": 768, 278 | "type": "TMWorld" 279 | }, 280 | "unitName": "Group_5774", 281 | "indexChild": 3, 282 | "internalWPid": "5379feb120d5ac429400004c" 283 | }, 284 | "flat_child_ids": ["5379feb120d5ac429400004e", "5379feb120d5ac429400004f"], 285 | "type": "container", 286 | "parent_ids": ["534f8f8e94da26c97d00002c"], 287 | "sid": "534f8f8e94da26c97d00003a", 288 | "updated_at": "2014-05-19T12:53:31Z", 289 | "_type": "Unit" 290 | }, { 291 | "created_at": "2014-05-19T12:53:31Z", 292 | "_id": "5379feb120d5ac429400004e", 293 | "stype": "ellipse", 294 | "flat_parent_ids": ["5379feb120d5ac429400004a"], 295 | "data": { 296 | "world": { 297 | "args": [46, 0, 135, 134, "20.20475968803733"], 298 | "width": 321, 299 | "height": 188, 300 | "type": "TMWorld" 301 | }, 302 | "unitName": "Ellipse_3723", 303 | "bindings": { 304 | "ressources": null, 305 | "properties": {} 306 | }, 307 | "background": { 308 | "solid": { 309 | "alpha": 1, 310 | "color": 10412991 311 | } 312 | }, 313 | "indexChild": 1 314 | }, 315 | "type": "primitive", 316 | "parent_ids": ["534f8f8e94da26c97d00002c"], 317 | "sid": "534f8f8e94da26c97d00003a", 318 | "updated_at": "2014-05-19T12:53:31Z", 319 | "_type": "Unit" 320 | }, { 321 | "created_at": "2014-05-19T12:53:31Z", 322 | "_id": "5379feb120d5ac429400004f", 323 | "stype": "path", 324 | "flat_parent_ids": ["5379feb120d5ac429400004a"], 325 | "data": { 326 | "shapeBounds": [0, 0, 204, 102], 327 | "background": { 328 | "solid": { 329 | "alpha": 1, 330 | "color": 10412991 331 | } 332 | }, 333 | "shape": "triangle", 334 | "world": { 335 | "args": [117, 86, 204, 102, 0], 336 | "width": 321, 337 | "height": 188, 338 | "type": "TMWorld" 339 | }, 340 | "unitName": "Shape_3937", 341 | "bindings": { 342 | "ressources": null, 343 | "properties": {} 344 | }, 345 | "indexChild": 2, 346 | "path": "M0 212.132 212.132 0 424.264 212.132 0 212.132Z" 347 | }, 348 | "type": "primitive", 349 | "parent_ids": ["534f8f8e94da26c97d00002c"], 350 | "sid": "534f8f8e94da26c97d00003a", 351 | "updated_at": "2014-05-19T12:53:31Z", 352 | "_type": "Unit" 353 | }, { 354 | "created_at": "2014-04-17T08:23:42Z", 355 | "_id": "534f8f8e94da26c97d00002f", 356 | "stype": "statestack", 357 | "type": "graphicunit", 358 | "sid": "534f8f8e94da26c97d00003a", 359 | "parent_ids": ["534f8f8e94da26c97d00002e"], 360 | "resource_ids": ["5347a59d5d04707f8f000023"], 361 | "data": { 362 | "stacks": { 363 | "5347a59d5d04707f8f000024": { 364 | "masterBack": null, 365 | "actions": {}, 366 | "thumb": "5347a59d5d04707f8f000023", 367 | "ref": null, 368 | "sactions": null, 369 | "masterFront": null, 370 | "units": null, 371 | "color": { 372 | "solid": { 373 | "alpha": 0, 374 | "color": 16777215 375 | } 376 | }, 377 | "name": "Untitled" 378 | }, 379 | "o": ["5347a59d5d04707f8f000024"] 380 | }, 381 | "internalWPid": "5347a59d5d04707f8f000029", 382 | "world": { 383 | "args": [0, 0, 1024, 768, 0], 384 | "width": 1024, 385 | "height": 768, 386 | "type": "TMWorld" 387 | }, 388 | "unitName": "RootStateStack1447", 389 | "bindings": { 390 | "ressources": null, 391 | "properties": {} 392 | }, 393 | "indexChild": 1 394 | }, 395 | "updated_at": "2014-05-19T12:53:31Z", 396 | "_type": "Unit" 397 | }, { 398 | "created_at": "2014-04-17T08:23:42Z", 399 | "_id": "534f8f8e94da26c97d000031", 400 | "stype": "statestack", 401 | "type": "graphicunit", 402 | "sid": "534f8f8e94da26c97d00003a", 403 | "parent_ids": ["534f8f8e94da26c97d000030"], 404 | "resource_ids": ["5347a59d5d04707f8f000027"], 405 | "data": { 406 | "stacks": { 407 | "5347a59d5d04707f8f000028": { 408 | "masterBack": null, 409 | "actions": {}, 410 | "thumb": "5347a59d5d04707f8f000027", 411 | "ref": null, 412 | "sactions": null, 413 | "masterFront": null, 414 | "units": null, 415 | "color": { 416 | "solid": { 417 | "alpha": 1, 418 | "color": 16777215 419 | } 420 | }, 421 | "name": "Untitled" 422 | }, 423 | "o": ["5347a59d5d04707f8f000028"] 424 | }, 425 | "internalWPid": "5347a59d5d04707f8f00002b", 426 | "world": { 427 | "args": [0, 0, 1024, 768, 0], 428 | "width": 1024, 429 | "height": 768, 430 | "type": "TMWorld" 431 | }, 432 | "unitName": "RootStateStack1448", 433 | "bindings": { 434 | "ressources": null, 435 | "properties": {} 436 | }, 437 | "indexChild": 1 438 | }, 439 | "updated_at": "2014-05-19T12:53:31Z", 440 | "_type": "Unit" 441 | }], 442 | "resources": [{ 443 | "created_at": "2014-04-17T08:23:42Z", 444 | "font_ids": ["527a597cab134d6a3e000005"], 445 | "unit_ids": ["534f8f8e94da26c97d000038"], 446 | "content": "

Lorem ipsum dolor sit amet, consectetur adipisicing elit.

", 447 | "_id": "534f8f8e94da26c97d00002b", 448 | "sid": "534f8f8e94da26c97d00003a", 449 | "type": "ressourceTextLayout", 450 | "updated_at": "2014-04-17T08:23:42Z", 451 | "_type": "Resource" 452 | }, { 453 | "created_at": "2014-04-11T08:22:42Z", 454 | "_id": "5347a61c5d04707f8f000059", 455 | "type": "ressourceImage", 456 | "unit_ids": ["5347a61c5d04707f8f00005a", "5347a81b94da26f432000011", "534bfa8894da26f432000025", "534bfb1194da2679f0000011", "534c0a5794da263072000011", "534c0fcb94da263072000025", "534d520294da262aae000011", "534d526b94da262aae000025", "534e34e994da267660000011", "534e368d94da267660000025", "534e36a294da26c974000011", "534e434a94da266b7b000011", "534e4abe94da26c1cf000011", "534e4d3694da267beb000011", "534e572d94da26c857000011", "534e59a394da26c857000025", "534e5ade94da26c97d000025", "534f8f8e94da26c97d000039", "535116c394da26c97d00004d", "53511e2694da26c97d000061"], 457 | "data": { 458 | "creationDate": 1394547794000, 459 | "width": 1024, 460 | "ressourceSize": 119186, 461 | "modificationDate": 1394547794000, 462 | "height": 1024 463 | }, 464 | "sid": "53464ef394da26a11400000d", 465 | "name": "seal_bear.jpg", 466 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a61c5d04707f8f000059", 467 | "updated_at": "2014-04-18T13:38:34Z", 468 | "_type": "Resource" 469 | }, { 470 | "created_at": "2014-04-11T08:22:42Z", 471 | "_id": "5347a59d5d04707f8f000023", 472 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000023", 473 | "unit_ids": ["5347a59d5d04707f8f000022", "5347a81b94da26f432000007", "534bfa8894da26f43200001b", "534bfb1194da2679f0000007", "534c0a5794da263072000007", "534c0fcb94da26307200001b", "534d520294da262aae000007", "534d526b94da262aae00001b", "534e34e994da267660000007", "534e368d94da26766000001b", "534e36a294da26c974000007", "534e434a94da266b7b000007", "534e4abe94da26c1cf000007", "534e4d3694da267beb000007", "534e572d94da26c857000007", "534e59a394da26c85700001b", "534e5ade94da26c97d00001b", "534f8f8e94da26c97d00002f", "535116c394da26c97d000043", "53511e2694da26c97d000057"], 474 | "type": "ressourceImage", 475 | "sid": "53464ef394da26a11400000d", 476 | "data": { 477 | "height": 768, 478 | "width": 1024 479 | }, 480 | "updated_at": "2014-04-18T13:38:34Z", 481 | "_type": "Resource" 482 | }, { 483 | "created_at": "2014-04-11T08:22:42Z", 484 | "_id": "5347a59d5d04707f8f000027", 485 | "content": "http://52820a5994da266b19000002.data.riak.dev:80/5347a59d5d04707f8f000027", 486 | "unit_ids": ["5347a59d5d04707f8f000026", "5347a81b94da26f432000009", "534bfa8894da26f43200001d", "534bfb1194da2679f0000009", "534c0a5794da263072000009", "534c0fcb94da26307200001d", "534d520294da262aae000009", "534d526b94da262aae00001d", "534e34e994da267660000009", "534e368d94da26766000001d", "534e36a294da26c974000009", "534e434a94da266b7b000009", "534e4abe94da26c1cf000009", "534e4d3694da267beb000009", "534e572d94da26c857000009", "534e59a394da26c85700001d", "534e5ade94da26c97d00001d", "534f8f8e94da26c97d000031", "535116c394da26c97d000045", "53511e2694da26c97d000059"], 487 | "type": "ressourceImage", 488 | "sid": "53464ef394da26a11400000d", 489 | "data": { 490 | "height": 768, 491 | "width": 1024 492 | }, 493 | "updated_at": "2014-04-18T13:38:34Z", 494 | "_type": "Resource" 495 | }] 496 | }] -------------------------------------------------------------------------------- /psd_rb/ruby/factories/group_unit.rb: -------------------------------------------------------------------------------- 1 | class GroupUnit 2 | 3 | def initialize(node) 4 | @node = node 5 | end 6 | 7 | def type 8 | 'GroupUnit' 9 | end 10 | 11 | def to_json 12 | as_json 13 | end 14 | 15 | def as_json 16 | { 17 | created_at: '2014-05-19T12:53:31Z', 18 | _id: '5379feb120d5ac429400004a', 19 | stype: 'group', 20 | flat_parent_ids: ['534f8f8e94da26c97d00002d'], 21 | data: { 22 | stacks: { 23 | '5379feb120d5ac429400004b'=> { 24 | ref: null, 25 | units: [0, 1], 26 | actions: {}, 27 | thumb: nil, 28 | name: 'Untitled' 29 | }, 30 | o: ['5379feb120d5ac429400004b'] 31 | }, 32 | world: { 33 | args: [243, 511, 321, 188, 0], 34 | width: 1024, 35 | height: 768, 36 | type: 'TMWorld' 37 | }, 38 | unitName: 'Group_5774', 39 | indexChild: 3, 40 | internalWPid: '5379feb120d5ac429400004c' 41 | }, 42 | flat_child_ids: %w(5379feb120d5ac429400004e 5379feb120d5ac429400004f), 43 | type: 'container', 44 | parent_ids: ['534f8f8e94da26c97d00002c'], 45 | sid: '534f8f8e94da26c97d00003a', 46 | updated_at: '2014-05-19T12:53:31Z', 47 | _type: 'Unit' 48 | } 49 | end 50 | end -------------------------------------------------------------------------------- /psd_rb/ruby/factories/image_unit.rb: -------------------------------------------------------------------------------- 1 | class ImageUnit 2 | 3 | def initialize(node) 4 | @node = node 5 | end 6 | 7 | def type 8 | 'ImageUnit'+( (@node.phas_crop)?(' with crop'):('') ) 9 | end 10 | 11 | def to_json 12 | as_json 13 | end 14 | 15 | def as_json 16 | { 17 | stype: 'image', 18 | type: 'media', 19 | # flat_parent_ids: ["534f8f8e94da26c97d00002d"], 20 | # parent_ids: ["534f8f8e94da26c97d00002c"], 21 | # resource_ids: ["5347a61c5d04707f8f000059"], 22 | data: { 23 | isAutoName: false, 24 | world: { 25 | args: [536, 31, 329, 329, 0], 26 | width: 1024, 27 | height: 768, 28 | type: 'TMWorld' 29 | }, 30 | unitName: @node.ppng_name, 31 | bindings: { 32 | ressources: nil, 33 | properties: {} 34 | }, 35 | # indexChild: 2 36 | }, 37 | _type: 'Unit' 38 | } 39 | end 40 | 41 | end 42 | -------------------------------------------------------------------------------- /psd_rb/ruby/factories/text_unit.rb: -------------------------------------------------------------------------------- 1 | class TextUnit 2 | 3 | def initialize(node) 4 | @node = node 5 | end 6 | 7 | def type 8 | 'TextUnit' 9 | end 10 | 11 | def to_json 12 | as_json 13 | end 14 | 15 | def as_json 16 | { 17 | stype: 'Text', 18 | type: 'media', 19 | # flat_parent_ids: ["534f8f8e94da26c97d00002d"], 20 | # parent_ids: ["534f8f8e94da26c97d00002c"], 21 | # resource_ids: ["534f8f8e94da26c97d00002b"], 22 | data: { 23 | opacity: 0.5, 24 | world: { 25 | args: [160, 31, 200, 183, 0], 26 | width: 1024, 27 | height: 768, 28 | type: 'TMWorld' 29 | }, 30 | unitName: @node.pget_text, 31 | bindings: { 32 | ressources: nil, 33 | properties: {} 34 | }, 35 | # indexChild: 1 36 | }, 37 | _type: 'Unit' 38 | } 39 | end 40 | end 41 | -------------------------------------------------------------------------------- /psd_rb/ruby/factories/unit_factory.rb: -------------------------------------------------------------------------------- 1 | require_relative 'text_unit' 2 | require_relative 'image_unit' 3 | require_relative 'group_unit' 4 | require_relative 'unknown_unit' 5 | 6 | class UnitFactory 7 | 8 | def self.create_unit(node) 9 | if node.phas_text 10 | return TextUnit.new(node) 11 | elsif node.pis_group 12 | return GroupUnit.new(node) 13 | else 14 | return ImageUnit.new(node) 15 | end 16 | end 17 | 18 | end 19 | -------------------------------------------------------------------------------- /psd_rb/ruby/factories/unknown_unit.rb: -------------------------------------------------------------------------------- 1 | class UnknownUnit 2 | 3 | def initialize(node) 4 | @node = node 5 | end 6 | 7 | def type 8 | 'UnknownUnit' 9 | end 10 | 11 | def to_json 12 | as_json 13 | end 14 | 15 | def as_json 16 | { 17 | Unit: 'Unkown' 18 | } 19 | end 20 | end -------------------------------------------------------------------------------- /psd_rb/ruby/panda_psd.rb: -------------------------------------------------------------------------------- 1 | require 'psd' 2 | require_relative 'util' 3 | require_relative 'unit_manager' 4 | 5 | class PandaPsd 6 | 7 | def initialize(file:nil) 8 | @errors = [] 9 | @info = [] 10 | @orientation = nil 11 | @unitManager = UnitManager.new 12 | @psd = PSD.new(file) 13 | @psd.parse! 14 | end 15 | 16 | attr_reader :info, :errors 17 | 18 | public 19 | 20 | def check_integrity 21 | check_dimensions 22 | self 23 | end 24 | 25 | def export 26 | @unitManager.export 27 | end 28 | 29 | def parse 30 | check_integrity 31 | go_through 32 | 33 | 34 | # #get an image 35 | # so = @psd.pget_layer_by_name('Smart Object (path d\'illustrator)') 36 | # so.pas_png 37 | # 38 | # #get a shape 39 | # so = @psd.pget_layer_by_name('Path rond') 40 | # so.pas_png 41 | # 42 | # #get a crop position 43 | # so = @psd.pget_layer_by_name('image croppée') 44 | # if so.phas_mask 45 | # so.pas_png # get the whole image without cropping 46 | # puts so.pget_mask_position 47 | # end 48 | # 49 | # #get a text 50 | # so = @psd.pget_layer_by_name('Texte') 51 | # if so.phas_text 52 | # pp so.pget_text_html 53 | # pp so.pget_positions 54 | # pp so.pget_dimensions 55 | # pp so.hidden? 56 | # pp so.pvisible 57 | # end 58 | # puts UnitFactory::create_unit(so).as_json 59 | 60 | self 61 | end 62 | 63 | private 64 | 65 | def go_through 66 | return nil unless @errors.empty? 67 | @psd.tree.descendants_layers.each do |layer| 68 | @unitManager.create_unit(layer) if layer.pvisible_tree? 69 | end 70 | end 71 | 72 | def check_dimensions 73 | return nil unless @errors.empty? 74 | if @psd.pget_dimensions == [2048, 1536] 75 | @orientation = 'landscape' 76 | elsif @psd.pget_dimensions == [1536, 2048] 77 | @orientation = 'portrait' 78 | else 79 | @errors << 'Dimensions are wrong. Only 2048x1536 or 1536x2048.' 80 | end 81 | end 82 | 83 | end 84 | 85 | 86 | 87 | -------------------------------------------------------------------------------- /psd_rb/ruby/unit_manager.rb: -------------------------------------------------------------------------------- 1 | require_relative 'factories/unit_factory' 2 | 3 | class UnitManager 4 | 5 | def initialize 6 | @units = [] 7 | end 8 | attr_reader :units 9 | 10 | def create_unit(layer) 11 | unit = UnitFactory::create_unit(layer) 12 | @units << unit 13 | unit 14 | end 15 | 16 | def export 17 | @units.map{|unit| unit.as_json} 18 | end 19 | 20 | end 21 | -------------------------------------------------------------------------------- /psd_rb/ruby/util.rb: -------------------------------------------------------------------------------- 1 | 2 | PSD.send(:define_method, 'pget_dimensions') do 3 | {width:tree.document_width, height:tree.document_height} 4 | end 5 | 6 | PSD.send(:define_method, 'pget_layer_by_name') do |name| 7 | tree.children_layers.map { |layer| return layer if layer.name == name }.compact.first 8 | end 9 | 10 | 11 | PSD::Node::Base.send(:define_method, 'pget_layer_by_name') do |name| 12 | children_layers.map { |layer| return layer if layer.name == name }.compact.first 13 | end 14 | 15 | PSD::Node::Base.send(:define_method, 'pget_dimensions') do 16 | {width:width, height:height} 17 | end 18 | 19 | PSD::Node::Base.send(:define_method, 'ppng_name') do 20 | name.gsub(/\s/, '_')+'.png' 21 | end 22 | 23 | PSD::Node::Base.send(:define_method, 'pas_png') do 24 | image.save_as_png '../output/'+ppng_name 25 | end 26 | 27 | PSD::Node::Base.send(:define_method, 'pis_group') do 28 | group? 29 | end 30 | 31 | PSD::Node::Base.send(:define_method, 'pvisible_tree?') do 32 | see_it = true 33 | see_it = visible? unless root? 34 | # noinspection RubyScope 35 | def rec(node, see_it) 36 | see_it = node.visible? unless node.root? || (!see_it) 37 | see_it = rec(node.parent, see_it) unless node.root? || (!see_it) 38 | see_it 39 | end 40 | see_it = rec(parent, see_it) unless root? || (!see_it) 41 | see_it 42 | end 43 | 44 | PSD::Node::Base.send(:define_method, 'pget_mask_position') do 45 | { 46 | top: mask.top-top, 47 | left: mask.left-left, 48 | right: right-mask.right, 49 | bottom: bottom-mask.bottom 50 | } 51 | end 52 | 53 | PSD::Node::Base.send(:define_method, 'phas_mask') do 54 | (mask.size > 0) 55 | end 56 | 57 | PSD::Node::Base.send(:define_method, 'phas_crop') do 58 | phas_mask && (pget_mask_position != {top:0, left:0, right:0, bottom:0}) 59 | end 60 | 61 | PSD::Node::Base.send(:define_method, 'phas_text') do 62 | text != nil 63 | end 64 | 65 | PSD::Node::Base.send(:define_method, 'pget_text') do 66 | text[:value] if phas_text 67 | end 68 | 69 | PSD::Node::Base.send(:define_method, 'pget_text_html') do 70 | ''+ 71 | '

'+ 72 | ""+ 73 | pget_text+ 74 | ''+ 75 | '

'+ 76 | '
' if phas_text 77 | end 78 | 79 | PSD::Node::Base.send(:define_method, 'pget_positions') do 80 | { 81 | top: top, 82 | left: left, 83 | right: right, 84 | bottom: bottom 85 | } 86 | end 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | # module Panda 99 | 100 | # module TreeUtil 101 | # 102 | # def get_layer_by_name(node, name) 103 | # return nil unless @errors.empty? 104 | # node = node.tree if node.instance_of?(PSD) 105 | # # noinspection RubyResolve 106 | # node.children_layers.map { |layer| return layer if layer.name == name }.compact.first 107 | # end 108 | 109 | # def get_dimensions(node) 110 | # return nil unless @errors.empty? 111 | # # noinspection RubyResolve 112 | # return {width: node.tree.document_width, height: node.tree.document_height} if node.instance_of?(PSD) 113 | # {width:node.width, height:node.height} 114 | # end 115 | # 116 | # end 117 | 118 | # module NodeUtil 119 | 120 | # def is_visible(node) 121 | # return nil unless @errors.empty? 122 | # node.visible && !node.hidden? 123 | # end 124 | 125 | # def as_png(node, name:nil) 126 | # return nil unless @errors.empty? 127 | # name = node.name.gsub(/\s/, '_')+'.png' if name.nil? || name.empty? 128 | # node.image.save_as_png './output/'+name 129 | # end 130 | 131 | # def has_mask(node) 132 | # return nil unless @errors.empty? 133 | # (node.mask.size > 0) && (get_mask_position(node) != {top:0, left:0, right:0, bottom:0}) 134 | # end 135 | 136 | # def get_mask_position(node) 137 | # return nil unless @errors.empty? 138 | # node.mask.instance_exec {{ 139 | # top: top-node.top, 140 | # left: left-node.left, 141 | # right: node.right-right, 142 | # bottom: node.bottom-bottom 143 | # }} 144 | # end 145 | 146 | # def has_text(node) 147 | # return nil unless @errors.empty? 148 | # node.text != nil 149 | # end 150 | 151 | # def get_text(node) 152 | # return nil unless @errors.empty? 153 | # node.text[:value] 154 | # end 155 | 156 | # def get_text_html(node) 157 | # return nil unless @errors.empty? 158 | # ""+ 159 | # "

"+ 160 | # ""+ 161 | # get_text(node)+ 162 | # ""+ 163 | # "

"+ 164 | # "
" 165 | # end 166 | 167 | # def get_positions(node) 168 | # return nil unless @errors.empty? 169 | # node.instance_exec {{ 170 | # top: top, 171 | # left: left, 172 | # right: right, 173 | # bottom: bottom 174 | # }} 175 | # end 176 | 177 | # end 178 | 179 | # class PsdUtil 180 | # # include TreeUtil 181 | # # include NodeUtil 182 | # 183 | # def initialize(file:nil) 184 | # @foreground_page_name = 'foreground' 185 | # @background_page_name = 'background' 186 | # end 187 | # 188 | # 189 | # end 190 | # end -------------------------------------------------------------------------------- /psd_rb/test.rb: -------------------------------------------------------------------------------- 1 | require_relative 'ruby/panda_psd' 2 | require 'benchmark' 3 | require 'pp' 4 | 5 | 6 | Benchmark.bm(10) do |x| 7 | x.report('PandaPsd:') { 8 | psd = PandaPsd.new(file:'../psds/rmn_v3.psd').parse 9 | psd = PandaPsd.new(file:'psds/micka.psd').parse 10 | pp 'infos -> '+psd.info.join(', ') 11 | pp 'errors -> '+psd.errors.join(', ') 12 | } 13 | end 14 | -------------------------------------------------------------------------------- /psdtools/A2 16 column.ai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/A2 16 column.ai -------------------------------------------------------------------------------- /psdtools/Logo.psd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/Logo.psd -------------------------------------------------------------------------------- /psdtools/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | __version__ = '20130515' 3 | 4 | if __name__ == '__main__': print __version__ 5 | -------------------------------------------------------------------------------- /psdtools/brochure.indd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure.indd -------------------------------------------------------------------------------- /psdtools/brochure_rtangfold_11x17_OUTh.indd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/brochure_rtangfold_11x17_OUTh.indd -------------------------------------------------------------------------------- /psdtools/buttons.psd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/buttons.psd -------------------------------------------------------------------------------- /psdtools/main.py: -------------------------------------------------------------------------------- 1 | from psd_tools import PSDImage 2 | from psd_tools.constants import BlendMode 3 | import psd_tools.user_api.psd_image 4 | import json 5 | import uuid 6 | 7 | """ 8 | {"pages":[{"images":[], "paragraphs":[{"size":0, "width":0, "string":"", "x":0, "y":0, "font":"", "height":0}]}]} 9 | or 10 | { 11 | "pages": [ 12 | { 13 | "images": [ 14 | "961dfcc0-c1eb-11e2-92af-040ccedc7e34_p0.png" 15 | ], 16 | "paragraphs": [ 17 | { 18 | "size": 98, 19 | "width": 587, 20 | "string": "Book Title", 21 | "y": -98, 22 | "x": -324, 23 | "font": "Georgia", 24 | "height": 705 25 | } 26 | ] 27 | } 28 | ] 29 | } 30 | """ 31 | class ImportPSD(object): 32 | 33 | """ Will Parse a PSD and store its data """ 34 | data = {'pages':[]} 35 | imagePath = './' 36 | psd = None 37 | PSDFilePath = None 38 | 39 | @classmethod 40 | def __init__(self, PSDFilePath, imagePath): 41 | self.PSDFilePath = PSDFilePath 42 | self.imagePath = imagePath 43 | self.psd = PSDImage.load(PSDFilePath) 44 | self.nb_layers = self.countLayers(layers=self.psd.layers) 45 | print "Layers to process: %d" % self.nb_layers 46 | 47 | @classmethod 48 | def countLayers(self, layers, layer_num=0): 49 | for sheet in layers: 50 | if isinstance(sheet, psd_tools.user_api.psd_image.Layer): 51 | layer_num += 1 52 | elif isinstance(sheet, psd_tools.user_api.psd_image.Group): 53 | if sheet.visible_global: 54 | layer_num = self.countLayers(layers=sheet.layers, layer_num=layer_num) 55 | return layer_num 56 | 57 | 58 | @classmethod 59 | def parse(self): 60 | self.data['pages'].append(self.browseSheets(sheets=self.psd.layers, parentName="root")) 61 | 62 | @classmethod 63 | def browseSheets(self, sheets, parentName, page_num=1): 64 | array = {} 65 | for sheet in sheets: 66 | if isinstance(sheet, psd_tools.user_api.psd_image.Layer): 67 | print "Processing layer: %d" % self.nb_layers 68 | self.nb_layers -= 1 69 | """ this sheet is a layer, it may be an image or some text """ 70 | if sheet.text_data is not None: 71 | """ If it's a text """ 72 | dic = dict({ 'name': sheet.name, 73 | 'x': sheet.bbox.x1, 74 | 'y': sheet.bbox.y2, 75 | 'width': sheet.bbox.width, 76 | 'height': sheet.bbox.height, 77 | 'string': sheet.text_data.text, 78 | 'font': None 79 | }) 80 | if 'paragraphs' not in array: 81 | array['paragraphs'] = [] 82 | array['paragraphs'].append(dic) 83 | else: 84 | """ If it's an image """ 85 | imageName = str(uuid.uuid1())+'_'+sheet.name+'_'+parentName+'.png' 86 | if sheet is not None and sheet.as_PIL() is not None: 87 | sheet.as_PIL().save(self.imagePath+imageName) 88 | else: 89 | imageName = None 90 | if 'images' not in array: 91 | array['images'] = [] 92 | array['images'].append(imageName) 93 | elif isinstance(sheet, psd_tools.user_api.psd_image.Group): 94 | """ this sheet is a group and may contains many layers """ 95 | if sheet.visible_global: 96 | arr = self.browseSheets(sheets=sheet.layers, parentName=sheet.name, page_num=page_num+1) 97 | self.data['pages'].append(arr) 98 | return array 99 | 100 | @classmethod 101 | def toJson(self): 102 | """ Will convert the parsed data array into json """ 103 | return json.dumps(self.data) 104 | 105 | 106 | 107 | def run(pdf_file, image_folder): 108 | importedPSD = ImportPSD(PSDFilePath=pdf_file, imagePath=image_folder) 109 | importedPSD.parse() 110 | jsonString = importedPSD.toJson() 111 | return jsonString 112 | 113 | 114 | 115 | 116 | if __name__ == '__main__': 117 | import sys 118 | if len(sys.argv) == 3: 119 | print run(sys.argv[1], sys.argv[2]) 120 | else: 121 | print "usage: %s pdf_file_path generated_images_path/ (eg: python %s book.pdf './images/')" % (sys.argv[0], sys.argv[0]) 122 | -------------------------------------------------------------------------------- /psdtools/test-text.psd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Micka33/content-extractor/a5f732c068e2eb32b901bdc43a28b1962acfd910/psdtools/test-text.psd --------------------------------------------------------------------------------