├── .gitattributes
├── .gitignore
├── README.md
└── tabula.jpg


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | /New folder
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # tocPDF
  2 | *by Amin Yahyaabadi*
  3 | 
  4 | Generates bookmarks from the table of contents already available at the beginning of pdf files.
  5 | 
  6 |  The plan is to automate the whole procedure (https://github.com/aminya/tocPDF#automated).
  7 | 
  8 | 
  9 |  Until then here is the manual procedure:
 10 | ## Manual:
 11 | ### Step 1:  Extraction of toc pages from PDF:
 12 | Use Chrome or software that you already have to extract the pages that contain the table of contents.
 13 | 
 14 | Tutorial for extracting pages using Chrome
 15 | https://www.techadvisor.co.uk/how-to/software/how-extract-pages-from-pdf-3679232/
 16 | 
 17 | We refer to this file as tocPDF.
 18 | 
 19 | ### Step 2: Extract table of contents text
 20 | Here we extract the text from tocPDF.
 21 | 
 22 | Even if your pdf file is searchable, usually when you copy the text the result is not in proper format (like a table).
 23 | 
 24 | Preferred Methods:
 25 | 
 26 | * ####  Tabula technology
 27 | For searchable PDF only -  https://tabula.technology/
 28 | 
 29 | 	* Download and run the software
 30 | 	* Select table of contents and do this for each page
 31 | 	* Hit preview and export extracted data
 32 | 	* Export to csv format
 33 | 
 34 | * #### Using OCR.space
 35 | for both scanned and searchable PDF -  https://ocr.space/
 36 | 
 37 | 	* Upload tocPDF
 38 | 	* Check "Do receipt scanning and/or table recognition" option
 39 | 	* Use "Just extract text and show overlay (fastest option)" option.
 40 | 	* download or copy paste the generated text.
 41 | 
 42 | 
 43 | We refer to the generated text as tocText.
 44 | 
 45 | 
 46 | ## For the following steps, instead, you can check the following which does a similar thing but with a GUI 
 47 | https://github.com/ifnoelse/pdf-bookmark/blob/master/README-EN.md
 48 | 
 49 | ### Step 3: Preparing the text of the table of content
 50 | 
 51 | Open tocText (txt or csv) with a spreadsheet editor (MS Excel or Google Sheet) or using a text editor.
 52 | 
 53 | Edit the text such that each page number is at the beginning of a line, e.g.
 54 | ```
 55 | 1 Cover
 56 | 2 Table of Contents
 57 | 5 Chapter 1
 58 | +6 Subchapter1
 59 | ++7 Sub-Subchapter1
 60 | 25 Chapter 2
 61 | ```
 62 | Don't forget to add the offset to page number (usually the page numbers in pdf have an offset compared to printed document).
 63 | 
 64 | ### Step 4: Download k2pdfoptdoes:
 65 | http://willus.com/k2pdfopt/download/
 66 | 
 67 | ### Step 5: (only for Windows) Disabling the GUI :
 68 | 
 69 | Disabling the GUI using this tutorial
 70 | http://willus.com/k2pdfopt/help/nogui.shtml
 71 | 
 72 | Then drag the original pdf file into your shortcut.
 73 | 
 74 | 
 75 | ### Step 6: Run the command:
 76 | #### Windows:
 77 | copy toc.txt and source pdf file in the folder of your shortcut for convenience.
 78 | 
 79 | Copy-paste the following command in the terminal and press enter.
 80 | ```
 81 | -mode copy -n -toclist toc.txt srcfile.pdf -o outfile.pdf
 82 | ```
 83 | Press enter again to start bookmarking.
 84 | 
 85 | #### OSX or Linux:
 86 | ```
 87 | k2pdfopt -mode copy -n -toclist toc.txt srcfile.pdf -o outfile.pdf
 88 | ```
 89 | 
 90 | 
 91 | ## Automated:
 92 | 
 93 | For now, I plan to start using available software (e.g. k2pdfoptdoes), and then later make the functionality Julia native (when [PDFIO.jl](https://github.com/sambitdash/PDFIO.jl/issues/66) adds pdf write capability).
 94 | 
 95 | Current algorithm plan:
 96 | * The user will provide page numbers that contain the table of content.
 97 | * Those pages are read from pdf by Julia
 98 | * Julia will extract these pages (here user can be called to do the cropping of the borders)
 99 | * Julia will send the extracted pages to https://ocr.space/ to do OCR, and then it gets the text from the table of content (using the [available APIs (Python, C++, etc)](https://ocr.space/ocrapi))
100 | * Julia will edit the received text to make it a specified format. (here user can be called to do a review). The prepared text file will be saved.
101 | * A software is called from Julia (e.g. k2pdfoptdoes from the command line). That software will read the original pdf file and text file and will generate the bookmarks for the pdf and will save it.
102 | 
103 | 
104 | * Also, if the pdf file is searchable, Julia can check the fonts in the whole pdf, and for example, get the text of Bold fonts. ([Infix PDF Editor](https://www.iceni.com/blog/how-to-bookmark-pages-in-a-pdf/) does this.) Manual font providing by the user also can be done ( The expensive [Evermap AutoBookmark ](https://www.evermap.com/autobookmark.asp) plugin for Adobe and [Nitro PDF](https://www.gonitro.com/) do this.)
105 | 
106 | 
107 | ## Other Manual Methods:
108 | #### Other method using Jpdfbookmark
109 | https://sourceforge.net/projects/jpdfbookmarks/
110 | 
111 | from https://ebooks.stackexchange.com/a/7763/12921
112 | 
113 |     Prepare the tocText file such that
114 | 
115 |     Chapter 1. The Beginning/23
116 |         Para 1.1 Child of The Beginning/25,FitWidth,96
117 |             Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43
118 |     Chapter 2. The Continue/30,TopLeft,120,42
119 |         Para 2.1 Child of The Beginning/32,FitPage
120 | 
121 |     You can OCR the TOC and use regex to fix it.
122 | 
123 |     Load that TOC
124 | 
125 |     Expand all bookmarks (Ctrl + E), select all of them, then go to Tools > Apply Page Offset
126 | 
127 |     Enter the first pages that outmatch the page number in the TOC
128 | 
129 | You can read its manual (http://jpdfbookmarks.altervista.org/InsertBookmarks.html#1_3_1) or watch a quick video tutorial (https://youtu.be/7DUkvH7_wII?t=30). It has command line mode and can work on Linux, Mac.
130 | 
131 | #### Other Methods for step 2:
132 | 
133 | * Tesseract OCR:
134 | 	https://github.com/tesseract-ocr/tesseract
135 | 
136 | 	Three is a good tutorial for Extracting Table Data From PDFs with Tesseract OCR:
137 | 	https://web.archive.org/web/20141022033241/http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
138 | 
139 | * Using OnlineOCR.net - Free up to a limit:
140 | https://www.onlineocr.net/
141 | 
142 | 	* Register in the website (to remove page number limitation) and log in
143 | 	* Select txt file option, Upload tocPDF, Convert your file
144 | 
145 | * A related Stack Overflow question:
146 | https://stackoverflow.com/questions/6173439/can-ocr-software-reliably-read-values-from-a-table
147 | 
148 | #### References:
149 | 
150 | https://www.willus.com/k2pdfopt/help/k2menu.shtml
151 | 
152 | https://www.willus.com/k2pdfopt/help/options.shtml
153 | 
154 | 
155 | https://ebooks.stackexchange.com/questions/107/how-to-create-clickable-table-of-contents-in-a-pdf
156 | 


--------------------------------------------------------------------------------
/tabula.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aminya/tocPDF/abfa12b7151d6fc6703046e33125dc7035cd2b14/tabula.jpg


--------------------------------------------------------------------------------