250 |
251 | Associated British Foods plc
252 | Weston Centre
253 | 10 Grosvenor Street London W1K 4QY
254 | Tel + 44 (0) 20 7399 6500
255 | Fax + 44 (0) 20 7399 6580
256 | For an accessible version of the Annual Report and Accounts please visit www.abf.co.uk
257 |
258 |
259 | ...
260 |
261 |
262 |
263 |
264 | • Acquisition of the leading Iberian sugar producer, Azucarera Ebro
265 | • Sale of Polish sugar business
266 | • Restructuring of US packaged oils business - new joint venture, Stratas
267 | • Zambian cane sugar expansion completed - capacity doubled
268 | • Investment in Chinese beet and cane sugar
269 | • Enzyme capacity investment in Finland completed
270 | • Yeast and yeast extracts plant under construction in Harbin
271 | • New Primark stores in UK and Spain and first openings in the Netherlands, Germany and Portugal
272 | • US Private Placement secures long-term non-bank finance
273 |
274 |
275 | This report has been printed on revive 50:50 Silk paper.
276 | This paper is made from pre and post consumer waste and virgin wood fibre, independently certified in accordance with the Forest Stewardship Council (FSC).
277 | It is manufactured at a mill that is certified to ISO 14001 environmental management standards. The pulp is bleached using an elemental chlorine free process. The inks used are all vegetable oil based.
278 | Printed at St Ives Westerham Press Ltd, ISO 14001, FSC certified and CarbonNeutral®
279 |
280 | ....
281 |
282 | This report contains forward-looking statements. These have been made by the directors in good faith based on the information available to them up to the time of their approval of this report. The directors can give no assurance that these expectations will prove to have been correct. Due to the inherent uncertainties, including both economic and business risk factors underlying such forward-looking information, actual results may differ materially from those expressed or implied by these forward-looking statements. The directors undertake no obligation to update any forward-looking statements whether as a result of new information, future events or otherwise.
283 |
284 |
285 |
286 |
287 |
288 | ```
289 |
290 | ### ID Formats
291 | The ID of a element is defined by is parent structure with an incremental counter:
292 |
293 | - page = page
294 | - paragraph = p
295 |
296 | **Example**
297 |
298 | Page 1, paragraph 3 would be written as `page1p3`.
299 |
300 | ### FontName
301 | The name of the font used for each paragraph is defined in the paragraph header. If there is a small change in font wihtin the paragraph, for example to highlight a color or if a part of the text was bold or italic, then these are removed and only the primary font is represented.
302 |
303 | ### Joining Lines
304 | Lines are joined automatically when they are split in the orignal content. Absolute line/sentence join rules are provided int he conifguration file. Automated sentence joining is provided by the Paracrawl Sentence Join tool. (https://github.com/paracrawl/sentence-join)
305 |
306 | ### Repairing Object Sequences
307 | Some PDF tools export malformed PDF content in some cases. For example, instead of rendering a word as a single object, a set of letters are rendered as individual objects. There are many exceptions that need to be handled. Common exceptions are handled wihtin the Java code.
308 |
309 | ## Search and Replace Characters
310 | Some PDF creation tools will transform characters resulting in words that are not using the correct letters (in terms of actual Unicode values), but look correct on the screen.
311 |
312 | For example (A) first (B) first
313 |
314 | Both of these look the same. But the "fi" in A is the letter "f" and "i" while the "fi" in B is the character "fi" (U+FB01).
315 |
316 | A list of these characters can be found in the file configuration file in the same folder as the PDFExtract.jar file. Additional search and replace characters can be added as needed. This search and replace is performed when processing words and merging them into lines.
317 |
318 | ## Performance
319 | The processing has been optimized and multithreaded. Reducing a large file can take some time. A 50MB PDF can be extracted, cleaned and stored in as little 10K, depending on the content.
320 |
321 | Previously for V1 and V2, the Apache PDF Box library was used to extract the intial DOM. V2 was a more optimize version using PDF Box. V3 is a complete rewrite and update to use the Poppler toolkit which is considerably faster. Poppler has some limitations and issues with paragragh font merging when they are different, but these have been worked around externally in the PDFExtract code.
322 |
323 | 1. Single file
324 |
325 | | Name | Size (KB) | V1 | V2 | V3 |
326 | | --- | --- | --- | --- | --- |
327 | | sample.pdf | 2.96 | 00:00:01.630 | 00:00:00.662 | 00:00:00.394 |
328 | | sample2.pdf | 34.72 | 00:00:01.698 | 00:00:00.909 | 00:00:00.454 |
329 | | sample3.pdf | 597.78 | 00:00:04.034 | 00:00:01.801 | 00:00:00.717 |
330 | | sample4.pdf | 3,462.57 | 00:01:38.810 | 00:00:32.910 | 00:00:08.949 |
331 |
332 | 2. Batch file, 10 files, 10 threads
333 |
334 | | Name | Size (KB) | V1 | V2 | V3 |
335 | | --- | --- | --- | --- | --- |
336 | | sample.pdf | 2.96 | 00:00:01.570 | 00:00:00.791 | 00:00:00.445 |
337 | | sample2.pdf | 34.72 | 00:00:02.452 | 00:00:01.377 | 00:00:00.718 |
338 | | sample3.pdf | 597.78 | 00:00:07.693 | 00:00:02.542 | 00:00:01.124 |
339 | | sample4.pdf | 3,462.57 | 00:03:35.353 | 00:00:36.036 | 00:00:12.594 |
340 |
341 |
342 | ## TODO
343 | The below list is a set of features planned for future:
344 | - Right-to-Left languages.
345 | - This code is untested on right-to-left languages and may need to be modified to support languages such as Arabic.
346 | - Vertical Script
347 | - This code is untested on right-to-left languages and may need to be modified to support languages such as Japanese when written down the page..
348 |
349 |
350 | ----
351 | ## FAQ
352 | ### Can the HTML output be loaded into the browser to view?
353 | Yes. By default the HTML output is not in a format that will render well in a browser as it is formatted for optimal processing and hot intended to be viewed by humans. Use the `--keepbrtags` option to output the HTML in a more visual format.
354 |
355 | ### Can this tool extract text from images embedded in PDF files?
356 | No. This tool processes only text. It is not an OCR tool, it is only able to extract text from PDF if the data is already in text format.
357 |
358 |
359 |
--------------------------------------------------------------------------------
/Test/pdf-in/sample.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bitextor/pdf-extract/6a8acac2bf99526e80604e377264aa1142be5471/Test/pdf-in/sample.pdf
--------------------------------------------------------------------------------
/Test/pdf-in/sample2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bitextor/pdf-extract/6a8acac2bf99526e80604e377264aa1142be5471/Test/pdf-in/sample2.pdf
--------------------------------------------------------------------------------
/Test/pdf-in/sample3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bitextor/pdf-extract/6a8acac2bf99526e80604e377264aa1142be5471/Test/pdf-in/sample3.pdf
--------------------------------------------------------------------------------
/Test/pdf-in/sample4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bitextor/pdf-extract/6a8acac2bf99526e80604e377264aa1142be5471/Test/pdf-in/sample4.pdf
--------------------------------------------------------------------------------
/Test/run-batch.sh:
--------------------------------------------------------------------------------
1 | java -jar ../target/PDFExtract-2.0.jar -B "sample.tab" -T 10 -LANG "en"
2 |
--------------------------------------------------------------------------------
/Test/run-single.sh:
--------------------------------------------------------------------------------
1 | java -jar ../target/PDFExtract-2.0.jar -I "pdf-in/sample.pdf" -O "html-out/sample.html" -LANG "en"
2 |
--------------------------------------------------------------------------------
/Test/sample.tab:
--------------------------------------------------------------------------------
1 | pdf-in/sample.pdf html-out/sample.html
2 | pdf-in/sample2.pdf html-out/sample2.html
3 | pdf-in/sample3.pdf html-out/sample3.html
4 | pdf-in/sample4.pdf html-out/sample4.html
5 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |