├── .gitignore ├── Extracting_pdf_tabular_data.ipynb ├── LICENSE ├── PDFFixup ├── __init__.py └── fixer.py ├── data ├── DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf ├── Ministerial_Quarterly_Transparency_information_-_April-June_2014.pdf ├── example_out.csv └── ministerial_transparency_Apr-Jun_2014.pdf └── example.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | 3 | __pycache__/ 4 | *.py[cod] 5 | *$py.class 6 | 7 | # IPython Notebook 8 | .ipynb_checkpoints 9 | 10 | # virtualenv 11 | venv/ 12 | lib/ 13 | 14 | *.bat 15 | -------------------------------------------------------------------------------- /Extracting_pdf_tabular_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "# warning: pdfminer uses python 2\n", 12 | "from __future__ import division" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "The UK government regularly releases information about the meetings that various ministers have with external organisations. You can find the releases by searching [here](https://www.gov.uk/government/publications). The hope is that by releasing information like this the public, journalists and other organisations can have some level of scrutiny over who members of parliament are meeting.\n", 20 | "\n", 21 | "Unfortunately, the information is released in a number of different formats and styles, making any sort of attempt to automatically catalogue it difficult. On the suggestion of [Transparency International](http://www.transparency.org.uk/), I have been attempting to automate the procedure. If you are interested in the output dataset, you can find it [here](https://github.com/ijmbarr/uk_minister_meetings), the full code I used to parse the documents can be found [here](https://github.com/ijmbarr/ti_intergrity_watch), (warning: it is a mess, currently undocumented and still in progress). \n", 22 | "\n", 23 | "To produce the output, I had to extract tabular information from a number of different formats: .csv, .doc, .pdf, .xlsx, .odt and .opd. Of these, by far the most difficult was the PDF file. While there are a number of different tools for extracting tabular information from pdf documents, such as [tabula](https://www.gov.uk/government/publications) and [pdftables](https://pdftables.readthedocs.io/en/latest/), neither of them quite worked on the documents I was looking at, so I decided to create my own.\n", 24 | "\n", 25 | "Reading around, you find the the best advice about parsing PDFs is [don't do it unless you have to](https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167). The reason is, as we will see, unlike formats like html or other markup languages where tables and their internal structure are well defined, in PDFs we only have low level information about the location of individual characters and lines on the page. We are going to have to use this information to infer how the table is structured.\n", 26 | "\n", 27 | "The results presented here aren't that polished yet, and I don't know how readily they will apply to other pdf formats. However, I hope this method might be of some use to other.\n", 28 | "\n", 29 | "## The Problem\n", 30 | "\n", 31 | "An example of the document we would like to parse can be found [here](https://www.gov.uk/government/publications/hmt-ministers-meetings-hospitality-gifts-and-overseas-travel-1-april-to-30-june-2014), [here](https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/362728/Ministerial_Quarterly_Transparency_information_-_January_to_March_2014.pdf) and [here](https://www.gov.uk/government/publications/department-of-culture-media-and-sport-ministerial-gifts-hospitality-travel-and-meetings-july-2014-to-march-2015). Let start with the first one as an example. In order to access the content of the PDFs, I'm going to use [pdfminer](http://euske.github.io/pdfminer/index.html). \n", 32 | "\n", 33 | "The first job is to find out what sort of object exist within the PDF. pdfminer return a list of LTPage objects describing each page. Each page can contain other objects: text, rectangles, lines figures, etc. (the full hierarchy of objects returned by pdfminer is detailed [here](https://euske.github.io/pdfminer/programming.html#layout)). We can pull out pages of the document using the following code:" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "metadata": { 40 | "collapsed": false 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "from pdfminer.pdfparser import PDFParser\n", 45 | "from pdfminer.pdfdocument import PDFDocument\n", 46 | "from pdfminer.pdfpage import PDFPage\n", 47 | "from pdfminer.pdfpage import PDFTextExtractionNotAllowed\n", 48 | "from pdfminer.pdfinterp import PDFResourceManager\n", 49 | "from pdfminer.pdfinterp import PDFPageInterpreter\n", 50 | "from pdfminer.layout import LAParams\n", 51 | "from pdfminer.converter import PDFPageAggregator\n", 52 | "\n", 53 | "\n", 54 | "def extract_layout_by_page(pdf_path):\n", 55 | " \"\"\"\n", 56 | " Extracts LTPage objects from a pdf file.\n", 57 | " \n", 58 | " slightly modified from\n", 59 | " https://euske.github.io/pdfminer/programming.html\n", 60 | " \"\"\"\n", 61 | " laparams = LAParams()\n", 62 | "\n", 63 | " fp = open(pdf_path, 'rb')\n", 64 | " parser = PDFParser(fp)\n", 65 | " document = PDFDocument(parser)\n", 66 | "\n", 67 | " if not document.is_extractable:\n", 68 | " raise PDFTextExtractionNotAllowed\n", 69 | "\n", 70 | " rsrcmgr = PDFResourceManager()\n", 71 | " device = PDFPageAggregator(rsrcmgr, laparams=laparams)\n", 72 | " interpreter = PDFPageInterpreter(rsrcmgr, device)\n", 73 | "\n", 74 | " layouts = []\n", 75 | " for page in PDFPage.create_pages(document):\n", 76 | " interpreter.process_page(page)\n", 77 | " layouts.append(device.get_result())\n", 78 | "\n", 79 | " return layouts\n", 80 | "\n", 81 | "example_file = \"data/DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf\"\n", 82 | "page_layouts = extract_layout_by_page(example_file)" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 3, 88 | "metadata": { 89 | "collapsed": false 90 | }, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "12" 96 | ] 97 | }, 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "len(page_layouts)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "We can now ask what are the types of object in the documents:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 3, 117 | "metadata": { 118 | "collapsed": false 119 | }, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/plain": [ 124 | "{pdfminer.layout.LTRect, pdfminer.layout.LTTextBoxHorizontal}" 125 | ] 126 | }, 127 | "execution_count": 3, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "objects_on_page = set(type(o) for o in page_layouts[3])\n", 134 | "objects_on_page" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "So it looks like we are only dealing with text, or rectangles. The text exists as text boxes, unfortunately they don't always match up with the table columns in a way we would like, so recursively extract each character from the text objects:" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 4, 147 | "metadata": { 148 | "collapsed": false 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "import pdfminer\n", 153 | "\n", 154 | "TEXT_ELEMENTS = [\n", 155 | " pdfminer.layout.LTTextBox,\n", 156 | " pdfminer.layout.LTTextBoxHorizontal,\n", 157 | " pdfminer.layout.LTTextLine,\n", 158 | " pdfminer.layout.LTTextLineHorizontal\n", 159 | "]\n", 160 | "\n", 161 | "def flatten(lst):\n", 162 | " \"\"\"Flattens a list of lists\"\"\"\n", 163 | " return [subelem for elem in lst for subelem in elem]\n", 164 | "\n", 165 | "\n", 166 | "def extract_characters(element):\n", 167 | " \"\"\"\n", 168 | " Recursively extracts individual characters from \n", 169 | " text elements. \n", 170 | " \"\"\"\n", 171 | " if isinstance(element, pdfminer.layout.LTChar):\n", 172 | " return [element]\n", 173 | "\n", 174 | " if any(isinstance(element, i) for i in TEXT_ELEMENTS):\n", 175 | " return flatten([extract_characters(e) for e in element])\n", 176 | "\n", 177 | " if isinstance(element, list):\n", 178 | " return flatten([extract_characters(l) for l in element])\n", 179 | "\n", 180 | " return []" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 5, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "current_page = page_layouts[4]\n", 192 | "\n", 193 | "texts = []\n", 194 | "rects = []\n", 195 | "\n", 196 | "# seperate text and rectangle elements\n", 197 | "for e in current_page:\n", 198 | " if isinstance(e, pdfminer.layout.LTTextBoxHorizontal):\n", 199 | " texts.append(e)\n", 200 | " elif isinstance(e, pdfminer.layout.LTRect):\n", 201 | " rects.append(e)\n", 202 | "\n", 203 | "# sort them into \n", 204 | "characters = extract_characters(texts)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Each element of the pdf is described the the bounding box. We can use this to visualise how the page is arranged:" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 6, 217 | "metadata": { 218 | "collapsed": true 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "import matplotlib.pyplot as plt\n", 223 | "from matplotlib import patches\n", 224 | "%matplotlib inline\n", 225 | "\n", 226 | " \n", 227 | "def draw_rect_bbox((x0,y0,x1,y1), ax, color):\n", 228 | " \"\"\"\n", 229 | " Draws an unfilled rectable onto ax.\n", 230 | " \"\"\"\n", 231 | " ax.add_patch( \n", 232 | " patches.Rectangle(\n", 233 | " (x0, y0),\n", 234 | " x1 - x0,\n", 235 | " y1 - y0,\n", 236 | " fill=False,\n", 237 | " color=color\n", 238 | " ) \n", 239 | " )\n", 240 | " \n", 241 | "def draw_rect(rect, ax, color=\"black\"):\n", 242 | " draw_rect_bbox(rect.bbox, ax, color)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 7, 248 | "metadata": { 249 | "collapsed": false 250 | }, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAH1CAYAAADmjwUuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnX+MbVd137/LM+E5JGLAEOzWD+51KL+UpAX+wI5o9Yb8\nAoOEqRoIkVIwIS0qIEipUkyqyk+qVAESCiCkOFUcZNIEnJAfuC0BB5lJ1Qo7RLaBgrFN8mawHfyg\nIXYFqV3PePWPe/bMvnv2j7XPPffH7Pv9SFf37L3XXnvtH3fNfeeus56oKgghhJx8Lli2AYQQQoaB\nDp0QQhqBDp0QQhqBDp0QQhqBDp0QQhphc1kDiwjDawghpAeqKrH6pX5DV9VmX9dee+3SbeAcOT/O\nb/l2DP3KwVsuhBDSCHTohBDSCHToc2J7e3vZJsyd1ufI+Z1sWp9fDCndk5nbwCK6rLEJIeSkIiLQ\nVfxRlBBCyHDQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQ\noRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCOYHLqI/GsR+V8i8iUR+R0ReYKIjEXkVhG5R0Q+\nJiKbnewTROTjInKviHxeRJ453ylkGI8BkeOvzc18OSdjka2ps4yR0zXLXIawf1HrNK81GHL8WXXO\n0+7U2Vvm2uU+D332djxemqtZFYr50EXk7wP4HwCep6r/T0RuBPApAK8A8AlV/X0R+XUAd6rqb4jI\nvwLwY6r6FhH5OQD/VFVfF9E7/3zoIkBsjLA+JpeSscjm6hw5u2rsm2Uufexf1DrV1C1zP30ds+qc\np92r+llI+YAhPguNMkQ+9A0AP9B9C/9+AH8N4KUA/qBrvwHAq7vrq7oyAHwCwE/2MZoQQkgdRYeu\nqn8N4P0AvgHgAQAPA7gdwEOq+ngndj+AS7vrSwHc1/U9APCQiFw0sN2EEEICNksCIvJkTL51jzBx\n5r8P4OUVY0T/aQAAZ8+ePbze3t5ey/8DkBBCcuzs7GBnZ8cka7mH/rMAXqaq/6Ir/3MAPw7gZwFc\noqqPi8gVAK5V1StF5NPd9W0isgHgm6r69Ihe3kNP2cV76LyHPrTdq/pZ4D30ama9h/4NAFeIyIUi\nIpjcE/8KgM8BeE0n8wYAn+yub+rK6Npv6Ws4IYQQO5Z76H+OyY+bdwD4Iia3UP4TgGsAvFNE7gFw\nEYDruy7XA3iaiNwL4Jc7OUIIIXOmeMtlbgMv6pZLwD4MPxxUyMZkwjqLzJC2WMfra2dunJxdpbo+\nffruUQ2zjtt3X2bdn5K+kt1hHXqOX1vvyrsAxpj+EU6DsqvjLZcJa/ek6CYAbGyUBTc2pg9fok9M\nX1h3qCdWZyG0JWHXZko2MZfQrqxMpC67ljPOvzgHv66wn+Y9r+g/VRfqTqxT7b5Ez1EwpmXPkuPH\nyNh5bB8tc/I4ZmvMXhzNc9y9K3D4CsuuDiLAaJSf2xrQ/jf0vj/aWGWHkvHrZtU5zx/lhhgnpyP3\nw/E85jWPdUmNsYgfey121PZ3dcDw4/eRX3P4DZ0QQtYAOnRCCGkEOnRCCGkEOnRCCGmE9n8ULTBL\neJwlrMxKLLQsFvZ2P4DLkA7fkkJbrq62X99xcjr8EzGr7j7j1/a39Ou7XvOyo7a/q8Mcxk/Jj0cj\n7O7t8UfRCLkfRWcJzz0RWA5bKBc7gKk6FGRiY+bscBsSOgLXNgYhhMRp/pbLsZjVSPlYrKt79+pT\ndVGZwpg5O3ydiLVF7NCgLteWq6vt13ecnI5DBtA91BxmHbfves3Ljr465zF+Sn53b49x5T1o/5bL\nvGKMc7qGSngUi2MGjusihKwNjEMnhJA1gA6dEEIaofkfRU8khugcQggJ4Tf0VUSV98kJIdU0/w1d\ngm+7GtSF5T5153AUTigivcbQSF+6dEJIDfyGPgBjxB+YIISQRdK2Qx+N5hZjPBVfC0y/9xjjUNdo\nBFWFqk7icEWO8j1vbDA2lxCSpO049EXB/M2EkAXBOHRCCFkD6NAJIaQR6NAJIaQR6NAJIaQR2nTo\n4/HkB0r32tycLod17jonl5MBpt9DmVTfvnK1/cfj8pqk5mm12TqHXP9YW589mbW+7zmwzK00X7df\nhPSgmSiXCy+8EI8++iiA7iEdry0sh3XuOidXkgGAXQCXZcYv2WWV69MflWPX2mydQ66/xe6crbPa\nMMQ5KM2tNN+UbtIOp06dwiOPPNK7fy7KpRmH3k3SFerS17rrnJxFplQeSq5Pf6Bu7FqbrXPI9bfY\nnbN1VhvmeQ5i/az2kROP75+mfFV/XVGH3uYtF0IIWUPo0AkhpBHo0AkhpBHo0AkhpBHo0AkhpBHa\ndeiM5yWErBnNhi3WxAzPM/44p59x6IxDZxz6erCxsYH9/X0A8w1bbNahT7GxARwcpOvcdU7OIlMq\nDyVX29/lUN/bs/ettdk6h1z/WBtQvyez1s/jHMT6peRHI2B3F6QdFhWH3u5/QccHNAgha0a799AJ\nIWTNoEMnhJBGoEMnhJBGoEMnhJBGaOpHUemiW3RScJUAgH1MT9aV+9aH7aFcbkx3jYiOXJ9dTNLz\nzsI5AOOKMS1yfdbKuk7WcWttSJV3MfsaD0HtPvVdQ2AyZwA4jfQ+WfcyV67di9w7grpdrMa+5ZAw\n+m4eY7QUtuhwWndxtMmLiFUOxynFJSOiIxWL7Zj1SPSJoy/JzRLXbRnfMm6tDaseA943zt0iG+vr\nWMRzGH3l/Xck2k4KjEMv60vnQ4/VzSNndjhOKT82YLPTx7hmybjXPvncS3Kz5Be3jG8Zt9aGFclF\nHu4T0H3Y++Zbt8jG+joW8f8B9JX3352tJySHPPOhE0IIqYIOnRBCGoEOnRBCGoEOnRBCGoEOnRBC\nGqGpKBdHqDUWczs26s3F3Vrjt924sTF3MR0/u4iwRQt95phry+ksyed0z2Jf6lkAi95Y31RdWL4f\n5Zjpmk9GTTx6jS6LDmtsvOU8zWqr1faaZ0lS8uG1ZU99lhblIiLPEZE7ROT27v1hEXm7iDxFRG4W\nkbtF5DMisuX1+ZCI3Csid4rIC3pbXsGpU6eSbf7DCOPutZ8S7ig9+BM7DP44IePEGGNMPrzutR8p\nO3YLNg/JZvAO5D8AoaxfH9OZW/9Qn/UhrrBPStYvh/K7mPzR3OzeJVO21oXlcWriPQkftkGknGPX\n6xOTz+1X2JayJfZgUEoeXr9UOdS1i+NrHVt/vy5s820qyYfXY9jJ+apZqfqGLiIXYPLH6HIAbwPw\nN6r6PhF5F4CnqOo1InIlgLep6itF5HIAH1TVKyK6FheHnov/niXmuU+sb0nHHOgVh+5s7RP7PGsc\ncmmsvvHXuXnmxrb2TdUZx+wdhz7LnljkfHv7rrfFjiFsLa2ZZa7+fFPyuc92gVWKQ/8pAH+pqvcB\nuArADV39DV0Z3ftHAUBVbwOwJSIXV1tNCCGkilqH/nMAfre7vlhVzwOAqj4IwDntSwHc5/V5oKsj\nhBAyR8y/QYjI9wF4FYB3dVXhvxmq/w1x9uzZw+vt7W1sb2/XqiCEkKbZ2dnBzs6OSdZ8D11EXgXg\nLar68q58F4BtVT0vIpcA+JyqPl9Eruuub+zkvgbgjPs27+njPXTeQ0/blBuL99Cny7yHHpcr1a35\nPfSfB/Axr3wTgKu766sBfNKrf3038BUAHgqdOSGEkOExOXQReSImP4j+oVf9XgA/LSJ3A/gJAO8B\nAFX9FIBzIvJ1AL8B4C2DWkzIuiBy9NrcPKrb3Dz6VujXu7pYHycb1sfkRIDxeK5TI/PBdA9dVf8O\nwA8Fdd/BxMnH5N82u2kzMBpN//NpY+OoPBpN3l3Zb8uV/fpYXUlXalxXvyjCtfFt8ctAeY6WtYrV\nl9atNJbFBqu9bp65sa19U3Vh+fTp4+OVODiY9D04mLycLr8+18eXKZUBYG/vaK/8OfVdb8ve136u\nLOcoV2c5dyn58LrPns6Bpp4UXdZcVh2uzckgew+9z28Ife49p2TIYKzKPXRCCCErDB06IYQ0Ah06\nIYQ0Ah06IYQ0Ah06IYQ0Qk364ZXm1KlTh1EC5zCdzjKXD92S9zqVC9naNzfuLoDL0tMajNTa+LaU\nytZ836n0wjV5u3O5p0MbUvKlPNZ+eReL2QcLIjLJoyEyba9X3heZqg8UHF5OyfnliO5Qxu+bO/OI\n1Fn2uVZXqs5yBne797GhT8mGWc/MPNPnNuPQfcbAsf8kQrxrBOWUrF8OdYR1JV2pcRcdGDbGtI2+\nLaVyOIeSbG6c2rVLrWNOPrd3qbmtGqHzy/3HDCElR5fLMb+L6bOSO/OI1Fn2uY8u4CgXeUwmZav1\nc197xleNNuPQS3G2QH1uCkusbp9xFxDv2yuXS2oNgH55NkLZ2rUr5c3I7U+f2OwlEI1DL9lbs45h\n2TpGSXZidH4Ma1tJl6PPZ9bvN68zboBx6IQQQorQoRNCSCPQoRNCSCPQoRNCSCPQoRNCSCPQoROy\nqriIC5HpHOjhu2srlV3e81w+dV/GHztW50edhGOKMKf6EmgyDp350DNY86Gn1gCoX6+Y7Cx5tUMb\nUvKlPNZ+edH7YCWVAz1st5TDvOelXOn+2LE6p8P1C9nbmz0/vSXHueUMWj/3NWd8Bc9Mm3HoZAqu\nzcnAnA89LFtl+8bl1/RLyZNDGIdOCCGkCB06IYQ0Ah06IYQ0Ah06IYQ0Ah06IYQ0QjNhi6PRCMyH\nHmdrayu5Nj6ptKy5HNF9857XrGtOp2UPUrbH9Fvyd+9ivnvm50OfslGO5zaHs6+7juZL98p+PvRj\nbYEuiETriv0i49R+ZmrOTUxX7Th922ts3sXk3IzmGO7YZthiKaQKYPrcuGDc3tB2XzamryYszhru\nltJp2YOU7X67P9clheMdC1v0ycWXW2PRY3HnIRYZqz2jEbC76yZX95mpOTcxXbOEcta097B5nmGL\nzXxDJ6Q5Yg7Lsb9/VN7fn5ZxbaFzcfWhvBsrbIu1pxxWrJ9z5mRh8B46IYQ0Ah06IYQ0Ah06IYQ0\nAh06IYQ0An8UJWRVCSNdUm3uOswUuLl5FLGSqyfNQIdOyKqSi3KJhZOGESkHB8flYvWkGdp06MyH\nniaWD90Ryz3t5umuQ9m+ec9r1jWn07IHKdtj+i35uxe1Z7lY9Ng3dCfj5nFwEF9bvz5si43pxuib\n+z4nWys3dL79vuPV+IMFnps2HywiU3BtTgbZfOhHQvUPyeXq/f5kITAfOiGEkCJ06IQQ0gh06IQQ\n0gh06IQQ0gh06IQQ0gjNhC0yH3qaUj70WXKN98llneu7i/mvx6ojIngMwGYqvNSrVxxFx2TbYuGo\ngUztfvpl4LgzKfWL1VttsNhnsX+3ex8Xxt/FcOeS+dBt+pgPPUExH3psbjHZPmuVqrf2XSOKYYs1\n62ht67MnsfLE8Li9qXKNzf4fpJ55yJPjxXSW1mAGGLZICCGkCB06IYQ0Ah06IYQ0Ah06IYQ0Ah06\nIYQ0Ah06IauKyNFrc/Oozm8L5VzZl4+Vc7o2N+My4/FcpkmGo5k49CmYPjdNLH3uLKlpS2uVqk/J\nLHo9TgruP6Rw6XNr5cNyTJerS8ns7aX3F8ifq1g5Vp87O2Hq4CFS31r9wQk5l6Y4dBHZAvCbAH4U\nwOMAfhHAPQBuBDDCJO7+tar6cCf/IQBXAvgegKtV9c6ITqbPXRBcm5NBNg699lmBPjHgqTo+IzAo\nqxCH/kEAn1LV5wP4RwC+BuAaAJ9V1ecCuAXAu7vBrgTwLFV9NoA3A7iut+WEEELMFB26iDwJwD9R\n1Y8AgKrud9/ErwJwQyd2Q1dG9/7RTvY2AFsicvHQhhNCCJnG8g39MgD/W0Q+IiK3i8h/EpEnArhY\nVc8DgKo+CMA57UsB3Of1f6CrI4QQMkcsP4puAngRgLeq6l+IyK9hcrslvAlUfVPo7Nmzh9fb29vY\n3t6uVUEIIU2zs7ODnZ0dk2zxR9HudsnnVfWHu/I/xsShPwvAtqqeF5FLAHxOVZ8vItd11zd28l8D\ncMZ9m/f08kfRBcG1ORnwR9H1YKk/inaO+D4ReU5X9ZMAvgLgJgBXd3VXA/hkd30TgNd3A18B4KHQ\nmc+d8fh4DK+7Ho+n2/22XNmvj9WVyqlxFx3bG65Nye5YW2r+tfKlcSz1VhtmGWfRe+Tw7QGOx4e7\n2HLfdtfmy/jlcH6hXC5+PRyrZq1rzkCfz6T1LJQ+0+GZtZ6tmK5lnBtVLb4wiWz5AoA7AfwhgC0A\nFwH4LIC7AdwM4Mme/IcBfB3AFwG8KKFTh2Q0Gikmt31Uu/dYWSPllKxfjvWx9s2NG/aZx2traytp\no8XusC1ne428ZR9K9UPMp1S/iD2KzcF/PWa4TsnnXjG5VN9zhbXNrXXNGbCOUZIr1eVsTe25xZaU\nfe41Go1m8nUAVBO+mvnQc7J+OdQR1vUdN+wzB6ryocfsDtuAtO018qVxcn1y6zf0OM7+OWPOhx6z\nse/8LHITo/L9YuXaz8wQtpbWzLJuuT2fxT8cdll+HDohhJAVhw6dEEIagQ6dEEIagQ6dEEIagQ6d\nEEIagQ59HQlja11dKs45bAPiba49LDs5vy2Vn3sZsburShjv7Nf5Meh+vbv29yvXN+wXk/N1xvST\nlYH50NctH3qOMA+2n0Pbz4sd5snOtfl1ufzcjjDndm2ua18mVu6Tw/306fSaLYrc+sXWMdxLy9qn\nxnTXqfNgWWvLZ6b2c2XJt56ry9kazqvGliV+ttuMQydTmOPQa2KE/VhdayyyZaxE7O46YH703y/X\nxEVbZCz9cuOTIoxDJ4QQUoQOnRBCGoEOnRBCGoEOnRBCGoEOnRBCGqGZKJcLL7wQjz76KADgHICx\n17aPo/jM3e59HGnLlf36WF1JV2rcXQCXpac1OJYV3+/eUzGtu5heX7/+sm4M9xO8f+2XrfXrSmyf\nUucuRe485s51jR5XRqRvTF+uLqYnN9fSZzWnL2yr+Xxb7Izp28Xk83Hq1Ck88sgjiVmVyUW5NBmH\nPkbcWbhroOxwwnLMSVn75sZdpYAvd/ByH/B9pP8YjjGZzz6O5uVf++VU/W5/85vGX2uLU485FIsz\nj+mN6fGd1LhrDz1M7I9zri78jITy1s+uRV9Ol+VLR87OnL5508w3dOZDT9MrH7pvayizxLm0jDkf\nem2seK5tlmcG/PLE8Li91rq+5y5lb05fTpclxj9nZ0Ef49AJIYQUoUMnhJBGoEMnhJBGoEMnhJBG\noEMnhJBGaNehixy9/HLYnpP1y7E+1r65cZeBb2PKbmCS8zqVIz0sOxm/j59vu1S2tMVkLGOVZPq2\nl+YRux6PZ9+ncO1jec39/Qj3qUZXKpd6WPb1xGwJz1FMnyOVXz9sD3WFc4qtoz+n2Hqk5p3qG7PF\nydTs9UA0E7Y4Ho+xt7cHYLViuy3M261vbW3h4YcfBmBfm74PSbk6QX2cfqktJmMZy/owU217aR4W\nG1Lk9snyMFCsrfSgj0U+JYeMHSX62GDRV3qIqaS35sGiXN/wwaLRaITd3d3C6GlyYYvNOHTGoafp\nFYfeJ97WlV2f2vWpzb1tHau0X33bS/Ow2DClRrrpFOLQU/ZZZWrlJkbl+1nXYpZ9qp1TqS5nqz/n\nnL4a/3DYhXHohBBCCtChE0JII9ChE0JII9ChE0JII9ChE0JII9ChryN+bG0pntdvW2J87VqSitf2\ny6n4cj8+Otc3FcPty/k6Y/rJytBkPnSMRtOHbWPjqDwaTd5d2W/Llf36WF1JV2pcV7+qbGwABweT\n64ODSXlvL79+QP36WNr6jlXar77tpXnErvvut1v7cC/Cel/ejZvraxnTXfv9/L6WPbV8Zmo/VyW5\nUl3O1nBeNbYs8bPdZhw6mWKmOHRLPRmEbBz6LHvUN3Y6ZwPPQm8Yh04IIaQIHTohhDQCHTohhDQC\nHTohhDQCHTohhDRCM2GLp06dOowSOIejNK5AOSVnKqWmRa62b4wwveY8OIygmBRiAvFy976Lbk27\n8r7I1BoA9vlb0pvmym6sIfcxJrvbvY8T9u9iPvslItF98td8quz2xNkY26PuOieTqptaJ8+mcPwp\nG2A7A/51nzNT8xnN9UGFDbmx/PpdxM/HqVOnIrXD0IxD9xkD5jzcfrmPXG1fd42EjlXCP6hjr34X\n02sczsfV5eYfW7NSv3CsIfcxZWdsrGXsl8WZbCbq+8j4daWc6KV9LrVZ9m2Iz2iuDyL9rPPJ1S+a\nNuPQrTG8YbmPXG1fdw0sLKa3dxx6KBOzOdZW0pmbr2Ud3VhD7mPKzthYc9qv3vnQLeWatZkYcVym\nRtZqj9W2IT6jljlbbbbYlDgfjEMnhBBShA6dEEIagQ6dEEIagQ6dEEIagQ6dEEIaocmwRUKaQOR4\nmtvNzXw59TyBky3J5MYmK0+bDj2XDz1XXlRudEeoYxH5k8O1idkWszWXRz6WP7pP3muLLZYc6H33\nO5T157ysfNe+Qw0dbI3D7SMb/qHI7YVfFzsjubZ550of4kxZx/LrVzUfuojsAngYwOMAHlPVF4vI\nUwDcCGCEybMmr1XVhzv5DwG4EsD3AFytqndGdDIf+oLg2pwMonHofWLNc7KTAer0DRxvv+6sQhz6\n4wC2VfWFqvriru4aAJ9V1ecCuAXAu7vBrgTwLFV9NoA3A7iut+WEEELMWB26RGSvAnBDd31DV3b1\nHwUAVb0NwJaIXDyjnYQQQgpYHboC+IyIfEFEfqmru1hVzwOAqj4IwDntSwHc5/V9oKsjhBAyR6w/\nir5EVb8pIj8E4GYRuRvHc8/wJhshhCwRk0NX1W92798WkT8G8GIA50XkYlU9LyKXAPhWJ/4AgGd4\n3U93dcc4e/bs4fX29ja2t7dr7SeEkKbZ2dnBzs6OTVhVsy8ATwTwg931DwD4nwB+BsB7Abyrq78G\nwHu661cA+G/d9RUAbk3o1SEZjUaKyb8S9NzkN/nD12PGch+52r6xl2s719k/9Gtrayu5Nm78cB6p\nOeba52X/Or5SZyVc81K95dxZ+uRkc/1r2yyfo3MrsD+zvEaj0Uy+DoBqwl8XwxZF5DIAf9QZswng\nd1T1PSJyEYDfw+Tb+B4mYYsPdX0+DODlmIQtvlFVb4/o1dLYNTB9bhpT+lxnT2mOodwC7F8XjoUt\n1lKKU4/FoltkRiNgby99zidGT9dZPot9wypDmRPGPMMWmQ/dL9Oh06Evkao49D7nLixbxyjJTozO\nj2Fty+lKyZwwViEOnRBCyIpDh04IIY1Ah04IIY1Ah04IIY1Ah04IIY3QTPrc0Wh0FCUAHA/7spb7\nyNX2jZW7awllBmBrayu/NqE9uTkm2vdFJofJr8P0AfPL7jols9+VU/1DZtHjy1hs3gVwWcKOIRCR\n6X0qncEBzt1M59w6hqWtJB+py+29q9/t3k8HbdbzMeS+j+aYVrfdsMWTyiLCFmvxY5NLscxDhszl\n9KVCP2v1hLr62DwQM8ehrxN9wov9vrXnY8B9n2fYYjPf0I9xUuPQF0EpDj0nt79/dB1zimQ4GIee\ndsokCu+hE0JII9ChE0JII9ChE0JII9ChE0JII9ChE0JII9ChryMi0y/HZhD0tLl5XC527esYj+dq\n+lrhr7PbC7/erbXbA7d//r65MvdlLWgzbHE0mnYyGxu2ch+52r4+oY45PnBQxJIX26Jjb+/4vPx2\n67ptbEzKtWvYR48vY7F5kfvk9iH2HMDe3rTswUF6H92+1O6BRRYor5m1LacrVWex3+3Z/fdPt1nP\nx7I/n0bafLCITGHKh94nln6OD1+sI9F86EeN6bj/PrHiZGkwHzohhJAidOiEENIIdOiEENIIdOiE\nENIIdOiEENIIzYQt+vnQzwEYR2QseZMtMjE5yzi7AC4rTWQOmPKhBzHOh3NIhQl211Nz9XRY8p2X\n1jwlH5Oz6vHbULDRYvMuht/TY/nQ3Zgi8bXw5Py89Mfm4dow7NmvzSef05Pa65zMsj9ftTAfuk1f\nPjQvVh8L87KG48XSeq5oSJ85H3pNLHqqbTQCdnfL62ZZ85R8TM6qx28Dhk+bPAPZfOixvTl9ejoW\n3cnkZHP29zn7rjwxvF+oa2681GdrznsxT5gPnQxLyTG6Opf7vCR3gj5MJ4rcWufy0jNn/drCe+iE\nENIIdOiEENIIdOiEENIIdOiEENIIdOiEENIIzUS5+HHoyVjrWH0Qf22WKZUTbbKESANrHHoqvtmP\nMz4Wp+7rS8WwZ9bTl43qzOnHdGx1qmyx/ViMd2Yuu5g85+DixYfeUz8OPRVLPjXPzNqF67ErgtOd\nrB+zjqDvsXUI1ifUOzWuZS8Ttkfnl5qT8YyUyot+boRx6DZ9tljrZbOKcejW3Oc1cek1+dT7xLtb\nxvTLsTYgrtsyF1fn4u4HIBuH7o+Zm1dMPtfWR6Z2nWvts5yd1Ji1Me+5tjmF4s4zDr1dh84Hi7zh\nDWsTtoUPi7jrUDYXmx4bL+xnfUAkpdPywU3Z7rf7c+3zkNMARPOh155HyzmdDDKbk6txmLm6UHdo\nW87+lB6LPY06dN5DJ4SQRqBDJ4SQRqBDJ4SQRqBDJ4SQRqBDJ4SQRmAcehC/apYplRNtqxiHnovr\nzsU0x+KBY3nUd7vr04jHNsfGSMU5R+Ogw9jjMA46Z3tsrhF9FvtSsc27qItnPpYPPbAnmfPcYH9x\nHrG5Ir9vltz5sfh+xGyP6M/ZH+qfyvme2PPcvh6zPVjjMB97LIb9fuT3m3HoNn2MQ0/QOx+6Xw/M\nHl/st+XioEtxyDVxz1bba+KyR6Pp3OIOa2hlgmgcemp+1njvsH40Au6/3z7X1Lg1zxmUbPTbgXrb\nYvU1sfMpwtz+gC18E8juN+PQbfoYh56guDZWR2Q5vCkZh1VfaQ1r5lEq585LyuZZ1jFBNA49pt86\nT0vbrGfflSeG52Vq97qPTbV9/fPZZ91WzKHzHjohhDQCHTohhDQCHTohhDQCHTohhDQCHTohhDQC\nHfo6InL02tycrguvNzfjcpte9K2TCfvEQiRT4/r6Unpi7zGZnO2uzi+HaxLaMx5PXr5eX/94HFnk\nAQhtcnWubdhHAAAf8UlEQVQ52/2yb6Mrx+YxL/vJwmnmwaIpRqO4M9nYmK53Zb/eIhOTA8o65vhA\nQW9yMb9+fShXk//axW2X9IUx45Y4Ykussj9WaHdq/r49e3vH9fp99vby52jofS/ZHsr5uLn484jZ\n78+hdPbdOpVkUn1ysrU29elrHTOmIyZ7+jSWRZtx6GSKbBy6tTxrHHGfcZ3+6cnMFrNcGjel17el\nFIvcE3Mces5G387aeZCFsBJx6CJygYjcLiI3deWxiNwqIveIyMdEpHuaVp4gIh8XkXtF5PMi8sze\nlhNCCDFTcw/9HQC+6pXfC+D9qvocAA8BeFNX/yYA31HVZwP4AID3DWEoIYSQPCaHLiKnAbwCwG96\n1T8B4A+66xsAvLq7vqorA8AnAPzk7GYSQggpYf2G/msAfgWHyfrkqQD+VlUf79rvB3Bpd30pgPsA\nQFUPADwkIhcNZjEhhJAoxSgXEXklgPOqeqeIbPtNxjGScmfPnj283t7exvb2dkqUEELWkp2dHezs\n7NiEVTX7AvAfAXwDwF8B+CaA7wL4zwC+BeCCTuYKAH/SXX8awOXd9QaAbyX06pCMRiPF5F8Qem7y\nm/2x12OJ8mOVMrGyZZxznX2Lfm1tbR1e5+yufT2+hL6zjDmvccO23FkJz1PqTMT2KRynz1xKfR7L\n2B+rt8y11Gb5jFk/hzWfaUuf2HvJltSeutdoNJrJ13W+E7FXVdiiiJwB8G9U9VUiciOAP1TVG0Xk\n1wF8UVWvE5G3APhRVX2LiLwOwKtV9XURXVoztsE2HOqzhKL55drQt1AuFQ5W0rEgZlqbUh/gaG7u\nuqQzFZJY6pPaq5oQv5ydQ+uLlTNjzRS2WDturk9pfSyyVnusttWGv9aEqlrOVmy/e37OVyJsMcI1\nAN4pIvcAuAjA9V399QCeJiL3AvjlTo4QQsicqXpSVFX/DMCfddfnAFwekXkUwGsHsY4QQogZ5nIh\nhJBGoEMnhJBGoEMnhJBGaDPbIskTyzTn18ey5/ltYQa/MPWtn3HOotNqh58uNtSxuXncrlBPLEVv\naoyUPn/+oR1OV0o2tMP1F4lnZPT1p9YmtNOVVzGzJ5k7dOjrSCwczq/PhWHGwjV9h+LCtlya1pjO\nVNhizg6Ro3FioWUHB+Ww0tDO3BgpfaUwtnAdYrL+PMI2n1yIoK/Lr3PlmD7SPG06dOZDz1OzNrE+\nsW/r7jq1JiWdKZkaHaV+fo7z1B77OcZT+vz3Ui7w2PlyY5ScbuxfMLE5hbpj53CI3Oa1edCtec4t\nttXYb/1MW/rE3t2cV/Bzznzoa0DxwaKwLleOtQHpb5u5b/YWO3zdYZ/Yt+SczTVzL9WndKXmnNLt\n1blViz5Y1NdOsnKs6oNFhBBCVgg6dEIIaQQ6dEIIaQQ6dEIIaYQ2o1xInlh0RS5iJ2zP9d/cPF6f\nipmORcuE44Y6Y3HXubZQbypKI2ZDag1yseq5ekLmDL+hkzx+qF8K1x6GBrrwLT8UMOzj2lPj+jpj\nOvzxYm3hWOEYKRtC+2JzjNkd1oscvcIQOP86DHVzobehXE5XTh9ZCxi2uAbMFLZYE9Lol3NhgzV9\nrPaUQhdTZddnUaGNGabyoZNmYdgiIYSQInTohBDSCHTohBDSCHTohBDSCHTohBDSCHTo64gf5ubH\nXMfKwCSuWmQ6j3dYdjKu7NeHdb7OMLY7ZUPMnlhseWoeKdvDsV05Zl9qPqm5u/fxGIQsAoYtrgHH\nwhZrSMV+W8t+XSl2O9Wes8falrIVmB47Z79FXyhXEb7IsMX1YJ5hi3xSdB3pmz5XBNjfn653ZWsf\n1xbaEdNtiYUvjePL5WwP22L21ca71/7xJGRGeMuFEEIagQ6dEEIagQ6dEEIagQ6dEEIagQ6dEEIa\ngVEu60gs+iKsc2U/l3lOLtfm0rqGqXj9+tiYMd2xtpRtri3UH2tz10w7S04w/IZO8oQx2rWEucpD\nPbGc5bmxYm0peWvMun+9t3eUe91/SMnPP55qi8mK8I8EWRj8hr6O1OZD39+3xXnXxLKX8o/nYtJj\nMeN+TLk/x9I4lnh2Qk4I/IZOCCGNQIdOCCGNQIdOCCGNQIdOCCGNQIdOCCGN0EyUy5kzZw7Tjy6a\ncwDGAHYBXLYUC/KMRqOj1KzAsXUK6/yyuy71y+mI6bHqtYyDoD0s19gfk1sUp06dApY4PlkMZ86c\nmZvuZvKhL5VcaNyqMR5PYq19cnm9rXnCrXnSU++19vhtQDne3Gr/aATs7sZ1EbIC5PKh06EPwUly\n6ISQE03OofMeOiGENAIdOiGENAIdOiGENAIdOiGENAId+lAw1IwQsmTo0IeC0S2EkCVDhz4U/IZO\nCFkydOhDwW/ohJAlQ4dOCCGNUHToInJKRG4TkTtE5Msicm1XPxaRW0XkHhH5mIhsdvVPEJGPi8i9\nIvJ5EXnmvCdBCCHE4NBV9VEAL1XVFwJ4AYArReRyAO8F8H5VfQ6AhwC8qevyJgDfUdVnA/gAgPfN\nxXJCCCFTmG65qOrfdZenMMnQqABeCuAPuvobALy6u76qKwPAJwD85CCWEkIIyWJy6CJygYjcAeBB\nAH8K4C8BPKSqj3ci9wO4tLu+FMB9AKCqBwAeEpGLBrW6xHg8/b+wiwCbm8evw/eYbK7s3oGj99q+\nOfnNzbJ8am6lsWJ6atfCcu3K4/HxfSnNpWZus9ps2ZNZ93KI9aw5D33XKDx7OfnxeKEfbZLHlA+9\nc9wvFJEnAfgjAM+rGENSDWfPnj283t7exvb2doXaaS688EI8+uijALrc1kG7Hhwc1rnr8D0mmyu7\ndz8f+riyb24s7VK75uRTcyuNFdNTM3fr9eE8upS9Od19xhtCRyjXt94y7qzrWRqzViZXBlBeq709\nMH97HadOncIjjzxilt/Z2cHOzo5Jtjp9roj8ewD/F8C/BXCJqj4uIlcAuFZVrxSRT3fXt4nIBoBv\nqurTI3oGTZ/bpZR0heNhhH6duw7fY7K5cmkcS9+cvPug5ORTcyuNFdNTY7/12p+Hm0tKd5/xhtAR\nyvWtt4w763qWxqyVyZWB+rUiUXz/NOWr+uuK/hW1RLk8TUS2uuvvB/DTAL4K4HMAXtOJvQHAJ7vr\nm7oyuvZbeltOCCHEjOWWy98DcIOIXIDJH4AbVfVTInIXgI+LyH8AcAeA6zv56wH8tojcC+BvALxu\nDnYTQggJKDp0Vf0ygBdF6s8BuDxS/yiA1w5iHSGEEDN8UpQQQhqBDp0QQhqhXYfOGFlCyJpRHbY4\n2MBzDlsUYCoePXYdviMimytH490r++bk3erk5FNzK40V01Njv/Xanwci9TXrl5vbLDpCub71lnFn\nXc/SmLUyuTISNuTmQ+JsbGxgf38fwJLDFk8q6r0rAGxsHLsO32OyuXJY36dvTh4bG0X51NxKY8X0\n1K6F5fpwHqMRMBpldfcZbwgdJZ3Wesu4s65nzXmY9Zya1mo0gqryVXgddA9qzRvTk6InEj70QAhZ\nM5r9hk4IIesGHTohhDQCHTohhDQCHTohhDRCuw7dZYsr5XMeKvd1Lnd1qq02T/UQcfXjcd2YFrk+\na2VdJ+u4s+Qq98tDrPFQhHtlzaXetzzP3Op99sLyvor7tkSaikN3KFCMu47J1NSH7aFcbkx3jYiO\nWB+f2Jg1xOzO2WmR67NW1nWyjltrg3Uuy6RmXfquoV8G0vtk3ctcue9e5N6B1ds3C/OKQ2/KoU/l\nQwcmsc+7u0d188x9HcrlxvRttOSp9jGuWTL/cszunJ0WuT5rZV0n67i1NljnMmfCfQKQ3qvSeeqz\nhn55MnhoYN1e5sp99yL37mxe8L7Vsqh86G3HoRNCyBrR7j10QghZM+jQCSGkEejQCSGkEejQCSGk\nEdr9UdSPDtnYOKobjY7LhJEkfv3GBuAypbn6zc2jdz+Lmj9OqMuNG441Hh9F4iyK0Aa/zp+DP/dU\nX8sahvr99QzXLDV+qN/tQWytQ52hrGt3++fK4zFWjtS6huuUa0utYewc+Gvi+vnnPRzHl/HXMrUX\nuc9WbMzYZ85/D2XCz2M4l9iZDuVjZXftR86tIM2ELW5ubh6mqMzF2Poj5mJiS7Gvvo7YOKEdsbaU\njpjOmGwfSnHEudhea+x2Lu44F0ucGj+3JqW1TskiIePLLZvc+sfmWLuGuTVNnfdwnHD8Glssn7GU\nDaE94bihfdbPZG69Y+vYBz83eh/WIh/6wcHBYe5hYLLw7gXgKH9zpBzKaqTdr4f3nhon1I1YOaYj\nlpfa0TP3NIDk2kTtLKzNsbUq1MfWxH8P21Nrf2xNcmvt5QJPjRHaAGCp+b0duXOcPNO1axgph2sS\nO+/RvYysZdaW1GcrMmbKhsP34HMdO/d+fXidakv1C2VWLTd6M9/Qsw/P5B7omeUhltoHP2LjLuCB\niKoHi/o8rFLzkE7NQyZ9H2aZ9cGVJVH1YJF1vrM+zGPVUSNTM6ZF1rVPFi/dJ2aP36fvWayE/2MR\nIYSQInTohBDSCHTohBDSCHTohBDSCHTohKwyIpOXH5ftx3KH17E2P3Y7rA/7hO9hPHhsnFicetg3\n1RaLN0+NFYtF99tL+mNjhOXUGsXWxX8fj7EKtPlg0Wg0vWEbG0fl8AEfvy1X9utjdSVdqXH9B50W\nQbg2vi2561zZWh9ry41h0VOag8XuZe1FDeEDMlZiD9f49aV+VpnYOH4515ayMTeP0nxSsrExUm2p\nfqkIqRWgzbBFMgXX5mSQDVvsEw4YlvuGHU6MSofsDRGyWKqPtfUNMxxyLXuEMzJskRBCSBE6dEII\naQQ6dEIIaQQ6dEIIaQQ6dEIIaQQ6dEJWmVjMdyrW3BJjTpqmzTh0QlojFxdeE2NeG/Mf/mcZsT8M\nNc8UWPvwGYZe0KETssqUYrHD+li5R6w0OZnwlgshhDQCHTohhDQCHTohhDQCHTohhDQCHTohhDRC\nM1Euo9HoMFvdOQBjr20fRxPd7d7HkbZc2a+P1ZV0pcbdBXBZelqDsLW1lVwb35bSHC1rE7aVyiX9\nMI5hsTNns1+PRFvMrl0Mv3+HWRcnheM2ikzsEZm2LSgfyvl6uvdZ9je3HiHWfSmNh0RbSVdKb27s\n+zHfz+RojiGObabPLYVwAbOlHu0bOhYbdwGhZNm1CW3xqf2f2GM6S+WS/pgdMZ21qWZz9rgxc7bn\n7OpJNH1uiDUfeC6Xt9Pdd39z63F8UsOk+XX6U+tdu+c1+z8w80yf28w3dEKaJHQ6+/tH766uJrc3\naRreQyeEkEagQyeEkEYoOnQROS0it4jIV0TkyyLy9q7+KSJys4jcLSKfEZEtr8+HROReEblTRF4w\nzwkQQgiZYPmGvg/gnar6IwB+HMBbReR5AK4B8FlVfS6AWwC8GwBE5EoAz1LVZwN4M4Dr5mI5IYSQ\nKYoOXVUfVNU7u+vvArgLwGkAVwG4oRO7oSuje/9oJ38bgC0RuXhguwkhhARURbmIyBjACwDcCuBi\nVT0PTJy+57QvBXCf1+2Bru78rMYSsnaEkSl+PLmf2hY4Xo61bXofeXcdk3ex0qFMmE7XD5UM08y6\nttDWXHpfMhNmhy4iPwjgEwDeoarfFZEwkLI6sPLs2bOH19vb29je3q5VEWc0Suc0dge1b/7jVF1J\nV2rcRedRDtfGtyWMXU7NO9YvpbNULukHbGNYc1pb8nC7MXO2O5ll5MH2c6P7exaLT3eEbal4dsfe\nXryfX8455XCcVD/LearJyR4r157l06fT81oCOzs72NnZMcmaHiwSkU0A/xXAn6jqB7u6uwBsq+p5\nEbkEwOdU9fkicl13fWMn9zUAZ9y3eU/n/B4sIlNwbU4G0QeLrA+ulcqxB2dK8o4anbW2rOG5nOeD\nRdawxd8C8FXnzDtuAnB1d301gE969a/vBr4CwEOhMyeEEDI8xW/oIvISAP8dwJcxua2iAH4VwJ8D\n+D0AzwCwB+C1qvpQ1+fDAF4O4HsA3qiqt0f08hv6guDanAz4DX09mOc39DZzuZApuDYnAzr09WAV\nbrkQQghZcejQCSGkEZrJtsh86Gms+dBry8vKW50bo2YuFt2pXN81OcJTsruI730qH3rXmM55Hin7\n/f02d52TT+mM5llP2Josu2tvPtYc9bU53MO6WXPhz/q5ZT50mz7bvcfwfp/1PqTlHmCfcRdwH9Gc\nD722HKsH5p+3OjdGn/vKOd2x+8PWcVJ2JtbPlA/dp5QTPSbrYrpzMejAJL5+b8+eZ72GMD976fOU\n+8z2+bxax7Hqr4T50AlZV6zOKpRJtfl51F1u9ZJ8zR+vkgwwbQMZFN5DJ4SQRqBDJ4SQRqBDJ4SQ\nRqBDJ4SQRuCPoutI7MeoVIhZKUWrSDqN6uZmPCIiHMuXS6V0TY0Ra3P6cran2kLdMRtTfX2Z2FoA\nTBtL5god+jpSE4IXCxlz5Vg4mN92cGAL9XRy7jocLzdGLOrD15ezPdYW0w0cn0usb0omnBejO8ic\naNOhMx96nppv6EB+XuF7mFs9/KYd0+3k/FjpVL9QPja+P64lT3bKFv+bfDiXXN9QJvavE8vep85x\nbe73XFuffPe5fOWl9fHlQv3WHPW1NvcdJ6d/GZ9bA20+WESmmOnBIktb7tu35UEqy3hWm3Jj1Y4z\ntI1AdoxjDxaRJmFyLkIIIUXo0AkhpBHo0AkhpBHo0AkhpBHajHIheXJRLrFf+nMRMGGUSSoG3LWF\nY/mRD2GceBgV4cd5O5kV+x/aCVkmdOhkmlJa1Rx+v1CPH5Loy/htfnvsgaTQpoODoxSvteF21j59\n++ZC+1Y05I2cfOjQ15FS2KBLbxq2xWRzuad9PVbdfmpVa67ycBxC1hTeQyeEkEagQyeEkEagQyeE\nkEagQyeEkEagQyeEkEagQyeEkEZoJmzxzJkzh9nqyDSj0ehwbc4BGAfrtA9gs6vzr0tld12qs+iy\n2IFA9z6A+wFc1si+nzp1CgB4jhvnzJkzc9PdTPpcQghZB5g+lxBC1gA6dEIIaQQ6dEIIaQQ6dEII\naQQ69CEYjyeJojY3Jy933fd9VXSMx5PXInWMx8vcSUJONIxyGYIw+1/sP06ueV8lHY5F6yCERGGU\nCyGErAF06IQQ0gh06IQQ0gi8hz4EIhAAbjbuuu/7KunAjLpqdPAeOiFleA99Afh/nNx13/dV0jGr\nrhodhJDZoEMfCiZUIoQsGTr0ofC/ZTKWmhCyBOjQh8J9Q9/YAPb2putq32fpO6SOjQ1gNJrdjlp7\n3ENGqYel/PrU9Xh89MCX3xY+yOQ/AAXE+1ivU+2W/im7czpLaxLOaZb1s8rEbIytU2x/hlzvkv2N\nPsjGH0WHQCT9YE8oE15b+sf6+jIp3aFef7xQvyOss9hTsjVld2wNrf36jJNbs1i9ZZxwzNQel/r7\n41vmGfYr6bSck9z6WWX8cqlfKGuxK9Qz1NqcIPijKCGErAF06IQQ0gh06EMwGk3/M5IQQpYAHfoQ\n7O6euPtwhJD2KDp0EbleRM6LyJe8uqeIyM0icreIfEZEtry2D4nIvSJyp4i8YF6GE0IImcbyDf0j\nAF4W1F0D4LOq+lwAtwB4NwCIyJUAnqWqzwbwZgDXDWjr6uLCo4DpWy9+GJVfTrWl+seufRmLbDhe\nKJeqi8lvbOR1We22rE/NmvjXLuQyNe/YXDc342tVY2Nqj0v9a+dpkQUmZ9O/JTjkesdkLOu3sTF5\n1a5RzXpb1sbZ0hCmsEURGQH4L6r6D7vy1wCcUdXzInIJgM+p6vNF5Lru+sZO7i4A26p6PqKzvbDF\n8fgoBr029MpvK/VzlMKwcjpzY+RsTc2tZq7h2qXmkurrr0FpHYaec5+5zmJX6Rz0DT3M6bPMZ6iw\n0z7zy41RE57pOGF+aB5hi093TlpVHwRwcVd/KYD7PLkHurr1gPfSCSFLZKgfRenFCCFkyWz27Hde\nRC72brl8q6t/AMAzPLnTXV2Us2fPHl5vb29je3u7pzmEENImOzs72NnZMcla76GPMbmH/mNd+b0A\nvqOq7xWRawA8WVWvEZFXAHirqr5SRK4A8AFVvSKhs7176LEy76HzHjrvofMe+oDk7qEXv6GLyO8C\n2AbwVBH5BoBrAbwHwO+LyC8C2APwWgBQ1U+JyCtE5OsAvgfgjcNM4QQgMvnF/OBgUt7cjF/7B8ld\n+9EVoQ5f1iXKCvuH5XCMmM6YPb6dft/Qlth8/KiF2HVqjrGxwjH8vjHC+lC+pPvgIG2/Yzw++sE7\n1p7a13A9S+2WdfT7pM5WeHZCXX4ff23CM+lf5/RY1iN1HTuvVttjNuZ0hvobg8m5hiB0KP6hizmv\nmGwoV6MjlEl9k83psY5x+vTEsalOO7mczdY1cNdAua/74+aP78uPRpMfqWv2JjaWZV1i5Mbpu3bW\nNfXbUusQYzSa2BQbp/Y8pmxKnc3SPCx7VZpryu4T5ody39Dp0IfA+k+9If5Z7VN7WyfUWXPLITVe\nrm9Mz1DrULvOtbcUwvFK5Vlsq11H616mbI7dovBlc7qHvi7tVU5Hbq6WdXOcMD/EbIuEELIG0KET\nQkgj9A1bJISQ1ULkeODAmkGHPhSxyIVSpEYuqsEnjGyI1VsiaWIRCbEoj1AmZm8YLWOdTy5iIjZ+\nKjol1Fla59j4OZtCmVS5ZHdMJlaXipCKzSNll9Ob2hN/rrl1isnn5mSJ+qk5C6l+uQgb4PhvPmsI\nb7kMSZ+IBN+Z+tcxne7bh0tu5OusjSIJ63MREzF7Dw6ORx2U5uNHTIRjltantIYx/bF5xNpTNjli\n3/py8jGdsTFzc0iNVyK3J2HESkpnTD43J0vEi+UsxOypGdP/gyRy5Pj9d3c9Gh29GoJRLkMwS6RG\n7hf8ksxQESSWyIJQfywywjpOSq91fSzrmrIntDkVDRKTG2JfayI5UhEd1siVPvtQmmvfKCTLXErR\nO7XyjcIoF0IIWQPo0AkhpBHo0AkhpBHo0AkhpBHo0AkhpBEYhz4E/v/Z6MKjYtfA8baUvEXGb3e6\nS7IWHTFbQv2pmHnLOLE1qVkfy7qm7AltjvU/fTouN8S+WtY7pqM0n7BP330ozbVmDWrnkpuXVb6x\nMMRaGLZICCEnCIYtEkLIGkCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTggh\njUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCH\nTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTggh\njUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjUCHTgghjTAXhy4iLxeRr4nIPSLy\nrnmMsers7Ows24S50/ocOb+TTevzizG4QxeRCwB8GMDLAPwIgJ8XkecNPc6qsw6HqfU5cn4nm9bn\nF2Me39BfDOBeVd1T1ccAfBzAVXMYhxBCiMc8HPqlAO7zyvd3dYQQQuaIqOqwCkX+GYCXqeq/7Mq/\nAODFqvr2QG7YgQkhZE1QVYnVb85hrAcAPNMrn+7qTAYRQgjpxzxuuXwBwD8QkZGIPAHA6wDcNIdx\nCCGEeAz+DV1VD0TkbQBuxuQPxvWqetfQ4xBCCJlm8HvohBBClsNSnhRt4cEjEbleRM6LyJe8uqeI\nyM0icreIfEZEtry2D4nIvSJyp4i8YDlW2xGR0yJyi4h8RUS+LCJv7+qbmKOInBKR20Tkjm5+13b1\nYxG5tTubHxORza7+CSLy8W5+nxeRZ+ZHWA1E5AIRuV1EburKrc1vV0S+2O3jn3d1TZzRPizcoTf0\n4NFHMJmDzzUAPquqzwVwC4B3A4CIXAngWar6bABvBnDdIg3tyT6Ad6rqjwD4cQBv7fapiTmq6qMA\nXqqqLwTwAgBXisjlAN4L4P2q+hwADwF4U9flTQC+083vAwDetwSz+/AOAF/1yq3N73EA26r6QlV9\ncVfXxBnthaou9AXgCgB/4pWvAfCuRdsx0FxGAL7klb8G4OLu+hIAd3XX1wH4OU/uLid3Ul4A/hjA\nT7U4RwBPBPAXmDwU9y0AF3T1h2cVwKcBXN5dbwD49rLtNszrNIA/BbAN4Kau7tutzK+z9RyApwZ1\nzZ1R62sZt1xafvDo6ap6HgBU9UEAF3f14ZwfwAmas4iMMfkWeysmH4Am5tjdjrgDwIOYOL6/BPCQ\nqj7eifhn83B+qnoA4CERuWjBJtfyawB+BYACgIg8FcDfNjQ/YDK3z4jIF0Tkl7q6Zs5oLfOIQydH\nnPhfnEXkBwF8AsA7VPW7kQfCTuwcO8f2QhF5EoA/AlBz62+ln6MQkVcCOK+qd4rItt9kVTG8VXPh\nJar6TRH5IQA3i8jdOH4mT+wZrWUZ39BNDx6dUM6LyMUAICKXYPLPd2Ayv2d4cidizt0PZp8A8Nuq\n+smuuqk5AoCq/h8AO5j8VvDk7nceYHoOh/MTkQ0AT1LV7yzY1BpeAuBVIvJXAD4G4CcAfBDAViPz\nAwCo6je7929jclvwxWjwjFpZhkNv6cEjwfQ3mZsAXN1dXw3gk1796wFARK7A5J/15xdj4kz8FoCv\nquoHvbom5igiT3PRDyLy/QB+GpMfDz8H4DWd2BswPb83dNevweTHtpVFVX9VVZ+pqj+MyWfsFlX9\nBTQyPwAQkSd2/4KEiPwAgJ8B8GU0ckZ7saQfMl4O4G4A9wK4Ztk/JPScw+8C+GsAjwL4BoA3AngK\ngM92c7sZwJM9+Q8D+DqALwJ40bLtN8zvJQAOANwJ4A4At3f7dlELcwTwY92c7gTwJQD/rqu/DMBt\nAO4BcCOA7+vqTwH4ve7M3gpgvOw5VMz1DI5+FG1mft1c3Pn8svMlrZzRPi8+WEQIIY3A/4KOEEIa\ngQ6dEEIagQ6dEEIagQ6dEEIagQ6dEEIagQ6dEEIagQ6dEEIa4f8D1SrxLrWBT0MAAAAASUVORK5C\nYII=\n", 255 | "text/plain": [ 256 | "" 257 | ] 258 | }, 259 | "metadata": {}, 260 | "output_type": "display_data" 261 | } 262 | ], 263 | "source": [ 264 | "xmin, ymin, xmax, ymax = current_page.bbox\n", 265 | "size = 6\n", 266 | "\n", 267 | "fig, ax = plt.subplots(figsize = (size, size * (ymax/xmax)))\n", 268 | "\n", 269 | "for rect in rects:\n", 270 | " draw_rect(rect, ax)\n", 271 | " \n", 272 | "for c in characters:\n", 273 | " draw_rect(c, ax, \"red\")\n", 274 | " \n", 275 | "\n", 276 | "plt.xlim(xmin, xmax)\n", 277 | "plt.ylim(ymin, ymax)\n", 278 | "plt.show()" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "To pull out the information in table format, we need a way to identify the cells of the table, and to identify which cell each character belongs to.\n", 286 | "\n", 287 | "It would be nice to think that the rectangles here match the edges of the cells of the tables, however exploring the layout you find that while some rectangles refer to cells, other seem to be used as line segments, and others as pixels. Some investigation suggest that the table can be defined by only looking at the rectangle which are \"line-like\", which we defined as any rectangle narrower then two pixels, with an area greater than one pixel:" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 8, 293 | "metadata": { 294 | "collapsed": true 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "def width(rect):\n", 299 | " x0, y0, x1, y1 = rect.bbox\n", 300 | " return min(x1 - x0, y1 - y0)\n", 301 | "\n", 302 | "def area(rect):\n", 303 | " x0, y0, x1, y1 = rect.bbox\n", 304 | " return (x1 - x0) * (y1 - y0)\n", 305 | "\n", 306 | "\n", 307 | "def cast_as_line(rect):\n", 308 | " \"\"\"\n", 309 | " Replaces a retangle with a line based on its longest dimension.\n", 310 | " \"\"\"\n", 311 | " x0, y0, x1, y1 = rect.bbox\n", 312 | "\n", 313 | " if x1 - x0 > y1 - y0:\n", 314 | " return (x0, y0, x1, y0, \"H\")\n", 315 | " else:\n", 316 | " return (x0, y0, x0, y1, \"V\")\n", 317 | " \n", 318 | "lines = [cast_as_line(r) for r in rects\n", 319 | " if width(r) < 2 and\n", 320 | " area(r) > 1]" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "Plotting the page again, but only with the lines gives:" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 9, 333 | "metadata": { 334 | "collapsed": false 335 | }, 336 | "outputs": [ 337 | { 338 | "data": { 339 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAH1CAYAAADmjwUuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+MdUd93/HP17uBQtKYX8Fu/cDehPJLSRrIH+CIViyh\nCRgkoGogRErBhLSogEJKlWJSRX6qSlVAQgGEFCeKg0yUgAn5gVsRcJDZVJXASQQGCgabBC+2g5/Q\nEFNBKhqvp3/cc56dnZ2ZM3PuuXvvnft+SY/2nDlz5sc5536f+9z93nnMOScAwOa7ZNUDAABMg4AO\nAI0goANAIwjoANAIAjoANGJ3VR2bGek1ADCCc85i5St9h+6ca/bPtddeu/IxMEfmx/xWP46p/+Tw\nkQsANIKADgCNIKAvyf7+/qqHsHStz5H5bbbW5xdjQ5/JLK1jM7eqvgFgU5mZ3Dr+UhQAMB0COgA0\ngoAOAI0goANAIwjoANAIAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjCOgA0AgCOgA0\noiigm9m/N7P/ZWafMbPfNrOHmNnMzD5hZneY2XvNbLer+xAze5+Z3WlmHzezxy93ChmzmWR2+s/u\nbn4/V6ekbk1ZSR+5thaZyxTjP6vrtKxrMGX/i7a5zHGnnr1VXrvc62HMvZ3NVhZq1sXgeuhm9o8l\n/U9JT3HO/T8zu1HShyS9QNIHnHO/a2a/Kuk259yvmdm/k/SDzrnXmtlPSvqXzrmXR9pd/nroZlKs\nj7A8Vi9Vp6RurqyXG1fN+BaZy5jxn9V1qilb5f3021i0zWWOe11fC6kYMMVroVFTrIe+I+k7u3fh\nD5P0V5KeI+n3uuM3SHpJt/3ibl+SPiDpuWMGDQCoMxjQnXN/Jeltkr4i6V5J35D0SUn3O+ce7Krd\nI+mKbvsKSXd35x5Jut/MHjXxuAEAgd2hCmb2CM3fde9pHsx/V9LzK/qI/tNAks6fP39xe39/fyv/\nD0AAyDk4ONDBwUFR3ZLP0H9C0vOcc/+m2//Xkn5E0k9Iutw596CZXSnpWufcVWb24W77VjPbkfRV\n59xjI+3yGXpqXHyGzmfoU497XV8LfIZebdHP0L8i6Uoz+wdmZpp/Jv45SR+T9NKuzislfbDbvqnb\nV3f8lrEDBwCUK/kM/U81/+XmpyR9WvOPUH5d0jWS3mhmd0h6lKTru1Oul/QYM7tT0s939QAASzb4\nkcvSOj6Dj1zMov8qAdAYx0cukhr/pqhL/dnZSR9L1cmdEztWWlbyp6bvkn7HnFM7pynnP9TemPu5\nzP7HXruaOrV9lM5/GWNI9VPyXBdeI5lJe3vadk2/Q1/olzaldaeq45ct2uYyfyk3RT+5NnK/OF7G\nvJZxXVJ9nMUve0vGUXt+XyZN3/+Y+ltua9+hA8A2IaADQCMI6ADQCAI6ADSi/YBudvJPWLa7Gy8L\n66bKhurE9nPj8Nv0lxDtj81mS7lMADZf01ku5KEDm8tJZLlEbG2WS1E+rDK5s2Nzh8f+qcnvrc05\nnioPeop+htqYqu2p5rBov1Pljk81jrFtLqP/TH3yyus1/Q59qTnGubamWvAolscsnW4LwNbY2nfo\nALBNCOgA0IjB/+ACK8AvcwGMwDv0deQcn5MDqEZAn8JsxrtqACvX9EcuJsUDbVhWUqe2rTF9RM7l\nfTqAUk2/Q3d7e8vLMU6Vje2j39/bk3Nu/mdvbx7c+/Wed3bIzQWQ1HYe+llh/WYAZ4Q8dADYAgR0\nAGgEAR0AGkFAB4BGtBnQ+7zwcJ3yVJm/7niqXq6OdPJnWCd17th6tefPZsPXJDXP0jGXziF3fuzY\nmHuyaPnY56BkbkPz7e8XMEIzWS6sfQ5gUywS+3JZLk19sciFy872wv2wrN/O1SupM7Q/Vb0x588v\nUPm5tWMunUPu/JJx58a66BiW+RzEzisdH5qyzDefbX7kAgBbiIAOAI0goANAIwjoANAIAjoANKLd\ngE4+L4At01Ta4gl9+hcAbIl2A3ofzM2O1xP3+WX+euOpeiV1hvanqld7fr+Gek3ftWMunUPu/Nix\ncNy5sS46hmU+B7HzUvVZ8x4jtRvQ+YIGgC3T7mfoALBlCOgA0AgCOgA0goAOAI1oN6D7WS65taen\nXDM7LB9aH7t2Te0p8upns7o+S+otsr54Sf8l/daOYd3XIq+9T2OvYT/n2Ww568Avci9yP8Oydblv\nK8Z66ABwxlgPvUByPfRYWb8/tjzWR9hPqo1+ez7o4XGenGR88hndAxBvf2icJfXGXKvS61Tab+0Y\nSudyhha6T2OvYb/dO4v/D2Bsff9nP9Y1uG+1WA8dADCIgA4AjSCgA0AjCOgA0AgCOgA0ot2AHua1\n9mV9zm2/H9YNy/s8V7+8by+WP+u352/7ffq/5V5F/myYRxyWSfGcYb+89FrFjueuU6q9VL9hn7kx\nxOaSql+SK11aFu6X3vOS+xRez6HrEWvLF463LwuvT+qZT43Ff82U1B/6GRtXSY58+ByGx3L3PpX3\nX3NPl2wwD93MniTpRklOkkn6Pkm/JOm3uvI9SXdJeplz7hvdOe+UdJWkb0m62jl3W6Rd8tABbKVl\n5aFXfbHIzC6RdI+kZ0p6vaS/cc691czeJOmRzrlrzOwqSa93zr3QzJ4p6R3OuSsjbU0e0JM5vP5+\nH/hr83fDNmryaXP9hucswaj85n6sQ3OaIi89dQ1SfZWMoXS8/TxzfZeemyor7HN0Hvoi96Sknj/e\nsde7ZBxTjHXompXM1Z9vqn7utT3gxH0eIRfQaz9y+ReS/sI5d7ekF0u6oSu/odtX9/M9kuScu1XS\npWZ2WfWoAQBVagP6T0r6nW77MufcBUlyzt0nqQ/aV0i62zvn3q4MALBExV/9N7PvkPQiSW/qisJ/\nM1T/G+L8+fMXt/f397W/v1/bBAA07eDgQAcHB0V1iz9DN7MXSXqtc+753f7tkvadcxfM7HJJH3PO\nPdXMruu2b+zqfUHSs/t38157fIbOZ+jpMeX64jP0k/t8hl5+zUrm6s83Vb+Bz9B/StJ7vf2bJF3d\nbV8t6YNe+Su6jq+UdH8YzAEA0ysK6Gb2cM1/Ifr7XvFbJP2YmX1R0o9K+mVJcs59SNKXzexLkn5N\n0msnHTGwLcyO/wzld/f1/O1UnnYuF7z/OZstdWpYjqLP0J1zfyfpe4Kyr2se5GP1X7/40Bawt3fy\nn087O8f7e3vzn/2+fyy375fHyobaSvXbl5+V8Nr4Y/H3peE5llyrWPnQdRvqq2QMpePt55nru/Tc\nVFm4f+7c6f6GHB3Nzz06mv/p2/LLc+f4dYb2Jenw8Phe+XMae71L7n3t66rkOcqVlTx3qfrh9ph7\nugRN/QcXq5rLuuPabIaq3wPV1vP3x3yOjcmsy2foAIA1RkAHgEYQ0AGgEQR0AGgEAR0AGtFmQJ/N\nTufw9tuz2cnjJetXh+Wp9ZRz+6l+zzrfN7w2tddgzPWK1a25duF2adtD61iv8j6U8Mfn70vl65zH\n1oBPtR3WKX3mS+9dybGh52Lsa7b0dV/zjK/hM9NU2iIAbIJlpS0WL861CVjLJW70GiHhfjiHobq5\nfmqv3dC6Gbn7U9LvGdyHIdH7NDTemusY7pf2MVRXqs+FTx0baqs35jXrn7esZ7zAMt98tvmRCwBs\nIQI6ADSCgA4AjSCgA0AjCOgA0AgCOrCu/BzxMGfc/xnLNY/tx3LYU2ujh/nqsTI/WyPsc03ztFvX\nVNriRayHnla6HnrqGkj11ytWd5F1tcMxpOoPrWPt75/1fSiVWgM9PF6yH657PrRWut93rKxvoz8v\ndHi4+Pr0JWuclzyDpa/7mmd8DZ+Zpr5YxJrfcVybzVC8Hnq4X/M9gjF5+TXnperjItZDBwAMIqAD\nQCMI6ADQCAI6ADSCgA4AjWgqywUANgHL5xZg+dy4weVzh8abW1I01l5NWlxpuluqzZJ7kBq7f9yf\n64rS8U7dJ18uv7w0Fz2Wdx4qqVM6nr096a67judT85qpeW5ibS2SyllzfMSYl/nms6mADjQlFrB6\nDzxwvP/AAyfr9MfC4NKXh/X7vsJjseOpgBU7rw/mODN8hg4AjSCgA0AjCOgA0AgCOgA0gl+KAusq\nlw3hH0utRLm7e5yxkitHMwjowLrKZbnE0knDjJSjo9P1YuVoRpsBnfXQ02Lrofdia0/38+y3w7pj\n1z2vua65NkvuQWrssfZL1u8+q3uWy0WPvUPv6/TzODqKX1u/PDwW67PvY+za97m6tfWmXm9/bH81\n8eAMn5umvinKmt9xXJvNUPQFsPALLVLd2uZhuX8+zgTroQMABhHQAaARBHQAaAQBHQAaQUAHgEY0\nleUCAJuA9dALsB563GA6XGxusbpjrlWqvPTcLVJ8n0rvUcmxMfckti/ln6vUnErH7L9hG7kOebK/\nWJtD12ABy3zzyUcuANAIAjoANIKADgCNIKADQCMI6ADQCAI6sK7Mjv/s7h6X+cfCev2+Xz+2n2tr\ndzdeZzZbyjQxnabSFi9i+dy02PK5iyxNO3StUuWpOmd9PTZF/x9S9Mvn1tYP92Nt9WWpOoeH6fsr\n5Z+r2H6sPPfshEsHT7H0bWk82JDnsuiLRWZ2qaTfkPQDkh6U9DOS7pB0o6Q9SXdJeplz7htd/XdK\nukrStyRd7Zy7LdImy+eeEa7NZsjmodd+V2BMDniqjO8ITGodls99h6QPOeeeKumHJH1B0jWSPuqc\ne7KkWyS9uevsKklPcM49UdJrJF03euQAgGKDAd3MvlvSP3fOvVuSnHMPdO/EXyzphq7aDd2+up/v\n6ereKulSM7ts6oEDAE4qeYf+vZL+t5m928w+aWa/bmYPl3SZc+6CJDnn7pPUB+0rJN3tnX9vVwYA\nWKKSX4ruSvphSa9zzv25mf2K5h+3hB8CVX8odP78+Yvb+/v72t/fr20CAJp2cHCgg4ODorqDvxTt\nPi75uHPu+7r9f6Z5QH+CpH3n3AUzu1zSx5xzTzWz67rtG7v6X5D07P7dvNcuvxQ9I1ybzcAvRbfD\nSn8p2gXiu83sSV3RcyV9TtJNkq7uyq6W9MFu+yZJr+g6vlLS/WEwX7rZ7HQOb789m5087h/L7fvl\nsbKh/VS/Z53bG16boXHHjqXmX1t/qJ+S8tIxLNLPWd+jnj8e6XR+eJ9b7o+9P+bX8ffD+YX1cvnr\nYV8117rmGRjzmix9FoZe0+EzW/psxdpawXNTmrb4Q5qnLX6HpL+U9CpJO5LeL+lxkg41T1u8v6v/\nLknP1zxt8VXOuU9G2mQ9dABbaVnv0Jv6Dy5YDz2uaj302LhTa1+XtpWqP9TPmLWux8xnqJ9+/Es2\n+j4tMr+SetK4NfJrXzNTjHXompVct9w9XyQ+XDxl9XnoAIA1R0AHgEYQ0AGgEQR0AGgEAR0AGkFA\n30Zhbm1flspzDo9J8WP98XC/r+cfS63PvYLc3bUV5jv7ZX4Oul/eb/v3K3dueF6snt9mrH2sDdZD\n37b10HPCdbD9NbT9dbHDdbJzx/yy3PrcvXDN7dq1rv06sf0xa7ifO5e+Zmcld/1i1zG8lyXXPtVn\nv516HkqudclrpvZ1VbLeeq4sN9ZwXjVjWeFru808dJxQnN9ckyPs5+qW5iKX9JXI3d0Gxd+l8Pdr\n8qJL6pScl+sfg8hDBwAMIqADQCMI6ADQCAI6ADSCgA4AjWgzoM9mp3N4++3ZbLvXQ5dOjjGWg9yv\nB+3X9ev4Y/aPrWIuLQvvUSyHP1YvrN+XxXLTc/nqsfNi/afWDx+zPnnJGuelr9Vce+Gxmtd3yThX\ntD56U2mLALAJlpW22NQXi1gPPW7UOtv+WMM6K5xLy0Z9X2DomavJIy959lP1pLq1yFPzCtsZ89ot\naS/X1tC8h8Y50N4y33y2+ZELAGwhAjoANIKADgCNIKADQCMI6ADQiHYDeiw31//tcuxYbj92Tum5\nuX5XIcznDct6fr5tXye3RnZ4zphc3dyxWJ2SvkryisccH5rHornIqfsUXvvYuub+/QjvU01bqbXU\nw32/ndhYwuco1l4vtb5+eDxsK5xT7Dr6c4pdj9S8U+fGxtLXqbnXEyEPHQDOGHnoBchDjxuVhz4m\n37bf78+pvT61a2+X9jV0v8YeH5pHyRhONLPAfZoil3zoGufOK70Wi9yn2jkNleXG6s85115NfLh4\nyvLefLb7kQsAbBkCOgA0goAOAI0goANAIwjoANAIAvo28nNrh/J5/WMrzK/dSql8bX+/ZA3z3Lmp\nHG6/nt9mrH2sjabSFi/a2zv5sO3sHO/v7c1/9vv+sdy+Xx4rG2or1W9fvq52dqSjo/n20dF8//Aw\nf/2k+utTcmxsX0P3a+zxoXnEtsfe7/7ah/ciLPfr9/3mzi3ps9/2z/PPLbmnJa+Z2tfVUL2hstxY\nw3nVjGWFr+2mvli0qrmsu0nym3PlmET2Pi1yj8bmTo/JGcegRWNV7otFfOQCAI0goANAIwjoANAI\nAjoANIKADgCNaDOgz2anc3hL9sfUG7P2d/hnzFrZiwj7D8v6fb+8H1tfHl6DMWufx65JSTul65fX\n3u+wbDY7fpYWXdt80fsUXvN+368XW5s+3M7VSZXF1vyO9Z+6TkPHpljTvuQ1usgzNeZ5O6vXs6ep\ntEUA2ASsh16gOoc33B9Tr/bcfns+4HwbExmdhx7WiY05dmyozdx8S65j39eU9zE1zlhfS7pfC9+n\nMdc8d33DOjV1S8dTOrYpXqMlcy4dc8mYEs8H66EDAAYR0AGgEQR0AGgEAR0AGkFAB4BGNJXlAjTF\n7PQyt7u7+f0wg8LfD3PQU+ek+sbaazOg59ZDz+2f1drovbCNs1g/Obw2sbHFxppbRz62fvSYda9L\nxlKyBvrY+x3W9ee8qvWu/YAaBtiagDumbvgXRe5e+GWxZyR3bNlrpU/xTJX25Zev63roZnaXpG9I\nelDS3zvnnmFmj5R0o6Q9SXdJeplz7htd/XdKukrStyRd7Zy7LdIm66GfEa7NZojmoY/JNc/Vlab5\nvgBGW4f10B+UtO+ce7pz7hld2TWSPuqce7KkWyS9uevsKklPcM49UdJrJF03euQAgGKlAd0idV8s\n6YZu+4Zuvy9/jyQ5526VdKmZXbbgOAEAA0oDupP0ETP7MzP72a7sMufcBUlyzt0nqQ/aV0i62zv3\n3q4MALBEpb8UfZZz7qtm9j2SbjazL2oe5H18yAYAK1QU0J1zX+1+fs3M/lDSMyRdMLPLnHMXzOxy\nSX/dVb9X0uO80891ZaecP3/+4vb+/r729/drxw8ATTs4ONDBwUFR3cEsFzN7uKRLnHPfNLPvlHSz\npP8s6bmSvu6ce4uZXSPpEc65a8zsBZJe55x7oZldKentzrkrI+2yfC6ArbTK5XMvk/QHZua6+r/t\nnLvZzP5c0vvN7GckHUp6WTfQD5nZC8zsS5qnLb5q9MgrsXxuXNGyrP14huYY1juD8W+LU/ep1lCe\neiwXvaTO3p50eLjcJXRL2krV2TDLfPPZ1H9wQUCPI6Bvhqo89DHPXbhf2sdQXYmAXmEd8tABAGuO\ngA4AjSCgA0AjCOgA0AgCOgA0oqksFwDYBKvMQ98YC+XwrouzSFus5ecmD+UyT5kyl2svlfpZ207Y\n1gqXk134Pm2TMenF/rm1z8eE932Zbz6bCugnbGoe+lkYykPP1XvggePtWFDEdMhDTwdlRPEZOgA0\ngoAOAI0goANAIwjoANAIAjoANIKAvo3MTv7p7QZJT7u7p+vFtv02ZrOlDn2r+Ne5vxd+eX+t+3vQ\n3z//vvX73Jet0Gba4t7eySCzs1O2P6Ze7bm+sI29vXHznULJutglbRwenp6Xf7z0uu3szPdrr+GY\ndvw6JWM+y/vU34fY9wAOD0/WPTpK38f+vtTeg5K60vA1Kz2WaytVVjL+/p7dc8/JY6XPx6pfn4Wa\n+qboquay7orWQx+TS7/EL19so+o1/XtjcsWxMqyHDgAYREAHgEYQ0AGgEQR0AGgEAR0AGtFUlgsA\nbALWQy+QTc2LleeWCc3VCeuFaWFrltJXvM52TS566tjennTXXeOWZg3rpOrH6pW24x+Tpl82eQHZ\n+xS7N+fOncxF7+vk6ubGP+bZ98daup58rp3UOuYl496QlFnWQ8e0hgJjX9avfT5Ub4NeTBsld61z\n69KzZv3W4jN0AGgEAR0AGkFAB4BGENABoBEEdABoBHnoAHDGyEMvUJRrvWrrmIdeuvZ5TV56zXrq\nY/LdS/r092PHpHjbJXPpy/q8+wkU36fcvGL1c8fG1Km9zrXjK3l2Un3W5rznji0pFXeZbz6beofO\nF4viiq5NeCz8ski/HdYd+sLP0LUY8yWi2i+rpMbuH/fnOuZLThOI3qfa57HkOZXGfwmo5stGJd93\nCNsOx5Ybf6qdkvGsOKCzHjoAIIuADgCNIKADQCMI6ADQCAI6ADSiqSwXANgE5KEXIA89bvR66H65\ntHh+sX8slwc9lIdck/dcOvaavOy9vZNri/dKUysTovcpNb/SfO+wfG9Puuee8rmm+q35nsHQGP3j\nUv3YYuU1ufMp4dr+Uln6ppS93+Shl7VHHnrC4LVZVo63X6dX2l5tHvSYsZc8L6kxL3IdE4rvU+k8\na3Pqxzz7/b5Ul5decq/HjKn23Fhue811GxnQyUMHAGQR0AGgEQR0AGgEAR0AGkFAB4BGENC3kdnx\nn93dk2Xh9u5uvN6ul/Ha1wnPiaVnpfr120u1E/sZq5Mbe1/m74fXJBzPbDb/47frtz+bRS7yBMIx\n9WW5sfv7/hj7/dg8ljV+nLmm8tAv2tuLB5OdnZPl/b5fXlInVk8abmNvb7F5LUMu59cvD+vVrH/d\n520PtRfmjJfkEZfkKvt9heNOzd8fz+Hh6Xb9cw4P88/R1Pd9aOxhPV8/F38esfH7cxh69vvrNFQn\ndU6ubu2Yxpxb2mesjVjdc+e0Km3moeOEbH5z6f6iecRj+u3bPzmZxXKWh/pNteuPZSgXeaSq7wuk\nxuiPs3YeOBNrkYduZpeY2SfN7KZuf2ZmnzCzO8zsvWa225U/xMzeZ2Z3mtnHzezxo0cOAChW8xn6\nGyR93tt/i6S3OeeeJOl+Sa/uyl8t6evOuSdKerukt04xUABAXlFAN7Nzkl4g6Te84h+V9Hvd9g2S\nXtJtv7jbl6QPSHru4sMEAAwpfYf+K5J+QZKTJDN7tKS/dc492B2/R9IV3fYVku6WJOfckaT7zexR\nk40YABA1mOViZi+UdME5d5uZ7fuHCvtI1jt//vzF7f39fe3v76eqAsBWOjg40MHBQVHdwSwXM/uv\nkn5a0gOSHibpH0r6Q0k/Luly59yDZnalpGudc1eZ2Ye77VvNbEfSV51zj420y3roALbSsrJcqtIW\nzezZkv6Dc+5FZnajpN93zt1oZr8q6dPOuevM7LWSfsA591oze7mklzjnXh5pi+Vzz8hC12boHOl4\nbv32UJuplMShc1L3qibFLzfOqduL7Wf6Wihtsbbf3DlD16ekbul4SsdWm/5ak6pa8mzF7vfI1/la\npC1GXCPpjWZ2h6RHSbq+K79e0mPM7E5JP9/VAwAsWdU3RZ1zfyLpT7rtL0t6ZqTOtyW9bJLRAQCK\nsZYLADSCgA4AjSCgA0Aj2lxtEXmxleb88tjqef6xcAW/cOlbf8W5kjZLx+EvFxu2sbt7elxhO7El\nelN9pNrz5x+Oo28rVTccR3++WXxFRr/91LUJx9nvr+PKnlg6Avo2iqXD+eW5NMxYuqYfUPq0rX6Z\n1libqbTF3DjMjvuJpZYdHQ2nlYbjzPWRam8ojS28DrG6/jzCY75ciqDfll/W78faQ/PaDOish55X\nc21i58TerffbqWsy1GaqTk0bQ+f5a5yn7rG/xniqPf/n0Frgseer72Mo6Mb+BRObU9h27DmcYm3z\n2nXQS9c5LxlbzfhLX9Ml58R+9nNew9c566FvgUm/sBI7JqXfbebe2ZeMw287PCf2Ljk35pq5D5Wn\n2krNOdW2V2ZS+j6NHSfWzrp+sQgAsEYI6ADQCAI6ADSCgA4AjWgzywV5seyKXMZOeDx3/u7u6fJU\nznQsWybsN2wzlnedOxa2m8rSiI0hdQ1yueq5cmDJeIeOPD/VL6U/HqYG9ulbfipgeE5/PNWv32as\nDb+/2LGwr7CP1BjC8cXmGBt3WG52/CdMgfO3w1S3PvU2rJdrK9cetgJpi1tgobTFmpRGfz+XNlhz\nTul4hlIXU/v9OWeV2pjBM7wdSFsEAAwioANAIwjoANAIAjoANIKADgCNIKBvIz/Nzc+5ju1L87xq\ns5PreIf7fZ1+3y8Py/w2w9zu1Bhi44nllqfmkRp72He/Hxtfaj6pufc/ZzMBZ4G0xS1wKm2xRir3\nu3TfLxvK3U4dz42n9FhqrNLJvnPjL2kvrFeRvsgzvB2WmbbIN0W30djlc82kBx44Wd7vl57THwvH\nEWu7JBd+qB+/Xm7s4bHY+Grz3Wv/8gQWxEcuANAIAjoANIKADgCNIKADQCMI6ADQCLJctlEs+yIs\n6/f9tcxz9XLH+mVdw6V4/fJYn7G2Y8dSY+uPhe3HjvXbLDuLDcY7dOSFOdq1wrXKw3Zia5bn+ood\nS9UvzVn3tw8Pj9de97+k5K8/njoWq2vGXxI4M7xD30a166E/8EBZnndNLvvQ+uO5nPRYzrifU+7P\ncaifknx2YEPwDh0AGkFAB4BGENABoBEEdABoBAEdABrR1PK5ALAJWD53wErXkc6lxq2b2Wyea+3L\nretduk546TrpqZ+14/GPScP55qXj39uT7ror3haw5pp5h75SmxTQAWy03Dt0PkMHgEYQ0AGgEQR0\nAGgEAR0AGkFAnwppkwBWjIA+FbJbAKwYAX0qvEMHsGIE9KnwDh3AihHQAaARgwHdzB5qZrea2afM\n7LNmdm1XPjOzT5jZHWb2XjPb7cofYmbvM7M7zezjZvb4ZU8CAFAQ0J1z35b0HOfc0yU9TdJVZvZM\nSW+R9Dbn3JMk3S/p1d0pr5b0defcEyW9XdJblzJyAMAJRR+5OOf+rtt8qOYLejlJz5H0e135DZJe\n0m2/uNu02XhqAAAWLElEQVSXpA9Ieu4kIwUAZBUFdDO7xMw+Jek+SX8s6S8k3e+ce7Crco+kK7rt\nKyTdLUnOuSNJ95vZoyYd9ZDZ7OT/wm4m7e6e3g5/xurm9vuf0vHP2nNz9Xd3h+un5jbUV6yd2mtR\nst3vz2an78vQXGrmtuiYS+7JovdyiutZ8zyMvUbhs5erP5ud6UsbeUXL53aB++lm9t2S/kDSUyr6\nsNSB8+fPX9ze39/X/v5+RbNBJ5bsZs5fOrXfDn/G6ub2Y0u21p6bq58b85hzUvVKxlvaV+qccMne\n1Bimvh6141ykvKTfRa/nUJ+1dRbdPzycB3ZUqVlp9uDgQAcHB0V1q5fPNbNfkvR/Jf1HSZc75x40\nsyslXeucu8rMPtxt32pmO5K+6px7bKSdyf+Di4vtmZ1OI/TL+u3wZ6xubn+on5Jzc/X7F0qufmpu\nQ33F2qkZf+m2P49+Lqm2x/Q3RRthvbHlJf0uej2H+qytk9uX6q8VBp2IVePPj/4tWpLl8hgzu7Tb\nfpikH5P0eUkfk/TSrtorJX2w276p21d3/JbRIwcAFCv5yOUfSbrBzC7R/C+AG51zHzKz2yW9z8z+\ni6RPSbq+q3+9pN8yszsl/Y2kly9h3ACAwGBAd859VtIPR8q/LOmZkfJvS3rZJKMDABTjm6IA0AgC\nOgA0ot2ATo4sgC1TlIe+kfqUKgDYEu0G9D6Y9z93dk5vhz97pfth+Zhzc/V3do7nUNpn6Txj7dSM\nv3S73z937nguqbbH9DdFG2G9seUl/S56PYf6rK2T25eGr9XenrA+2g3ofOkBwJZp9zN0ANgyBHQA\naAQBHQAaQUAHgEa0G9D9LBez5a99HeundK3zVJ1wf4q8+tmsrs+SemOuVel1Ku13kbXK/f0prvFU\nwntVupb62P1lrq0+5l6U/FzH+7ZC1cvnTtbxEpbPBYBNsKzlc5tKWzyxHro0z5G9667jsmWufR3W\ny/Xpj7FkneqTk0zOP6VqrfhUnVy9Mdeq9DqV9ls7htK5nKFT62TXXJex19Dfl5a3tvoi9yL3sx/z\nCu9brWW++WwqoJ+wITcXAKbS7mfoALBlCOgA0AgCOgA0goAOAI1o95ei/m+S/ZXj/NXh+jrhb539\n8p0d6ejoZPnu7vHP/ljYT9hW32/Y12x2nIlzVmK/Ze/LwhUe/fnFzi25hmH7/vWMreoX6z9sv78H\nsWsdthnW7Y/396/fX8dc5tR1Da9T7ljqGsaeA/+a9Of5z3vYj1/Hv5ape5F7bcX6jL3m/J9hnfD1\nGM4l9kyH9WP7/bafObeGyEMHgDNGHnqBU3novtzfvjGp47G/9UvOK+0jN86R7w5O5aGPGdciaq91\naf2Sd1upuqn9Fb4Di+ah5+TeeY65h7nnMXfNS69t6dhSfcZ+Sifbqvm+Qph7X5LnP0G++zLffDb1\nDj355ZmxN9HfL7m5Y/o9gy9EVH2xaMyXVWq+pFPzJZOxX2ZZ9IsrK1L1xaLS+S76ZZ7SNmrq1PRZ\nUrc/Lm1MQF/WO3R+KQoAjSCgA0AjCOgA0AgCOgA0goAOrDOz+R8/L9vP5Q63Y8f83O2wPDwn/Bnm\ng8f6ieWph+emjsXyzVN9xXLR/eND7cf6CPdT1yh2Xfyfs5nWQVNpixft7Z28YTs7x/vhF3z8Y7l9\nvzxWNtRWql//i05nIbw2/lhy27n90vLYsVwfJe0MzaFk3Ku6FzWG0jRTStM2U+eV1on14+/nji0j\nDTVVN9ZHSeqnv51bynfF2kxbxAlcm82QTVsckw4Y7o9NO5TyKXtTpCwOlceOjU0znPJajkhnJG0R\nADCIgA4AjSCgA0AjCOgA0AgCOgA0goAOrLNYzncq17wkxxxNazMPHWhNLi+8Jse8Nuc//M8yYn8x\n1HynoPQcvsMwCgEdWGdDudhheWx/RK40NhMfuQBAIwjoANAIAjoANIKADgCNIKADQCOaWm0RADbB\nslZbbCptMfk/24cpXPPKw3X9/dKlRmv7PYNUshPLdQ4tS+qr/Z/YY20O7Q+1HxtHrM3apWZz4+n7\nzI09N66RosvnhkrXA8+t5d23Pfb+5q7H6UlNs8xv3/7QEr65fkvnmprLRJb55rOpgA40Jww6Dzxw\n/LMvq1nbG03jM3QAaAQBHQAaMRjQzeycmd1iZp8zs8+a2c915Y80s5vN7Itm9hEzu9Q7551mdqeZ\n3WZmT1vmBAAAcyXv0B+Q9Ebn3PdL+hFJrzOzp0i6RtJHnXNPlnSLpDdLkpldJekJzrknSnqNpOuW\nMnIAwAmDAd05d59z7rZu+5uSbpd0TtKLJd3QVbuh21f38z1d/VslXWpml008bgBAoCrLxcxmkp4m\n6ROSLnPOXZDmQd8L2ldIuts77d6u7MKigwW2TpiZ4q9v7i9tK53ejx3b9V7y/Xasfr8cbFgnXE7X\nT5UMl5ntj4VjzS3vi4UUB3Qz+y5JH5D0BufcN80sTNSsTtw8f/78xe39/X3t7+/XNhG3t5de07h/\nUMeuf5wqG2or1e9Zr6McXht/LGHucmresfNq1pyuaV8q66N0TeuSdbj7PnNj7+usYh1sf210/57F\n8tN74bFUPnvv8DB+nr+fC8phP6nzSp6nmjXZY/u1z/K5c+l5rcDBwYEODg6K6hZ9U9TMdiX9d0l/\n5Jx7R1d2u6R959wFM7tc0secc081s+u67Ru7el+Q9Oz+3bzX5uTfFF3Vt17XHddmM0S/WDTVfuyL\nM0P1ezVt1o5lC5/LRV+PuW+KlqYt/qakz/fBvHOTpKu77aslfdArf0XX8ZWS7g+DOQBgeoPv0M3s\nWZL+h6TPav6xipP0i5L+VNL7JT1O0qGklznn7u/OeZek50v6lqRXOec+GWmXd+hnhGuzGXiHvh2W\n+Q69qcW5CFpxXJvNQEDfDuvwkQsAYM0R0AGgEU195AIAm4D10AuwHnpc8Xrotfuxcmn561bn+hjz\nuXKu7djnw6X9pMaZuH5F66H7htZEj9Xtc7pzOejSPL/+8LB8nfUa4frsQ6+n3Gt2zOu1tJ/S9iux\nHjqwrUqDVVgndcxfR71fW32ofs1fXkN1pJNjwKT4DB0AGkFAB4BGENABoBEEdABoBL8U3UaxX0aF\nZf3+0BKtZullVHd34xkRYV9+vdSSrqk+Ysf69nJjTx0L246NMXWuXyd2LSSWjcVSEdC3UU0KXixl\nrN+PpYP5x46OylI9+3r9dthfro9Y1offXm7ssWOxtqXTc4mdm6oTzovsDixJmwGd9dDzat6hS/l5\nhT/DtdXDd9qxtvt6fq506rywfqx/v9+SdbJTY/HfyYdzyZ0b1on966Tk3qee49q133PHxqx3n1uv\nfOj6+PXC9kvXqK8d89h+cu2v4nVboKlvirIAVdxCXywqOZZ7913yRaqS/krHlOurtp+pxyhl++AZ\n3g4szgUAGERAB4BGENABoBEEdABoRJtZLsjLZbnEftOfy4AJs0xSOeD9sbAvP/MhzBMPsyL8PO++\nzpr9D+3AKhHQcdLQsqo5/nlhO35Kol/HP+Yfj30hKRzT0dHxEq+16Xal54w9N5fat6Ypb9h8BPRt\nNJQ22C9vGh6L1c2tPe23U9q2v7Rq6VrlYT/AluIzdABoBAEdABpBQAeARhDQAaARBHQAaAQBHQAa\n0dRqiwCwCZa12mIzeegsOwpg2/GRCwA0goAOAI0goANAIwjoANAIAvoUZrP5QlG7u/M//fbYn+vS\nxmw2/3OWbcxmq7yTwEZrJm1xpcLV/2L/cXLNz3Vqo3fWbQCI4j+JBoAtQEAHgEYQ0AGgEXyGPgGW\nHZiG4zN0YNBWfPV/lZy0nr/Q3LRfigJYCB+5TIWABGDFCOhT8T8mIJcawAoQ0KfSv0Pf2ZEOD0+W\n1f5c5Nwp29jZkfb2Fh9H7Xj6LxmFXzyKlae2Z7PjL3z5x8IvMvXb/V/CsXNKt1PHS85PjTvX5tA1\nCee0yPUrrRMbY+w6xe7PlNd7aPyNfpGNX4pOwSz9GXZYJ9wuOT92rl8n1XbYrt9f2H4vLCsZz9BY\nU+OOXcPS88b0k7tmsfKSfsI+U/d46Hy//5J5hucNtVnynOSuX2kdf3/ovLBuybjCdqa6NhuELxYB\nwBYgoANAIwjoU9jbO/nPSABYAQL6FO66a+M+hwPQnsGAbmbXm9kFM/uMV/ZIM7vZzL5oZh8xs0u9\nY+80szvN7DYze9qyBg4AOKnkHfq7JT0vKLtG0kedc0+WdIukN0uSmV0l6QnOuSdKeo2k6yYc6/rq\n06Okkx+9+GlU/n7qWOr82LZfp6Ru2F9YL1UWq7+zk2+rdNwl16fmmvjbfcplat6xue7uxq9VzRhT\n93jo/Np5ltSV5s+m/5HglNc7Vqfk+u3szP/UXqOa611ybfqxNKQobdHM9iT9N+fcP+32vyDp2c65\nC2Z2uaSPOeeeambXdds3dvVul7TvnLsQabO9tMXZ7DgHvTb1yj82dF5vKA0r12auj9xYU3OrmWt4\n7VJzSZ3rX4Oh6zD1nMfMdZFxDT0HY1MPc+2VzGeqtNMx88v1UZOe2duwOLSMtMXH9kHaOXefpMu6\n8isk3e3Vu7cr2w58lg5ghab6pShRDABWbOxqixfM7DLvI5e/7srvlfQ4r965rizq/PnzF7f39/e1\nv78/cjgA0KaDgwMdHBwU1S39DH2m+WfoP9jtv0XS151zbzGzayQ9wjl3jZm9QNLrnHMvNLMrJb3d\nOXdlos32PkOP7fMZOp+h8xk6n6FPaKH10M3sdyTtS3q0mX1F0rWSflnS75rZz0g6lPQySXLOfcjM\nXmBmX5L0LUmvmmYKG8Bs/hvzo6P5/u5ufNt/kPptP7sibMOv2y+UFZ4f7od9xNqMjccfp39uOJbY\nfPyshdh2ao6xvsI+/HNjwvKw/lDbR0fp8fdms+NfeMeOp+5reD2HjpdcR/+c1LMVPjthW/45/rUJ\nn0l/O9dOyfVIbcee19Kxx8aYazNsvzEszjWFMKD4D10seMXqhvVq2gjrpN7J5top7ePcuXlgc+5k\nkMuNufQa9NvS8Ln9X25+/379vb35L6lr7k2sr5LrEpPrZ+y1K72m/rHUdYjZ25uPKdZP7fOYGlPq\n2RyaR8m9GppratwbFody79AJ6FMo/afeFP+s9tV+rBO2WfORQ6q/3Lmxdqa6DrXXufYjhbC/of1F\nxlZ7HUvvZWrMsY8o/Lq5tqfeHrpXuTZycy25br0Ni0OstggAW4CADgCN4D+JBtAGs9OJA1uGgD6V\nWObCUKZGLqvBF2Y2xMpLMmliGQmxLI+wTmy8YbZM6XxyGROx/lPZKWGbQ9c51n9uTGGd1P7QuGN1\nYmWpDKnYPFLj6ttN3RN/rrnrFKufm1NJ1k/Ns5A6L5dhI53+nc8W4iOXKY3JSPCDqb8da7N/99Ev\nbuS3WZtFEpbnMiZi4z06Op11MDQfP2Mi7HPo+gxdw1j7sXnEjqfG1Iu968vVj7UZ6zM3h1R/Q3L3\nJMxYSbUZq5+bU0nGS8mzEBtPTZ/+X0hmx4Hf/9lv7+0d/2kIWS5TWCRTI/cb/KE6U2WQlGQWhO3H\nMiNK+0m1W3p9Sq5rajzhmFPZILF6U9zXmkyOVEZHaebKmPswNNexWUglcxnK3qmt3yiyXABgCxDQ\nAaARBHQAaAQBHQAaQUAHgEaQhz4F//9s7NOjYtvS6WOp+iV1/ON920N1S9qIjSVsP5UzX9JP7JrU\nXJ+S65oaTzjm2PnnzsXrTXFfS653rI2h+YTnjL0PQ3OtuQa1c8nNq7R+Y2mItUhbBIANQtoiAGwB\nAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjCOgA0AgCOgA0goAOAI0goANAIwjoANAI\nAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjCOgA0AgCOgA0goAOAI0goANAIwjoANAI\nAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjCOgA0AgCOgA0YikB3cyeb2ZfMLM7zOxN\ny+hj3R0cHKx6CEvX+hyZ32ZrfX4xkwd0M7tE0rskPU/S90v6KTN7ytT9rLtteJhanyPz22ytzy9m\nGe/QnyHpTufcoXPu7yW9T9KLl9APAMCzjIB+haS7vf17ujIAwBKZc27aBs3+laTnOef+bbf/05Ke\n4Zz7uaDetB0DwJZwzlmsfHcJfd0r6fHe/rmurGhAAIBxlvGRy59J+idmtmdmD5H0ckk3LaEfAIBn\n8nfozrkjM3u9pJs1/wvjeufc7VP3AwA4afLP0AEAq7GSb4q28MUjM7vezC6Y2We8skea2c1m9kUz\n+4iZXeode6eZ3Wlmt5nZ01Yz6nJmds7MbjGzz5nZZ83s57ryJuZoZg81s1vN7FPd/K7tymdm9onu\n2Xyvme125Q8xs/d18/u4mT0+38N6MLNLzOyTZnZTt9/a/O4ys0939/FPu7ImntExzjygN/TFo3dr\nPgffNZI+6px7sqRbJL1ZkszsKklPcM49UdJrJF13lgMd6QFJb3TOfb+kH5H0uu4+NTFH59y3JT3H\nOfd0SU+TdJWZPVPSWyS9zTn3JEn3S3p1d8qrJX29m9/bJb11BcMe4w2SPu/ttza/ByXtO+ee7px7\nRlfWxDM6inPuTP9IulLSH3n710h601mPY6K57En6jLf/BUmXdduXS7q9275O0k969W7v623KH0l/\nKOlftDhHSQ+X9OeafynuryVd0pVffFYlfVjSM7vtHUlfW/W4C+Z1TtIfS9qXdFNX9rVW5teN9cuS\nHh2UNfeMlv5ZxUcuLX/x6LHOuQuS5Jy7T9JlXXk453u1QXM2s5nm72I/ofkLoIk5dh9HfErSfZoH\nvr+QdL9z7sGuiv9sXpyfc+5I0v1m9qgzHnKtX5H0C5KcJJnZoyX9bUPzk+Zz+4iZ/ZmZ/WxX1swz\nWmsZeeg4tvG/cTaz75L0AUlvcM59M/KFsI2dYxfYnm5m3y3pDyTVfPS31t+jMLMXSrrgnLvNzPb9\nQ6VNTD+qpXiWc+6rZvY9km42sy/q9DO5sc9orVW8Qy/64tGGumBml0mSmV2u+T/fpfn8HufV24g5\nd78w+4Ck33LOfbArbmqOkuSc+z+SDjT/XcEjut/zSCfncHF+ZrYj6budc18/46HWeJakF5nZX0p6\nr6QflfQOSZc2Mj9JknPuq93Pr2n+seAz1OAzWmoVAb2lLx6ZTr6TuUnS1d321ZI+6JW/QpLM7ErN\n/1l/4WyGuJDflPR559w7vLIm5mhmj+mzH8zsYZJ+TPNfHn5M0ku7aq/Uyfm9stt+qea/bFtbzrlf\ndM493jn3fZq/xm5xzv20GpmfJJnZw7t/QcrMvlPSj0v6rBp5RkdZ0S8yni/pi5LulHTNqn+RMHIO\nvyPpryR9W9JXJL1K0iMlfbSb282SHuHVf5ekL0n6tKQfXvX4C+b3LElHkm6T9ClJn+zu26NamKOk\nH+zmdJukz0j6T13590q6VdIdkm6U9B1d+UMlvb97Zj8habbqOVTM9dk6/qVoM/Pr5tI/n5/tY0kr\nz+iYP3yxCAAawX9BBwCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjCOgA0Ij/D9Nzpv5ZAfa3AAAA\nAElFTkSuQmCC\n", 340 | "text/plain": [ 341 | "" 342 | ] 343 | }, 344 | "metadata": {}, 345 | "output_type": "display_data" 346 | } 347 | ], 348 | "source": [ 349 | "xmin, ymin, xmax, ymax = current_page.bbox\n", 350 | "size = 6\n", 351 | "\n", 352 | "fig, ax = plt.subplots(figsize = (size, size * (ymax/xmax)))\n", 353 | "\n", 354 | "for l in lines:\n", 355 | " x0,y0,x1,y1,_ = l\n", 356 | " plt.plot([x0, x1], [y0, y1], 'k-')\n", 357 | " \n", 358 | "for c in characters:\n", 359 | " draw_rect(c, ax, \"red\")\n", 360 | "\n", 361 | "plt.xlim(xmin, xmax)\n", 362 | "plt.ylim(ymin, ymax)\n", 363 | "plt.show()" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "Now onto the main question: how do we assign characters to cells? We use a simple idea, for each character, find the line that bounds it from above, below, left and right. Define these four lines as the cell. We do this using the following function" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 10, 376 | "metadata": { 377 | "collapsed": true 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "def does_it_intersect(x, (xmin, xmax)):\n", 382 | " return (x <= xmax and x >= xmin)\n", 383 | "\n", 384 | "def find_bounding_rectangle((x, y), lines):\n", 385 | " \"\"\"\n", 386 | " Given a collection of lines, and a point, try to find the rectangle \n", 387 | " made from the lines that bounds the point. If the point is not \n", 388 | " bounded, return None.\n", 389 | " \"\"\"\n", 390 | " \n", 391 | " v_intersects = [l for l in lines\n", 392 | " if l[4] == \"V\"\n", 393 | " and does_it_intersect(y, (l[1], l[3]))]\n", 394 | "\n", 395 | " h_intersects = [l for l in lines\n", 396 | " if l[4] == \"H\"\n", 397 | " and does_it_intersect(x, (l[0], l[2]))]\n", 398 | "\n", 399 | " if len(v_intersects) < 2 or len(h_intersects) < 2:\n", 400 | " return None\n", 401 | "\n", 402 | " v_left = [v[0] for v in v_intersects\n", 403 | " if v[0] < x]\n", 404 | "\n", 405 | " v_right = [v[0] for v in v_intersects\n", 406 | " if v[0] > x]\n", 407 | "\n", 408 | " if len(v_left) == 0 or len(v_right) == 0:\n", 409 | " return None\n", 410 | "\n", 411 | " x0, x1 = max(v_left), min(v_right)\n", 412 | "\n", 413 | " h_down = [h[1] for h in h_intersects\n", 414 | " if h[1] < y]\n", 415 | "\n", 416 | " h_up = [h[1] for h in h_intersects\n", 417 | " if h[1] > y]\n", 418 | "\n", 419 | " if len(h_down) == 0 or len(h_up) == 0:\n", 420 | " return None\n", 421 | "\n", 422 | " y0, y1 = max(h_down), min(h_up)\n", 423 | "\n", 424 | " return (x0, y0, x1, y1)" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "The line segments that make up the cell boundaries aren't always complete - those small pixel sized rectangle we threw away earlier leave gaps in them. Combined with the bbox's of character sometime lying outside their cell, we have to be careful about which point we use to find its cell. To make things robust I use three points: the bottom left corner, the top right corner and the centre. The box that defines that majority of these is the one chosen.\n", 432 | "\n", 433 | "We can now run this code over every character. " 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 11, 439 | "metadata": { 440 | "collapsed": false 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "from collections import defaultdict\n", 445 | "import math\n", 446 | "\n", 447 | "box_char_dict = {}\n", 448 | "\n", 449 | "for c in characters:\n", 450 | " # choose the bounding box that occurs the majority of times for each of these:\n", 451 | " bboxes = defaultdict(int)\n", 452 | " l_x, l_y = c.bbox[0], c.bbox[1]\n", 453 | " bbox_l = find_bounding_rectangle((l_x, l_y), lines)\n", 454 | " bboxes[bbox_l] += 1\n", 455 | "\n", 456 | " c_x, c_y = math.floor((c.bbox[0] + c.bbox[2]) / 2), math.floor((c.bbox[1] + c.bbox[3]) / 2)\n", 457 | " bbox_c = find_bounding_rectangle((c_x, c_y), lines)\n", 458 | " bboxes[bbox_c] += 1\n", 459 | "\n", 460 | " u_x, u_y = c.bbox[2], c.bbox[3]\n", 461 | " bbox_u = find_bounding_rectangle((u_x, u_y), lines)\n", 462 | " bboxes[bbox_u] += 1\n", 463 | "\n", 464 | " # if all values are in different boxes, default to character center.\n", 465 | " # otherwise choose the majority.\n", 466 | " if max(bboxes.values()) == 1:\n", 467 | " bbox = bbox_c\n", 468 | " else:\n", 469 | " bbox = max(bboxes.items(), key=lambda x: x[1])[0]\n", 470 | "\n", 471 | " if bbox is None:\n", 472 | " continue\n", 473 | "\n", 474 | " if bbox in box_char_dict.keys():\n", 475 | " box_char_dict[bbox].append(c)\n", 476 | " continue\n", 477 | "\n", 478 | " box_char_dict[bbox] = [c]" 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "To check that this has works, we can plot the page again, focusing on one particular cell:" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 12, 491 | "metadata": { 492 | "collapsed": false 493 | }, 494 | "outputs": [ 495 | { 496 | "data": { 497 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAH1CAYAAADmjwUuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+MdUd93/HP17uBQlLMr2C3fmBvQvmlJA3kD3BEK5bQ\nBAwSpmogRErBhLSogEJClWJSRX6iSlVARQEUKU4UB5koARPyA7ci4CCzqSqBSYQNFIxtEp7FdvAT\nGuI0kIr2WU//uOc8Ozs7M2fm3HP37p37fkmrPWfOnPlxztnvc5+73ztrzjkBANbfJaseAABgGgR0\nAGgEAR0AGkFAB4BGENABoBHbq+rYzEivAYARnHMWK1/pK3TnXLNf11133crHwByZH/Nb/Tim/srh\nLRcAaAQBHQAaQUBfkt3d3VUPYelanyPzW2+tzy/Ght6TWVrHZm5VfQPAujIzudP4S1EAwHQI6ADQ\nCAI6ADSCgA4AjSCgA0AjCOgA0AgCOgA0goAOAI0goANAIwjoANAIAjoANIKADgCNIKADQCMI6ADQ\niKKAbmY/a2b/08w+a2a/bWYPM7OZmX3SzO42s/eZ2XZX92Fm9n4zu8fMPmFmT1ruFDJmM8ns+Nf2\ndn4/V6ekbk1ZSR+5thaZyxTjP6nrtKxrMGX/i7a5zHGnnr1VXrvcz8OYezubrSzUnBaD66Gb2T+W\n9D8kPd0593/N7CZJH5b0YkkfdM79rpn9qqQ7nHO/Zmb/TtL3Oedeb2Y/JulfOudeGWl3+euhm0mx\nPsLyWL1UnZK6ubJeblw141tkLmPGf1LXqaZslffTb2PRNpc57tP6s5CKAVP8LDRqivXQtyR9e/cq\n/BGS/lLS8yX9Xnf8Rkkv67av7vYl6YOSXjBm0ACAOoMB3Tn3l5LeIekrku6X9LeSPi3pQefcQ121\n+yRd0W1fIene7twDSQ+a2WMnHjcAILA9VMHMHq35q+4dzYP570p6UUUf0f8aSNLZs2cvbu/u7m7k\n3wAEgJy9vT3t7e0V1S15D/1HJb3QOfdvuv1/LekHJf2opMudcw+Z2ZWSrnPOXWVmH+m2bzOzLUlf\ndc49IdIu76GnxsV76LyHPvW4T+vPAu+hV1v0PfSvSLrSzP6BmZnm74l/XtLHJb28q/NqSR/qtm/u\n9tUdv3XswAEA5UreQ/+U5r/cvF3SZzR/C+XXJV0r6c1mdrekx0q6oTvlBkmPN7N7JP1MVw8AsGSD\nb7ksreMTeMvFLPq/EgCNcbzlIqnxT4q61NfWVvpYqk7unNix0rKSr5q+S/odc07tnKac/1B7Y+7n\nMvsfe+1q6tT2UTr/ZYwh1U/Jc114jWQm7exo0zX9Cn2hX9qU1p2qjl+2aJvL/KXcFP3k2sj94ngZ\n81rGdUn1cRK/7C0ZR+35fZk0ff9j6m+4jX2FDgCbhIAOAI0goANAIwjoANCI9gO62dGvsGx7O14W\n1k2VDdWJ7efG4bfpLyHaH5vNlnKZAKy/prNcyEMH1peTyHKJ2Ngsl6J8WGVyZ8fmDo/9qsnvrc05\nnioPeop+htqYqu2p5rBov1Pljk81jrFtLqP/TH3yyus1/Qp9qTnGubamWvAolscsHW8LwMbY2Ffo\nALBJCOgA0IjBP3CBFeCXuQBG4BX6aeQc75MDqEZAn8JsxqtqACvX9FsuJsUDbVhWUqe2rTF9RM7l\ndTqAUk2/Qnc7O8vLMU6Vje2j39/ZkXNu/rWzMw/u/XrPW1vk5gJIajsP/aSwfjOAE0IeOgBsAAI6\nADSCgA4AjSCgA0Aj2gzofV54uE55qsxfdzxVL1dHOvo9rJM6d2y92vNns+Frkppn6ZhL55A7P3Zs\nzD1ZtHzsc1Ayt6H59vcLGKGZLBfWPgewLhaJfbksl6Y+WOTCZWd74X5Y1m/n6pXUGdqfqt6Y8+cX\nqPzc2jGXziF3fsm4c2NddAzLfA5i55WOD01Z5ovPNt9yAYANREAHgEYQ0AGgEQR0AGgEAR0AGtFu\nQCefF8CGaSpt8Yg+/QsANkS7Ab0P5maH64n7/DJ/vfFUvZI6Q/tT1as9v19Dvabv2jGXziF3fuxY\nOO7cWBcdwzKfg9h5qfqseY+R2g3ofEADwIZp9z10ANgwBHQAaAQBHQAaQUAHgEa0G9D9LJfc2tNT\nrpkdlg+tj127pvYUefWzWV2fJfUWWV+8pP+SfmvHcNrXIq+9T2OvYT/n2Ww568Avci9y38Oy03Lf\nVoz10AHghLEeeoHkeuixsn5/bHmsj7CfVBv99nzQw+M8Osn45DO6ByDe/tA4S+qNuVal16m039ox\nlM7lBC10n8Zew367dxJ/D2Bsff97P9ZTcN9qsR46AGAQAR0AGkFAB4BGENABoBEEdABoRLsBPcxr\n7cv6nNt+P6wblvd5rn55314sf9Zvz9/2+/R/y72K/Nkwjzgsk+I5w3556bWKHc9dp1R7qX7DPnNj\niM0lVb8kV7q0LNwvvecl9ym8nkPXI9aWLxxvXxZen9QznxqL/zNTUn/oe2xcJTny4XMYHsvd+1Te\nf809XbLBPHQze6qkmyQ5SSbpuyX9gqTf6sp3JJ2T9Arn3N9257xb0lWSvinpGufcHZF2yUMHsJGW\nlYde9cEiM7tE0n2SniPpjZL+2jn3djN7i6THOOeuNbOrJL3ROfcSM3uOpHc5566MtDV5QE/m8Pr7\nfeCvzd8N26jJp831G56zBKPym/uxDs1pirz01DVI9VUyhtLx9vPM9V16bqqssM/ReeiL3JOSev54\nx17vknFMMdaha1YyV3++qfq5n+0BR+7zCLmAXvuWy7+Q9OfOuXslXS3pxq78xm5f3ff3SpJz7jZJ\nl5rZZdWjBgBUqQ3oPybpd7rty5xz5yXJOfeApD5oXyHpXu+c+7syAMASFX/038y+TdJLJb2lKwr/\nz1D9f4izZ89e3N7d3dXu7m5tEwDQtL29Pe3t7RXVLX4P3cxeKun1zrkXdft3Stp1zp03s8slfdw5\n9wwzu77bvqmr90VJz+tfzXvt8R4676Gnx5Tri/fQj+7zHnr5NSuZqz/fVP0G3kP/cUnv8/ZvlnRN\nt32NpA955a/qOr5S0oNhMAcATK8ooJvZIzX/hejve8Vvk/TDZnaXpB+S9EuS5Jz7sKQvm9mXJP2a\npNdPOmJgU5gdfg3ld/f1/O1UnnYuF7z/PpstdWpYjqL30J1zfy/pO4Oyr2se5GP137j40Baws3P0\nv09bW4f7Ozvz7/2+fyy375fHyobaSvXbl5+U8Nr4Y/H3peE5llyrWPnQdRvqq2QMpePt55nru/Tc\nVFm4f+bM8f6GHBzMzz04mH/1bfnluXP8OkP7krS/f3iv/DmNvd4l977256rkOcqVlTx3qfrh9ph7\nugRN/YGLVc3ltOParIeq3wPV1vP3x7yPjcmclvfQAQCnGAEdABpBQAeARhDQAaARBHQAaESbAX02\nO57D22/PZkePl6xfHZan1lPO7af6Pel83/Da1F6DMdcrVrfm2oXbpW0PrWO9yvtQwh+fvy+Vr3Me\nWwM+1XZYp/SZL713JceGnouxP7OlP/c1z/gpfGaaSlsEgHWwrLTF4sW51gFrucSNXiMk3A/nMFQ3\n10/ttRtaNyN3f0r6PYH7MCR6n4bGW3Mdw/3SPobqSvW58KljQ231xvzM+uct6xkvsMwXn22+5QIA\nG4iADgCNIKADQCMI6ADQCAI6ADSCgA6cVn6OeJgz7n+P5ZrH9mM57Km10cN89ViZn60R9nlK87Rb\n11Ta4kWsh55Wuh566hpI9dcrVneRdbXDMaTqD61j7e+f9H0olVoDPTxesh+uez60Vrrfd6ysb6M/\nL7S/v/j69CVrnJc8g6U/9zXP+Cl8Zpr6YBFrfsdxbdZD8Xro4X7N5wjG5OXXnJeqj4tYDx0AMIiA\nDgCNIKADQCMI6ADQCAI6ADSiqSwXAFgHLJ9bgOVz4waXzx0ab25J0Vh7NWlxpeluqTZL7kFq7P5x\nf64rSsc7dp98ufzy0lz0WN55qKRO6Xh2dqRz5w7nU/MzU/PcxNpaJJWz5viIMS/zxWdTAR1oSixg\n9S5cONy/cOFonf5YGFz68rB+31d4LHY8FbBi5/XBHCeG99ABoBEEdABoBAEdABpBQAeARvBLUeC0\nymVD+MdSK1Fubx9mrOTK0QwCOnBa5bJcYumkYUbKwcHxerFyNKPNgM566Gmx9dB7sbWn+3n222Hd\nseue11zXXJsl9yA19lj7Jet3n9Q9y+Wix16h93X6eRwcxK+tXx4ei/XZ9zF27ftc3dp6U6+3P7a/\nmnhwgs9NU58UZc3vOK7Neij6AFj4gRapbm3zsNw/HyeC9dABAIMI6ADQCAI6ADSCgA4AjSCgA0Aj\nmspyAYB1wHroBVgPPW4wHS42t1jdMdcqVV567gYpvk+l96jk2Jh7EtuX8s9Vak6lY/ZfsI1chzzZ\nX6zNoWuwgGW++OQtFwBoBAEdABpBQAeARhDQAaARBHQAaAQBHTitzA6/trcPy/xjYb1+368f28+1\ntb0drzObLWWamE5TaYsXsXxuWmz53EWWph26VqnyVJ2Tvh7rov+DFP3yubX1w/1YW31Zqs7+fvr+\nSvnnKrYfK889O+HSwVMsfVsaD9bkuSz6YJGZXSrpNyR9r6SHJP2kpLsl3SRpR9I5Sa9wzv1tV//d\nkq6S9E1J1zjn7oi0yfK5J4Rrsx6yeei1nxUYkwOeKuMzApM6DcvnvkvSh51zz5D0/ZK+KOlaSR9z\nzj1N0q2S3tp1dpWkJzvnniLpdZKuHz1yAECxwYBuZo+S9M+dc++RJOfche6V+NWSbuyq3djtq/v+\n3q7ubZIuNbPLph44AOCoklfo3yXpf5nZe8zs02b262b2SEmXOefOS5Jz7gFJfdC+QtK93vn3d2UA\ngCUq+aXotqQfkPQG59yfmdkva/52S/gmUPWbQmfPnr24vbu7q93d3domAKBpe3t72tvbK6o7+EvR\n7u2STzjnvrvb/2eaB/QnS9p1zp03s8slfdw59wwzu77bvqmr/0VJz+tfzXvt8kvRE8K1WQ/8UnQz\nrPSXol0gvtfMntoVvUDS5yXdLOmaruwaSR/qtm+W9Kqu4yslPRgG86WbzY7n8Pbbs9nR4/6x3L5f\nHisb2k/1e9K5veG1GRp37Fhq/rX1h/opKS8dwyL9nPQ96vnjkY7nh/e55f7Y+2N+HX8/nF9YL5e/\nHvZVc61rnoExP5Olz8LQz3T4zJY+W7G2VvDclKYtfr/maYvfJukvJL1G0pakD0h6oqR9zdMWH+zq\n/4qkF2metvga59ynI22yHjqAjbSsV+hN/YEL1kOPq1oPPTbu1NrXpW2l6g/1M2at6zHzGeqnH/+S\njb5Pi8yvpJ40bo382p+ZKcY6dM1Krlvuni8SHy6esvo8dADAKUdAB4BGENABoBEEdABoBAEdABpB\nQN9EYW5tX5bKcw6PSfFj/fFwv6/nH0utz72C3N1TK8x39sv8HHS/vN/271fu3PC8WD2/zVj7ODVY\nD33T1kPPCdfB9tfQ9tfFDtfJzh3zy3Lrc/fCNbdr17r268T2x6zhfuZM+pqdlNz1i13H8F6WXPtU\nn/126nkoudYlPzO1P1cl663nynJjDedVM5YV/my3mYeOI4rzm2tyhP1c3dJc5JK+Erm7m6D4sxT+\nfk1edEmdkvNy/WMQeegAgEEEdABoBAEdABpBQAeARhDQAaARbQb02ex4Dm+/PZtt9nro0tExxnKQ\n+/Wg/bp+HX/M/rFVzKVl4T2K5fDH6oX1+7JYbnouXz12Xqz/1PrhY9YnL1njvPRnNddeeKzm57tk\nnCtaH72ptEUAWAfLSlts6oNFrIceN2qdbX+sYZ0VzqVloz4vMPTM1eSRlzz7qXpS3VrkqXmF7Yz5\n2S1pL9fW0LyHxjnQ3jJffLb5lgsAbCACOgA0goAOAI0goANAIwjoANCIdgN6LDfX/+1y7FhuP3ZO\n6bm5flchzOcNy3p+vm1fJ7dGdnjOmFzd3LFYnZK+SvKKxxwfmseiucip+xRe+9i65v79CO9TTVup\ntdTDfb+d2FjC5yjWXi+1vn54PGwrnFPsOvpzil2P1LxT58bG0tepudcTIQ8dAE4YeegFyEOPG5WH\nPibftt/vz6m9PrVrb5f2NXS/xh4fmkfJGI40s8B9miKXfOga584rvRaL3KfaOQ2V5cbqzznXXk18\nuHjK8l58tvuWCwBsGAI6ADSCgA4AjSCgA0AjCOgA0AgC+ibyc2uH8nn9YyvMr91IqXxtf79kDfPc\nuakcbr+e32asfZwaTaUtXrSzc/Rh29o63N/ZmX/v9/1juX2/PFY21Faq3778tNrakg4O5tsHB/P9\n/f389ZPqr0/JsbF9Dd2vsceH5hHbHnu/+2sf3ouw3K/f95s7t6TPfts/zz+35J6W/MzU/lwN1Rsq\ny401nFfNWFb4s93UB4tWNZfTbpL85lw5JpG9T4vco7G502NyxjFo0ViV+2ARb7kAQCMI6ADQCAI6\nADSCgA4AjSCgA0Aj2sxymc3mqXW9MEUrtT+mXu25Mf2xnR3p3LmaaRc5lj0xxs5O+pr6qW2x47H6\nqdS5kuvY9zXlfYzV7dPO9vfj45/4fvn3afazpv1HB0N7SDq4ZNx+v11yjhSvU1O3dDylY6sZf1iv\nZs6lYy4Z087fbencf7mg0DKzXJoK6ACwDkhbLOCcm3/52zX7Y+rVnttvl7Qx0Vf22pSMPTfm2LGS\n+Y8di9/XlPdxaF4ncL/8+6Szx9sOy2r2++2Sc1J1auqWjqd0bDXjrxlHOI8x8ykt9+/zsjQV0AFg\nkxHQAaARBHQAaAQBHQAaQUAHgEa0uXwu0AD7RTuW47x9nWX37RePZrP5+9vXmXRJvk6ub5x+TeWh\nX2yPDxYdkb02pWP1P2QT1uGDRZPw71MsyA59QCZnirp8sOj0f7CoND/2nKTPSLpd0qe6ssdIukXS\nXZI+KulSr/67Jd0j6Q5Jz0y06aY0dXst4dqsB/8+6ayOfA/LY/sldce0F9bBYhb9eezOj8bq0v9Q\nPSRp1zn3LOfcs7uyayV9zDn3NEm3Snpr96/HVZKe7Jx7iqTXSbq+sA8AwAJKA7pF6l4t6cZu+8Zu\nvy9/ryQ5526TdKmZXbbgOAEAA0oDupP0UTP7UzP7qa7sMufceUlyzj0gqQ/aV0i61zv3/q4MALBE\npVkuz3XOfdXMvlPSLWZ2l+ZB3scfFgSAFSoK6M65r3bfv2Zmfyjp2ZLOm9llzrnzZna5pL/qqt8v\n6Yne6We6smPOnj17cXt3d1e7u7u14weApu3t7Wlvb6+o7mDaopk9UtIlzrlvmNm3a57Z8ouSXiDp\n6865t5nZtZIe7Zy71sxeLOkNzrmXmNmVkt7pnLsy0q4b6rsGy+cCWBeLxL5c2mLJK/TLJP2Bmbmu\n/m87524xsz+T9AEz+0lJ+5Je0Q30w2b2YjP7kqRvSnrN6JFXuniRzCT/gpXuj6lXe26/PR9wvo2J\nHPsDF2Ef/niG5hjWO4Hxb4rYH7gY80cscnWl+vZ2/m5L595xkH7OpfyzXnMs11aqzppZ5ovPNj9Y\nREA/goC+HqL3aegZHPus1/QxVFcioFdY5geL+GAvADSCgA4AjSCgA0AjCOgA0AgCOgA0oqksFwBY\nB6vMQ18bR1K+1tVJpC3Wiq0Fnjo+Zcpcrr1U6mdtO2FbY8Y8kYXv0yYZk17sn1v7fEx435f54rOp\ngH7Euuahn4ShPPRcvQsXDrdjQRHTIQ89HZQRxXvoANAIAjoANIKADgCNIKADQCMI6ADQCAL6JjI7\n+tXbDpKetreP14tt+23MZksd+kbxr3N/L/zy/lr396C/f/596/e5LxuhzbTFnZ2jQWZrq2x/TL3a\nc31hGzs74+Y7hTC/PJZvXtLG/v7xefnHS6/b1tZ8v/YajmnHr1My5pO8T/19iH0OYH//aN2Dg/R9\n7O9L7T0oqSsNX7PSY7m2UmUl4+/v2X33HT1W+nys+uezUFOfFF3VXE67ovXQx+TSL/HDF5uoek3/\n3phccawM66EDAAYR0AGgEQR0AGgEAR0AGkFAB4BGNJXlAgDrgPXQC2RT82LluWVCc3XCemFa2ClL\n6SteZ7smFz11bGdHOndu3NKsYZ1U/Vi90nb8Y9L0yyYvIHufYvfmzJmjueh9nVzd3PjHPPv+WEvX\nk8+1k1rHvGTca5Iyy3romNZQYOzL+rXPh+qt0Q/TWsld69y69KxZv7F4Dx0AGkFAB4BGENABoBEE\ndABoBAEdABpBHjoAnDDy0AsU5Vqv2mnMQy9d+7wmL71mPfUx+e4lffr7sWNSvO2SufRlfd79BIrv\nU25esfq5Y2Pq1F7n2vGVPDupPmtz3nPHlpSKu8wXn029QueDRXFF1yY8Fn5YpN8O6w594GfoWoz5\nEFHth1VSY/eP+3Md8yGnCUTvU+3zWPKcSuM/BFTzYaOSzzuEbYdjy40/1U7JeFYc0FkPHQCQRUAH\ngEYQ0AGgEQR0AGgEAR0AGtFUlgsArAPy0AuQhx43ej10v1xaPL/YP5bLgx7KQ67Jey4de01e9s7O\n0bXFe6WplQnR+5SaX2m+d1i+syPdd1/5XFP91nzOYGiM/nGpfmyx8prc+ZRwbX+pLH1Tyt5v8tDL\n2iMPPWHw2iwrx9uv0yttrzYPeszYS56X1JgXuY4JxfepdJ61OfVjnv1+X6rLSy+512PGVHtuLLe9\n5rqNDOjkoQMAsgjoANAIAjoANIKADgCNIKADQCMI6JvI7PBre/toWbi9vR2vt+1lvPZ1wnNi6Vmp\nfv32Uu3Evsfq5Mbel/n74TUJxzObzb/8dv32Z7PIRZ5AOKa+LDd2f98fY78fm8eyxo8T11Qe+kU7\nO/FgsrV1tLzf98tL6sTqScNt7OwsNq9lyOX8+uVhvZr1r/u87aH2wpzxkjziklxlv69w3Kn5++PZ\n3z/ern/O/n7+OZr6vg+NPazn6+fizyM2fn8OQ89+f52G6qTOydWtHdOYc0v7jLURq3vmjFalzTx0\nHJHNby7dXzSPeEy/fftHJ7NYzvJQv6l2/bEM5SKPVPV5gdQY/XHWzgMn4lTkoZvZJWb2aTO7uduf\nmdknzexuM3ufmW135Q8zs/eb2T1m9gkze9LokQMAitW8h/4mSV/w9t8m6R3OuadKelDSa7vy10r6\nunPuKZLeKentUwwUAJBXFNDN7IykF0v6Da/4hyT9Xrd9o6SXddtXd/uS9EFJL1h8mACAIaWv0H9Z\n0s9JcpJkZo+T9DfOuYe64/dJuqLbvkLSvZLknDuQ9KCZPXayEQMAogazXMzsJZLOO+fuMLNd/1Bh\nH8l6Z8+evbi9u7ur3d3dVFUA2Eh7e3va29srqjuY5WJm/1nST0i6IOkRkv6hpD+U9COSLnfOPWRm\nV0q6zjl3lZl9pNu+zcy2JH3VOfeESLushw5gIy0ry6UqbdHMnifp3zvnXmpmN0n6fefcTWb2q5I+\n45y73sxeL+l7nXOvN7NXSnqZc+6VkbZYPveELHRths6RDufWbw+1mUpJHDonda9qUvxy45y6vdh+\npq+F0hZr+82dM3R9SuqWjqd0bLXprzWpqiXPVux+j/w5PxVpixHXSnqzmd0t6bGSbujKb5D0eDO7\nR9LPdPUAAEtW9UlR59yfSPqTbvvLkp4TqfMtSa+YZHQAgGKs5QIAjSCgA0AjCOgA0Ig2V1tEXmyl\nOb88tnqefyxcwS9c+tZfca6kzdJx+MvFhm1sbx8fV9hObIneVB+p9vz5h+Po20rVDcfRn28WX5HR\nbz91bcJx9vuncWVPLB0BfRPF0uH88lwaZixd0w8ofdpWv0xrrM1U2mJuHGaH/cRSyw4OhtNKw3Hm\n+ki1N5TGFl6HWF1/HuExXy5F0G/LL+v3Y+2heW0GdNZDz6u5NrFzYq/W++3UNRlqM1Wnpo2h8/w1\nzlP32F9jPNWe/31oLfDY89X3MRR0Y/+Dic0pbDv2HE6xtnntOuil65yXjK1m/KU/0yXnxL73cz6F\nP+esh74BJv3ASuyYlH61mXtlXzIOv+3wnNir5NyYa+Y+VJ5qKzXnVNtemUnp+zR2nDh1TusHiwAA\npwgBHQAaQUAHgEYQ0AGgEW1muSAvll2Ry9gJj+fO394+Xp7KmY5ly4T9hm3G8q5zx8J2U1kasTGk\nrkEuVz1XDiwZr9CR56f6pfTHw9TAPn3LTwUMz+mPp/r124y14fcXOxb2FfaRGkM4vtgcY+MOy80O\nv8IUOH87THXrU2/Derm2cu1hI5C2uAEWSlusSWn093NpgzXnlI5nKHUxtd+fc1KpjRk8w5uBtEUA\nwCACOgA0goAOAI0goANAIwjoANAIAvom8tPc/Jzr2L40z6s2O7qOd7jf1+n3/fKwzG8zzO1OjSE2\nnlhueWoeqbGHfff7sfGl5pOae/99NhNwEkhb3ADH0hZrpHK/S/f9sqHc7dTx3HhKj6XGKh3tOzf+\nkvbCehXpizzDm2GZaYt8UnQTjV0+10y6cOFoeb9fek5/LBxHrO2SXPihfvx6ubGHx2Ljq813r/3H\nE1gQb7kAQCMI6ADQCAI6ADSCgA4AjSCgA0AjyHLZRLHsi7Cs3/fXMs/Vyx3rl3UNl+L1y2N9xtqO\nHUuNrT8Wth871m+z7CzWGK/QkRfmaNcK1yoP24mtWZ7rK3YsVb80Z93f3t8/XHvd/5CSv/546lis\nrhn/SODE8Ap9E9Wuh37hQlmed00u+9D647mc9FjOuJ9T7s9xqJ+SfHZgTfAKHQAaQUAHgEYQ0AGg\nEQR0AGgEAR0AGtHU8rkAsA5YPnfASteRzqXGnTaz2TzX2pdb17t0nfDSddJT32vH4x+ThvPNS8e/\nsyOdOxdvCzjlmnmFvlLrFNABrLXcK3TeQweARhDQAaARBHQAaAQBHQAaQUCfCmmTAFaMgD4VslsA\nrBgBfSq8QgewYgT0qfAKHcCKEdABoBGDAd3MHm5mt5nZ7Wb2OTO7riufmdknzexuM3ufmW135Q8z\ns/eb2T1m9gkze9KyJwEAKAjozrlvSXq+c+5Zkp4p6Soze46kt0l6h3PuqZIelPTa7pTXSvq6c+4p\nkt4p6e0CeRf8AAAWTklEQVRLGTkA4Iiit1ycc3/fbT5c8wW9nKTnS/q9rvxGSS/rtq/u9iXpg5Je\nMMlIAQBZRQHdzC4xs9slPSDpjyX9uaQHnXMPdVXuk3RFt32FpHslyTl3IOlBM3vspKMeMpsd/Svs\nZtL29vHt8Husbm6//y4dfq89N1d/e3u4fmpuQ33F2qm9FiXb/f5sdvy+DM2lZm6Ljrnknix6L6e4\nnjXPw9hrFD57ufqz2Yn+aCOvaPncLnA/y8weJekPJD29og9LHTh79uzF7d3dXe3u7lY0G3RiyW7m\n/KVT++3we6xubj+2ZGvtubn6uTGPOSdVr2S8pX2lzgmX7E2NYerrUTvORcpL+l30eg71WVtn0f39\n/XlgR5WalWb39va0t7dXVLd6+Vwz+wVJ/0fSf5B0uXPuITO7UtJ1zrmrzOwj3fZtZrYl6avOuSdE\n2pn8D1xcbM/seBqhX9Zvh99jdXP7Q/2UnJur3/+g5Oqn5jbUV6ydmvGXbvvz6OeSantMf1O0EdYb\nW17S76LXc6jP2jq5fan+WmHQkVg1/vzov6IlWS6PN7NLu+1HSPphSV+Q9HFJL++qvVrSh7rtm7t9\ndcdvHT1yAECxkrdc/pGkG83sEs3/AbjJOfdhM7tT0vvN7D9Jul3SDV39GyT9lpndI+mvJb1yCeMG\nAAQGA7pz7nOSfiBS/mVJz4mUf0vSKyYZHQCgGJ8UBYBGENABoBHtBnRyZAFsmKI89LXUp1QBwIZo\nN6D3wbz/vrV1fDv83ivdD8vHnJurv7V1OIfSPkvnGWunZvyl2/3+mTOHc0m1Paa/KdoI640tL+l3\n0es51Gdtndy+NHytdnaE06PdgM6HHgBsmHbfQweADUNAB4BGENABoBEEdABoRLsB3c9yMVv+2tex\nfkrXOk/VCfenyKufzer6LKk35lqVXqfSfhdZq9zfn+IaTyW8V6VrqY/dX+ba6mPuRcn303jfVqh6\n+dzJOl7C8rkAsA6WtXxuU2mLR9ZDl+Y5sufOHZYtc+3rsF6uT3+MJetUH51kcv4pVWvFp+rk6o25\nVqXXqbTf2jGUzuUEHVsnu+a6jL2G/r60vLXVF7kXue/9mFd432ot88VnUwH9iDW5uQAwlXbfQweA\nDUNAB4BGENABoBEEdABoRLu/FPV/k+yvHOevDtfXCX/r7JdvbUkHB0fLt7cPv/fHwn7Ctvp+w75m\ns8NMnJMS+y17Xxau8OjPL3ZuyTUM2/evZ2xVv1j/Yfv9PYhd67DNsG5/vL9//f5pzGVOXdfwOuWO\npa5h7Dnwr0l/nv+8h/34dfxrmboXuZ+tWJ+xnzn/e1gn/HkM5xJ7psP6sf1+28+cO4XIQweAE0Ye\neoFjeei+3L++ManjsX/1S84r7SM3zpGvDo7loY8Z1yJqr3Vp/ZJXW6m6qf0VvgKL5qHn5F55jrmH\nuecxd81Lr23p2FJ9xr5LR9uq+bxCmHtfkuc/Qb77Ml98NvUKPfnhmbE30d8vublj+j2BD0RUfbBo\nzIdVaj6kU/Mhk7EfZln0gysrUvXBotL5LvphntI2aurU9FlStz8urU1AX9YrdH4pCgCNIKADQCMI\n6ADQCAI6ADSCgA6cZmbzLz8v28/lDrdjx/zc7bA8PCf8HuaDx/qJ5amH56aOxfLNU33FctH940Pt\nx/oI91PXKHZd/O+zmU6DptIWL9rZOXrDtrYO98MP+PjHcvt+eaxsqK1Uv/4HnU5CeG38seS2c/ul\n5bFjuT5K2hmaQ8m4V3UvagylaaaUpm2mziutE+vH388dW0YaaqpurI+S1E9/O7eU74q1mbaII7g2\n6yGbtjgmHTDcH5t2KOVT9qZIWRwqjx0bm2Y45bUckc5I2iIAYBABHQAaQUAHgEYQ0AGgEQR0AGgE\nAR04zWI536lc85IcczStzTx0oDW5vPCaHPPanP/wj2XE/mGo+UxB6Tl8hmEUAjpwmg3lYoflsf0R\nudJYT7zlAgCNIKADQCMI6ADQCAI6ADSCgA4AjWhqtUUAWAfLWm2xqbTF5F+2D1O45pWH6/r7pUuN\n1vZ7AqlkR5brHFqW1Ff7l9hjbQ7tD7UfG0eszdqlZnPj6fvMjT03rpGiy+eGStcDz63l3bc99v7m\nrsfxSU2zzG/f/tASvrl+S+eamstElvnis6mADjQnDDoXLhx+78tq1vZG03gPHQAaQUAHgEYMBnQz\nO2Nmt5rZ583sc2b20135Y8zsFjO7y8w+amaXeue828zuMbM7zOyZy5wAAGCu5BX6BUlvds59j6Qf\nlPQGM3u6pGslfcw59zRJt0p6qySZ2VWSnuyce4qk10m6fikjBwAcMRjQnXMPOOfu6La/IelOSWck\nXS3pxq7ajd2+uu/v7erfJulSM7ts4nEDAAJVWS5mNpP0TEmflHSZc+68NA/6XtC+QtK93mn3d2Xn\nFx0ssHHCzBR/fXN/aVvp+H7s2Lb3I99vx+r3y8GGdcLldP1UyXCZ2f5YONbc8r5YSHFAN7PvkPRB\nSW9yzn3DzMJEzerEzbNnz17c3t3d1e7ubm0TcTs76TWN+wd17PrHqbKhtlL9nvQ6yuG18ccS5i6n\n5h07r2bN6Zr2pbI+Ste0LlmHu+8zN/a+zirWwfbXRvfvWSw/vRceS+Wz9/b34+f5+7mgHPaTOq/k\neapZkz22X/ssnzmTntcK7O3taW9vr6hu0SdFzWxb0n+T9EfOuXd1ZXdK2nXOnTezyyV93Dn3DDO7\nvtu+qav3RUnP61/Ne21O/knRVX3q9bTj2qyH6AeLptqPfXBmqH6vps3asWzgc7noz2Puk6KlaYu/\nKekLfTDv3Czpmm77Gkkf8spf1XV8paQHw2AOAJje4Ct0M3uupP8u6XOav63iJP28pE9J+oCkJ0ra\nl/QK59yD3Tm/IulFkr4p6TXOuU9H2uUV+gnh2qwHXqFvhmW+Qm9qcS6CVhzXZj0Q0DfDaXjLBQBw\nyhHQAaARTb3lAgDrgPXQC7Aeelzxeui1+7FyafnrVuf6GPO+cq7t2PvDpf2kxpm4fkXrofuG1kSP\n1e1zunM56NI8v35/v3yd9Rrh+uxDP0+5n9kxP6+l/ZS2X4n10IFNVRqswjqpY/466v3a6kP1a/7x\nGqojHR0DJsV76ADQCAI6ADSCgA4AjSCgA0Aj+KXoJor9Mios6/eHlmg1Sy+jur0dz4gI+/LrpZZ0\nTfURO9a3lxt76ljYdmyMqXP9OrFrIbFsLJaKgL6JalLwYilj/X4sHcw/dnBQlurZ1+u3w/5yfcSy\nPvz2cmOPHYu1LR2fS+zcVJ1wXmR3YEnaDOish55X8wpdys8r/B6urR6+0o613dfzc6VT54X1Y/37\n/Zask50ai/9KPpxL7tywTux/JyX3PvUc1679njs2Zr373HrlQ9fHrxe2X7pGfe2Yx/aTa38VP7cF\nmvqkKAtQxS30waKSY7lX3yUfpCrpr3RMub5q+5l6jFK2D57hzcDiXACAQQR0AGgEAR0AGkFAB4BG\ntJnlgrxclkvsN/25DJgwyySVA94fC/vyMx/CPPEwK8LP8+7rnLK/0A6sEgEdRw0tq5rjnxe246ck\n+nX8Y/7x2AeSwjEdHBwu8Vqbbld6zthzc6l9pzTlDeuPgL6JhtIG++VNw2Oxurm1p/12Stv2l1Yt\nXas87AfYULyHDgCNIKADQCMI6ADQCAI6ADSCgA4AjSCgA0AjmlptEQDWwbJWW2wmD51lRwFsOt5y\nAYBGENABoBEEdABoBAEdABpBQJ/CbDZfKGp7e/7Vb4/9flramM3mXyfZxmy2yjsJrLVm0hZXKlz9\nL/aHk2u+n6Y2eifdBoAo/kg0AGwAAjoANIKADgCN4D30CbDswDQc76EDgzbio/+r5KTT+QvNdful\nKICF8JbLVAhIAFaMgD4V/20CcqkBrAABfSr9K/StLWl//2hZ7fdFzp2yja0taWdn8XHUjqf/kFH4\nwaNYeWp7Njv8wJd/LPwgU7/d/yMcO6d0O3W85PzUuHNtDl2TcE6LXL/SOrExxq5T7P5Meb2Hxt/o\nB9n4pegUzNLvYYd1wu2S82Pn+nVSbYft+v2F7ffCspLxDI01Ne7YNSw9b0w/uWsWKy/pJ+wzdY+H\nzvf7L5lneN5QmyXPSe76ldbx94fOC+uWjCtsZ6prs0b4YBEAbAACOgA0goA+hZ2do/+NBIAVIKBP\n4dy5tXsfDkB7BgO6md1gZufN7LNe2WPM7BYzu8vMPmpml3rH3m1m95jZHWb2zGUNHABwVMkr9PdI\nemFQdq2kjznnnibpVklvlSQzu0rSk51zT5H0OknXTzjW06tPj5KOvvXip1H5+6ljqfNj236dkrph\nf2G9VFms/tZWvq3ScZdcn5pr4m/3KZepecfmur0dv1Y1Y0zd46Hza+dZUleaP5v+W4JTXu9YnZLr\nt7U1/6q9RjXXu+Ta9GNpSFHaopntSPqvzrl/2u1/UdLznHPnzexySR93zj3DzK7vtm/q6t0padc5\ndz7SZntpi7PZYQ56beqVf2zovN5QGlauzVwfubGm5lYz1/DapeaSOte/BkPXYeo5j5nrIuMaeg7G\nph7m2iuZz1Rpp2Pml+ujJj2zt2ZxaBlpi0/og7Rz7gFJl3XlV0i616t3f1e2GXgvHcAKTfVLUaIY\nAKzY2NUWz5vZZd5bLn/Vld8v6YlevTNdWdTZs2cvbu/u7mp3d3fkcACgTXt7e9rb2yuqW/oe+kzz\n99C/r9t/m6SvO+feZmbXSnq0c+5aM3uxpDc4515iZldKeqdz7spEm+29hx7b5z103kPnPXTeQ5/Q\nQuuhm9nvSNqV9Dgz+4qk6yT9kqTfNbOflLQv6RWS5Jz7sJm92My+JOmbkl4zzRTWgNn8N+YHB/P9\n7e34tv8g9dt+dkXYhl+3XygrPD/cD/uItRkbjz9O/9xwLLH5+FkLse3UHGN9hX3458aE5WH9obYP\nDtLj781mh7/wjh1P3dfweg4dL7mO/jmpZyt8dsK2/HP8axM+k/52rp2S65Hajj2vpWOPjTHXZth+\nY1icawphQPEfuljwitUN69W0EdZJvZLNtVPax5kz88Dm3NEglxtz6TXot6Xhc/t/3Pz+/fo7O/Nf\nUtfcm1hfJdclJtfP2GtXek39Y6nrELOzMx9TrJ/a5zE1ptSzOTSPkns1NNfUuNcsDuVeoRPQp1D6\nX70p/lvtq31bJ2yz5i2HVH+5c2PtTHUdaq9z7VsKYX9D+4uMrfY6lt7L1Jhjb1H4dXNtT709dK9y\nbeTmWnLdemsWh1htEQA2AAEdABrBH4kG0Aaz44kDG4aAPpVY5sJQpkYuq8EXZjbEyksyaWIZCbEs\nj7BObLxhtkzpfHIZE7H+U9kpYZtD1znWf25MYZ3U/tC4Y3ViZakMqdg8UuPq203dE3+uuesUq5+b\nU0nWT82zkDovl2EjHf+dzwbiLZcpjclI8IOpvx1rs3/10S9u5LdZm0USlucyJmLjPTg4nnUwNB8/\nYyLsc+j6DF3DWPuxecSOp8bUi73qy9WPtRnrMzeHVH9DcvckzFhJtRmrn5tTScZLybMQG09Nn/4/\nSGaHgd//3m/v7Bx+NYQslykskqmR+w3+UJ2pMkhKMgvC9mOZEaX9pNotvT4l1zU1nnDMqWyQWL0p\n7mtNJkcqo6M0c2XMfRia69gspJK5DGXv1NZvFFkuALABCOgA0AgCOgA0goAOAI0goANAI8hDn4L/\nNxv79KjYtnT8WKp+SR3/eN/2UN2SNmJjCdtP5cyX9BO7JjXXp+S6psYTjjl2/pkz8XpT3NeS6x1r\nY2g+4Tlj78PQXGuuQe1ccvMqrd9YGmIt0hYBYI2QtggAG4CADgCNIKADQCMI6ADQCAI6ADSCgA4A\njSCgA0AjCOgA0AgCOgA0goAOAI0goANAIwjoANAIAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4A\njSCgA0AjCOgA0AgCOgA0goAOAI0goANAIwjoANAIAjoANIKADgCNIKADQCMI6ADQCAI6ADSCgA4A\njSCgA0AjCOgA0AgCOgA0goAOAI1YSkA3sxeZ2RfN7G4ze8sy+jjt9vb2Vj2EpWt9jsxvvbU+v5jJ\nA7qZXSLpVyS9UNL3SPpxM3v61P2cdpvwMLU+R+a33lqfX8wyXqE/W9I9zrl959z/k/R+SVcvoR8A\ngGcZAf0KSfd6+/d1ZQCAJTLn3LQNmv0rSS90zv3bbv8nJD3bOffTQb1pOwaADeGcs1j59hL6ul/S\nk7z9M11Z0YAAAOMs4y2XP5X0T8xsx8weJumVkm5eQj8AAM/kr9Cdcwdm9kZJt2j+D8YNzrk7p+4H\nAHDU5O+hAwBWYyWfFG3hg0dmdoOZnTezz3pljzGzW8zsLjP7qJld6h17t5ndY2Z3mNkzVzPqcmZ2\nxsxuNbPPm9nnzOynu/Im5mhmDzez28zs9m5+13XlMzP7ZPdsvs/Mtrvyh5nZ+7v5fcLMnpTv4XQw\ns0vM7NNmdnO339r8zpnZZ7r7+KmurIlndIwTD+gNffDoPZrPwXetpI85554m6VZJb5UkM7tK0pOd\nc0+R9DpJ15/kQEe6IOnNzrnvkfSDkt7Q3acm5uic+5ak5zvnniXpmZKuMrPnSHqbpHc4554q6UFJ\nr+1Oea2kr3fze6ekt69g2GO8SdIXvP3W5veQpF3n3LOcc8/uypp4Rkdxzp3ol6QrJf2Rt3+tpLec\n9DgmmsuOpM96+1+UdFm3fbmkO7vt6yX9mFfvzr7eunxJ+kNJ/6LFOUp6pKQ/0/xDcX8l6ZKu/OKz\nKukjkp7TbW9J+tqqx10wrzOS/ljSrqSbu7KvtTK/bqxflvS4oKy5Z7T0axVvubT8waMnOOfOS5Jz\n7gFJl3Xl4Zzv1xrN2cxmmr+K/aTmPwBNzLF7O+J2SQ9oHvj+XNKDzrmHuir+s3lxfs65A0kPmtlj\nT3jItX5Z0s9JcpJkZo+T9DcNzU+az+2jZvanZvZTXVkzz2itZeSh49Da/8bZzL5D0gclvck5943I\nB8LWdo5dYHuWmT1K0h9Iqnnr71R/jsLMXiLpvHPuDjPb9Q+VNjH9qJbiuc65r5rZd0q6xczu0vFn\ncm2f0VqreIVe9MGjNXXezC6TJDO7XPP/vkvz+T3Rq7cWc+5+YfZBSb/lnPtQV9zUHCXJOfe/Je1p\n/ruCR3e/55GOzuHi/MxsS9KjnHNfP+Gh1niupJea2V9Iep+kH5L0LkmXNjI/SZJz7qvd969p/rbg\ns9XgM1pqFQG9pQ8emY6+krlZ0jXd9jWSPuSVv0qSzOxKzf9bf/5khriQ35T0Befcu7yyJuZoZo/v\nsx/M7BGSfljzXx5+XNLLu2qv1tH5vbrbfrnmv2w7tZxzP++ce5Jz7rs1/xm71Tn3E2pkfpJkZo/s\n/gcpM/t2ST8i6XNq5BkdZUW/yHiRpLsk3SPp2lX/ImHkHH5H0l9K+pakr0h6jaTHSPpYN7dbJD3a\nq/8rkr4k6TOSfmDV4y+Y33MlHUi6Q9Ltkj7d3bfHtjBHSd/XzekOSZ+V9B+78u+SdJukuyXdJOnb\nuvKHS/pA98x+UtJs1XOomOvzdPhL0Wbm182lfz4/18eSVp7RMV98sAgAGsGfoAOARhDQAaARBHQA\naAQBHQAaQUAHgEYQ0AGgEQR0AGjE/wfkVxxIQtJj/gAAAABJRU5ErkJggg==\n", 498 | "text/plain": [ 499 | "" 500 | ] 501 | }, 502 | "metadata": {}, 503 | "output_type": "display_data" 504 | } 505 | ], 506 | "source": [ 507 | "import random\n", 508 | "xmin, ymin, xmax, ymax = current_page.bbox\n", 509 | "size = 6\n", 510 | "\n", 511 | "fig, ax = plt.subplots(figsize = (size, size * (ymax/xmax)))\n", 512 | "\n", 513 | "for l in lines:\n", 514 | " x0,y0,x1,y1,_ = l\n", 515 | " plt.plot([x0, x1], [y0, y1], 'k-')\n", 516 | " \n", 517 | "for c in characters:\n", 518 | " draw_rect(c, ax, \"red\")\n", 519 | " \n", 520 | "# plot the characters of a random cell as green\n", 521 | "for c in random.choice(box_char_dict.values()): \n", 522 | " draw_rect(c, ax, \"green\")\n", 523 | "\n", 524 | "plt.xlim(xmin, xmax)\n", 525 | "plt.ylim(ymin, ymax)\n", 526 | "plt.show()" 527 | ] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "To capture empty cells, I choose a grid on points across the page and try to assign them to a cell. If this cell isn't present in box_char_dict, then it is created and left empty." 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 13, 539 | "metadata": { 540 | "collapsed": true 541 | }, 542 | "outputs": [], 543 | "source": [ 544 | "xmin, ymin, xmax, ymax = current_page.bbox\n", 545 | "\n", 546 | "for x in range(int(xmin), int(xmax), 10):\n", 547 | " for y in range(int(ymin), int(ymax), 10):\n", 548 | " bbox = find_bounding_rectangle((x, y), lines)\n", 549 | "\n", 550 | " if bbox is None:\n", 551 | " continue\n", 552 | "\n", 553 | " if bbox in box_char_dict.keys():\n", 554 | " continue\n", 555 | "\n", 556 | " box_char_dict[bbox] = []" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "All that remains is to map between the ordering of cells on the page and a python data structure and between the ordering of characters in a cell and a string. The two functions below carry this out:" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 14, 569 | "metadata": { 570 | "collapsed": true 571 | }, 572 | "outputs": [], 573 | "source": [ 574 | "def chars_to_string(chars):\n", 575 | " \"\"\"\n", 576 | " Converts a collection of characters into a string, by ordering them left to right, \n", 577 | " then top to bottom.\n", 578 | " \"\"\"\n", 579 | " if not chars:\n", 580 | " return \"\"\n", 581 | " rows = sorted(list(set(c.bbox[1] for c in chars)), reverse=True)\n", 582 | " text = \"\"\n", 583 | " for row in rows:\n", 584 | " sorted_row = sorted([c for c in chars if c.bbox[1] == row], key=lambda c: c.bbox[0])\n", 585 | " text += \"\".join(c.get_text() for c in sorted_row)\n", 586 | " return text\n", 587 | "\n", 588 | "\n", 589 | "def boxes_to_table(box_record_dict):\n", 590 | " \"\"\"\n", 591 | " Converts a dictionary of cell:characters mapping into a python list\n", 592 | " of lists of strings. Tries to split cells into rows, then for each row \n", 593 | " breaks it down into columns.\n", 594 | " \"\"\"\n", 595 | " boxes = box_record_dict.keys()\n", 596 | " rows = sorted(list(set(b[1] for b in boxes)), reverse=True)\n", 597 | " table = []\n", 598 | " for row in rows:\n", 599 | " sorted_row = sorted([b for b in boxes if b[1] == row], key=lambda b: b[0])\n", 600 | " table.append([chars_to_string(box_record_dict[b]) for b in sorted_row])\n", 601 | " return table" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "The results aren't bad:" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 15, 614 | "metadata": { 615 | "collapsed": false 616 | }, 617 | "outputs": [ 618 | { 619 | "data": { 620 | "text/plain": [ 621 | "[[u'The Rt Hon Jeremy Hunt, Secretary of State for Health '],\n", 622 | " [u'Date of Meeting ',\n", 623 | " u'Name of Organisation ',\n", 624 | " u'Purpose of Meeting '],\n", 625 | " [u'January 2015 ',\n", 626 | " u'College of Emergency Medicine ',\n", 627 | " u'Catch-up discussion on NHS winter pressures '],\n", 628 | " [u'January 2015 ',\n", 629 | " u'Ovarian Cancer Action ',\n", 630 | " u'Introductory discussion on ovarian cancer services '],\n", 631 | " [u'January 2015 ',\n", 632 | " u'World Health Organisation ',\n", 633 | " u'Bi-lateral with Margaret Chan, Director General '],\n", 634 | " [u'January 2015 ',\n", 635 | " u'Greene King ',\n", 636 | " u'Discussion on Dementia Friends programme '],\n", 637 | " [u'January 2015 ',\n", 638 | " u'Royal College of General Practitioners ',\n", 639 | " u'Catch-up discussion on NHS primary care services '],\n", 640 | " [u'January 2015 ', u'Unison ', u'Discussion on NHS industrial relations '],\n", 641 | " [u'January 2015 ',\n", 642 | " u'National Association of Primary Care ',\n", 643 | " u'Catch-up discussion on NHS primary care services '],\n", 644 | " [u'February 2015 ',\n", 645 | " u'Cambridge Health Network ',\n", 646 | " u'Roundtable discussion on healthcare technology '],\n", 647 | " [u'February 2015 ', u'Eli Lilly ', u'Introductory discussion '],\n", 648 | " [u'February 2015 ',\n", 649 | " u'Air Accidents Investigation Branch ',\n", 650 | " u'Introductory discussion on airline safety '],\n", 651 | " [u'March 2015 ', u'OC&C Stategy Consultants ', u'Discussion on leadership '],\n", 652 | " [u'March 2015 ',\n", 653 | " u'World Health Organisation ',\n", 654 | " u'International Summit in Geneva '],\n", 655 | " [u'March 2015 ',\n", 656 | " u'Organisation for Economic Co-operation and Development ',\n", 657 | " u'Discussion on healthcare services in England '],\n", 658 | " [u'March 2015 ',\n", 659 | " u'Macmillan, Marie Curie, National Council for Palliative Care, Cicely Saunders International, Motor Neurone Disease Association, Sue Ryder, Hospice Uk ',\n", 660 | " u'Roundtable discussion on end of life care '],\n", 661 | " [u' '],\n", 662 | " [u'Dr. Dan Poulter / Parliamentary Under-Secretary for State for Health '],\n", 663 | " [u'Date of Meeting ',\n", 664 | " u'Name of Organisation ',\n", 665 | " u'Purpose of Meeting '],\n", 666 | " [u'January 2015 ',\n", 667 | " u'Group B Strep Support ',\n", 668 | " u'Cross party delegation to discuss concerns of Group B Strep Support Group '],\n", 669 | " [u'January 2015 ',\n", 670 | " u'British Medical Association ',\n", 671 | " u'Discussion on General Medical Council objectives '],\n", 672 | " [u'January 2015 ',\n", 673 | " u'General Medical Council, Health Education England and Medical Schools Council ',\n", 674 | " u'Discussion on point of registration ']]" 675 | ] 676 | }, 677 | "execution_count": 15, 678 | "metadata": {}, 679 | "output_type": "execute_result" 680 | } 681 | ], 682 | "source": [ 683 | "boxes_to_table(box_char_dict)" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "## Conclusions\n", 691 | "We have achieved what we set out to do: extract tabular information from a PDF into a data structure that we can use.\n", 692 | "\n", 693 | "Certain things in this approach get missed, such as distinctions between tables, and distinctions between headers and rows, but depending on the document these things can often be inferred from the structure. \n", 694 | "\n", 695 | "Hopefully this will be of use to someone. I had considered packaging the whole thing up into a module, but I am not sure how well this approach generalises to other documents. If there is sufficient interest I'm happy to consider spending the time on it.\n", 696 | "\n", 697 | "## How it was Made\n", 698 | "This post was created as a jupyter notebook. You download a copy [here](https://github.com/ijmbarr/parsing-pdfs)." 699 | ] 700 | } 701 | ], 702 | "metadata": { 703 | "kernelspec": { 704 | "display_name": "Python 2", 705 | "language": "python", 706 | "name": "python2" 707 | }, 708 | "language_info": { 709 | "codemirror_mode": { 710 | "name": "ipython", 711 | "version": 2 712 | }, 713 | "file_extension": ".py", 714 | "mimetype": "text/x-python", 715 | "name": "python", 716 | "nbconvert_exporter": "python", 717 | "pygments_lexer": "ipython2", 718 | "version": "2.7.11+" 719 | } 720 | }, 721 | "nbformat": 4, 722 | "nbformat_minor": 0 723 | } 724 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Iain Barr 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /PDFFixup/__init__.py: -------------------------------------------------------------------------------- 1 | import fixer 2 | -------------------------------------------------------------------------------- /PDFFixup/fixer.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | 3 | import math 4 | import pdfminer 5 | from collections import defaultdict 6 | 7 | 8 | from pdfminer.pdfparser import PDFParser 9 | from pdfminer.pdfdocument import PDFDocument 10 | from pdfminer.pdfpage import PDFPage 11 | from pdfminer.pdfpage import PDFTextExtractionNotAllowed 12 | from pdfminer.pdfinterp import PDFResourceManager 13 | from pdfminer.pdfinterp import PDFPageInterpreter 14 | 15 | from pdfminer.layout import LAParams 16 | from pdfminer.converter import PDFPageAggregator 17 | 18 | TEXT_ELEMENTS = [ 19 | pdfminer.layout.LTTextBox, 20 | pdfminer.layout.LTTextBoxHorizontal, 21 | pdfminer.layout.LTTextLine, 22 | pdfminer.layout.LTTextLineHorizontal 23 | ] 24 | 25 | 26 | def extract_layout_by_page(pdf_path): 27 | """ 28 | Extracts the layouts of the pages of a PDF document 29 | specified by pdf_path. 30 | 31 | Uses the PDFminer library. See its documentation for 32 | details of the objects returned. 33 | 34 | See: 35 | - https://euske.github.io/pdfminer/programming.html 36 | - http://denis.papathanasiou.org/posts/2010.08.04.post.html 37 | """ 38 | laparams = LAParams() 39 | 40 | fp = open(pdf_path, 'rb') 41 | parser = PDFParser(fp) 42 | document = PDFDocument(parser) 43 | 44 | if not document.is_extractable: 45 | raise PDFTextExtractionNotAllowed 46 | 47 | rsrcmgr = PDFResourceManager() 48 | device = PDFPageAggregator(rsrcmgr, laparams=laparams) 49 | interpreter = PDFPageInterpreter(rsrcmgr, device) 50 | 51 | layouts = [] 52 | for page in PDFPage.create_pages(document): 53 | interpreter.process_page(page) 54 | layouts.append(device.get_result()) 55 | 56 | return layouts 57 | 58 | 59 | def get_tables(pdf_path): 60 | """ 61 | Tried to extract tabular information from the document in pdf_path. 62 | :param pdf_path: pdf document path 63 | :return: List of pages, each page is a list of lists 64 | """ 65 | return [page_to_table(page_layout) for page_layout in 66 | extract_layout_by_page(pdf_path)] 67 | 68 | 69 | def page_to_table(page_layout): 70 | """ 71 | Given a pdfminer page object, tries to convert it to a table 72 | :param page_layout 73 | :return: list of lists 74 | """ 75 | texts = [] 76 | rects = [] 77 | other = [] 78 | 79 | for e in page_layout: 80 | if isinstance(e, pdfminer.layout.LTTextBoxHorizontal): 81 | texts.append(e) 82 | elif isinstance(e, pdfminer.layout.LTRect): 83 | rects.append(e) 84 | else: 85 | other.append(e) 86 | 87 | # convert text elements to characters 88 | # and rectangles to lines 89 | characters = extract_characters(texts) 90 | lines = [cast_as_line(r) for r in rects 91 | if width(r) < 2 and 92 | area(r) > 1] 93 | 94 | # match each character to a bounding rectangle where possible 95 | box_char_dict = {} 96 | for c in characters: 97 | # choose the bounding box that occurs the majority of times for each of these: 98 | bboxes = defaultdict(int) 99 | l_x, l_y = c.bbox[0], c.bbox[1] 100 | bbox_l = find_bounding_rectangle((l_x, l_y), lines) 101 | bboxes[bbox_l] += 1 102 | 103 | c_x, c_y = math.floor((c.bbox[0] + c.bbox[2]) / 2), math.floor((c.bbox[1] + c.bbox[3]) / 2) 104 | bbox_c = find_bounding_rectangle((c_x, c_y), lines) 105 | bboxes[bbox_c] += 1 106 | 107 | u_x, u_y = c.bbox[2], c.bbox[3] 108 | bbox_u = find_bounding_rectangle((u_x, u_y), lines) 109 | bboxes[bbox_u] += 1 110 | 111 | # if all values are in different boxes, default to character center. 112 | # otherwise choose the majority. 113 | if max(bboxes.values()) == 1: 114 | bbox = bbox_c 115 | else: 116 | bbox = max(bboxes.items(), key=lambda x: x[1])[0] 117 | 118 | if bbox is None: 119 | continue 120 | 121 | if bbox in box_char_dict.keys(): 122 | box_char_dict[bbox].append(c) 123 | continue 124 | 125 | box_char_dict[bbox] = [c] 126 | 127 | # look for empty bounding boxes by scanning 128 | # over a grid of values on the page 129 | for x in range(100, 550, 10): 130 | for y in range(50, 800, 10): 131 | bbox = find_bounding_rectangle((x, y), lines) 132 | 133 | if bbox is None: 134 | continue 135 | 136 | if bbox in box_char_dict.keys(): 137 | continue 138 | 139 | box_char_dict[bbox] = [] 140 | 141 | return boxes_to_table(box_char_dict) 142 | 143 | 144 | def flatten(lst): 145 | """ 146 | Flatterns a list of lists one level. 147 | :param lst: list of lists 148 | :return: list 149 | """ 150 | return [subelem for elem in lst for subelem in elem] 151 | 152 | 153 | def extract_characters(element): 154 | if isinstance(element, pdfminer.layout.LTChar): 155 | return [element] 156 | 157 | if any(isinstance(element, i) for i in TEXT_ELEMENTS): 158 | elements = [] 159 | for e in element: 160 | elements += extract_characters(e) 161 | return elements 162 | 163 | if isinstance(element, list): 164 | return flatten([extract_characters(l) for l in element]) 165 | 166 | return [] 167 | 168 | 169 | def width(rect): 170 | x0, y0, x1, y1 = rect.bbox 171 | return min(x1 - x0, y1 - y0) 172 | 173 | 174 | def length(rect): 175 | x0, y0, x1, y1 = rect.bbox 176 | return max(x1 - x0, y1 - y0) 177 | 178 | 179 | def area(rect): 180 | x0, y0, x1, y1 = rect.bbox 181 | return (x1 - x0) * (y1 - y0) 182 | 183 | 184 | def cast_as_line(rect): 185 | x0, y0, x1, y1 = rect.bbox 186 | 187 | if x1 - x0 > y1 - y0: 188 | return (x0, y0, x1, y0, "H") 189 | else: 190 | return (x0, y0, x0, y1, "V") 191 | 192 | 193 | def does_it_intersect(x, (xmin, xmax)): 194 | return (x <= xmax and x >= xmin) 195 | 196 | 197 | def find_bounding_rectangle((x, y), lines): 198 | v_intersects = [l for l in lines 199 | if l[4] == "V" 200 | and does_it_intersect(y, (l[1], l[3]))] 201 | 202 | h_intersects = [l for l in lines 203 | if l[4] == "H" 204 | and does_it_intersect(x, (l[0], l[2]))] 205 | 206 | if len(v_intersects) < 2 or len(h_intersects) < 2: 207 | return None 208 | 209 | v_left = [v[0] for v in v_intersects 210 | if v[0] < x] 211 | 212 | v_right = [v[0] for v in v_intersects 213 | if v[0] > x] 214 | 215 | if len(v_left) == 0 or len(v_right) == 0: 216 | return None 217 | 218 | x0, x1 = max(v_left), min(v_right) 219 | 220 | h_down = [h[1] for h in h_intersects 221 | if h[1] < y] 222 | 223 | h_up = [h[1] for h in h_intersects 224 | if h[1] > y] 225 | 226 | if len(h_down) == 0 or len(h_up) == 0: 227 | return None 228 | 229 | y0, y1 = max(h_down), min(h_up) 230 | 231 | return (x0, y0, x1, y1) 232 | 233 | 234 | def chars_to_string(chars): 235 | if not chars: 236 | return "" 237 | rows = sorted(list(set(c.bbox[1] for c in chars)), reverse=True) 238 | text = "" 239 | for row in rows: 240 | sorted_row = sorted([c for c in chars if c.bbox[1] == row], key=lambda c: c.bbox[0]) 241 | text += "".join(c.get_text() for c in sorted_row) 242 | return text 243 | 244 | 245 | def boxes_to_table(box_record_dict): 246 | boxes = box_record_dict.keys() 247 | rows = sorted(list(set(b[1] for b in boxes)), reverse=True) 248 | table = [] 249 | for row in rows: 250 | sorted_row = sorted([b for b in boxes if b[1] == row], key=lambda b: b[0]) 251 | table.append([chars_to_string(box_record_dict[b]) for b in sorted_row]) 252 | return table 253 | -------------------------------------------------------------------------------- /data/DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ijmbarr/parsing-pdfs/85eaa41430b460c18f0b7ba1239109e67b6a227e/data/DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf -------------------------------------------------------------------------------- /data/Ministerial_Quarterly_Transparency_information_-_April-June_2014.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ijmbarr/parsing-pdfs/85eaa41430b460c18f0b7ba1239109e67b6a227e/data/Ministerial_Quarterly_Transparency_information_-_April-June_2014.pdf -------------------------------------------------------------------------------- /data/example_out.csv: -------------------------------------------------------------------------------- 1 | The Rt Hon Jeremy Hunt, Secretary of State for Health ,,,,,, 2 | Date gift given ,From ,Gift ,Value ,Outcome ,, 3 | Nil , , ,,,, 4 | Dr. Dan Poulter / Parliamentary Under-Secretary for State for Health ,,,,,, 5 | Date gift given ,From ,Gift ,Value ,Outcome ,, 6 | Nil , , ,,,, 7 | Norman Lamb MP, Minister of State for Care and Support ,,,,,, 8 | Date gift given ,From ,Gift ,Value ,Outcome ,, 9 | Nil , , , , ,, 10 | Jane Ellison MP, Parliamentary Under Secretary of State for Public Health ,,,,,, 11 | Date gift given ,From ,Gift ,Value ,Outcome ,, 12 | Nil , , , , ,, 13 | Earl Howe, Parliamentary Under-Secretary of State for Quality ,,,,,, 14 | Date gift given ,From ,Gift ,Value ,Outcome ,, 15 | Nil , , , , ,, 16 | The Rt Hon Jeremy Hunt, Secretary of State for Health ,,,,,, 17 | Date gift received ,From ,Gift ,Value ,Outcome ,, 18 | Nil , , ,,,, 19 | Norman Lamb MP, Minister of State for Care and Support ,,,,,, 20 | Date gift received ,From ,Gift ,Value ,Outcome ,, 21 | Nil , , , , ,, 22 | Jane Ellison MP, Parliamentary Under Secretary of State for Public Health ,,,,,, 23 | Date gift received ,From ,Gift ,Value ,Outcome ,, 24 | Nil , , , , ,, 25 | Dr Dan Poulter MP, Parliamentary Under Secretary of State for Health ,,,,,, 26 | Date gift received ,From ,Gift ,Value ,Outcome ,, 27 | Nil , , , , ,, 28 | Earl Howe, Parliamentary Under-Secretary of State for Quality ,,,,,, 29 | Date gift received ,From ,Gift ,Value ,Outcome ,, 30 | Nil , , , , ,, 31 | The Rt Hon Jeremy Hunt, Secretary of State for Health ,,,,,, 32 | Date ,Name of Organisation ,Type of Hospitality Received ,,,, 33 | 6 January 2015 ,Royal College of Physicians ,Dinner ,,,, 34 | 4 February 2015 ,College of Emergency Medicine ,Dinner ,,,, 35 | Dr. Dan Poulter / Parliamentary Under-Secretary for State for Health ,,,,,, 36 | Date ,Name of Organisation ,Type of Hospitality Received ,,,, 37 | Nil , , ,,,, 38 | Norman Lamb MP, Minister of State for Care and Support ,,,,,, 39 | Date ,Name of Organisation ,Type of Hospitality Received ,,,, 40 | 12 March 2015 ,The Times ,Dinner ,,,, 41 | 17 March 2015 ,Mental Health Annual Conference ,Dinner ,,,, 42 | Earl Howe, Parliamentary-under-Secretary of State for Quality ,,,,,, 43 | Date ,Name of Organisation ,Type of Hospitality Received ,,,, 44 | 4 February 2015 ,College of Emergency Medicine ,Dinner ,,,, 45 | Jane Ellison MP, Parliamentary Under Secretary of State for Public Health,,,,,, 46 | Date ,Name of Organisation ,Type of Hospitality Received ,,,, 47 | Nil , , ,,,, 48 | The Rt Hon Jeremy Hunt, Secretary of State for Health ,,,,,, 49 | Date(s) of trip ,Destination ,Purpose of trip ,‘Scheduled’ ‘No 32 (The Royal) Squadron’ or ‘other RAF’ or ‘Chartered’ or ‘Eurostar’ ,Number of officials accompanying Minister, where non-scheduled travel is used ,Total cost including travel, and accommodation of Minister only , 50 | 16 – 17 March 2015 ,Geneva, Switzerland ,To attend a World Health Organisation summit ,Scheduled , , ,£265 51 | Dr. Dan Poulte / Parliamentary Under-Secretary for State for Health ,,,,,, 52 | Date(s) of trip ,Destination ,Purpose of trip ,‘Scheduled’ ‘No 32 (The Royal) Squadron’ or ‘other RAF’ or ‘Chartered’ or ‘Eurostar’ ,Number of officials accompanying Minister, where non-scheduled travel is used ,Total cost including travel, and accommodation of Minister only , 53 | Nil , , , , , , 54 | r,,,,,, 55 | ,,,,,, 56 | ,,,,,, 57 | Date(s) of trip ,Destination ,Purpose of trip ,‘No 32 (The Royal) Squadron’ or ‘other RAF’ or ‘Charter’ or ‘Eurostar’ ,Number of officials accompanying Minister, where non-scheduled travel is used ,Total cost including travel, and accommodation of Minister only , 58 | Nil , , , , , , 59 | Earl Howe, Parliamentary-under-Secretary of State for Quality ,,,,,, 60 | Date(s) of trip ,Destination , Purpose of trip ,‘No 32 (The Royal) Squadron’ or ‘other RAF’ or ‘Charter’ or ‘Eurostar’ ,Number of officials accompanying Minister, where non-scheduled travel is used ,Total cost including travel, and accommodation of Minister only , 61 | Nil , , , , , , 62 | ,,,,,, 63 | Jane Ellison MP, Parliamentary Under Secretay of State fo Public Health ,,,,,, 64 | Date(s) of trip ,Destination ,Purpose of trip ,‘No 32 (The Royal) Squadron’ or ‘other RAF’ or ‘Charter’ or ‘Eurostar’ ,Number of officials accompanying Minister, where non-scheduled travel is used ,Total cost including travel, and accommodation of Minister only , 65 | Nil , , , , , , 66 | The Rt Hon Jeremy Hunt, Secretary of State for Health ,,,,,, 67 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 68 | January 2015 ,College of Emergency Medicine ,Catch-up discussion on NHS winter pressures ,,,, 69 | January 2015 ,Ovarian Cancer Action ,Introductory discussion on ovarian cancer services ,,,, 70 | January 2015 ,World Health Organisation ,Bi-lateral with Margaret Chan, Director General ,,,, 71 | January 2015 ,Greene King ,Discussion on Dementia Friends programme ,,,, 72 | January 2015 ,Royal College of General Practitioners ,Catch-up discussion on NHS primary care services ,,,, 73 | January 2015 ,Unison ,Discussion on NHS industrial relations ,,,, 74 | January 2015 ,National Association of Primary Care ,Catch-up discussion on NHS primary care services ,,,, 75 | February 2015 ,Cambridge Health Network ,Roundtable discussion on healthcare technology ,,,, 76 | February 2015 ,Eli Lilly ,Introductory discussion ,,,, 77 | February 2015 ,Air Accidents Investigation Branch ,Introductory discussion on airline safety ,,,, 78 | March 2015 ,OC&C Stategy Consultants ,Discussion on leadership ,,,, 79 | March 2015 ,World Health Organisation ,International Summit in Geneva ,,,, 80 | March 2015 ,Organisation for Economic Co-operation and Development ,Discussion on healthcare services in England ,,,, 81 | March 2015 ,Macmillan, Marie Curie, National Council for Palliative Care, Cicely Saunders International, Motor Neurone Disease Association, Sue Ryder, Hospice Uk ,Roundtable discussion on end of life care ,,,, 82 | ,,,,,, 83 | Dr. Dan Poulter / Parliamentary Under-Secretary for State for Health ,,,,,, 84 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 85 | January 2015 ,Group B Strep Support ,Cross party delegation to discuss concerns of Group B Strep Support Group ,,,, 86 | January 2015 ,British Medical Association ,Discussion on General Medical Council objectives ,,,, 87 | January 2015 ,General Medical Council, Health Education England and Medical Schools Council ,Discussion on point of registration ,,,, 88 | January 2015 ,Federation of Surgical Speciality Associations ,Catch-up discussion on medical innovation ,,,, 89 | February 2015 ,Allied Health Professional Forum ,Catch-up discussion with Allied Health Professional leads ,,,, 90 | March 2015 ,Children and Young People's Forum ,Catch up discussion on children and young people ,,,, 91 | March 2015 ,Royal College of Physicians ,Discussion on workforce. ,,,, 92 | March 2015 ,Royal College of Obstetricians and Gynaecologists ,Catch-up discussion on gynaecology ,,,, 93 | Norman Lamb MP, Minister of State for Care and Support ,,,,,, 94 | Date of meeting ,Name of organisation ,Purpose of meeting ,,,, 95 | January 2015 ,Think Ahead ,Catch-up discussion ,,,, 96 | January 2015 ,Care England ,Discussion on integrated care ,,,, 97 | January 2015 ,Centreforum ,Discussion on social care ,,,, 98 | January 2015 ,Challenging Behaviour Foundation ,Discussion on learning disability ,,,, 99 | January 2015 ,National Council for Palliative Care and Dying Matters ,Discussion on end of life care ,,,, 100 | January 2015 ,The Priory ,Discussion on addiction and mental health ,,,, 101 | January 2015 ,Voluntary, Community and Social Enterprise Steering Group ,Discussion on upcoming review of VCSE organisations ,,,, 102 | January 2015 ,Care and Support Transformation Group ,Discussion on social care ,,,, 103 | February 2015 ,Local Government Association ,Discussion on sector-led improvement ,,,, 104 | February 2015 ,UK Council of Psychotherapists ,Discussion on gay conversion therapy ,,,, 105 | February 2015 ,Carers call to action ,Discussion on carers and dementia ,,,, 106 | February 2015 ,Learning Disability Programme Board ,Discussion on learning disability ,,,, 107 | February 2015 ,Service User and Care Advisory Group ,Discussion on social care ,,,, 108 | February 2015 ,Dementia Programme Board ,Discussion on dementia ,,,, 109 | February 2015 ,Carers UK ,Social care discussion ,,,, 110 | February 2015 ,Black mental health UK ,Discussion on mental health in BME population ,,,, 111 | February 2015 ,Student Minds and Royal College of General Practitioners ,Discussion on eating disorders ,,,, 112 | March 2015 ,Association of Business Insurers ,Discussion about financial products in social care ,,,, 113 | March 2015 ,Transforming Care Assurance Board ,Discussion on learning disabilities ,,,, 114 | March 2015 ,Think Ahead ,Regular catch up ,,,, 115 | March 2015 ,Ofcom, Samaritans, BT ,Discussion to set up free phone line for suicide prevention ,,,, 116 | March 2015 ,Kate Nash and Associates ,Discussion on learning disabilities ,,,, 117 | March 2015 ,Bishop James/Gosport Inquiry Panel ,Discussion on abuse in Gosport ,,,, 118 | March 2015 ,British Youth Council ,Discussion on young people in ,,,, 119 | Government ,,,,,, 120 | March 2015 ,Southern Health ,Discussion on standards of care ,,,, 121 | March 2015 ,The Haven ,Discussion on concerns about funding for Haven project ,,,, 122 | March 2015 ,People First England ,Discussion on learning disabilities ,,,, 123 | March 2015 ,Macmillan, Marie Curie, National Council for Palliative Care, Cicely Saunders International, Motor Neurone Disease Association, Sue Ryder, Hospice UK ,Roundtable discussion on end of life care ,,,, 124 | March 2015 ,Crisis Care Steering Group ,Discussion on crisis care ,,,, 125 | March 2015 ,Autism Programme Board ,Discussion on autism ,,,, 126 | Earl Howe, Parliamentary-under-Secretary of State for Quality ,,,,,, 127 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 128 | January 2015 ,World Health Organisation ,Bi-lateral meeting with Director General Margaret Chan ,,,, 129 | January 2015 ,Association of British Healthcare Industries ,Discussion on patient issues on industry, Healthcare Acquired Infections, Engagement with NHS England ,,,, 130 | January 2015 ,Specialised Healthcare Alliance ,Discussion on specialised commissioning ,,,, 131 | January 2015 ,British Medical Association ,Catch up discussion ,,,, 132 | 5 February 2015 ,UK Sepsis Trust ,Discussion on public health aspect of sepsis ,,,, 133 | February 2015 ,Specialised Healthcare Alliance ,Continued discussion on specialised commissioning ,,,, 134 | February 2015 ,Which? ,Discussion on dentistry pricing transparency ,,,, 135 | February 2015 ,Wisper Public Affairs ,Discussion on undiagnosed HIV ,,,, 136 | February 2015 ,Neuro Foundation ,Discussion on rare diseases ,,,, 137 | March 2015 ,Pharmaceutical Services Negotiating Committee ,Discussion on community pharmacy ,,,, 138 | March 2015 ,Self Care Forum ,Discussion on self care ,,,, 139 | March 2015 ,Healthwatch UK ,Catch-up with Anna Bradley ,,,, 140 | March 2015 ,Health Foundation ,Presentation of work of Health Foundation ,,,, 141 | Jane Ellison MP, Parliamentary Under Secretary of State for Public Health ,,,,,, 142 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 143 | January 2015 ,Tesco ,Discussion on Tesco’s commitment to customers for making healthier choices. ,,,, 144 | January 2015 ,Youth Sport Trust ,Discussion on physical activity in schools. ,,,, 145 | March 2015 ,Yum Restaurants ,Discussion on the Responsibility Deal. ,,,, 146 | March 2015 ,Black Mental Health UK ,Introductory discussion ,,,, 147 | March 2015 ,Prostate Cancer UK ,General discussion on cancer policy. ,,,, 148 | Name of Special Adviser – Sue Beeby,,,,,, 149 | Date of hospitality ,Name of organisation ,Type of hospitality received ,,,, 150 | Nil , , ,,,, 151 | Name of Special Adviser – Christina obinson,,,,,, 152 | Date of hospitality ,Name of organisation ,Type of hospitality received ,,,, 153 | Nil , , ,,,, 154 | R,,,,,, 155 | Name of Special Adviser – Edward Jones,,,,,, 156 | Date of hospitality ,Name of organisation ,Type of hospitality received ,,,, 157 | 21January 2015 ,ITV ,Ticket to National Television Awards ,,,, 158 | Name of Special Adviser – Paul Harrion,,,,,, 159 | Date of hospitality ,Name of organisation ,Type of hospitality received ,,,, 160 | Nil , , ,,,, 161 | s,,,,,, 162 | Name of Special Adviser – Christina Robinson ,,,,,, 163 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 164 | Nil , , ,,,, 165 | Name of Special Aviser – Edward Jones ,,,,,, 166 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 167 | Nil , , ,,,, 168 | d,,,,,, 169 | Name of Special Aviser – Paul Harrison ,,,,,, 170 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 171 | 14 January 2015 ,ITN ,Background briefing/general catch up ,,,, 172 | d,,,,,, 173 | Name of Special Aviser – Sue Beeby ,,,,,, 174 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 175 | 14 January 2015 ,ITN ,Background briefing/general catch up ,,,, 176 | d,,,,,, 177 | Permanent Secretary, Department of Health, Dame Una O’Brien DCB ,,,,,, 178 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 179 | January 2015 ,The Kennedy Trust ,Catch-up discussion ,,,, 180 | January 2015 ,Health Service Ombudsman ,Catch-up discussion ,,,, 181 | February 2015 ,Association of Chief Executives of Voluntary Organisations ,Catch-up discussion ,,,, 182 | March 2015 ,Academy of Medical Royal Colleges ,Introductory discussion with new President ,,,, 183 | March 2015 ,National Audit Office ,Introductory discussion with new non-executive Board member ,,,, 184 | March 2015 ,United Kingdom Homecare Association ,Catch-up discussion ,,,, 185 | March 2015 ,Lilly UK ,Introductory discussion ,,,, 186 | Chief Medical Officer, Dame Sally Davies DBE ,,,,,, 187 | Date of Meeting ,Name of Organisation ,Purpose of Meeting ,,,, 188 | January 2015 ,Royal College of Nursing ,Catch-up discussion on Antimicrobial resistance and ebola ,,,, 189 | January 2015 ,Imperial College London ,Introductory discussion ,,,, 190 | January 2015 ,Sport England ,Catch-up discussion ,,,, 191 | January 2015 ,Merck Sharp and Dohme Ltd ,Catch-up discussion ,,,, 192 | January 2015 ,General Medical Council ,Catch-up discussion ,,,, 193 | January 2015 ,Academy of Medical Sciences ,Catch-up discussion ,,,, 194 | January 2015 ,Feeling Nuts Campaign ,Introductory discussion ,,,, 195 | January 2015 ,World Health Organisation ,Executive Board meeting ,,,, 196 | January 2015 ,Medecin Sans Frontieres ,Catch-up meeting on Ebola response ,,,, 197 | February 2015 ,Engineering and Physical Sciences Research Council ,Introductory discussion ,,,, 198 | February 2015 ,Wellcome Trust ,Catch-up discussion on Antimicrobial resistance ,,,, 199 | February 2015 ,Norlien Foundation ,Catch-up discussion ,,,, 200 | February 2015 ,World Innovation Summit for Health ,Catch-up discussion ,,,, 201 | February 2015 ,Alberta Innovates ,Catch-up discussion ,,,, 202 | March 2015 ,Medical Research Council ,Council meeting ,,,, 203 | March 2015 ,World Innovation Summit for Health ,Summit on international health and Antimicrobial resistance ,,,, 204 | March 2015 ,Academy of Medical Royal Colleges ,Catch-up meeting on Antimicrobial resistance ,,,, 205 | March 2015 ,The Foundation for Science and Technology ,Discussion on forward-look of foundation events ,,,, 206 | March 2015 ,Alberta innovates ,Catch-up discussion ,,,, 207 | March 2015 ,Cambridge Globalist ,Introductory discussion ,,,, 208 | March 2015 ,Japan Agency for Medical Research and Development ,Catch-up discussion ,,,, 209 | March 2015 ,GlaxoSmithKline ,Catch-up discussion on Ebola response ,,,, 210 | March 2015 ,Johnson and Johnson ,Catch-up discussion on Ebola response ,,,, 211 | -------------------------------------------------------------------------------- /data/ministerial_transparency_Apr-Jun_2014.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ijmbarr/parsing-pdfs/85eaa41430b460c18f0b7ba1239109e67b6a227e/data/ministerial_transparency_Apr-Jun_2014.pdf -------------------------------------------------------------------------------- /example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Short example on how to parse a whole file\n", 8 | "\n", 9 | "In my main blog post I walked though the steps of how I managed to extract tabular data from a PDF. I wrapped the whole thing in a few functions to make extracting from an entire file possible.\n", 10 | "\n", 11 | "First we impore the relevant function:" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "from PDFFixup.fixer import get_tables" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "Next we run it over the whole file:" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "file_path = \"data/DH_Ministerial_gifts_hospitality_travel_and_external_meetings_Jan_to_Mar_2015.pdf\"\n", 41 | "extracted_table = get_tables(file_path)" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "12" 55 | ] 56 | }, 57 | "execution_count": 3, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "len(extracted_table)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "The returned object is a list of pages, each page containing the tabular data:" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/plain": [ 83 | "[[u'Earl Howe, Parliamentary-under-Secretary of State for Quality '],\n", 84 | " [u'Date ', u'Name of Organisation ', u'Type of Hospitality Received '],\n", 85 | " [u'4 February 2015 ', u'College of Emergency Medicine ', u'Dinner '],\n", 86 | " [u' Jane Ellison MP, Parliamentary Under Secretary of State for Public Health'],\n", 87 | " [u'Date ', u'Name of Organisation ', u'Type of Hospitality Received '],\n", 88 | " [u'Nil ', u' ', u' '],\n", 89 | " [u'The Rt Hon Jeremy Hunt, Secretary of State for Health '],\n", 90 | " [u'Date(s) of trip ',\n", 91 | " u'Destination ',\n", 92 | " u'Purpose of trip ',\n", 93 | " u'\\u2018Scheduled\\u2019 \\u2018No 32 (The Royal) Squadron\\u2019 or \\u2018other RAF\\u2019 or \\u2018Chartered\\u2019 or \\u2018Eurostar\\u2019 ',\n", 94 | " u'Number of officials accompanying Minister, where non-scheduled travel is used ',\n", 95 | " u'Total cost including travel, and accommodation of Minister only '],\n", 96 | " [u'16 \\u2013 17 March 2015 ',\n", 97 | " u'Geneva, Switzerland ',\n", 98 | " u'To attend a World Health Organisation summit ',\n", 99 | " u'Scheduled ',\n", 100 | " u' ',\n", 101 | " u' ',\n", 102 | " u'\\xa3265 '],\n", 103 | " [u'Dr. Dan Poulte / Parliamentary Under-Secretary for State for Health '],\n", 104 | " [u'Date(s) of trip ',\n", 105 | " u'Destination ',\n", 106 | " u'Purpose of trip ',\n", 107 | " u'\\u2018Scheduled\\u2019 \\u2018No 32 (The Royal) Squadron\\u2019 or \\u2018other RAF\\u2019 or \\u2018Chartered\\u2019 or \\u2018Eurostar\\u2019 ',\n", 108 | " u'Number of officials accompanying Minister, where non-scheduled travel is used ',\n", 109 | " u'Total cost including travel, and accommodation of Minister only '],\n", 110 | " [u'Nil ', u' ', u' ', u' ', u' ', u' '],\n", 111 | " [u'r'],\n", 112 | " ['']]" 113 | ] 114 | }, 115 | "execution_count": 4, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "extracted_table[2]" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "To get things into a format that can be dumped into csv, we need to do a bit more work. The lists returned for each row can be different lengths. This reflects different sizes of the column widths in the original tables. To get around this we simply pad each row to the same length. The code below will do this, concatenate the pages and save the whole thing as a csv file:" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 5, 134 | "metadata": { 135 | "collapsed": false 136 | }, 137 | "outputs": [], 138 | "source": [ 139 | "def table_to_csv(extracted_table):\n", 140 | " max_length = 0\n", 141 | " \n", 142 | " #concatenate the pages\n", 143 | " concatenated_table = [row for page in extracted_table for row in page]\n", 144 | " \n", 145 | " #find the maximum length\n", 146 | " for row in concatenated_table:\n", 147 | " if len(row) > max_length:\n", 148 | " max_length = len(row)\n", 149 | " \n", 150 | " # convert to string\n", 151 | " out = \"\"\n", 152 | " for row in concatenated_table:\n", 153 | " # pad the row \n", 154 | " if len(row) < max_length:\n", 155 | " row += [\"\"] * (max_length - len(row))\n", 156 | " \n", 157 | " out += \",\".join(row) + \"\\n\"\n", 158 | " \n", 159 | " return out" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 6, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "csved = table_to_csv(extracted_table)\n", 171 | "\n", 172 | "# Note: you might want to change the encoding, depending on what format your document is\n", 173 | "open(\"data/example_out.csv\", \"wb\").write(csved.encode(\"utf-8\"))" 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Python 2", 180 | "language": "python", 181 | "name": "python2" 182 | }, 183 | "language_info": { 184 | "codemirror_mode": { 185 | "name": "ipython", 186 | "version": 2 187 | }, 188 | "file_extension": ".py", 189 | "mimetype": "text/x-python", 190 | "name": "python", 191 | "nbconvert_exporter": "python", 192 | "pygments_lexer": "ipython2", 193 | "version": "2.7.11+" 194 | } 195 | }, 196 | "nbformat": 4, 197 | "nbformat_minor": 0 198 | } 199 | --------------------------------------------------------------------------------