├── README.md ├── .gitignore └── CityBudgetExtractor.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # city_budget_explorer 2 | A repo to explore city budgets 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /CityBudgetExtractor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# City Budget Extractor" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 5, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import PyPDF2\n", 17 | "import pandas as pd\n", 18 | "from pprint import pprint" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "After struggling for 20 minutes or so realized that the numbering in the PDF document don't match what is pulled by PyPDf2 because the budget PDF includes some \"intro\" pages that aren't counted by the GUI pdf reader but are counted by PyPDF2" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 4, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# This PDF has some special formatting that offsets the page numbers\n", 35 | "page_offset = 11\n", 36 | "filename = \"FY-19-20-Adopted-Budget.pdf\"\n", 37 | "\n", 38 | "reader = PyPDF2.PdfFileReader(filename)\n", 39 | "page = reader.getPage(279 +11)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Lets take a look at what the extracted text looks like. Because its a PDF Im already expecting something terrible and that what is looks like we got. We back one large string that sort of goes across the page row by row. We'll need to split this apart using some custom logic," 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "For the headers I'm just going to write them down manually. While I could write some clever python it's really not worth it because there's only 6 headers, they're the same page to page, and I don't want the actual string from the text anyway." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 6, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "HEADERS = [\"FY2015/16_Actual\", \"FY2016/17_Actual\", \"FY2017/18_Actual\", \"FY2018/19_Actual\", \"FY2018/19_Revised\", \"FY2019/20_Adopted\"]" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "With the headers done lets extract each row. The pattern I see here is some string that tells us the fund type, an\n", 77 | "\n", 78 | "1. String thas in \"All Funds\", \"General Fund\n", 79 | "2. And all caps line that signifies the start of the block of expenses\n", 80 | "3. The line item of expenses\n", 81 | "4. 6 rows that are the actual budget expenses for my city\n", 82 | "\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 16, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "class Parser:\n", 92 | " \"\"\"Parses the PDF file to grab the city expednitures as dates\n", 93 | " Performs the work in three passes\n", 94 | " \n", 95 | " 1. Identifying the Block Headers and closing lines\n", 96 | " 2. Identifying the row titles\n", 97 | " 3. Parsing the row values and determining which ones can be parsed to valid row\n", 98 | " \n", 99 | " \"\"\"\n", 100 | " \n", 101 | " HEADERS = [\"FY2015/16_Actual\", \"FY2016/17_Actual\",\n", 102 | " \"FY2017/18_Actual\", \"FY2018/19_Actual\",\n", 103 | " \"FY2018/19_Revised\", \"FY2019/20_Adopted\"]\n", 104 | " \n", 105 | " def __init__(self, page=279, page_offset=11, filename=\"FY-19-20-Adopted-Budget.pdf\"):\n", 106 | " \"\"\"Gets page from pdf and page text values\n", 107 | " \n", 108 | " Notes\n", 109 | " -----\n", 110 | " This PDF has some special formatting that offsets the page numbers\n", 111 | "\n", 112 | " filename = \"FY-19-20-Adopted-Budget.pdf\"\n", 113 | " \"\"\"\n", 114 | " self.page = reader.getPage(279 +11)\n", 115 | " \n", 116 | " # Split the text into discrete word units and clean up spacing\n", 117 | " \n", 118 | " # List Comprehension\n", 119 | " self.text = tuple([line.strip() for line in self.page.extractText().split(\"\\n\")])\n", 120 | " \n", 121 | " \n", 122 | " self.text = pd.Series(self.text)\n", 123 | "\n", 124 | " # Parse the budget numbers into numbers\n", 125 | " self.text = self.text.apply(self.coerce_numbers)\n", 126 | " \n", 127 | " \n", 128 | " def parse(self):\n", 129 | " raise NotImplementedError\n", 130 | " \n", 131 | " @staticmethod\n", 132 | " def is_header(line):\n", 133 | " \n", 134 | " if isinstance(line, int):\n", 135 | " return False\n", 136 | " else:\n", 137 | " return line.isupper() and \"\".join(line.split(\" \")).isalpha()\n", 138 | " \n", 139 | " @staticmethod\n", 140 | " def coerce_numbers(line):\n", 141 | " \"\"\"Tries parsing number strings into numbers, else return string\"\"\"\n", 142 | " \n", 143 | " # Check if number is negative with parantheses\n", 144 | " try:\n", 145 | " if line[0] == \"(\" and line[-1] ==\")\":\n", 146 | " neg_number = int(\"\".join(line[1:-1].split(\",\")))\n", 147 | " return neg_number\n", 148 | " except IndexError:\n", 149 | " pass\n", 150 | "\n", 151 | " # Otherwise try plain logic\n", 152 | " try:\n", 153 | " return int(\"\".join(line.split(\",\")))\n", 154 | " except ValueError:\n", 155 | " return line\n", 156 | " \n", 157 | " def parse_block_headers(self):\n", 158 | " \"\"\"Identify the headers from the page as well as ending line\"\"\"\n", 159 | " current_header = {}\n", 160 | " headers = []\n", 161 | " \n", 162 | " for i, line in self.text.iteritems():\n", 163 | " if self.is_header(line):\n", 164 | " if line != current_header.get(\"header\"):\n", 165 | " current_header = {\"header\":line, \"start\":i}\n", 166 | " else:\n", 167 | " assert line == current_header.get(\"header\")\n", 168 | " current_header[\"end\"] = i\n", 169 | " headers.append(current_header)\n", 170 | " current_header = {}\n", 171 | " \n", 172 | " self.headers = pd.DataFrame(headers)\n", 173 | " return self.headers\n", 174 | " \n", 175 | " def parse_row_labels(self):\n", 176 | " \"\"\"Identify the row labels from the page \n", 177 | " \n", 178 | " Notes\n", 179 | " ----\n", 180 | " Row labels must appear after first header and before last header\n", 181 | " \n", 182 | " \"\"\"\n", 183 | " \n", 184 | " rows = []\n", 185 | " for i, (header, start, end) in df.iterrows():\n", 186 | " \n", 187 | " row = {\"header\":header, \"values\":[]}\n", 188 | " \n", 189 | " # Get text block for this budget item block\n", 190 | " text_block = self.text[start+1:end]\n", 191 | " \n", 192 | " \n", 193 | " for line in text_block:\n", 194 | " \n", 195 | " # All rows should end with a % sign\n", 196 | " # TODO: There's still an issue where the line is not bookended always\n", 197 | " if \"%\" in str(line):\n", 198 | " if len(row[\"values\"]) == 6:\n", 199 | " row[\"complete\"] = True\n", 200 | " else:\n", 201 | " row[\"complete\"] = False\n", 202 | "\n", 203 | " # Explode numerical values and pair headers\n", 204 | " numbers = row.pop(\"values\")\n", 205 | " for key, val in zip(self.HEADERS, numbers):\n", 206 | " row[key] = val\n", 207 | "\n", 208 | " rows.append(row)\n", 209 | " row = {\"header\":header, \"values\":[]}\n", 210 | " \n", 211 | " # If line is a string in this block its a row label\n", 212 | " elif isinstance(line, str):\n", 213 | " row[\"line_item\"] = line\n", 214 | " \n", 215 | " # Otherwise its \n", 216 | " else:\n", 217 | " assert isinstance(line, int)\n", 218 | " row[\"values\"].append(line)\n", 219 | " \n", 220 | " self.budget = pd.DataFrame(rows)\n", 221 | " return self.budget\n", 222 | "\n", 223 | " \n", 224 | " def parse_numbers(self):\n", 225 | " \"\"\"Identify indices of valid numbers from the page\"\"\"" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 17, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/html": [ 236 | "
| \n", 254 | " | header | \n", 255 | "line_item | \n", 256 | "complete | \n", 257 | "FY2015/16_Actual | \n", 258 | "FY2016/17_Actual | \n", 259 | "FY2017/18_Actual | \n", 260 | "FY2018/19_Actual | \n", 261 | "FY2018/19_Revised | \n", 262 | "FY2019/20_Adopted | \n", 263 | "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", 268 | "PERSONNEL SERVICES | \n", 269 | "Salaries, Permanent | \n", 270 | "True | \n", 271 | "34323749 | \n", 272 | "34654406 | \n", 273 | "25765375 | \n", 274 | "37010295 | \n", 275 | "37033765 | \n", 276 | "36135762 | \n", 277 | "
| 1 | \n", 280 | "PERSONNEL SERVICES | \n", 281 | "Salaries, Temporary | \n", 282 | "True | \n", 283 | "499772 | \n", 284 | "420908 | \n", 285 | "348015 | \n", 286 | "367098 | \n", 287 | "538702 | \n", 288 | "367948 | \n", 289 | "
| 2 | \n", 292 | "PERSONNEL SERVICES | \n", 293 | "Salaries, Overtime | \n", 294 | "True | \n", 295 | "5007346 | \n", 296 | "5043233 | \n", 297 | "4093771 | \n", 298 | "3953950 | \n", 299 | "4372335 | \n", 300 | "4049950 | \n", 301 | "
| 3 | \n", 304 | "PERSONNEL SERVICES | \n", 305 | "Benefits | \n", 306 | "False | \n", 307 | "1466088 | \n", 308 | "1550479 | \n", 309 | "1079615 | \n", 310 | "25343062 | \n", 311 | "26926178 | \n", 312 | "21666643 | \n", 313 | "
| 4 | \n", 316 | "OPERATING EXPENSES | \n", 317 | "Utilities | \n", 318 | "True | \n", 319 | "17654 | \n", 320 | "31687 | \n", 321 | "30413 | \n", 322 | "19500 | \n", 323 | "19500 | \n", 324 | "19500 | \n", 325 | "
| 5 | \n", 328 | "OPERATING EXPENSES | \n", 329 | "Equipment and Supplies | \n", 330 | "True | \n", 331 | "1105697 | \n", 332 | "1575090 | \n", 333 | "935471 | \n", 334 | "985254 | \n", 335 | "1489857 | \n", 336 | "1328684 | \n", 337 | "
| 6 | \n", 340 | "OPERATING EXPENSES | \n", 341 | "Repairs and Maintenance | \n", 342 | "True | \n", 343 | "1106671 | \n", 344 | "939054 | \n", 345 | "752048 | \n", 346 | "964510 | \n", 347 | "986248 | \n", 348 | "964510 | \n", 349 | "
| 7 | \n", 352 | "OPERATING EXPENSES | \n", 353 | "Conferences and Training | \n", 354 | "True | \n", 355 | "344329 | \n", 356 | "337535 | \n", 357 | "308983 | \n", 358 | "334105 | \n", 359 | "335654 | \n", 360 | "225767 | \n", 361 | "
| 8 | \n", 364 | "OPERATING EXPENSES | \n", 365 | "Professional Services | \n", 366 | "True | \n", 367 | "503872 | \n", 368 | "458393 | \n", 369 | "391996 | \n", 370 | "335825 | \n", 371 | "735552 | \n", 372 | "335825 | \n", 373 | "
| 9 | \n", 376 | "OPERATING EXPENSES | \n", 377 | "Other Contract Services | \n", 378 | "True | \n", 379 | "1727604 | \n", 380 | "1790163 | \n", 381 | "1569292 | \n", 382 | "2279087 | \n", 383 | "2355534 | \n", 384 | "2189087 | \n", 385 | "
| 10 | \n", 388 | "OPERATING EXPENSES | \n", 389 | "Rental Expense | \n", 390 | "True | \n", 391 | "11420 | \n", 392 | "13111 | \n", 393 | "7148 | \n", 394 | "10884 | \n", 395 | "10884 | \n", 396 | "10884 | \n", 397 | "
| 11 | \n", 400 | "OPERATING EXPENSES | \n", 401 | "Payments to Other Governments | \n", 402 | "True | \n", 403 | "962714 | \n", 404 | "790602 | \n", 405 | "592863 | \n", 406 | "928540 | \n", 407 | "928540 | \n", 408 | "928540 | \n", 409 | "
| 12 | \n", 412 | "OPERATING EXPENSES | \n", 413 | "Expense Allowances | \n", 414 | "True | \n", 415 | "331430 | \n", 416 | "346883 | \n", 417 | "330933 | \n", 418 | "367000 | \n", 419 | "367000 | \n", 420 | "367000 | \n", 421 | "
| 13 | \n", 424 | "OPERATING EXPENSES | \n", 425 | "Other Expenses | \n", 426 | "True | \n", 427 | "3736 | \n", 428 | "10147 | \n", 429 | "132 | \n", 430 | "4973 | \n", 431 | "4973 | \n", 432 | "4973 | \n", 433 | "
| 14 | \n", 436 | "CAPITAL EXPENDITURES | \n", 437 | "Equipment | \n", 438 | "True | \n", 439 | "24028 | \n", 440 | "342171 | \n", 441 | "88629 | \n", 442 | "56895 | \n", 443 | "156000 | \n", 444 | "295922 | \n", 445 | "