├── README.md ├── .gitignore └── CityBudgetExtractor.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # city_budget_explorer 2 | A repo to explore city budgets 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /CityBudgetExtractor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# City Budget Extractor" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 5, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import PyPDF2\n", 17 | "import pandas as pd\n", 18 | "from pprint import pprint" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "After struggling for 20 minutes or so realized that the numbering in the PDF document don't match what is pulled by PyPDf2 because the budget PDF includes some \"intro\" pages that aren't counted by the GUI pdf reader but are counted by PyPDF2" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 4, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# This PDF has some special formatting that offsets the page numbers\n", 35 | "page_offset = 11\n", 36 | "filename = \"FY-19-20-Adopted-Budget.pdf\"\n", 37 | "\n", 38 | "reader = PyPDF2.PdfFileReader(filename)\n", 39 | "page = reader.getPage(279 +11)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Lets take a look at what the extracted text looks like. Because its a PDF Im already expecting something terrible and that what is looks like we got. We back one large string that sort of goes across the page row by row. We'll need to split this apart using some custom logic," 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "For the headers I'm just going to write them down manually. While I could write some clever python it's really not worth it because there's only 6 headers, they're the same page to page, and I don't want the actual string from the text anyway." 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 6, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "HEADERS = [\"FY2015/16_Actual\", \"FY2016/17_Actual\", \"FY2017/18_Actual\", \"FY2018/19_Actual\", \"FY2018/19_Revised\", \"FY2019/20_Adopted\"]" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "With the headers done lets extract each row. The pattern I see here is some string that tells us the fund type, an\n", 77 | "\n", 78 | "1. String thas in \"All Funds\", \"General Fund\n", 79 | "2. And all caps line that signifies the start of the block of expenses\n", 80 | "3. The line item of expenses\n", 81 | "4. 6 rows that are the actual budget expenses for my city\n", 82 | "\n" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 16, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "class Parser:\n", 92 | " \"\"\"Parses the PDF file to grab the city expednitures as dates\n", 93 | " Performs the work in three passes\n", 94 | " \n", 95 | " 1. Identifying the Block Headers and closing lines\n", 96 | " 2. Identifying the row titles\n", 97 | " 3. Parsing the row values and determining which ones can be parsed to valid row\n", 98 | " \n", 99 | " \"\"\"\n", 100 | " \n", 101 | " HEADERS = [\"FY2015/16_Actual\", \"FY2016/17_Actual\",\n", 102 | " \"FY2017/18_Actual\", \"FY2018/19_Actual\",\n", 103 | " \"FY2018/19_Revised\", \"FY2019/20_Adopted\"]\n", 104 | " \n", 105 | " def __init__(self, page=279, page_offset=11, filename=\"FY-19-20-Adopted-Budget.pdf\"):\n", 106 | " \"\"\"Gets page from pdf and page text values\n", 107 | " \n", 108 | " Notes\n", 109 | " -----\n", 110 | " This PDF has some special formatting that offsets the page numbers\n", 111 | "\n", 112 | " filename = \"FY-19-20-Adopted-Budget.pdf\"\n", 113 | " \"\"\"\n", 114 | " self.page = reader.getPage(279 +11)\n", 115 | " \n", 116 | " # Split the text into discrete word units and clean up spacing\n", 117 | " \n", 118 | " # List Comprehension\n", 119 | " self.text = tuple([line.strip() for line in self.page.extractText().split(\"\\n\")])\n", 120 | " \n", 121 | " \n", 122 | " self.text = pd.Series(self.text)\n", 123 | "\n", 124 | " # Parse the budget numbers into numbers\n", 125 | " self.text = self.text.apply(self.coerce_numbers)\n", 126 | " \n", 127 | " \n", 128 | " def parse(self):\n", 129 | " raise NotImplementedError\n", 130 | " \n", 131 | " @staticmethod\n", 132 | " def is_header(line):\n", 133 | " \n", 134 | " if isinstance(line, int):\n", 135 | " return False\n", 136 | " else:\n", 137 | " return line.isupper() and \"\".join(line.split(\" \")).isalpha()\n", 138 | " \n", 139 | " @staticmethod\n", 140 | " def coerce_numbers(line):\n", 141 | " \"\"\"Tries parsing number strings into numbers, else return string\"\"\"\n", 142 | " \n", 143 | " # Check if number is negative with parantheses\n", 144 | " try:\n", 145 | " if line[0] == \"(\" and line[-1] ==\")\":\n", 146 | " neg_number = int(\"\".join(line[1:-1].split(\",\")))\n", 147 | " return neg_number\n", 148 | " except IndexError:\n", 149 | " pass\n", 150 | "\n", 151 | " # Otherwise try plain logic\n", 152 | " try:\n", 153 | " return int(\"\".join(line.split(\",\")))\n", 154 | " except ValueError:\n", 155 | " return line\n", 156 | " \n", 157 | " def parse_block_headers(self):\n", 158 | " \"\"\"Identify the headers from the page as well as ending line\"\"\"\n", 159 | " current_header = {}\n", 160 | " headers = []\n", 161 | " \n", 162 | " for i, line in self.text.iteritems():\n", 163 | " if self.is_header(line):\n", 164 | " if line != current_header.get(\"header\"):\n", 165 | " current_header = {\"header\":line, \"start\":i}\n", 166 | " else:\n", 167 | " assert line == current_header.get(\"header\")\n", 168 | " current_header[\"end\"] = i\n", 169 | " headers.append(current_header)\n", 170 | " current_header = {}\n", 171 | " \n", 172 | " self.headers = pd.DataFrame(headers)\n", 173 | " return self.headers\n", 174 | " \n", 175 | " def parse_row_labels(self):\n", 176 | " \"\"\"Identify the row labels from the page \n", 177 | " \n", 178 | " Notes\n", 179 | " ----\n", 180 | " Row labels must appear after first header and before last header\n", 181 | " \n", 182 | " \"\"\"\n", 183 | " \n", 184 | " rows = []\n", 185 | " for i, (header, start, end) in df.iterrows():\n", 186 | " \n", 187 | " row = {\"header\":header, \"values\":[]}\n", 188 | " \n", 189 | " # Get text block for this budget item block\n", 190 | " text_block = self.text[start+1:end]\n", 191 | " \n", 192 | " \n", 193 | " for line in text_block:\n", 194 | " \n", 195 | " # All rows should end with a % sign\n", 196 | " # TODO: There's still an issue where the line is not bookended always\n", 197 | " if \"%\" in str(line):\n", 198 | " if len(row[\"values\"]) == 6:\n", 199 | " row[\"complete\"] = True\n", 200 | " else:\n", 201 | " row[\"complete\"] = False\n", 202 | "\n", 203 | " # Explode numerical values and pair headers\n", 204 | " numbers = row.pop(\"values\")\n", 205 | " for key, val in zip(self.HEADERS, numbers):\n", 206 | " row[key] = val\n", 207 | "\n", 208 | " rows.append(row)\n", 209 | " row = {\"header\":header, \"values\":[]}\n", 210 | " \n", 211 | " # If line is a string in this block its a row label\n", 212 | " elif isinstance(line, str):\n", 213 | " row[\"line_item\"] = line\n", 214 | " \n", 215 | " # Otherwise its \n", 216 | " else:\n", 217 | " assert isinstance(line, int)\n", 218 | " row[\"values\"].append(line)\n", 219 | " \n", 220 | " self.budget = pd.DataFrame(rows)\n", 221 | " return self.budget\n", 222 | "\n", 223 | " \n", 224 | " def parse_numbers(self):\n", 225 | " \"\"\"Identify indices of valid numbers from the page\"\"\"" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 17, 231 | "metadata": {}, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/html": [ 236 | "
\n", 237 | "\n", 250 | "\n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | "
headerline_itemcompleteFY2015/16_ActualFY2016/17_ActualFY2017/18_ActualFY2018/19_ActualFY2018/19_RevisedFY2019/20_Adopted
0PERSONNEL SERVICESSalaries, PermanentTrue343237493465440625765375370102953703376536135762
1PERSONNEL SERVICESSalaries, TemporaryTrue499772420908348015367098538702367948
2PERSONNEL SERVICESSalaries, OvertimeTrue500734650432334093771395395043723354049950
3PERSONNEL SERVICESBenefitsFalse146608815504791079615253430622692617821666643
4OPERATING EXPENSESUtilitiesTrue176543168730413195001950019500
5OPERATING EXPENSESEquipment and SuppliesTrue1105697157509093547198525414898571328684
6OPERATING EXPENSESRepairs and MaintenanceTrue1106671939054752048964510986248964510
7OPERATING EXPENSESConferences and TrainingTrue344329337535308983334105335654225767
8OPERATING EXPENSESProfessional ServicesTrue503872458393391996335825735552335825
9OPERATING EXPENSESOther Contract ServicesTrue172760417901631569292227908723555342189087
10OPERATING EXPENSESRental ExpenseTrue11420131117148108841088410884
11OPERATING EXPENSESPayments to Other GovernmentsTrue962714790602592863928540928540928540
12OPERATING EXPENSESExpense AllowancesTrue331430346883330933367000367000367000
13OPERATING EXPENSESOther ExpensesTrue373610147132497349734973
14CAPITAL EXPENDITURESEquipmentTrue240283421718862956895156000295922
\n", 448 | "
" 449 | ], 450 | "text/plain": [ 451 | " header line_item complete \\\n", 452 | "0 PERSONNEL SERVICES Salaries, Permanent True \n", 453 | "1 PERSONNEL SERVICES Salaries, Temporary True \n", 454 | "2 PERSONNEL SERVICES Salaries, Overtime True \n", 455 | "3 PERSONNEL SERVICES Benefits False \n", 456 | "4 OPERATING EXPENSES Utilities True \n", 457 | "5 OPERATING EXPENSES Equipment and Supplies True \n", 458 | "6 OPERATING EXPENSES Repairs and Maintenance True \n", 459 | "7 OPERATING EXPENSES Conferences and Training True \n", 460 | "8 OPERATING EXPENSES Professional Services True \n", 461 | "9 OPERATING EXPENSES Other Contract Services True \n", 462 | "10 OPERATING EXPENSES Rental Expense True \n", 463 | "11 OPERATING EXPENSES Payments to Other Governments True \n", 464 | "12 OPERATING EXPENSES Expense Allowances True \n", 465 | "13 OPERATING EXPENSES Other Expenses True \n", 466 | "14 CAPITAL EXPENDITURES Equipment True \n", 467 | "\n", 468 | " FY2015/16_Actual FY2016/17_Actual FY2017/18_Actual FY2018/19_Actual \\\n", 469 | "0 34323749 34654406 25765375 37010295 \n", 470 | "1 499772 420908 348015 367098 \n", 471 | "2 5007346 5043233 4093771 3953950 \n", 472 | "3 1466088 1550479 1079615 25343062 \n", 473 | "4 17654 31687 30413 19500 \n", 474 | "5 1105697 1575090 935471 985254 \n", 475 | "6 1106671 939054 752048 964510 \n", 476 | "7 344329 337535 308983 334105 \n", 477 | "8 503872 458393 391996 335825 \n", 478 | "9 1727604 1790163 1569292 2279087 \n", 479 | "10 11420 13111 7148 10884 \n", 480 | "11 962714 790602 592863 928540 \n", 481 | "12 331430 346883 330933 367000 \n", 482 | "13 3736 10147 132 4973 \n", 483 | "14 24028 342171 88629 56895 \n", 484 | "\n", 485 | " FY2018/19_Revised FY2019/20_Adopted \n", 486 | "0 37033765 36135762 \n", 487 | "1 538702 367948 \n", 488 | "2 4372335 4049950 \n", 489 | "3 26926178 21666643 \n", 490 | "4 19500 19500 \n", 491 | "5 1489857 1328684 \n", 492 | "6 986248 964510 \n", 493 | "7 335654 225767 \n", 494 | "8 735552 335825 \n", 495 | "9 2355534 2189087 \n", 496 | "10 10884 10884 \n", 497 | "11 928540 928540 \n", 498 | "12 367000 367000 \n", 499 | "13 4973 4973 \n", 500 | "14 156000 295922 " 501 | ] 502 | }, 503 | "execution_count": 17, 504 | "metadata": {}, 505 | "output_type": "execute_result" 506 | } 507 | ], 508 | "source": [ 509 | "p = Parser()\n", 510 | "df = p.parse_block_headers()\n", 511 | "p.parse_row_labels()" 512 | ] 513 | } 514 | ], 515 | "metadata": { 516 | "kernelspec": { 517 | "display_name": "Python 3", 518 | "language": "python", 519 | "name": "python3" 520 | }, 521 | "language_info": { 522 | "codemirror_mode": { 523 | "name": "ipython", 524 | "version": 3 525 | }, 526 | "file_extension": ".py", 527 | "mimetype": "text/x-python", 528 | "name": "python", 529 | "nbconvert_exporter": "python", 530 | "pygments_lexer": "ipython3", 531 | "version": "3.7.6" 532 | } 533 | }, 534 | "nbformat": 4, 535 | "nbformat_minor": 4 536 | } 537 | --------------------------------------------------------------------------------