├── .coveragerc ├── .gitignore ├── .gitlab-ci.yml ├── .pylintrc ├── CONTRIBUTORS.txt ├── Dockerfile ├── LICENSE.txt ├── Makefile ├── Pipfile ├── Pipfile.lock ├── README.md ├── data_prep ├── README.md ├── bert_data_prep │ ├── README.md │ ├── evaluate_test_set_predictions.py │ ├── finetune_on_tpu.sh │ ├── gen_bert_data.py │ └── prep_all_bert.sh ├── ft_data_prep │ ├── README.md │ ├── pdf_to_xml.sh │ ├── pdf_to_xml_notext.sh │ ├── pdfalto │ ├── prep_all_fasttext.sh │ ├── prep_all_url_fasttext.sh │ ├── prep_fasttext.sh │ ├── preprocess_tsv.py │ └── train.sh ├── image_data_prep │ ├── README.md │ └── pdf_image.sh ├── pdf_one_text.sh └── pdf_text.sh ├── docker-compose.yml ├── example.env ├── example_calls.py ├── fetch_models.sh ├── finetune ├── 0_urls.tsv ├── 1_urls.tsv ├── 2_urls.tsv └── README.md ├── pdf_trio ├── __init__.py ├── api_routes.py ├── pdf_classifier.py ├── pdf_util.py ├── text_prep.py └── url_classifier.py ├── pytest.ini ├── requirements-dev.txt ├── requirements.txt ├── rest_demo.sh ├── start_api_service.sh ├── start_bert_serving.sh ├── start_image_classifier_serving.sh ├── tests ├── files │ ├── other │ │ └── ia_frontpage.pdf │ └── research │ │ ├── fea48178ffac3a42035ed27d6e2b897cb570cf13.pdf │ │ ├── hidden-technical-debt-in-machine-learning-systems__2015.pdf │ │ └── submission_363.pdf ├── fixtures.py ├── test_flask_api.py ├── test_live_backend.py ├── test_pdf_classifier.py ├── test_pdf_util.py └── test_url_classifier.py ├── tf-serving_rest_demo.sh ├── tfserving_models_docker.config └── uwsgi.ini /.coveragerc: -------------------------------------------------------------------------------- 1 | [run] 2 | source = 3 | pdf_trio 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | model_snapshots/ 2 | htmlcov 3 | .idea/ 4 | __pycache__/ 5 | .* 6 | *.pyc 7 | #*# 8 | *~ 9 | *.swp 10 | [Tt]humbs.db 11 | *.DS_Store 12 | *.log 13 | # intellij PyCharm local files 14 | misc.xml 15 | modules.xml 16 | pdf_trio.iml 17 | vcs.xml 18 | workspace.xml 19 | 20 | !.gitlab-ci.yml 21 | !.travis.yml 22 | !.coveragerc 23 | !.pylintrc 24 | 25 | # Don't ignore this file itself 26 | !.gitignore 27 | -------------------------------------------------------------------------------- /.gitlab-ci.yml: -------------------------------------------------------------------------------- 1 | 2 | variables: 3 | PIPENV_VENV_IN_PROJECT: "true" 4 | 5 | image: "python:3.7-buster" 6 | 7 | basic_tests: 8 | before_script: 9 | - apt update -qy 10 | - apt install -y poppler-utils imagemagick libmagickcore-6.q16-6-extra ghostscript netpbm gsfonts wget 11 | - pip3 install pipenv 12 | - pipenv --version 13 | - ./fetch_models.sh 14 | script: 15 | - cp example.env .env 16 | - pipenv install --dev --deploy 17 | - pipenv run pytest --cov 18 | # Just errors 19 | - pipenv run pylint -E pdf_trio tests/*.py 20 | -------------------------------------------------------------------------------- /.pylintrc: -------------------------------------------------------------------------------- 1 | [MESSAGES CONTROL] 2 | # TODO: should re-enable some of these 3 | disable=C0323,W0142,C0301,C0103,C0111,E0213,C0302,C0203,W0703,R0201,W0223,bad-continuation,arguments-differ,unidiomatic-typecheck,unused-wildcard-import,no-member,cyclic-import,too-few-public-methods,wildcard-import,too-many-locals,too-many-ancestors,unused-import 4 | 5 | [REPORTS] 6 | output-format=colorized 7 | include-ids=yes 8 | 9 | [MISCELLANEOUS] 10 | # List of note tags to take in consideration, separated by a comma. 11 | notes=FIXME,XXX,DELETEME 12 | 13 | [TYPECHECK] 14 | ignored-modules=responses 15 | -------------------------------------------------------------------------------- /CONTRIBUTORS.txt: -------------------------------------------------------------------------------- 1 | 2 | Paul Baclace 3 | - original author 4 | 5 | Bryan Newbold 6 | - changes for use in production at Internet Archive 7 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | from python:3.7-buster 2 | 3 | RUN apt update 4 | RUN apt install -y poppler-utils imagemagick libmagickcore-6.q16-6-extra ghostscript netpbm gsfonts 5 | 6 | COPY . /app 7 | WORKDIR /app 8 | 9 | RUN pip install pipenv 10 | 11 | RUN pipenv install --system --deploy 12 | 13 | CMD ["flask", "run", "--host", "0.0.0.0"] 14 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | 2 | SHELL = /bin/bash 3 | .SHELLFLAGS = -o pipefail -c 4 | 5 | .PHONY: help 6 | help: ## Print info about all commands 7 | @echo "Commands:" 8 | @echo 9 | @grep -E '^[a-zA-Z0-9_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf " \033[01;32m%-20s\033[0m %s\n", $$1, $$2}' 10 | 11 | .PHONY: deps 12 | deps: ## Install dependencies (eg, create virtualenv using pipenv) 13 | pipenv install --dev --deploy 14 | 15 | .PHONY: test 16 | test: ## Run all tests and lints 17 | pipenv run python -m pytest --ignore extra/ 18 | pipenv run pylint -j 0 -E pdf_trio tests/*.py 19 | #pipenv run flake8 tests/ pdf_trio/ --count --select=E9,F63,F7,F82 --show-source --statistics 20 | 21 | .PHONY: coverage 22 | coverage: ## Run tests, collecting code coverage 23 | pipenv run pytest --cov --cov-report html --ignore extra/ 24 | 25 | .PHONY: serve 26 | serve: ## Run web service locally 27 | pipenv run flask run -h localhost 28 | 29 | .PHONY: fetch-models 30 | fetch-models: ## Download model files from archive.org 31 | ./fetch_models.sh 32 | -------------------------------------------------------------------------------- /Pipfile: -------------------------------------------------------------------------------- 1 | # This file is *not* used as part of bundling or distributing the python client 2 | # library (fatcat-openapi-client). It *is* shared by the web interface (flask app), 3 | # workers, and import scripts. 4 | 5 | [[source]] 6 | url = "https://pypi.python.org/simple" 7 | verify_ssl = true 8 | name = "pypi" 9 | 10 | [dev-packages] 11 | pytest = "*" 12 | pytest-pylint = "*" 13 | pytest-cov = "*" 14 | pylint = "*" 15 | pytest-mock = "*" 16 | responses = "*" 17 | 18 | [packages] 19 | # API/HTTP 20 | Flask = ">=1.1" 21 | raven = {extras = ['flask'],version = "*"} 22 | requests = ">=2" 23 | Werkzeug = "<1.0.0" 24 | python-dotenv = "*" 25 | 26 | # ML/backend things 27 | bert-serving-client = ">=1.9" 28 | fasttext = ">=0.9" 29 | numpy = ">=1" 30 | opencv-python-headless = ">=4" 31 | 32 | [requires] 33 | python_version = "3.7" 34 | -------------------------------------------------------------------------------- /Pipfile.lock: -------------------------------------------------------------------------------- 1 | { 2 | "_meta": { 3 | "hash": { 4 | "sha256": "9a2886266643a68e77719b292be843f5b7eac871b8854762830cc73857fd361f" 5 | }, 6 | "pipfile-spec": 6, 7 | "requires": { 8 | "python_version": "3.7" 9 | }, 10 | "sources": [ 11 | { 12 | "name": "pypi", 13 | "url": "https://pypi.python.org/simple", 14 | "verify_ssl": true 15 | } 16 | ] 17 | }, 18 | "default": { 19 | "bert-serving-client": { 20 | "hashes": [ 21 | "sha256:09ffea7f633dca83aab05ab93b51356623e0de84bc5fed070fd1d3df949b63da", 22 | "sha256:cd37bbe06e3a89812bff70cf0da6a41c8850e780272026653b2cbde1c2b863ad" 23 | ], 24 | "index": "pypi", 25 | "version": "==1.10.0" 26 | }, 27 | "blinker": { 28 | "hashes": [ 29 | "sha256:471aee25f3992bd325afa3772f1063dbdbbca947a041b8b89466dc00d606f8b6" 30 | ], 31 | "version": "==1.4" 32 | }, 33 | "certifi": { 34 | "hashes": [ 35 | "sha256:017c25db2a153ce562900032d5bc68e9f191e44e9a0f762f373977de9df1fbb3", 36 | "sha256:25b64c7da4cd7479594d035c08c2d809eb4aab3a26e5a990ea98cc450c320f1f" 37 | ], 38 | "version": "==2019.11.28" 39 | }, 40 | "chardet": { 41 | "hashes": [ 42 | "sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae", 43 | "sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691" 44 | ], 45 | "version": "==3.0.4" 46 | }, 47 | "click": { 48 | "hashes": [ 49 | "sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13", 50 | "sha256:5b94b49521f6456670fdb30cd82a4eca9412788a93fa6dd6df72c94d5a8ff2d7" 51 | ], 52 | "version": "==7.0" 53 | }, 54 | "fasttext": { 55 | "hashes": [ 56 | "sha256:6ead9c6aafe985472066e27c43e33f581b192befd136a84c3c2e8197e7e05be6", 57 | "sha256:f79b3447b1612b9b1ad791d4be97db3fecb97762dba01e6fbec389d0d0874772" 58 | ], 59 | "index": "pypi", 60 | "version": "==0.9.1" 61 | }, 62 | "flask": { 63 | "hashes": [ 64 | "sha256:13f9f196f330c7c2c5d7a5cf91af894110ca0215ac051b5844701f2bfd934d52", 65 | "sha256:45eb5a6fd193d6cf7e0cf5d8a5b31f83d5faae0293695626f539a823e93b13f6" 66 | ], 67 | "index": "pypi", 68 | "version": "==1.1.1" 69 | }, 70 | "idna": { 71 | "hashes": [ 72 | "sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407", 73 | "sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c" 74 | ], 75 | "version": "==2.8" 76 | }, 77 | "itsdangerous": { 78 | "hashes": [ 79 | "sha256:321b033d07f2a4136d3ec762eac9f16a10ccd60f53c0c91af90217ace7ba1f19", 80 | "sha256:b12271b2047cb23eeb98c8b5622e2e5c5e9abd9784a153e9d8ef9cb4dd09d749" 81 | ], 82 | "version": "==1.1.0" 83 | }, 84 | "jinja2": { 85 | "hashes": [ 86 | "sha256:93187ffbc7808079673ef52771baa950426fd664d3aad1d0fa3e95644360e250", 87 | "sha256:b0eaf100007721b5c16c1fc1eecb87409464edc10469ddc9a22a27a99123be49" 88 | ], 89 | "version": "==2.11.1" 90 | }, 91 | "markupsafe": { 92 | "hashes": [ 93 | "sha256:00bc623926325b26bb9605ae9eae8a215691f33cae5df11ca5424f06f2d1f473", 94 | "sha256:09027a7803a62ca78792ad89403b1b7a73a01c8cb65909cd876f7fcebd79b161", 95 | "sha256:09c4b7f37d6c648cb13f9230d847adf22f8171b1ccc4d5682398e77f40309235", 96 | "sha256:1027c282dad077d0bae18be6794e6b6b8c91d58ed8a8d89a89d59693b9131db5", 97 | "sha256:13d3144e1e340870b25e7b10b98d779608c02016d5184cfb9927a9f10c689f42", 98 | "sha256:24982cc2533820871eba85ba648cd53d8623687ff11cbb805be4ff7b4c971aff", 99 | "sha256:29872e92839765e546828bb7754a68c418d927cd064fd4708fab9fe9c8bb116b", 100 | "sha256:43a55c2930bbc139570ac2452adf3d70cdbb3cfe5912c71cdce1c2c6bbd9c5d1", 101 | "sha256:46c99d2de99945ec5cb54f23c8cd5689f6d7177305ebff350a58ce5f8de1669e", 102 | "sha256:500d4957e52ddc3351cabf489e79c91c17f6e0899158447047588650b5e69183", 103 | "sha256:535f6fc4d397c1563d08b88e485c3496cf5784e927af890fb3c3aac7f933ec66", 104 | "sha256:596510de112c685489095da617b5bcbbac7dd6384aeebeda4df6025d0256a81b", 105 | "sha256:62fe6c95e3ec8a7fad637b7f3d372c15ec1caa01ab47926cfdf7a75b40e0eac1", 106 | "sha256:6788b695d50a51edb699cb55e35487e430fa21f1ed838122d722e0ff0ac5ba15", 107 | "sha256:6dd73240d2af64df90aa7c4e7481e23825ea70af4b4922f8ede5b9e35f78a3b1", 108 | "sha256:717ba8fe3ae9cc0006d7c451f0bb265ee07739daf76355d06366154ee68d221e", 109 | "sha256:79855e1c5b8da654cf486b830bd42c06e8780cea587384cf6545b7d9ac013a0b", 110 | "sha256:7c1699dfe0cf8ff607dbdcc1e9b9af1755371f92a68f706051cc8c37d447c905", 111 | "sha256:88e5fcfb52ee7b911e8bb6d6aa2fd21fbecc674eadd44118a9cc3863f938e735", 112 | "sha256:8defac2f2ccd6805ebf65f5eeb132adcf2ab57aa11fdf4c0dd5169a004710e7d", 113 | "sha256:98c7086708b163d425c67c7a91bad6e466bb99d797aa64f965e9d25c12111a5e", 114 | "sha256:9add70b36c5666a2ed02b43b335fe19002ee5235efd4b8a89bfcf9005bebac0d", 115 | "sha256:9bf40443012702a1d2070043cb6291650a0841ece432556f784f004937f0f32c", 116 | "sha256:ade5e387d2ad0d7ebf59146cc00c8044acbd863725f887353a10df825fc8ae21", 117 | "sha256:b00c1de48212e4cc9603895652c5c410df699856a2853135b3967591e4beebc2", 118 | "sha256:b1282f8c00509d99fef04d8ba936b156d419be841854fe901d8ae224c59f0be5", 119 | "sha256:b2051432115498d3562c084a49bba65d97cf251f5a331c64a12ee7e04dacc51b", 120 | "sha256:ba59edeaa2fc6114428f1637ffff42da1e311e29382d81b339c1817d37ec93c6", 121 | "sha256:c8716a48d94b06bb3b2524c2b77e055fb313aeb4ea620c8dd03a105574ba704f", 122 | "sha256:cd5df75523866410809ca100dc9681e301e3c27567cf498077e8551b6d20e42f", 123 | "sha256:cdb132fc825c38e1aeec2c8aa9338310d29d337bebbd7baa06889d09a60a1fa2", 124 | "sha256:e249096428b3ae81b08327a63a485ad0878de3fb939049038579ac0ef61e17e7", 125 | "sha256:e8313f01ba26fbbe36c7be1966a7b7424942f670f38e666995b88d012765b9be" 126 | ], 127 | "version": "==1.1.1" 128 | }, 129 | "numpy": { 130 | "hashes": [ 131 | "sha256:1786a08236f2c92ae0e70423c45e1e62788ed33028f94ca99c4df03f5be6b3c6", 132 | "sha256:17aa7a81fe7599a10f2b7d95856dc5cf84a4eefa45bc96123cbbc3ebc568994e", 133 | "sha256:20b26aaa5b3da029942cdcce719b363dbe58696ad182aff0e5dcb1687ec946dc", 134 | "sha256:2d75908ab3ced4223ccba595b48e538afa5ecc37405923d1fea6906d7c3a50bc", 135 | "sha256:39d2c685af15d3ce682c99ce5925cc66efc824652e10990d2462dfe9b8918c6a", 136 | "sha256:56bc8ded6fcd9adea90f65377438f9fea8c05fcf7c5ba766bef258d0da1554aa", 137 | "sha256:590355aeade1a2eaba17617c19edccb7db8d78760175256e3cf94590a1a964f3", 138 | "sha256:70a840a26f4e61defa7bdf811d7498a284ced303dfbc35acb7be12a39b2aa121", 139 | "sha256:77c3bfe65d8560487052ad55c6998a04b654c2fbc36d546aef2b2e511e760971", 140 | "sha256:9537eecf179f566fd1c160a2e912ca0b8e02d773af0a7a1120ad4f7507cd0d26", 141 | "sha256:9acdf933c1fd263c513a2df3dceecea6f3ff4419d80bf238510976bf9bcb26cd", 142 | "sha256:ae0975f42ab1f28364dcda3dde3cf6c1ddab3e1d4b2909da0cb0191fa9ca0480", 143 | "sha256:b3af02ecc999c8003e538e60c89a2b37646b39b688d4e44d7373e11c2debabec", 144 | "sha256:b6ff59cee96b454516e47e7721098e6ceebef435e3e21ac2d6c3b8b02628eb77", 145 | "sha256:b765ed3930b92812aa698a455847141869ef755a87e099fddd4ccf9d81fffb57", 146 | "sha256:c98c5ffd7d41611407a1103ae11c8b634ad6a43606eca3e2a5a269e5d6e8eb07", 147 | "sha256:cf7eb6b1025d3e169989416b1adcd676624c2dbed9e3bcb7137f51bfc8cc2572", 148 | "sha256:d92350c22b150c1cae7ebb0ee8b5670cc84848f6359cf6b5d8f86617098a9b73", 149 | "sha256:e422c3152921cece8b6a2fb6b0b4d73b6579bd20ae075e7d15143e711f3ca2ca", 150 | "sha256:e840f552a509e3380b0f0ec977e8124d0dc34dc0e68289ca28f4d7c1d0d79474", 151 | "sha256:f3d0a94ad151870978fb93538e95411c83899c9dc63e6fb65542f769568ecfa5" 152 | ], 153 | "index": "pypi", 154 | "version": "==1.18.1" 155 | }, 156 | "opencv-python-headless": { 157 | "hashes": [ 158 | "sha256:105fe1dc648047c23b1a88c168a42e2760d3bb2c8294b36dd374f1a904d6e8a3", 159 | "sha256:107331e70a352464e9e68ba917cec636b857c8c1332a6444d8df39f1e06c3c4f", 160 | "sha256:134b7184edb06693d2d8987ff6d0216b76ac0ed066ae167cd1897e1b4377fdc2", 161 | "sha256:1493eba327ea9943f93b53393afd335ee23ea898d607873adc6b1ecd96db17d4", 162 | "sha256:20cfc4bdf5aabb2a5060101d057e3550f64df7cf3e7d26551fa2f54fbdcdd060", 163 | "sha256:29b560628b1a166cac88dc277e938b6791f8650c7e405b95fd4de8b9abff31e6", 164 | "sha256:2c26456369e127aa507ed5af8cb0f08e2de4c2a60b231d1bea03f0c36094ec56", 165 | "sha256:323a49938fdc878c85371c546127e973757b03fcf2acd62a103709085752315d", 166 | "sha256:37f9b704338ab99ed1cd4179da12d6b39e7ca60aa197dd330f71ed1454ba0ea3", 167 | "sha256:3c8a26dcaaaf15f36c170639bd539ffebc957f95995705537e69bfd92765c912", 168 | "sha256:59a259edddf765cd0af91aaa9e9d7fe7648570d7923539b3a85c703026615236", 169 | "sha256:60afa5e201324d0313ba4425ce812789e9889abfc35d85c6b09426093fbc5a09", 170 | "sha256:76cffdc6cb530fe244c89e9bd7b23a8d2761bd54a3eb44713df9fc3dfee8f544", 171 | "sha256:7f1c0d3f8bb560c23446fee81421435904610133d9cfe8c8e245392a80139812", 172 | "sha256:84d247193d977bc059ac1524113c44f197c51122e0a4efba3b85380c51533040", 173 | "sha256:8f817b4ec0399e3ee82db8707e6b6a31cbdfe84bbbe669f946c75aed7c678acc", 174 | "sha256:95b06780225eed3a00a59ddce05c8c7d16aa3d8d0b0d8484a3a6089fc84af5a8", 175 | "sha256:9e80919dd00805ba4bffe175107391da1024fe303c82c271704a77b793cd654e", 176 | "sha256:b44fd3d66a3d1adaf9f49cc99a05f012df98ef79e42e5971bb960b2b81e2742e", 177 | "sha256:b4549c71277e2cc427cc32fbec6cb2dd92f0f839467cdb22381b4e5b0e53f35c", 178 | "sha256:b64449744aed7c3072235bdaf06289b36e0e65bcde7834c951a912f79a44e564", 179 | "sha256:c2b6c26b3e44e1890c5517d58e8b953e827ee10048bcdfd5dba94b0b74105fa1", 180 | "sha256:c8f6fe6e81b0c0fd058dcf24b12d3632152de334afa290fd63f30afe30c8713a", 181 | "sha256:d5cbab84f9ae346cbd1d16b5906e34e5bdd17e4a8d9e213599a24a510a82de94", 182 | "sha256:d86e654b6dd99d5e2f9e3b505a2252161a0b44e7e60a62cbe364f9d8de32ab36", 183 | "sha256:de4fc34bc4f3a93f079005cdd51db4f59fbecf72463514a6a73fdfd29bc8e777", 184 | "sha256:f3e418b9480f2e167ea5a800a0375e424f15d345a0a15c638f66b6f3bb6a7796" 185 | ], 186 | "index": "pypi", 187 | "version": "==4.2.0.32" 188 | }, 189 | "pybind11": { 190 | "hashes": [ 191 | "sha256:06398d054acd33d3b89d4b12000fadc36e946001438425a96c9e30048655ab96", 192 | "sha256:72e6def53fb491f7f4e92692029d2e7bb5a0783314f20d80222735ff10a75758" 193 | ], 194 | "version": "==2.4.3" 195 | }, 196 | "python-dotenv": { 197 | "hashes": [ 198 | "sha256:8429f459fc041237d98c9ff32e1938e7e5535b5ff24388876315a098027c3a57", 199 | "sha256:ca9f3debf2262170d6f46571ce4d6ca1add60bb93b69c3a29dcb3d1a00a65c93" 200 | ], 201 | "index": "pypi", 202 | "version": "==0.11.0" 203 | }, 204 | "pyzmq": { 205 | "hashes": [ 206 | "sha256:01b588911714a6696283de3904f564c550c9e12e8b4995e173f1011755e01086", 207 | "sha256:0573b9790aa26faff33fba40f25763657271d26f64bffb55a957a3d4165d6098", 208 | "sha256:0fa82b9fc3334478be95a5566f35f23109f763d1669bb762e3871a8fa2a4a037", 209 | "sha256:1e59b7b19396f26e360f41411a5d4603356d18871049cd7790f1a7d18f65fb2c", 210 | "sha256:2a294b4f44201bb21acc2c1a17ff87fbe57b82060b10ddb00ac03e57f3d7fcfa", 211 | "sha256:355b38d7dd6f884b8ee9771f59036bcd178d98539680c4f87e7ceb2c6fd057b6", 212 | "sha256:4b73d20aec63933bbda7957e30add233289d86d92a0bb9feb3f4746376f33527", 213 | "sha256:4ec47f2b50bdb97df58f1697470e5c58c3c5109289a623e30baf293481ff0166", 214 | "sha256:5541dc8cad3a8486d58bbed076cb113b65b5dd6b91eb94fb3e38a3d1d3022f20", 215 | "sha256:6fca7d11310430e751f9832257866a122edf9d7b635305c5d8c51f74a5174d3d", 216 | "sha256:7369656f89878455a5bcd5d56ca961884f5d096268f71c0750fc33d6732a25e5", 217 | "sha256:75d73ee7ca4b289a2a2dfe0e6bd8f854979fc13b3fe4ebc19381be3b04e37a4a", 218 | "sha256:80c928d5adcfa12346b08d31360988d843b54b94154575cccd628f1fe91446bc", 219 | "sha256:83ce18b133dc7e6789f64cb994e7376c5aa6b4aeced993048bf1d7f9a0fe6d3a", 220 | "sha256:8b8498ceee33a7023deb2f3db907ca41d6940321e282297327a9be41e3983792", 221 | "sha256:8c69a6cbfa94da29a34f6b16193e7c15f5d3220cb772d6d17425ff3faa063a6d", 222 | "sha256:8ff946b20d13a99dc5c21cb76f4b8b253eeddf3eceab4218df8825b0c65ab23d", 223 | "sha256:972d723a36ab6a60b7806faa5c18aa3c080b7d046c407e816a1d8673989e2485", 224 | "sha256:a6c9c42bbdba3f9c73aedbb7671815af1943ae8073e532c2b66efb72f39f4165", 225 | "sha256:aa3872f2ebfc5f9692ef8957fe69abe92d905a029c0608e45ebfcd451ad30ab5", 226 | "sha256:cf08435b14684f7f2ca2df32c9df38a79cdc17c20dc461927789216cb43d8363", 227 | "sha256:d30db4566177a6205ed1badb8dbbac3c043e91b12a2db5ef9171b318c5641b75", 228 | "sha256:d5ac84f38575a601ab20c1878818ffe0d09eb51d6cb8511b636da46d0fd8949a", 229 | "sha256:e37f22eb4bfbf69cd462c7000616e03b0cdc1b65f2d99334acad36ea0e4ddf6b", 230 | "sha256:e6549dd80de7b23b637f586217a4280facd14ac01e9410a037a13854a6977299", 231 | "sha256:ed6205ca0de035f252baa0fd26fdd2bc8a8f633f92f89ca866fd423ff26c6f25", 232 | "sha256:efdde21febb9b5d7a8e0b87ea2549d7e00fda1936459cfb27fb6fca0c36af6c1", 233 | "sha256:f4e72646bfe79ff3adbf1314906bbd2d67ef9ccc71a3a98b8b2ccbcca0ab7bec" 234 | ], 235 | "version": "==18.1.1" 236 | }, 237 | "raven": { 238 | "hashes": [ 239 | "sha256:3fa6de6efa2493a7c827472e984ce9b020797d0da16f1db67197bcc23c8fae54", 240 | "sha256:44a13f87670836e153951af9a3c80405d36b43097db869a36e92809673692ce4" 241 | ], 242 | "index": "pypi", 243 | "version": "==6.10.0" 244 | }, 245 | "requests": { 246 | "hashes": [ 247 | "sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4", 248 | "sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31" 249 | ], 250 | "index": "pypi", 251 | "version": "==2.22.0" 252 | }, 253 | "urllib3": { 254 | "hashes": [ 255 | "sha256:2f3db8b19923a873b3e5256dc9c2dedfa883e33d87c690d9c7913e1f40673cdc", 256 | "sha256:87716c2d2a7121198ebcb7ce7cccf6ce5e9ba539041cfbaeecfb641dc0bf6acc" 257 | ], 258 | "version": "==1.25.8" 259 | }, 260 | "werkzeug": { 261 | "hashes": [ 262 | "sha256:1e0dedc2acb1f46827daa2e399c1485c8fa17c0d8e70b6b875b4e7f54bf408d2", 263 | "sha256:b353856d37dec59d6511359f97f6a4b2468442e454bd1c98298ddce53cac1f04" 264 | ], 265 | "index": "pypi", 266 | "version": "==0.16.1" 267 | } 268 | }, 269 | "develop": { 270 | "astroid": { 271 | "hashes": [ 272 | "sha256:71ea07f44df9568a75d0f354c49143a4575d90645e9fead6dfb52c26a85ed13a", 273 | "sha256:840947ebfa8b58f318d42301cf8c0a20fd794a33b61cc4638e28e9e61ba32f42" 274 | ], 275 | "version": "==2.3.3" 276 | }, 277 | "attrs": { 278 | "hashes": [ 279 | "sha256:08a96c641c3a74e44eb59afb61a24f2cb9f4d7188748e76ba4bb5edfa3cb7d1c", 280 | "sha256:f7b7ce16570fe9965acd6d30101a28f62fb4a7f9e926b3bbc9b61f8b04247e72" 281 | ], 282 | "version": "==19.3.0" 283 | }, 284 | "certifi": { 285 | "hashes": [ 286 | "sha256:017c25db2a153ce562900032d5bc68e9f191e44e9a0f762f373977de9df1fbb3", 287 | "sha256:25b64c7da4cd7479594d035c08c2d809eb4aab3a26e5a990ea98cc450c320f1f" 288 | ], 289 | "version": "==2019.11.28" 290 | }, 291 | "chardet": { 292 | "hashes": [ 293 | "sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae", 294 | "sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691" 295 | ], 296 | "version": "==3.0.4" 297 | }, 298 | "coverage": { 299 | "hashes": [ 300 | "sha256:15cf13a6896048d6d947bf7d222f36e4809ab926894beb748fc9caa14605d9c3", 301 | "sha256:1daa3eceed220f9fdb80d5ff950dd95112cd27f70d004c7918ca6dfc6c47054c", 302 | "sha256:1e44a022500d944d42f94df76727ba3fc0a5c0b672c358b61067abb88caee7a0", 303 | "sha256:25dbf1110d70bab68a74b4b9d74f30e99b177cde3388e07cc7272f2168bd1477", 304 | "sha256:3230d1003eec018ad4a472d254991e34241e0bbd513e97a29727c7c2f637bd2a", 305 | "sha256:3dbb72eaeea5763676a1a1efd9b427a048c97c39ed92e13336e726117d0b72bf", 306 | "sha256:5012d3b8d5a500834783689a5d2292fe06ec75dc86ee1ccdad04b6f5bf231691", 307 | "sha256:51bc7710b13a2ae0c726f69756cf7ffd4362f4ac36546e243136187cfcc8aa73", 308 | "sha256:527b4f316e6bf7755082a783726da20671a0cc388b786a64417780b90565b987", 309 | "sha256:722e4557c8039aad9592c6a4213db75da08c2cd9945320220634f637251c3894", 310 | "sha256:76e2057e8ffba5472fd28a3a010431fd9e928885ff480cb278877c6e9943cc2e", 311 | "sha256:77afca04240c40450c331fa796b3eab6f1e15c5ecf8bf2b8bee9706cd5452fef", 312 | "sha256:7afad9835e7a651d3551eab18cbc0fdb888f0a6136169fbef0662d9cdc9987cf", 313 | "sha256:9bea19ac2f08672636350f203db89382121c9c2ade85d945953ef3c8cf9d2a68", 314 | "sha256:a8b8ac7876bc3598e43e2603f772d2353d9931709345ad6c1149009fd1bc81b8", 315 | "sha256:b0840b45187699affd4c6588286d429cd79a99d509fe3de0f209594669bb0954", 316 | "sha256:b26aaf69713e5674efbde4d728fb7124e429c9466aeaf5f4a7e9e699b12c9fe2", 317 | "sha256:b63dd43f455ba878e5e9f80ba4f748c0a2156dde6e0e6e690310e24d6e8caf40", 318 | "sha256:be18f4ae5a9e46edae3f329de2191747966a34a3d93046dbdf897319923923bc", 319 | "sha256:c312e57847db2526bc92b9bfa78266bfbaabac3fdcd751df4d062cd4c23e46dc", 320 | "sha256:c60097190fe9dc2b329a0eb03393e2e0829156a589bd732e70794c0dd804258e", 321 | "sha256:c62a2143e1313944bf4a5ab34fd3b4be15367a02e9478b0ce800cb510e3bbb9d", 322 | "sha256:cc1109f54a14d940b8512ee9f1c3975c181bbb200306c6d8b87d93376538782f", 323 | "sha256:cd60f507c125ac0ad83f05803063bed27e50fa903b9c2cfee3f8a6867ca600fc", 324 | "sha256:d513cc3db248e566e07a0da99c230aca3556d9b09ed02f420664e2da97eac301", 325 | "sha256:d649dc0bcace6fcdb446ae02b98798a856593b19b637c1b9af8edadf2b150bea", 326 | "sha256:d7008a6796095a79544f4da1ee49418901961c97ca9e9d44904205ff7d6aa8cb", 327 | "sha256:da93027835164b8223e8e5af2cf902a4c80ed93cb0909417234f4a9df3bcd9af", 328 | "sha256:e69215621707119c6baf99bda014a45b999d37602cb7043d943c76a59b05bf52", 329 | "sha256:ea9525e0fef2de9208250d6c5aeeee0138921057cd67fcef90fbed49c4d62d37", 330 | "sha256:fca1669d464f0c9831fd10be2eef6b86f5ebd76c724d1e0706ebdff86bb4adf0" 331 | ], 332 | "version": "==5.0.3" 333 | }, 334 | "idna": { 335 | "hashes": [ 336 | "sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407", 337 | "sha256:ea8b7f6188e6fa117537c3df7da9fc686d485087abf6ac197f9c46432f7e4a3c" 338 | ], 339 | "version": "==2.8" 340 | }, 341 | "importlib-metadata": { 342 | "hashes": [ 343 | "sha256:06f5b3a99029c7134207dd882428a66992a9de2bef7c2b699b5641f9886c3302", 344 | "sha256:b97607a1a18a5100839aec1dc26a1ea17ee0d93b20b0f008d80a5a050afb200b" 345 | ], 346 | "markers": "python_version < '3.8'", 347 | "version": "==1.5.0" 348 | }, 349 | "isort": { 350 | "hashes": [ 351 | "sha256:54da7e92468955c4fceacd0c86bd0ec997b0e1ee80d97f67c35a78b719dccab1", 352 | "sha256:6e811fcb295968434526407adb8796944f1988c5b65e8139058f2014cbe100fd" 353 | ], 354 | "version": "==4.3.21" 355 | }, 356 | "lazy-object-proxy": { 357 | "hashes": [ 358 | "sha256:0c4b206227a8097f05c4dbdd323c50edf81f15db3b8dc064d08c62d37e1a504d", 359 | "sha256:194d092e6f246b906e8f70884e620e459fc54db3259e60cf69a4d66c3fda3449", 360 | "sha256:1be7e4c9f96948003609aa6c974ae59830a6baecc5376c25c92d7d697e684c08", 361 | "sha256:4677f594e474c91da97f489fea5b7daa17b5517190899cf213697e48d3902f5a", 362 | "sha256:48dab84ebd4831077b150572aec802f303117c8cc5c871e182447281ebf3ac50", 363 | "sha256:5541cada25cd173702dbd99f8e22434105456314462326f06dba3e180f203dfd", 364 | "sha256:59f79fef100b09564bc2df42ea2d8d21a64fdcda64979c0fa3db7bdaabaf6239", 365 | "sha256:8d859b89baf8ef7f8bc6b00aa20316483d67f0b1cbf422f5b4dc56701c8f2ffb", 366 | "sha256:9254f4358b9b541e3441b007a0ea0764b9d056afdeafc1a5569eee1cc6c1b9ea", 367 | "sha256:9651375199045a358eb6741df3e02a651e0330be090b3bc79f6d0de31a80ec3e", 368 | "sha256:97bb5884f6f1cdce0099f86b907aa41c970c3c672ac8b9c8352789e103cf3156", 369 | "sha256:9b15f3f4c0f35727d3a0fba4b770b3c4ebbb1fa907dbcc046a1d2799f3edd142", 370 | "sha256:a2238e9d1bb71a56cd710611a1614d1194dc10a175c1e08d75e1a7bcc250d442", 371 | "sha256:a6ae12d08c0bf9909ce12385803a543bfe99b95fe01e752536a60af2b7797c62", 372 | "sha256:ca0a928a3ddbc5725be2dd1cf895ec0a254798915fb3a36af0964a0a4149e3db", 373 | "sha256:cb2c7c57005a6804ab66f106ceb8482da55f5314b7fcb06551db1edae4ad1531", 374 | "sha256:d74bb8693bf9cf75ac3b47a54d716bbb1a92648d5f781fc799347cfc95952383", 375 | "sha256:d945239a5639b3ff35b70a88c5f2f491913eb94871780ebfabb2568bd58afc5a", 376 | "sha256:eba7011090323c1dadf18b3b689845fd96a61ba0a1dfbd7f24b921398affc357", 377 | "sha256:efa1909120ce98bbb3777e8b6f92237f5d5c8ea6758efea36a473e1d38f7d3e4", 378 | "sha256:f3900e8a5de27447acbf900b4750b0ddfd7ec1ea7fbaf11dfa911141bc522af0" 379 | ], 380 | "version": "==1.4.3" 381 | }, 382 | "mccabe": { 383 | "hashes": [ 384 | "sha256:ab8a6258860da4b6677da4bd2fe5dc2c659cff31b3ee4f7f5d64e79735b80d42", 385 | "sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f" 386 | ], 387 | "version": "==0.6.1" 388 | }, 389 | "more-itertools": { 390 | "hashes": [ 391 | "sha256:5dd8bcf33e5f9513ffa06d5ad33d78f31e1931ac9a18f33d37e77a180d393a7c", 392 | "sha256:b1ddb932186d8a6ac451e1d95844b382f55e12686d51ca0c68b6f61f2ab7a507" 393 | ], 394 | "version": "==8.2.0" 395 | }, 396 | "packaging": { 397 | "hashes": [ 398 | "sha256:170748228214b70b672c581a3dd610ee51f733018650740e98c7df862a583f73", 399 | "sha256:e665345f9eef0c621aa0bf2f8d78cf6d21904eef16a93f020240b704a57f1334" 400 | ], 401 | "version": "==20.1" 402 | }, 403 | "pluggy": { 404 | "hashes": [ 405 | "sha256:15b2acde666561e1298d71b523007ed7364de07029219b604cf808bfa1c765b0", 406 | "sha256:966c145cd83c96502c3c3868f50408687b38434af77734af1e9ca461a4081d2d" 407 | ], 408 | "version": "==0.13.1" 409 | }, 410 | "py": { 411 | "hashes": [ 412 | "sha256:5e27081401262157467ad6e7f851b7aa402c5852dbcb3dae06768434de5752aa", 413 | "sha256:c20fdd83a5dbc0af9efd622bee9a5564e278f6380fffcacc43ba6f43db2813b0" 414 | ], 415 | "version": "==1.8.1" 416 | }, 417 | "pylint": { 418 | "hashes": [ 419 | "sha256:3db5468ad013380e987410a8d6956226963aed94ecb5f9d3a28acca6d9ac36cd", 420 | "sha256:886e6afc935ea2590b462664b161ca9a5e40168ea99e5300935f6591ad467df4" 421 | ], 422 | "index": "pypi", 423 | "version": "==2.4.4" 424 | }, 425 | "pyparsing": { 426 | "hashes": [ 427 | "sha256:4c830582a84fb022400b85429791bc551f1f4871c33f23e44f353119e92f969f", 428 | "sha256:c342dccb5250c08d45fd6f8b4a559613ca603b57498511740e65cd11a2e7dcec" 429 | ], 430 | "version": "==2.4.6" 431 | }, 432 | "pytest": { 433 | "hashes": [ 434 | "sha256:0d5fe9189a148acc3c3eb2ac8e1ac0742cb7618c084f3d228baaec0c254b318d", 435 | "sha256:ff615c761e25eb25df19edddc0b970302d2a9091fbce0e7213298d85fb61fef6" 436 | ], 437 | "index": "pypi", 438 | "version": "==5.3.5" 439 | }, 440 | "pytest-cov": { 441 | "hashes": [ 442 | "sha256:cc6742d8bac45070217169f5f72ceee1e0e55b0221f54bcf24845972d3a47f2b", 443 | "sha256:cdbdef4f870408ebdbfeb44e63e07eb18bb4619fae852f6e760645fa36172626" 444 | ], 445 | "index": "pypi", 446 | "version": "==2.8.1" 447 | }, 448 | "pytest-mock": { 449 | "hashes": [ 450 | "sha256:b35eb281e93aafed138db25c8772b95d3756108b601947f89af503f8c629413f", 451 | "sha256:cb67402d87d5f53c579263d37971a164743dc33c159dfb4fb4a86f37c5552307" 452 | ], 453 | "index": "pypi", 454 | "version": "==2.0.0" 455 | }, 456 | "pytest-pylint": { 457 | "hashes": [ 458 | "sha256:3996a55ba66ce8ddf150754d8549567a4b067d63fa4513fdfd3325c7553c8075", 459 | "sha256:b3f83f4525b2cbd019e9e46b4ee9c4ccee82bde66edf9872690ccfdc75456703" 460 | ], 461 | "index": "pypi", 462 | "version": "==0.15.0" 463 | }, 464 | "requests": { 465 | "hashes": [ 466 | "sha256:11e007a8a2aa0323f5a921e9e6a2d7e4e67d9877e85773fba9ba6419025cbeb4", 467 | "sha256:9cf5292fcd0f598c671cfc1e0d7d1a7f13bb8085e9a590f48c010551dc6c4b31" 468 | ], 469 | "index": "pypi", 470 | "version": "==2.22.0" 471 | }, 472 | "responses": { 473 | "hashes": [ 474 | "sha256:515fd7c024097e5da76e9c4cf719083d181f1c3ddc09c2e0e49284ce863dd263", 475 | "sha256:8ce8cb4e7e1ad89336f8865af152e0563d2e7f0e0b86d2cf75f015f819409243" 476 | ], 477 | "index": "pypi", 478 | "version": "==0.10.9" 479 | }, 480 | "six": { 481 | "hashes": [ 482 | "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a", 483 | "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c" 484 | ], 485 | "version": "==1.14.0" 486 | }, 487 | "typed-ast": { 488 | "hashes": [ 489 | "sha256:0666aa36131496aed8f7be0410ff974562ab7eeac11ef351def9ea6fa28f6355", 490 | "sha256:0c2c07682d61a629b68433afb159376e24e5b2fd4641d35424e462169c0a7919", 491 | "sha256:249862707802d40f7f29f6e1aad8d84b5aa9e44552d2cc17384b209f091276aa", 492 | "sha256:24995c843eb0ad11a4527b026b4dde3da70e1f2d8806c99b7b4a7cf491612652", 493 | "sha256:269151951236b0f9a6f04015a9004084a5ab0d5f19b57de779f908621e7d8b75", 494 | "sha256:4083861b0aa07990b619bd7ddc365eb7fa4b817e99cf5f8d9cf21a42780f6e01", 495 | "sha256:498b0f36cc7054c1fead3d7fc59d2150f4d5c6c56ba7fb150c013fbc683a8d2d", 496 | "sha256:4e3e5da80ccbebfff202a67bf900d081906c358ccc3d5e3c8aea42fdfdfd51c1", 497 | "sha256:6daac9731f172c2a22ade6ed0c00197ee7cc1221aa84cfdf9c31defeb059a907", 498 | "sha256:715ff2f2df46121071622063fc7543d9b1fd19ebfc4f5c8895af64a77a8c852c", 499 | "sha256:73d785a950fc82dd2a25897d525d003f6378d1cb23ab305578394694202a58c3", 500 | "sha256:8c8aaad94455178e3187ab22c8b01a3837f8ee50e09cf31f1ba129eb293ec30b", 501 | "sha256:8ce678dbaf790dbdb3eba24056d5364fb45944f33553dd5869b7580cdbb83614", 502 | "sha256:aaee9905aee35ba5905cfb3c62f3e83b3bec7b39413f0a7f19be4e547ea01ebb", 503 | "sha256:bcd3b13b56ea479b3650b82cabd6b5343a625b0ced5429e4ccad28a8973f301b", 504 | "sha256:c9e348e02e4d2b4a8b2eedb48210430658df6951fa484e59de33ff773fbd4b41", 505 | "sha256:d205b1b46085271b4e15f670058ce182bd1199e56b317bf2ec004b6a44f911f6", 506 | "sha256:d43943ef777f9a1c42bf4e552ba23ac77a6351de620aa9acf64ad54933ad4d34", 507 | "sha256:d5d33e9e7af3b34a40dc05f498939f0ebf187f07c385fd58d591c533ad8562fe", 508 | "sha256:fc0fea399acb12edbf8a628ba8d2312f583bdbdb3335635db062fa98cf71fca4", 509 | "sha256:fe460b922ec15dd205595c9b5b99e2f056fd98ae8f9f56b888e7a17dc2b757e7" 510 | ], 511 | "markers": "implementation_name == 'cpython' and python_version < '3.8'", 512 | "version": "==1.4.1" 513 | }, 514 | "urllib3": { 515 | "hashes": [ 516 | "sha256:2f3db8b19923a873b3e5256dc9c2dedfa883e33d87c690d9c7913e1f40673cdc", 517 | "sha256:87716c2d2a7121198ebcb7ce7cccf6ce5e9ba539041cfbaeecfb641dc0bf6acc" 518 | ], 519 | "version": "==1.25.8" 520 | }, 521 | "wcwidth": { 522 | "hashes": [ 523 | "sha256:8fd29383f539be45b20bd4df0dc29c20ba48654a41e661925e612311e9f3c603", 524 | "sha256:f28b3e8a6483e5d49e7f8949ac1a78314e740333ae305b4ba5defd3e74fb37a8" 525 | ], 526 | "version": "==0.1.8" 527 | }, 528 | "wrapt": { 529 | "hashes": [ 530 | "sha256:565a021fd19419476b9362b05eeaa094178de64f8361e44468f9e9d7843901e1" 531 | ], 532 | "version": "==1.11.2" 533 | }, 534 | "zipp": { 535 | "hashes": [ 536 | "sha256:5c56e330306215cd3553342cfafc73dda2c60792384117893f3a83f8a1209f50", 537 | "sha256:d65287feb793213ffe11c0f31b81602be31448f38aeb8ffc2eb286c4f6f6657e" 538 | ], 539 | "version": "==2.2.0" 540 | } 541 | } 542 | } 543 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | PDF Trio 3 | ============ 4 | 5 | **Git Repo:** 6 | 7 | **Blog Post:** 8 | 9 | **License:** Apache-2.0 10 | 11 | `pdf_trio` is a Machine Learning (ML) service which combines three distinct 12 | classifiers to predict whether a PDF document is a "research publication". It 13 | exposes an HTTP API which receives files (POST) and returns classification 14 | confidence scores. There is also an experimental endpoint for fast 15 | classification based on URL strings alone. 16 | 17 | This project was developed at the [Internet Archive](https://archive.org) as 18 | part of efforts to collect and provide perpetual access to research documents 19 | on the world wide web. Initial funding was provided by the Andrew W. Mellon 20 | Foundation under the "[Ensuring the Persistent Access of Long Tail Open Access 21 | Journal Literature][mellon-blog]" project. 22 | 23 | This system was originally designed, trained, and implemented by [Paul 24 | Baclace](http://baclace.net/). See `CONTRIBUTORS.txt` for details. 25 | 26 | [mellon-blog]: https://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/ 27 | 28 | 29 | ## Quickstart with Docker Compose 30 | 31 | These instructions describe how to run the `pdf_trio` API service locally using 32 | docker and pre-trained Tensorflow and fastText machine learning models. They 33 | assume you have docker installed locally, as well as basic command line tools 34 | (bash, wget, etc). 35 | 36 | **Download trained model files:** about 1.6 GBytes of data to download from 37 | archive.org. 38 | 39 | ./fetch_models.sh 40 | 41 | **Run docker-compose:** this command will build a docker image for the API from 42 | scratch and run it. It will also fetch and run two tensorflow-serving back-end 43 | daemons. Requires several GByte of RAM. 44 | 45 | docker-compose up 46 | 47 | You can try submitting a PDF for classification like: 48 | 49 | curl localhost:3939/classify/research-pub/all -F pdf_content=@tests/files/research/hidden-technical-debt-in-machine-learning-systems__2015.pdf 50 | 51 | To re-build the API docker image (eg, if you make local code changes): 52 | 53 | docker-compose up --build --force-recreate --no-deps api 54 | 55 | ## Development 56 | 57 | The default python dependency management system for this project is `pipenv`, 58 | though it is also possible to use `conda` (see directions later in this 59 | document). 60 | 61 | To install dependencies on a Debian buster Linux computer: 62 | 63 | sudo apt install -y poppler-utils imagemagick libmagickcore-6.q16-6-extra ghostscript netpbm gsfonts wget 64 | pip3 install pipenv 65 | pipenv install --dev --deploy 66 | 67 | Download model files: 68 | 69 | ./fetch_models.sh 70 | 71 | Use the default local configuration: 72 | 73 | cp example.env .env 74 | 75 | Run just the tensorflow-serving back-end daemons using docker-compose like: 76 | 77 | docker-compose up tfserving 78 | 79 | Unit tests partially mock the back-end tensorflow-serving daemons, and any 80 | tests which do call these daemons will automatically skip if they are not 81 | available locally. To run the tests: 82 | 83 | pipenv run python -m pytest 84 | pipenv run pylint -E pdf_trio tests/*.py 85 | 86 | # with coverage: 87 | pipenv run pytest --cov --cov-report html 88 | 89 | ## Background 90 | 91 | The purpose of this project is to identify research works for richer cataloging 92 | in production at [Internet Archive](https://archive.org). Research works are 93 | not always found in well-curated locations with good metadata that can be 94 | utilized to enrich indexing and search. [Ongoing 95 | work](https://blog.dshr.org/2015/04/preserving-long-form-digital-humanities.html) 96 | at the Internet Archive will use this classifier ensemble to curate "long tail" 97 | research articles in multiple languages published by small publishers. [Low 98 | volume 99 | publishing](https://blog.dshr.org/2017/01/the-long-tail-of-non-english-science.html) 100 | is inversely correlated to longevity, so the goal is to preserve the research 101 | works from these sites to ensure they are not lost. 102 | 103 | The performance target is to classify a PDF in about 1 second or less on 104 | average and this implementation achieves that goal when multiple, parallel 105 | requests are made on a multi-core machine without a GPU. 106 | 107 | The URL classifier is best used as a "true vs. unknown" choice, that is, if the 108 | classification is non-research ('other'), then find out more, do not assume it 109 | is not a research work. Our motivation is to have a quick check that can be 110 | used during crawling. A high confidence is used to avoid false positives. 111 | 112 | ## Design 113 | 114 | * REST API based on python Flask 115 | * Deep Learning neural networks 116 | * Run with tensorflow serving for high throughput 117 | * CNN for image classification 118 | * BERT for text classification using a multilingual model 119 | * FastText linear classifier 120 | * Full text 'bag of words' classification for high-throughput 121 | * URL-based classification 122 | * PDF training data [preparation scripts](data_prep/README.md) for each kind of sub-classifier 123 | 124 | Two other repos are relied upon and not copied into this repo because they are 125 | useful standalone. This repo can be considered the 'roll up' that integrates 126 | the ensemble. 127 | 128 | This PDF classifier can be re-purposed for other binary cases simply by using 129 | different training data. 130 | 131 | ### PDF Challenges 132 | 133 | PDFs have challenges, because they can: 134 | * be pure images of text with no easily extractable text at all 135 | * range from 1 page position papers to 100+ page theses 136 | * have citations either at the end of the document, or interspersed as footnotes 137 | 138 | We decided to avoid using OCR for text extraction for speed reasons and because 139 | of the challenge of multiple languages. 140 | 141 | We address these challenges with an ensemble of classifiers that use confidence 142 | values to cover all the cases. There are still some edge cases, but incidence 143 | rate is at most a few percent. 144 | 145 | ## API Configuration and Deployment 146 | 147 | The following env vars must be defined to run this API service: 148 | 149 | - `FT_MODEL` full path to the FastText model for linear classifier 150 | - `FT_URL_MODEL` path to FastText model for URL classifier 151 | - `TF_IMAGE_SERVER_URL` base API URL for image tensorflow-serving process 152 | - `TF_BERT_SERVER_URL` base API URL for BERT tensorflow-serving process 153 | 154 | ### Backend Service Dependency Setup 155 | 156 | These directions assume you are running in an Ubuntu Xenial (16.04 LTS) virtual 157 | machine. 158 | 159 | sudo apt-get install -y poppler-utils imagemagick libmagickcore-6.q16-2-extra ghostscript netpbm gsfonts-other 160 | conda create --name pdf_trio python=3.7 --file requirements.txt 161 | conda activate pdf_trio 162 | 163 | Edit `/etc/ImageMagick/policy.xml` to change: 164 | 165 | 166 | 167 | To: 168 | 169 | 170 | 171 | We expect imagemagick 6.x; when 7.x is used, the binary will not be called 172 | convert anymore. 173 | 174 | ### Back-Backend Components for Serving 175 | 176 | - fastText-python is used for hosting fastText models 177 | - tensorflow-serving for image and BERT inference 178 | 179 | #### Tensorflow-Serving via Docker 180 | 181 | Tensorflow serving is used to host the NN models and we prefer to use the REST because that significantly 182 | reduces the complexity on the client side. 183 | 184 | - install on ubuntu (docker.io distinguishes Docker from some preexisting 185 | package 'docker', a window manager extension): 186 | - `sudo apt-get install docker.io` 187 | - get a docker image from the docker hub (might need to be careful about v1.x to v2.x transition) 188 | - `sudo docker pull tensorflow/serving:latest` 189 | - NOTE: I am running docker at system level (not user), so sudo is needed for operations, YMMV 190 | - see what is running: 191 | - `sudo docker ps` 192 | - stop a running docker, using the id shown in the docker ps output: 193 | - `sudo docker stop 955d531040b2` 194 | - to start tensorflow-serving: 195 | - `./start_bert_serving.sh` 196 | - `./start_image_classifier_serving.sh` 197 | 198 | ## Training and Models 199 | 200 | These are covered in detail under [data_prep](data_prep/README.md): 201 | - fastText (needed for training) 202 | - bert variant repo for training 203 | - tf_image_classifier repo 204 | 205 | ### Models 206 | 207 | Sample models for research-pub classification are available at Internet Archive 208 | under . 209 | 210 | A handy python 211 | [package](https://archive.org/services/docs/api/internetarchive/quickstart.html#downloading) 212 | will fetch the files and directory structure (necessary for 213 | tensorflow-serving). You can use curl and carefully recreate the directory 214 | structure, of course. The full set of models is 1.6GB. 215 | 216 | Here is a summary of the model files, directories, and env vars to specify the paths: 217 | 218 | | env var | size | path | Used By | 219 | | ------- | ---- | ------- | ------- | 220 | | BERT_MODEL_PATH | 1.3GB | pdf_trio_models/bert_models | start_bert_serving.sh | 221 | | IMAGE_MODEL_PATH | 87MB | pdf_trio_models/pdf_image_classifier_model | start_image_classifier_serving.sh | 222 | | FT_MODEL | 600MB | pdf_trio_models/fastText_models/dataset20000_20190818.bin | start_api_service.sh | 223 | | FT_URL_MODEL | 202MB | pdf_trio_models/fastText_models/url_dataset20000_20190817.bin | start_api_service.sh | 224 | 225 | 226 | See [Data Prep](data_prep/README.md) for details on preparing training data and how to train each classifier. 227 | 228 | -------------------------------------------------------------------------------- /data_prep/README.md: -------------------------------------------------------------------------------- 1 | # Data Prep 2 | 3 | There are three kinds of classifiers used in this ensemble. Each is detailed in a subdirectory to describe 4 | how data is prepared and how to train. 5 | 6 | * [FastText](ft_data_prep/README.md) a fast linear classifier 7 | * [CNN](image_data_prep/README.md) convolutional neural network classifer 8 | * [BERT](bert_data_prep/README.md) 9 | 10 | ## Data Prep Tools 11 | 12 | The external programs like imagemagick and pdftotext are used here and are also used by the REST API to 13 | prepare raw PDF content for inference. 14 | 15 | Processing individual PDFs using the scripts here will be straightforward. Some unavoidable idiosyncrasies creep in when 16 | we process URL collections which are in an arbitrary .tsv file format. Inference by URL is completely separate 17 | from classifying PDF content, however. 18 | 19 | ### convert PDF to jpg 20 | Run this in a directory containing .pdf files while specifying the destination directory to 21 | put the corresponding .jpg files. The files will have the same basename as the 22 | corresponding PDF. 23 | ``` 24 | ./pdf_image.sh dest_jpg_dir 25 | ``` 26 | Images are 224x224 with a white background. The page image is horizontally centered and flush against the "north". 27 | The regularization used for natural image training is not used because PDFs are not natural images. 28 | 29 | It was found empirically that a jpg size of 3000 or 30 | smaller is always blank. It would especially detrimental to have both positive and negatives examples that 31 | were blank, so these are removed. 32 | 33 | ### Extract Text from PDFs 34 | Running this will extract text from each .pdf to make the same basename plus .txt. 35 | Text is in human reading order. It relies on pdftotext and installation instructions are 36 | in the top level README file. 37 | ``` 38 | ./pdf_text.sh pdf_src_dir text_dest_dir 39 | ``` 40 | If the text file size is too small (less than 500 bytes), it is removed to prevent training with 41 | insufficient features. If the txt file already exists, then that conversion is skipped. 42 | 43 | We found that pdftotext worked better than native python libraries for text extraction. 44 | 45 | 46 | -------------------------------------------------------------------------------- /data_prep/bert_data_prep/README.md: -------------------------------------------------------------------------------- 1 | # Data Prep and Training (fine-tuning) for BERT 2 | 3 | ## Overview 4 | There are two parts to obtaining a BERT model for classification: Preparing the dataset and 5 | training/fine-tuning a pre-trained, multilingual BERT. 6 | 7 | The training dataset is created by extracting text from PDFs and forming 512 tokens per document in each 8 | row of a TSV file required by `run_classifier.py`. A [modified BERT](https://github.com/tralfamadude/bert) 9 | is used to create a SavedModel for fast serving via tensorflow serving 10 | (only `run_classifier.py` was modified). 11 | 12 | The training/fine-tuning is best done on Google Cloud using TPUs for quick performance (15min for 20k 13 | samples), but can also be done using just CPU (30hr using 48GB RAM x 30 cores x 2GHz clock). 14 | 15 | ## Prepare BERT Training Dataset 16 | Use `gen_bert_data.py` to prepare training data for BERT by ingesting .pdf files, producing 17 | intermediate .txt files using pdftotext, and then processing those into the tsv file that 18 | BERT requires. The script `prep_all_bert.sh` shows how to run it: 19 | ``` 20 | ./prep_all_bert.sh 21 | Usage: basename_for_dataset dest_dir kill_list other_pdfs_dir research_pdfs_dir_1 [research_pdfs_dir_2] 22 | 23 | # Example usage 24 | TS=$(date '+%Y%m%dT%H') 25 | ./prep_all_bert.sh bert$TS mydest/ my_kill_list my_other_pdfs_dir/ my_research_pdfs_dir/ 26 | ``` 27 | where the basename is given a timestamp to record when the data was gathered, my_kill_list is a file with 28 | pdf file basenames (one per line, no .pdf extension, no path) which is used to ignore certain PDF files (the file can be 29 | empty, like /dev/null), and two directories of PDF files, one for each class. 30 | 31 | Note that the kill_list is only applied to the 'other' category. This is because the Internet Archive collection 32 | of 'research' PDFs is taken as a given since it is derived from authoritative sources. 33 | By contrast, the 'other' collection is derived from random PDFs at large, so it can contain docs that 34 | actually are 'research'. In practice, about 6% of random PDFs on the Internet are research docs. Our process has 35 | been to analyze misclassified docs by viewing them, and if they are labeled 'other' but should be 36 | 'research', they are added to the kill list. This makes it possible to decouple the exceptions from 37 | the gathering of random PDFs for 'other'. 38 | 39 | For internal provenance reasons, an optional, additional research PDF directory is supported on the 40 | command line. 41 | 42 | ## Train (Fine-Tune) BERT 43 | We are using run_classifier.py from BERT to train (fine-tune, actually) the classifier. 44 | 45 | Training/fine-tuning is best done on Google Cloud using TPUs for quick performance (15min), but can also 46 | be done using just CPU (30hr). 47 | 48 | ## Prepare Environment 49 | Spin up a linux node in Google Cloud. The following was tested against Ubuntu 16, but should only need 50 | minor tweaks on other versions or flavors. The TPU was a v2-8 TPU (named tpu-node-1), which is basically 51 | the least-powerful TPU available, but totally sufficient for the present purpose. 52 | 53 | The TPU is the most expensive part to rent (about $100/day), so do not start it until ready to train/fine-tune and double-check 54 | when done that you stopped the TPU. Note that defining the TPU instance will auto-start it. 55 | ``` 56 | # install miniconda 57 | curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh >Miniconda3-latest-Linux-x86_64.sh 58 | chmod +x *.sh 59 | ./Miniconda3-latest-Linux-x86_64.sh 60 | 61 | # update env 62 | . ~/.bashrc 63 | # should be in (base) conda env now 64 | 65 | # create tf_hub env 66 | conda create --name tf_hub python=3.7 scipy tensorflow=1.14.0 tensorboard=1.14.0 numpy tensorflow-hub -c conda-forge 67 | 68 | conda activate tf_hub 69 | 70 | pip install --upgrade google-api-python-client 71 | pip install --upgrade oauth2client 72 | 73 | ``` 74 | ## A BERT repo for SavedModel 75 | The run_classifier.py module of BERT is used to train/fine-tune and test/validate the model. 76 | 77 | Use this particular BERT repo because it has been modified to produce a SavedModel which works with 78 | a REST api of tensorflow serving; the REST definition matches usage in app.py in top level directory: 79 | ``` 80 | git clone https://github.com/tralfamadude/bert.git 81 | 82 | ``` 83 | 84 | ## Cloud Storage 85 | A bucket must be defined, called BUCKET below. 86 | When using cloud TPU, cloud storage (gs://) must be used for pretrained model, 87 | and the output directory, as indicated by bert repo readme. 88 | Also put the inputs into gs://${BUCKET}/$BASE where BASE is the basename used for creating the dataset above. 89 | 90 | Note that the bucket, TPU, and machine instance need to be in the same region. 91 | 92 | In the next commands, we assume ~/$BASE corresponds to the directory generated by the dataset creation above, 93 | so scp your dataset to the cloud instance. 94 | 95 | ``` 96 | BASE=?fill-in? 97 | BUCKET=mybucket 98 | curl https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip > uncased_L-12_H-768_A-12.zip 99 | unzip uncased_L-12_H-768_A-12.zip 100 | export BERT_BASE_DIR=gs://$BUCKET/multi_cased_L-12_H-768_A-12 101 | gsutil cp -r multi_cased_L-12_H-768_A-12 gs://$BUCKET/ 102 | gsutil cp -r ~/$BASE gs://$BUCKET/ 103 | 104 | ``` 105 | ## Training 106 | Start your TPU; it is assumed to be called tpu-node-1 below. 107 | ``` 108 | TS=$(date '+%Y%m%dT%H%M') 109 | BOUT=bert_output_$TS 110 | 111 | nohup python ./run_classifier.py \ 112 | --task_name=cola \ 113 | --do_train=true \ 114 | --do_eval=true \ 115 | --vocab_file=$BERT_BASE_DIR/vocab.txt \ 116 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 117 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ 118 | --max_seq_length=512 \ 119 | --train_batch_size=64 \ 120 | --learning_rate=2e-5 \ 121 | --num_train_epochs=3 \ 122 | --do_lower_case=False \ 123 | --data_dir=gs://${BUCKET}/$BASE \ 124 | --output_dir=gs://${BUCKET}/$BOUT \ 125 | --use_tpu=true \ 126 | --tpu_name=tpu-node-1 \ 127 | --tpu_zone=us-central1-b \ 128 | --num_tpu_cores=8 \ 129 | >run.out 2>&1 & 130 | 131 | wait 132 | ``` 133 | Wait for it to finish before proceeding. It should take about 15min for 20k samples. 134 | If it finishes in a minute, then 135 | something probably went wrong. Look at run.out to see what happened. 136 | 137 | Using the Google cloud console, look at gs://${BUCKET}/$BOUT 138 | 139 | ## Testing 140 | This step will measure accuracy against the withheld dataset. 141 | The model checkpoint name and bert_output dir must be manually substituted below. 142 | (ToDo: need gsutil cmd to find the max ckpt file.) 143 | ``` 144 | SAVED_MODEL=gs://${BUCKET}/bert_finetuned_${TS} 145 | 146 | python ./run_classifier.py \ 147 | --task_name=cola \ 148 | --do_predict=true \ 149 | --vocab_file=$BERT_BASE_DIR/vocab.txt \ 150 | --bert_config_file=$BERT_BASE_DIR/bert_config.json \ 151 | --init_checkpoint=gs://${BUCKET}/${BOUT}/model.ckpt-1136 \ 152 | --max_seq_length=512 \ 153 | --data_dir=gs://${BUCKET}/$BASE \ 154 | --use_tpu=true \ 155 | --tpu_name=tpu-node-1 \ 156 | --tpu_zone=us-central1-b \ 157 | --num_tpu_cores=8 \ 158 | --output_dir=gs://${BUCKET}/${BOUT} 2>&1 | tee measure.out 159 | 160 | # 161 | # measure withheld samples: 162 | gsutil cp gs://${BUCKET}/${BOUT}/test_results.tsv . 163 | gsutil cp gs://${BUCKET}/bert_20000/test_original.tsv . 164 | python ./evaluate_test_set_predictions.py --tsv ./test_original.tsv --results ./test_results.tsv > test_tally 165 | 166 | # see test_tally for details on results on witheld samples 167 | 168 | ``` 169 | Back in research-pub/data_prep/bert_data_prep/ (not necessarily in google cloud), measure withheld samples: 170 | ``` 171 | gsutil cp gs://${BUCKET}/${BOUT}/test_results.tsv . 172 | gsutil cp gs://${BUCKET}/$BASE/test_original.tsv . 173 | python ./evaluate_test_set_predictions.py --tsv ./test_original.tsv --results ./test_results.tsv > test_tally 174 | ``` 175 | look at the file test_tally for details on results on withheld samples. The last line shows the 176 | accuracy like this: `n 1818 correct 1777 0.977448` 177 | 178 | -------------------------------------------------------------------------------- /data_prep/bert_data_prep/evaluate_test_set_predictions.py: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2019 Robert Jones jones@craic.com Craic Computing LLC 4 | 5 | # This software is made freely available under the terms of the MIT license 6 | 7 | # given a test file and a file of prediction results, report the record ID, the label and the prediction 8 | 9 | # TSV input format is 10 | # 11 | #