├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── analyze_form.py
├── analyze_invoice.pdf
├── analyze_invoice.py
├── document_pipeline.png
├── env_template.sh
├── main.py
├── requirements.txt
├── sort_documents.py
├── tag_article.py
├── test.py
├── tif_to_pdf.py
└── utils.py


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # How to Contribute
 2 | 
 3 | We'd love to accept your patches and contributions to this project. There are
 4 | just a few small guidelines you need to follow.
 5 | 
 6 | ## Contributor License Agreement
 7 | 
 8 | Contributions to this project must be accompanied by a Contributor License
 9 | Agreement. You (or your employer) retain the copyright to your contribution;
10 | this simply gives us permission to use and redistribute your contributions as
11 | part of the project. Head over to <https://cla.developers.google.com/> to see
12 | your current agreements on file or to sign a new one.
13 | 
14 | You generally only need to submit a CLA once, so if you've already submitted one
15 | (even if it was for a different project), you probably don't need to do it
16 | again.
17 | 
18 | ## Code reviews
19 | 
20 | All submissions, including submissions by project members, require review. We
21 | use GitHub pull requests for this purpose. Consult
22 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
23 | information on using pull requests.
24 | 
25 | ## Community Guidelines
26 | 
27 | This project follows
28 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/).


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
  1 |     Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Document Parsing Pipeline
  2 | 
  3 | This repo provides code for building a document processing pipeline on GCP. It works like this:
  4 | 
  5 | 1. Upload a document as text, as a pdf, or as an image to the cloud (in this case, a Google Cloud Storage bucket).
  6 | 2. The document gets tagged by type (i.e. "invoice," "article," etc) and then put into a new folder/storage bucket based on that type.
  7 | 3. Each document type is processed differently. A file that's moved to the "article" bucket gets tagged (i.e. Sports, Politics, Technology). A file that's moved the "invoices" folder gets analyze for prices, phone numbers, and other entities.
  8 | 4. Extracted data then gets put into a bigquery table.
  9 | 
 10 | ## Step 0: Enable APIs
 11 | 
 12 | For this project, you'll need to enable these GCP services:
 13 | - AutoML Natural Language
 14 | - Natural Language API
 15 | - Vision API
 16 | - Places API
 17 | 
 18 | ![document pipeline architecture](https://github.com/dalequark/document-pipeline/blob/master/document_pipeline.png)
 19 | 
 20 | ## Step 1: Create Storage Buckets
 21 | 
 22 | First, you'll want to create a couple of [Storage Buckets](https://cloud.google.com/storage/docs/creating-buckets).
 23 | 
 24 | Create one folder, something like `gs://input-documents`, where you'll initially upload your unsorted documemts.
 25 | 
 26 | Next, create folders for each of the file types you want to move documents to. In this project, I have code to analyze invoices and articles, so I created three buckets:
 27 | 
 28 | - `gs://articles`
 29 | - `gs://invoices`
 30 | - `gs://unsorted-docs`
 31 | 
 32 | ## Step 2: Create an AutoML Document Classification Model
 33 | 
 34 | Next, you'll want to build a model that sorts documents by type, labeling them as invoices, invoices, articles, emails, and so on. To do this, I used the [RVL-CDIP dataset](https://www.cs.cmu.edu/~aharley/rvl-cdip/):
 35 | 
 36 | ![rvl-cdip-1](https://www.cs.cmu.edu/~aharley/rvl-cdip/images/sample1.png)
 37 | ![rvl-cdip-2](https://www.cs.cmu.edu/~aharley/rvl-cdip/images/sample2.png)
 38 | 
 39 |     A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015
 40 | 
 41 | This huge dataset contains 400,000 grayscale document images in 16 classes.
 42 | 
 43 | I also added some scanned invoice data from the [Scanned invoices OCR and Information Extraction Dataset](https://rrc.cvc.uab.es/?ch=13).
 44 | 
 45 | Because this dataset combined was so huge and hard to work with, I just used a sample of documents with this breakdown of document type:
 46 | 
 47 | |label          | count |
 48 | |---------------|-------| 
 49 | |advertisement  | 902   |
 50 | |budget         | 900   |
 51 | |email          | 900   |
 52 | |form           | 900   |
 53 | |handwritten    | 900   |
 54 | |invoice        | 900   |
 55 | |letter         | 900   |
 56 | |news           | 188   |
 57 | |invoice        | 626   |
 58 | 
 59 | I've hosted all this data at `https://console.cloud.google.com/storage/browser/doc-classification-training-data` (or `gs://doc-classification-training-data`). The training files, as `pdfs` and `txt` data, are in the `processed_files` folder. You'll need to copy those files to your own GCP project in order to train your own model. You'll also see a file `automl_input.csv` in the bucket. This is the `csv` file you'll use to import data into AutoML Natural Language.
 60 | 
 61 | AutoML Natural Language is a tool that builds custom deep learning models from user-provided training data. It works with text and pdf files. This is what I used to build my document type classifier. [The docs](https://cloud.google.com/natural-language/automl/docs/beginners-guide) will show you how to train your own document classification model. When you're done training your model, Google Cloud will host the model for you. 
 62 | 
 63 | To use your AutoML model in your app, you'll need to find your model name:
 64 | 
 65 | `projects/YOUR LONG NUMBER/locations/us-central1/models/YOUR LONG MODEL ID NUMBER`
 66 | 
 67 | We'll use this id in the next step.
 68 | 
 69 | ## Step 3: Set up Cloud Functions
 70 | 
 71 | This document pipeline is designed so that when you upload files to an input bucket (i.e. `gs://input-documents`), they're sorted by type by the AutoML model we built in Step 2.
 72 | 
 73 | After you've created this storage bucket and trained your AutoML model, you'll need to create a new [cloud function](https://cloud.google.com/functions/docs/quickstart-python) that runs every time a new file is uploaded to `gs://input-documents`.
 74 | 
 75 | Create a new cloud function with the code in `sort_documents.py`. For this to work, you'll need to set several environmental variables in the Cloud Functions console, like `INVOICES_BUCKET`, `UNSORTED_BUCKET`, and `ARTICLES_BUCKET`. These should be the names of the corresponding buckets you created, without the preceding `gs:` (i.e. `gs://reciepts` -> `invoices`). You'll also need to set the environmental variable `SORT_MODEL_NAME` to the model name we found in the last step (that entire long path that ends in model id number).
 76 | 
 77 | Once you've set up this function, documents uploaded to `gs://input_documents` (or whatever you've called your input document folder) will be classified and moved into the invoices, articles, or unsorted buckets respectively.
 78 | 
 79 | There are two more cloud functions defined in this repository:
 80 | 
 81 | - analyze_invoice.py
 82 | - tag_article.py
 83 | 
 84 | You'll want to create two new cloud functions that connects these scripts to your invoices and article buckets.
 85 | 
 86 | But first, you'll need to create a bigquery table to store the data extracted from those functions
 87 | 
 88 | ## Step 4: Create BigQuery Tables
 89 | 
 90 | [Create two BigQuery tables](https://cloud.google.com/bigquery/docs/tables). One, which you can call something like, "article_tags", will store the tags extracted by your `tag_article` cloud functions. It's schema should be:
 91 | 
 92 | | column name | type |
 93 | |-------------|------|
 94 | | filename    |string|
 95 | | tag         |string|
 96 | 
 97 | Create a second table called something like `invoice_data` with the schema:
 98 | 
 99 | | column name | type |
100 | |-------------|------|
101 | | filename    |string|
102 | | address     |string|
103 | |phone_nunmber|string|
104 | |name         |string|
105 | |total        |float |
106 | |date_str     |string|
107 | 
108 | For each of these tables, note the Table ID, which should be of the form:
109 | 
110 | `your-project-name.your-dataset-name.your-table-name`
111 | 
112 | ## Step 5: Create Remaining Cloud Functions
113 | 
114 | Now that you've created those BigQuery tables, you can deploy the cloud functions:
115 | 
116 | - tag_article.py
117 | - analyze_invoice.py
118 | 
119 | As you deploy, you'll need to set the environmental variables `ARTICLE_TAGS_TABLE` and `invoiceS_TABLE` respectively to their BigQuery table IDs. For the `analyze_receiept` cloud function, you'll also need to [create an API key](https://cloud.google.com/docs/authentication/api-keys and set the environmental varialbe `GOOGLE_API_KEY` to that key (this script uses the Google Places API to find information about businesses from their phone numbers).
120 | 
121 | [This documentation](https://cloud.google.com/functions/docs/quickstart-python) shows you how to create a Python cloud function on Google Cloud. 
122 | 
123 | ## Step 6: Analyze
124 | 
125 | Voila! You're done. Upload a file to your `gs://input-documents` and watch as its sorted and analyzed.
126 | 
127 | **This is not an officially supported Google product**
128 | 


--------------------------------------------------------------------------------
/analyze_form.py:
--------------------------------------------------------------------------------
 1 | from google.cloud import documentai_v1beta2 as documentai
 2 | from google.cloud import bigquery
 3 | import os
 4 | 
 5 | def _insert_tags_bigquery(rows):
 6 |     client = bigquery.Client()
 7 |     table_id = os.environ["FORMS_TABLE"]
 8 |     table = client.get_table(table_id)
 9 |     errors = client.insert_rows(table, rows)
10 |     if errors:
11 |         print("Got errors " + str(errors))
12 | 
13 | def get_form_fields(bucket, filename):
14 |     """Parse a form"""
15 | 
16 |     client = documentai.DocumentUnderstandingServiceClient()
17 | 
18 |     gcs_source = documentai.types.GcsSource(uri=f"gs://{bucket}/{filename}")
19 | 
20 |     # mime_type can be application/pdf, image/tiff,
21 |     # and image/gif, or application/json
22 |     input_config = documentai.types.InputConfig(
23 |         gcs_source=gcs_source, mime_type='application/pdf')
24 | 
25 |     # Improve form parsing results by providing key-value pair hints.
26 |     # For each key hint, key is text that is likely to appear in the
27 |     # document as a form field name (i.e. "DOB").
28 |     # Value types are optional, but can be one or more of:
29 |     # ADDRESS, LOCATION, ORGANIZATION, PERSON, PHONE_NUMBER, ID,
30 |     # NUMBER, EMAIL, PRICE, TERMS, DATE, NAME
31 |     key_value_pair_hints = [
32 |         documentai.types.KeyValuePairHint(key='Emergency Contact',
33 |                                           value_types=['NAME']),
34 |         documentai.types.KeyValuePairHint(
35 |             key='Referred By')
36 |     ]
37 | 
38 |     # Setting enabled=True enables form extraction
39 |     form_extraction_params = documentai.types.FormExtractionParams(
40 |         enabled=True, key_value_pair_hints=key_value_pair_hints)
41 | 
42 |     # Location can be 'us' or 'eu'
43 |     parent = 'projects/{}/locations/us'.format(os.environ["PROJECT_ID"])
44 |     request = documentai.types.ProcessDocumentRequest(
45 |         parent=parent,
46 |         input_config=input_config,
47 |         form_extraction_params=form_extraction_params)
48 | 
49 |     document = client.process_document(request=request)
50 | 
51 |     def _get_text(el):
52 |         """Doc AI identifies form fields by their offsets
53 |         in document text. This function converts offsets
54 |         to text snippets.
55 |         """
56 |         response = ''
57 |         # If a text segment spans several lines, it will
58 |         # be stored in different text segments.
59 |         for segment in el.text_anchor.text_segments:
60 |             start_index = segment.start_index
61 |             end_index = segment.end_index
62 |             response += document.text[start_index:end_index]
63 |         return response
64 | 
65 |     # Return an array of form fields
66 |     return [
67 |         {
68 |             "filename": filename,
69 |             "page": page.page_number,
70 |             "form_field_name": _get_text(form_field.field_name),
71 |             "form_field_value": _get_text(form_field.field_value)
72 |         }
73 |         for page in document.pages for form_field in page.form_fields
74 |     ]
75 | 
76 | def analyze_form(data, context):
77 |     bucket = data["bucket"]
78 |     name = data["name"]
79 |     rows = get_form_fields(bucket, name)
80 |     if rows:
81 |         print("Inserting form gs://%s/%s into bigquery" % (bucket, name))
82 |         _insert_tags_bigquery(rows) 


--------------------------------------------------------------------------------
/analyze_invoice.pdf:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Google LLC
  2 | 
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | 
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | 
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | from google.cloud import vision
 16 | from google.cloud import language
 17 | from google.cloud.language import enums
 18 | from google.cloud.language import types
 19 | from google.cloud import bigquery
 20 | import os
 21 | import json
 22 | import requests
 23 | 
 24 | def _get_name_from_phone(phone_number):
 25 |     """ Given a phone number as a string, returns Google's
 26 |     guess at the place name
 27 |     """
 28 |     api_key = os.environ["GOOGLE_API_KEY"]
 29 |     digit_phone = phone_number.replace('-', '').replace('+', '')
 30 |     places_endpoint = f"https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=%2B1{digit_phone}&inputtype=phonenumber&fields=name,formatted_address,geometry,photos&key={api_key}"
 31 |     body = json.loads(requests.get(places_endpoint).text)
 32 |     response = {}
 33 |     if 'candidates' in body and len(body['candidates']) > 0:
 34 |         place_info = body['candidates'][0]
 35 |         return place_info["name"] if "name" in place_info else ""
 36 |     return ""
 37 | 
 38 | def _extract_entities(text):
 39 |     """ Given a string of text, returns extracted."""
 40 | 
 41 |     client = language.LanguageServiceClient()
 42 | 
 43 |     document = types.Document(
 44 |         content=text,
 45 |         type=enums.Document.Type.PLAIN_TEXT)
 46 | 
 47 |     response = client.analyze_entities(document=document)
 48 | 
 49 |     return [{"name": entity.name, "type": enums.Entity.Type(entity.type).name, "metadata": entity.metadata} for entity in   response.entities]
 50 | 
 51 | def _extract_text(bucket_name, filename):
 52 |     uri = f"gs://{bucket_name}/{filename}"
 53 |     client = vision.ImageAnnotatorClient()
 54 |     res = client.document_text_detection({'source': {'image_uri': uri}})
 55 |     text = res.full_text_annotation.text
 56 |     if not text:
 57 |         print("OCR error " + str(res))
 58 |     return text
 59 | 
 60 | 
 61 | def _analyze_receipt(bucket_name, filename):
 62 |     """ Given a gc location of a receipt file, extracts (when found)
 63 |     the name, address, phone number, bill total (greatest
 64 |     price listed on bill), and date.
 65 |     """
 66 | 
 67 |     text = _extract_text(bucket_name, filename)
 68 |     if not text:
 69 |         print(f"Couldn't extract text from gs://{bucket_name}/{filename}")
 70 |         return
 71 | 
 72 |     entities = _extract_entities(text)
 73 |     if not entities:
 74 |         print(f"Couldn't extract entities from gs://{bucket_name}/{filename}")
 75 |         return
 76 | 
 77 |     result = {}
 78 | 
 79 |     # Extract the address.
 80 |     addrs = [x for x in entities if x['type'] == 'ADDRESS']
 81 |     try:
 82 |         result['address'] = addrs[0]['name']
 83 |     except:
 84 |         result['address'] = ""
 85 | 
 86 |     # Extract prices/total
 87 |     prices = [x for x in entities if x['type'] == 'PRICE']
 88 | 
 89 |     # Assume the highest listed price is the total. This may not always be true
 90 |     try:
 91 |         total = max([float(x['metadata']['value']) for x in prices])
 92 |         result['total'] = total
 93 |     except:
 94 |         result['total'] = None
 95 | 
 96 |     # Extract phone number.
 97 |     phone_numbers = [x for x in entities if x['type'] == 'PHONE_NUMBER']
 98 | 
 99 |     try:
100 |         result['phone_number'] = phone_numbers[0]['name']
101 |         name = _get_name_from_phone(result['phone_number'])
102 |         if name:
103 |             result["name"] = name
104 |         else:
105 |             result["name"] = ""
106 |     except:
107 |         result['phone_number'] = ""
108 |         result["name"] = ""
109 | 
110 |     # Extract transaction date.
111 |     dates = [x for x in entities if x['type'] == 'DATE']
112 |     try:
113 |         result['date'] = dates[0]['name']
114 |     except:
115 |         result['date'] = None
116 | 
117 |     return result
118 | 
119 | def _insert_receipt_bigquery(filename, name, date, total, address, phone_number):
120 |     client = bigquery.Client()
121 |     table_id = os.environ["RECEIPTS_TABLE"]
122 |     table = client.get_table(table_id)
123 |     rows =[{"filename": filename, "name": name, "date_str": date,
124 |     "total": total, "address": address, "phone_number": phone_number}]
125 |     errors = client.insert_rows(table, rows)
126 |     if errors:
127 |         print("Got errors " + str(errors))
128 | 
129 | def analyze_receipt(data, context):
130 |     bucket = data["bucket"]
131 |     name = data["name"]
132 |     result = _analyze_receipt(bucket, name)
133 |     print(result)
134 |     if result:
135 |         print("Inserting receipt gs://%s/%s into bigquery" % (bucket, name))
136 |         _insert_receipt_bigquery(name, result["name"], result["date"], result["total"], result["address"], result["phone_number"])


--------------------------------------------------------------------------------
/analyze_invoice.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2020 Google LLC
  2 | 
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | 
  7 | #     https://www.apache.org/licenses/LICENSE-2.0
  8 | 
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | # TODO: Conver to new Invoice API when it's availible in Python
 16 | 
 17 | from google.cloud import vision
 18 | from google.cloud import language
 19 | from google.cloud.language import enums
 20 | from google.cloud.language import types
 21 | from google.cloud import bigquery
 22 | import os
 23 | import json
 24 | import requests
 25 | 
 26 | def _get_name_from_phone(phone_number):
 27 |     """ Given a phone number as a string, returns Google's
 28 |     guess at the place name
 29 |     """
 30 |     api_key = os.environ["GOOGLE_API_KEY"]
 31 |     digit_phone = phone_number.replace('-', '').replace('+', '')
 32 |     places_endpoint = f"https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=%2B1{digit_phone}&inputtype=phonenumber&fields=name,formatted_address,geometry,photos&key={api_key}"
 33 |     body = json.loads(requests.get(places_endpoint).text)
 34 |     response = {}
 35 |     if 'candidates' in body and len(body['candidates']) > 0:
 36 |         place_info = body['candidates'][0]
 37 |         return place_info["name"] if "name" in place_info else ""
 38 |     return ""
 39 | 
 40 | def _extract_entities(text):
 41 |     """ Given a string of text, returns extracted."""
 42 | 
 43 |     client = language.LanguageServiceClient()
 44 | 
 45 |     document = types.Document(
 46 |         content=text,
 47 |         type=enums.Document.Type.PLAIN_TEXT)
 48 | 
 49 |     response = client.analyze_entities(document=document)
 50 | 
 51 |     return [{"name": entity.name, "type": enums.Entity.Type(entity.type).name, "metadata": entity.metadata} for entity in   response.entities]
 52 | 
 53 | def _extract_text(bucket_name, filename):
 54 |     uri = f"gs://{bucket_name}/{filename}"
 55 |     client = vision.ImageAnnotatorClient()
 56 |     res = client.document_text_detection({'source': {'image_uri': uri}})
 57 |     text = res.full_text_annotation.text
 58 |     if not text:
 59 |         print("OCR error " + str(res))
 60 |     return text
 61 | 
 62 | 
 63 | def _analyze_invoice(bucket_name, filename):
 64 |     """ Given a gc location of a invoice file, extracts (when found)
 65 |     the name, address, phone number, bill total (greatest
 66 |     price listed on bill), and date.
 67 |     """
 68 | 
 69 |     text = _extract_text(bucket_name, filename)
 70 |     if not text:
 71 |         print(f"Couldn't extract text from gs://{bucket_name}/{filename}")
 72 |         return
 73 | 
 74 |     entities = _extract_entities(text)
 75 |     if not entities:
 76 |         print(f"Couldn't extract entities from gs://{bucket_name}/{filename}")
 77 |         return
 78 | 
 79 |     result = {}
 80 | 
 81 |     # Extract the address.
 82 |     addrs = [x for x in entities if x['type'] == 'ADDRESS']
 83 |     try:
 84 |         result['address'] = addrs[0]['name']
 85 |     except:
 86 |         result['address'] = ""
 87 | 
 88 |     # Extract prices/total
 89 |     prices = [x for x in entities if x['type'] == 'PRICE']
 90 | 
 91 |     # Assume the highest listed price is the total. This may not always be true
 92 |     try:
 93 |         total = max([float(x['metadata']['value']) for x in prices])
 94 |         result['total'] = total
 95 |     except:
 96 |         result['total'] = None
 97 | 
 98 |     # Extract phone number.
 99 |     phone_numbers = [x for x in entities if x['type'] == 'PHONE_NUMBER']
100 | 
101 |     try:
102 |         result['phone_number'] = phone_numbers[0]['name']
103 |         name = _get_name_from_phone(result['phone_number'])
104 |         if name:
105 |             result["name"] = name
106 |         else:
107 |             result["name"] = ""
108 |     except:
109 |         result['phone_number'] = ""
110 |         result["name"] = ""
111 | 
112 |     # Extract transaction date.
113 |     dates = [x for x in entities if x['type'] == 'DATE']
114 |     try:
115 |         result['date'] = dates[0]['name']
116 |     except:
117 |         result['date'] = None
118 | 
119 |     return result
120 | 
121 | def _insert_invoice_bigquery(filename, name, date, total, address, phone_number):
122 |     client = bigquery.Client()
123 |     table_id = os.environ["INVOICES_TABLE"]
124 |     table = client.get_table(table_id)
125 |     rows =[{"filename": filename, "name": name, "date_str": date,
126 |     "total": total, "address": address, "phone_number": phone_number}]
127 |     errors = client.insert_rows(table, rows)
128 |     if errors:
129 |         print("Got errors " + str(errors))
130 | 
131 | def analyze_invoice(data, context):
132 |     bucket = data["bucket"]
133 |     name = data["name"]
134 |     result = _analyze_invoice(bucket, name)
135 |     print(result)
136 |     if result:
137 |         print("Inserting invoice gs://%s/%s into bigquery" % (bucket, name))
138 |         _insert_invoice_bigquery(name, result["name"], result["date"], result["total"], result["address"], result["phone_number"])


--------------------------------------------------------------------------------
/document_pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dalequark/document-pipeline/4dd7da1d6f9561dacd5eb51a6240543013539227/document_pipeline.png


--------------------------------------------------------------------------------
/env_template.sh:
--------------------------------------------------------------------------------
1 | export UNSORTED_BUCKET="YOUR_BUCKET_ID"
2 | export INVOICES_BUCKET="YOUR_BUCKET_ID"
3 | export ARTICLES_BUCKET="YOUR_BUCKET_ID"
4 | export FORMS_BUCKET="YOUR_BUCKET_ID"
5 | export SORT_MODEL_NAME="YOUR_MODEL_NAME"
6 | export SORT_MODEL_THRESHOLD="0.7"
7 | export ARTICLE_TAGS_TABLE="YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE"
8 | export FORMS_TABLE="YOUR_BUCKET_ID"
9 | export PROJECT_ID="YOUR_PROJECT_ID"


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import sort_documents
 2 | import tag_article
 3 | import analyze_invoice
 4 | import analyze_form
 5 | 
 6 | def sort_documents_entry(data, context):
 7 |     sort_documents.sort_documents(data, context)
 8 | 
 9 | def tag_article_entry(data, context):
10 |     tag_article.tag_article(data, context)
11 | 
12 | 
13 | def analyze_form_entry(data, context):
14 |     analyze_form.analyze_form(data, context)
15 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | astroid==2.4.1
 2 | autopep8==1.5.2
 3 | cachetools==4.0.0
 4 | certifi==2019.11.28
 5 | chardet==3.0.4
 6 | google-api-core==1.16.0
 7 | google-auth==1.10.2
 8 | google-cloud-automl==0.9.0
 9 | google-cloud-bigquery==1.23.1
10 | google-cloud-core==1.2.0
11 | google-cloud-documentai==0.1.0
12 | google-cloud-language==1.3.0
13 | google-cloud-storage==1.25.0
14 | google-cloud-vision==0.41.0
15 | google-resumable-media==0.5.0
16 | googleapis-common-protos==1.51.0
17 | grpcio==1.26.0
18 | idna==2.8
19 | img2pdf==0.3.3
20 | isort==4.3.21
21 | lazy-object-proxy==1.4.3
22 | mccabe==0.6.1
23 | Pillow==7.0.0
24 | proto-plus==0.4.0
25 | protobuf==3.11.2
26 | pyasn1==0.4.8
27 | pyasn1-modules==0.2.8
28 | pycodestyle==2.6.0
29 | pylint==2.5.2
30 | pytz==2019.3
31 | requests==2.22.0
32 | rsa==4.0
33 | six==1.14.0
34 | toml==0.10.1
35 | typed-ast==1.4.1
36 | urllib3==1.25.8
37 | wrapt==1.12.1
38 | 


--------------------------------------------------------------------------------
/sort_documents.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2020 Google LLC
 2 | 
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | 
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | 
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | import os
16 | from google.api_core.client_options import ClientOptions
17 | from google.cloud import automl_v1
18 | from google.cloud.automl_v1.proto import service_pb2
19 | from google.cloud import storage
20 | from google.cloud import vision
21 | import utils
22 | 
23 | 
24 | def _gcs_payload(bucket, filename):
25 |     uri = f"gs://{bucket}/{filename}"
26 |     return {'document': {'input_config': {'gcs_source': {'input_uris': [uri]}}}}
27 | 
28 | def _img_payload(bucket, filename):
29 |     print(f"Converting file gs://{bucket}/{filename} to text")
30 |     text = utils.extract_text(bucket, filename)
31 |     if not text:
32 |         return None
33 |     return {'text_snippet': {'content': text, 'mime_type': 'text/plain'}}
34 | 
35 | 
36 | def classify_doc(bucket, filename):
37 |     options = ClientOptions(api_endpoint='automl.googleapis.com')
38 |     prediction_client = automl_v1.PredictionServiceClient(
39 |         client_options=options)
40 | 
41 |     _, ext = os.path.splitext(filename)
42 |     if ext in [".pdf", "txt", "html"]:
43 |         payload = _gcs_payload(bucket, filename)
44 |     elif ext in ['.tif', '.tiff', '.png', '.jpeg', '.jpg']:
45 |         payload = _img_payload(bucket, filename)
46 |     else:
47 |         print(
48 |             f"Could not sort document gs://{bucket}/{filename}, unsupported file type {ext}")
49 |         return None
50 |     if not payload:
51 |         print(
52 |             f"Missing document gs://{bucket}/{filename} payload, cannot sort")
53 |         return None
54 |     request = prediction_client.predict(
55 |         os.environ["SORT_MODEL_NAME"], payload, {})
56 |     label = max(request.payload, key=lambda x: x.classification.score)
57 |     threshold = float(os.environ.get('SORT_MODEL_THRESHOLD')) or 0.7
58 |     displayName = label.display_name if label.classification.score > threshold else None
59 |     print(f"Labeled document gs://{bucket}/{filename} as {displayName}")
60 |     return displayName
61 | 
62 | 
63 | def sort_documents(data, context):
64 |     print("Hello from sort documenets")
65 |     bucket = data["bucket"]
66 |     name = data["name"]
67 |     print("Classifying doc")
68 |     doc_type = classify_doc(bucket, name)
69 |     print(f"Labeled document gs://{bucket}/{name} as {doc_type}")
70 |     storage_client = storage.Client()
71 |     source_bucket = storage_client.bucket(bucket)
72 |     source_blob = source_bucket.blob(name)
73 |     if doc_type in ["invoice", "invoice", "budget"]:
74 |         dest_bucket_name = os.environ["INVOICES_BUCKET"]
75 |     elif doc_type == "article":
76 |         dest_bucket_name = os.environ["ARTICLES_BUCKET"]
77 |     elif doc_type == "form":
78 |         dest_bucket_name = os.environ["FORMS_BUCKET"]
79 |     else:
80 |         dest_bucket_name = os.environ["UNSORTED_BUCKET"]
81 |     dest_bucket = storage_client.bucket(dest_bucket_name)
82 | 
83 |     blob_copy = source_bucket.copy_blob(source_blob, dest_bucket, name)
84 |     source_blob.delete()
85 |     print(
86 |         f"Moved file gs://{bucket}/{name} to gs://{dest_bucket_name}/{blob_copy.name}")
87 | 


--------------------------------------------------------------------------------
/tag_article.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2020 Google LLC
 2 | 
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | 
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | 
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | from google.cloud import language
16 | from google.cloud import storage
17 | from google.cloud.language import enums
18 | from google.cloud.language import types
19 | from google.cloud import bigquery
20 | import os
21 | import utils
22 | 
23 | def get_tags(text, confidence_thresh=0.69):
24 |     # Instantiates a client
25 |     client = language.LanguageServiceClient()
26 | 
27 |     document = types.Document(
28 |         content=text,
29 |         type=enums.Document.Type.PLAIN_TEXT)
30 |     try:
31 |         res = client.classify_text(document)
32 |     except Exception as err:
33 |         print(err)
34 |         return []
35 |     return [tag.name for tag in res.categories]
36 | 
37 | def _insert_tags_bigquery(filename, tags):
38 |     client = bigquery.Client()
39 |     table_id = os.environ["ARTICLE_TAGS_TABLE"]
40 |     table = client.get_table(table_id)
41 |     rows =[{"filename" : filename, "tag": tags}]
42 |     errors = client.insert_rows(table, rows)
43 |     if errors:
44 |         print("Got errors " + str(errors))
45 | 
46 | def tag_article(data, context):
47 |     bucket = data["bucket"]
48 |     name = data["name"]
49 |     ext = os.path.splitext(name)[1] if len(os.path.splitext(name)[1]) > 1 else None
50 |     text = None
51 |     if ext in ['.tif', '.tiff', '.png', '.jpeg', '.jpg']:
52 |         print("Extracting text from image file")
53 |         text = utils.extract_text(bucket, name)
54 |         if not text:
55 |             print("Couldn't extract text from gs://%s/%s" % (bucket, name))
56 |     elif ext in ['.txt']:
57 |         print("Downloading text file from cloud")
58 |         storage_client = storage.Client()
59 |         bucket = storage_client.bucket(bucket)
60 |         blob = bucket.blob(name)
61 |         text = blob.download_as_string()
62 |     else:
63 |         print(f'Unsupported file type {ext}')
64 |     if text:
65 |         tags = get_tags(text)
66 |         print("Found %d tags for article %s" % (len(tags), name))
67 |         _insert_tags_bigquery(name, tags)
68 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | import sort_documents
 3 | import utils
 4 | import tag_article
 5 | import analyze_form
 6 | from unittest.mock import patch 
 7 | from io import StringIO 
 8 | 
 9 | class SortDocuments(unittest.TestCase):
10 |     def test_classify_doc(self):
11 |         bucket = "cloud-samples-data" 
12 |         filename = "documentai/invoice.pdf"
13 |         label = sort_documents.classify_doc(bucket, filename)
14 |         self.assertEqual(label, "invoice", "Should be invoice")
15 | 
16 | class Utils(unittest.TestCase):
17 |     def test_extract_text(self):
18 |         bucket = "cloud-samples-data" 
19 |         filename = "vision/text/screen.jpg"
20 |         text = utils.extract_text(bucket, filename)
21 |         self.assertIsNotNone(text)
22 |         self.assertIsInstance(text, str)
23 | 
24 | class TagArticle(unittest.TestCase):
25 | 
26 |     TEST_TEXT = """Google, headquartered in Mountain View 
27 |     (1600 Amphitheatre Pkwy, Mountain View, CA 940430), unveiled 
28 |     the new Android phone for $799 at the Consumer Electronic Show. 
29 |     Sundar Pichai said in his keynote that users love their new Android phones."""
30 | 
31 |     def test_get_tags(self):
32 |         tags = tag_article.get_tags(self.TEST_TEXT, 0.5)
33 |         self.assertIsNotNone(tags)
34 |         self.assertIsInstance(tags, list)
35 |         self.assertIsInstance(tags[0], str)
36 |         self.assertGreater(len(tags), 0)
37 |         self.assertGreater(len(tags[0]), 0)
38 |     
39 |     # TODO: Don't include this stateful function
40 |     # def test_add_bq(self):
41 |     #     with patch('sys.stdout', new = StringIO()) as fake_out: 
42 |     #         tag_article._insert_tags_bigquery("myfakefile", ["some", "fake", "keys"]) 
43 |     #         self.assertTrue('error' not in fake_out.getvalue().lower()) 
44 | 
45 | class AnalyzeForm(unittest.TestCase):
46 |     def test_get_form_fields(self):
47 |         bucket = "cloud-samples-data" 
48 |         filename = "documentai/form.pdf"
49 |         fields = analyze_form.get_form_fields(bucket, filename)
50 |         self.assertIsInstance(fields, list)
51 |         self.assertGreater(len(fields), 0)
52 |         for field in fields:
53 |             self.assertSetEqual(set(field.keys()), set(["filename", "page", "form_field_name", "form_field_value"]))
54 |             self.assertIsInstance(field["filename"], str)
55 |             self.assertGreater(len(field["filename"]), 0)
56 |             self.assertIsInstance(field["page"], int)
57 |             self.assertIsInstance(field["form_field_name"], str)
58 |             self.assertGreater(len(field["form_field_name"]), 0)
59 |             self.assertIsInstance(field["form_field_value"], str)
60 |             self.assertGreater(len(field["form_field_value"]), 0)
61 |             
62 |     # TODO: Don't include this statful function
63 |     # def test_analyze_form(self):
64 |     #     analyze_form.analyze_form({"bucket": "cloud-samples-data", "name": "documentai/form.pdf"}, None)
65 | 
66 | if __name__ == '__main__':
67 |     unittest.main()


--------------------------------------------------------------------------------
/tif_to_pdf.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2020 Google LLC
 2 | 
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | 
 7 | #     https://www.apache.org/licenses/LICENSE-2.0
 8 | 
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | import img2pdf
16 | from google.cloud import storage
17 | import os
18 | 
19 | def convert_to_pdf(data, context):
20 |     assert(os.environ["PDF_DIR"])
21 |     # Download the file from cloud storage
22 |     storage_client = storage.Client()
23 |     bucket = storage_client.bucket(data["bucket"])
24 |     blob = bucket.blob(data["name"])
25 |     print(f"Got file from bucket {data['bucket']} with name {data['name']}")
26 |     img_data = blob.download_as_string()
27 |     pdf = img2pdf.convert(img_data)
28 |     print("Converted to pdf")
29 |     pdf_bucket = storage_client.bucket(os.environ["PDF_DIR"])
30 |     pdf_name = ".".join(data["name"].split(".")[:-1]) + ".pdf"
31 |     print(f"Uploading file with name {pdf_name} to bucket {os.environ['PDF_DIR']}")
32 |     pdf_blob = pdf_bucket.blob(pdf_name)
33 |     pdf_blob.upload_from_string(pdf, content_type="application/pdf")


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | from google.cloud import vision
 2 | 
 3 | def extract_text(bucket_name, filename):
 4 |     uri = f"gs://{bucket_name}/{filename}"
 5 |     client = vision.ImageAnnotatorClient()
 6 |     res = client.document_text_detection({'source': {'image_uri': uri}})
 7 |     text = res.full_text_annotation.text
 8 |     if not text:
 9 |         print("OCR error " + str(res))
10 |     return text


--------------------------------------------------------------------------------