├── .github
└── workflows
│ └── python-publish.yml
├── LICENSE
├── README.md
├── bill7.png
├── doc_transformers
├── __init__.py
└── parser.py
├── output.png
├── result.csv
├── setup.cfg
└── setup.py
/.github/workflows/python-publish.yml:
--------------------------------------------------------------------------------
1 | # This workflow will upload a Python Package using Twine when a release is created
2 | # For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
3 |
4 | name: Upload Python Package
5 |
6 | on:
7 | release:
8 | types: [created]
9 |
10 | jobs:
11 | deploy:
12 |
13 | runs-on: ubuntu-latest
14 |
15 | steps:
16 | - uses: actions/checkout@v2
17 | - name: Set up Python
18 | uses: actions/setup-python@v2
19 | with:
20 | python-version: '3.x'
21 | - name: Install dependencies
22 | run: |
23 | python -m pip install --upgrade pip
24 | pip install setuptools wheel twine
25 | - name: Build and publish
26 | env:
27 | TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
28 | TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
29 | run: |
30 | python setup.py sdist
31 | twine upload dist/*
32 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Vishnu Nandakumar
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Doc Transformers
2 | Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (key - value pairs)
3 |
4 | ```bash
5 | pip install -q doc-transformers
6 | ```
7 |
8 | ## Pre-requisites
9 |
10 | Please install the following seperately
11 | ```
12 | pip install pip --upgrade
13 | pip install -q git+https://github.com/huggingface/transformers.git
14 |
15 | pip install pyyaml==5.1
16 |
17 | # workaround: install old version of pytorch since detectron2 hasn't released packages for pytorch 1.9 (issue: https://github.com/facebookresearch/detectron2/issues/3158)
18 | pip install torch==1.8.0+cu101 torchvision==0.9.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
19 |
20 | # install detectron2 that matches pytorch 1.8
21 | # See https://detectron2.readthedocs.io/tutorials/install.html for instructions
22 | pip install -q detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
23 | ```
24 |
25 | ## Implementation
26 |
27 | ```python
28 | # loads the pretrained dataset also
29 | from doc_transformers import parser
30 |
31 | # loads the image and labels
32 | image = parser.load_image(input_path_image)
33 | labels = parser.load_tags()
34 |
35 | # loads the model
36 | feature_extractor, processor, model = parser.load_models()
37 |
38 | # gets the bounding boxes, predictions, extracted words and image processed
39 | kp = parser.process_image(image, feature_extractor, processor, model, labels)
40 | ```
41 |
42 | ## Results
43 |
44 | **Input & Output**
45 |
46 |
47 |
48 |
49 |
50 |
51 | **Table**
52 |
53 | - After saving to csv the result looks like the following
54 |
55 | | LABEL | TEXT |
56 | | ----- | ---------------------------------- |
57 | | title | CREDIT CARD VOUCHER ANY RESTAURANT |
58 | | title | ANYWHERE |
59 | | key | DATE: |
60 | | value | 02/02/2014 |
61 | | key | TIME: |
62 | | value | 11:11 |
63 | | key | CARD |
64 | | key | TYPE: |
65 | | value | MC |
66 | | key | ACCT: |
67 | | value | XXXX XXXX XXXX |
68 | | value | 1111 |
69 | | key | TRANS |
70 | | key | KEY: |
71 | | value | HYU8789798234 |
72 | | key | AUTH |
73 | | key | CODE: |
74 | | value | 12345 |
75 | | key | EXP |
76 | | key | DATE: |
77 | | value | XX/XX |
78 | | key | CHECK: |
79 | | value | 1111 |
80 | | key | TABLE: |
81 | | value | 11/11 |
82 | | key | SERVER: |
83 | | value | 34 |
84 | | value | MONIKA |
85 | | key | Subtotal: |
86 | | value | $1969 |
87 | | value | .69 |
88 | | key | Gratuity: Total: |
89 |
90 | ## Code credits
91 |
92 | [@HuggingFace](https://huggingface.co/)
93 |
94 | - Please note that this is still in development phase and will be improved in the near future
95 |
--------------------------------------------------------------------------------
/bill7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vishnunkumar/doc_transformers/ea52976d5232caff4ce13758309b0bfac57664c0/bill7.png
--------------------------------------------------------------------------------
/doc_transformers/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/doc_transformers/parser.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import pytesseract
4 | import torch
5 | from itertools import groupby
6 | from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification, LayoutLMv2FeatureExtractor
7 | from datasets import load_dataset
8 | from PIL import Image, ImageDraw, ImageFont
9 |
10 | def load_tags():
11 |
12 | datasets = load_dataset("nielsr/funsd")
13 | labels = datasets['train'].features['ner_tags'].feature.names
14 |
15 | return labels
16 |
17 | def load_models():
18 |
19 | feature_extractor = LayoutLMv2FeatureExtractor("microsoft/layoutlmv3-base", apply_ocr=True)
20 | processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
21 | model = LayoutLMv3ForTokenClassification.from_pretrained("nielsr/layoutlmv3-finetuned-funsd")
22 |
23 | return feature_extractor, processor, model
24 |
25 | def load_image(path):
26 |
27 | image = Image.open(path).convert("RGB")
28 | return image
29 |
30 | def unnormalize_box(bbox, width, height):
31 |
32 | return [
33 | width * (bbox[0] / 1000),
34 | height * (bbox[1] / 1000),
35 | width * (bbox[2] / 1000),
36 | height * (bbox[3] / 1000),
37 | ]
38 |
39 | def iob_to_label(label):
40 |
41 | label = label[2:]
42 | if not label:
43 | return 'other'
44 | return label
45 |
46 |
47 | def process_image(image, feature_extractor, processor, model, labels):
48 |
49 | id2label = {v: k for v, k in enumerate(labels)}
50 | label2id = {k: v for v, k in enumerate(labels)}
51 | width, height = image.size
52 |
53 | encods = feature_extractor(image, return_tensors="pt")
54 | encoding = processor(image, truncation=True, return_offsets_mapping=True, return_tensors="pt")
55 | offset_mapping = encoding.pop("offset_mapping")
56 |
57 | outputs = model(**encoding)
58 | predictions = outputs.logits.argmax(-1).squeeze().tolist()
59 | token_boxes = encoding.bbox.squeeze().tolist()
60 |
61 | is_subword = np.array(offset_mapping.squeeze().tolist())[:, 0] != 0
62 | true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
63 | true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]
64 |
65 | words = encods.words
66 | n = len(true_predictions) - len(words[0])
67 | k = int(n/2)
68 | true_predictions = true_predictions[k:-k]
69 | true_boxes = true_boxes[k:-k]
70 |
71 | l_words = []
72 | preds = []
73 | bboxes = []
74 | key_pairs = []
75 |
76 | for i in range(0, len(words[0])):
77 | json_dict = {}
78 | if true_predictions[i] not in ["O"]:
79 | if true_predictions[i] in ["B-HEADER", "I-HEADER"]:
80 | json_dict["label"] = "TITLE"
81 | elif true_predictions[i] in ["B-QUESTION", "I-QUESTION"]:
82 | json_dict["label"] = "KEY"
83 | else:
84 | json_dict["label"] = "VALUE"
85 | json_dict["value"] = words[0][i]
86 | key_pairs.append(json_dict)
87 | bboxes.append(true_boxes[i])
88 |
89 | return key_pairs, bboxes
90 |
91 | def visualize_image(image, key_pairs, bboxes):
92 |
93 | draw = ImageDraw.Draw(image)
94 | font = ImageFont.load_default()
95 | label2color = {'KEY':'blue', 'VALUE':'green', 'TITLE':'orange'}
96 |
97 | for kp, box in enumerate(zip(key_pairs, bboxes)):
98 | draw.rectangle(box, outline=label2color[kp['label']])
99 | draw.text((box[0] + 10, box[1] - 10), text=kp['label'], fill=label2color[predicted_label], font=font)
100 |
101 | return image
102 |
--------------------------------------------------------------------------------
/output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Vishnunkumar/doc_transformers/ea52976d5232caff4ce13758309b0bfac57664c0/output.png
--------------------------------------------------------------------------------
/result.csv:
--------------------------------------------------------------------------------
1 | LABEL,TEXT
2 | title,CREDIT CARD VOUCHER ANY RESTAURANT
3 | title,ANYWHERE
4 | key,DATE:
5 | value,02/02/2014
6 | key,TIME:
7 | value,11:11
8 | key,CARD
9 | key,TYPE:
10 | value,MC
11 | key,ACCT:
12 | value,XXXX XXXX XXXX
13 | value,1111
14 | key,TRANS
15 | key,KEY:
16 | value,HYU8789798234
17 | key,AUTH
18 | key,CODE:
19 | value,12345
20 | key,EXP
21 | key,DATE:
22 | value,XX/XX
23 | key,CHECK:
24 | value,1111
25 | key,TABLE:
26 | value,11/11
27 | key,SERVER:
28 | value,34
29 | value,MONIKA
30 | key,Subtotal:
31 | value,$1969
32 | value,.69
33 | key,Gratuity: Total:
34 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | # Inside of setup.cfg
2 | [metadata]
3 | description-file = README.md
4 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 |
3 | with open("README.md", "r") as fh:
4 | long_description = fh.read()
5 |
6 | requirements = [
7 | 'transformers',
8 | 'datasets',
9 | 'pytesseract',
10 | 'pandas',
11 | 'numpy'
12 | ]
13 |
14 |
15 | setuptools.setup(
16 | name="doc_transformers",
17 | version="1.0.2",
18 | author="Vishnu Nandakumar",
19 | author_email="nkumarvishnu25@gmail.com",
20 | description="Deep learning for document processing",
21 | long_description=long_description,
22 | long_description_content_type="text/markdown",
23 | url = 'https://github.com/Vishnunkumar/doc_transformers/',
24 | packages=[
25 | 'doc_transformers',
26 | ],
27 | package_dir={'doc_transformers': 'doc_transformers'},
28 | package_data={
29 | 'doc_transformers': ['doc_transformers/*.py']
30 | },
31 | install_requires=requirements,
32 | license="MIT license",
33 | zip_safe=False,
34 | keywords='doc_transformers',
35 | classifiers=(
36 | 'Development Status :: 3 - Alpha', # Chose either "3 - Alpha", "4 - Beta" or "5 - Production/Stable" as the current state of your package
37 | 'Intended Audience :: Developers', # Define that your audience are developers
38 | 'Topic :: Software Development :: Build Tools',
39 | 'License :: OSI Approved :: MIT License', # Again, pick a license
40 | 'Programming Language :: Python :: 3', #Specify which pyhton versions that you want to support
41 | 'Programming Language :: Python :: 3.4',
42 | 'Programming Language :: Python :: 3.5',
43 | 'Programming Language :: Python :: 3.6',
44 | ),
45 | )
46 |
--------------------------------------------------------------------------------