├── requirements.txt
├── data
    ├── pnlsheet.pdf
    ├── sample_file.pdf
    ├── AmazonWebServices.pdf
    ├── invoice2data_simple.csv
    └── sample_csv.csv
├── images
    ├── result.jpg
    ├── ITR_final.jpg
    ├── Uk_licence.jpg
    ├── itr_output.png
    ├── sample_csv.PNG
    ├── sample_pdf.PNG
    ├── jaccard_demo.png
    ├── FFT_image_blur.png
    ├── edit_distance.png
    ├── jaccard_formula.png
    ├── metric_learning.png
    ├── stock_researh.png
    ├── pdf-sample-page-001.jpg
    ├── invoice2data_csv_result.PNG
    ├── tesseract_sample_result.PNG
    └── AmazonWebService_PDF_Image.jpg
├── code
    ├── pdftotext_sample.py
    ├── invoice2data.py
    └── sample_code_teseract.py
├── template
    └── temp.yml
└── README.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==0.24.2
2 | pdftotext==2.1.2
3 | 


--------------------------------------------------------------------------------
/data/pnlsheet.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/pnlsheet.pdf


--------------------------------------------------------------------------------
/images/result.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/result.jpg


--------------------------------------------------------------------------------
/data/sample_file.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/sample_file.pdf


--------------------------------------------------------------------------------
/images/ITR_final.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/ITR_final.jpg


--------------------------------------------------------------------------------
/images/Uk_licence.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/Uk_licence.jpg


--------------------------------------------------------------------------------
/images/itr_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/itr_output.png


--------------------------------------------------------------------------------
/images/sample_csv.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/sample_csv.PNG


--------------------------------------------------------------------------------
/images/sample_pdf.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/sample_pdf.PNG


--------------------------------------------------------------------------------
/images/jaccard_demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/jaccard_demo.png


--------------------------------------------------------------------------------
/data/AmazonWebServices.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/AmazonWebServices.pdf


--------------------------------------------------------------------------------
/images/FFT_image_blur.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/FFT_image_blur.png


--------------------------------------------------------------------------------
/images/edit_distance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/edit_distance.png


--------------------------------------------------------------------------------
/images/jaccard_formula.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/jaccard_formula.png


--------------------------------------------------------------------------------
/images/metric_learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/metric_learning.png


--------------------------------------------------------------------------------
/images/stock_researh.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/stock_researh.png


--------------------------------------------------------------------------------
/images/pdf-sample-page-001.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/pdf-sample-page-001.jpg


--------------------------------------------------------------------------------
/images/invoice2data_csv_result.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/invoice2data_csv_result.PNG


--------------------------------------------------------------------------------
/images/tesseract_sample_result.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/tesseract_sample_result.PNG


--------------------------------------------------------------------------------
/images/AmazonWebService_PDF_Image.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/AmazonWebService_PDF_Image.jpg


--------------------------------------------------------------------------------
/code/pdftotext_sample.py:
--------------------------------------------------------------------------------
 1 | import pdftotext
 2 | import pandas as pd
 3 | 
 4 | # Load your PDF
 5 | with open("./data/sample_file.pdf", "rb") as f:
 6 |     pdf = pdftotext.PDF(f)
 7 | 
 8 | sentences = []
 9 | for page in pdf:
10 |     lines = page.splitlines()
11 |     for line in lines:
12 |         sentences.append(line)
13 | 
14 | df = pd.DataFrame(data=sentences, columns=['data'])
15 | # print(df.head())
16 | df.to_csv("./data/sample_csv.csv", index=False)
17 | 


--------------------------------------------------------------------------------
/data/invoice2data_simple.csv:
--------------------------------------------------------------------------------
1 | ,issuer,amount,amount_untaxed,date,invoice_number,partner_name,partner_website,currency,lines,desc
2 | 0,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'AWS Data Transfer', 'price_unit': '0.01'}",Invoice from Amazon Web Services
3 | 1,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Elastic Compute Cloud', 'price_unit': '1.87'}",Invoice from Amazon Web Services
4 | 2,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Glacier', 'price_unit': '2.22'}",Invoice from Amazon Web Services
5 | 3,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Simple Storage Service', 'price_unit': '0.01'}",Invoice from Amazon Web Services
6 | 


--------------------------------------------------------------------------------
/code/invoice2data.py:
--------------------------------------------------------------------------------
 1 | # Importing all the required libraries
 2 | 
 3 | from invoice2data import extract_data
 4 | from invoice2data.extract.loader import read_templates
 5 | from invoice2data.input import pdftotext
 6 | import pandas as pd
 7 | 
 8 | # Importing custom template
 9 | templates = read_templates('./template/')
10 | 
11 | #print(templates)
12 | 
13 | # Extract data from PDF
14 | result = extract_data('./data/pnlsheet.pdf', templates = templates, input_module = pdftotext)
15 | 
16 | # Store the extracted data to a Data-frame
17 | df = pd.DataFrame(data = result)
18 | 
19 | # Export Data-frame to a csv file
20 | df.to_csv('./data/invoice2data_simple.csv')
21 | 
22 | 
23 | ''' 
24 | You can use any desired library to extract data from pdftotext, pdftotext, pdfminer, tesseract. It is optional
25 | and by default pdftotext will be used if not specified.
26 | 
27 | The custom template named temp.yml is placed in the templates. You can remove the templates parameter in
28 | extract_data(). Default templates will be used
29 | 
30 | '''


--------------------------------------------------------------------------------
/template/temp.yml:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | issuer: Arigo Private Limited
 3 | fields:
 4 |   static_invoice_number:
 5 |   static_amount:
 6 |   static_date: 
 7 |   current_assets: (?i)\s*current\s*assets\s*\$(\d+,*\d+)
 8 |   cash: (?i)(?<!\w)cash\s*\$(\d+,*\d+)
 9 |   Petty cash: (?i)\s*petty\s*cash\s*\$(\d+,*\d+)
10 |   Inventory: (?i)\s*inventory\s*\$(\d+,*\d+)
11 |   Fixed assests: (?i)\s*fixed\s*assets\s*\$(\d+,*\d+)
12 |   Leasehold: (?i)\s*leasehold\s*\$(\d+,*\d+)
13 |   long-term liabilities: (?i)\s*long-term\s*liabilities\s*\$(\d+,*\d+)
14 |   Total assets: (?i)\s*total\s*assets\s*\$(\d+,*\d+)
15 |   Total liabilities: (?i)\s*total\s*liabilities\s*\$(\d+,*\d+)
16 |   Net assets: (?i)\s*net\s*assets\s*(?:\W*\w*\s*\w*\W*)\s*\$(\d+,*\d+)
17 |   Working Capital: (?i)\s*working\s*capital\s*\$(\d+,*\d+)
18 | lines:
19 |   start: Equity Share Capital\s+6,339.00\s+6,335.00
20 |   end: (?i)\s+reserves\s*and\s*surplus
21 |   line: (?P<type>\w+)\s+(?P<In March 19>\d+,*\d+\.*)\s+(?P<March 18>\d+,*\d+\.*)
22 | keywords:
23 |   - 'Balance'
24 | 


--------------------------------------------------------------------------------
/code/sample_code_teseract.py:
--------------------------------------------------------------------------------
 1 | # This is simple code snippet to get the data out of the dirivng license as show in use case
 2 | 
 3 | import pytesseract
 4 | from pytesseract import Output
 5 | import cv2
 6 | import pandas as pd
 7 | 
 8 | poppler_path = '' # path to poppler
 9 | pytesseract.pytesseract.tesseract_cmd = '' # path to pytesseract.exe
10 | 
11 | if __name__ == '__main__':
12 | 
13 |     image = cv2.imread('Uk_licence.jpg')
14 |     image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
15 | 
16 |     rows = []
17 |     custom_config = r'-c preserve_interword_spaces=1 --oem 3 --psm 3 -l eng'
18 |     d = pytesseract.image_to_data(image, config=custom_config, output_type=Output.DICT)
19 |     df = pd.DataFrame(data=d)
20 |     # clean up blanks
21 |     df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]
22 |     # sort blocks vertically
23 |     sorted_blocks = df1.groupby('block_num').first(
24 |     ).sort_values('top').index.tolist()
25 |     for block in sorted_blocks:
26 |         curr = df1[df1['block_num'] == block]
27 |         sel = curr[curr.text.str.len() > 3]
28 |         char_w = (sel.width/sel.text.str.len()).mean()
29 |         prev_par, prev_line, prev_left = 0, 0, 0
30 |         text = ''
31 |         for ix, ln in curr.iterrows():
32 |             # add new line when necessary
33 |             if prev_par != ln['par_num']:
34 |                 text += '\n'
35 |                 prev_par = ln['par_num']
36 |                 prev_line = ln['line_num']
37 |                 prev_left = 0
38 |             elif prev_line != ln['line_num']:
39 |                 text += '\n'
40 |                 prev_line = ln['line_num']
41 |                 prev_left = 0
42 | 
43 |             added = 0  # num of spaces that should be added
44 |             if ln['left']/char_w > prev_left + 1:
45 |                 added = int((ln['left'])/char_w) - prev_left
46 |                 text += ' ' * added
47 |             text += ln['text'] + ' '
48 |             prev_left += len(ln['text']) + added + 1
49 |         text += '\n'
50 |         for row in text.split("\n"):
51 |             rows.append(row)
52 | 
53 |     # Print the data that we have extracted
54 |     for i in rows:
55 |         print(i)
56 | 


--------------------------------------------------------------------------------
/data/sample_csv.csv:
--------------------------------------------------------------------------------
 1 | data
 2 | SAMPLE PROFIT & LOSS STATEMENT
 3 |    Any borrower(s) who is/are self-employed or an independent contractor should
 4 |    complete this form if they do not already have their own profit and loss form.             		
 5 |    Company Name:_ _________________________________________________ Percent of Ownership_%
 6 |    Company Address:_
 7 |    Type of Business:________________________________________________________________________________
 8 |    Borrower Name(s):_ _____________________________________________________________________________
 9 |    Loan Number: _
10 |    Dates Reported (MM/DD/YY - MM/DD/YY)_ _______________________________________________________
11 |    (Must be minimum of 3 full months)
12 |    Please fill in the fields that apply to your business
13 |      GROSS INCOME
14 |        Gross Sales                                                                $
15 |        (Total amount of income from sales or service before subtracting expenses)
16 |        Other Income                                                               $
17 |        (Any other additional funds earned through the company such as payments
18 |        from people leasing space or payments from investors)
19 |      Total GROSS INCOME BEFORE TAXES                                              $
20 |      EXPENSES
21 |        Cost of Goods Sold                                                         $
22 |        (Direct costs to produce or obtain the goods sold by the company)
23 |        Accounting and Legal Fees                                                  $
24 |        Advertising                                                                $
25 |        Insurance                                                                  $
26 |        (Do not include homeowner insurance)
27 |        Maintenance and Repairs                                                    $
28 |        Supplies                                                                   $
29 |        Payroll Expenses                                                           $
30 |        (Salaries and wages for borrower(s) on the mortgage loan)
31 |        Payroll Expenses                                                           $
32 |        (Salaries and wages for employees who are not borrower(s)
33 |        on the mortgage loan)
34 |        Postage                                                                    $
35 | "                                                                                                   (Over, please)"
36 | SAMPLE PROFIT & LOSS STATEMENT
37 |    Please fill in the fields that apply to your business
38 |         Rent                                                                                  $
39 |         Licenses                                                                              $
40 |         Taxes                                                                                 $
41 |         (Do not include Real Estate taxes on the property; do not include Income
42 |         Taxes on the business - include the total of any other taxes that you have to
43 |         pay for the business)
44 |         Telephone                                                                             $
45 |         Travel/Transportation                                                                 $
46 |         Utilities                                                                             $
47 |         Other                                                                                 $
48 |         (Total and explanation of any other expenses not already listed)
49 |         Total EXPENSES                                                                        $
50 |       NET INCOME
51 |         Net Income Before Taxes                                                               $
52 |         Taxes                                                                                 $
53 |         (Paid on Business Income)
54 |       Total NET INCOME AFTER TAXES                                                            $
55 | "   By signing this document, I/we certify that all the information is truthful. I/we understand that knowingly submitting false"
56 |    information may constitute fraud.
57 |    Borrower Name(s)_
58 |    Signature____________________________________________________ Date_
59 |    Signature_Date_
60 |                                                                                                                     39932 PL 1217
61 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Financial documents parsing examples with various different libraries
  2 | 
  3 | There is so much data into the `pdfs` that we store but for other purposes like training the machine learning model we can not use that data directly from `pdfs`. This is really one of the issue when we have to work with financial data as most of the companies works with `pdfs` to store their P&L, tax returns, general meeting notes and much more. Even the broachers of the companies are into `pdfs`. In order to study more about the companies we need some way to extract this data efficiently and precisely. Here we would like to show our research and work around how to get the data from `pdfs`.
  4 | 
  5 | There are some libraries already created, open sourced for doing such extractions. Let's see in depth some of them.
  6 | 
  7 | ### 1. PDFToText:
  8 | 
  9 | If the pdf is not made out of images this library is one of the best to use. It is really easy to use and can convert the pdf data into simple raw text very precisely. Let's see some samples of this.
 10 | ![samples_pdf](images/sample_pdf.PNG)
 11 | 
 12 | ![sample_csv](images/sample_csv.PNG)
 13 | 
 14 | 
 15 | In the above two pictures, the first one is the image of the pdf file and the second one is of the `csv` file that is generated from by the `pdftotext` library. The example taken is to show the capability of the library to extract the tables and data at the same time in meaningful way. After that anyone can apply regular extractions to get the most out of the data from the csv file. We have added some files in `data` folder as examples.
 16 | 
 17 | You can use the sample code that we have used from `code` folder. Do following to use that
 18 | 
 19 | 1. Change the name of the `pdf` file path at the line 5 from `./data/sample_file.pdf` to your desired file path.
 20 | 2. Change the name of the output file at the line line 16 from `./data/sample_csv.csv` to your desired file path.
 21 | 3. Run file `pdftotext_sample.py` from the terminal using `python pdftotext_sample.py` command.
 22 | 
 23 |     Requirements to use `pdftotext` are:
 24 |      1. OS - Linux
 25 |      2. Python >= 3.0
 26 |     Here is the package link on pypi https://pypi.org/project/pdftotext/.
 27 | 
 28 | ### 2. Tesseract
 29 | 
 30 | PDF2Text can extract data from `text PDF`, where as it will fail for extracting data from `image PDF`. In real world scenario one can get any kind of PDFs, so one needs to use `Optical character recognition` (OCR) libraries which are meant for this. Tesseract is one of the best example of it. Let us understand it with these samples:
 31 | 
 32 | ![tesseract Example](images/tesseract_sample_result.PNG)
 33 | 
 34 | In the above picture we can see how information is extracted. The lest most is the a receipt and the left most is the output which one gets after applying tesseract onto it.
 35 | 
 36 |     Requirements to use `tesseract` are:
 37 |      1.  1. OS - Linux, Mac OSX and Windows
 38 |      2. Python 2.7 or 3.5+
 39 |     Here is the package link on pypi https://pypi.org/project/pytesseract/
 40 | 
 41 | ### 3. Invoice2Data
 42 | 
 43 | Invoice2Data library can used not just only to extract data from PDF but also get information from that extracted data. Both the above libraries can be used for their specific usage, where as `Invoice2Data` provides ability to extract data with any of the above mentioned (and also more) libraries. It extracts the data from the PDF and then using the templates one can get the desired information out of it. Below is shown a sample of how it works:
 44 | 
 45 | ![AWS PDF](images/AmazonWebService_PDF_Image.jpg)
 46 | 
 47 | ![Invoice2Data CSV](images/invoice2data_csv_result.PNG)
 48 | 
 49 | 
 50 | The first image is the PDF of AWS receipt, and the second is the extracted information in form of CSV data. For more information on how to use Invoice2Data and how templates work, review it's [GitHub repository](https://github.com/invoice-x/invoice2data)
 51 | 
 52 | 
 53 |     Requirements to use `invoice2data` are:
 54 |      1. OS - Linux
 55 |      2. Python >= 3.0
 56 |     Here is the package link on pypi https://pypi.org/project/invoice2data/0.0.1/
 57 | 
 58 | ### 4. Pdfminer.six
 59 | 
 60 | Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text.
 61 | 
 62 | It is build in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device to use the power of pdfminer.six for other purposes that text analysis.
 63 | 
 64 | ![SAMPLE PDF](images/pdf-sample-page-001.jpg)
 65 | Result:
 66 | ```
 67 | Adobe Acrobat PDF Files
 68 | Adobe® Portable Document Format (PDF) is a universal file format that preserves all
 69 | of the fonts, formatting, colours and graphics  of any  source document,  regardless of
 70 | the application and platform used to create it.
 71 | Adobe PDF is an ideal format for electronic document distribution as it overcomes the
 72 | problems commonly encountered with electronic file sharing.
 73 | •  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
 74 | Reader.  Recipients  of  other  file  formats  sometimes  can't  open  files  because  they
 75 | don't have the applications used to create the documents.
 76 | •  PDF files always print correctly on any printing device.
 77 | •  PDF  files  always  display  exactly  as  created,  regardless  of  fonts,  software,  and
 78 | operating systems. Fonts, and graphics are not lost due to platform, software, and
 79 | version incompatibilities.
 80 | •  The  free  Acrobat  Reader  is  easy  to  download  and  can  be  freely  distributed  by
 81 | anyone.
 82 | •  Compact  PDF  files  are  smaller  than  their  source  files  and  download  a
 83 | page at a time for fast display on the Web.
 84 | ```
 85 | 
 86 | 
 87 |     Requirements to use `pdfminer.six` are:
 88 |      1. Python >= 3.4
 89 | 
 90 |     Insall:
 91 |      1. pip install pdfminer.six
 92 | 
 93 |  ### 5. Pytessereact - Extracting information from KYC document
 94 | 
 95 |  Pytessereact is trained on different languages.If the document has different languages then we need to train the pytessereact and if image is not clear or visible then we need to use some image processing techniques like thresholding,opening,border detection.
 96 | 
 97 |  ### Driving Licence:
 98 |  Here we have fetched data from UK sample licence using this library.
 99 | 
100 |  ![LICENCE IMAGE](images/Uk_licence.jpg)
101 | 
102 | Result:
103 | 
104 | ```
105 | DRIVING LICENCE
106 | 1, MORGAN
107 | 2, SARAH
108 | MEREDYTH
109 | 3. 11.03.1976 UNITED KINGDOM
110 | 4a. 19.01.2013 4c. DVLA
111 | 4b. 18.01.2023
112 | 5. MORGA753116SM9lJ 35
113 | 8. 122 BURNS CRESCENT
114 | EDINBURGH
115 | EH1 9GP.
116 | 9. AM/A/B1/B/f/kK/I/n/p/q
117 | DVLA INTERNAL USE
118 | ```
119 |  ### Marksheet:
120 | 
121 | One more example is below for the Pytessereact. Here we are extracting data from marksheet for engineering student. The original image for the same is as below
122 | 
123 | ![Marksheet](images/result.jpg)
124 | 
125 | From the above image, the extracted data is as below
126 | ```
127 | Academic Year             Month & Year of Examination      Satement No.
128 | 2016-2017                      MAY-2017                      1148827843
129 | 
130 | Course Code &        040    SANKALCHAND PATEL COLLEGE OF ENGINEERING, VISNAGAR
131 | Name
132 | 
133 |        Course                           Branch Name
134 | BACHELOR OF ENGINEERING           MECHANICAL ENGINEERING
135 | 
136 | Student's Name
137 |           PATEL POOJANKUMAR MUKESHBHAI
138 | 
139 | Enrolment No(NEW) 150400119084   Enrolment No(OLD) 150400119084    Seat No. E319478
140 | 
141 | Subject
142 | Code               Subject Name         Course  Theory Component Practical Component Theory Practical Subject
143 |                                                 ESE PA            ESE PA              Grade   Grade   Grade
144 |                                         Credit
145 | 2131903     Manufacturing Process-1      5      DD  BB     BB BB     CD    BB    CC
146 | ```
147 | 
148 | We can see that some of the information is still missing as that is due to the color schema of the image. But, with proper image processing like converting image to black and white it is possible to sovle such issues and get perfect results.
149 | 
150 |  ### Income Tax Return (ITR)
151 | 
152 |  Below is small example on ITR. Due to size of image we could not show the full image of ITR but as you can see below the small excerpt of ITR and how the model is able to get the data as it is from the image.
153 | 
154 |  ![ITR](images/ITR_final.jpg)
155 | 
156 | ```
157 | Name
158 | PAN                           |        AY:   2020-21          |         DIN  : CPC/2021/A1/105231128            |    Ack. No.  : 439435200030820
159 | 
160 | 
161 |  SI.No.      Particulars             Reporting Heads                                                                                              Amount  in =
162 |                                                                                                             As provided by Taxpayer As Computed u/s 143(1)
163 | 
164 |  01           SALARY
165 |                                     (i) Gross  salary (iatibtic)                                         7,83,867                7,83,867
166 |                                             (a) Salary as per section  17(1)                             7,83,867                7,83,867
167 |                                             (b) Value  of perquisites as per section 17(2)               0                       0
168 |                                             (c) Profits in lieu of salary as per sec 17(3)               0                       0
169 |                                     (ii) Less : Allowances  to the extent exempt  u/s 10                 42,037                  42,037
170 |                                     (iii) Net salary (i-ii)                                              7,41,830               7,41,830
171 |                                     (iv) Deduction  u/s 16 (ivativbtive)                                 52,400                  52,400
172 |                                             (a) Standard  deduction  u/s 16  (ia)                        50,000                  50,000
173 |                                             (b) Entertainment  allowance   u/s 16 (ii)                   0                       0
174 |                                             (c) Professional  tax 16(iii)                                2,400                   2,400
175 |                                     (v) Income  chargeable  under the head ‘Salaries’ (iii-iv)	         6,89,430               6,89,430
176 | ```
177 | 
178 |  As you can see the data is preserved properly and in a way where we can do further processing and apply logic as per need of the system.
179 | 
180 | ### Stock Research Reports
181 | 
182 | So many institutes publishes research report every quarter on stock. Many hedge fund managers read those reports in order to make decisions on if they should invest in such stocks or not. This gave us very interesting problem where we parsed many stocks reports published by different institutes for same campany and analyze them to make one generalized report for managers. We have already working prototype where we can extract information like Target Price, Published Date, Action and so on. We are also able to extract different tables presents in such reports to show user how on average stock is performing as per different institutes.Below is just small snippet from such report and we have shown that we are able to get details like Stock Name, Date, Action, Target Price and so on. There is complext logic and modeling workflow behind it which we can not show here.
183 | 
184 | ![StockReport](images/stock_researh.png)
185 | 
186 | Here is the ouput for above image from our system
187 | 
188 | ```
189 | file_name,companyName,action,date,targetPrice,currentPrice
190 | HUVR-23-7-19-PL.pdf,hindustan unilever,accumulate,July 23 2019,accumulate,1816,1690
191 | ```
192 | 
193 | As you can see our system accurately tells most important data from stock research reports out of the box.
194 | 
195 | We have applied OCR techniques to many other financial use cases and documents and have achived state of the art results.
196 | 
197 | ## Best Practices around - Name/address matching, quality of documents, Deduplication of photos
198 | 
199 | There are many things we need to do after we get the data out of the documents. Some of the challanges for the same are listed below and we will show some best practices for the same.
200 | 
201 |     1) Name matching - I.e. a)  Urvish Patel vs Urvish P.  b) Urvishkumar Patel vs Urvish Patel
202 |     2) Address matching across documents where address entered might be slightly different
203 |     3) Deduplication of photos/documents
204 |     4) How to check the quality of the documents I.e. borders, lighting etc.
205 | 
206 | To solve above issues we can have following solution.
207 | 
208 | 1. Name matching - I.e. a)  Urvish Patel vs Urvish P.  b) Urvishkumar Patel vs Urvish Patel
209 | 
210 |     There are algorithms specifically designed to solve this problem. They are measures how much similarity is there between two words and hence give us the idea if texts are same or different.
211 | 
212 |     Two such algorithms are
213 | 
214 |     A.) Levenshtein Distance
215 | 
216 |     Minimum number of single-character edits required to change one word into the other
217 | 
218 |         Insertions
219 |         Deletions
220 |         Substitutions
221 | 
222 |         Ex:
223 |         “kitten” and ”sitting” has edit distance = 3
224 |         kitten → sitten → sittin → sitting
225 | 
226 |     This is one of the most famous algorithm used for string matching problems.
227 | 
228 | 
229 |     B.) Jaccard Method
230 | 
231 |     ![formula_jaccard](images/jaccard_formula.png)
232 | 
233 |     ![jaccard_demo](images/jaccard_demo.png)
234 | 
235 |     The above two are the most used methods when it comes to string matching use cases.
236 | 
237 | 2. Address matching - This is more or less same case as above where we try to match two strings. So for this problem as well, we can use the fuzzy logic algorithms as Jaccard Method to compute the similarity.
238 | 
239 | 3. Deduplication of photos/documents - There are various deep learning algorithms now which can help us in identifying duplicate documents or if the documents are the same or not.
240 | 
241 |     A.) Metric Learning:- This is considered as the state of the art method where we learn the representation of image just like what we learn in facial recognition models.
242 |     ![metric_learning](images/metric_learning.png)
243 | 
244 |     B.) Embedding Learning:- Embeddings are representations of the words in the context of NLP, however, we can use them for learning the representation of the images and can compute a similarity between images based on the vector representation of original and duplicate images.
245 | 
246 | 4. How to check the quality of the documents I.e. borders, lighting etc
247 | 
248 |     A.) Fast Fourier Transform: -
249 |     There are several ways to check the quality of an image. We can use algorithms like FFT (Fast Fourier transform) which gives the idea of how much the image is blurry. Once we know the value of blur we can discard the images which are more blur and ask the user to upload them again.
250 |     ![FFT](images/FFT_image_blur.png)
251 | 
252 |     B.) Another idea is to determine the resolution of images based on the pixel density (DPI). This can be done using many libraries like ImageMagick available in Python. If the quality of an image is 95 then it is said to be of the highest quality else not. So based on this user can be asked to upload the image again if needed.
253 | 
254 |     [Here](https://www.kaggle.com/pokekarat/classify-jpg-data-based-on-its-quality-75-90-95)  is the notebook on Kaggle created for the same purpose. The idea was to split the data based on quality to train different models. But the same idea can be applied to detect the quality of an imge.
255 | 
256 | 
257 | 
258 | As we can see, not just that getting document's data is important but also doing some processing on it to get the proper data is also important. We can have many other problems like duplication when it comes to documents and that is also needed to be tackled. But, with the rise of Machine Learning these problems are getting solved day by day.


--------------------------------------------------------------------------------