├── requirements.txt ├── data ├── pnlsheet.pdf ├── sample_file.pdf ├── AmazonWebServices.pdf ├── invoice2data_simple.csv └── sample_csv.csv ├── images ├── result.jpg ├── ITR_final.jpg ├── Uk_licence.jpg ├── itr_output.png ├── sample_csv.PNG ├── sample_pdf.PNG ├── jaccard_demo.png ├── FFT_image_blur.png ├── edit_distance.png ├── jaccard_formula.png ├── metric_learning.png ├── stock_researh.png ├── pdf-sample-page-001.jpg ├── invoice2data_csv_result.PNG ├── tesseract_sample_result.PNG └── AmazonWebService_PDF_Image.jpg ├── code ├── pdftotext_sample.py ├── invoice2data.py └── sample_code_teseract.py ├── template └── temp.yml └── README.md /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==0.24.2 2 | pdftotext==2.1.2 3 | -------------------------------------------------------------------------------- /data/pnlsheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/pnlsheet.pdf -------------------------------------------------------------------------------- /images/result.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/result.jpg -------------------------------------------------------------------------------- /data/sample_file.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/sample_file.pdf -------------------------------------------------------------------------------- /images/ITR_final.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/ITR_final.jpg -------------------------------------------------------------------------------- /images/Uk_licence.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/Uk_licence.jpg -------------------------------------------------------------------------------- /images/itr_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/itr_output.png -------------------------------------------------------------------------------- /images/sample_csv.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/sample_csv.PNG -------------------------------------------------------------------------------- /images/sample_pdf.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/sample_pdf.PNG -------------------------------------------------------------------------------- /images/jaccard_demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/jaccard_demo.png -------------------------------------------------------------------------------- /data/AmazonWebServices.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/data/AmazonWebServices.pdf -------------------------------------------------------------------------------- /images/FFT_image_blur.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/FFT_image_blur.png -------------------------------------------------------------------------------- /images/edit_distance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/edit_distance.png -------------------------------------------------------------------------------- /images/jaccard_formula.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/jaccard_formula.png -------------------------------------------------------------------------------- /images/metric_learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/metric_learning.png -------------------------------------------------------------------------------- /images/stock_researh.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/stock_researh.png -------------------------------------------------------------------------------- /images/pdf-sample-page-001.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/pdf-sample-page-001.jpg -------------------------------------------------------------------------------- /images/invoice2data_csv_result.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/invoice2data_csv_result.PNG -------------------------------------------------------------------------------- /images/tesseract_sample_result.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/tesseract_sample_result.PNG -------------------------------------------------------------------------------- /images/AmazonWebService_PDF_Image.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Pirimid/financial-documents-ocr-deep-learning/HEAD/images/AmazonWebService_PDF_Image.jpg -------------------------------------------------------------------------------- /code/pdftotext_sample.py: -------------------------------------------------------------------------------- 1 | import pdftotext 2 | import pandas as pd 3 | 4 | # Load your PDF 5 | with open("./data/sample_file.pdf", "rb") as f: 6 | pdf = pdftotext.PDF(f) 7 | 8 | sentences = [] 9 | for page in pdf: 10 | lines = page.splitlines() 11 | for line in lines: 12 | sentences.append(line) 13 | 14 | df = pd.DataFrame(data=sentences, columns=['data']) 15 | # print(df.head()) 16 | df.to_csv("./data/sample_csv.csv", index=False) 17 | -------------------------------------------------------------------------------- /data/invoice2data_simple.csv: -------------------------------------------------------------------------------- 1 | ,issuer,amount,amount_untaxed,date,invoice_number,partner_name,partner_website,currency,lines,desc 2 | 0,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'AWS Data Transfer', 'price_unit': '0.01'}",Invoice from Amazon Web Services 3 | 1,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Elastic Compute Cloud', 'price_unit': '1.87'}",Invoice from Amazon Web Services 4 | 2,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Glacier', 'price_unit': '2.22'}",Invoice from Amazon Web Services 5 | 3,Amazon Web Services,4.11,4.11,2014-08-03,42183017,"Amazon Web Services, Inc.",aws.amazon.com,USD,"{'description': 'Amazon Simple Storage Service', 'price_unit': '0.01'}",Invoice from Amazon Web Services 6 | -------------------------------------------------------------------------------- /code/invoice2data.py: -------------------------------------------------------------------------------- 1 | # Importing all the required libraries 2 | 3 | from invoice2data import extract_data 4 | from invoice2data.extract.loader import read_templates 5 | from invoice2data.input import pdftotext 6 | import pandas as pd 7 | 8 | # Importing custom template 9 | templates = read_templates('./template/') 10 | 11 | #print(templates) 12 | 13 | # Extract data from PDF 14 | result = extract_data('./data/pnlsheet.pdf', templates = templates, input_module = pdftotext) 15 | 16 | # Store the extracted data to a Data-frame 17 | df = pd.DataFrame(data = result) 18 | 19 | # Export Data-frame to a csv file 20 | df.to_csv('./data/invoice2data_simple.csv') 21 | 22 | 23 | ''' 24 | You can use any desired library to extract data from pdftotext, pdftotext, pdfminer, tesseract. It is optional 25 | and by default pdftotext will be used if not specified. 26 | 27 | The custom template named temp.yml is placed in the templates. You can remove the templates parameter in 28 | extract_data(). Default templates will be used 29 | 30 | ''' -------------------------------------------------------------------------------- /template/temp.yml: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | issuer: Arigo Private Limited 3 | fields: 4 | static_invoice_number: 5 | static_amount: 6 | static_date: 7 | current_assets: (?i)\s*current\s*assets\s*\$(\d+,*\d+) 8 | cash: (?i)(?\w+)\s+(?P\d+,*\d+\.*)\s+(?P\d+,*\d+\.*) 22 | keywords: 23 | - 'Balance' 24 | -------------------------------------------------------------------------------- /code/sample_code_teseract.py: -------------------------------------------------------------------------------- 1 | # This is simple code snippet to get the data out of the dirivng license as show in use case 2 | 3 | import pytesseract 4 | from pytesseract import Output 5 | import cv2 6 | import pandas as pd 7 | 8 | poppler_path = '' # path to poppler 9 | pytesseract.pytesseract.tesseract_cmd = '' # path to pytesseract.exe 10 | 11 | if __name__ == '__main__': 12 | 13 | image = cv2.imread('Uk_licence.jpg') 14 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 15 | 16 | rows = [] 17 | custom_config = r'-c preserve_interword_spaces=1 --oem 3 --psm 3 -l eng' 18 | d = pytesseract.image_to_data(image, config=custom_config, output_type=Output.DICT) 19 | df = pd.DataFrame(data=d) 20 | # clean up blanks 21 | df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')] 22 | # sort blocks vertically 23 | sorted_blocks = df1.groupby('block_num').first( 24 | ).sort_values('top').index.tolist() 25 | for block in sorted_blocks: 26 | curr = df1[df1['block_num'] == block] 27 | sel = curr[curr.text.str.len() > 3] 28 | char_w = (sel.width/sel.text.str.len()).mean() 29 | prev_par, prev_line, prev_left = 0, 0, 0 30 | text = '' 31 | for ix, ln in curr.iterrows(): 32 | # add new line when necessary 33 | if prev_par != ln['par_num']: 34 | text += '\n' 35 | prev_par = ln['par_num'] 36 | prev_line = ln['line_num'] 37 | prev_left = 0 38 | elif prev_line != ln['line_num']: 39 | text += '\n' 40 | prev_line = ln['line_num'] 41 | prev_left = 0 42 | 43 | added = 0 # num of spaces that should be added 44 | if ln['left']/char_w > prev_left + 1: 45 | added = int((ln['left'])/char_w) - prev_left 46 | text += ' ' * added 47 | text += ln['text'] + ' ' 48 | prev_left += len(ln['text']) + added + 1 49 | text += '\n' 50 | for row in text.split("\n"): 51 | rows.append(row) 52 | 53 | # Print the data that we have extracted 54 | for i in rows: 55 | print(i) 56 | -------------------------------------------------------------------------------- /data/sample_csv.csv: -------------------------------------------------------------------------------- 1 | data 2 | SAMPLE PROFIT & LOSS STATEMENT 3 | Any borrower(s) who is/are self-employed or an independent contractor should 4 | complete this form if they do not already have their own profit and loss form. 5 | Company Name:_ _________________________________________________ Percent of Ownership_% 6 | Company Address:_ 7 | Type of Business:________________________________________________________________________________ 8 | Borrower Name(s):_ _____________________________________________________________________________ 9 | Loan Number: _ 10 | Dates Reported (MM/DD/YY - MM/DD/YY)_ _______________________________________________________ 11 | (Must be minimum of 3 full months) 12 | Please fill in the fields that apply to your business 13 | GROSS INCOME 14 | Gross Sales $ 15 | (Total amount of income from sales or service before subtracting expenses) 16 | Other Income $ 17 | (Any other additional funds earned through the company such as payments 18 | from people leasing space or payments from investors) 19 | Total GROSS INCOME BEFORE TAXES $ 20 | EXPENSES 21 | Cost of Goods Sold $ 22 | (Direct costs to produce or obtain the goods sold by the company) 23 | Accounting and Legal Fees $ 24 | Advertising $ 25 | Insurance $ 26 | (Do not include homeowner insurance) 27 | Maintenance and Repairs $ 28 | Supplies $ 29 | Payroll Expenses $ 30 | (Salaries and wages for borrower(s) on the mortgage loan) 31 | Payroll Expenses $ 32 | (Salaries and wages for employees who are not borrower(s) 33 | on the mortgage loan) 34 | Postage $ 35 | " (Over, please)" 36 | SAMPLE PROFIT & LOSS STATEMENT 37 | Please fill in the fields that apply to your business 38 | Rent $ 39 | Licenses $ 40 | Taxes $ 41 | (Do not include Real Estate taxes on the property; do not include Income 42 | Taxes on the business - include the total of any other taxes that you have to 43 | pay for the business) 44 | Telephone $ 45 | Travel/Transportation $ 46 | Utilities $ 47 | Other $ 48 | (Total and explanation of any other expenses not already listed) 49 | Total EXPENSES $ 50 | NET INCOME 51 | Net Income Before Taxes $ 52 | Taxes $ 53 | (Paid on Business Income) 54 | Total NET INCOME AFTER TAXES $ 55 | " By signing this document, I/we certify that all the information is truthful. I/we understand that knowingly submitting false" 56 | information may constitute fraud. 57 | Borrower Name(s)_ 58 | Signature____________________________________________________ Date_ 59 | Signature_Date_ 60 | 39932 PL 1217 61 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Financial documents parsing examples with various different libraries 2 | 3 | There is so much data into the `pdfs` that we store but for other purposes like training the machine learning model we can not use that data directly from `pdfs`. This is really one of the issue when we have to work with financial data as most of the companies works with `pdfs` to store their P&L, tax returns, general meeting notes and much more. Even the broachers of the companies are into `pdfs`. In order to study more about the companies we need some way to extract this data efficiently and precisely. Here we would like to show our research and work around how to get the data from `pdfs`. 4 | 5 | There are some libraries already created, open sourced for doing such extractions. Let's see in depth some of them. 6 | 7 | ### 1. PDFToText: 8 | 9 | If the pdf is not made out of images this library is one of the best to use. It is really easy to use and can convert the pdf data into simple raw text very precisely. Let's see some samples of this. 10 | ![samples_pdf](images/sample_pdf.PNG) 11 | 12 | ![sample_csv](images/sample_csv.PNG) 13 | 14 | 15 | In the above two pictures, the first one is the image of the pdf file and the second one is of the `csv` file that is generated from by the `pdftotext` library. The example taken is to show the capability of the library to extract the tables and data at the same time in meaningful way. After that anyone can apply regular extractions to get the most out of the data from the csv file. We have added some files in `data` folder as examples. 16 | 17 | You can use the sample code that we have used from `code` folder. Do following to use that 18 | 19 | 1. Change the name of the `pdf` file path at the line 5 from `./data/sample_file.pdf` to your desired file path. 20 | 2. Change the name of the output file at the line line 16 from `./data/sample_csv.csv` to your desired file path. 21 | 3. Run file `pdftotext_sample.py` from the terminal using `python pdftotext_sample.py` command. 22 | 23 | Requirements to use `pdftotext` are: 24 | 1. OS - Linux 25 | 2. Python >= 3.0 26 | Here is the package link on pypi https://pypi.org/project/pdftotext/. 27 | 28 | ### 2. Tesseract 29 | 30 | PDF2Text can extract data from `text PDF`, where as it will fail for extracting data from `image PDF`. In real world scenario one can get any kind of PDFs, so one needs to use `Optical character recognition` (OCR) libraries which are meant for this. Tesseract is one of the best example of it. Let us understand it with these samples: 31 | 32 | ![tesseract Example](images/tesseract_sample_result.PNG) 33 | 34 | In the above picture we can see how information is extracted. The lest most is the a receipt and the left most is the output which one gets after applying tesseract onto it. 35 | 36 | Requirements to use `tesseract` are: 37 | 1. 1. OS - Linux, Mac OSX and Windows 38 | 2. Python 2.7 or 3.5+ 39 | Here is the package link on pypi https://pypi.org/project/pytesseract/ 40 | 41 | ### 3. Invoice2Data 42 | 43 | Invoice2Data library can used not just only to extract data from PDF but also get information from that extracted data. Both the above libraries can be used for their specific usage, where as `Invoice2Data` provides ability to extract data with any of the above mentioned (and also more) libraries. It extracts the data from the PDF and then using the templates one can get the desired information out of it. Below is shown a sample of how it works: 44 | 45 | ![AWS PDF](images/AmazonWebService_PDF_Image.jpg) 46 | 47 | ![Invoice2Data CSV](images/invoice2data_csv_result.PNG) 48 | 49 | 50 | The first image is the PDF of AWS receipt, and the second is the extracted information in form of CSV data. For more information on how to use Invoice2Data and how templates work, review it's [GitHub repository](https://github.com/invoice-x/invoice2data) 51 | 52 | 53 | Requirements to use `invoice2data` are: 54 | 1. OS - Linux 55 | 2. Python >= 3.0 56 | Here is the package link on pypi https://pypi.org/project/invoice2data/0.0.1/ 57 | 58 | ### 4. Pdfminer.six 59 | 60 | Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. It can also be used to get the exact location, font or color of the text. 61 | 62 | It is build in a modular way such that each component of pdfminer.six can be replaced easily. You can implement your own interpreter or rendering device to use the power of pdfminer.six for other purposes that text analysis. 63 | 64 | ![SAMPLE PDF](images/pdf-sample-page-001.jpg) 65 | Result: 66 | ``` 67 | Adobe Acrobat PDF Files 68 | Adobe® Portable Document Format (PDF) is a universal file format that preserves all 69 | of the fonts, formatting, colours and graphics of any source document, regardless of 70 | the application and platform used to create it. 71 | Adobe PDF is an ideal format for electronic document distribution as it overcomes the 72 | problems commonly encountered with electronic file sharing. 73 | • Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat 74 | Reader. Recipients of other file formats sometimes can't open files because they 75 | don't have the applications used to create the documents. 76 | • PDF files always print correctly on any printing device. 77 | • PDF files always display exactly as created, regardless of fonts, software, and 78 | operating systems. Fonts, and graphics are not lost due to platform, software, and 79 | version incompatibilities. 80 | • The free Acrobat Reader is easy to download and can be freely distributed by 81 | anyone. 82 | • Compact PDF files are smaller than their source files and download a 83 | page at a time for fast display on the Web. 84 | ``` 85 | 86 | 87 | Requirements to use `pdfminer.six` are: 88 | 1. Python >= 3.4 89 | 90 | Insall: 91 | 1. pip install pdfminer.six 92 | 93 | ### 5. Pytessereact - Extracting information from KYC document 94 | 95 | Pytessereact is trained on different languages.If the document has different languages then we need to train the pytessereact and if image is not clear or visible then we need to use some image processing techniques like thresholding,opening,border detection. 96 | 97 | ### Driving Licence: 98 | Here we have fetched data from UK sample licence using this library. 99 | 100 | ![LICENCE IMAGE](images/Uk_licence.jpg) 101 | 102 | Result: 103 | 104 | ``` 105 | DRIVING LICENCE 106 | 1, MORGAN 107 | 2, SARAH 108 | MEREDYTH 109 | 3. 11.03.1976 UNITED KINGDOM 110 | 4a. 19.01.2013 4c. DVLA 111 | 4b. 18.01.2023 112 | 5. MORGA753116SM9lJ 35 113 | 8. 122 BURNS CRESCENT 114 | EDINBURGH 115 | EH1 9GP. 116 | 9. AM/A/B1/B/f/kK/I/n/p/q 117 | DVLA INTERNAL USE 118 | ``` 119 | ### Marksheet: 120 | 121 | One more example is below for the Pytessereact. Here we are extracting data from marksheet for engineering student. The original image for the same is as below 122 | 123 | ![Marksheet](images/result.jpg) 124 | 125 | From the above image, the extracted data is as below 126 | ``` 127 | Academic Year Month & Year of Examination Satement No. 128 | 2016-2017 MAY-2017 1148827843 129 | 130 | Course Code & 040 SANKALCHAND PATEL COLLEGE OF ENGINEERING, VISNAGAR 131 | Name 132 | 133 | Course Branch Name 134 | BACHELOR OF ENGINEERING MECHANICAL ENGINEERING 135 | 136 | Student's Name 137 | PATEL POOJANKUMAR MUKESHBHAI 138 | 139 | Enrolment No(NEW) 150400119084 Enrolment No(OLD) 150400119084 Seat No. E319478 140 | 141 | Subject 142 | Code Subject Name Course Theory Component Practical Component Theory Practical Subject 143 | ESE PA ESE PA Grade Grade Grade 144 | Credit 145 | 2131903 Manufacturing Process-1 5 DD BB BB BB CD BB CC 146 | ``` 147 | 148 | We can see that some of the information is still missing as that is due to the color schema of the image. But, with proper image processing like converting image to black and white it is possible to sovle such issues and get perfect results. 149 | 150 | ### Income Tax Return (ITR) 151 | 152 | Below is small example on ITR. Due to size of image we could not show the full image of ITR but as you can see below the small excerpt of ITR and how the model is able to get the data as it is from the image. 153 | 154 | ![ITR](images/ITR_final.jpg) 155 | 156 | ``` 157 | Name 158 | PAN | AY: 2020-21 | DIN : CPC/2021/A1/105231128 | Ack. No. : 439435200030820 159 | 160 | 161 | SI.No. Particulars Reporting Heads Amount in = 162 | As provided by Taxpayer As Computed u/s 143(1) 163 | 164 | 01 SALARY 165 | (i) Gross salary (iatibtic) 7,83,867 7,83,867 166 | (a) Salary as per section 17(1) 7,83,867 7,83,867 167 | (b) Value of perquisites as per section 17(2) 0 0 168 | (c) Profits in lieu of salary as per sec 17(3) 0 0 169 | (ii) Less : Allowances to the extent exempt u/s 10 42,037 42,037 170 | (iii) Net salary (i-ii) 7,41,830 7,41,830 171 | (iv) Deduction u/s 16 (ivativbtive) 52,400 52,400 172 | (a) Standard deduction u/s 16 (ia) 50,000 50,000 173 | (b) Entertainment allowance u/s 16 (ii) 0 0 174 | (c) Professional tax 16(iii) 2,400 2,400 175 | (v) Income chargeable under the head ‘Salaries’ (iii-iv) 6,89,430 6,89,430 176 | ``` 177 | 178 | As you can see the data is preserved properly and in a way where we can do further processing and apply logic as per need of the system. 179 | 180 | ### Stock Research Reports 181 | 182 | So many institutes publishes research report every quarter on stock. Many hedge fund managers read those reports in order to make decisions on if they should invest in such stocks or not. This gave us very interesting problem where we parsed many stocks reports published by different institutes for same campany and analyze them to make one generalized report for managers. We have already working prototype where we can extract information like Target Price, Published Date, Action and so on. We are also able to extract different tables presents in such reports to show user how on average stock is performing as per different institutes.Below is just small snippet from such report and we have shown that we are able to get details like Stock Name, Date, Action, Target Price and so on. There is complext logic and modeling workflow behind it which we can not show here. 183 | 184 | ![StockReport](images/stock_researh.png) 185 | 186 | Here is the ouput for above image from our system 187 | 188 | ``` 189 | file_name,companyName,action,date,targetPrice,currentPrice 190 | HUVR-23-7-19-PL.pdf,hindustan unilever,accumulate,July 23 2019,accumulate,1816,1690 191 | ``` 192 | 193 | As you can see our system accurately tells most important data from stock research reports out of the box. 194 | 195 | We have applied OCR techniques to many other financial use cases and documents and have achived state of the art results. 196 | 197 | ## Best Practices around - Name/address matching, quality of documents, Deduplication of photos 198 | 199 | There are many things we need to do after we get the data out of the documents. Some of the challanges for the same are listed below and we will show some best practices for the same. 200 | 201 | 1) Name matching - I.e. a) Urvish Patel vs Urvish P. b) Urvishkumar Patel vs Urvish Patel 202 | 2) Address matching across documents where address entered might be slightly different 203 | 3) Deduplication of photos/documents 204 | 4) How to check the quality of the documents I.e. borders, lighting etc. 205 | 206 | To solve above issues we can have following solution. 207 | 208 | 1. Name matching - I.e. a) Urvish Patel vs Urvish P. b) Urvishkumar Patel vs Urvish Patel 209 | 210 | There are algorithms specifically designed to solve this problem. They are measures how much similarity is there between two words and hence give us the idea if texts are same or different. 211 | 212 | Two such algorithms are 213 | 214 | A.) Levenshtein Distance 215 | 216 | Minimum number of single-character edits required to change one word into the other 217 | 218 | Insertions 219 | Deletions 220 | Substitutions 221 | 222 | Ex: 223 | “kitten” and ”sitting” has edit distance = 3 224 | kitten → sitten → sittin → sitting 225 | 226 | This is one of the most famous algorithm used for string matching problems. 227 | 228 | 229 | B.) Jaccard Method 230 | 231 | ![formula_jaccard](images/jaccard_formula.png) 232 | 233 | ![jaccard_demo](images/jaccard_demo.png) 234 | 235 | The above two are the most used methods when it comes to string matching use cases. 236 | 237 | 2. Address matching - This is more or less same case as above where we try to match two strings. So for this problem as well, we can use the fuzzy logic algorithms as Jaccard Method to compute the similarity. 238 | 239 | 3. Deduplication of photos/documents - There are various deep learning algorithms now which can help us in identifying duplicate documents or if the documents are the same or not. 240 | 241 | A.) Metric Learning:- This is considered as the state of the art method where we learn the representation of image just like what we learn in facial recognition models. 242 | ![metric_learning](images/metric_learning.png) 243 | 244 | B.) Embedding Learning:- Embeddings are representations of the words in the context of NLP, however, we can use them for learning the representation of the images and can compute a similarity between images based on the vector representation of original and duplicate images. 245 | 246 | 4. How to check the quality of the documents I.e. borders, lighting etc 247 | 248 | A.) Fast Fourier Transform: - 249 | There are several ways to check the quality of an image. We can use algorithms like FFT (Fast Fourier transform) which gives the idea of how much the image is blurry. Once we know the value of blur we can discard the images which are more blur and ask the user to upload them again. 250 | ![FFT](images/FFT_image_blur.png) 251 | 252 | B.) Another idea is to determine the resolution of images based on the pixel density (DPI). This can be done using many libraries like ImageMagick available in Python. If the quality of an image is 95 then it is said to be of the highest quality else not. So based on this user can be asked to upload the image again if needed. 253 | 254 | [Here](https://www.kaggle.com/pokekarat/classify-jpg-data-based-on-its-quality-75-90-95) is the notebook on Kaggle created for the same purpose. The idea was to split the data based on quality to train different models. But the same idea can be applied to detect the quality of an imge. 255 | 256 | 257 | 258 | As we can see, not just that getting document's data is important but also doing some processing on it to get the proper data is also important. We can have many other problems like duplication when it comes to documents and that is also needed to be tackled. But, with the rise of Machine Learning these problems are getting solved day by day. --------------------------------------------------------------------------------