├── requirements.txt ├── README.md ├── OCR_APP.py ├── Text_Extract.py └── BIZ_CARDS_DASHBOARD.py /requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit 2 | pymysql 3 | easyocr 4 | numpy 5 | Pillow==8.4.0 6 | spacy>=3.0.0,<4.0.0 7 | en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl 8 | opencv-python==4.5.5.64 9 | python-Levenshtein==0.12.2 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Text-Extraction-From-Business-Card-Using-OCR 2 | 3 | Application Link: https://tulasinnd-text-extraction-from-business-card-usi-ocr-app-c78inw.streamlit.app/ 4 | 5 | Application Demo Video Link: https://www.linkedin.com/posts/tulasi-n-49b6111b0_unlocking-data-from-business-cards-using-activity-7039308472575815680-Zwd9?utm_source=share&utm_medium=member_desktop 6 | 7 | This code is an OCR application that extracts text from images uploaded by users, using the EasyOCR library. 8 | The extracted text is then processed to extract information such as email, phone number, pin code, address, 9 | and website URL, and displayed on a Streamlit web app interface. User will be able to upload this information into 10 | database and can delete at anytime 11 | 12 | Skills required: 13 | 14 | Python, AWS-RDS-MYSQL, Streamlit, Regular Expressions, OCR, CSS, HTML 15 | 16 | Features 17 | 18 | Extracts Website URL, Email, Pin Code, Phone Number(s), and Address from the uploaded image. 19 | Displays the uploaded image along with the extracted text. 20 | 21 | Installation 22 | 23 | Clone the repository using the following command: 24 | 25 | git clone https://github.com/shubham5351/OCR-for-Text-Extraction.git 26 | 27 | Navigate to the directory using the following command: 28 | 29 | cd OCR-for-Text-Extraction 30 | 31 | Install the required libraries using the following command: 32 | 33 | pip install -r requirements.txt 34 | 35 | Run the application using the following command: 36 | 37 | streamlit run app.py 38 | 39 | Usage 40 | 41 | Once the application is running, upload an image using the “Upload Image” button. 42 | 43 | The application will extract the text from the image and display it in the “Extracted Text” section. 44 | 45 | The extracted text will include the Website URL, Email, Pin Code, Phone Number(s), and Address. 46 | 47 | Advantages 48 | 49 | Time-saving: OCR technology enables the automated extraction of data from business cards, saving time and effort 50 | that would otherwise be spent on manual data entry. 51 | 52 | Increased accuracy: OCR technology has the potential to reduce errors and improve accuracy compared to manual data entry. 53 | 54 | Scalability: OCR technology can handle large volumes of business cards, making it an ideal solution for businesses 55 | with high volumes of contacts. 56 | 57 | Easy integration: OCR technology can be easily integrated into existing systems and applications, making it a seamless 58 | addition to existing workflows. 59 | 60 | Cost-effective: OCR technology can be a cost-effective solution compared to hiring additional staff to handle manual 61 | data entry tasks. 62 | 63 | Limitations 64 | 65 | While the app has been designed to make accurate predictions, occasional incorrect outputs may occur. 66 | 67 | This can happen due to various factors such as low-quality input data or unexpected changes in the input data. 68 | 69 | Note 70 | 71 | The application can extract text in English language only. 72 | 73 | It will extract information only from BUSINESS CARDS 74 | -------------------------------------------------------------------------------- /OCR_APP.py: -------------------------------------------------------------------------------- 1 | import easyocr as ocr #OCR 2 | import streamlit as st #Web App 3 | from PIL import Image #Image Processing 4 | import numpy as np #Image Processing 5 | st. set_page_config(layout="wide") 6 | import re 7 | import pandas as pd 8 | 9 | #title 10 | st.title(":orange[UNLOCKING DATA FROM BUSINESS CARDS USING OCR]") 11 | st.write(" ") 12 | col1, col2,col3= st.columns([3,0.5,4.5]) 13 | with col1: 14 | #image uploader 15 | st.write("## UPLOAD IMAGE") 16 | image = st.file_uploader(label = "",type=['png','jpg','jpeg']) 17 | 18 | @st.cache 19 | def load_model(): 20 | reader = ocr.Reader(['en'])#,model_storage_directory='.') 21 | return reader 22 | 23 | reader = load_model() #load model 24 | 25 | if image is not None: 26 | input_image = Image.open(image) #read image 27 | with col1: 28 | #st.write("## YOUR IMAGE") 29 | st.image(input_image) #display image 30 | 31 | result = reader.readtext(np.array(input_image)) 32 | result_text = [] #empty list for results 33 | for text in result: 34 | result_text.append(text[1]) 35 | 36 | PH=[] 37 | PHID=[] 38 | ADD=set() 39 | AID=[] 40 | EMAIL='' 41 | EID='' 42 | PIN='' 43 | PID='' 44 | WEB='' 45 | WID='' 46 | 47 | for i, string in enumerate(result_text): 48 | #st.write(string.lower()) 49 | 50 | # TO FIND EMAIL 51 | if re.search(r'@', string.lower()): 52 | EMAIL=string.lower() 53 | EID=i 54 | 55 | # TO FIND PINCODE 56 | match = re.search(r'\d{6,7}', string.lower()) 57 | if match: 58 | PIN=match.group() 59 | PID=i 60 | 61 | # TO FIND PHONE NUMBER 62 | # match = re.search(r'(?:ph|phone|phno)?(?:[+-]?\d*){7,}', string) 63 | #match = re.search(r'(?:ph|phone|phno)?\s*(?:[+-]?\d\s*){7,}', string) 64 | match = re.search(r'(?:ph|phone|phno)?\s*(?:[+-]?\d\s*[\(\)]*){7,}', string) 65 | if match and len(re.findall(r'\d', string)) > 7: 66 | PH.append(string) 67 | PHID.append(i) 68 | 69 | 70 | 71 | # TO FIND ADDRESS 72 | keywords = ['road', 'floor', ' st ', 'st,', 'street', ' dt ', 'district', 73 | 'near', 'beside', 'opposite', ' at ', ' in ', 'center', 'main road', 74 | 'state','country', 'post','zip','city','zone','mandal','town','rural', 75 | 'circle','next to','across from','area','building','towers','village', 76 | ' ST ',' VA ',' VA,',' EAST ',' WEST ',' NORTH ',' SOUTH '] 77 | # Define the regular expression pattern to match six or seven continuous digits 78 | digit_pattern = r'\d{6,7}' 79 | # Check if the string contains any of the keywords or a sequence of six or seven digits 80 | if any(keyword in string.lower() for keyword in keywords) or re.search(digit_pattern, string): 81 | ADD.add(string) 82 | AID.append(i) 83 | 84 | # TO FIND STATE (USING SIMILARITY SCORE) 85 | states = ['Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh', 'Goa', 'Gujarat', 86 | 'Haryana','Hyderabad', 'Himachal Pradesh', 'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 87 | 'Maharashtra', 'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha', 'Punjab', 88 | 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana', 'Tripura', 'Uttar Pradesh', 'Uttarakhand', 'West Bengal', 89 | "United States", "China", "Japan", "Germany", "United Kingdom", "France", "India", 90 | "Canada", "Italy", "South Korea", "Russia", "Australia", "Brazil", "Spain", "Mexico", 'USA','UK'] 91 | 92 | import Levenshtein 93 | def string_similarity(s1, s2): 94 | distance = Levenshtein.distance(s1, s2) 95 | similarity = 1 - (distance / max(len(s1), len(s2))) 96 | return similarity * 100 97 | 98 | for x in states: 99 | similarity = string_similarity(x.lower(), string.lower()) 100 | if similarity > 50: 101 | ADD.add(string) 102 | AID.append(i) 103 | 104 | # WEBSITE URL 105 | if re.match(r"(?!.*@)(www|.*com$)", string): 106 | WEB=string.lower() 107 | WID=i 108 | with col3: 109 | # DISPLAY ALL THE ELEMENTS OF BUSINESS CARD 110 | st.write("## EXTRACTED TEXT") 111 | st.write('##### :red[WEBSITE URL: ] '+ str(WEB)) 112 | st.write('##### :red[EMAIL: ] '+ str(EMAIL)) 113 | st.write('##### :red[PIN CODE: ] '+ str(PIN)) 114 | ph_str = ', '.join(PH) 115 | st.write('##### :red[PHONE NUMBER(S): ] '+ph_str) 116 | add_str = ' '.join([str(elem) for elem in ADD]) 117 | st.write('##### :red[ADDRESS: ] ', add_str) 118 | 119 | IDS= [EID,PID,WID] 120 | IDS.extend(AID) 121 | IDS.extend(PHID) 122 | # st.write(IDS) 123 | oth='' 124 | fin=[] 125 | for i, string in enumerate(result_text): 126 | if i not in IDS: 127 | if len(string) >= 4 and ',' not in string and '.' not in string and 'www.' not in string: 128 | if not re.match("^[0-9]{0,3}$", string) and not re.match("^[^a-zA-Z0-9]+$", string): 129 | numbers = re.findall('\d+', string) 130 | if len(numbers) == 0 or all(len(num) < 3 for num in numbers) and not any(num in string for num in ['0','1','2','3','4','5','6','7','8','9']*3): 131 | fin.append(string) 132 | st.write('##### :red[CARD HOLDER & COMPANY DETAILS: ] ') 133 | for i in fin: 134 | st.write('##### '+i) 135 | 136 | # st.write(result_text) 137 | # st.write(PH) 138 | -------------------------------------------------------------------------------- /Text_Extract.py: -------------------------------------------------------------------------------- 1 | import easyocr as ocr #OCR 2 | import streamlit as st #Web App 3 | from PIL import Image #Image Processing 4 | import numpy as np #Image Processing 5 | st. set_page_config(layout="wide") 6 | import re 7 | import pandas as pd 8 | 9 | #title 10 | st.title(":orange[UNLOCKING DATA FROM BUSINESS CARDS USING OCR]") 11 | st.write(" ") 12 | col1, col2,col3= st.columns([3,0.5,4.5]) 13 | with col1: 14 | #image uploader 15 | st.write("## UPLOAD IMAGE") 16 | image = st.file_uploader(label = "",type=['png','jpg','jpeg']) 17 | 18 | @st.cache 19 | def load_model(): 20 | reader = ocr.Reader(['en'])#,model_storage_directory='.') 21 | return reader 22 | 23 | reader = load_model() #load model 24 | 25 | if image is not None: 26 | input_image = Image.open(image) #read image 27 | with col1: 28 | #st.write("## YOUR IMAGE") 29 | # st.image(input_image) #display image 30 | st.write('EXTRACTED TEXT') 31 | 32 | result = reader.readtext(np.array(input_image)) 33 | result_text = [] #empty list for results 34 | for text in result: 35 | result_text.append(text[1]) 36 | st.write(text[1]) 37 | 38 | # PH=[] 39 | # PHID=[] 40 | # ADD=set() 41 | # AID=[] 42 | # EMAIL='' 43 | # EID='' 44 | # PIN='' 45 | # PID='' 46 | # WEB='' 47 | # WID='' 48 | 49 | # for i, string in enumerate(result_text): 50 | # #st.write(string.lower()) 51 | 52 | # # TO FIND EMAIL 53 | # if re.search(r'@', string.lower()): 54 | # EMAIL=string.lower() 55 | # EID=i 56 | 57 | # # TO FIND PINCODE 58 | # match = re.search(r'\d{6,7}', string.lower()) 59 | # if match: 60 | # PIN=match.group() 61 | # PID=i 62 | 63 | # # TO FIND PHONE NUMBER 64 | # # match = re.search(r'(?:ph|phone|phno)?(?:[+-]?\d*){7,}', string) 65 | # #match = re.search(r'(?:ph|phone|phno)?\s*(?:[+-]?\d\s*){7,}', string) 66 | # match = re.search(r'(?:ph|phone|phno)?\s*(?:[+-]?\d\s*[\(\)]*){7,}', string) 67 | # if match and len(re.findall(r'\d', string)) > 7: 68 | # PH.append(string) 69 | # PHID.append(i) 70 | 71 | 72 | 73 | # # TO FIND ADDRESS 74 | # keywords = ['road', 'floor', ' st ', 'st,', 'street', ' dt ', 'district', 75 | # 'near', 'beside', 'opposite', ' at ', ' in ', 'center', 'main road', 76 | # 'state','country', 'post','zip','city','zone','mandal','town','rural', 77 | # 'circle','next to','across from','area','building','towers','village', 78 | # ' ST ',' VA ',' VA,',' EAST ',' WEST ',' NORTH ',' SOUTH '] 79 | # # Define the regular expression pattern to match six or seven continuous digits 80 | # digit_pattern = r'\d{6,7}' 81 | # # Check if the string contains any of the keywords or a sequence of six or seven digits 82 | # if any(keyword in string.lower() for keyword in keywords) or re.search(digit_pattern, string): 83 | # ADD.add(string) 84 | # AID.append(i) 85 | 86 | # # TO FIND STATE (USING SIMILARITY SCORE) 87 | # states = ['Andhra Pradesh', 'Arunachal Pradesh', 'Assam', 'Bihar', 'Chhattisgarh', 'Goa', 'Gujarat', 88 | # 'Haryana','Hyderabad', 'Himachal Pradesh', 'Jharkhand', 'Karnataka', 'Kerala', 'Madhya Pradesh', 89 | # 'Maharashtra', 'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Odisha', 'Punjab', 90 | # 'Rajasthan', 'Sikkim', 'Tamil Nadu', 'Telangana', 'Tripura', 'Uttar Pradesh', 'Uttarakhand', 'West Bengal', 91 | # "United States", "China", "Japan", "Germany", "United Kingdom", "France", "India", 92 | # "Canada", "Italy", "South Korea", "Russia", "Australia", "Brazil", "Spain", "Mexico", 'USA','UK'] 93 | 94 | # import Levenshtein 95 | # def string_similarity(s1, s2): 96 | # distance = Levenshtein.distance(s1, s2) 97 | # similarity = 1 - (distance / max(len(s1), len(s2))) 98 | # return similarity * 100 99 | 100 | # for x in states: 101 | # similarity = string_similarity(x.lower(), string.lower()) 102 | # if similarity > 50: 103 | # ADD.add(string) 104 | # AID.append(i) 105 | 106 | # # WEBSITE URL 107 | # if re.match(r"(?!.*@)(www|.*com$)", string): 108 | # WEB=string.lower() 109 | # WID=i 110 | # with col3: 111 | # # DISPLAY ALL THE ELEMENTS OF BUSINESS CARD 112 | # st.write("## EXTRACTED TEXT") 113 | # st.write('##### :red[WEBSITE URL: ] '+ str(WEB)) 114 | # st.write('##### :red[EMAIL: ] '+ str(EMAIL)) 115 | # st.write('##### :red[PIN CODE: ] '+ str(PIN)) 116 | # ph_str = ', '.join(PH) 117 | # st.write('##### :red[PHONE NUMBER(S): ] '+ph_str) 118 | # add_str = ' '.join([str(elem) for elem in ADD]) 119 | # st.write('##### :red[ADDRESS: ] ', add_str) 120 | 121 | # IDS= [EID,PID,WID] 122 | # IDS.extend(AID) 123 | # IDS.extend(PHID) 124 | # # st.write(IDS) 125 | # oth='' 126 | # fin=[] 127 | # for i, string in enumerate(result_text): 128 | # if i not in IDS: 129 | # if len(string) >= 4 and ',' not in string and '.' not in string and 'www.' not in string: 130 | # if not re.match("^[0-9]{0,3}$", string) and not re.match("^[^a-zA-Z0-9]+$", string): 131 | # numbers = re.findall('\d+', string) 132 | # if len(numbers) == 0 or all(len(num) < 3 for num in numbers) and not any(num in string for num in ['0','1','2','3','4','5','6','7','8','9']*3): 133 | # fin.append(string) 134 | # st.write('##### :red[CARD HOLDER & COMPANY DETAILS: ] ') 135 | # for i in fin: 136 | # st.write('##### '+i) 137 | 138 | # # st.write(result_text) 139 | # # st.write(PH) 140 | -------------------------------------------------------------------------------- /BIZ_CARDS_DASHBOARD.py: -------------------------------------------------------------------------------- 1 | import easyocr as ocr #OCR 2 | import streamlit as st #Web App 3 | from PIL import Image #Image Processing 4 | import numpy as np #Image Processing 5 | st. set_page_config(layout="wide") 6 | import re 7 | import pymysql 8 | import io 9 | 10 | # Getting Secrets from Streamlit Secret File 11 | username=st.secrets['AWS_RDS_username'] 12 | password=st.secrets['AWS_RDS_password'] 13 | Endpoint=st.secrets['Endpoint'] 14 | Dbase=st.secrets['DATABASE'] 15 | 16 | # Connect to AWS-RDS-MYSQL 17 | connection = pymysql.connect( 18 | host=Endpoint, 19 | user=username, 20 | password=password, 21 | database=Dbase 22 | ) 23 | cursor = connection.cursor() 24 | 25 | #title 26 | def format_title(title: str): 27 | """ 28 | Formats the given title with a colored box and padding 29 | """ 30 | formatted_title = f"