├── .gitignore
├── README.md
├── cnn_ocr1.py
├── focus_border_Images
    ├── f_16544460977400001-03.jpg
    ├── g_16544460977400001-03.jpg
    ├── h.jpg
    ├── h_16544460977400001-03.jpg
    ├── v.jpg
    └── v_16544460977400001-03.jpg
├── focus_border_images.png
├── focus_border_img.jpg
├── focus_text_Images
    └── result_16540011586530001-03.jpg
├── new_thread_7_paddle_opti.py
├── output_img
    ├── 16544460977400001-03_1655120866607_1.png
    ├── 16544460977400001-03_1655120866609_2.png
    ├── 16544460977400001-03_1655120866609_3.png
    ├── 16544460977400001-03_1655120866610_4.png
    ├── 16544460977400001-03_1655120866611_5.png
    ├── 16544460977400001-03_1655120866612_6.png
    ├── 16544460977400001-03_1655120866612_7.png
    ├── 16544460977400001-03_1655120866613_8.png
    ├── 16544460977400001-03_1655120866939_11.png
    ├── 16544460977400001-03_1655120866940_12.png
    ├── 16544460977400001-03_1655120867285_15.png
    ├── 16544460977400001-03_1655120867593_18.png
    ├── 16544460977400001-03_1655120867593_19.png
    ├── 16544460977400001-03_1655120867886_22.png
    ├── 16544460977400001-03_1655120868209_25.png
    ├── 16544460977400001-03_1655120868210_26.png
    ├── 16544460977400001-03_1655120868507_29.png
    ├── 16544460977400001-03_1655120868810_32.png
    ├── 16544460977400001-03_1655120868810_33.png
    ├── 16544460977400001-03_1655120869094_36.png
    ├── 16544460977400001-03_1655120869422_39.png
    ├── 16544460977400001-03_1655120869423_40.png
    ├── 16544460977400001-03_1655120869713_43.png
    ├── 16544460977400001-03_1655120870040_46.png
    ├── 16544460977400001-03_1655120870041_47.png
    ├── 16544460977400001-03_1655120870348_50.png
    ├── 16544460977400001-03_1655120870654_53.png
    ├── 16544460977400001-03_1655120870655_54.png
    ├── 16544460977400001-03_1655120870955_57.png
    ├── 16544460977400001-03_1655120871261_60.png
    ├── 16544460977400001-03_1655120871262_61.png
    ├── 16544460977400001-03_1655120871555_64.png
    ├── 16544460977400001-03_1655120871887_67.png
    ├── 16544460977400001-03_1655120871888_68.png
    ├── 16544460977400001-03_1655120872207_71.png
    ├── 16544460977400001-03_1655120872548_74.png
    ├── 16544460977400001-03_1655120872549_75.png
    ├── 16544460977400001-03_1655120872851_78.png
    ├── 16544460977400001-03_1655120873167_81.png
    ├── 16544460977400001-03_1655120873168_82.png
    ├── 16544460977400001-03_1655120873796_85.png
    ├── 16544460977400001-03_1655120874124_88.png
    ├── 16544460977400001-03_1655120874125_89.png
    ├── 16544460977400001-03_1655120874563_92.png
    ├── 16544460977400001-03_1655120874891_95.png
    ├── 16544460977400001-03_1655120874892_96.png
    ├── 16544460977400001-03_1655120875187_99.png
    ├── 16544460977400001-03_1655120875586_102.png
    ├── 16544460977400001-03_1655120875587_103.png
    ├── 16544460977400001-03_1655120875919_106.png
    ├── 16544460977400001-03_1655120876292_109.png
    ├── 16544460977400001-03_1655120876293_110.png
    ├── 16544460977400001-03_1655120876586_113.png
    ├── 16544460977400001-03_1655120876921_116.png
    ├── 16544460977400001-03_1655120876922_117.png
    ├── 16544460977400001-03_1655120877270_120.png
    ├── 16544460977400001-03_1655120877615_123.png
    ├── 16544460977400001-03_1655120877616_124.png
    ├── 16544460977400001-03_1655120877910_127.png
    ├── 16544460977400001-03_1655120878219_130.png
    ├── 16544460977400001-03_1655120878221_131.png
    ├── 16544460977400001-03_1655120878619_134.png
    ├── 16544460977400001-03_1655120878965_137.png
    ├── 16544460977400001-03_1655120878965_138.png
    ├── 16544460977400001-03_1655120879303_141.png
    ├── 16544460977400001-03_1655120879677_144.png
    ├── 16544460977400001-03_1655120879678_145.png
    ├── 16544460977400001-03_1655120879968_148.png
    ├── 16544460977400001-03_1655120880337_151.png
    ├── 16544460977400001-03_1655120880338_152.png
    ├── 16544460977400001-03_1655120880651_155.png
    ├── 16544460977400001-03_1655120881022_158.png
    ├── 16544460977400001-03_1655120881023_159.png
    ├── 16544460977400001-03_1655120881357_162.png
    ├── 16544460977400001-03_1655120881749_165.png
    └── 16544460977400001-03_1655120881749_166.png
├── pdf_img
    └── 16544460977400001-03.jpg
├── pdf_to_img.py
├── requirements.txt
├── start_ocr1.py
├── table_1.csv
└── table_1.jpg


/.gitignore:
--------------------------------------------------------------------------------
1 | readme.txt
2 | models/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Ocr_pdf_table_csv
 2 | This is a project that extracts data from pdf snapshot and enters data into csv
 3 | This project is using 2 methods to detect boxes from table.
 4 | First method depends on the border of table.
 5 | Custom CNN model is used in this project.
 6 | You can find the project that is used pretrained Tesseract model on https://github.com/Supernova1024/OCR-Pdf-to-CSV-Table  
 7 | Please give me star if this project was helpful to your startup project. :)
 8 | 
 9 | The result is stored in "focus_border_Images" folder
10 | ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/focus_border_img.jpg)
11 | ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/focus_border_images.png)
12 |   
13 | # Installing
14 | - Download this repository
15 | - Install requirements.txt in project root directory
16 | 
17 | # How to Run
18 | - Convert pdf to images
19 |   Open the pdf_image.py and define the parameters.
20 |   You can check the parameters here.
21 |   https://pdf2image.readthedocs.io/en/latest/reference.html
22 | - Preparing Datasets and Training the Model.
23 |   Please use this project.
24 |   https://github.com/Supernova1024/train-CNN-model-for-number-classification-OCR
25 | - After getting model by using above project, please copy the model into "models" folder.
26 | - Start
27 |   ```
28 |   python start_ocr1.py
29 |   ```
30 |  
31 | # Result Description
32 |   ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/table_1.jpg)
33 |   In this project, I used the table that has 7 columns
34 |   1, 4, and 5 columns can't be recognized by the model.
35 |   The boxes of these columns are stored in "output_img" folder as JPG and added their file name to csv file.
36 |   You can check example of "output_img" folder here.
37 |   https://drive.google.com/drive/folders/1nrns5zZkzfVP9o8aCyjkj9O4O-TwJAvP?usp=sharing
38 |   Other columns can be recognized by the model and the results are stored in csv directly.
39 |   The csv keeps table structure of original pdf
40 |   I attached example "output_img" folder and "table_1.csv"
41 | 
42 | Please give me star if this project was helpful to your startup project. :)
43 | 
44 | 
45 | 
46 |   
47 | 


--------------------------------------------------------------------------------
/cnn_ocr1.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import numpy as np
 3 | import os
 4 | import cv2
 5 | import time
 6 | import tensorflow.compat.v1 as tf
 7 | tf.disable_v2_behavior()
 8 | 
 9 | image_size=32
10 | num_channels=3
11 | 
12 | # Let us restore the saved model 
13 | sess = tf.Session()
14 | # Step-1: Recreate the network graph. At this step only graph is created.
15 | saver = tf.train.import_meta_graph('models/trained_model.meta')
16 | # Step-2: Now let's load the weights saved using the restore method.
17 | saver.restore(sess, tf.train.latest_checkpoint('./models/'))
18 | 
19 | # Accessing the default graph which we have restored
20 | graph = tf.get_default_graph()
21 | # Now, let's get hold of the op that we can be processed to get the output.
22 | # In the original network y_pred is the tensor that is the prediction of the network
23 | y_pred = graph.get_tensor_by_name("y_pred:0")
24 | 
25 | ## Let's feed the images to the input placeholders
26 | x= graph.get_tensor_by_name("x:0") 
27 | y_true = graph.get_tensor_by_name("y_true:0") 
28 | y_test_images = np.zeros((1, 12)) 
29 | predicted_class = None
30 | dirs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "n", "m"]
31 | 
32 | def predict(img):
33 |     
34 |     image = cv2.resize(img, (image_size, image_size),0,0, cv2.INTER_LINEAR)
35 |     image = np.stack((image,)*3, axis=-1)
36 |     images = []
37 |     images.append(image)
38 |     images = np.array(images, dtype=np.uint8)
39 |     images = images.astype('float32')
40 |     images = np.multiply(images, 1.0/255.0) 
41 |     x_batch = images.reshape(1, image_size,image_size,num_channels)
42 |     
43 |     # Creating the feed_dict that is required to be fed to calculate y_pred 
44 |     feed_dict_testing = {x: x_batch, y_true: y_test_images}
45 |     # print("--feed_dict_testing---", feed_dict_testing)
46 |     result=sess.run(y_pred, feed_dict=feed_dict_testing)
47 |     # Result is of this format [[probabiliy_of_classA probability_of_classB ....]]
48 | 
49 |     a = result[0].tolist()
50 |     r=0
51 |     max1 = max(a)
52 |     index1 = a.index(max1)
53 |     count = 0
54 |     
55 |     for name in dirs:
56 |         if count==index1:
57 |             predicted_class = name
58 |         count+=1
59 | 
60 |     for i in a:
61 |         if i!=max1:
62 |             if max1-i<i:
63 |                 r=1                           
64 |     if r ==0:
65 |         # print(predicted_class)
66 |         return predicted_class
67 | 
68 | 
69 | 
70 | 
71 | 
72 | 
73 | 
74 | 
75 | 
76 | 
77 | 
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/focus_border_Images/f_16544460977400001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/f_16544460977400001-03.jpg


--------------------------------------------------------------------------------
/focus_border_Images/g_16544460977400001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/g_16544460977400001-03.jpg


--------------------------------------------------------------------------------
/focus_border_Images/h.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/h.jpg


--------------------------------------------------------------------------------
/focus_border_Images/h_16544460977400001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/h_16544460977400001-03.jpg


--------------------------------------------------------------------------------
/focus_border_Images/v.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/v.jpg


--------------------------------------------------------------------------------
/focus_border_Images/v_16544460977400001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_Images/v_16544460977400001-03.jpg


--------------------------------------------------------------------------------
/focus_border_images.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_images.png


--------------------------------------------------------------------------------
/focus_border_img.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_border_img.jpg


--------------------------------------------------------------------------------
/focus_text_Images/result_16540011586530001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/focus_text_Images/result_16540011586530001-03.jpg


--------------------------------------------------------------------------------
/new_thread_7_paddle_opti.py:
--------------------------------------------------------------------------------
  1 | import cv2
  2 | import numpy as np
  3 | import time
  4 | import csv
  5 | import os
  6 | from multiprocessing import Pool
  7 | import statistics
  8 | from paddleocr import PaddleOCR,draw_ocr
  9 | from scipy import ndimage
 10 | import os
 11 | os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
 12 | 
 13 | input_folder = "pdf_img/"
 14 | output_folder = "output"
 15 | out_csv = "table_1.csv"
 16 | ocr = PaddleOCR(use_angle_cls=True, lang='en', use_gpu=True, show_log=False)
 17 | 
 18 | 
 19 | def uuid(filename):
 20 |     time_str = str(int(round(time.time() * 1000)))
 21 |     file = filename.split(".pdf")[0]
 22 |     id = file + "&&&" + time_str
 23 |     return id
 24 | 
 25 | 
 26 | def box_processing(img, lang):
 27 |     if lang == 'arabic':
 28 |         img = img[0:img.shape[0], int(img.shape[1]/2):img.shape[1]]
 29 | 
 30 |     # kernel = np.ones((1, 1), np.uint8)
 31 |     # dilateimg = cv2.dilate(img, kernel, iterations=30)
 32 | 
 33 |     # kernel1 = np.ones((3, 3), np.uint8)
 34 |     # erodeimg = cv2.erode(dilateimg, kernel1, iterations=1)
 35 | 
 36 |     # kernel = np.ones((3, 3), np.uint8)
 37 |     # dilateimg = cv2.dilate(erodeimg, kernel, iterations=1)
 38 |     return img
 39 | 
 40 | def sort_contours(cnts, method="left-to-right"):
 41 |     try:
 42 |         # initialize the reverse flag and sort index
 43 |         reverse = False
 44 |         i = 0
 45 | 
 46 |         # handle if we need to sort in reverse
 47 |         if method in ["right-to-left", "bottom-to-top"]:
 48 |             reverse = True
 49 | 
 50 |         # handle if we are sorting against the y-coordinate rather than
 51 |         # the x-coordinate of the bounding box
 52 |         if method in ["top-to-bottom", "bottom-to-top"]:
 53 |             i = 1
 54 | 
 55 |         # construct the list of bounding boxes and sort them from top to
 56 |         # bottom
 57 |         boundingBoxes = [cv2.boundingRect(c) for c in cnts]
 58 |         cnts, boundingBoxes = zip(*sorted(zip(cnts, boundingBoxes),
 59 |                                           key=lambda b: b[1][i], reverse=reverse))
 60 | 
 61 |         # return the list of sorted contours and bounding boxes
 62 |         return cnts, boundingBoxes
 63 |     except Exception as e:
 64 |         print(f"Error: {e}")
 65 | 
 66 | 
 67 | def draw_blue_line(img, rect_barcode):
 68 |     try:
 69 |         img_gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 70 |         img_cny = cv2.Canny(img_gry, 150, 200)
 71 |         lns = cv2.ximgproc.createFastLineDetector().detect(img_cny)
 72 |         img_cpy = img.copy()
 73 | 
 74 |         if len(rect_barcode) > 0:
 75 |             if lns is not None:
 76 |                 for ln in lns:
 77 |                     x1 = int(ln[0][0])
 78 |                     y1 = int(ln[0][1])
 79 |                     x2 = int(ln[0][2])
 80 |                     y2 = int(ln[0][3])
 81 |                     cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5)
 82 |                 return (1, img_cpy)
 83 | 
 84 |         elif lns is not None:
 85 |             for ln in lns:
 86 |                 x1 = int(ln[0][0])
 87 |                 y1 = int(ln[0][1])
 88 |                 x2 = int(ln[0][2])
 89 |                 y2 = int(ln[0][3])
 90 |                 cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5)
 91 |             return (1, img_cpy)
 92 | 
 93 |         return (0, img_cpy)
 94 | 
 95 |     except Exception as e:
 96 |         print(f"Error: {e}")
 97 | 
 98 | 
 99 | def v_remove_cnts(image, rect_barcode):
100 |     try:
101 |         # Initialize variables
102 |         limit_distance = 200
103 |         mask = np.ones(image.shape[:2], dtype="uint8") * 255
104 | 
105 |         # Find contours and sort by height
106 |         contours, hierarchy = cv2.findContours(
107 |             image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
108 |         h_arr = [cv2.boundingRect(cnt)[3] for cnt in contours]
109 |         h_arr.sort()
110 | 
111 |         # Cluster contours based on height
112 |         current_cluster = []
113 |         clusters = []
114 |         for i in range(len(h_arr)):
115 |             if not current_cluster or h_arr[i] - current_cluster[-1] <= limit_distance:
116 |                 current_cluster.append(h_arr[i])
117 |             else:
118 |                 clusters.append(current_cluster)
119 |                 current_cluster = [h_arr[i]]
120 |         clusters.append(current_cluster)
121 | 
122 |         # Find the largest cluster and average height
123 |         largest_cluster = max(clusters, key=len)
124 |         avg_value = sum(largest_cluster) / len(largest_cluster)
125 | 
126 |         # Remove contours outside of the largest cluster
127 |         for cnt in contours:
128 |             h = cv2.boundingRect(cnt)[3]
129 |             if h < avg_value - limit_distance or h > avg_value + limit_distance:
130 |                 cv2.drawContours(mask, [cnt], -1, 0, -1)
131 |         image = cv2.bitwise_and(image, image, mask=mask)
132 | 
133 |         # Draw first vertical line if barcode exists
134 |         if len(rect_barcode) > 0:
135 |             x_barcode_center = rect_barcode[0][0]
136 |             y_barcode_center = int(rect_barcode[0][1])
137 |             w_barcode = rect_barcode[1][0]
138 |             x_first_vertical = int(x_barcode_center - 350/2 - 8)
139 |             cv2.line(image, pt1=(x_first_vertical, y_barcode_center), pt2=(x_first_vertical, image.shape[0]), color=(255, 0, 0), thickness=10)
140 | 
141 |         return (1, image)
142 | 
143 |     except Exception as e:
144 |         print(f"Error in v_remove_cnts: {str(e)}")
145 |         return (0, image)
146 | 
147 | 
148 | def h_remove_cnts(image):
149 |     try:
150 |         mask = np.ones(image.shape[:2], dtype="uint8") * 255
151 |         contours, hierarchy = cv2.findContours(image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
152 |         for cnt in contours:
153 |             x, y, w, h = cv2.boundingRect(cnt)
154 |             if w < image.shape[1]:
155 |                 cv2.drawContours(mask, [cnt], -1, 0, -1)
156 |         image = cv2.bitwise_and(image, image, mask=mask)
157 |         return image
158 |     except Exception as e:
159 |         print(f"Error occurred in h_remove_cnts: {e}")
160 |         return None
161 | 
162 | 
163 | def find_0(s):
164 |     ch = "0"
165 |     return [i for i, ltr in enumerate(s) if ltr == ch]
166 | 
167 | def img_rotate(img, rect_barcode):
168 |     try:
169 |         angle = rect_barcode[2]
170 |     except IndexError:
171 |         return img
172 |     
173 |     # Rotate the original image based on the angle of the polygon
174 |     h, w = img.shape[:2]
175 |     center = (w // 2, h // 2)
176 |     if angle > 45:
177 |         angle1 = angle - 90 - 0.8
178 |     else:
179 |         angle1 = angle
180 |     M = cv2.getRotationMatrix2D(center, angle1, 1.0)
181 |     rotated = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
182 | 
183 |     return rotated
184 | 
185 | 
186 | def find_barcode(img, file_path):
187 |     try:
188 |         time_str = str(int(round(time.time() * 1000)))
189 |         resizeimg = img[100:600, 20:800]
190 |         kernel = np.ones((1, 1), np.uint8)
191 |         dilateimg = cv2.dilate(resizeimg, kernel, iterations=3)
192 | 
193 |         kernel1 = np.ones((1, 7), np.uint8)
194 |         erodeimg = cv2.erode(dilateimg, kernel1, iterations=5)
195 | 
196 |         kernel2 = np.ones((7, 7), np.uint8)
197 |         dilateimg = cv2.dilate(erodeimg, kernel2, iterations=7)
198 | 
199 |         kernel1 = np.ones((7, 7), np.uint8)
200 |         erodeimg = cv2.erode(dilateimg, kernel1, iterations=7)
201 | 
202 |         gray = cv2.cvtColor(erodeimg, cv2.COLOR_BGR2GRAY)
203 | 
204 |         # Apply a threshold to the grayscale image to get a binary image
205 |         _, thresh = cv2.threshold(gray, 254, 255, cv2.THRESH_BINARY_INV)
206 | 
207 |         # Find the contours in the binary image
208 |         contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
209 |         rect_barcode = []
210 |         if len(contours) > 0:
211 |             # Filter the contours to find the box contour based on its shape and size
212 |             for contour in contours:
213 |                 # Approximate the contour to a polygon
214 |                 approx = cv2.approxPolyDP(contour, 0.01 * cv2.arcLength(contour, True), True)
215 | 
216 |                 # Check if the polygon has 4 vertices (i.e., is a quadrilateral) and has a certain minimum area
217 |                 if len(approx) == 4 and cv2.contourArea(approx) < 25000 and cv2.contourArea(approx) > 8000:
218 |                     # Draw the contour on the original image
219 |                     cv2.drawContours(erodeimg, [approx], 0, (0, 0, 255), 2)
220 |                     # Calculate the angle of the polygon
221 |                     rect_barcode = cv2.minAreaRect(approx)
222 |                     break
223 |             if not rect_barcode:
224 |                 print("Barcode not found. ", file_path)
225 |         else:
226 |             print("Barcode not found. ", file_path)
227 |         return rect_barcode
228 |     except Exception as e:
229 |         print(f"Error: {str(e)}")
230 |         return rect_barcode
231 | 
232 | 
233 | def reorganize_array(arr):
234 |     # Making 3 dimensional array based on big Y difference
235 |     # Making the length of sub array less than 7
236 |     group = []
237 |     subarray = [arr[0]]
238 | 
239 |     try:
240 |         for i in range(1, len(arr)):
241 |             if arr[i][1] - arr[i-1][1] > 40:
242 |                 subarray.sort()
243 |                 if len(subarray) >= 7:
244 |                     subarray = subarray[:7]
245 |                 group.append(subarray)
246 |                 subarray = []
247 |             subarray.append(arr[i])
248 | 
249 |         subarray.sort()
250 |         if len(subarray) >= 7:
251 |             subarray = subarray[:7]
252 |         group.append(subarray)
253 |     except Exception as e:
254 |         print(f"Error in reorganize_array: {e}")
255 |         return []
256 | 
257 |     return group
258 | 
259 | 
260 | # Table rotation
261 | def correct_skew(image, delta=2, limit=7):
262 |     def determine_score(arr, angle):
263 |         data = ndimage.rotate(arr, angle, reshape=False, order=0)
264 |         histogram = np.sum(data, axis=1, dtype=float)
265 |         score = np.sum((histogram[1:] - histogram[:-1]) ** 2, dtype=float)
266 |         return histogram, score
267 | 
268 |     try:
269 |         gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
270 |         thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] 
271 | 
272 |         scores = []
273 |         angles = np.arange(-limit, limit + delta, delta)
274 |         for angle in angles:
275 |             histogram, score = determine_score(thresh, angle)
276 |             scores.append(score)
277 | 
278 |         best_angle = angles[scores.index(max(scores))]
279 | 
280 |         (h, w) = image.shape[:2]
281 |         center = (w // 2, h // 2)
282 |         M = cv2.getRotationMatrix2D(center, best_angle, 1.0)
283 |         corrected = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, \
284 |                 borderMode=cv2.BORDER_REPLICATE)
285 | 
286 |         return best_angle, corrected
287 |     except Exception as e:
288 |         print("Error: ", e)
289 |         return None, None
290 | 
291 | 
292 | 
293 | #Functon for extracting the box
294 | def box_extraction(img_for_box_extraction_path):
295 |     print("==filename=====", img_for_box_extraction_path)
296 |     try:
297 |         # Split filename and extension
298 |         filename_no_xtensn, xtensn = os.path.splitext(img_for_box_extraction_path)
299 |         
300 |         # Check if extension is valid
301 |         if xtensn.lower() not in ['.jpg', '.jpeg']:
302 |             raise ValueError("Invalid image file extension")
303 |             
304 |         # Get PDF name from filename
305 |         filename_no_folder = filename_no_xtensn.split(input_folder)[1]
306 |         pdf_name = filename_no_folder.split("&&&")[0]
307 |         
308 |         # Create output directory if it doesn't exist
309 |         output_dir = output_folder + "_" + pdf_name
310 |         if not os.path.exists(output_dir):
311 |             os.makedirs(output_dir)
312 |         
313 |         # Load image and find barcode
314 |         output_img_path = output_folder + "_" + pdf_name + "/"
315 |         img_origin = cv2.imread(img_for_box_extraction_path)
316 |         rect_barcode = find_barcode(img_origin, img_for_box_extraction_path)
317 |         
318 |         # Rotate image if barcode is found
319 |         if len(rect_barcode) > 0:
320 |             print("=========found barcode=====", img_for_box_extraction_path)
321 |             img = img_rotate(img_origin, rect_barcode)
322 |             rect_barcode = find_barcode(img, img_for_box_extraction_path)
323 |             
324 |             # Draw blue line on barcode
325 |             green_flag, img1 = draw_blue_line(img, rect_barcode)
326 |             
327 |             if green_flag == 1:
328 |                 print("===1======green_flag == 1=====", img_for_box_extraction_path)
329 |                 # Threshold and binarize image
330 |                 Cimg_gray_para = [3, 3, 0]
331 |                 Cimg_blur_para = [150, 255]
332 |                 gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
333 |                 blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2])
334 |                 thresh, img_bin = cv2.threshold(blurred_img, 200, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
335 |                 img_bin = 255 - img_bin
336 |                 
337 |                 # Detect vertical and horizontal lines
338 |                 kernel_length = np.array(img).shape[1] // 40
339 |                 verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length))
340 |                 hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
341 |                 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
342 |                 
343 |                 # Morphological operations to detect lines
344 |                 img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2)
345 |                 verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20)
346 |                 verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2)
347 |                 cnts_flag, v_cnts_img = v_remove_cnts(verticle_lines_img, rect_barcode)
348 |                 
349 |                 if cnts_flag == 1:
350 |                     print("====cnts_flag == 1====", img_for_box_extraction_path)
351 |                     # Morphological operation to detect horizontal lines from an image
352 |                     img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2)
353 |                     horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=17)
354 |                     horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2)
355 |                     
356 |                     # Find valid cnts
357 |                     h_cnts_img = h_remove_cnts(horizontal_lines_img)
358 | 
359 |                     # Weighting parameters, this will decide the quantity of an image to be added to make a new image.
360 |                     alpha = 0.5
361 |                     beta = 1.0 - alpha
362 | 
363 |                     # This function helps to add two image with specific weight parameter to get a third image as summation of two image.
364 |                     img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0)
365 |                     img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2)
366 |                     (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
367 |                     cv2.waitKey(0)
368 | 
369 |                     # Find contours for image, which will detect all the boxes
370 |                     contours, hierarchy = cv2.findContours(
371 |                         img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
372 |                     # Sort all the contours by top to bottom.
373 |                     (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom")
374 | 
375 |                     boxes = []
376 |                     xx = []
377 |                     yy = []
378 |                     ww = []
379 |                     hh = []
380 |                     areaa = []
381 | 
382 |                     for c in contours:
383 |                         # Returns the location and width,height for every contour
384 |                         x, y, w, h = cv2.boundingRect(c)
385 |                         if x > 3 and w > 800 and (x + w) < 2000 and y > 50 and y+h < 3650 and h > 55 and w < 1000 and h < 100:
386 |                             xx.append(x)
387 | 
388 |                     first_column_x = 131
389 |                     if len(xx) > 0:
390 |                         first_column_x = statistics.mode(xx)
391 |                     if first_column_x > 130:
392 |                         for c in contours:
393 |                             # Returns the location and width,height for every contour
394 |                             x, y, w, h = cv2.boundingRect(c)
395 |                             area = w * h
396 |                             box_info = [x, y, w, h, area]
397 |                             # print(box_info)
398 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5)
399 |                             if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 10) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200:
400 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
401 |                                 boxes.append(box_info)
402 |                             elif x > 3 and w > 800 and (x + w) < 2050 and y > 40 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
403 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
404 |                                 boxes.append(box_info)
405 |                             elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
406 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
407 |                                 boxes.append(box_info)
408 |                     else:
409 |                         for c in contours:
410 |                             # Returns the location and width,height for every contour
411 |                             x, y, w, h = cv2.boundingRect(c)
412 |                             area = w * h
413 |                             box_info = [x, y, w, h, area]
414 |                             # print(box_info)
415 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5)
416 |                             if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 100) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200:
417 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
418 |                                 boxes.append(box_info)
419 |                             elif x > 3 and w > 800 and (x + w) < 2050 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
420 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
421 |                                 boxes.append(box_info)
422 |                             elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
423 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
424 |                                 boxes.append(box_info)
425 | 
426 |                     boxes_sorted_y = sorted(boxes, key=lambda x: x[1])
427 |                     box_array = boxes_sorted_y
428 | 
429 |                     if len(box_array) > 0:
430 |                         row_columns = reorganize_array(box_array)
431 |                     else:
432 |                         i = 1
433 |                         columns = []
434 |                         row_columns = []
435 |                         for box in boxes_sorted_y:
436 |                             columns.append(box)
437 |                             if i % 7 == 0:
438 |                                 boxes_sorted_x = sorted(columns, key=lambda x: x[0])
439 |                                 row_columns.append(boxes_sorted_x)
440 |                                 columns = []
441 |                         i += 1
442 | 
443 |                     for sub_boxes  in row_columns:
444 |                         for box  in sub_boxes:
445 |                             x = box[0]
446 |                             y = box[1]
447 |                             w = box[2]
448 |                             h = box[3]
449 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
450 | 
451 |                     ## Write red rect image
452 |                     time_str = str(int(round(time.time() * 1000)))
453 |                     w_filename = 'cnts/' +filename_no_folder+ '_' + '.png'
454 |                     cv2.imwrite(w_filename, image)
455 |                     
456 |                     idx = 0
457 |                     csv_row_col = []
458 |                     col = 0
459 |                     w_filename_base = output_img_path + filename_no_folder+ '_'
460 |                     for columns in row_columns:
461 |                         csv_cols = []
462 |                         if col == 0:
463 |                             row = 0
464 |                             for box in columns:
465 | 
466 |                                 idx += 1
467 |                                 new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
468 |                                 time_str = str(int(round(time.time() * 1000)))
469 |                                 if row == 0:
470 |                                     w_filename = w_filename_base +str(idx) +'_Address.png'
471 |                                 if row == 3:
472 |                                     w_filename = w_filename_base +str(idx) +'_Guardian.png'
473 |                                 if row == 4:
474 |                                     w_filename = w_filename_base +str(idx) +'_Name.png'
475 |                                 cv2.imwrite(w_filename, new_img)
476 |                                 csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png')
477 |                                 row += 1
478 |                         else:
479 |                             row = 0
480 |                             for box in columns:
481 |                                 if row  == 0 or row == 3 or row == 4:
482 |                                     idx += 1
483 |                                     new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
484 |                                     if row == 0:
485 |                                         w_filename = w_filename_base +str(idx) +'_Address.png'
486 |                                     if row == 3:
487 |                                         w_filename = w_filename_base +str(idx) +'_Guardian.png'
488 |                                     if row == 4:
489 |                                         w_filename = w_filename_base +str(idx) +'_Name.png'
490 |                                     cv2.imwrite(w_filename, new_img)
491 |                                     csv_cols.append(w_filename.split(output_img_path)[1])
492 |                                 else:
493 |                                     idx += 1
494 |                                     new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
495 | 
496 |                                     if row  == 1:
497 |                                         processed_box = box_processing(new_img, "arabic")
498 |                                     else:
499 |                                         processed_box = box_processing(new_img, "number")
500 |                                     
501 |                                     result = ocr.ocr(processed_box, cls=False)
502 |                                     txts = [line[1][0] for line in result[0]]
503 |                                     if len(txts) == 0:
504 |                                         data = ''
505 |                                     else:
506 |                                         data = txts[0]
507 |                                     csv_cols.append(data)
508 |                                     font = cv2.FONT_HERSHEY_SIMPLEX
509 |                                     cv2.putText(img, data, (box[0],box[1]), font, 1, (255, 0, 0), 2, cv2.LINE_AA)
510 |                                 row += 1
511 |                             # Add page number to last column
512 |                             csv_cols.append(filename_no_folder.split("-")[1].split(".")[0])
513 |                         csv_row_col.append(csv_cols)
514 |                         col += 1
515 |                     with open(pdf_name + ".csv", 'a', newline='') as f:
516 |                         writer = csv.writer(f, delimiter=',')
517 |                         writer.writerows(csv_row_col)  #considering my_list is a list of lists.
518 |                 else:
519 |                     print("no table in ", output_img_path+filename_no_folder)
520 | 
521 |                 ## Write ocr result to the image
522 |                 w_filename = 'text/' +filename_no_folder+ '_' + '.png'
523 |                 cv2.imwrite(w_filename, img)
524 |             
525 |         else:
526 |             print("=========no found barcode=====")
527 |             angle, img = correct_skew(img_origin)
528 |             img1 = img  # Read the image
529 |             rect_barcode = []
530 |             (green_flag, img1) = draw_blue_line(img1, rect_barcode)
531 | 
532 |             if green_flag == 1:
533 |                 print("=========green_flag == 1=====")
534 |                 Cimg_gray_para = [3, 3, 0]
535 |                 Cimg_blur_para = [150, 255]
536 | 
537 |                 gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
538 |                 blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2])
539 |                 (thresh, img_bin) = cv2.threshold(blurred_img, 200, 255,
540 |                                                 cv2.THRESH_BINARY | cv2.THRESH_OTSU)  # Thresholding the image
541 |                 img_bin = 255-img_bin  # Invert the image
542 | 
543 |                 # Defining a kernel length
544 |                 kernel_length = np.array(img).shape[1]//40
545 |                 
546 |                 # A verticle kernel of (1 X kernel_length), which will detect all the verticle lines from the image.
547 |                 verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length))
548 |                 # A horizontal kernel of (kernel_length X 1), which will help to detect all the horizontal line from the image.
549 |                 hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
550 |                 # A kernel of (3 X 3) ones.
551 |                 kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
552 | 
553 |                 # Morphological operation to detect verticle lines from an image
554 |                 img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2)
555 |                 verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20)
556 |                 verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2)
557 | 
558 |                 # Find valid cnts
559 |                 (cnts_flag, v_cnts_img) = v_remove_cnts(verticle_lines_img, rect_barcode)
560 | 
561 |                 if cnts_flag == 1:
562 |                     # Morphological operation to detect horizontal lines from an image
563 |                     img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2)
564 |                     horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=17)
565 |                     horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2)
566 |                     
567 |                     # Find valid cnts
568 |                     h_cnts_img = h_remove_cnts(horizontal_lines_img)
569 | 
570 |                     # Weighting parameters, this will decide the quantity of an image to be added to make a new image.
571 |                     alpha = 0.5
572 |                     beta = 1.0 - alpha
573 | 
574 |                     # This function helps to add two image with specific weight parameter to get a third image as summation of two image.
575 |                     img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0)
576 |                     img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2)
577 |                     (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
578 | 
579 |                     # Find contours for image, which will detect all the boxes
580 |                     contours, hierarchy = cv2.findContours(
581 |                         img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
582 |                     # Sort all the contours by top to bottom.
583 |                     (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom")
584 | 
585 |                     boxes = []
586 |                     xx = []
587 |                     yy = []
588 |                     ww = []
589 |                     hh = []
590 |                     areaa = []
591 | 
592 |                     for c in contours:
593 |                         # Returns the location and width,height for every contour
594 |                         x, y, w, h = cv2.boundingRect(c)
595 |                         if x > 3 and w > 800 and (x + w) < 2000 and y > 50 and y+h < 3650 and h > 55 and w < 1000 and h < 100:
596 |                             xx.append(x)
597 | 
598 |                     first_column_x = 131
599 |                     if len(xx) > 0:
600 |                         first_column_x = statistics.mode(xx)
601 |                     if first_column_x > 130:
602 |                         for c in contours:
603 |                             # Returns the location and width,height for every contour
604 |                             x, y, w, h = cv2.boundingRect(c)
605 |                             area = w * h
606 |                             box_info = [x, y, w, h, area]
607 |                             # print(box_info)
608 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5)
609 |                             if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 10) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200:
610 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
611 |                                 boxes.append(box_info)
612 |                             elif x > 3 and w > 800 and (x + w) < 2050 and y > 40 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
613 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
614 |                                 boxes.append(box_info)
615 |                             elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
616 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
617 |                                 boxes.append(box_info)
618 |                     else:
619 |                         for c in contours:
620 |                             # Returns the location and width,height for every contour
621 |                             x, y, w, h = cv2.boundingRect(c)
622 |                             area = w * h
623 |                             box_info = [x, y, w, h, area]
624 |                             # print(box_info)
625 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5)
626 |                             if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 100) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200:
627 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
628 |                                 boxes.append(box_info)
629 |                             elif x > 3 and w > 800 and (x + w) < 2050 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
630 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
631 |                                 boxes.append(box_info)
632 |                             elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200:
633 |                                 # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
634 |                                 boxes.append(box_info)
635 | 
636 |                     boxes_sorted_y = sorted(boxes, key=lambda x: x[1])
637 |                     box_array = boxes_sorted_y
638 | 
639 |                     ## Sort boxes by x and make rows
640 |                     i = 1
641 |                     columns = []
642 |                     row_columns = []
643 |                     for box in boxes_sorted_y:
644 |                         columns.append(box)
645 |                         if i % 7 == 0:
646 |                             boxes_sorted_x = sorted(columns, key=lambda x: x[0])
647 |                             row_columns.append(boxes_sorted_x)
648 |                             columns = []
649 |                         i += 1
650 | 
651 |                     for sub_boxes  in row_columns:
652 |                         for box  in sub_boxes:
653 |                             x = box[0]
654 |                             y = box[1]
655 |                             w = box[2]
656 |                             h = box[3]
657 |                             image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
658 | 
659 |                     ## Write red rect image
660 |                     time_str = str(int(round(time.time() * 1000)))
661 |                     w_filename = 'cnts/' +filename_no_folder+ '_' + '.png'
662 |                     # cv2.imwrite(w_filename, image)
663 | 
664 |                     idx = 0
665 |                     csv_row_col = []
666 |                     col = 0
667 |                     w_filename_base = output_img_path + filename_no_folder+ '_'
668 |                     for columns in row_columns:
669 |                         csv_cols = []
670 |                         if col == 0:
671 |                             row = 0
672 |                             for box in columns:
673 | 
674 |                                 idx += 1
675 |                                 new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
676 |                                 time_str = str(int(round(time.time() * 1000)))
677 |                                 # w_filename = w_filename_base +time_str+ '_' +str(idx) + '.png'
678 |                                 if row == 0:
679 |                                     w_filename = w_filename_base +str(idx) +'_Address.png'
680 |                                 if row == 3:
681 |                                     w_filename = w_filename_base +str(idx) +'_Guardian.png'
682 |                                 if row == 4:
683 |                                     w_filename = w_filename_base +str(idx) +'_Name.png'
684 |                                 cv2.imwrite(w_filename, new_img)
685 |                                 csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png')
686 |                                 row += 1
687 |                         else:
688 |                             row = 0
689 |                             for box in columns:
690 |                                 if row  == 0 or row == 3 or row == 4:
691 |                                     idx += 1
692 |                                     new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
693 |                                     if row == 0:
694 |                                         w_filename = w_filename_base +str(idx) +'_Address.png'
695 |                                     if row == 3:
696 |                                         w_filename = w_filename_base +str(idx) +'_Guardian.png'
697 |                                     if row == 4:
698 |                                         w_filename = w_filename_base +str(idx) +'_Name.png'
699 |                                     cv2.imwrite(w_filename, new_img)
700 |                                     csv_cols.append(w_filename.split(output_img_path)[1])
701 |                                 else:
702 |                                     idx += 1
703 |                                     new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
704 | 
705 |                                     if row  == 1:
706 |                                         processed_box = box_processing(new_img, "arabic")
707 |                                     else:
708 |                                         processed_box = box_processing(new_img, "number")
709 |                                     
710 |                                     result = ocr.ocr(processed_box, cls=False)
711 |                                     txts = [line[1][0] for line in result[0]]
712 |                                     if len(txts) == 0:
713 |                                         data = ''
714 |                                     else:
715 |                                         data = txts[0]
716 |                                     csv_cols.append(data)
717 |                                     font = cv2.FONT_HERSHEY_SIMPLEX
718 |                                     cv2.putText(img, data, (box[0],box[1]), font, 1, (255, 0, 0), 2, cv2.LINE_AA)
719 |                                 row += 1
720 |                             # Add page number to last column
721 |                             csv_cols.append(filename_no_folder.split("-")[1].split(".")[0])
722 |                         csv_row_col.append(csv_cols)
723 |                         col += 1
724 |                     with open(pdf_name + ".csv", 'a', newline='') as f:
725 |                         writer = csv.writer(f, delimiter=',')
726 |                         writer.writerows(csv_row_col)  #considering my_list is a list of lists.
727 |                 else:
728 |                     print("no table in ", output_img_path+filename_no_folder)
729 |             else:
730 |                 print("no table in ", output_img_path+filename_no_folder)
731 | 
732 |             ## Write ocr result to the image
733 |             w_filename = 'text/' +filename_no_folder+ '_' + '.png'
734 |             cv2.imwrite(w_filename, img)
735 | 
736 |     except (FileNotFoundError, ValueError) as e:
737 |         print(f"Error: {e}")
738 |     except Exception as e:
739 |         print("An unexpected error occurred:", e)
740 |     finally:
741 |         cv2.destroyAllWindows()
742 | 
743 | 
744 | def process_files(files):
745 |     for filename in files:
746 |         file_path = os.path.join(input_folder, filename)
747 |         try:
748 |             box_extraction(file_path)
749 |         except Exception as e:
750 |             print(f"Error processing {file_path}: {e}")
751 | 
752 | 
753 | def main_threading():
754 |     filenames = os.listdir(input_folder)
755 |     if len(filenames) < 5:
756 |         for filename in filenames:
757 |             file_path = os.path.join(input_folder, filename)
758 |             try:
759 |                 box_extraction(file_path)
760 |             except Exception as e:
761 |                 print(f"Error processing {file_path}: {e}")
762 |     else:
763 |         count_processes = 5
764 |         step = len(filenames) // count_processes
765 |         file_groups = [filenames[i*step:(i+1)*step] for i in range(count_processes)]
766 |         if len(filenames) % count_processes != 0:
767 |             file_groups[-1].extend(filenames[count_processes*step:])
768 |         with Pool(count_processes) as p:
769 |             p.map(process_files, file_groups)
770 | 
771 | 
772 | def main_single():
773 |     stop_threads = False 
774 |     for filename in os.listdir(input_folder):
775 |         file_path = input_folder + filename
776 |         box_extraction(file_path)
777 | 
778 | 
779 | if __name__ == '__main__':
780 |     print("Reading image..")
781 |     main_threading()


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866607_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866607_1.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866609_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866609_2.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866609_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866609_3.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866610_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866610_4.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866611_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866611_5.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866612_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866612_6.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866612_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866612_7.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866613_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866613_8.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866939_11.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866939_11.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120866940_12.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866940_12.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120867285_15.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867285_15.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120867593_18.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867593_18.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120867593_19.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867593_19.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120867886_22.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867886_22.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120868209_25.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868209_25.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120868210_26.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868210_26.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120868507_29.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868507_29.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120868810_32.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868810_32.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120868810_33.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868810_33.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120869094_36.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869094_36.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120869422_39.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869422_39.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120869423_40.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869423_40.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120869713_43.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869713_43.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870040_46.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870040_46.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870041_47.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870041_47.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870348_50.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870348_50.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870654_53.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870654_53.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870655_54.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870655_54.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120870955_57.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870955_57.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120871261_60.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871261_60.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120871262_61.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871262_61.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120871555_64.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871555_64.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120871887_67.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871887_67.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120871888_68.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871888_68.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120872207_71.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872207_71.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120872548_74.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872548_74.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120872549_75.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872549_75.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120872851_78.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872851_78.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120873167_81.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873167_81.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120873168_82.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873168_82.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120873796_85.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873796_85.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120874124_88.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874124_88.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120874125_89.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874125_89.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120874563_92.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874563_92.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120874891_95.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874891_95.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120874892_96.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874892_96.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120875187_99.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875187_99.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120875586_102.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875586_102.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120875587_103.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875587_103.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120875919_106.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875919_106.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120876292_109.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876292_109.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120876293_110.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876293_110.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120876586_113.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876586_113.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120876921_116.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876921_116.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120876922_117.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876922_117.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120877270_120.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877270_120.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120877615_123.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877615_123.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120877616_124.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877616_124.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120877910_127.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877910_127.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120878219_130.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878219_130.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120878221_131.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878221_131.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120878619_134.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878619_134.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120878965_137.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878965_137.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120878965_138.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878965_138.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120879303_141.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879303_141.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120879677_144.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879677_144.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120879678_145.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879678_145.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120879968_148.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879968_148.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120880337_151.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880337_151.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120880338_152.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880338_152.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120880651_155.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880651_155.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120881022_158.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881022_158.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120881023_159.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881023_159.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120881357_162.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881357_162.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120881749_165.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881749_165.png


--------------------------------------------------------------------------------
/output_img/16544460977400001-03_1655120881749_166.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881749_166.png


--------------------------------------------------------------------------------
/pdf_img/16544460977400001-03.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/pdf_img/16544460977400001-03.jpg


--------------------------------------------------------------------------------
/pdf_to_img.py:
--------------------------------------------------------------------------------
 1 | from pdf2image import convert_from_path
 2 | import time
 3 | 
 4 | jpegopt={
 5 |     "quality": 100,
 6 |     "progressive": True,
 7 |     "optimize": True
 8 | }
 9 | 
10 | time_str = str(int(round(time.time() * 1000)))
11 | 
12 | print("Loading PDF file...")
13 | convert_from_path("shamaryati code 259200305.pdf",
14 |     dpi=320,
15 |     output_folder="pdf_img/",
16 |     first_page=3,
17 |     last_page=56,
18 |     fmt="jpeg",
19 |     jpegopt=jpegopt,
20 |     thread_count=5,
21 |     userpw=None,
22 |     use_cropbox=False,
23 |     strict=False,
24 |     transparent=False,
25 |     single_file=False,
26 |     output_file=time_str,
27 |     poppler_path=r'C:\Program Files\poppler-0.68.0\bin',
28 |     grayscale=False,
29 |     size=None,
30 |     paths_only=False,
31 |     hide_annotations=False,)
32 | 
33 | print("Done!")


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.22.4
2 | opencv-contrib-python==4.5.5.64
3 | packaging==21.3
4 | pdf2image==1.16.0
5 | Pillow==9.1.1
6 | pyparsing==3.0.9
7 | pytesseract==0.3.9
8 | 


--------------------------------------------------------------------------------
/start_ocr1.py:
--------------------------------------------------------------------------------
  1 | import cv2
  2 | import numpy as np
  3 | import time
  4 | import csv
  5 | import pytesseract
  6 | import os
  7 | import threading
  8 | from PIL import Image, ImageOps, ImageFilter 
  9 | import imutils
 10 | from cnn_ocr1 import predict
 11 | ## Init
 12 | jpegopt={
 13 |     "quality": 100,
 14 |     "progressive": True,
 15 |     "optimize": True
 16 | }
 17 | 
 18 | global thread_kill_flags
 19 | input_folder = "pdf_img/"
 20 | output_folder = "output_img/"
 21 | out_csv = "table_1.csv"
 22 | # pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
 23 | 
 24 | 
 25 | def uuid(filename):
 26 |     time_str = str(int(round(time.time() * 1000)))
 27 |     file = filename.split(".pdf")[0]
 28 |     id = file + "&&&" + time_str
 29 |     return id
 30 | 
 31 | 
 32 | def img_preprocessing(img):
 33 |     gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
 34 |     ret, thresh = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
 35 |     # noise removal
 36 |     kernel = np.ones((1,1),np.uint8)
 37 |     opening = cv2.morphologyEx(thresh,cv2.MORPH_OPEN,kernel, iterations = 1)
 38 |     # sure background area
 39 |     sure_bg = cv2.dilate(opening,kernel,iterations=4)
 40 |     # Finding sure foreground area
 41 |     dist_transform = cv2.distanceTransform(opening,cv2.DIST_L2,5)
 42 |     ret, sure_fg = cv2.threshold(dist_transform,0.2*dist_transform.max(),255,0)
 43 |     # Finding unknown region
 44 |     sure_fg = np.uint8(sure_fg)
 45 |     sure_fg = 255 - sure_fg
 46 | 
 47 |     return sure_fg
 48 | 
 49 | 
 50 | def caculate_time_difference(start_milliseconds, end_milliseconds, filename):
 51 |     if filename == 'total':
 52 |         diff_milliseconds = int(end_milliseconds) - int(start_milliseconds)
 53 |         seconds=(diff_milliseconds / 1000) % 60
 54 |         minutes=(diff_milliseconds/(1000*60))%60
 55 |         hours=(diff_milliseconds/(1000*60*60))%24
 56 |         print("Total run time", hours,":",minutes,":",seconds)
 57 |     else:
 58 |         diff_milliseconds = int(end_milliseconds) - int(start_milliseconds)
 59 |         seconds=(diff_milliseconds / 1000) % 60
 60 |         print(seconds, "s", filename)
 61 | 
 62 | 
 63 | def sort_contours(cnts, method="left-to-right"):
 64 |     # initialize the reverse flag and sort index
 65 |     reverse = False
 66 |     i = 0
 67 | 
 68 |     # handle if we need to sort in reverse
 69 |     if method == "right-to-left" or method == "bottom-to-top":
 70 |         reverse = True
 71 | 
 72 |     # handle if we are sorting against the y-coordinate rather than
 73 |     # the x-coordinate of the bounding box
 74 |     if method == "top-to-bottom" or method == "bottom-to-top":
 75 |         i = 1
 76 | 
 77 |     # construct the list of bounding boxes and sort them from top to
 78 |     # bottom
 79 |     boundingBoxes = [cv2.boundingRect(c) for c in cnts]
 80 |     (cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes),
 81 |                                         key=lambda b: b[1][i], reverse=reverse))
 82 | 
 83 |     # return the list of sorted contours and bounding boxes
 84 |     return (cnts, boundingBoxes)
 85 | 
 86 | 
 87 | def draw_green_line(img):
 88 |     img_gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 89 |     img_cny = cv2.Canny(img_gry, 50, 200)
 90 |     lns = cv2.ximgproc.createFastLineDetector().detect(img_cny)
 91 |     img_cpy = img.copy()
 92 |     if lns is not None:
 93 |         for ln in lns:
 94 |             x1 = int(ln[0][0])
 95 |             y1 = int(ln[0][1])
 96 |             x2 = int(ln[0][2])
 97 |             y2 = int(ln[0][3])
 98 |             cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5)
 99 |         return (1, img_cpy)
100 |     else:
101 |         return (0, img_cpy)
102 | 
103 | 
104 | def v_remove_cnts(image):
105 |     mask = np.ones(image.shape[:2], dtype="uint8") * 255
106 |     contours, hierarchy = cv2.findContours(
107 |         image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
108 |     if len(contours) != 0:
109 |         h_arr = []
110 |         for cnt in contours:
111 |             x, y, w, h = cv2.boundingRect(cnt)
112 |             h_arr.append(h)
113 |         mode_h = (max(set(h_arr), key = h_arr.count))
114 |         for cnt in contours:
115 |             x, y, w, h = cv2.boundingRect(cnt)
116 |             h_arr.append(h)
117 |             if h < mode_h - 50:
118 |                 cv2.drawContours(mask, [cnt], -1, 0, -1)
119 |         image = cv2.bitwise_and(image, image, mask=mask)
120 |         return (1, image)
121 |     else:
122 |         return (0, image)
123 | 
124 | 
125 | def h_remove_cnts(image):
126 |     mask = np.ones(image.shape[:2], dtype="uint8") * 255
127 |     contours, hierarchy = cv2.findContours(
128 |         image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
129 |     for cnt in contours:
130 |         x, y, w, h = cv2.boundingRect(cnt)
131 |         box_info = [x, y, w, h]
132 |         if w < 2640:
133 |             cv2.drawContours(mask, [cnt], -1, 0, -1)
134 |     image = cv2.bitwise_and(image, image, mask=mask)
135 |     return image
136 | 
137 | 
138 | def find_0(s):
139 |     ch = "0"
140 |     return [i for i, ltr in enumerate(s) if ltr == ch]
141 | 
142 | 
143 | def recognition_num(img, i, data):
144 |     dim = (32, 32)
145 |     resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA)
146 |     
147 |     value = predict(resized)
148 |     if value != "n" and value != "m":
149 |         if value == None:
150 |             time_str = str(int(round(time.time() * 1000)))
151 |             name = "numbers/" + str(time_str) + str(i) + ".jpg"
152 |             cv2.imwrite(name, resized)
153 |             value = "_"
154 |         data += str(value)
155 |     # else:
156 |     #     time_str = str(int(round(time.time() * 1000)))
157 |     #     name = "n_m/" + str(time_str) + str(i) + ".jpg"
158 |     #     cv2.imwrite(name, resized) 
159 |     
160 |     if value == None:
161 |         time_str = str(int(round(time.time() * 1000)))
162 |         name = "numbers/" + str(time_str) + str(i) + ".jpg"
163 |         cv2.imwrite(name, resized)
164 |     return data
165 | 
166 | 
167 | def split_number(img):
168 |     kernel = np.ones((1, 1), np.uint8)
169 |     img = cv2.dilate(img, kernel, iterations=2)
170 |     img = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY)[1]
171 |     cnts = cv2.findContours(img.copy(), cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
172 |     cnts = imutils.grab_contours(cnts)
173 |     data = ''
174 |     if len(cnts) != 0:
175 |         cnts, boundingBoxes = sort_contours(cnts, method="left-to-right")
176 |         
177 |         if len(cnts) == 1:
178 |             return 0
179 |         else:
180 |             i = 1
181 |             for c in cnts:
182 |                 x, y, w, h = cv2.boundingRect(c)
183 |                 if w > 3 and h > 15 and w < 80 and h < 60:
184 |                     
185 |                     if w > 3 and w < 24:
186 |                         new_img = img[y:y+h, x:x+w]
187 |                         data = recognition_num(new_img, i, data)
188 | 
189 |                     elif w > 23 and w < 43:
190 |                         new_img1 = img[y:y+h, x:x+int(w/2)+2]
191 |                         data = recognition_num(new_img1, i, data)
192 |                         new_img2 = img[y:y+h, x+int(w/2)-1:x+w]
193 |                         data = recognition_num(new_img2, i + 1000, data)
194 | 
195 |                     elif w > 42 and w < 65:
196 |                         new_img1 = img[y:y+h, x:x+int(w/3)+2]
197 |                         data = recognition_num(new_img1, i, data)
198 |                         new_img2 = img[y:y+h, x+int(w/3)-1:x+int(w*2/3)+2]
199 |                         data = recognition_num(new_img2, i + 1000, data)
200 |                         new_img3 = img[y:y+h, x+int(w*2/3)-1:x+w]
201 |                         data = recognition_num(new_img3, i + 2000, data)
202 |                     
203 |                     elif w > 64 and w < 80:
204 |                         new_img1 = img[y:y+h, x:x+int(w/4)+2]
205 |                         data = recognition_num(new_img1, i, data)
206 |                         new_img2 = img[y:y+h, x+int(w/4)-1:x+int(w*2/4)+2]
207 |                         data = recognition_num(new_img2, i + 1000, data)
208 |                         new_img3 = img[y:y+h, x+int(w*2/4)-1:x+int(w*3/4)]
209 |                         data = recognition_num(new_img3, i + 2000, data)
210 |                         new_img4 = img[y:y+h, x+int(w*3/4)-1:x+w]
211 |                         data = recognition_num(new_img4, i + 2000, data)
212 |                     
213 |                     i += 1
214 |     return data
215 | 
216 | #Functon for extracting the box
217 | def box_extraction(img_for_box_extraction_path, cropped_dir_path):
218 |     filename_no_xtensn = img_for_box_extraction_path.split(".")[0]
219 |     xtensn = img_for_box_extraction_path.split(".")[1]
220 |     if xtensn == "jpg" or xtensn == "jpeg" or xtensn == "JPG" or xtensn == "JPEG":
221 |         filename_no_folder = filename_no_xtensn.split(input_folder)[1]
222 |         img = cv2.imread(img_for_box_extraction_path)  # Read the image
223 |         img1 = cv2.imread(img_for_box_extraction_path)  # Read the image
224 |         (green_flag, img1) = draw_green_line(img1)
225 | 
226 |         if green_flag == 1:
227 |             Cimg_gray_para = [3, 3, 0]
228 |             Cimg_blur_para = [150, 255]
229 | 
230 |             gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
231 |             blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2])
232 |             (thresh, img_bin) = cv2.threshold(blurred_img, 200, 255,
233 |                                             cv2.THRESH_BINARY | cv2.THRESH_OTSU)  # Thresholding the image
234 |             img_bin = 255-img_bin  # Invert the image
235 | 
236 |             # Defining a kernel length
237 |             kernel_length = np.array(img).shape[1]//40
238 |             
239 |             # A verticle kernel of (1 X kernel_length), which will detect all the verticle lines from the image.
240 |             verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length))
241 |             # A horizontal kernel of (kernel_length X 1), which will help to detect all the horizontal line from the image.
242 |             hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
243 |             # A kernel of (3 X 3) ones.
244 |             kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
245 | 
246 |             # Morphological operation to detect verticle lines from an image
247 |             img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2)
248 |             verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20)
249 |             verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2)
250 | 
251 |             # Find valid cnts
252 |             (cnts_flag, v_cnts_img) = v_remove_cnts(verticle_lines_img)
253 | 
254 |             if cnts_flag == 1:
255 |                 # Morphological operation to detect horizontal lines from an image
256 |                 img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2)
257 |                 horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=10)
258 |                 horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2)
259 |                 
260 |                 # Find valid cnts
261 |                 h_cnts_img = h_remove_cnts(horizontal_lines_img)
262 | 
263 |                 # Weighting parameters, this will decide the quantity of an image to be added to make a new image.
264 |                 alpha = 0.5
265 |                 beta = 1.0 - alpha
266 | 
267 |                 # This function helps to add two image with specific weight parameter to get a third image as summation of two image.
268 |                 img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0)
269 |                 img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2)
270 |                 (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
271 | 
272 |                 # Find contours for image, which will detect all the boxes
273 |                 contours, hierarchy = cv2.findContours(
274 |                     img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
275 |                 # Sort all the contours by top to bottom.
276 |                 (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom")
277 | 
278 |                 ## Find suitable boxes
279 |                 boxes = []
280 |                 xx = []
281 |                 yy = []
282 |                 ww = []
283 |                 hh = []
284 |                 areaa = []
285 |                 for c in contours:
286 |                     # Returns the location and width,height for every contour
287 |                     x, y, w, h = cv2.boundingRect(c)
288 |                     area = w * h
289 |                     box_info = [x, y, w, h, area]
290 |                     # print(box_info)
291 |                     image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5)
292 |                     if x > 50 and x < 2500 and (x + w) < 2640 and y > 50 and w > 70 and h > 55 and w < 1000 and h < 200:
293 |                         image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
294 |                         boxes.append(box_info)
295 |                     elif x > 3 and w > 800 and (x + w) < 2640 and y > 50 and h > 55 and w < 1000 and h < 200:
296 |                         image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25)
297 |                         boxes.append(box_info)
298 |                 boxes_sorted_y = sorted(boxes, key=lambda x: x[1])
299 | 
300 |                 ## Sort boxes by x and make rows
301 |                 i = 1
302 |                 columns = []
303 |                 row_columns = []
304 |                 for box in boxes_sorted_y:
305 |                     columns.append(box)
306 |                     if i % 7 == 0:
307 |                         boxes_sorted_x = sorted(columns, key=lambda x: x[0])
308 |                         row_columns.append(boxes_sorted_x)
309 |                         columns = []
310 |                     i += 1
311 | 
312 |                 idx = 0
313 |                 csv_row_col = []
314 |                 col = 0
315 |                 for columns in row_columns:
316 |                     csv_cols = []
317 |                     if col == 0:
318 |                         row = 0
319 |                         for box in columns:
320 |                             idx += 1
321 |                             new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
322 |                             time_str = str(int(round(time.time() * 1000)))
323 |                             # w_filename = cropped_dir_path+filename_no_folder+ '_' +time_str+ '_' +str(idx) + '.png'
324 |                             if row == 0:
325 |                                 w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Address.png'
326 |                             if row == 3:
327 |                                 w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Guardian.png'
328 |                             if row == 4:
329 |                                 w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Name.png'
330 |                             cv2.imwrite(w_filename, new_img)
331 |                             # csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png')
332 |                             row += 1
333 |                     else:
334 |                         row = 0
335 |                         for box in columns:
336 |                             if row  == 0 or row == 3 or row == 4:
337 |                                 idx += 1
338 |                                 new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
339 |                                 if row == 0:
340 |                                     w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Address.png'
341 |                                 if row == 3:
342 |                                     w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Guardian.png'
343 |                                 if row == 4:
344 |                                     w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Name.png'
345 |                                 cv2.imwrite(w_filename, new_img)
346 |                                 csv_cols.append(w_filename.split(cropped_dir_path)[1])
347 |                             else:
348 |                                 idx += 1
349 |                                 new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]]
350 |                                 new_img = img_preprocessing(new_img)
351 |                                 data = "&&  " + str(split_number(new_img))
352 |                                 csv_cols.append(data)
353 |                                 font = cv2.FONT_HERSHEY_SIMPLEX
354 |                                 cv2.putText(img, data, (box[0],box[1]), font, 2, (0, 255, 0), 2, cv2.LINE_AA)
355 |                             row += 1
356 |                         # Add page number to last column
357 |                         csv_cols.append(filename_no_folder.split("-")[1].split(".")[0])
358 |                     csv_row_col.append(csv_cols)
359 |                     col += 1
360 |                 with open(out_csv, 'a', newline='') as f:
361 |                     writer = csv.writer(f, delimiter=',')
362 |                     writer.writerows(csv_row_col)  #considering my_list is a list of lists.
363 | 
364 |             else:
365 |                 print("no table in ", cropped_dir_path+filename_no_folder)
366 |         else:
367 |             print("no table in ", cropped_dir_path+filename_no_folder)
368 | 
369 | 
370 | def main(start_time):
371 |     start_total = str(int(round(time.time() * 1000)))
372 |     for filename in os.listdir(input_folder):
373 |         print(filename)
374 |         start_milliseconds = str(int(round(time.time() * 1000)))
375 |         file_path = input_folder + filename
376 |         box_extraction(file_path, output_folder)
377 |         end_milliseconds = str(int(round(time.time() * 1000)))
378 |     end_total = str(int(round(time.time() * 1000)))
379 |     caculate_time_difference(start_total, end_total, "total")    
380 |     
381 | 
382 | if __name__ == '__main__':
383 |     print("Reading image..")
384 |     start_time = str(int(round(time.time() * 1000)))
385 |     main(start_time)


--------------------------------------------------------------------------------
/table_1.csv:
--------------------------------------------------------------------------------
 1 | 
 2 | 16544460977400001-03_1655120228532_8.png,62,35201-4785657-9,16544460977400001-03_1655120230458_11.png,16544460977400001-03_1655120230477_12.png,1,1
 3 | 16544460977400001-03_1655120230821_15.png,37,35202-2948435-9,16544460977400001-03_1655120231182_18.png,16544460977400001-03_1655120231183_19.png,1,2
 4 | 16544460977400001-03_1655120231530_22.png,36,36202-8131266-7,16544460977400001-03_1655120231881_25.png,16544460977400001-03_1655120231882_26.png,4,3
 5 | 16544460977400001-03_1655120232214_29.png,35,35202-8256565-9,16544460977400001-03_1655120232566_32.png,16544460977400001-03_1655120232567_33.png,1,4
 6 | 16544460977400001-03_1655120232889_36.png,32,35201-6358583-5,16544460977400001-03_1655120233245_39.png,16544460977400001-03_1655120233246_40.png,1,5
 7 | 16544460977400001-03_1655120233581_43.png,30,35202-0886762-7,16544460977400001-03_1655120233950_46.png,16544460977400001-03_1655120233951_47.png,1,6
 8 | 16544460977400001-03_1655120234366_50.png,91,39202-4566769-5,16544460977400001-03_1655120234776_53.png,16544460977400001-03_1655120234777_54.png,2,1
 9 | 16544460977400001-03_1655120235138_57.png,58,35202-4386216-5,16544460977400001-03_1655120235500_60.png,16544460977400001-03_1655120235500_61.png,2,8
10 | 16544460977400001-03_1655120235834_64.png,52,35202-2789094-9,16544460977400001-03_1655120236215_67.png,16544460977400001-03_1655120236216_68.png,2,9
11 | 16544460977400001-03_1655120236669_71.png,49,35202-3433250-9,16544460977400001-03_1655120237739_74.png,16544460977400001-03_1655120239279_75.png,2,0
12 | 16544460977400001-03_1655120239968_78.png,49,30202-6576483-7,16544460977400001-03_1655120240607_81.png,16544460977400001-03_1655120240608_82.png,2,41
13 | 16544460977400001-03_1655120241169_85.png,22,39202-5662367-9,16544460977400001-03_1655120241779_88.png,16544460977400001-03_1655120241799_89.png,2,42
14 | 16544460977400001-03_1655120242208_92.png,28,32202-6306618-7,16544460977400001-03_1655120242988_95.png,16544460977400001-03_1655120242989_96.png,2,413
15 | 16544460977400001-03_1655120243660_99.png,28,-098,16544460977400001-03_1655120244033_102.png,16544460977400001-03_1655120244034_103.png,2,14
16 | 16544460977400001-03_1655120244412_106.png,,39202-5974213-7,16544460977400001-03_1655120244878_109.png,16544460977400001-03_1655120244878_110.png,2,15
17 | 16544460977400001-03_1655120245234_113.png,2,35202-4943846-5,16544460977400001-03_1655120245667_116.png,16544460977400001-03_1655120245667_117.png,2,18
18 | 16544460977400001-03_1655120246082_120.png,65,35202-5052350-9,16544460977400001-03_1655120246518_123.png,16544460977400001-03_1655120246519_124.png,3,17
19 | 16544460977400001-03_1655120247000_127.png,54,35202-7949955-1,16544460977400001-03_1655120247433_130.png,16544460977400001-03_1655120247434_131.png,3,18
20 | 16544460977400001-03_1655120247977_134.png,46,35202-4918828-9,16544460977400001-03_1655120248349_137.png,16544460977400001-03_1655120248350_138.png,3,19
21 | 16544460977400001-03_1655120248800_141.png,38,35201-7525691-3,16544460977400001-03_1655120249187_144.png,16544460977400001-03_1655120249188_145.png,3,20
22 | 16544460977400001-03_1655120249701_148.png,31,35202-4588576-3,16544460977400001-03_1655120250104_151.png,16544460977400001-03_1655120250105_152.png,3,21
23 | 16544460977400001-03_1655120250661_155.png,30,36202-1916332-5,16544460977400001-03_1655120251089_158.png,16544460977400001-03_1655120251090_159.png,3,22
24 | 16544460977400001-03_1655120251499_162.png,25,35202-2034663-9,16544460977400001-03_1655120251978_165.png,16544460977400001-03_1655120251979_166.png,3,23
25 | 16544460977400001-03_1655120252382_169.png,56,35202-1953178-7,16544460977400001-03_1655120253001_172.png,16544460977400001-03_1655120253002_173.png,4,24
26 | 


--------------------------------------------------------------------------------
/table_1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/table_1.jpg


--------------------------------------------------------------------------------