├── .gitignore ├── README.md ├── cnn_ocr1.py ├── focus_border_Images ├── f_16544460977400001-03.jpg ├── g_16544460977400001-03.jpg ├── h.jpg ├── h_16544460977400001-03.jpg ├── v.jpg └── v_16544460977400001-03.jpg ├── focus_border_images.png ├── focus_border_img.jpg ├── focus_text_Images └── result_16540011586530001-03.jpg ├── new_thread_7_paddle_opti.py ├── output_img ├── 16544460977400001-03_1655120866607_1.png ├── 16544460977400001-03_1655120866609_2.png ├── 16544460977400001-03_1655120866609_3.png ├── 16544460977400001-03_1655120866610_4.png ├── 16544460977400001-03_1655120866611_5.png ├── 16544460977400001-03_1655120866612_6.png ├── 16544460977400001-03_1655120866612_7.png ├── 16544460977400001-03_1655120866613_8.png ├── 16544460977400001-03_1655120866939_11.png ├── 16544460977400001-03_1655120866940_12.png ├── 16544460977400001-03_1655120867285_15.png ├── 16544460977400001-03_1655120867593_18.png ├── 16544460977400001-03_1655120867593_19.png ├── 16544460977400001-03_1655120867886_22.png ├── 16544460977400001-03_1655120868209_25.png ├── 16544460977400001-03_1655120868210_26.png ├── 16544460977400001-03_1655120868507_29.png ├── 16544460977400001-03_1655120868810_32.png ├── 16544460977400001-03_1655120868810_33.png ├── 16544460977400001-03_1655120869094_36.png ├── 16544460977400001-03_1655120869422_39.png ├── 16544460977400001-03_1655120869423_40.png ├── 16544460977400001-03_1655120869713_43.png ├── 16544460977400001-03_1655120870040_46.png ├── 16544460977400001-03_1655120870041_47.png ├── 16544460977400001-03_1655120870348_50.png ├── 16544460977400001-03_1655120870654_53.png ├── 16544460977400001-03_1655120870655_54.png ├── 16544460977400001-03_1655120870955_57.png ├── 16544460977400001-03_1655120871261_60.png ├── 16544460977400001-03_1655120871262_61.png ├── 16544460977400001-03_1655120871555_64.png ├── 16544460977400001-03_1655120871887_67.png ├── 16544460977400001-03_1655120871888_68.png ├── 16544460977400001-03_1655120872207_71.png ├── 16544460977400001-03_1655120872548_74.png ├── 16544460977400001-03_1655120872549_75.png ├── 16544460977400001-03_1655120872851_78.png ├── 16544460977400001-03_1655120873167_81.png ├── 16544460977400001-03_1655120873168_82.png ├── 16544460977400001-03_1655120873796_85.png ├── 16544460977400001-03_1655120874124_88.png ├── 16544460977400001-03_1655120874125_89.png ├── 16544460977400001-03_1655120874563_92.png ├── 16544460977400001-03_1655120874891_95.png ├── 16544460977400001-03_1655120874892_96.png ├── 16544460977400001-03_1655120875187_99.png ├── 16544460977400001-03_1655120875586_102.png ├── 16544460977400001-03_1655120875587_103.png ├── 16544460977400001-03_1655120875919_106.png ├── 16544460977400001-03_1655120876292_109.png ├── 16544460977400001-03_1655120876293_110.png ├── 16544460977400001-03_1655120876586_113.png ├── 16544460977400001-03_1655120876921_116.png ├── 16544460977400001-03_1655120876922_117.png ├── 16544460977400001-03_1655120877270_120.png ├── 16544460977400001-03_1655120877615_123.png ├── 16544460977400001-03_1655120877616_124.png ├── 16544460977400001-03_1655120877910_127.png ├── 16544460977400001-03_1655120878219_130.png ├── 16544460977400001-03_1655120878221_131.png ├── 16544460977400001-03_1655120878619_134.png ├── 16544460977400001-03_1655120878965_137.png ├── 16544460977400001-03_1655120878965_138.png ├── 16544460977400001-03_1655120879303_141.png ├── 16544460977400001-03_1655120879677_144.png ├── 16544460977400001-03_1655120879678_145.png ├── 16544460977400001-03_1655120879968_148.png ├── 16544460977400001-03_1655120880337_151.png ├── 16544460977400001-03_1655120880338_152.png ├── 16544460977400001-03_1655120880651_155.png ├── 16544460977400001-03_1655120881022_158.png ├── 16544460977400001-03_1655120881023_159.png ├── 16544460977400001-03_1655120881357_162.png ├── 16544460977400001-03_1655120881749_165.png └── 16544460977400001-03_1655120881749_166.png ├── pdf_img └── 16544460977400001-03.jpg ├── pdf_to_img.py ├── requirements.txt ├── start_ocr1.py ├── table_1.csv └── table_1.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | readme.txt 2 | models/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Ocr_pdf_table_csv 2 | This is a project that extracts data from pdf snapshot and enters data into csv 3 | This project is using 2 methods to detect boxes from table. 4 | First method depends on the border of table. 5 | Custom CNN model is used in this project. 6 | You can find the project that is used pretrained Tesseract model on https://github.com/Supernova1024/OCR-Pdf-to-CSV-Table 7 | Please give me star if this project was helpful to your startup project. :) 8 | 9 | The result is stored in "focus_border_Images" folder 10 | ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/focus_border_img.jpg) 11 | ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/focus_border_images.png) 12 | 13 | # Installing 14 | - Download this repository 15 | - Install requirements.txt in project root directory 16 | 17 | # How to Run 18 | - Convert pdf to images 19 | Open the pdf_image.py and define the parameters. 20 | You can check the parameters here. 21 | https://pdf2image.readthedocs.io/en/latest/reference.html 22 | - Preparing Datasets and Training the Model. 23 | Please use this project. 24 | https://github.com/Supernova1024/train-CNN-model-for-number-classification-OCR 25 | - After getting model by using above project, please copy the model into "models" folder. 26 | - Start 27 | ``` 28 | python start_ocr1.py 29 | ``` 30 | 31 | # Result Description 32 | ![](https://github.com/Supernova1024/OCR-PDF-to-CSV-Table-by-CNN/blob/main/table_1.jpg) 33 | In this project, I used the table that has 7 columns 34 | 1, 4, and 5 columns can't be recognized by the model. 35 | The boxes of these columns are stored in "output_img" folder as JPG and added their file name to csv file. 36 | You can check example of "output_img" folder here. 37 | https://drive.google.com/drive/folders/1nrns5zZkzfVP9o8aCyjkj9O4O-TwJAvP?usp=sharing 38 | Other columns can be recognized by the model and the results are stored in csv directly. 39 | The csv keeps table structure of original pdf 40 | I attached example "output_img" folder and "table_1.csv" 41 | 42 | Please give me star if this project was helpful to your startup project. :) 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /cnn_ocr1.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import os 4 | import cv2 5 | import time 6 | import tensorflow.compat.v1 as tf 7 | tf.disable_v2_behavior() 8 | 9 | image_size=32 10 | num_channels=3 11 | 12 | # Let us restore the saved model 13 | sess = tf.Session() 14 | # Step-1: Recreate the network graph. At this step only graph is created. 15 | saver = tf.train.import_meta_graph('models/trained_model.meta') 16 | # Step-2: Now let's load the weights saved using the restore method. 17 | saver.restore(sess, tf.train.latest_checkpoint('./models/')) 18 | 19 | # Accessing the default graph which we have restored 20 | graph = tf.get_default_graph() 21 | # Now, let's get hold of the op that we can be processed to get the output. 22 | # In the original network y_pred is the tensor that is the prediction of the network 23 | y_pred = graph.get_tensor_by_name("y_pred:0") 24 | 25 | ## Let's feed the images to the input placeholders 26 | x= graph.get_tensor_by_name("x:0") 27 | y_true = graph.get_tensor_by_name("y_true:0") 28 | y_test_images = np.zeros((1, 12)) 29 | predicted_class = None 30 | dirs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, "n", "m"] 31 | 32 | def predict(img): 33 | 34 | image = cv2.resize(img, (image_size, image_size),0,0, cv2.INTER_LINEAR) 35 | image = np.stack((image,)*3, axis=-1) 36 | images = [] 37 | images.append(image) 38 | images = np.array(images, dtype=np.uint8) 39 | images = images.astype('float32') 40 | images = np.multiply(images, 1.0/255.0) 41 | x_batch = images.reshape(1, image_size,image_size,num_channels) 42 | 43 | # Creating the feed_dict that is required to be fed to calculate y_pred 44 | feed_dict_testing = {x: x_batch, y_true: y_test_images} 45 | # print("--feed_dict_testing---", feed_dict_testing) 46 | result=sess.run(y_pred, feed_dict=feed_dict_testing) 47 | # Result is of this format [[probabiliy_of_classA probability_of_classB ....]] 48 | 49 | a = result[0].tolist() 50 | r=0 51 | max1 = max(a) 52 | index1 = a.index(max1) 53 | count = 0 54 | 55 | for name in dirs: 56 | if count==index1: 57 | predicted_class = name 58 | count+=1 59 | 60 | for i in a: 61 | if i!=max1: 62 | if max1-i 0: 75 | if lns is not None: 76 | for ln in lns: 77 | x1 = int(ln[0][0]) 78 | y1 = int(ln[0][1]) 79 | x2 = int(ln[0][2]) 80 | y2 = int(ln[0][3]) 81 | cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5) 82 | return (1, img_cpy) 83 | 84 | elif lns is not None: 85 | for ln in lns: 86 | x1 = int(ln[0][0]) 87 | y1 = int(ln[0][1]) 88 | x2 = int(ln[0][2]) 89 | y2 = int(ln[0][3]) 90 | cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5) 91 | return (1, img_cpy) 92 | 93 | return (0, img_cpy) 94 | 95 | except Exception as e: 96 | print(f"Error: {e}") 97 | 98 | 99 | def v_remove_cnts(image, rect_barcode): 100 | try: 101 | # Initialize variables 102 | limit_distance = 200 103 | mask = np.ones(image.shape[:2], dtype="uint8") * 255 104 | 105 | # Find contours and sort by height 106 | contours, hierarchy = cv2.findContours( 107 | image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 108 | h_arr = [cv2.boundingRect(cnt)[3] for cnt in contours] 109 | h_arr.sort() 110 | 111 | # Cluster contours based on height 112 | current_cluster = [] 113 | clusters = [] 114 | for i in range(len(h_arr)): 115 | if not current_cluster or h_arr[i] - current_cluster[-1] <= limit_distance: 116 | current_cluster.append(h_arr[i]) 117 | else: 118 | clusters.append(current_cluster) 119 | current_cluster = [h_arr[i]] 120 | clusters.append(current_cluster) 121 | 122 | # Find the largest cluster and average height 123 | largest_cluster = max(clusters, key=len) 124 | avg_value = sum(largest_cluster) / len(largest_cluster) 125 | 126 | # Remove contours outside of the largest cluster 127 | for cnt in contours: 128 | h = cv2.boundingRect(cnt)[3] 129 | if h < avg_value - limit_distance or h > avg_value + limit_distance: 130 | cv2.drawContours(mask, [cnt], -1, 0, -1) 131 | image = cv2.bitwise_and(image, image, mask=mask) 132 | 133 | # Draw first vertical line if barcode exists 134 | if len(rect_barcode) > 0: 135 | x_barcode_center = rect_barcode[0][0] 136 | y_barcode_center = int(rect_barcode[0][1]) 137 | w_barcode = rect_barcode[1][0] 138 | x_first_vertical = int(x_barcode_center - 350/2 - 8) 139 | cv2.line(image, pt1=(x_first_vertical, y_barcode_center), pt2=(x_first_vertical, image.shape[0]), color=(255, 0, 0), thickness=10) 140 | 141 | return (1, image) 142 | 143 | except Exception as e: 144 | print(f"Error in v_remove_cnts: {str(e)}") 145 | return (0, image) 146 | 147 | 148 | def h_remove_cnts(image): 149 | try: 150 | mask = np.ones(image.shape[:2], dtype="uint8") * 255 151 | contours, hierarchy = cv2.findContours(image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 152 | for cnt in contours: 153 | x, y, w, h = cv2.boundingRect(cnt) 154 | if w < image.shape[1]: 155 | cv2.drawContours(mask, [cnt], -1, 0, -1) 156 | image = cv2.bitwise_and(image, image, mask=mask) 157 | return image 158 | except Exception as e: 159 | print(f"Error occurred in h_remove_cnts: {e}") 160 | return None 161 | 162 | 163 | def find_0(s): 164 | ch = "0" 165 | return [i for i, ltr in enumerate(s) if ltr == ch] 166 | 167 | def img_rotate(img, rect_barcode): 168 | try: 169 | angle = rect_barcode[2] 170 | except IndexError: 171 | return img 172 | 173 | # Rotate the original image based on the angle of the polygon 174 | h, w = img.shape[:2] 175 | center = (w // 2, h // 2) 176 | if angle > 45: 177 | angle1 = angle - 90 - 0.8 178 | else: 179 | angle1 = angle 180 | M = cv2.getRotationMatrix2D(center, angle1, 1.0) 181 | rotated = cv2.warpAffine(img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) 182 | 183 | return rotated 184 | 185 | 186 | def find_barcode(img, file_path): 187 | try: 188 | time_str = str(int(round(time.time() * 1000))) 189 | resizeimg = img[100:600, 20:800] 190 | kernel = np.ones((1, 1), np.uint8) 191 | dilateimg = cv2.dilate(resizeimg, kernel, iterations=3) 192 | 193 | kernel1 = np.ones((1, 7), np.uint8) 194 | erodeimg = cv2.erode(dilateimg, kernel1, iterations=5) 195 | 196 | kernel2 = np.ones((7, 7), np.uint8) 197 | dilateimg = cv2.dilate(erodeimg, kernel2, iterations=7) 198 | 199 | kernel1 = np.ones((7, 7), np.uint8) 200 | erodeimg = cv2.erode(dilateimg, kernel1, iterations=7) 201 | 202 | gray = cv2.cvtColor(erodeimg, cv2.COLOR_BGR2GRAY) 203 | 204 | # Apply a threshold to the grayscale image to get a binary image 205 | _, thresh = cv2.threshold(gray, 254, 255, cv2.THRESH_BINARY_INV) 206 | 207 | # Find the contours in the binary image 208 | contours, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 209 | rect_barcode = [] 210 | if len(contours) > 0: 211 | # Filter the contours to find the box contour based on its shape and size 212 | for contour in contours: 213 | # Approximate the contour to a polygon 214 | approx = cv2.approxPolyDP(contour, 0.01 * cv2.arcLength(contour, True), True) 215 | 216 | # Check if the polygon has 4 vertices (i.e., is a quadrilateral) and has a certain minimum area 217 | if len(approx) == 4 and cv2.contourArea(approx) < 25000 and cv2.contourArea(approx) > 8000: 218 | # Draw the contour on the original image 219 | cv2.drawContours(erodeimg, [approx], 0, (0, 0, 255), 2) 220 | # Calculate the angle of the polygon 221 | rect_barcode = cv2.minAreaRect(approx) 222 | break 223 | if not rect_barcode: 224 | print("Barcode not found. ", file_path) 225 | else: 226 | print("Barcode not found. ", file_path) 227 | return rect_barcode 228 | except Exception as e: 229 | print(f"Error: {str(e)}") 230 | return rect_barcode 231 | 232 | 233 | def reorganize_array(arr): 234 | # Making 3 dimensional array based on big Y difference 235 | # Making the length of sub array less than 7 236 | group = [] 237 | subarray = [arr[0]] 238 | 239 | try: 240 | for i in range(1, len(arr)): 241 | if arr[i][1] - arr[i-1][1] > 40: 242 | subarray.sort() 243 | if len(subarray) >= 7: 244 | subarray = subarray[:7] 245 | group.append(subarray) 246 | subarray = [] 247 | subarray.append(arr[i]) 248 | 249 | subarray.sort() 250 | if len(subarray) >= 7: 251 | subarray = subarray[:7] 252 | group.append(subarray) 253 | except Exception as e: 254 | print(f"Error in reorganize_array: {e}") 255 | return [] 256 | 257 | return group 258 | 259 | 260 | # Table rotation 261 | def correct_skew(image, delta=2, limit=7): 262 | def determine_score(arr, angle): 263 | data = ndimage.rotate(arr, angle, reshape=False, order=0) 264 | histogram = np.sum(data, axis=1, dtype=float) 265 | score = np.sum((histogram[1:] - histogram[:-1]) ** 2, dtype=float) 266 | return histogram, score 267 | 268 | try: 269 | gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 270 | thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1] 271 | 272 | scores = [] 273 | angles = np.arange(-limit, limit + delta, delta) 274 | for angle in angles: 275 | histogram, score = determine_score(thresh, angle) 276 | scores.append(score) 277 | 278 | best_angle = angles[scores.index(max(scores))] 279 | 280 | (h, w) = image.shape[:2] 281 | center = (w // 2, h // 2) 282 | M = cv2.getRotationMatrix2D(center, best_angle, 1.0) 283 | corrected = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, \ 284 | borderMode=cv2.BORDER_REPLICATE) 285 | 286 | return best_angle, corrected 287 | except Exception as e: 288 | print("Error: ", e) 289 | return None, None 290 | 291 | 292 | 293 | #Functon for extracting the box 294 | def box_extraction(img_for_box_extraction_path): 295 | print("==filename=====", img_for_box_extraction_path) 296 | try: 297 | # Split filename and extension 298 | filename_no_xtensn, xtensn = os.path.splitext(img_for_box_extraction_path) 299 | 300 | # Check if extension is valid 301 | if xtensn.lower() not in ['.jpg', '.jpeg']: 302 | raise ValueError("Invalid image file extension") 303 | 304 | # Get PDF name from filename 305 | filename_no_folder = filename_no_xtensn.split(input_folder)[1] 306 | pdf_name = filename_no_folder.split("&&&")[0] 307 | 308 | # Create output directory if it doesn't exist 309 | output_dir = output_folder + "_" + pdf_name 310 | if not os.path.exists(output_dir): 311 | os.makedirs(output_dir) 312 | 313 | # Load image and find barcode 314 | output_img_path = output_folder + "_" + pdf_name + "/" 315 | img_origin = cv2.imread(img_for_box_extraction_path) 316 | rect_barcode = find_barcode(img_origin, img_for_box_extraction_path) 317 | 318 | # Rotate image if barcode is found 319 | if len(rect_barcode) > 0: 320 | print("=========found barcode=====", img_for_box_extraction_path) 321 | img = img_rotate(img_origin, rect_barcode) 322 | rect_barcode = find_barcode(img, img_for_box_extraction_path) 323 | 324 | # Draw blue line on barcode 325 | green_flag, img1 = draw_blue_line(img, rect_barcode) 326 | 327 | if green_flag == 1: 328 | print("===1======green_flag == 1=====", img_for_box_extraction_path) 329 | # Threshold and binarize image 330 | Cimg_gray_para = [3, 3, 0] 331 | Cimg_blur_para = [150, 255] 332 | gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY) 333 | blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2]) 334 | thresh, img_bin = cv2.threshold(blurred_img, 200, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) 335 | img_bin = 255 - img_bin 336 | 337 | # Detect vertical and horizontal lines 338 | kernel_length = np.array(img).shape[1] // 40 339 | verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length)) 340 | hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1)) 341 | kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3)) 342 | 343 | # Morphological operations to detect lines 344 | img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2) 345 | verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20) 346 | verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2) 347 | cnts_flag, v_cnts_img = v_remove_cnts(verticle_lines_img, rect_barcode) 348 | 349 | if cnts_flag == 1: 350 | print("====cnts_flag == 1====", img_for_box_extraction_path) 351 | # Morphological operation to detect horizontal lines from an image 352 | img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2) 353 | horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=17) 354 | horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2) 355 | 356 | # Find valid cnts 357 | h_cnts_img = h_remove_cnts(horizontal_lines_img) 358 | 359 | # Weighting parameters, this will decide the quantity of an image to be added to make a new image. 360 | alpha = 0.5 361 | beta = 1.0 - alpha 362 | 363 | # This function helps to add two image with specific weight parameter to get a third image as summation of two image. 364 | img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0) 365 | img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2) 366 | (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) 367 | cv2.waitKey(0) 368 | 369 | # Find contours for image, which will detect all the boxes 370 | contours, hierarchy = cv2.findContours( 371 | img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 372 | # Sort all the contours by top to bottom. 373 | (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom") 374 | 375 | boxes = [] 376 | xx = [] 377 | yy = [] 378 | ww = [] 379 | hh = [] 380 | areaa = [] 381 | 382 | for c in contours: 383 | # Returns the location and width,height for every contour 384 | x, y, w, h = cv2.boundingRect(c) 385 | if x > 3 and w > 800 and (x + w) < 2000 and y > 50 and y+h < 3650 and h > 55 and w < 1000 and h < 100: 386 | xx.append(x) 387 | 388 | first_column_x = 131 389 | if len(xx) > 0: 390 | first_column_x = statistics.mode(xx) 391 | if first_column_x > 130: 392 | for c in contours: 393 | # Returns the location and width,height for every contour 394 | x, y, w, h = cv2.boundingRect(c) 395 | area = w * h 396 | box_info = [x, y, w, h, area] 397 | # print(box_info) 398 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5) 399 | if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 10) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200: 400 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 401 | boxes.append(box_info) 402 | elif x > 3 and w > 800 and (x + w) < 2050 and y > 40 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 403 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 404 | boxes.append(box_info) 405 | elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 406 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 407 | boxes.append(box_info) 408 | else: 409 | for c in contours: 410 | # Returns the location and width,height for every contour 411 | x, y, w, h = cv2.boundingRect(c) 412 | area = w * h 413 | box_info = [x, y, w, h, area] 414 | # print(box_info) 415 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5) 416 | if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 100) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200: 417 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 418 | boxes.append(box_info) 419 | elif x > 3 and w > 800 and (x + w) < 2050 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 420 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 421 | boxes.append(box_info) 422 | elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 423 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 424 | boxes.append(box_info) 425 | 426 | boxes_sorted_y = sorted(boxes, key=lambda x: x[1]) 427 | box_array = boxes_sorted_y 428 | 429 | if len(box_array) > 0: 430 | row_columns = reorganize_array(box_array) 431 | else: 432 | i = 1 433 | columns = [] 434 | row_columns = [] 435 | for box in boxes_sorted_y: 436 | columns.append(box) 437 | if i % 7 == 0: 438 | boxes_sorted_x = sorted(columns, key=lambda x: x[0]) 439 | row_columns.append(boxes_sorted_x) 440 | columns = [] 441 | i += 1 442 | 443 | for sub_boxes in row_columns: 444 | for box in sub_boxes: 445 | x = box[0] 446 | y = box[1] 447 | w = box[2] 448 | h = box[3] 449 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 450 | 451 | ## Write red rect image 452 | time_str = str(int(round(time.time() * 1000))) 453 | w_filename = 'cnts/' +filename_no_folder+ '_' + '.png' 454 | cv2.imwrite(w_filename, image) 455 | 456 | idx = 0 457 | csv_row_col = [] 458 | col = 0 459 | w_filename_base = output_img_path + filename_no_folder+ '_' 460 | for columns in row_columns: 461 | csv_cols = [] 462 | if col == 0: 463 | row = 0 464 | for box in columns: 465 | 466 | idx += 1 467 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 468 | time_str = str(int(round(time.time() * 1000))) 469 | if row == 0: 470 | w_filename = w_filename_base +str(idx) +'_Address.png' 471 | if row == 3: 472 | w_filename = w_filename_base +str(idx) +'_Guardian.png' 473 | if row == 4: 474 | w_filename = w_filename_base +str(idx) +'_Name.png' 475 | cv2.imwrite(w_filename, new_img) 476 | csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png') 477 | row += 1 478 | else: 479 | row = 0 480 | for box in columns: 481 | if row == 0 or row == 3 or row == 4: 482 | idx += 1 483 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 484 | if row == 0: 485 | w_filename = w_filename_base +str(idx) +'_Address.png' 486 | if row == 3: 487 | w_filename = w_filename_base +str(idx) +'_Guardian.png' 488 | if row == 4: 489 | w_filename = w_filename_base +str(idx) +'_Name.png' 490 | cv2.imwrite(w_filename, new_img) 491 | csv_cols.append(w_filename.split(output_img_path)[1]) 492 | else: 493 | idx += 1 494 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 495 | 496 | if row == 1: 497 | processed_box = box_processing(new_img, "arabic") 498 | else: 499 | processed_box = box_processing(new_img, "number") 500 | 501 | result = ocr.ocr(processed_box, cls=False) 502 | txts = [line[1][0] for line in result[0]] 503 | if len(txts) == 0: 504 | data = '' 505 | else: 506 | data = txts[0] 507 | csv_cols.append(data) 508 | font = cv2.FONT_HERSHEY_SIMPLEX 509 | cv2.putText(img, data, (box[0],box[1]), font, 1, (255, 0, 0), 2, cv2.LINE_AA) 510 | row += 1 511 | # Add page number to last column 512 | csv_cols.append(filename_no_folder.split("-")[1].split(".")[0]) 513 | csv_row_col.append(csv_cols) 514 | col += 1 515 | with open(pdf_name + ".csv", 'a', newline='') as f: 516 | writer = csv.writer(f, delimiter=',') 517 | writer.writerows(csv_row_col) #considering my_list is a list of lists. 518 | else: 519 | print("no table in ", output_img_path+filename_no_folder) 520 | 521 | ## Write ocr result to the image 522 | w_filename = 'text/' +filename_no_folder+ '_' + '.png' 523 | cv2.imwrite(w_filename, img) 524 | 525 | else: 526 | print("=========no found barcode=====") 527 | angle, img = correct_skew(img_origin) 528 | img1 = img # Read the image 529 | rect_barcode = [] 530 | (green_flag, img1) = draw_blue_line(img1, rect_barcode) 531 | 532 | if green_flag == 1: 533 | print("=========green_flag == 1=====") 534 | Cimg_gray_para = [3, 3, 0] 535 | Cimg_blur_para = [150, 255] 536 | 537 | gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY) 538 | blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2]) 539 | (thresh, img_bin) = cv2.threshold(blurred_img, 200, 255, 540 | cv2.THRESH_BINARY | cv2.THRESH_OTSU) # Thresholding the image 541 | img_bin = 255-img_bin # Invert the image 542 | 543 | # Defining a kernel length 544 | kernel_length = np.array(img).shape[1]//40 545 | 546 | # A verticle kernel of (1 X kernel_length), which will detect all the verticle lines from the image. 547 | verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length)) 548 | # A horizontal kernel of (kernel_length X 1), which will help to detect all the horizontal line from the image. 549 | hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1)) 550 | # A kernel of (3 X 3) ones. 551 | kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3)) 552 | 553 | # Morphological operation to detect verticle lines from an image 554 | img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2) 555 | verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20) 556 | verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2) 557 | 558 | # Find valid cnts 559 | (cnts_flag, v_cnts_img) = v_remove_cnts(verticle_lines_img, rect_barcode) 560 | 561 | if cnts_flag == 1: 562 | # Morphological operation to detect horizontal lines from an image 563 | img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2) 564 | horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=17) 565 | horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2) 566 | 567 | # Find valid cnts 568 | h_cnts_img = h_remove_cnts(horizontal_lines_img) 569 | 570 | # Weighting parameters, this will decide the quantity of an image to be added to make a new image. 571 | alpha = 0.5 572 | beta = 1.0 - alpha 573 | 574 | # This function helps to add two image with specific weight parameter to get a third image as summation of two image. 575 | img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0) 576 | img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2) 577 | (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) 578 | 579 | # Find contours for image, which will detect all the boxes 580 | contours, hierarchy = cv2.findContours( 581 | img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 582 | # Sort all the contours by top to bottom. 583 | (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom") 584 | 585 | boxes = [] 586 | xx = [] 587 | yy = [] 588 | ww = [] 589 | hh = [] 590 | areaa = [] 591 | 592 | for c in contours: 593 | # Returns the location and width,height for every contour 594 | x, y, w, h = cv2.boundingRect(c) 595 | if x > 3 and w > 800 and (x + w) < 2000 and y > 50 and y+h < 3650 and h > 55 and w < 1000 and h < 100: 596 | xx.append(x) 597 | 598 | first_column_x = 131 599 | if len(xx) > 0: 600 | first_column_x = statistics.mode(xx) 601 | if first_column_x > 130: 602 | for c in contours: 603 | # Returns the location and width,height for every contour 604 | x, y, w, h = cv2.boundingRect(c) 605 | area = w * h 606 | box_info = [x, y, w, h, area] 607 | # print(box_info) 608 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5) 609 | if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 10) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200: 610 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 611 | boxes.append(box_info) 612 | elif x > 3 and w > 800 and (x + w) < 2050 and y > 40 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 613 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 614 | boxes.append(box_info) 615 | elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 616 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 617 | boxes.append(box_info) 618 | else: 619 | for c in contours: 620 | # Returns the location and width,height for every contour 621 | x, y, w, h = cv2.boundingRect(c) 622 | area = w * h 623 | box_info = [x, y, w, h, area] 624 | # print(box_info) 625 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5) 626 | if x > 100 and x < (img1.shape[0] - 70) and (x + w) < (img1.shape[1] - 100) and y > 50 and y+h < 3650 and w > 100 and h > 55 and w < 1000 and h < 200: 627 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 628 | boxes.append(box_info) 629 | elif x > 3 and w > 800 and (x + w) < 2050 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 630 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 631 | boxes.append(box_info) 632 | elif x > 500 and x < 1500 and w > 40 and y > 50 and y+h < 3650 and h > 40 and w < 1000 and h < 200: 633 | # image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 634 | boxes.append(box_info) 635 | 636 | boxes_sorted_y = sorted(boxes, key=lambda x: x[1]) 637 | box_array = boxes_sorted_y 638 | 639 | ## Sort boxes by x and make rows 640 | i = 1 641 | columns = [] 642 | row_columns = [] 643 | for box in boxes_sorted_y: 644 | columns.append(box) 645 | if i % 7 == 0: 646 | boxes_sorted_x = sorted(columns, key=lambda x: x[0]) 647 | row_columns.append(boxes_sorted_x) 648 | columns = [] 649 | i += 1 650 | 651 | for sub_boxes in row_columns: 652 | for box in sub_boxes: 653 | x = box[0] 654 | y = box[1] 655 | w = box[2] 656 | h = box[3] 657 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 658 | 659 | ## Write red rect image 660 | time_str = str(int(round(time.time() * 1000))) 661 | w_filename = 'cnts/' +filename_no_folder+ '_' + '.png' 662 | # cv2.imwrite(w_filename, image) 663 | 664 | idx = 0 665 | csv_row_col = [] 666 | col = 0 667 | w_filename_base = output_img_path + filename_no_folder+ '_' 668 | for columns in row_columns: 669 | csv_cols = [] 670 | if col == 0: 671 | row = 0 672 | for box in columns: 673 | 674 | idx += 1 675 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 676 | time_str = str(int(round(time.time() * 1000))) 677 | # w_filename = w_filename_base +time_str+ '_' +str(idx) + '.png' 678 | if row == 0: 679 | w_filename = w_filename_base +str(idx) +'_Address.png' 680 | if row == 3: 681 | w_filename = w_filename_base +str(idx) +'_Guardian.png' 682 | if row == 4: 683 | w_filename = w_filename_base +str(idx) +'_Name.png' 684 | cv2.imwrite(w_filename, new_img) 685 | csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png') 686 | row += 1 687 | else: 688 | row = 0 689 | for box in columns: 690 | if row == 0 or row == 3 or row == 4: 691 | idx += 1 692 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 693 | if row == 0: 694 | w_filename = w_filename_base +str(idx) +'_Address.png' 695 | if row == 3: 696 | w_filename = w_filename_base +str(idx) +'_Guardian.png' 697 | if row == 4: 698 | w_filename = w_filename_base +str(idx) +'_Name.png' 699 | cv2.imwrite(w_filename, new_img) 700 | csv_cols.append(w_filename.split(output_img_path)[1]) 701 | else: 702 | idx += 1 703 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 704 | 705 | if row == 1: 706 | processed_box = box_processing(new_img, "arabic") 707 | else: 708 | processed_box = box_processing(new_img, "number") 709 | 710 | result = ocr.ocr(processed_box, cls=False) 711 | txts = [line[1][0] for line in result[0]] 712 | if len(txts) == 0: 713 | data = '' 714 | else: 715 | data = txts[0] 716 | csv_cols.append(data) 717 | font = cv2.FONT_HERSHEY_SIMPLEX 718 | cv2.putText(img, data, (box[0],box[1]), font, 1, (255, 0, 0), 2, cv2.LINE_AA) 719 | row += 1 720 | # Add page number to last column 721 | csv_cols.append(filename_no_folder.split("-")[1].split(".")[0]) 722 | csv_row_col.append(csv_cols) 723 | col += 1 724 | with open(pdf_name + ".csv", 'a', newline='') as f: 725 | writer = csv.writer(f, delimiter=',') 726 | writer.writerows(csv_row_col) #considering my_list is a list of lists. 727 | else: 728 | print("no table in ", output_img_path+filename_no_folder) 729 | else: 730 | print("no table in ", output_img_path+filename_no_folder) 731 | 732 | ## Write ocr result to the image 733 | w_filename = 'text/' +filename_no_folder+ '_' + '.png' 734 | cv2.imwrite(w_filename, img) 735 | 736 | except (FileNotFoundError, ValueError) as e: 737 | print(f"Error: {e}") 738 | except Exception as e: 739 | print("An unexpected error occurred:", e) 740 | finally: 741 | cv2.destroyAllWindows() 742 | 743 | 744 | def process_files(files): 745 | for filename in files: 746 | file_path = os.path.join(input_folder, filename) 747 | try: 748 | box_extraction(file_path) 749 | except Exception as e: 750 | print(f"Error processing {file_path}: {e}") 751 | 752 | 753 | def main_threading(): 754 | filenames = os.listdir(input_folder) 755 | if len(filenames) < 5: 756 | for filename in filenames: 757 | file_path = os.path.join(input_folder, filename) 758 | try: 759 | box_extraction(file_path) 760 | except Exception as e: 761 | print(f"Error processing {file_path}: {e}") 762 | else: 763 | count_processes = 5 764 | step = len(filenames) // count_processes 765 | file_groups = [filenames[i*step:(i+1)*step] for i in range(count_processes)] 766 | if len(filenames) % count_processes != 0: 767 | file_groups[-1].extend(filenames[count_processes*step:]) 768 | with Pool(count_processes) as p: 769 | p.map(process_files, file_groups) 770 | 771 | 772 | def main_single(): 773 | stop_threads = False 774 | for filename in os.listdir(input_folder): 775 | file_path = input_folder + filename 776 | box_extraction(file_path) 777 | 778 | 779 | if __name__ == '__main__': 780 | print("Reading image..") 781 | main_threading() -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866607_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866607_1.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866609_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866609_2.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866609_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866609_3.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866610_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866610_4.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866611_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866611_5.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866612_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866612_6.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866612_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866612_7.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866613_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866613_8.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866939_11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866939_11.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120866940_12.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120866940_12.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120867285_15.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867285_15.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120867593_18.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867593_18.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120867593_19.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867593_19.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120867886_22.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120867886_22.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120868209_25.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868209_25.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120868210_26.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868210_26.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120868507_29.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868507_29.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120868810_32.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868810_32.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120868810_33.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120868810_33.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120869094_36.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869094_36.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120869422_39.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869422_39.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120869423_40.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869423_40.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120869713_43.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120869713_43.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870040_46.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870040_46.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870041_47.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870041_47.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870348_50.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870348_50.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870654_53.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870654_53.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870655_54.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870655_54.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120870955_57.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120870955_57.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120871261_60.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871261_60.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120871262_61.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871262_61.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120871555_64.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871555_64.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120871887_67.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871887_67.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120871888_68.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120871888_68.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120872207_71.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872207_71.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120872548_74.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872548_74.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120872549_75.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872549_75.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120872851_78.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120872851_78.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120873167_81.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873167_81.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120873168_82.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873168_82.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120873796_85.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120873796_85.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120874124_88.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874124_88.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120874125_89.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874125_89.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120874563_92.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874563_92.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120874891_95.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874891_95.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120874892_96.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120874892_96.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120875187_99.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875187_99.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120875586_102.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875586_102.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120875587_103.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875587_103.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120875919_106.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120875919_106.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120876292_109.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876292_109.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120876293_110.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876293_110.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120876586_113.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876586_113.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120876921_116.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876921_116.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120876922_117.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120876922_117.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120877270_120.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877270_120.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120877615_123.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877615_123.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120877616_124.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877616_124.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120877910_127.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120877910_127.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120878219_130.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878219_130.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120878221_131.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878221_131.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120878619_134.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878619_134.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120878965_137.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878965_137.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120878965_138.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120878965_138.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120879303_141.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879303_141.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120879677_144.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879677_144.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120879678_145.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879678_145.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120879968_148.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120879968_148.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120880337_151.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880337_151.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120880338_152.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880338_152.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120880651_155.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120880651_155.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120881022_158.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881022_158.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120881023_159.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881023_159.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120881357_162.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881357_162.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120881749_165.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881749_165.png -------------------------------------------------------------------------------- /output_img/16544460977400001-03_1655120881749_166.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/output_img/16544460977400001-03_1655120881749_166.png -------------------------------------------------------------------------------- /pdf_img/16544460977400001-03.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/pdf_img/16544460977400001-03.jpg -------------------------------------------------------------------------------- /pdf_to_img.py: -------------------------------------------------------------------------------- 1 | from pdf2image import convert_from_path 2 | import time 3 | 4 | jpegopt={ 5 | "quality": 100, 6 | "progressive": True, 7 | "optimize": True 8 | } 9 | 10 | time_str = str(int(round(time.time() * 1000))) 11 | 12 | print("Loading PDF file...") 13 | convert_from_path("shamaryati code 259200305.pdf", 14 | dpi=320, 15 | output_folder="pdf_img/", 16 | first_page=3, 17 | last_page=56, 18 | fmt="jpeg", 19 | jpegopt=jpegopt, 20 | thread_count=5, 21 | userpw=None, 22 | use_cropbox=False, 23 | strict=False, 24 | transparent=False, 25 | single_file=False, 26 | output_file=time_str, 27 | poppler_path=r'C:\Program Files\poppler-0.68.0\bin', 28 | grayscale=False, 29 | size=None, 30 | paths_only=False, 31 | hide_annotations=False,) 32 | 33 | print("Done!") -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.22.4 2 | opencv-contrib-python==4.5.5.64 3 | packaging==21.3 4 | pdf2image==1.16.0 5 | Pillow==9.1.1 6 | pyparsing==3.0.9 7 | pytesseract==0.3.9 8 | -------------------------------------------------------------------------------- /start_ocr1.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import numpy as np 3 | import time 4 | import csv 5 | import pytesseract 6 | import os 7 | import threading 8 | from PIL import Image, ImageOps, ImageFilter 9 | import imutils 10 | from cnn_ocr1 import predict 11 | ## Init 12 | jpegopt={ 13 | "quality": 100, 14 | "progressive": True, 15 | "optimize": True 16 | } 17 | 18 | global thread_kill_flags 19 | input_folder = "pdf_img/" 20 | output_folder = "output_img/" 21 | out_csv = "table_1.csv" 22 | # pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" 23 | 24 | 25 | def uuid(filename): 26 | time_str = str(int(round(time.time() * 1000))) 27 | file = filename.split(".pdf")[0] 28 | id = file + "&&&" + time_str 29 | return id 30 | 31 | 32 | def img_preprocessing(img): 33 | gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) 34 | ret, thresh = cv2.threshold(gray,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU) 35 | # noise removal 36 | kernel = np.ones((1,1),np.uint8) 37 | opening = cv2.morphologyEx(thresh,cv2.MORPH_OPEN,kernel, iterations = 1) 38 | # sure background area 39 | sure_bg = cv2.dilate(opening,kernel,iterations=4) 40 | # Finding sure foreground area 41 | dist_transform = cv2.distanceTransform(opening,cv2.DIST_L2,5) 42 | ret, sure_fg = cv2.threshold(dist_transform,0.2*dist_transform.max(),255,0) 43 | # Finding unknown region 44 | sure_fg = np.uint8(sure_fg) 45 | sure_fg = 255 - sure_fg 46 | 47 | return sure_fg 48 | 49 | 50 | def caculate_time_difference(start_milliseconds, end_milliseconds, filename): 51 | if filename == 'total': 52 | diff_milliseconds = int(end_milliseconds) - int(start_milliseconds) 53 | seconds=(diff_milliseconds / 1000) % 60 54 | minutes=(diff_milliseconds/(1000*60))%60 55 | hours=(diff_milliseconds/(1000*60*60))%24 56 | print("Total run time", hours,":",minutes,":",seconds) 57 | else: 58 | diff_milliseconds = int(end_milliseconds) - int(start_milliseconds) 59 | seconds=(diff_milliseconds / 1000) % 60 60 | print(seconds, "s", filename) 61 | 62 | 63 | def sort_contours(cnts, method="left-to-right"): 64 | # initialize the reverse flag and sort index 65 | reverse = False 66 | i = 0 67 | 68 | # handle if we need to sort in reverse 69 | if method == "right-to-left" or method == "bottom-to-top": 70 | reverse = True 71 | 72 | # handle if we are sorting against the y-coordinate rather than 73 | # the x-coordinate of the bounding box 74 | if method == "top-to-bottom" or method == "bottom-to-top": 75 | i = 1 76 | 77 | # construct the list of bounding boxes and sort them from top to 78 | # bottom 79 | boundingBoxes = [cv2.boundingRect(c) for c in cnts] 80 | (cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes), 81 | key=lambda b: b[1][i], reverse=reverse)) 82 | 83 | # return the list of sorted contours and bounding boxes 84 | return (cnts, boundingBoxes) 85 | 86 | 87 | def draw_green_line(img): 88 | img_gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) 89 | img_cny = cv2.Canny(img_gry, 50, 200) 90 | lns = cv2.ximgproc.createFastLineDetector().detect(img_cny) 91 | img_cpy = img.copy() 92 | if lns is not None: 93 | for ln in lns: 94 | x1 = int(ln[0][0]) 95 | y1 = int(ln[0][1]) 96 | x2 = int(ln[0][2]) 97 | y2 = int(ln[0][3]) 98 | cv2.line(img_cpy, pt1=(x1, y1), pt2=(x2, y2), color=(255, 0, 0), thickness=5) 99 | return (1, img_cpy) 100 | else: 101 | return (0, img_cpy) 102 | 103 | 104 | def v_remove_cnts(image): 105 | mask = np.ones(image.shape[:2], dtype="uint8") * 255 106 | contours, hierarchy = cv2.findContours( 107 | image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 108 | if len(contours) != 0: 109 | h_arr = [] 110 | for cnt in contours: 111 | x, y, w, h = cv2.boundingRect(cnt) 112 | h_arr.append(h) 113 | mode_h = (max(set(h_arr), key = h_arr.count)) 114 | for cnt in contours: 115 | x, y, w, h = cv2.boundingRect(cnt) 116 | h_arr.append(h) 117 | if h < mode_h - 50: 118 | cv2.drawContours(mask, [cnt], -1, 0, -1) 119 | image = cv2.bitwise_and(image, image, mask=mask) 120 | return (1, image) 121 | else: 122 | return (0, image) 123 | 124 | 125 | def h_remove_cnts(image): 126 | mask = np.ones(image.shape[:2], dtype="uint8") * 255 127 | contours, hierarchy = cv2.findContours( 128 | image, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 129 | for cnt in contours: 130 | x, y, w, h = cv2.boundingRect(cnt) 131 | box_info = [x, y, w, h] 132 | if w < 2640: 133 | cv2.drawContours(mask, [cnt], -1, 0, -1) 134 | image = cv2.bitwise_and(image, image, mask=mask) 135 | return image 136 | 137 | 138 | def find_0(s): 139 | ch = "0" 140 | return [i for i, ltr in enumerate(s) if ltr == ch] 141 | 142 | 143 | def recognition_num(img, i, data): 144 | dim = (32, 32) 145 | resized = cv2.resize(img, dim, interpolation = cv2.INTER_AREA) 146 | 147 | value = predict(resized) 148 | if value != "n" and value != "m": 149 | if value == None: 150 | time_str = str(int(round(time.time() * 1000))) 151 | name = "numbers/" + str(time_str) + str(i) + ".jpg" 152 | cv2.imwrite(name, resized) 153 | value = "_" 154 | data += str(value) 155 | # else: 156 | # time_str = str(int(round(time.time() * 1000))) 157 | # name = "n_m/" + str(time_str) + str(i) + ".jpg" 158 | # cv2.imwrite(name, resized) 159 | 160 | if value == None: 161 | time_str = str(int(round(time.time() * 1000))) 162 | name = "numbers/" + str(time_str) + str(i) + ".jpg" 163 | cv2.imwrite(name, resized) 164 | return data 165 | 166 | 167 | def split_number(img): 168 | kernel = np.ones((1, 1), np.uint8) 169 | img = cv2.dilate(img, kernel, iterations=2) 170 | img = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY)[1] 171 | cnts = cv2.findContours(img.copy(), cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 172 | cnts = imutils.grab_contours(cnts) 173 | data = '' 174 | if len(cnts) != 0: 175 | cnts, boundingBoxes = sort_contours(cnts, method="left-to-right") 176 | 177 | if len(cnts) == 1: 178 | return 0 179 | else: 180 | i = 1 181 | for c in cnts: 182 | x, y, w, h = cv2.boundingRect(c) 183 | if w > 3 and h > 15 and w < 80 and h < 60: 184 | 185 | if w > 3 and w < 24: 186 | new_img = img[y:y+h, x:x+w] 187 | data = recognition_num(new_img, i, data) 188 | 189 | elif w > 23 and w < 43: 190 | new_img1 = img[y:y+h, x:x+int(w/2)+2] 191 | data = recognition_num(new_img1, i, data) 192 | new_img2 = img[y:y+h, x+int(w/2)-1:x+w] 193 | data = recognition_num(new_img2, i + 1000, data) 194 | 195 | elif w > 42 and w < 65: 196 | new_img1 = img[y:y+h, x:x+int(w/3)+2] 197 | data = recognition_num(new_img1, i, data) 198 | new_img2 = img[y:y+h, x+int(w/3)-1:x+int(w*2/3)+2] 199 | data = recognition_num(new_img2, i + 1000, data) 200 | new_img3 = img[y:y+h, x+int(w*2/3)-1:x+w] 201 | data = recognition_num(new_img3, i + 2000, data) 202 | 203 | elif w > 64 and w < 80: 204 | new_img1 = img[y:y+h, x:x+int(w/4)+2] 205 | data = recognition_num(new_img1, i, data) 206 | new_img2 = img[y:y+h, x+int(w/4)-1:x+int(w*2/4)+2] 207 | data = recognition_num(new_img2, i + 1000, data) 208 | new_img3 = img[y:y+h, x+int(w*2/4)-1:x+int(w*3/4)] 209 | data = recognition_num(new_img3, i + 2000, data) 210 | new_img4 = img[y:y+h, x+int(w*3/4)-1:x+w] 211 | data = recognition_num(new_img4, i + 2000, data) 212 | 213 | i += 1 214 | return data 215 | 216 | #Functon for extracting the box 217 | def box_extraction(img_for_box_extraction_path, cropped_dir_path): 218 | filename_no_xtensn = img_for_box_extraction_path.split(".")[0] 219 | xtensn = img_for_box_extraction_path.split(".")[1] 220 | if xtensn == "jpg" or xtensn == "jpeg" or xtensn == "JPG" or xtensn == "JPEG": 221 | filename_no_folder = filename_no_xtensn.split(input_folder)[1] 222 | img = cv2.imread(img_for_box_extraction_path) # Read the image 223 | img1 = cv2.imread(img_for_box_extraction_path) # Read the image 224 | (green_flag, img1) = draw_green_line(img1) 225 | 226 | if green_flag == 1: 227 | Cimg_gray_para = [3, 3, 0] 228 | Cimg_blur_para = [150, 255] 229 | 230 | gray_img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY) 231 | blurred_img = cv2.GaussianBlur(gray_img, (Cimg_gray_para[0], Cimg_gray_para[1]), Cimg_gray_para[2]) 232 | (thresh, img_bin) = cv2.threshold(blurred_img, 200, 255, 233 | cv2.THRESH_BINARY | cv2.THRESH_OTSU) # Thresholding the image 234 | img_bin = 255-img_bin # Invert the image 235 | 236 | # Defining a kernel length 237 | kernel_length = np.array(img).shape[1]//40 238 | 239 | # A verticle kernel of (1 X kernel_length), which will detect all the verticle lines from the image. 240 | verticle_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length)) 241 | # A horizontal kernel of (kernel_length X 1), which will help to detect all the horizontal line from the image. 242 | hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1)) 243 | # A kernel of (3 X 3) ones. 244 | kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3)) 245 | 246 | # Morphological operation to detect verticle lines from an image 247 | img_temp1 = cv2.erode(img_bin, verticle_kernel, iterations=2) 248 | verticle_lines_img = cv2.dilate(img_temp1, verticle_kernel, iterations=20) 249 | verticle_lines_img = cv2.erode(verticle_lines_img, verticle_kernel, iterations=2) 250 | 251 | # Find valid cnts 252 | (cnts_flag, v_cnts_img) = v_remove_cnts(verticle_lines_img) 253 | 254 | if cnts_flag == 1: 255 | # Morphological operation to detect horizontal lines from an image 256 | img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=2) 257 | horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=10) 258 | horizontal_lines_img = cv2.erode(horizontal_lines_img, hori_kernel, iterations=2) 259 | 260 | # Find valid cnts 261 | h_cnts_img = h_remove_cnts(horizontal_lines_img) 262 | 263 | # Weighting parameters, this will decide the quantity of an image to be added to make a new image. 264 | alpha = 0.5 265 | beta = 1.0 - alpha 266 | 267 | # This function helps to add two image with specific weight parameter to get a third image as summation of two image. 268 | img_final_bin = cv2.addWeighted(v_cnts_img, alpha, h_cnts_img, beta, 0.0) 269 | img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2) 270 | (thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) 271 | 272 | # Find contours for image, which will detect all the boxes 273 | contours, hierarchy = cv2.findContours( 274 | img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) 275 | # Sort all the contours by top to bottom. 276 | (contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom") 277 | 278 | ## Find suitable boxes 279 | boxes = [] 280 | xx = [] 281 | yy = [] 282 | ww = [] 283 | hh = [] 284 | areaa = [] 285 | for c in contours: 286 | # Returns the location and width,height for every contour 287 | x, y, w, h = cv2.boundingRect(c) 288 | area = w * h 289 | box_info = [x, y, w, h, area] 290 | # print(box_info) 291 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 255, 0), 5) 292 | if x > 50 and x < 2500 and (x + w) < 2640 and y > 50 and w > 70 and h > 55 and w < 1000 and h < 200: 293 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 294 | boxes.append(box_info) 295 | elif x > 3 and w > 800 and (x + w) < 2640 and y > 50 and h > 55 and w < 1000 and h < 200: 296 | image = cv2.rectangle(img1, (x, y), (x+w, y+h),(0, 0, 255), 25) 297 | boxes.append(box_info) 298 | boxes_sorted_y = sorted(boxes, key=lambda x: x[1]) 299 | 300 | ## Sort boxes by x and make rows 301 | i = 1 302 | columns = [] 303 | row_columns = [] 304 | for box in boxes_sorted_y: 305 | columns.append(box) 306 | if i % 7 == 0: 307 | boxes_sorted_x = sorted(columns, key=lambda x: x[0]) 308 | row_columns.append(boxes_sorted_x) 309 | columns = [] 310 | i += 1 311 | 312 | idx = 0 313 | csv_row_col = [] 314 | col = 0 315 | for columns in row_columns: 316 | csv_cols = [] 317 | if col == 0: 318 | row = 0 319 | for box in columns: 320 | idx += 1 321 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 322 | time_str = str(int(round(time.time() * 1000))) 323 | # w_filename = cropped_dir_path+filename_no_folder+ '_' +time_str+ '_' +str(idx) + '.png' 324 | if row == 0: 325 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Address.png' 326 | if row == 3: 327 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Guardian.png' 328 | if row == 4: 329 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Name.png' 330 | cv2.imwrite(w_filename, new_img) 331 | # csv_cols.append(filename_no_xtensn+ '_' +time_str+ '_' +str(idx) + '.png') 332 | row += 1 333 | else: 334 | row = 0 335 | for box in columns: 336 | if row == 0 or row == 3 or row == 4: 337 | idx += 1 338 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 339 | if row == 0: 340 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Address.png' 341 | if row == 3: 342 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Guardian.png' 343 | if row == 4: 344 | w_filename = cropped_dir_path+filename_no_folder+ '_' +str(idx) +'_Name.png' 345 | cv2.imwrite(w_filename, new_img) 346 | csv_cols.append(w_filename.split(cropped_dir_path)[1]) 347 | else: 348 | idx += 1 349 | new_img = img[box[1]:box[1]+box[3], box[0]:box[0]+box[2]] 350 | new_img = img_preprocessing(new_img) 351 | data = "&& " + str(split_number(new_img)) 352 | csv_cols.append(data) 353 | font = cv2.FONT_HERSHEY_SIMPLEX 354 | cv2.putText(img, data, (box[0],box[1]), font, 2, (0, 255, 0), 2, cv2.LINE_AA) 355 | row += 1 356 | # Add page number to last column 357 | csv_cols.append(filename_no_folder.split("-")[1].split(".")[0]) 358 | csv_row_col.append(csv_cols) 359 | col += 1 360 | with open(out_csv, 'a', newline='') as f: 361 | writer = csv.writer(f, delimiter=',') 362 | writer.writerows(csv_row_col) #considering my_list is a list of lists. 363 | 364 | else: 365 | print("no table in ", cropped_dir_path+filename_no_folder) 366 | else: 367 | print("no table in ", cropped_dir_path+filename_no_folder) 368 | 369 | 370 | def main(start_time): 371 | start_total = str(int(round(time.time() * 1000))) 372 | for filename in os.listdir(input_folder): 373 | print(filename) 374 | start_milliseconds = str(int(round(time.time() * 1000))) 375 | file_path = input_folder + filename 376 | box_extraction(file_path, output_folder) 377 | end_milliseconds = str(int(round(time.time() * 1000))) 378 | end_total = str(int(round(time.time() * 1000))) 379 | caculate_time_difference(start_total, end_total, "total") 380 | 381 | 382 | if __name__ == '__main__': 383 | print("Reading image..") 384 | start_time = str(int(round(time.time() * 1000))) 385 | main(start_time) -------------------------------------------------------------------------------- /table_1.csv: -------------------------------------------------------------------------------- 1 | 2 | 16544460977400001-03_1655120228532_8.png,62,35201-4785657-9,16544460977400001-03_1655120230458_11.png,16544460977400001-03_1655120230477_12.png,1,1 3 | 16544460977400001-03_1655120230821_15.png,37,35202-2948435-9,16544460977400001-03_1655120231182_18.png,16544460977400001-03_1655120231183_19.png,1,2 4 | 16544460977400001-03_1655120231530_22.png,36,36202-8131266-7,16544460977400001-03_1655120231881_25.png,16544460977400001-03_1655120231882_26.png,4,3 5 | 16544460977400001-03_1655120232214_29.png,35,35202-8256565-9,16544460977400001-03_1655120232566_32.png,16544460977400001-03_1655120232567_33.png,1,4 6 | 16544460977400001-03_1655120232889_36.png,32,35201-6358583-5,16544460977400001-03_1655120233245_39.png,16544460977400001-03_1655120233246_40.png,1,5 7 | 16544460977400001-03_1655120233581_43.png,30,35202-0886762-7,16544460977400001-03_1655120233950_46.png,16544460977400001-03_1655120233951_47.png,1,6 8 | 16544460977400001-03_1655120234366_50.png,91,39202-4566769-5,16544460977400001-03_1655120234776_53.png,16544460977400001-03_1655120234777_54.png,2,1 9 | 16544460977400001-03_1655120235138_57.png,58,35202-4386216-5,16544460977400001-03_1655120235500_60.png,16544460977400001-03_1655120235500_61.png,2,8 10 | 16544460977400001-03_1655120235834_64.png,52,35202-2789094-9,16544460977400001-03_1655120236215_67.png,16544460977400001-03_1655120236216_68.png,2,9 11 | 16544460977400001-03_1655120236669_71.png,49,35202-3433250-9,16544460977400001-03_1655120237739_74.png,16544460977400001-03_1655120239279_75.png,2,0 12 | 16544460977400001-03_1655120239968_78.png,49,30202-6576483-7,16544460977400001-03_1655120240607_81.png,16544460977400001-03_1655120240608_82.png,2,41 13 | 16544460977400001-03_1655120241169_85.png,22,39202-5662367-9,16544460977400001-03_1655120241779_88.png,16544460977400001-03_1655120241799_89.png,2,42 14 | 16544460977400001-03_1655120242208_92.png,28,32202-6306618-7,16544460977400001-03_1655120242988_95.png,16544460977400001-03_1655120242989_96.png,2,413 15 | 16544460977400001-03_1655120243660_99.png,28,-098,16544460977400001-03_1655120244033_102.png,16544460977400001-03_1655120244034_103.png,2,14 16 | 16544460977400001-03_1655120244412_106.png,,39202-5974213-7,16544460977400001-03_1655120244878_109.png,16544460977400001-03_1655120244878_110.png,2,15 17 | 16544460977400001-03_1655120245234_113.png,2,35202-4943846-5,16544460977400001-03_1655120245667_116.png,16544460977400001-03_1655120245667_117.png,2,18 18 | 16544460977400001-03_1655120246082_120.png,65,35202-5052350-9,16544460977400001-03_1655120246518_123.png,16544460977400001-03_1655120246519_124.png,3,17 19 | 16544460977400001-03_1655120247000_127.png,54,35202-7949955-1,16544460977400001-03_1655120247433_130.png,16544460977400001-03_1655120247434_131.png,3,18 20 | 16544460977400001-03_1655120247977_134.png,46,35202-4918828-9,16544460977400001-03_1655120248349_137.png,16544460977400001-03_1655120248350_138.png,3,19 21 | 16544460977400001-03_1655120248800_141.png,38,35201-7525691-3,16544460977400001-03_1655120249187_144.png,16544460977400001-03_1655120249188_145.png,3,20 22 | 16544460977400001-03_1655120249701_148.png,31,35202-4588576-3,16544460977400001-03_1655120250104_151.png,16544460977400001-03_1655120250105_152.png,3,21 23 | 16544460977400001-03_1655120250661_155.png,30,36202-1916332-5,16544460977400001-03_1655120251089_158.png,16544460977400001-03_1655120251090_159.png,3,22 24 | 16544460977400001-03_1655120251499_162.png,25,35202-2034663-9,16544460977400001-03_1655120251978_165.png,16544460977400001-03_1655120251979_166.png,3,23 25 | 16544460977400001-03_1655120252382_169.png,56,35202-1953178-7,16544460977400001-03_1655120253001_172.png,16544460977400001-03_1655120253002_173.png,4,24 26 | -------------------------------------------------------------------------------- /table_1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Supernova1024/29OCR-PDF-to-CSV-Table-by-CNN/97d205dfd37d65b22284ba644ebdfad2536a190c/table_1.jpg --------------------------------------------------------------------------------