├── LICENSE
├── README.md
├── auxiliar_scripts.py
├── demos
    ├── README.md
    ├── Usage demo.gif
    ├── demo_final_example.pdf
    ├── demo_final_example
    │   ├── demo_final_example0001-1.png
    │   ├── demo_final_example0001-2.png
    │   ├── demo_final_example0002-3.png
    │   ├── demo_final_example0002-4.png
    │   ├── demo_final_example0003-5.png
    │   └── demo_final_example0004-6.png
    ├── demo_jumps_interesting_pages_pattern_2.pdf
    ├── demo_jumps_interesting_pages_pattern_2
    │   ├── demo_jumps_interesting_pages_pattern_20001-1.png
    │   ├── demo_jumps_interesting_pages_pattern_20001-2.png
    │   ├── demo_jumps_interesting_pages_pattern_20002-3.png
    │   ├── demo_jumps_interesting_pages_pattern_20002-4.png
    │   ├── demo_jumps_interesting_pages_pattern_20003-5.png
    │   └── demo_jumps_interesting_pages_pattern_20004-6.png
    ├── demo_jumps_not_interesting_pages.pdf
    ├── demo_jumps_not_interesting_pages
    │   ├── demo_jumps_not_interesting_pages0001-1.png
    │   ├── demo_jumps_not_interesting_pages0002-2.png
    │   └── demo_jumps_not_interesting_pages0003-3.png
    ├── demo_no_jumps.pdf
    ├── demo_no_jumps
    │   ├── DEMO_no_jumps0001-1.png
    │   ├── DEMO_no_jumps0002-2.png
    │   └── DEMO_no_jumps0003-3.png
    └── step_by_step_process.png
└── pdf-scraper-with-ocr.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Jacobo José Guijarro Villalba
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # pdf-scraper-with-ocr
 2 | With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't implement any kind of character recognition.
 3 | 
 4 | [![Screencast](https://github.com/JacoboGuijar/pdf-scraper-with-ocr/blob/main/demos/Usage%20demo.gif)](https://github.com/JacoboGuijar/pdf-scraper-with-ocr/blob/main/demos/Usage%20demo.gif)
 5 | 
 6 | 
 7 | ## How it works
 8 | When you run the program a GUI will open with four buttons. Only two of them are available for use at the begining: "Choose a PDF" and "Extract Information". We will start choosing our PDF. When the button is clicked a new window will open where we can navigate through our folders and select the PDF we want.
 9 | 
10 | Once we have selected the PDF the button "Delete Pages" will activate. Here we will be able to select which pages we want to delete from our PDF because they do not contain information we want to scrape. Do not worry, the program will create a copy of your PDF and modify the copy, it will not touch the original except to create the copy. In case you do not want to delete any pages just leave the field in blank, however, if our PDF contains a cover, index or other kind of one time only pages you can delete them by indicating each page separated by a semicolon, see: `1;2;10;` this will delete pages 1, 2 and 10. If you want to delete a range of pages you can indicate the first and last page separated by a hyphen: `5-10` will delete pages 5, 6, 7, 8, 9 and 10. See below for other commands.
11 | 
12 | Now that we have deleted the pages we did not need the button "PDF to images" will activate, pressing it will open a window where we will be asked to select the folder where the pages of the PDF will be saved as images. If the PDF has over 100 pages this might take a while (around 25 minutes for 456 pages in my case). It might look like the window freezes but do not worry, the program is still running.
13 | 
14 | Finally, once all the pages have been converted to images we can start scraping the PDF. By clicking on "Extract Information" the window will change and present four new buttons: "Load images", "Undo", "Show image" and "Extract text". Clicking on "Load images" will open a window where we can select the folder where our images where saved. Once we have selected the folder we will be asked if our PDF follows any pattern. A pattern is used whenever the information we want to obtain is divided in different pages. Maybe the phone number of a client is in one page and the email in the next one, however we must be sure that every client will follow this pattern and have the phone number and email in the same place. In case our information is not split across diferent pages we can write 1, as the pattern will repeat every page. We will also need to choose if we want to see random images or not. We will select not randomized by now, see below for information. 
15 | 
16 | Whenever we click on "ok" the program will load a series of preview images where we can select by clicking and draggin the information we want to keep. Every time we start clicking a red rectangle will follow the mouse until the click is released. After releasing the mouse we will be asked what is the name of the field we just selected. This name will be the name of the column where this is information is stored. After creating as many selections as we want we can click on "Extract text". Go grab a coffe, this might take a long time but after finishing a new file will appear in the folder where you are running this script. An Excel file with all the information you wanted.
17 | 
18 | You can find a series of demos and step by step tutorials in different formats in the 'demos' folder.
19 | 
20 | ## Language configuration and field naming
21 | There are multiple types of texts that can be extracted. Here I will explain the different solutions to improve your text extraction. All of these are addons to the selection name and all work in the same way as in the email example, just changing the ending after the '\_'.
22 | ### Emails
23 | If your main language is not English please change the value of the 'MY_LANG' variable at the begging of the 'pdf-scraper-with-ocr.py' file to the language you need. You can find the different languages in the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/Data-Files.html). 
24 | 
25 | It should be noticed that if you want to extract an email the '@' symbol will not be detected some times. To improve the accuracy of the email detection you can add '\_email' at the end of the name selection. See:
26 | 
27 | ![](https://i.imgur.com/gughskB.png)
28 | 
29 | This will change the language to English only for this selection, something that seems to help a lot in the email detection.
30 | ### Multiple lines
31 | This program is configured to analyze only one line, as you can see in the demos files. In case you need to analyze a field of text that is divided in multiple lines you should add at the end of the selection '\_ML'. This will tell the program that this specific field has multiple lines. 
32 | 
33 | Different features for different types of text will be added in the future
34 | 
35 | ## Deleting pages
36 | Every PDF is different from others. They can be organized in a lot of different ways, making the automation of the pages to delete kind of a pain. Currently this are the commands supported for deleting pages:
37 | ### Single page deletion
38 | This will delete the pages that to correspond to the written indexes: `1;2;10;` will delete pages 1, 2 and 10.
39 | ### Delete page in range
40 | This will delete the pages between the first and last index seperated by a hyphen: `5-10` will delete pages 5, 6, 7, 8, 9 and 10.
41 | ### Delete every Nx pages:
42 | If every three files in our PDF we have a file that does not have any interesting information by using. `Nx` we will delete every index multiple of N. `3x` will delete pages 3, 6, 9, 12, 15...
43 | ### Delete every Nx + C pages:
44 | Maybe the pattern our PDF follows goes like this: page 1 (useful), page 2 (useless), page 3 (useful),(the pattern begins again here) page 4 (useful)... We will need to delete pages 2, 5, 8, 11... Then using `3x+1` will delete every three pages the next page.
45 | ### Delete everything after or before N:
46 | In case we want to delete all pages after page N using: `N-` will delete every page after page N. In the same way, using: `-N` will delete all pages before N. 
47 | ### Combinations
48 | You can combine different methods to delete pages separating them by a semicolon: `4x; 100-; 45;` this will delete every fourth page, all pages after index 100 and the page 45.
49 | 
50 | ## The Show image button
51 | It is important that you make sure all your selections grab all the information in all pages. To help you create better selections you can click on the "Show image" button to navigate across different pages. If you have a pattern of 1 you will see that every time you click on the button your image change but the rectangles stay in place. In case you want to delete any of them you can use the "Undo" button (explanation below). If you have a pattern greater than 1 when clicking on "Show image" you will see how your selections disappear. This is because the program keeps track of what selections you have made in which page of the pattern. You can also create selections here that will be analyzed next to the ones in the previous page.
52 | 
53 | ## Randomized preview
54 | Selecting to randomize the preview images can be quite helpful. Many times every section in a PDF seems to follow the same pattern and fill the same space but every now and them some fields might not be were they should or some piece of text might be bigger than rectangle you created before. This is were the randomized preview can save your output file. Keep in mind that the random preview will keep showing images in order according to the pattern you selected, you will just see different patterns instead of the three first ones that the not randomized option offers.
55 | 
56 | ## The Undo button
57 | In case you clicked something by mistake, did not write correctly the name you wanted for a field or created a rectangle that later you discovered will not capture all the info you wanted there is an undo button. The Undo button will eliminate the last rectangle created. In case your PDF follows a pattern greater than 1 the undo button will delete the last rectangle created in the page you are. For example, if your PDF has a pattern of 3 and you have created two rectangles on page 1, then click on "Show image" to see the next image in your pattern (page 2) and create a rectangle there and go back to page 1 (by clicking twice on "Show image"), clicking the undo button will not delete the selection from page 2, it will delete the last created selection in the page you are at the moment of clicking.
58 | 
59 | ## Increase accuracy
60 | This program is configured is configured to create images with 400 of DPI, this is over the recommended mininum according to the Tesseract documentation. However, if you want to increase this accuracy, and increase the execution time, you can change the DPI variable at the beginning of the auxiliar_scripts.py file.
61 | 
62 | ## Final note
63 | If you think this tool might help you and you want to thank me for my work, please consider using Paypal to help me pay my loans: [https://www.paypal.com/donate?hosted_button_id=4TGWFN2Y6BTZE](https://www.paypal.com/donate?hosted_button_id=4TGWFN2Y6BTZE)
64 | 


--------------------------------------------------------------------------------
/auxiliar_scripts.py:
--------------------------------------------------------------------------------
  1 | import pdf2image
  2 | from PyPDF2 import PdfFileWriter, PdfFileReader
  3 | from PIL import Image
  4 | import pytesseract
  5 | import time
  6 | import os
  7 | import inspect
  8 | IMG_FOLDER = 'C:\\Users\\jacob\\Desktop\\Escritorio\\Proyectos python\\PDF-OCR\\temp_img\\'
  9 | #Tesseract para poder utilizar Pytesseract https://github.com/UB-Mannheim/tesseract/wiki
 10 | #En mi caso Tesseract está instalado en la ruta: 
 11 | pytesseract.pytesseract.tesseract_cmd = r'C:\\Users\\jacob\\AppData\\Local\\Tesseract-OCR\\tesseract.exe'
 12 | #En caso de estar instalado en otra ruta añadirla en lugar de la arriba escrita.
 13 | 
 14 | #POPPLER LATEST VERSION https://github.com/oschwartz10612/poppler-windows/releases/
 15 | PDF_PATH = 'TEST.pdf'
 16 | DPI = 400
 17 | FIRST_PAGE = None
 18 | LAST_PAGE = None
 19 | FORMAT = 'png'
 20 | THREAD_COUNT = 4
 21 | USERPWD = None
 22 | USE_CROPBOX = False
 23 | STRICT = False
 24 | 
 25 | def pdf_to_pil(file, folder, output_file):
 26 | 	start_time = time.time()
 27 | 
 28 | 	pil_images = pdf2image.convert_from_path(file, dpi = DPI, output_folder = folder, output_file = output_file,first_page = FIRST_PAGE,
 29 | 											last_page = LAST_PAGE, fmt = FORMAT, thread_count = THREAD_COUNT, userpw = USERPWD, 
 30 | 											use_cropbox = USE_CROPBOX, strict = STRICT, grayscale = True)
 31 | 	print('Time taken: ' + str(time.time() - start_time))
 32 | 
 33 | def delete_pages(array, file):
 34 | 	#I think this could be done waaay better with regex. Think about it.
 35 | 	clean_array = []
 36 | 	pages = PdfFileReader(file, 'rb').getNumPages()
 37 | 
 38 | 	for num in array:
 39 | 		if 'x' in num and '+' in num:
 40 | 
 41 | 			for x in list(range(0, pages)):
 42 | 				polin = [int(num.split('x')[0]), int(num.split('x')[1].split('+')[1])]
 43 | 
 44 | 				if x % int(polin[0]) == 0:
 45 | 
 46 | 					clean_array.append((x + 1) + polin[1])
 47 | 		if 'x' in num and '+' not in num:
 48 | 			for x in list(range(1, pages + 1)):
 49 | 				if x % int(num.split('x')[0]) == 0:
 50 | 					clean_array.append(x)
 51 | 
 52 | 		if '-' in num:
 53 | 			rang = num.split('-')
 54 | 
 55 | 			if rang[0] != '' and rang[1] != '':	
 56 | 
 57 | 				for i in range(int(rang[1]) - int(rang[0])):
 58 | 					clean_array.append(int(rang[0]) + i)
 59 | 				clean_array.append(int(rang[1]))
 60 | 			elif rang[1] == '' and rang[0] != '':
 61 | 
 62 | 				list(range(0, pages))
 63 | 				clean_array = list(range(1, pages + 1))[int(rang[0]):]
 64 | 
 65 | 			else:
 66 | 
 67 | 				clean_array = list(range(0, pages))[:int(rang[1])]
 68 | 				clean_array.append(int(rang[1]))
 69 | 		elif str.isdigit(num) == True:
 70 | 			clean_array.append(int(num))
 71 | 
 72 | 	return clean_array
 73 | 
 74 | def create_pdf(file, array):
 75 | 	pages_to_keep = []
 76 | 	pdf = PdfFileReader(file, 'rb')
 77 | 	output = PdfFileWriter()
 78 | 	for i in list(range(1,pdf.getNumPages() + 1)):
 79 | 		if i not in array:
 80 | 			pages_to_keep.append(i)
 81 | 			page = pdf.getPage(i - 1)
 82 | 			output.addPage(page)
 83 | 
 84 | 	with open('temp.pdf', 'wb') as temp:
 85 | 		output.write(temp)
 86 | 
 87 | def save_images(pil_images):
 88 | 	index = 1
 89 | 	for image in pil_images:
 90 | 		image.save(IMG_FOLDER + 'page_' + str(index) + '.png')
 91 | 		index += 1
 92 | 
 93 | def get_image_w_h(image_path):
 94 | 	im = Image.open(image_path)
 95 | 	width, height = im.size
 96 | 
 97 | 	return width, height
 98 | 
 99 | def extract_from_crop(image, *args):
100 | 	for arg in args:
101 | 		cropped_img = image.crop(arg)
102 | 		cropped_img.show()
103 | 		img_text = pytesseract.image_to_string(cropped_img, lang = 'spa')
104 | 
105 | 


--------------------------------------------------------------------------------
/demos/README.md:
--------------------------------------------------------------------------------
 1 | # The demos folder
 2 | Here you will find examples on gif and image formats, as well as the PDFs I used for creating the tutorials and doing some testing. 
 3 | 
 4 | This file is also an step by step guide to analyze the three demos you can find in this folder to help you get used to this application.
 5 | 
 6 | ## Analyzing 'demo_no_jumps.pdf'
 7 | This file is the most simple case for analyzing as it will not require eliminating any images and it have a pattern of 1 as every page has a different content but with the same layout.
 8 | 
 9 | Click on 'Choose a PDF' to select the file, then on 'Delete Pages'. As said before, we want to keep all the pages in this PDF so we will leave the field where we were supossed to introduce the pages indexes in blank and just click on 'ok'. Clicking in 'PDF to images' will ask to introduce a name for a folder were we will save the pages as images, for this tutorial it will be named as the file, 'demo_no_jumps' (which you can also find in this folder). If done correctly 'demo_no_jumps' should contain three images that correspond to three pages of the PDF. 
10 | 
11 | Now that we have the images we can click on 'Extract Information' to start the selection process. First select the folder where the images were saved by clicking on 'Load images'. Once the folder have been selected a new window will appear. Here we need to introduce the pattern that the PDF follows and if we want to see a randomized selection of the pages or see them in order. Because our PDF has every page the same layout and each page correspond to the information of a different client we know that the pattern is 1. The randomized option must be filled but whatever we choose it will not affect the PDF or the order in which the information is extracted, just the images we see.
12 | 
13 | After clicking on 'ok' you are ready to start selecting everything you need and whenever you finish you can click on 'Extract text' to export each field into an xlsx file.
14 | 
15 | ## Analyzing 'demo_jumps_not_interesting_pages.pdf'
16 | Start taking a look to the PDF. In this case some pages have useless information. We can fix this by using the deleting function. It is ease to spot that the pages we want to delete appear every two files. 
17 | 
18 | Open the PDF and click on 'Delete Pages'. If you have read the README on the main page of this project you might have seen the Deleting pages section. There you can find one of the commands that works perfectly for this kind of files: `Nx`. This will delete one page every N pages. In our case writting `2x` in the deleting field will delete all the pages we do not want. After clicking on 'ok' you can click on 'PDF to images' to see the resulting images or simply go to the folder where you are running the program to see the 'temp.pdf' created after the deleting process. Remember you can find in this folder the container for the images too.
19 | 
20 | Finally go to 'Extract Information' and 'Load images' to select the images folder and select the pattern. In this case once all the unwanted pages have been deleted the PDF will have a pattern of 1, as in the previous example. Select if you want to randomize the preview images or not and start selecting the sections you need.
21 | 
22 | ## Analyzing 'demo_jumps_interesting_pages_pattern_2.pdf'
23 | Opening this file will display a PDF where the information we want to obtain is divided in two pages. The main page that we have seen in the previous examples and an extra one contains extra info about the clients. All of the pages contain important information so we will not delete any of them but rather change how the images are displayed and saved to obtain everything we need.
24 | 
25 | Start by opening the PDF and clicking on 'Delete Pages'. Leave the field in blank, click on 'PDF to images' and then on 'Extract Information'. Once the folder has been selected we will be asked what pattern follows our PDF. As the information we want is divided in two pages we want to save every two pages as one client in the output file. To accomplish that write that the PDF has a pattern of 2. Select if you want the output randomized or not and click on 'ok'.
26 | 
27 | Once the images are loaded you can navigate through them by clicking on 'Show image'. We will see first the main page we have seen in other examples and then the new extra page. Now you have two different pages to select the information you want to extract. I encourage you to play around creating selections across the different fields. 
28 | 
29 | One thing to notice is that if you use the 'Undo' button it will not delete the last rectangle you made. It will delete the last rectangle you made on the page you are when you click the button. Let's see a couple of examples: First, you create two selections on the main page and then change to the extra page where you have not done any selections yet and click the 'Undo' button. It will not delete anything as there are no selections on the extra page. Second, if you create two selections on page one, change to page two and create other selections and change again to page one. Now you decide that one of your selections doesn't contain all the info in every page so you decide to click the Undo button. It will change the last selection on page one because even though the last selection is on page two the 'Undo' button will delete the last selection in the page you are when clicking it.
30 | 
31 | # Analyzing 'demo_final_example.pdf'
32 | In this file we will use the deleting images utility and the pattern selection. Taking a look to the file it might take a couple of minutes to discover that this file has a page we want to delete every page after a multiple of three. The deleting command used for this has the form `Nx+C` and for this particular case is `3x+1`. 
33 | 
34 | If you now open the temp.pdf or see the folder where the images are saved you can see that we now have a file like the one analyzed on the previous example. Following the same logic introduce a pattern of 2 and begin selecting the areas you need.
35 | 


--------------------------------------------------------------------------------
/demos/Usage demo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/Usage demo.gif


--------------------------------------------------------------------------------
/demos/demo_final_example.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example.pdf


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0001-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0001-1.png


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0001-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0001-2.png


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0002-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0002-3.png


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0002-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0002-4.png


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0003-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0003-5.png


--------------------------------------------------------------------------------
/demos/demo_final_example/demo_final_example0004-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_final_example/demo_final_example0004-6.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2.pdf


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20001-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20001-1.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20001-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20001-2.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20002-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20002-3.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20002-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20002-4.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20003-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20003-5.png


--------------------------------------------------------------------------------
/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20004-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_interesting_pages_pattern_2/demo_jumps_interesting_pages_pattern_20004-6.png


--------------------------------------------------------------------------------
/demos/demo_jumps_not_interesting_pages.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_not_interesting_pages.pdf


--------------------------------------------------------------------------------
/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0001-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0001-1.png


--------------------------------------------------------------------------------
/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0002-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0002-2.png


--------------------------------------------------------------------------------
/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0003-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_jumps_not_interesting_pages/demo_jumps_not_interesting_pages0003-3.png


--------------------------------------------------------------------------------
/demos/demo_no_jumps.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_no_jumps.pdf


--------------------------------------------------------------------------------
/demos/demo_no_jumps/DEMO_no_jumps0001-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_no_jumps/DEMO_no_jumps0001-1.png


--------------------------------------------------------------------------------
/demos/demo_no_jumps/DEMO_no_jumps0002-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_no_jumps/DEMO_no_jumps0002-2.png


--------------------------------------------------------------------------------
/demos/demo_no_jumps/DEMO_no_jumps0003-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/demo_no_jumps/DEMO_no_jumps0003-3.png


--------------------------------------------------------------------------------
/demos/step_by_step_process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JacoboGuijar/pdf-scraper-with-ocr/3814dc21f32b6dec1122328f00b0afd4e0051d0c/demos/step_by_step_process.png


--------------------------------------------------------------------------------
/pdf-scraper-with-ocr.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import tkinter as tk
  3 | import tkinter.filedialog as filedialog
  4 | from tkinter import ttk
  5 | import os 
  6 | import random
  7 | from PIL import Image, ImageTk
  8 | import pandas as pd
  9 | import auxiliar_scripts
 10 | 
 11 | LARGE_FONT = ('verdana', 12)
 12 | SMALL_FONT = ('verdana', 8)
 13 | MAX_IMG_SIZE = 1080
 14 | MY_LANG = 'eng'
 15 | 
 16 | class ScrollFrame(tk.Frame):
 17 |     def __init__(self, parent):
 18 |         super().__init__(parent)
 19 | 
 20 |         self.canvas = tk.Canvas(self, borderwidth=0, background="#ffffff")          
 21 |         self.viewPort = tk.Frame(self.canvas, background="#ffffff")                     
 22 |         self.vsb = tk.Scrollbar(self, orient="vertical", command=self.canvas.yview)  
 23 |         self.canvas.configure(yscrollcommand=self.vsb.set)                          
 24 | 
 25 |         self.vsb.pack(side="right", fill="y")                                       
 26 |         self.canvas.pack(side="left", fill="both", expand=True)                     
 27 |         self.canvas_window = self.canvas.create_window((4,4), window=self.viewPort, anchor="nw", tags="self.viewPort")
 28 | 
 29 |         self.viewPort.bind("<Configure>", self.onFrameConfigure)                    
 30 |         self.canvas.bind("<Configure>", self.onCanvasConfigure)                     
 31 | 
 32 |         self.onFrameConfigure(None)                                                 
 33 | 
 34 |     def onFrameConfigure(self, event):                                              
 35 |         self.canvas.configure(scrollregion=self.canvas.bbox("all"))                 
 36 | 
 37 |     def onCanvasConfigure(self, event):
 38 |         canvas_width = event.width
 39 |         self.canvas.itemconfig(self.canvas_window, width = canvas_width)           
 40 | 
 41 | 
 42 | class Example(tk.Frame):
 43 |     def __init__(self, root):
 44 | 
 45 |         tk.Frame.__init__(self, root)
 46 |         self.scrollFrame = ScrollFrame(self)
 47 |         self.images_array = []
 48 |         self.image_num = 0
 49 |         self.pattern_count = 0
 50 |         self.x = 0
 51 |         self.y = 0
 52 |         self.rect = None
 53 |         self.rect_count = None
 54 | 
 55 |         self.button1 = tk.Button(self.scrollFrame.viewPort, text = 'Choose a PDF', command = self.choose_pdf)
 56 |         self.button2 = tk.Button(self.scrollFrame.viewPort, text = 'Delete Pages', command = self.process_pdf, state = 'disabled')
 57 |         self.button3 = tk.Button(self.scrollFrame.viewPort, text = 'PDF to images', command = self.pdf_to_images, state = 'disabled')
 58 |         self.button4 = tk.Button(self.scrollFrame.viewPort, text = 'Extract Information', command = self.change_screen)
 59 |         self.button5 = tk.Button(self.scrollFrame.viewPort, text = 'Load images', command = self.load_images, state = 'disabled')
 60 |         self.button6 = tk.Button(self.scrollFrame.viewPort, text = 'Undo', command = self.undo_rectangle, state = 'disabled')
 61 |         self.button7 = tk.Button(self.scrollFrame.viewPort, text = 'Show image', command = self.next_image, state = 'disabled')
 62 |         self.button8 = tk.Button(self.scrollFrame.viewPort, text = 'Extract text', command = self.process_text, state = 'disabled')
 63 | 
 64 |         self.button1.grid(row = 1, column = 1, sticky = 'nesw')
 65 |         self.button2.grid(row = 2, column = 1, sticky = 'nesw')
 66 |         self.button3.grid(row = 3, column = 1, sticky = 'nesw')
 67 |         self.button4.grid(row = 4, column = 1, sticky = 'nesw')
 68 | 
 69 |         self.scrollFrame.pack(side="top", fill="both", expand=True)
 70 |     
 71 |     def choose_pdf(self):
 72 |         ftype = [('PDF document files', '*.pdf')]
 73 | 
 74 |         self.file = filedialog.askopenfile(mode = 'rt', filetypes = ftype)
 75 |         self.button2['state'] = 'normal'
 76 | 
 77 |     def process_pdf(self):
 78 |         self.button1['state'] = 'disabled'
 79 |         self.button2['state'] = 'disabled'
 80 |         self.button3['state'] = 'disabled'
 81 | 
 82 |         self.popup = popupWindow(self)
 83 |         self.popup.validation = True
 84 |         self.popup.button['state'] = 'normal'
 85 |         self.popup.label['text'] = 'Insert the pages you want to delete'
 86 |         self.popup.warning_label['text'] = 'If you want to delete individual pages write them separated by semicolons: 1;4;12.\nIf you want to delete pages in a range write the first and last page separated by a hyphen: 3-15.\nIf you want to delete each N pages you can write: 3x to delete one page every three pages.\n Leave the field blank if you do not want to delete any pages.'
 87 |         self.master.wait_window(self.popup.top)
 88 | 
 89 |         try:
 90 |             to_delete = auxiliar_scripts.delete_pages(self.popup.value.split(';'), self.file.name)
 91 |         except AttributeError:
 92 |             to_delete = auxiliar_scripts.delete_pages([], self.file.name)
 93 |         auxiliar_scripts.create_pdf(self.file.name, to_delete)
 94 | 
 95 |         self.button1['state'] = 'normal'
 96 |         self.button2['state'] = 'normal'
 97 |         self.button3['state'] = 'normal'
 98 | 
 99 |     def pdf_to_images(self):
100 |         self.button1['state'] = 'disabled'
101 |         self.button2['state'] = 'disabled'
102 |         self.button3['state'] = 'disabled'
103 |         self.button4['state'] = 'disabled'
104 | 
105 |         self.popup = popupWindow(self)
106 |         self.popup.validation = True
107 |         self.popup.label['text'] = 'Folder name'
108 |         self.popup.warning_label['text'] = 'Write the name of the folder where each pdf page will be saved as an image file.\nPlease make sure there are no folders with that name.'
109 |         self.popup.entry.focus_set()
110 |         self.master.wait_window(self.popup.top)
111 | 
112 |         progress_bar = ttk.Progressbar(self.scrollFrame.viewPort, orient = 'horizontal', mode = 'indeterminate')
113 |         progress_bar.grid(row = 5, column = 1, sticky = 'nesw')
114 |         progress_bar.start()
115 |         try:
116 |             os.makedirs(os.path.dirname(os.path.abspath(__file__)) + '\\' + self.popup.value, exist_ok = True)
117 |             folder = os.path.dirname(os.path.abspath(__file__)) + '\\' + self.popup.value
118 |             print(self.file.name.split('/')[-1].split('.')[0])
119 |             auxiliar_scripts.pdf_to_pil('temp.pdf', folder, self.file.name.split('/')[-1].split('.')[0])
120 |         except TypeError:
121 |             os.makedirs(os.path.dirname(os.path.abspath(__file__)) + '\\' + 'temporal_directory', exist_ok = True)
122 |             folder = os.path.dirname(os.path.abspath(__file__)) + '\\' + 'temporal_directory'
123 |             print(self.file.name.split('/')[-1].split('.')[0])
124 |             auxiliar_scripts.pdf_to_pil('temp.pdf', folder, self.file.name.split('/')[-1].split('.')[0])
125 |         progress_bar.grid_remove()
126 | 
127 |         self.button1['state'] = 'normal'
128 |         self.button2['state'] = 'normal'
129 |         self.button3['state'] = 'normal'
130 |         self.button4['state'] = 'normal'
131 | 
132 |     def change_screen(self):
133 |         self.button1['state'] = 'disabled'
134 |         self.button2['state'] = 'disabled'
135 |         self.button3['state'] = 'disabled'
136 |         self.button4['state'] = 'disabled'
137 |         self.button5['state'] = 'normal'
138 | 
139 |         self.button1.grid_remove()
140 |         self.button2.grid_remove()
141 |         self.button3.grid_remove()
142 |         self.button4.grid_remove()
143 | 
144 |         self.button5.grid(row = 0, column = 0, sticky = 'ne')
145 |         self.button6.grid(row = 0, column = 1, sticky = 'ne')
146 |         self.button7.grid(row = 0, column = 2, sticky = 'ne')
147 |         self.button8.grid(row = 0, column = 3, sticky = 'ne')
148 | 
149 |     def load_images(self):
150 |         self.button5['state'] = 'disabled'
151 | 
152 |         self.directory = filedialog.askdirectory()
153 | 
154 |         self.popup = popupWindow(self)
155 |         self.popup.label['text'] = 'How often does the pdf repeat?'
156 |         self.popup.warning_label['text'] = 'If your pdf repeats every 3 pages write: \'3\'. If your PDF follows a pattern where every page is the same write: \'1\'.'
157 |         self.popup.optMenu.grid()
158 |         self.popup.warning_label_2.grid()
159 |         self.popup.entry.focus_set()
160 |         self.master.wait_window(self.popup.top)
161 | 
162 |         self.load = [] #In case we change directory the images shown won't mix
163 |         sample_images = os.listdir(self.directory)
164 |         sample_images.sort()
165 | 
166 |         try:
167 |             self.pattern = int(self.popup.value)
168 |             if self.pattern < 0:
169 |                 self.pattern = 1
170 |         except Exception:
171 |             self.pattern = 1
172 | 
173 |         if self.popup.random_state == 'Randomized':
174 |             try:
175 |                 if self.pattern < len(sample_images):
176 |                     debug_array = []
177 |                     sample_img_index = random.sample(list(range(int(len(sample_images)/self.pattern))), 3)
178 |                     print(sample_img_index)
179 |                     for img_index in sample_img_index:
180 |                         for index in range(self.pattern):
181 |                             sample_images.append(sample_images[img_index*self.pattern + index])
182 |                             debug_array.append(img_index*self.pattern + index + 1)
183 |                     
184 |                     sample_images = sample_images[-self.pattern*3:]
185 | 
186 |             except ValueError:
187 |                 sample_images = sample_images[random.sample(sample_images, self.value * 3)]
188 |         else:
189 |             try:
190 |                 if self.pattern < len(sample_images):
191 |                     sample_images = sample_images[0:self.pattern * 3]
192 |             except ValueError:
193 |                 sample_images = sample_images[0:6]
194 | 
195 |         for image in sample_images:
196 |             img = Image.open(self.directory + '\\' + image)
197 |             #resize images
198 |             if img.height > MAX_IMG_SIZE or img.width > MAX_IMG_SIZE:
199 |                 if img.height > img.width:
200 |                     factor = MAX_IMG_SIZE / img.height
201 |                 else:
202 |                     factor = MAX_IMG_SIZE / img.width
203 |             resized_img = img.resize((int(img.width * factor), int(img.height * factor)))
204 |             img_dict = {'img': resized_img, 'factor': factor}
205 |             self.images_array.append(img_dict)
206 | 
207 |         print(sample_images)
208 |         self.load = self.images_array[self.image_num]['img']
209 |         self.render = ImageTk.PhotoImage(self.load)
210 |         self.scrollFrame.canvas['width'] = self.render.width()
211 |         self.scrollFrame.canvas['height'] = self.render.height()
212 |         self.scrollFrame.scrollregion = (0,0,self.images_array[self.image_num]['img'].width, self.images_array[self.image_num]['img'].height)
213 |         self.image_on_canvas = self.scrollFrame.canvas.create_image(0,0,anchor = 'nw', image = self.render)
214 | 
215 |         self.rect_count = [([]) for i in range(self.pattern)]
216 |         self.array_crop_coords = [([]) for i in range(self.pattern)]
217 |         self.array_crop_coords_dict = [([]) for i in range(self.pattern)]
218 | 
219 |         self.button5['state'] = 'normal'
220 |         self.button6['state'] = 'normal'
221 |         self.button7['state'] = 'normal'
222 |         self.button8['state'] = 'normal'
223 | 
224 |         self.mouse_position_label = tk.Label(self.scrollFrame.viewPort, text = 'x: y:')
225 | 
226 |         self.scrollFrame.canvas.bind('<ButtonPress-1>', self.on_button_press)
227 |         self.scrollFrame.canvas.bind('<B1-Motion>', self.on_move_press)
228 |         self.scrollFrame.canvas.bind('<Motion>', self.on_move)
229 |         self.scrollFrame.canvas.bind('<ButtonRelease-1>', self.on_button_release)
230 | 
231 |     def process_text(self):
232 | 
233 |         self.button5['state'] = 'disabled'
234 |         self.button6['state'] = 'disabled'
235 |         self.button7['state'] = 'disabled'
236 |         self.button8['state'] = 'disabled'
237 | 
238 |         list_of_lists = []
239 |         keys = ['image']
240 |         for i in range(len(self.array_crop_coords)):
241 |             for j in range(len(self.array_crop_coords[i])):
242 |                 list_of_lists.append([])
243 |                 keys.append(self.array_crop_coords_dict[i][j]['name'])
244 |         images_to_process = os.listdir(self.directory)
245 |         images_to_process.sort()
246 |         print(images_to_process)
247 |         progress_window = tk.Toplevel()
248 |         tk.Label(progress_window, text = 'Please wait while the text is extracted')
249 | 
250 |         progress = 0
251 |         progress_var = tk.DoubleVar()
252 |         progress_bar = ttk.Progressbar(progress_window, variable = progress_var, maximum = 100)
253 |         progress_bar.grid()
254 |         progress_window.pack_slaves()
255 |         progress_step = float(100 / len(images_to_process))
256 | 
257 |         lst = []
258 |         for k in range(len(images_to_process)):
259 |             progress_window.update()
260 |             img = Image.open(self.directory + '\\' + images_to_process[k])
261 |             i = k % self.pattern
262 |             if i == 0:
263 |                 lst = []
264 |                 lst.append(images_to_process[k])
265 |             if len(self.array_crop_coords[i]) > 0:
266 | 
267 |                 for j in range(len(self.array_crop_coords[i])):
268 |                     
269 |                     #This is to make sure the rectangles can be created with any starting point
270 |                     if self.array_crop_coords[i][j][0] > self.array_crop_coords[i][j][2] and self.array_crop_coords[i][j][1] > self.array_crop_coords[i][j][3]:
271 |                         self.array_crop_coords[i][j] = (self.array_crop_coords[i][j][2],self.array_crop_coords[i][j][3], self.array_crop_coords[i][j][0], self.array_crop_coords[i][j][1])
272 |                     
273 |                     elif self.array_crop_coords[i][j][0] < self.array_crop_coords[i][j][2] and self.array_crop_coords[i][j][1] > self.array_crop_coords[i][j][3]:
274 |                         self.array_crop_coords[i][j] = (self.array_crop_coords[i][j][0],self.array_crop_coords[i][j][3], self.array_crop_coords[i][j][2], self.array_crop_coords[i][j][1])
275 |                     
276 |                     elif self.array_crop_coords[i][j][0] > self.array_crop_coords[i][j][2] and self.array_crop_coords[i][j][1] < self.array_crop_coords[i][j][3]:
277 |                         self.array_crop_coords[i][j] = (self.array_crop_coords[i][j][2],self.array_crop_coords[i][j][1], self.array_crop_coords[i][j][0], self.array_crop_coords[i][j][3])
278 |                     cropped_image = img.crop(self.array_crop_coords[i][j])
279 |                     #cropped_image.show()
280 |                     if self.array_crop_coords_dict[i][j]['name'].endswith('_email'):
281 |                         texto = auxiliar_scripts.pytesseract.image_to_string(cropped_image, lang = 'eng', config = '--psm 7')
282 |                     if self.array_crop_coords_dict[i][j]['name'].endswith('_ML'):
283 |                         texto = auxiliar_scripts.pytesseract.image_to_string(cropped_image, lang = MY_LANG, config = '--psm 6') #This one seems to be the one that works best for multiple lines
284 |                     else:
285 |                         texto = auxiliar_scripts.pytesseract.image_to_string(cropped_image, lang = MY_LANG, config = '--psm 7')
286 | 
287 |                     clean_text = texto.strip('\n\x0c')
288 | 
289 |                     lst.append(clean_text)
290 | 
291 |             list_of_lists.append(lst)
292 |             img.close()
293 |             progress += progress_step
294 |             progress_var.set(progress)
295 | 
296 | 
297 |         #Aquí vamos a crear un metodo que escriba mejor el excel.
298 |         #Por ahora parece que funciona.
299 |         dict_array = []
300 |         for count, lst in enumerate(list_of_lists):
301 |             if self.pattern != 1 and count%self.pattern==1 and len(lst) > 0:
302 |                 dic = dict(zip(keys, lst))
303 |                 dict_array.append(dic)
304 |             elif self.pattern == 1 and len(lst) > 0:
305 |                 dic = dict(zip(keys, lst))
306 |                 dict_array.append(dic)
307 |         print(dict_array)
308 |         
309 |         df = pd.DataFrame(data = dict_array, columns = keys)
310 |         df.to_excel('output.xlsx')
311 |         progress_window.destroy()
312 | 
313 |         self.button5['state'] = 'normal'
314 |         self.button6['state'] = 'normal'
315 |         self.button7['state'] = 'normal'
316 |         self.button8['state'] = 'normal'
317 | 
318 |     def next_image(self):
319 |         self.image_num += 1
320 |         self.pattern_count += 1
321 |         if self.image_num == len(self.images_array):
322 |             self.image_num = 0
323 |         if self.pattern_count == self.pattern:
324 |             self.pattern_count = 0 
325 | 
326 |         for i in range(self.pattern):
327 |             for j in self.rect_count[i]:
328 |                 print('i: {} j: {}'.format(i, j))
329 |                 if i == self.pattern_count:
330 |                     self.scrollFrame.canvas.itemconfigure(j, state = 'normal')
331 |                 else:
332 |                     self.scrollFrame.canvas.itemconfigure(j, state = 'hidden')
333 | 
334 |         load = self.images_array[self.image_num]['img']
335 |         render = ImageTk.PhotoImage(load)
336 |         self.scrollFrame.canvas.itemconfig(self.image_on_canvas, image = render)
337 |         self.scrollFrame.canvas.image = render
338 |         self.scrollFrame.canvas.place()
339 |     def on_button_press(self, event):
340 |         if self.scrollFrame.vsb.get()[0] == 0:
341 |             self.start_x = event.x 
342 |             self.start_y = event.y
343 |         else:
344 |             self.start_x = event.x 
345 |             self.start_y = event.y + self.scrollFrame.vsb.get()[0] * self.images_array[self.image_num]['img'].height
346 | 
347 |         self.rect = self.scrollFrame.canvas.create_rectangle(self.x, self.y, 1,1, outline = 'red', tags = str(self.rect_count))
348 |         self.rect_count[self.pattern_count].append(self.rect)
349 |     def on_move_press(self, event):
350 |         if event.y >= event.y + self.scrollFrame.vsb.get()[0] * self.images_array[self.image_num]['img'].height:
351 |             curX = event.x
352 |             curY = event.y
353 |         else:
354 |             curX = event.x 
355 |             curY = event.y + self.scrollFrame.vsb.get()[0] * self.images_array[self.image_num]['img'].height
356 | 
357 |         text = 'x: {x} y: {y}\nx: {x} y: {y_test}'.format(x = event.x, y = event.y, y_test = event.y)
358 | 
359 |         self.scrollFrame.canvas.coords(self.rect, self.start_x, self.start_y, curX, curY)
360 |         self.mouse_position_label.config(text = text)
361 |     def on_move(self, event):
362 |         text = 'x: {x} y: {y}\nx: {x} y: {y_test}'.format(x = event.x, y = event.y, y_test = event.y)
363 |         self.mouse_position_label.config(text = text)
364 |         #self.mouse_position_label.grid(column = 4, row = 0)
365 |     def on_button_release(self, event):
366 |         end_x = event.x 
367 |         end_y = event.y 
368 |         if event.y >= event.y + self.scrollFrame.vsb.get()[0] * self.images_array[self.image_num]['img'].height:
369 |             end_x = event.x
370 |             end_y = event.y
371 |         else:
372 |             end_x = event.x 
373 |             end_y = event.y + self.scrollFrame.vsb.get()[0] * self.images_array[self.image_num]['img'].height
374 |         start_x_factor = int(self.start_x / self.images_array[self.image_num]['factor'])
375 |         start_y_factor = int(self.start_y / self.images_array[self.image_num]['factor'])
376 |         end_x_factor = int(end_x / self.images_array[self.image_num]['factor'])
377 |         end_y_factor = int(end_y / self.images_array[self.image_num]['factor'])
378 |         coords = 'x1: {start_x} y1: {start_y} x2: {end_x} y2: {end_y}'.format(start_x = self.start_x, start_y = self.start_y, end_x = end_x, end_y = end_y)
379 |         coords_factor = 'x1: {start_x} y1: {start_y} x2: {end_x} y2: {end_y}'.format(start_x = start_x_factor, start_y = start_y_factor, end_x = end_x_factor, end_y = end_y_factor)
380 |         crop_coords = (int(self.start_x / self.images_array[self.image_num]['factor']), 
381 |                        int(self.start_y / self.images_array[self.image_num]['factor']),
382 |                        int(end_x / self.images_array[self.image_num]['factor']), 
383 |                        int(end_y / self.images_array[self.image_num]['factor']))
384 | 
385 | 
386 |         self.array_crop_coords[self.pattern_count].append(crop_coords)
387 |         self.popup = popupWindow(self)
388 |         self.popup.validation = True
389 |         self.popup.entry.focus_set()
390 |         self.master.wait_window(self.popup.top)
391 |         self.array_crop_coords_dict[self.pattern_count].append({'name': self.popup.value, 'coords': self.array_crop_coords[self.pattern_count][-1]})
392 |         print(self.array_crop_coords_dict)        
393 |     def undo_rectangle(self):
394 |         if len(self.rect_count[self.pattern_count]) > 0:
395 |             self.rect = self.rect_count[self.pattern_count].pop()
396 |             self.scrollFrame.canvas.delete(self.rect)
397 |             self.array_crop_coords[self.pattern_count].pop()
398 |             self.array_crop_coords_dict[self.pattern_count].pop()
399 |         else:
400 |             print('There are no more areas to delete.')
401 |             self.button8['state'] = 'disabled'
402 | 
403 | class popupWindow(object):
404 |     def __init__(self, master):
405 |         top = self.top = tk.Toplevel(master)
406 |         self.value = tk.StringVar()
407 |         self.random_state = tk.StringVar()
408 |         self.validation = False
409 |         self.label = tk.Label(top, text = 'Write a name for your selection')
410 |         self.label.grid()
411 |         self.entry = tk.Entry(top, textvariable=self.value)
412 |         self.entry.grid()
413 |         self.button = tk.Button(top, text = 'Ok', command = self.cleanup, state = 'disabled')
414 |         self.button.grid()
415 |         self.warning_label = tk.Label(top, text = 'This will be the name for the column where this field is stored.')
416 |         self.warning_label.grid()
417 |         self.optMenu = ttk.Combobox(top,values = ['Not randomized', 'Randomized'], state = "readonly", textvariable=self.random_state)
418 |         #self.optMenu.grid()
419 |         self.warning_label_2 = tk.Label(top, text = 'To randomize the preview images might be helpful to detect small differences in your selection boxes that you might otherwise miss.\n Random sets of images will be chosen but each set will appear in order.\n A set is a succesion of images that belong to one repetition of the PDF.')
420 |         #self.warning_label_2.grid()
421 | 
422 |         self.value.trace('w', self.validate)
423 |         self.random_state.trace('w', self.validate)
424 | 
425 |     def validate(self, *args):
426 |         if self.value.get() and self.random_state.get():
427 |             self.button.config(state = 'normal')
428 |             print(self.optMenu.get())
429 | 
430 |         elif self.validation == True:
431 |             self.button.config(state = 'normal')
432 |         else:
433 |             self.button.config(state = 'disabled')
434 |     def cleanup(self):
435 |         self.random_state = self.optMenu.get()
436 |         self.value = self.entry.get()
437 |         self.top.destroy()
438 | 
439 | if __name__ == "__main__":
440 |     root=tk.Tk()
441 |     Example(root).pack(side="top", fill="both", expand=True)
442 |     root.mainloop()
443 | 


--------------------------------------------------------------------------------