├── .gitignore ├── Readme.md ├── Split.py ├── long-sample.pdf ├── main.py └── sample.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/* 2 | splitted/* 3 | *.txt 4 | POLI* -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # PDF to Text with Python 2 | 3 | ## Introduction 4 | 5 | This program will: 6 | 7 | 1. Split your PDF into pages, 8 | 2. Extract the text from each pages, and 9 | 3. Save them in `.txt` file. 10 | 11 | ## Required 12 | - [PDFtk](https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) ([Why using this?](#whyme)) 13 | - [PyPDF2](https://github.com/mstamy2/PyPDF2) 14 | 15 | ## Run 16 | ``` 17 | $ python main.py 18 | ``` 19 | 20 | ## Why Using PDFtk? 21 | 22 | Because PyPDF2's extract function doesn't works on some files. -------------------------------------------------------------------------------- /Split.py: -------------------------------------------------------------------------------- 1 | from PyPDF2 import PdfFileWriter, PdfFileReader 2 | import os, errno 3 | 4 | filename = "long-sample.pdf" 5 | directory = "splitted/"+filename 6 | 7 | def split(directory, filename): 8 | inputpdf = PdfFileReader(open(filename, "rb")) 9 | try: 10 | os.makedirs(directory) 11 | except OSError as e: 12 | if e.errno != errno.EEXIST: 13 | raise 14 | 15 | for i in range(inputpdf.numPages): 16 | output = PdfFileWriter() 17 | output.addPage(inputpdf.getPage(i)) 18 | with open(directory+ "/%s.pdf" % i, "wb") as outputStream: 19 | output.write(outputStream) 20 | 21 | if __name__ == "__main__": 22 | split(directory, filename) -------------------------------------------------------------------------------- /long-sample.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/asepmaulanaismail/pdf-to-txt-python/6229b9274b47bcf104a478958227a89e37374ca2/long-sample.pdf -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import PyPDF2 2 | import Split 3 | from subprocess import call 4 | import sys 5 | 6 | if (len(sys.argv) < 2): 7 | print("Error\nFormat: \n\tpython main.py your-pdf-file") 8 | else: 9 | filename = sys.argv[1] 10 | directory = "splitted/" + filename 11 | 12 | Split.split(directory, filename) 13 | pdfFileObj = open(filename, 'rb') 14 | pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 15 | 16 | for i in range(pdfReader.numPages): 17 | splitted_file_name = directory + "/" + repr(i) 18 | call(["pdftotext", splitted_file_name + ".pdf"]) 19 | # f = open(splitted_file_name + '.txt', 'r') 20 | # print("Page %s" % repr(i+1)) 21 | # print(f.read()) 22 | # print("====================") -------------------------------------------------------------------------------- /sample.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/asepmaulanaismail/pdf-to-txt-python/6229b9274b47bcf104a478958227a89e37374ca2/sample.pdf --------------------------------------------------------------------------------