├── .gitignore ├── LICENSE ├── README.md ├── config.cfg ├── main.py ├── requirements.txt └── requirements_3_10.txt /.gitignore: -------------------------------------------------------------------------------- 1 | **/__pycache__ 2 | *.pyc 3 | *.egg-info 4 | 5 | .DS_Store 6 | .vscode 7 | .gitignore 8 | pdf 9 | word 10 | venv -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018 simpleapples https://www.simpleapples.com 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is furnished 8 | to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # pdf2word 2 | 3 | ~~60行~~40行代码实现多进程PDF转Word 4 | 5 | > 新版本基于[https://github.com/dothinking/pdf2docx](https://github.com/dothinking/pdf2docx)实现 6 | 7 | ## 使用方法 8 | 9 | * clone或下载项目到本地 10 | ```python 11 | git clone git@github.com:simpleapples/pdf2word.git 12 | ``` 13 | 14 | * 进入项目目录,建立虚拟环境,并安装依赖 15 | 16 | ```python 17 | cd pdf2word 18 | python3 -m venv venv 19 | 20 | # Linux 21 | source venv/bin/activate 22 | 23 | # Windows 24 | venv\Scripts\activate 25 | 26 | # Python < 3.10 27 | pip install -r requirements.txt 28 | 29 | # Python 3.10 or later 30 | pip install -r requirements_3_10.txt 31 | ``` 32 | 33 | * 修改config.cfg文件,指定存放pdf和word文件的文件夹,以及同时工作的进程数 34 | * 执行```python main.py``` 35 | 36 | ## ModuleNotFoundError: No module named '_tkinter' 报错处理 37 | 38 | ### macOS环境 39 | 40 | 1. 安装homebrew 41 | ```bash 42 | /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" 43 | ``` 44 | 45 | 2. 使用homebrew安装tkinter 46 | ```bash 47 | brew install python-tk 48 | ``` 49 | 50 | ### Linux环境 51 | 52 | 以ubuntu为例 53 | 54 | ```bash 55 | sudo apt install python3-tk 56 | ``` 57 | 58 | **欢迎Star** 59 | 60 | ## Python私房菜 61 | 62 | ![](http://ww1.sinaimg.cn/large/6ae0adaely1foxc0cfkjsj2076076aac.jpg) 63 | 64 | ## License 65 | 66 | 采用 MIT 开源许可证 67 | -------------------------------------------------------------------------------- /config.cfg: -------------------------------------------------------------------------------- 1 | [default] 2 | pdf_folder=pdf 3 | word_folder=word 4 | max_worker=10 5 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import logging 4 | from configparser import ConfigParser 5 | from concurrent.futures import ProcessPoolExecutor 6 | 7 | from pdf2docx import Converter 8 | 9 | 10 | def pdf_to_word(pdf_file_path, word_file_path): 11 | cv = Converter(pdf_file_path) 12 | cv.convert(word_file_path) 13 | cv.close() 14 | 15 | 16 | def main(): 17 | logging.getLogger().setLevel(logging.ERROR) 18 | 19 | config_parser = ConfigParser() 20 | config_parser.read("config.cfg") 21 | config = config_parser["default"] 22 | 23 | tasks = [] 24 | with ProcessPoolExecutor(max_workers=int(config["max_worker"])) as executor: 25 | for file in os.listdir(config["pdf_folder"]): 26 | extension_name = os.path.splitext(file)[1] 27 | if extension_name != ".pdf": 28 | continue 29 | file_name = os.path.splitext(file)[0] 30 | pdf_file = config["pdf_folder"] + "/" + file 31 | word_file = config["word_folder"] + "/" + file_name + ".docx" 32 | print("正在处理: ", file) 33 | result = executor.submit(pdf_to_word, pdf_file, word_file) 34 | tasks.append(result) 35 | while True: 36 | exit_flag = True 37 | for task in tasks: 38 | if not task.done(): 39 | exit_flag = False 40 | if exit_flag: 41 | print("完成") 42 | exit(0) 43 | 44 | 45 | if __name__ == "__main__": 46 | main() 47 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | fire==0.4.0 2 | fonttools==4.28.3 3 | lxml==4.7.1 4 | numpy==1.21.4 5 | opencv-python==4.5.4.60 6 | pdf2docx==0.5.2 7 | PyMuPDF==1.19.3 8 | python-docx==0.8.11 9 | six==1.16.0 10 | termcolor==1.1.0 11 | -------------------------------------------------------------------------------- /requirements_3_10.txt: -------------------------------------------------------------------------------- 1 | fire==0.4.0 2 | fonttools==4.28.3 3 | lxml==4.7.1 4 | numpy==1.21.4 5 | opencv-python==4.10.0.84 6 | pdf2docx==0.5.8 7 | PyMuPDF==1.19.3 8 | python-docx==0.8.11 9 | six==1.16.0 10 | termcolor==1.1.0 11 | --------------------------------------------------------------------------------