├── .github └── workflows │ └── python-package.yml ├── LICENSE ├── README.md ├── config.yml └── main.py /.github/workflows/python-package.yml: -------------------------------------------------------------------------------- 1 | # This workflow will install Python dependencies, run tests and lint with a variety of Python versions 2 | # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python 3 | 4 | name: Python package 5 | 6 | on: 7 | push: 8 | branches: [ "main" ] 9 | pull_request: 10 | branches: [ "main" ] 11 | 12 | jobs: 13 | build: 14 | 15 | runs-on: ubuntu-latest 16 | strategy: 17 | fail-fast: false 18 | matrix: 19 | python-version: ["3.9", "3.10", "3.11"] 20 | 21 | steps: 22 | - uses: actions/checkout@v4 23 | - name: Set up Python ${{ matrix.python-version }} 24 | uses: actions/setup-python@v3 25 | with: 26 | python-version: ${{ matrix.python-version }} 27 | - name: Install dependencies 28 | run: | 29 | python -m pip install --upgrade pip 30 | python -m pip install flake8 pytest 31 | if [ -f requirements.txt ]; then pip install -r requirements.txt; fi 32 | - name: Lint with flake8 33 | run: | 34 | # stop the build if there are Python syntax errors or undefined names 35 | flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics 36 | # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide 37 | flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics 38 | - name: Test with pytest 39 | run: | 40 | pytest 41 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 salikx 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 感谢以下项目 2 | [![Readme Card](https://github-readme-stats.vercel.app/api/pin/?username=hect0x7&repo=JMComic-Crawler-Python)]([https://github.com/tonquer/JMComic-qt](https://github.com/hect0x7/JMComic-Crawler-Python)https://github.com/hect0x7/JMComic-Crawler-Python) 3 | 4 | 5 | # 📄 图片批量转换为 PDF 脚本 6 | 7 | 该项目用于将分章节存储的图片合并为单个 PDF 文件,支持自动遍历指定目录下的所有文件夹,避免高内存占用问题。 8 | 9 | --- 10 | 11 | ## 📌 功能介绍 12 | 13 | * 遍历指定根目录,处理子文件夹中的图片并生成对应的 PDF。 14 | * 支持图片格式:`JPG` / `JPEG` / `PNG` / `WEBP` / `BMP`。 15 | * 数字顺序排序子目录和图片,确保页码正确。 16 | * 检测已生成的 PDF,避免重复转换。 17 | * 错误处理和日志提示,自动跳过异常图片或空子目录。 18 | * 内存优化:使用生成器逐张处理图片,避免一次性加载所有图片。 19 | 20 | --- 21 | 22 | ## 📂 目录结构示例 23 | 24 | ``` 25 | root_directory/ 26 | ├── 001/ 27 | │ ├── 1.jpg 28 | │ ├── 2.jpg 29 | │ └── … 30 | ├── 002/ 31 | │ ├── 1.png 32 | │ ├── 2.png 33 | │ └── … 34 | ├── Chapter3/ 35 | │ ├── 01.webp 36 | │ ├── 02.webp 37 | │ └── … 38 | └── script.py 39 | ``` 40 | 41 | 生成的 PDF 将保存在 `root_directory` 下: 42 | 43 | ``` 44 | root_directory/ 45 | ├── 001.pdf 46 | ├── 002.pdf 47 | ├── Chapter3.pdf 48 | └── … 49 | ``` 50 | 51 | --- 52 | 53 | ## ⚙️ 环境依赖 54 | 55 | 请确保安装以下依赖: 56 | 57 | ```bash 58 | pip install pillow pyyaml jmcomic 59 | ``` 60 | 61 | > **备注:** `jmcomic` 用于加载配置文件。确保使用前已正确安装或替换为自己的配置获取逻辑。 62 | 63 | --- 64 | 65 | ## 📄 使用方法 66 | 67 | 1. **克隆或下载代码。** 68 | 69 | 2. **确保配置文件存在并正确设置:** 70 | 71 | 配置文件路径需在脚本中指定: 72 | 73 | ```python 74 | config_path = "D:/18comic_down/code/config.yml" 75 | ``` 76 | 77 | 配置文件需要包含根目录设置: 78 | 79 | ```yaml 80 | dir_rule: 81 | base_dir: "D:/your_base_directory" 82 | ``` 83 | 84 | 3. **运行脚本:** 85 | 86 | ```bash 87 | python script.py 88 | ``` 89 | 90 | > 如果需要自定义参数或路径,请修改 `config_path` 和相关参数。 91 | 92 | --- 93 | 94 | ## 🔧 参数说明 95 | 96 | | 参数 | 说明 | 备注 | 97 | | ------------- | ------------- | ------- | 98 | | `config_path` | 配置文件路径 | YAML 格式 | 99 | | `base_dir` | 根目录,存放图片文件夹位置 | 必须存在 | 100 | 101 | --- 102 | 103 | ## 🚩 功能细节 104 | 105 | * **内存优化:** 106 | 107 | * 使用生成器逐张加载图片,仅在处理时占用内存,有效防止内存泄露。 108 | * **排序规则:** 109 | 110 | * 子文件夹按纯数字排序,非数字文件夹排在最后。 111 | * 图片文件名根据数字部分排序,确保正确顺序合成 PDF。 112 | * **异常处理:** 113 | 114 | * 跳过非数字子目录。 115 | * 跳过无法读取或损坏的图片文件。 116 | * 检查目标 PDF 是否已存在,避免重复转换。 117 | 118 | --- 119 | 120 | ## 📑 日志输出示例 121 | 122 | ``` 123 | 📄 转换中:001 124 | 开始生成PDF:D:\your_base_directory\001.pdf 125 | ✅ 成功生成PDF:D:\your_base_directory\001.pdf 126 | 处理完成,耗时 5.23 秒 127 | 128 | 跳过已有PDF:002.pdf 129 | 130 | 📄 转换中:Chapter3 131 | 开始生成PDF:D:\your_base_directory\Chapter3.pdf 132 | ✅ 成功生成PDF:D:\your_base_directory\Chapter3.pdf 133 | 处理完成,耗时 8.47 秒 134 | ``` 135 | 136 | --- 137 | 138 | ## ❓ 常见问题 139 | 140 | 1. **配置文件加载失败:** 141 | 142 | * 请检查 `config_path` 路径是否正确。 143 | * 确保 `config.yml` 存在且格式正确。 144 | 145 | 2. **未找到图片文件:** 146 | 147 | * 确保子目录内存在支持的图片类型。 148 | * 检查子目录命名是否为纯数字,如 `001`、`002`。 149 | 150 | 3. **内存占用高:** 151 | 152 | * 脚本已使用生成器优化,如仍有问题请检查图片分辨率或尝试拆分图片文件夹。 153 | 154 | --- 155 | 156 | ## 📬 联系 157 | 158 | 如有问题或建议,请在仓库提交 issue 反馈。 159 | -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | # Github Actions 下载脚本配置 2 | version: '2.0' 3 | 4 | dir_rule: 5 | base_dir: D:/18comic_down/books 6 | rule: Bd_Atitle_Pindex 7 | 8 | client: 9 | domain: 10 | - 18comic.vip 11 | - 18comic.org 12 | 13 | download: 14 | cache: true # 如果要下载的文件在磁盘上已存在,不用再下一遍了吧? 15 | image: 16 | decode: true # JM的原图是混淆过的,要不要还原? 17 | suffix: .jpg # 把图片都转为.jpg格式 18 | threading: 19 | # batch_count: 章节的批量下载图片线程数 20 | # 数值大,下得快,配置要求高,对禁漫压力大 21 | # 数值小,下得慢,配置要求低,对禁漫压力小 22 | # PS: 禁漫网页一般是一次请求50张图 23 | batch_count: 45 -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import yaml 4 | from PIL import Image 5 | import jmcomic 6 | 7 | def sorted_numeric_filenames(file_list): 8 | """对文件名按数字部分排序""" 9 | def extract_number(s): 10 | name, _ = os.path.splitext(s) 11 | return int(''.join(filter(str.isdigit, name)) or 0) 12 | return sorted(file_list, key=extract_number) 13 | 14 | def convert_images_to_pdf(input_folder, output_path, pdf_name): 15 | start_time = time.time() 16 | allowed_extensions = {'.jpg', '.jpeg', '.png', '.webp', '.bmp'} 17 | output_path = os.path.normpath(output_path) 18 | os.makedirs(output_path, exist_ok=True) 19 | pdf_full_path = os.path.join(output_path, f"{os.path.splitext(pdf_name)[0]}.pdf") 20 | 21 | image_iterator = [] 22 | 23 | # 获取子目录并排序 24 | try: 25 | subdirs = sorted( 26 | [d for d in os.listdir(input_folder) if os.path.isdir(os.path.join(input_folder, d))], 27 | key=lambda x: int(x) if x.isdigit() else float('inf') 28 | ) 29 | except Exception as e: 30 | print(f"错误:无法读取目录 {input_folder},原因:{e}") 31 | return 32 | 33 | for subdir in subdirs: 34 | subdir_path = os.path.join(input_folder, subdir) 35 | try: 36 | files = [f for f in os.listdir(subdir_path) 37 | if os.path.isfile(os.path.join(subdir_path, f)) and os.path.splitext(f)[1].lower() in allowed_extensions] 38 | files = sorted_numeric_filenames(files) 39 | for f in files: 40 | image_iterator.append(os.path.join(subdir_path, f)) 41 | except Exception as e: 42 | print(f"警告:读取子目录失败 {subdir_path},原因:{e}") 43 | 44 | if not image_iterator: 45 | print("错误:未找到任何图片文件") 46 | return 47 | 48 | try: 49 | def open_image(path): 50 | img = Image.open(path) 51 | if img.mode != 'RGB': 52 | img = img.convert('RGB') 53 | return img 54 | 55 | # 用生成器延迟加载,首张图用作 PDF 的 base 图 56 | image_iter = (open_image(p) for p in image_iterator) 57 | first_image = next(image_iter, None) 58 | 59 | if not first_image: 60 | print("错误:没有有效图片可生成PDF") 61 | return 62 | 63 | print(f"开始生成PDF:{pdf_full_path}") 64 | first_image.save( 65 | pdf_full_path, 66 | "PDF", 67 | save_all=True, 68 | append_images=[img for img in image_iter], 69 | optimize=True 70 | ) 71 | print(f"✅ 成功生成PDF:{pdf_full_path}") 72 | 73 | except Exception as e: 74 | print(f"❌ 生成PDF失败:{e}") 75 | 76 | print(f"处理完成,耗时 {time.time() - start_time:.2f} 秒") 77 | 78 | def main(): 79 | config_path = "D:/18comic_down/code/config.yml" 80 | try: 81 | option = jmcomic.JmOption.from_file(config_path) 82 | with open(config_path, "r", encoding="utf-8") as f: 83 | config = yaml.safe_load(f) 84 | base_dir = config["dir_rule"]["base_dir"] 85 | except Exception as e: 86 | print(f"加载配置失败:{e}") 87 | return 88 | 89 | if not os.path.exists(base_dir): 90 | print(f"错误:根目录不存在 {base_dir}") 91 | return 92 | 93 | for entry in os.scandir(base_dir): 94 | if entry.is_dir(): 95 | pdf_name = f"{entry.name}.pdf" 96 | pdf_path = os.path.join(base_dir, pdf_name) 97 | if os.path.exists(pdf_path): 98 | print(f"跳过已有PDF:{pdf_name}") 99 | continue 100 | 101 | print(f"\n📄 转换中:{entry.name}") 102 | convert_images_to_pdf( 103 | input_folder=entry.path, 104 | output_path=base_dir, 105 | pdf_name=entry.name 106 | ) 107 | 108 | if __name__ == "__main__": 109 | main() 110 | --------------------------------------------------------------------------------