├── .dockerignore ├── .gitignore ├── requirements.txt ├── Dockerfile ├── README.md ├── parse_pageviews.py └── zim_converter.py /.dockerignore: -------------------------------------------------------------------------------- 1 | *db* 2 | *zim -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *db* 2 | *zim 3 | venv 4 | top*articles -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | libzim==2.0.0 2 | zstd==1.5.2.6 3 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | # syntax=docker/dockerfile:1 2 | 3 | FROM python:3.8-slim-buster 4 | 5 | WORKDIR /app 6 | 7 | COPY requirements.txt requirements.txt 8 | RUN pip3 install -r requirements.txt 9 | 10 | COPY . . 11 | 12 | ENTRYPOINT [ "python3", "/app/zim_converter.py" ] -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ZIM-converter 2 | 3 | This Python project converts ZIM files (Kiwix format) to optimized SQLite databases for the WikiReader plugin in KOReader. 4 | 5 | **Now supports both Wikipedia AND non-Wikipedia content** (iFixit, Wiktionary, etc.) with enhanced error handling and image processing options. 6 | ## WikiReader 7 | 8 | I created this plugin for KoReader during sometime off: https://github.com/koreader/koreader/pull/9534 9 | It has not been merged yet and probably never will be, but I am using it myself and is fairly stable and works for me. 10 | 11 | ## Just give me a database 12 | 13 | If you have no experience programming or simply want a database file that is preconverted, that [can be found here](https://mega.nz/file/9zZlQIKC#ZDPEAQvo_jktEdaDn20AplywxXScJW5yOGB8BMfd1qA). 14 | This database contains the top 60k popular articles of english wikipedia as of may 2025, with the top 10k articles containing images too. Note that the max file size is 4GB for the FAT32 filesystems of common ereaders, so this is about as much info as you can pack in a single DB. If you have an external sd card with NTFS or EXT4 you could convert the full wikipedia with images (~100 GiB). In theory it should work, but I have not tested it myself. 15 | 16 | [Old database](https://mega.nz/file/06AX2DrC#1WYLi9GsF2DV7VplMaMoK7bKGWna2ItIeiW92OekALg). This database contains 114303 popular articles of english wikipedia as of september 2022. 17 | 18 | ## Converting ZIM files yourself 19 | 20 | **Supported Content Types:** 21 | - ✅ **Wikipedia** (all languages and variants) 22 | - ✅ **iFixit** (repair guides with images) 23 | - ✅ **Wiktionary** (dictionaries) 24 | - ✅ **Any ZIM file** (universal namespace support) 25 | 26 | Conversion is optimized and includes comprehensive error handling and debug logging. 27 | 28 | ### How to get ZIM file 29 | 30 | You can download a dump of WikiPedias most popular articles from their servers, or use a mirror [like this one](http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/). I recommend using a dump starting with `wikipedia_en_top_nopic`. 31 | 32 | On the command line you could for example do this: 33 | 34 | ```bash 35 | wget -O wikipedia.zim http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_top_nopic_2022-09.zim 36 | ``` 37 | 38 | ### Running the CLI 39 | 40 | ```bash 41 | # First install the dependencies with pip: 42 | pip install -r requirements.txt 43 | 44 | # Basic conversion (text-only, smallest database): 45 | python3 zim_converter.py --zim-file ./wikipedia.zim --output-db ./zim_articles.db 46 | 47 | # Include images (larger database): 48 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --include-images 49 | 50 | # EXTREME e-ink compression (requires ImageMagick): 51 | # 16-level grayscale, dithered, max 600x400, <15KB per image 52 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --compress-images 53 | 54 | # Enable debug logging for troubleshooting: 55 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --debug 56 | ``` 57 | 58 | **Options:** 59 | - `--include-images`: Include original images (significantly increases size) 60 | - `--compress-images`: EXTREME e-ink optimization - 16-level grayscale, dithered, max 600x400, <15KB per image (requires ImageMagick) 61 | - `--debug`: Enable detailed debug logging to file and console (for troubleshooting) 62 | 63 | **Install ImageMagick for image compression:** 64 | ```bash 65 | # Ubuntu/Debian: 66 | sudo apt install imagemagick 67 | 68 | # macOS: 69 | brew install imagemagick 70 | 71 | # Windows: Download from https://imagemagick.org/ 72 | ``` 73 | 74 | Then transfer the `.db` file to your e-reader and set it as the database in the WikiReader plugin. 75 | 76 | ### Docker 77 | 78 | You can manually install the 2 dependencies and just run the python file with appropriate arguments. But if needed 79 | you can also build and run the docker if preferred, example when the zim file is called `wikipedia.zim` in the current dir: 80 | 81 | ```bash 82 | docker build --tag zim-converter . 83 | docker run --rm -it -v $(pwd):/project zim-converter --zim-file /project/wikipedia.zim --output-db /project/zim_articles.db 84 | ``` 85 | -------------------------------------------------------------------------------- /parse_pageviews.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sqlite3 3 | import subprocess 4 | import argparse 5 | 6 | 7 | MIN_MONTLY_PAGE_COUNT = 500 8 | 9 | 10 | def setup_db(con): 11 | cursor = con.cursor() 12 | 13 | cursor.executescript(""" 14 | PRAGMA journal_mode=WAL; 15 | 16 | CREATE TABLE IF NOT EXISTS pageviews ( 17 | wiki_domain_id TEXT PRIMARY KEY, 18 | wiki_id INTEGER NOT NULL, 19 | domain TEXT NOT NULL, 20 | view_count INTEGER NOT NULL 21 | ); 22 | 23 | CREATE TABLE IF NOT EXISTS title_2_wiki_domain_id ( 24 | wiki_domain_id TEXT NOT NULL, 25 | title TEXT PRIMARY KEY 26 | ); 27 | 28 | 29 | """) 30 | con.commit() 31 | 32 | 33 | def getbz2_multithreaded(url: str, silent=False): 34 | curlOutput = subprocess.Popen(('curl', '--silent', '-L', url), stdout=subprocess.PIPE) 35 | 36 | if silent: 37 | rawOutput = subprocess.Popen(('lbzip2', '-d'), stdin=curlOutput.stdout, stdout=subprocess.PIPE) 38 | else: 39 | networkOutput = subprocess.Popen(('pv', '-cN', 'network'), stdin=curlOutput.stdout, stdout=subprocess.PIPE) 40 | lbzipOutput = subprocess.Popen(('lbzip2', '-d'), stdin=networkOutput.stdout, stdout=subprocess.PIPE) 41 | rawOutput = subprocess.Popen(('pv', '-cN', 'raw'), stdin=lbzipOutput.stdout, stdout=subprocess.PIPE) 42 | 43 | return rawOutput.stdout 44 | 45 | 46 | def parse_pageviews_xml_url(url: str, con: sqlite3.Connection, pageview_stats: dict, valid_domains: list): 47 | cursor = con.cursor() 48 | num_bytes_read = 0 49 | num_lines_parsed = 0 50 | start_time = time.time() 51 | 52 | input_fh = getbz2_multithreaded(url) 53 | for line in input_fh: 54 | # Convert to string if needed 55 | if type(line) is bytes: 56 | line = line.decode("utf-8")[:-1] 57 | num_bytes_read += len(line) 58 | num_lines_parsed += 1 59 | 60 | # Report progress every once in a while 61 | if num_lines_parsed % 400000 == 0 and False: 62 | # Currently done by pipeviewer 63 | time_passed = time.time() - start_time 64 | print(f"Total speed: {(num_bytes_read / 1048576):.1f} MiB / {time_passed:.0f} seconds") 65 | 66 | # Break up the line 67 | line_parts = line.split(' ') 68 | if (len(line_parts) < 4): 69 | continue 70 | 71 | # Extract relevant page information 72 | domain = line_parts[0] 73 | page_name = line_parts[1] 74 | page_wiki_id = line_parts[2] 75 | # page_count_str = line_parts[-1] 76 | page_count = int(line_parts[-2]) 77 | if page_wiki_id == "null": 78 | continue 79 | # if domain != "en.wikipedia": 80 | # continue 81 | # if page_name.startswith("Category") or page_name.startswith("Talk"): 82 | # continue 83 | 84 | if page_count < MIN_MONTLY_PAGE_COUNT: 85 | continue 86 | if domain not in valid_domains: 87 | continue 88 | 89 | # Add new page entries into the stats dict if they do not exist 90 | if domain not in pageview_stats: 91 | pageview_stats[domain] = {} 92 | if page_wiki_id not in pageview_stats[domain]: 93 | pageview_stats[domain][page_wiki_id] = 0 94 | 95 | # Increment the counter for this page with the page_count 96 | pageview_stats[domain][page_wiki_id] += page_count 97 | 98 | total_page_count = pageview_stats[domain][page_wiki_id] 99 | wiki_domain_id = f'{domain}-{page_wiki_id}' 100 | 101 | cursor.execute("INSERT OR REPLACE INTO pageviews VALUES(?, ?, ?, ?)", ( 102 | wiki_domain_id, page_wiki_id, domain, total_page_count 103 | ) 104 | ) 105 | 106 | cursor.execute("INSERT OR REPLACE INTO title_2_wiki_domain_id VALUES(?, ?)", ( 107 | wiki_domain_id, page_name 108 | ) 109 | ) 110 | if num_lines_parsed % 10000 == 0: 111 | con.commit() 112 | print('Done with bz2 file') 113 | con.commit() 114 | 115 | 116 | if __name__ == "__main__": 117 | parser = argparse.ArgumentParser(description='Process wikipedia pageviews dump.') 118 | parser.add_argument( 119 | '--pageview-url', help='URL of pageview dump .bz2', 120 | default="http://ftp.acc.umu.se/mirror/wikimedia.org/other/pageview_complete/monthly/2022/2022-08/pageviews-202208-user.bz2" 121 | ) 122 | parser.add_argument( 123 | '--database-path', help='Path where the database will be stored', 124 | default="pageviews.db" 125 | ) 126 | args = parser.parse_args() 127 | 128 | con = sqlite3.connect(args.database_path) 129 | setup_db(con) 130 | pageview_stats = {} 131 | 132 | valid_domains = ["en.wikipedia"] 133 | parse_pageviews_xml_url(args.pageview_url, con, pageview_stats, valid_domains) 134 | 135 | print("done") 136 | -------------------------------------------------------------------------------- /zim_converter.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sqlite3 3 | import zstd 4 | import argparse 5 | from libzim import Archive 6 | from multiprocessing import Pool 7 | import re 8 | import base64 9 | import time 10 | import logging 11 | import subprocess 12 | import tempfile 13 | 14 | # Setup logging (will be configured later based on --debug flag) 15 | logger = logging.getLogger(__name__) 16 | 17 | img_src_pattern = r']*src=["\']([^"\']+)["\']' 18 | css_link_pattern = r']*href=["\']([^"\']+)["\'][^>]*rel=["\']stylesheet["\'][^>]*>' 19 | MAX_IMAGE_SIZE = 300 * 1024 # 300 KB in bytes 20 | MAX_COMPRESSED_IMAGE_SIZE = 15 * 1024 # 15 KB for e-ink optimized images 21 | 22 | def compress_image_with_imagemagick(image_data, mime_type): 23 | """ 24 | EXTREME e-ink optimization using ImageMagick: 25 | - Convert to 16-level grayscale with Floyd-Steinberg dithering 26 | - Resize to max 600x400 for smaller files 27 | - Enhance contrast for better e-ink visibility 28 | - Very aggressive compression (quality 35) 29 | - Target: <15KB per image for optimal e-reader performance 30 | Returns compressed image data and new size, or None if compression fails 31 | """ 32 | try: 33 | # Create temporary files 34 | with tempfile.NamedTemporaryFile(suffix=f'.{mime_type}', delete=False) as input_file: 35 | input_file.write(image_data) 36 | input_path = input_file.name 37 | 38 | with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as output_file: 39 | output_path = output_file.name 40 | 41 | # ImageMagick command for EXTREME e-ink optimization: 42 | # - Convert to pure black/white with dithering for better e-ink rendering 43 | # - Aggressive resizing for small file sizes 44 | # - Minimal quality for maximum compression 45 | cmd = [ 46 | 'convert', 47 | input_path, 48 | '-colorspace', 'Gray', # Convert to grayscale first 49 | '-resize', '600x400>', # Smaller max size for e-ink 50 | '-contrast-stretch', '0.15x0.05%', # Enhance contrast for e-ink 51 | '-dither', 'FloydSteinberg', # Add dithering for better e-ink display 52 | '-colors', '16', # Reduce to 16 gray levels (e-ink friendly) 53 | '-quality', '35', # Very aggressive compression 54 | '-strip', # Remove all metadata 55 | '-interlace', 'none', # Remove progressive encoding 56 | '-sampling-factor', '4:2:0', # Aggressive chroma subsampling 57 | output_path 58 | ] 59 | 60 | result = subprocess.run(cmd, capture_output=True, text=True, timeout=30) 61 | 62 | if result.returncode == 0: 63 | # Read compressed image 64 | with open(output_path, 'rb') as f: 65 | compressed_data = f.read() 66 | 67 | # Clean up temp files 68 | os.unlink(input_path) 69 | os.unlink(output_path) 70 | 71 | if len(compressed_data) < MAX_COMPRESSED_IMAGE_SIZE: 72 | logger.debug(f"Image compressed: {len(image_data)} -> {len(compressed_data)} bytes ({len(compressed_data)/len(image_data)*100:.1f}%)") 73 | return compressed_data, 'jpeg' 74 | else: 75 | logger.debug(f"Compressed image still too large ({len(compressed_data)} bytes), skipping") 76 | return None, None 77 | else: 78 | logger.warning(f"ImageMagick conversion failed: {result.stderr}") 79 | # Clean up temp files 80 | try: 81 | os.unlink(input_path) 82 | os.unlink(output_path) 83 | except: 84 | pass 85 | return None, None 86 | 87 | except subprocess.TimeoutExpired: 88 | logger.warning("ImageMagick conversion timed out") 89 | return None, None 90 | except Exception as e: 91 | logger.error(f"Image compression failed: {e}") 92 | return None, None 93 | 94 | def is_binary_content(path): 95 | """ 96 | Check if a path points to binary/media content that shouldn't be processed as text 97 | """ 98 | path_lower = path.lower() 99 | 100 | # Image file extensions 101 | image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.bmp', '.ico', '.tiff'] 102 | # Media file extensions 103 | media_extensions = ['.mp4', '.webm', '.ogg', '.mp3', '.wav', '.pdf'] 104 | # Font and style extensions 105 | font_extensions = ['.woff', '.woff2', '.ttf', '.otf', '.eot'] 106 | # Other binary extensions 107 | other_extensions = ['.zip', '.gz', '.tar', '.exe', '.bin'] 108 | 109 | # Check file extension 110 | for ext in image_extensions + media_extensions + font_extensions + other_extensions: 111 | if path_lower.endswith(ext): 112 | return True 113 | 114 | # Check for known image hosting patterns 115 | if any(pattern in path_lower for pattern in [ 116 | 'images/', 'img/', 'static/', 'assets/', 'media/', 117 | '.amazonaws.com/', 'upload', 'thumbnail', 'logo' 118 | ]): 119 | return True 120 | 121 | return False 122 | 123 | def setup_db(con): 124 | """Setup a SQLite database in the format expected by WikiReader 125 | """ 126 | cursor = con.cursor() 127 | 128 | cursor.executescript(""" 129 | PRAGMA journal_mode=WAL; 130 | 131 | CREATE TABLE IF NOT EXISTS articles ( 132 | id INTEGER PRIMARY KEY, 133 | title TEXT NOT NULL UNIQUE, 134 | page_content_zstd BLOB NOT NULL 135 | ); 136 | 137 | CREATE TABLE IF NOT EXISTS title_2_id ( 138 | id INTEGER NOT NULL, 139 | title_lower_case TEXT PRIMARY KEY 140 | ); 141 | 142 | DROP TABLE IF EXISTS css; 143 | CREATE TABLE IF NOT EXISTS css ( 144 | content_zstd BLOB NOT NULL 145 | ); 146 | 147 | -- Add indexes for better WikiReader compatibility 148 | CREATE INDEX IF NOT EXISTS idx_title_lower ON title_2_id(title_lower_case); 149 | CREATE INDEX IF NOT EXISTS idx_articles_title ON articles(title); 150 | 151 | """) 152 | con.commit() 153 | 154 | def get_mime_type(path): 155 | ext = path.split('.')[-1].lower() 156 | if ext == 'jpg' or ext == 'jpeg': 157 | return 'jpeg' 158 | elif ext == 'png': 159 | return 'png' 160 | elif ext == 'svg': 161 | return 'svg' 162 | elif ext == 'gif': 163 | return 'gif' 164 | else: 165 | return 'jpeg' 166 | 167 | def get_image_link(link: str, zim: Archive, compress_images=False): 168 | link = link.replace('../', '') 169 | try: 170 | entry = zim.get_entry_by_path(link) 171 | item = entry.get_item() 172 | 173 | original_size = len(item.content) 174 | 175 | # Check if we should compress the image 176 | if compress_images and hasattr(get_image_link, '_imagemagick_available'): 177 | if original_size > MAX_COMPRESSED_IMAGE_SIZE: # Only compress if larger than target 178 | mime_type = get_mime_type(item.path) 179 | compressed_data, new_mime = compress_image_with_imagemagick(item.content.tobytes(), mime_type) 180 | 181 | if compressed_data: 182 | # Use compressed image 183 | base64_bytes = base64.b64encode(compressed_data) 184 | base64_string = base64_bytes.decode('utf-8') 185 | size = len(base64_bytes) 186 | logger.debug(f"Using compressed image for {link}: {original_size} -> {len(compressed_data)} bytes") 187 | return f"data:image/{new_mime};base64,{base64_string}", size 188 | else: 189 | logger.debug(f"Compression failed for {link}, using original or skipping") 190 | 191 | # Use original image logic (with size check) 192 | if original_size > MAX_IMAGE_SIZE: 193 | logger.debug(f"Skipped image {link} ({original_size} bytes > max {MAX_IMAGE_SIZE})") 194 | return None, 0 195 | 196 | base64_bytes = base64.b64encode(item.content) 197 | base64_string = base64_bytes.decode('utf-8') 198 | mime_type = get_mime_type(item.path) 199 | size = len(base64_bytes) 200 | return f"data:image/{mime_type};base64,{base64_string}", size 201 | except KeyError: 202 | return None, 0 203 | 204 | def get_css_content(link: str, zim: Archive): 205 | link = link.replace('../', '') 206 | try: 207 | entry = zim.get_entry_by_path(link) 208 | item = entry.get_item() 209 | content = item.content.tobytes().decode('utf-8') 210 | return content, len(content.encode('utf-8')) 211 | except Exception as e: 212 | # print(f"Failed to load CSS {link}: {e}") 213 | return None, 0 214 | 215 | def replace_img_and_css_html(html: str, zim: Archive, compress_images=False): 216 | # Handle image sources 217 | img_sources = re.findall(img_src_pattern, html) 218 | for src in img_sources: 219 | image_link, size = get_image_link(src, zim, compress_images) 220 | if image_link: 221 | html = html.replace(src, image_link) 222 | else: 223 | pass 224 | # print('Failed for image', src) 225 | 226 | # Handle CSS links 227 | css_links = [] # re.findall(css_link_pattern, html) 228 | # print(css_links) 229 | css_links = [l for l in css_links if 'inserted_style' not in l] 230 | for href in css_links: 231 | css_content, size = get_css_content(href, zim) 232 | if css_content: 233 | style_tag = f"" 234 | html = re.sub( 235 | rf']*href=["\']{re.escape(href)}["\'][^>]*>', 236 | style_tag, 237 | html 238 | ) 239 | else: 240 | # print('Failed for CSS', href) 241 | pass 242 | return html 243 | 244 | 245 | def convert_zim(zim_path, db_path, article_list=None): 246 | """Process a range of a ZIM file into a seperate SQLite database""" 247 | logger.info(f"Starting ZIM conversion: {zim_path} -> {db_path}") 248 | 249 | con = sqlite3.connect(db_path) 250 | cursor = con.cursor() 251 | setup_db(con) 252 | 253 | try: 254 | zim = Archive(zim_path) 255 | logger.info(f"ZIM file loaded successfully. Entry count: {zim.entry_count}") 256 | except Exception as e: 257 | logger.error(f"Failed to load ZIM file {zim_path}: {e}") 258 | raise 259 | 260 | # Track processing statistics 261 | stats = { 262 | 'total_entries': 0, 263 | 'special_files_skipped': 0, 264 | 'redirects_processed': 0, 265 | 'articles_processed': 0, 266 | 'binary_files_skipped': 0, 267 | 'other_entries_skipped': 0, 268 | 'processing_errors': 0 269 | } 270 | 271 | def all_entry_gen(): 272 | logger.info("Using all_entry_gen - processing all entries") 273 | for id in range(0, zim.entry_count): 274 | try: 275 | # Try to use public API first, fall back to private if needed 276 | try: 277 | zim_entry = zim.get_entry_by_id(id) 278 | except AttributeError: 279 | logger.debug(f"Public get_entry_by_id not available, using private _get_entry_by_id for entry {id}") 280 | zim_entry = zim._get_entry_by_id(id) 281 | yield zim_entry 282 | except Exception as e: 283 | logger.error(f"Failed to get entry {id}: {e}") 284 | stats['processing_errors'] += 1 285 | 286 | def selected_entry_gen(): 287 | logger.info(f"Using selected_entry_gen - processing {len(article_list)} specific articles") 288 | for article_title in article_list: 289 | try: 290 | yield zim.get_entry_by_path('/A/' + article_title) 291 | except Exception as e: 292 | logger.error(f'Failed to get article {article_title}: {e}') 293 | stats['processing_errors'] += 1 294 | 295 | entry_gen = selected_entry_gen if article_list else all_entry_gen 296 | num_total = len(article_list) if article_list else zim.entry_count 297 | num_done = 0 298 | t0 = time.time() 299 | 300 | logger.info(f"Starting processing {num_total} entries") 301 | 302 | for zim_entry in entry_gen(): 303 | stats['total_entries'] += 1 304 | 305 | # Log first 10 entries to understand structure 306 | if num_done < 10: 307 | logger.info(f"Entry {num_done}: path='{zim_entry.path}', title='{zim_entry.title}', is_redirect={zim_entry.is_redirect}") 308 | 309 | # Detect special files 310 | if zim_entry.path.startswith('-'): 311 | logger.debug(f"Skipping special file: {zim_entry.path}") 312 | stats['special_files_skipped'] += 1 313 | continue 314 | 315 | # deal with normal files 316 | if zim_entry.is_redirect: 317 | logger.debug(f"Processing redirect: {zim_entry.title} -> {zim_entry.path}") 318 | try: 319 | destination_entry = zim_entry.get_redirect_entry() 320 | # Try to use public API for index, fall back to private 321 | try: 322 | dest_index = destination_entry.index 323 | except AttributeError: 324 | logger.debug("Using private _index attribute for redirect destination") 325 | dest_index = destination_entry._index 326 | 327 | cursor.execute("INSERT OR REPLACE INTO title_2_id VALUES(?, ?)", [ 328 | dest_index, zim_entry.title.lower() 329 | ]) 330 | stats['redirects_processed'] += 1 331 | except Exception as e: 332 | logger.error(f"Failed to process redirect {zim_entry.title}: {e}") 333 | stats['processing_errors'] += 1 334 | 335 | elif zim_entry.path.startswith('A/'): # Wikipedia articles 336 | logger.debug(f"Processing Wikipedia article: {zim_entry.title}") 337 | try: 338 | process_article_entry(zim_entry, cursor, zim, stats) 339 | except Exception as e: 340 | logger.error(f"Failed to process Wikipedia article {zim_entry.title}: {e}") 341 | stats['processing_errors'] += 1 342 | 343 | elif not zim_entry.path.startswith('A/') and len(zim_entry.path) > 2 and '/' in zim_entry.path: 344 | # This might be a non-Wikipedia article (like iFixit) 345 | namespace = zim_entry.path.split('/')[0] 346 | 347 | # Skip known binary/media files 348 | if is_binary_content(zim_entry.path): 349 | logger.debug(f"Skipping binary/media file: {zim_entry.path}") 350 | stats['binary_files_skipped'] += 1 351 | continue 352 | 353 | if num_done < 50: # Log first 50 non-A/ entries to understand structure 354 | logger.info(f"Non-Wikipedia entry found: namespace='{namespace}', path='{zim_entry.path}', title='{zim_entry.title}'") 355 | 356 | # Try to process as article regardless of namespace 357 | try: 358 | process_article_entry(zim_entry, cursor, zim, stats) 359 | logger.debug(f"Successfully processed non-Wikipedia article: {zim_entry.title}") 360 | except Exception as e: 361 | logger.debug(f"Failed to process non-Wikipedia article {zim_entry.title}: {e}") 362 | stats['processing_errors'] += 1 363 | else: 364 | stats['other_entries_skipped'] += 1 365 | if num_done < 20: # Log first 20 skipped entries 366 | logger.debug(f"Skipping other entry: path='{zim_entry.path}', title='{zim_entry.title}'") 367 | 368 | num_done += 1 369 | # Commit to db on disk every once in a while 370 | if num_done % 500 == 0: 371 | elapsed = time.time() - t0 372 | logger.info(f'{elapsed:.1f}s Committing batch to db, at entry {num_done} of {num_total}') 373 | logger.info(f"Stats so far: {stats}") 374 | con.commit() 375 | 376 | elapsed = time.time() - t0 377 | logger.info(f"Processing completed in {elapsed:.1f}s") 378 | logger.info(f"Final statistics: {stats}") 379 | 380 | # Add a default "Ebook" entry for WikiReader plugin compatibility 381 | try: 382 | cursor.execute("SELECT COUNT(*) FROM articles WHERE title = 'Ebook'") 383 | if cursor.fetchone()[0] == 0: 384 | default_content = """ 385 | 386 | Welcome to WikiReader 387 | 388 |

Welcome to WikiReader

389 |

This database contains articles converted from a ZIM file.

390 |

Use the search function to find articles.

391 |

Database statistics:

392 |
    393 |
  • Total articles: {articles_processed}
  • 394 |
  • Total entries processed: {total_entries}
  • 395 |
  • Binary files skipped: {binary_files_skipped}
  • 396 |
397 | 398 | 399 | """.format(**stats) 400 | 401 | compressed_content = zstd.compress(default_content.encode(), 9, 4) 402 | 403 | # Use a high ID that won't conflict 404 | default_id = 999999 405 | cursor.execute("INSERT OR REPLACE INTO articles VALUES(?, ?, ?)", [ 406 | default_id, "Ebook", compressed_content 407 | ]) 408 | cursor.execute("INSERT OR REPLACE INTO title_2_id VALUES(?, ?)", [ 409 | default_id, "ebook" 410 | ]) 411 | logger.info("Added default 'Ebook' page for WikiReader compatibility") 412 | except Exception as e: 413 | logger.warning(f"Failed to add default Ebook page: {e}") 414 | 415 | con.commit() 416 | con.close() 417 | return stats 418 | 419 | def process_article_entry(zim_entry, cursor, zim, stats): 420 | """Process a single article entry (works for any namespace)""" 421 | # First make it findable 422 | try: 423 | # Try to use public API for index, fall back to private 424 | try: 425 | entry_index = zim_entry.index 426 | except AttributeError: 427 | logger.debug(f"Using private _index attribute for entry {zim_entry.title}") 428 | entry_index = zim_entry._index 429 | 430 | cursor.execute("INSERT INTO title_2_id VALUES(?, ?)", [ 431 | entry_index, zim_entry.title.lower() 432 | ]) 433 | except sqlite3.IntegrityError as e: 434 | if not 'UNIQUE constraint' in str(e): 435 | logger.error(f"Unexpected integrity error for {zim_entry.title}: {e}") 436 | raise e 437 | cursor.execute("SELECT id FROM title_2_id WHERE title_lower_case = ?", [zim_entry.title.lower()]) 438 | result = cursor.fetchone() 439 | if result: 440 | cur_id = result[0] 441 | if cur_id != entry_index: 442 | cursor.execute("UPDATE title_2_id SET id = ? WHERE id = ?", [ 443 | entry_index, cur_id 444 | ]) 445 | logger.debug(f"Updated duplicate title mapping for {zim_entry.title}") 446 | 447 | try: 448 | # Validate entry has meaningful content 449 | if not zim_entry.title or len(zim_entry.title.strip()) < 2: 450 | logger.debug(f"Skipping entry with invalid title: '{zim_entry.title}'") 451 | return 452 | 453 | page_content = bytes(zim_entry.get_item().content).decode() 454 | 455 | # Validate content is not empty/too short 456 | if len(page_content.strip()) < 100: 457 | logger.debug(f"Skipping entry with minimal content: {zim_entry.title} ({len(page_content)} chars)") 458 | return 459 | 460 | logger.debug(f"Extracted content for {zim_entry.title}: {len(page_content)} characters") 461 | 462 | # Process images/CSS only if requested (WARNING: can make DB much larger) 463 | if hasattr(process_article_entry, '_include_images') and process_article_entry._include_images: 464 | compress_images = hasattr(process_article_entry, '_compress_images') and process_article_entry._compress_images 465 | new_page_content = replace_img_and_css_html(page_content, zim, compress_images) 466 | logger.debug(f"Processed images/CSS for {zim_entry.title} (compression: {compress_images})") 467 | else: 468 | new_page_content = page_content 469 | 470 | zstd_page_content = zstd.compress(new_page_content.encode(), 9, 4) 471 | logger.debug(f"Compressed content for {zim_entry.title}: {len(zstd_page_content)} bytes") 472 | 473 | cursor.execute("INSERT OR REPLACE INTO articles VALUES(?, ?, ?)", [ 474 | entry_index, zim_entry.title.replace("_", " "), zstd_page_content 475 | ]) 476 | stats['articles_processed'] += 1 477 | 478 | except Exception as e: 479 | logger.error(f"Failed to process content for {zim_entry.title}: {e}") 480 | raise 481 | 482 | 483 | 484 | if __name__ == "__main__": 485 | parser = argparse.ArgumentParser(description='Extract articles from ZIM files to SQLite database.') 486 | parser.add_argument( 487 | '--zim-file', help='Path of ZIM file', 488 | default="./wikipedia.zim" 489 | ) 490 | parser.add_argument( 491 | '--output-db', help='Path where the SQLite database will be stored', 492 | default="./zim_articles.db" 493 | ) 494 | parser.add_argument( 495 | '--article-list', help='Path of newline list of articles to extract from ZIM if you dont want to convert all entries', 496 | default=None, required=False 497 | ) 498 | parser.add_argument( 499 | '--include-images', action='store_true', 500 | help='Include images in the conversion (WARNING: significantly increases database size)' 501 | ) 502 | parser.add_argument( 503 | '--compress-images', action='store_true', 504 | help='EXTREME e-ink optimization: 16-level grayscale, dithered, <15KB per image (requires ImageMagick). Implies --include-images.' 505 | ) 506 | parser.add_argument( 507 | '--debug', action='store_true', 508 | help='Enable debug logging to file and console' 509 | ) 510 | args = parser.parse_args() 511 | 512 | # Configure logging based on debug flag 513 | if args.debug: 514 | logging.basicConfig( 515 | level=logging.DEBUG, 516 | format='%(asctime)s - %(levelname)s - %(message)s', 517 | handlers=[ 518 | logging.FileHandler('zim_converter_debug.log'), 519 | logging.StreamHandler() 520 | ] 521 | ) 522 | logger.info("Debug logging enabled - detailed output will be saved to zim_converter_debug.log") 523 | else: 524 | logging.basicConfig( 525 | level=logging.INFO, 526 | format='%(levelname)s: %(message)s', 527 | handlers=[logging.StreamHandler()] 528 | ) 529 | 530 | logger.info(f"Starting ZIM conversion") 531 | logger.info(f"ZIM file: {args.zim_file}") 532 | logger.info(f"Output DB: {args.output_db}") 533 | 534 | # Check if ZIM file exists 535 | if not os.path.exists(args.zim_file): 536 | logger.error(f"ZIM file not found: {args.zim_file}") 537 | exit(1) 538 | 539 | # NOTE: Removed duplicate database connection setup here 540 | # convert_zim() handles its own connection to avoid conflicts 541 | 542 | if args.article_list: 543 | logger.info(f"Loading article list from: {args.article_list}") 544 | articles = open(args.article_list).read().splitlines() 545 | logger.info(f"Loaded {len(articles)} articles from list") 546 | else: 547 | articles = None 548 | logger.info("Processing all entries in ZIM file") 549 | 550 | # Check for ImageMagick if compression is requested 551 | if args.compress_images: 552 | try: 553 | result = subprocess.run(['convert', '-version'], capture_output=True, text=True, timeout=5) 554 | if result.returncode == 0: 555 | get_image_link._imagemagick_available = True 556 | logger.info("ImageMagick detected - image compression enabled") 557 | args.include_images = True # Compress images implies include images 558 | else: 559 | logger.error("ImageMagick not found! Please install ImageMagick to use --compress-images") 560 | exit(1) 561 | except Exception as e: 562 | logger.error(f"Failed to check ImageMagick: {e}") 563 | logger.error("Please install ImageMagick to use --compress-images") 564 | exit(1) 565 | 566 | # Set image processing flags 567 | process_article_entry._include_images = args.include_images 568 | process_article_entry._compress_images = args.compress_images 569 | 570 | if args.compress_images: 571 | logger.warning("EXTREME e-ink image compression enabled!") 572 | logger.info(f"Images: 16-level grayscale, dithered, max 600x400, <{MAX_COMPRESSED_IMAGE_SIZE} bytes each") 573 | elif args.include_images: 574 | logger.warning("Image processing enabled - database size will be significantly larger!") 575 | logger.info(f"Max image size: {MAX_IMAGE_SIZE} bytes") 576 | else: 577 | logger.info("Image processing disabled - text-only conversion for smaller database") 578 | 579 | try: 580 | stats = convert_zim(args.zim_file, args.output_db, articles) 581 | logger.info(f"Conversion completed successfully!") 582 | logger.info(f"Final statistics: {stats}") 583 | except Exception as e: 584 | logger.error(f"Conversion failed: {e}") 585 | raise 586 | 587 | print('Done') 588 | --------------------------------------------------------------------------------