├── .dockerignore
├── .gitignore
├── requirements.txt
├── Dockerfile
├── README.md
├── parse_pageviews.py
└── zim_converter.py


/.dockerignore:
--------------------------------------------------------------------------------
1 | *db*
2 | *zim


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *db*
2 | *zim
3 | venv
4 | top*articles


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | libzim==2.0.0
2 | zstd==1.5.2.6
3 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | # syntax=docker/dockerfile:1
 2 | 
 3 | FROM python:3.8-slim-buster
 4 | 
 5 | WORKDIR /app
 6 | 
 7 | COPY requirements.txt requirements.txt
 8 | RUN pip3 install -r requirements.txt
 9 | 
10 | COPY . .
11 | 
12 | ENTRYPOINT [ "python3", "/app/zim_converter.py" ]


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ZIM-converter
 2 | 
 3 | This Python project converts ZIM files (Kiwix format) to optimized SQLite databases for the WikiReader plugin in KOReader. 
 4 | 
 5 | **Now supports both Wikipedia AND non-Wikipedia content** (iFixit, Wiktionary, etc.) with enhanced error handling and image processing options.
 6 | ## WikiReader
 7 | 
 8 | I created this plugin for KoReader during sometime off: https://github.com/koreader/koreader/pull/9534
 9 | It has not been merged yet and probably never will be, but I am using it myself and is fairly stable and works for me.
10 | 
11 | ## Just give me a database
12 | 
13 | If you have no experience programming or simply want a database file that is preconverted, that [can be found here](https://mega.nz/file/9zZlQIKC#ZDPEAQvo_jktEdaDn20AplywxXScJW5yOGB8BMfd1qA).
14 | This database contains the top 60k popular articles of english wikipedia as of may 2025, with the top 10k articles containing images too. Note that the max file size is 4GB for the FAT32 filesystems of common ereaders, so this is about as much info as you can pack in a single DB. If you have an external sd card with NTFS or EXT4 you could convert the full wikipedia with images (~100 GiB). In theory it should work, but I have not tested it myself.
15 | 
16 | [Old database](https://mega.nz/file/06AX2DrC#1WYLi9GsF2DV7VplMaMoK7bKGWna2ItIeiW92OekALg). This database contains 114303 popular articles of english wikipedia as of september 2022.
17 | 
18 | ## Converting ZIM files yourself
19 | 
20 | **Supported Content Types:**
21 | - ✅ **Wikipedia** (all languages and variants)
22 | - ✅ **iFixit** (repair guides with images)  
23 | - ✅ **Wiktionary** (dictionaries)
24 | - ✅ **Any ZIM file** (universal namespace support)
25 | 
26 | Conversion is optimized and includes comprehensive error handling and debug logging.
27 | 
28 | ### How to get ZIM file
29 | 
30 | You can download a dump of WikiPedias most popular articles from their servers, or use a mirror [like this one](http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/). I recommend using a dump starting with `wikipedia_en_top_nopic`.
31 | 
32 | On the command line you could for example do this:
33 | 
34 | ```bash
35 | wget -O wikipedia.zim http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_top_nopic_2022-09.zim
36 | ```
37 | 
38 | ### Running the CLI
39 | 
40 | ```bash
41 | # First install the dependencies with pip:
42 | pip install -r requirements.txt
43 | 
44 | # Basic conversion (text-only, smallest database):
45 | python3 zim_converter.py --zim-file ./wikipedia.zim --output-db ./zim_articles.db
46 | 
47 | # Include images (larger database):
48 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --include-images
49 | 
50 | # EXTREME e-ink compression (requires ImageMagick):
51 | # 16-level grayscale, dithered, max 600x400, <15KB per image
52 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --compress-images
53 | 
54 | # Enable debug logging for troubleshooting:
55 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --debug
56 | ```
57 | 
58 | **Options:**
59 | - `--include-images`: Include original images (significantly increases size)
60 | - `--compress-images`: EXTREME e-ink optimization - 16-level grayscale, dithered, max 600x400, <15KB per image (requires ImageMagick)
61 | - `--debug`: Enable detailed debug logging to file and console (for troubleshooting)
62 | 
63 | **Install ImageMagick for image compression:**
64 | ```bash
65 | # Ubuntu/Debian:
66 | sudo apt install imagemagick
67 | 
68 | # macOS:
69 | brew install imagemagick
70 | 
71 | # Windows: Download from https://imagemagick.org/
72 | ```
73 | 
74 | Then transfer the `.db` file to your e-reader and set it as the database in the WikiReader plugin.
75 | 
76 | ### Docker
77 | 
78 | You can manually install the 2 dependencies and just run the python file with appropriate arguments. But if needed
79 | you can also build and run the docker if preferred, example when the zim file is called `wikipedia.zim` in the current dir:
80 | 
81 | ```bash
82 | docker build --tag zim-converter .
83 | docker run --rm -it -v $(pwd):/project zim-converter --zim-file /project/wikipedia.zim --output-db /project/zim_articles.db
84 | ```
85 | 


--------------------------------------------------------------------------------
/parse_pageviews.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import sqlite3
  3 | import subprocess
  4 | import argparse
  5 | 
  6 | 
  7 | MIN_MONTLY_PAGE_COUNT = 500
  8 | 
  9 | 
 10 | def setup_db(con):
 11 |     cursor = con.cursor()
 12 | 
 13 |     cursor.executescript("""
 14 |     PRAGMA journal_mode=WAL;
 15 | 
 16 |     CREATE TABLE IF NOT EXISTS pageviews (
 17 |         wiki_domain_id TEXT PRIMARY KEY,
 18 |         wiki_id INTEGER NOT NULL,
 19 |         domain TEXT NOT NULL,
 20 |         view_count INTEGER NOT NULL
 21 |     );
 22 | 
 23 |     CREATE TABLE IF NOT EXISTS title_2_wiki_domain_id  (
 24 |         wiki_domain_id TEXT NOT NULL,
 25 |         title TEXT PRIMARY KEY
 26 |     );
 27 | 
 28 | 
 29 |     """)
 30 |     con.commit()
 31 | 
 32 | 
 33 | def getbz2_multithreaded(url: str, silent=False):
 34 |     curlOutput = subprocess.Popen(('curl', '--silent', '-L', url), stdout=subprocess.PIPE)
 35 | 
 36 |     if silent:
 37 |         rawOutput = subprocess.Popen(('lbzip2', '-d'), stdin=curlOutput.stdout, stdout=subprocess.PIPE)
 38 |     else:
 39 |         networkOutput = subprocess.Popen(('pv', '-cN', 'network'), stdin=curlOutput.stdout, stdout=subprocess.PIPE)
 40 |         lbzipOutput = subprocess.Popen(('lbzip2', '-d'), stdin=networkOutput.stdout, stdout=subprocess.PIPE)
 41 |         rawOutput = subprocess.Popen(('pv', '-cN', 'raw'), stdin=lbzipOutput.stdout, stdout=subprocess.PIPE)
 42 | 
 43 |     return rawOutput.stdout
 44 | 
 45 | 
 46 | def parse_pageviews_xml_url(url: str, con: sqlite3.Connection, pageview_stats: dict, valid_domains: list):
 47 |     cursor = con.cursor()
 48 |     num_bytes_read = 0
 49 |     num_lines_parsed = 0
 50 |     start_time = time.time()
 51 | 
 52 |     input_fh = getbz2_multithreaded(url)
 53 |     for line in input_fh:
 54 |         # Convert to string if needed
 55 |         if type(line) is bytes:
 56 |             line = line.decode("utf-8")[:-1]
 57 |         num_bytes_read += len(line)
 58 |         num_lines_parsed += 1
 59 | 
 60 |         # Report progress every once in a while
 61 |         if num_lines_parsed % 400000 == 0 and False:
 62 |             # Currently done by pipeviewer
 63 |             time_passed = time.time() - start_time
 64 |             print(f"Total speed: {(num_bytes_read / 1048576):.1f} MiB / {time_passed:.0f} seconds")
 65 | 
 66 |         # Break up the line
 67 |         line_parts = line.split(' ')
 68 |         if (len(line_parts) < 4):
 69 |             continue
 70 | 
 71 |         # Extract relevant page information
 72 |         domain = line_parts[0]
 73 |         page_name = line_parts[1]
 74 |         page_wiki_id = line_parts[2]
 75 |         # page_count_str = line_parts[-1]
 76 |         page_count = int(line_parts[-2])
 77 |         if page_wiki_id == "null":
 78 |             continue
 79 |         # if domain != "en.wikipedia":
 80 |         #     continue
 81 |         # if page_name.startswith("Category") or page_name.startswith("Talk"):
 82 |         #     continue
 83 | 
 84 |         if page_count < MIN_MONTLY_PAGE_COUNT:
 85 |             continue
 86 |         if domain not in valid_domains:
 87 |             continue
 88 | 
 89 |         # Add new page entries into the stats dict if they do not exist
 90 |         if domain not in pageview_stats:
 91 |             pageview_stats[domain] = {}
 92 |         if page_wiki_id not in pageview_stats[domain]:
 93 |             pageview_stats[domain][page_wiki_id] = 0
 94 | 
 95 |         # Increment the counter for this page with the page_count
 96 |         pageview_stats[domain][page_wiki_id] += page_count
 97 | 
 98 |         total_page_count = pageview_stats[domain][page_wiki_id]
 99 |         wiki_domain_id = f'{domain}-{page_wiki_id}'
100 | 
101 |         cursor.execute("INSERT OR REPLACE INTO pageviews VALUES(?, ?, ?, ?)", (
102 |                 wiki_domain_id, page_wiki_id, domain, total_page_count
103 |             )
104 |         )
105 | 
106 |         cursor.execute("INSERT OR REPLACE INTO title_2_wiki_domain_id VALUES(?, ?)", (
107 |                 wiki_domain_id, page_name
108 |             )
109 |         )
110 |         if num_lines_parsed % 10000 == 0:
111 |             con.commit()
112 |     print('Done with bz2 file')
113 |     con.commit()
114 | 
115 | 
116 | if __name__ == "__main__":
117 |     parser = argparse.ArgumentParser(description='Process wikipedia pageviews dump.')
118 |     parser.add_argument(
119 |         '--pageview-url', help='URL of pageview dump .bz2',
120 |         default="http://ftp.acc.umu.se/mirror/wikimedia.org/other/pageview_complete/monthly/2022/2022-08/pageviews-202208-user.bz2"
121 |     )
122 |     parser.add_argument(
123 |         '--database-path', help='Path where the database will be stored',
124 |         default="pageviews.db"
125 |     )
126 |     args = parser.parse_args()
127 | 
128 |     con = sqlite3.connect(args.database_path)
129 |     setup_db(con)
130 |     pageview_stats = {}
131 | 
132 |     valid_domains = ["en.wikipedia"]
133 |     parse_pageviews_xml_url(args.pageview_url, con, pageview_stats, valid_domains)
134 | 
135 |     print("done")
136 | 


--------------------------------------------------------------------------------
/zim_converter.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sqlite3
  3 | import zstd
  4 | import argparse
  5 | from libzim import Archive
  6 | from multiprocessing import Pool
  7 | import re
  8 | import base64
  9 | import time
 10 | import logging
 11 | import subprocess
 12 | import tempfile
 13 | 
 14 | # Setup logging (will be configured later based on --debug flag)
 15 | logger = logging.getLogger(__name__)
 16 | 
 17 | img_src_pattern = r'<img\s+[^>]*src=["\']([^"\']+)["\']'
 18 | css_link_pattern = r'<link\s+[^>]*href=["\']([^"\']+)["\'][^>]*rel=["\']stylesheet["\'][^>]*>'
 19 | MAX_IMAGE_SIZE = 300 * 1024  # 300 KB in bytes
 20 | MAX_COMPRESSED_IMAGE_SIZE = 15 * 1024  # 15 KB for e-ink optimized images
 21 | 
 22 | def compress_image_with_imagemagick(image_data, mime_type):
 23 |     """
 24 |     EXTREME e-ink optimization using ImageMagick:
 25 |     - Convert to 16-level grayscale with Floyd-Steinberg dithering
 26 |     - Resize to max 600x400 for smaller files
 27 |     - Enhance contrast for better e-ink visibility  
 28 |     - Very aggressive compression (quality 35)
 29 |     - Target: <15KB per image for optimal e-reader performance
 30 |     Returns compressed image data and new size, or None if compression fails
 31 |     """
 32 |     try:
 33 |         # Create temporary files
 34 |         with tempfile.NamedTemporaryFile(suffix=f'.{mime_type}', delete=False) as input_file:
 35 |             input_file.write(image_data)
 36 |             input_path = input_file.name
 37 |         
 38 |         with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as output_file:
 39 |             output_path = output_file.name
 40 |         
 41 |         # ImageMagick command for EXTREME e-ink optimization:
 42 |         # - Convert to pure black/white with dithering for better e-ink rendering
 43 |         # - Aggressive resizing for small file sizes
 44 |         # - Minimal quality for maximum compression
 45 |         cmd = [
 46 |             'convert',
 47 |             input_path,
 48 |             '-colorspace', 'Gray',          # Convert to grayscale first
 49 |             '-resize', '600x400>',          # Smaller max size for e-ink
 50 |             '-contrast-stretch', '0.15x0.05%',  # Enhance contrast for e-ink
 51 |             '-dither', 'FloydSteinberg',    # Add dithering for better e-ink display
 52 |             '-colors', '16',                # Reduce to 16 gray levels (e-ink friendly)
 53 |             '-quality', '35',               # Very aggressive compression
 54 |             '-strip',                       # Remove all metadata
 55 |             '-interlace', 'none',           # Remove progressive encoding
 56 |             '-sampling-factor', '4:2:0',    # Aggressive chroma subsampling
 57 |             output_path
 58 |         ]
 59 |         
 60 |         result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
 61 |         
 62 |         if result.returncode == 0:
 63 |             # Read compressed image
 64 |             with open(output_path, 'rb') as f:
 65 |                 compressed_data = f.read()
 66 |             
 67 |             # Clean up temp files
 68 |             os.unlink(input_path)
 69 |             os.unlink(output_path)
 70 |             
 71 |             if len(compressed_data) < MAX_COMPRESSED_IMAGE_SIZE:
 72 |                 logger.debug(f"Image compressed: {len(image_data)} -> {len(compressed_data)} bytes ({len(compressed_data)/len(image_data)*100:.1f}%)")
 73 |                 return compressed_data, 'jpeg'
 74 |             else:
 75 |                 logger.debug(f"Compressed image still too large ({len(compressed_data)} bytes), skipping")
 76 |                 return None, None
 77 |         else:
 78 |             logger.warning(f"ImageMagick conversion failed: {result.stderr}")
 79 |             # Clean up temp files
 80 |             try:
 81 |                 os.unlink(input_path)
 82 |                 os.unlink(output_path)
 83 |             except:
 84 |                 pass
 85 |             return None, None
 86 |             
 87 |     except subprocess.TimeoutExpired:
 88 |         logger.warning("ImageMagick conversion timed out")
 89 |         return None, None
 90 |     except Exception as e:
 91 |         logger.error(f"Image compression failed: {e}")
 92 |         return None, None
 93 | 
 94 | def is_binary_content(path):
 95 |     """
 96 |     Check if a path points to binary/media content that shouldn't be processed as text
 97 |     """
 98 |     path_lower = path.lower()
 99 |     
100 |     # Image file extensions
101 |     image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.bmp', '.ico', '.tiff']
102 |     # Media file extensions  
103 |     media_extensions = ['.mp4', '.webm', '.ogg', '.mp3', '.wav', '.pdf']
104 |     # Font and style extensions
105 |     font_extensions = ['.woff', '.woff2', '.ttf', '.otf', '.eot']
106 |     # Other binary extensions
107 |     other_extensions = ['.zip', '.gz', '.tar', '.exe', '.bin']
108 |     
109 |     # Check file extension
110 |     for ext in image_extensions + media_extensions + font_extensions + other_extensions:
111 |         if path_lower.endswith(ext):
112 |             return True
113 |     
114 |     # Check for known image hosting patterns
115 |     if any(pattern in path_lower for pattern in [
116 |         'images/', 'img/', 'static/', 'assets/', 'media/',
117 |         '.amazonaws.com/', 'upload', 'thumbnail', 'logo'
118 |     ]):
119 |         return True
120 |         
121 |     return False
122 | 
123 | def setup_db(con):
124 |     """Setup a SQLite database in the format expected by WikiReader
125 |     """
126 |     cursor = con.cursor()
127 | 
128 |     cursor.executescript("""
129 |     PRAGMA journal_mode=WAL;
130 | 
131 |     CREATE TABLE IF NOT EXISTS articles (
132 |         id INTEGER PRIMARY KEY,
133 |         title TEXT NOT NULL UNIQUE,
134 |         page_content_zstd BLOB NOT NULL
135 |     );
136 | 
137 |     CREATE TABLE IF NOT EXISTS title_2_id  (
138 |         id INTEGER NOT NULL,
139 |         title_lower_case TEXT PRIMARY KEY
140 |     );
141 | 
142 |     DROP TABLE IF EXISTS css;
143 |     CREATE TABLE IF NOT EXISTS css  (
144 |         content_zstd BLOB NOT NULL
145 |     );
146 | 
147 |     -- Add indexes for better WikiReader compatibility
148 |     CREATE INDEX IF NOT EXISTS idx_title_lower ON title_2_id(title_lower_case);
149 |     CREATE INDEX IF NOT EXISTS idx_articles_title ON articles(title);
150 | 
151 |     """)
152 |     con.commit()
153 | 
154 | def get_mime_type(path):
155 |     ext = path.split('.')[-1].lower()
156 |     if ext == 'jpg' or ext == 'jpeg':
157 |         return 'jpeg'
158 |     elif ext == 'png':
159 |         return 'png'
160 |     elif ext == 'svg':
161 |         return 'svg'
162 |     elif ext == 'gif':
163 |         return 'gif'
164 |     else:
165 |         return 'jpeg'
166 | 
167 | def get_image_link(link: str, zim: Archive, compress_images=False):
168 |     link = link.replace('../', '')
169 |     try:
170 |         entry = zim.get_entry_by_path(link)
171 |         item = entry.get_item()
172 |         
173 |         original_size = len(item.content)
174 |         
175 |         # Check if we should compress the image
176 |         if compress_images and hasattr(get_image_link, '_imagemagick_available'):
177 |             if original_size > MAX_COMPRESSED_IMAGE_SIZE:  # Only compress if larger than target
178 |                 mime_type = get_mime_type(item.path)
179 |                 compressed_data, new_mime = compress_image_with_imagemagick(item.content.tobytes(), mime_type)
180 |                 
181 |                 if compressed_data:
182 |                     # Use compressed image
183 |                     base64_bytes = base64.b64encode(compressed_data)
184 |                     base64_string = base64_bytes.decode('utf-8')
185 |                     size = len(base64_bytes)
186 |                     logger.debug(f"Using compressed image for {link}: {original_size} -> {len(compressed_data)} bytes")
187 |                     return f"data:image/{new_mime};base64,{base64_string}", size
188 |                 else:
189 |                     logger.debug(f"Compression failed for {link}, using original or skipping")
190 |         
191 |         # Use original image logic (with size check)
192 |         if original_size > MAX_IMAGE_SIZE:
193 |             logger.debug(f"Skipped image {link} ({original_size} bytes > max {MAX_IMAGE_SIZE})")
194 |             return None, 0
195 | 
196 |         base64_bytes = base64.b64encode(item.content)
197 |         base64_string = base64_bytes.decode('utf-8')
198 |         mime_type = get_mime_type(item.path)
199 |         size = len(base64_bytes)
200 |         return f"data:image/{mime_type};base64,{base64_string}", size
201 |     except KeyError:
202 |         return None, 0
203 | 
204 | def get_css_content(link: str, zim: Archive):
205 |     link = link.replace('../', '')
206 |     try:
207 |         entry = zim.get_entry_by_path(link)
208 |         item = entry.get_item()
209 |         content = item.content.tobytes().decode('utf-8')
210 |         return content, len(content.encode('utf-8'))
211 |     except Exception as e:
212 |         # print(f"Failed to load CSS {link}: {e}")
213 |         return None, 0
214 | 
215 | def replace_img_and_css_html(html: str, zim: Archive, compress_images=False):
216 |     # Handle image sources
217 |     img_sources = re.findall(img_src_pattern, html)
218 |     for src in img_sources:
219 |         image_link, size = get_image_link(src, zim, compress_images)
220 |         if image_link:
221 |             html = html.replace(src, image_link)
222 |         else:
223 |             pass
224 |             # print('Failed for image', src)
225 | 
226 |     # Handle CSS links
227 |     css_links = [] # re.findall(css_link_pattern, html)
228 |     # print(css_links)
229 |     css_links = [l for l in css_links if 'inserted_style' not in l]
230 |     for href in css_links:
231 |         css_content, size = get_css_content(href, zim)
232 |         if css_content:
233 |             style_tag = f"<style>/* EXTRACTED FROM {href} */{css_content}</style>"
234 |             html = re.sub(
235 |                 rf'<link\s+[^>]*href=["\']{re.escape(href)}["\'][^>]*>',
236 |                 style_tag,
237 |                 html
238 |             )
239 |         else:
240 |             # print('Failed for CSS', href)
241 |             pass
242 |     return html
243 | 
244 | 
245 | def convert_zim(zim_path, db_path, article_list=None):
246 |     """Process a range of a ZIM file into a seperate SQLite database"""
247 |     logger.info(f"Starting ZIM conversion: {zim_path} -> {db_path}")
248 |     
249 |     con = sqlite3.connect(db_path)
250 |     cursor = con.cursor()
251 |     setup_db(con)
252 | 
253 |     try:
254 |         zim = Archive(zim_path)
255 |         logger.info(f"ZIM file loaded successfully. Entry count: {zim.entry_count}")
256 |     except Exception as e:
257 |         logger.error(f"Failed to load ZIM file {zim_path}: {e}")
258 |         raise
259 | 
260 |     # Track processing statistics
261 |     stats = {
262 |         'total_entries': 0,
263 |         'special_files_skipped': 0,
264 |         'redirects_processed': 0,
265 |         'articles_processed': 0,
266 |         'binary_files_skipped': 0,
267 |         'other_entries_skipped': 0,
268 |         'processing_errors': 0
269 |     }
270 | 
271 |     def all_entry_gen():
272 |         logger.info("Using all_entry_gen - processing all entries")
273 |         for id in range(0, zim.entry_count):
274 |             try:
275 |                 # Try to use public API first, fall back to private if needed
276 |                 try:
277 |                     zim_entry = zim.get_entry_by_id(id)
278 |                 except AttributeError:
279 |                     logger.debug(f"Public get_entry_by_id not available, using private _get_entry_by_id for entry {id}")
280 |                     zim_entry = zim._get_entry_by_id(id)
281 |                 yield zim_entry
282 |             except Exception as e:
283 |                 logger.error(f"Failed to get entry {id}: {e}")
284 |                 stats['processing_errors'] += 1
285 | 
286 |     def selected_entry_gen():
287 |         logger.info(f"Using selected_entry_gen - processing {len(article_list)} specific articles")
288 |         for article_title in article_list:
289 |             try:
290 |                 yield zim.get_entry_by_path('/A/' + article_title)
291 |             except Exception as e:
292 |                 logger.error(f'Failed to get article {article_title}: {e}')
293 |                 stats['processing_errors'] += 1
294 | 
295 |     entry_gen = selected_entry_gen if article_list else all_entry_gen
296 |     num_total = len(article_list) if article_list else zim.entry_count
297 |     num_done = 0
298 |     t0 = time.time()
299 |     
300 |     logger.info(f"Starting processing {num_total} entries")
301 |     
302 |     for zim_entry in entry_gen():
303 |         stats['total_entries'] += 1
304 |         
305 |         # Log first 10 entries to understand structure
306 |         if num_done < 10:
307 |             logger.info(f"Entry {num_done}: path='{zim_entry.path}', title='{zim_entry.title}', is_redirect={zim_entry.is_redirect}")
308 |         
309 |         # Detect special files
310 |         if zim_entry.path.startswith('-'):
311 |             logger.debug(f"Skipping special file: {zim_entry.path}")
312 |             stats['special_files_skipped'] += 1
313 |             continue
314 |         
315 |         # deal with normal files
316 |         if zim_entry.is_redirect:
317 |             logger.debug(f"Processing redirect: {zim_entry.title} -> {zim_entry.path}")
318 |             try:
319 |                 destination_entry = zim_entry.get_redirect_entry()
320 |                 # Try to use public API for index, fall back to private
321 |                 try:
322 |                     dest_index = destination_entry.index
323 |                 except AttributeError:
324 |                     logger.debug("Using private _index attribute for redirect destination")
325 |                     dest_index = destination_entry._index
326 |                 
327 |                 cursor.execute("INSERT OR REPLACE INTO title_2_id VALUES(?, ?)", [
328 |                     dest_index, zim_entry.title.lower()
329 |                 ])
330 |                 stats['redirects_processed'] += 1
331 |             except Exception as e:
332 |                 logger.error(f"Failed to process redirect {zim_entry.title}: {e}")
333 |                 stats['processing_errors'] += 1
334 |                 
335 |         elif zim_entry.path.startswith('A/'):  # Wikipedia articles
336 |             logger.debug(f"Processing Wikipedia article: {zim_entry.title}")
337 |             try:
338 |                 process_article_entry(zim_entry, cursor, zim, stats)
339 |             except Exception as e:
340 |                 logger.error(f"Failed to process Wikipedia article {zim_entry.title}: {e}")
341 |                 stats['processing_errors'] += 1
342 |                 
343 |         elif not zim_entry.path.startswith('A/') and len(zim_entry.path) > 2 and '/' in zim_entry.path:
344 |             # This might be a non-Wikipedia article (like iFixit)
345 |             namespace = zim_entry.path.split('/')[0]
346 |             
347 |             # Skip known binary/media files
348 |             if is_binary_content(zim_entry.path):
349 |                 logger.debug(f"Skipping binary/media file: {zim_entry.path}")
350 |                 stats['binary_files_skipped'] += 1
351 |                 continue
352 |                 
353 |             if num_done < 50:  # Log first 50 non-A/ entries to understand structure
354 |                 logger.info(f"Non-Wikipedia entry found: namespace='{namespace}', path='{zim_entry.path}', title='{zim_entry.title}'")
355 |             
356 |             # Try to process as article regardless of namespace
357 |             try:
358 |                 process_article_entry(zim_entry, cursor, zim, stats)
359 |                 logger.debug(f"Successfully processed non-Wikipedia article: {zim_entry.title}")
360 |             except Exception as e:
361 |                 logger.debug(f"Failed to process non-Wikipedia article {zim_entry.title}: {e}")
362 |                 stats['processing_errors'] += 1
363 |         else:
364 |             stats['other_entries_skipped'] += 1
365 |             if num_done < 20:  # Log first 20 skipped entries
366 |                 logger.debug(f"Skipping other entry: path='{zim_entry.path}', title='{zim_entry.title}'")
367 |             
368 |         num_done += 1
369 |         # Commit to db on disk every once in a while
370 |         if num_done % 500 == 0:
371 |             elapsed = time.time() - t0
372 |             logger.info(f'{elapsed:.1f}s Committing batch to db, at entry {num_done} of {num_total}')
373 |             logger.info(f"Stats so far: {stats}")
374 |             con.commit()
375 | 
376 |     elapsed = time.time() - t0
377 |     logger.info(f"Processing completed in {elapsed:.1f}s")
378 |     logger.info(f"Final statistics: {stats}")
379 |     
380 |     # Add a default "Ebook" entry for WikiReader plugin compatibility
381 |     try:
382 |         cursor.execute("SELECT COUNT(*) FROM articles WHERE title = 'Ebook'")
383 |         if cursor.fetchone()[0] == 0:
384 |             default_content = """
385 |             <html>
386 |             <head><title>Welcome to WikiReader</title></head>
387 |             <body>
388 |             <h1>Welcome to WikiReader</h1>
389 |             <p>This database contains articles converted from a ZIM file.</p>
390 |             <p>Use the search function to find articles.</p>
391 |             <p>Database statistics:</p>
392 |             <ul>
393 |             <li>Total articles: {articles_processed}</li>
394 |             <li>Total entries processed: {total_entries}</li>
395 |             <li>Binary files skipped: {binary_files_skipped}</li>
396 |             </ul>
397 |             </body>
398 |             </html>
399 |             """.format(**stats)
400 |             
401 |             compressed_content = zstd.compress(default_content.encode(), 9, 4)
402 |             
403 |             # Use a high ID that won't conflict
404 |             default_id = 999999
405 |             cursor.execute("INSERT OR REPLACE INTO articles VALUES(?, ?, ?)", [
406 |                 default_id, "Ebook", compressed_content
407 |             ])
408 |             cursor.execute("INSERT OR REPLACE INTO title_2_id VALUES(?, ?)", [
409 |                 default_id, "ebook"
410 |             ])
411 |             logger.info("Added default 'Ebook' page for WikiReader compatibility")
412 |     except Exception as e:
413 |         logger.warning(f"Failed to add default Ebook page: {e}")
414 | 
415 |     con.commit()
416 |     con.close()
417 |     return stats
418 | 
419 | def process_article_entry(zim_entry, cursor, zim, stats):
420 |     """Process a single article entry (works for any namespace)"""
421 |     # First make it findable
422 |     try:
423 |         # Try to use public API for index, fall back to private
424 |         try:
425 |             entry_index = zim_entry.index
426 |         except AttributeError:
427 |             logger.debug(f"Using private _index attribute for entry {zim_entry.title}")
428 |             entry_index = zim_entry._index
429 |             
430 |         cursor.execute("INSERT INTO title_2_id VALUES(?, ?)", [
431 |             entry_index, zim_entry.title.lower()
432 |         ])
433 |     except sqlite3.IntegrityError as e:
434 |         if not 'UNIQUE constraint' in str(e):
435 |             logger.error(f"Unexpected integrity error for {zim_entry.title}: {e}")
436 |             raise e
437 |         cursor.execute("SELECT id FROM title_2_id WHERE title_lower_case = ?", [zim_entry.title.lower()])
438 |         result = cursor.fetchone()
439 |         if result:
440 |             cur_id = result[0]
441 |             if cur_id != entry_index:
442 |                 cursor.execute("UPDATE title_2_id SET id = ? WHERE id = ?", [
443 |                     entry_index, cur_id
444 |                 ])
445 |                 logger.debug(f"Updated duplicate title mapping for {zim_entry.title}")
446 | 
447 |     try:
448 |         # Validate entry has meaningful content
449 |         if not zim_entry.title or len(zim_entry.title.strip()) < 2:
450 |             logger.debug(f"Skipping entry with invalid title: '{zim_entry.title}'")
451 |             return
452 |             
453 |         page_content = bytes(zim_entry.get_item().content).decode()
454 |         
455 |         # Validate content is not empty/too short
456 |         if len(page_content.strip()) < 100:
457 |             logger.debug(f"Skipping entry with minimal content: {zim_entry.title} ({len(page_content)} chars)")
458 |             return
459 |             
460 |         logger.debug(f"Extracted content for {zim_entry.title}: {len(page_content)} characters")
461 |         
462 |         # Process images/CSS only if requested (WARNING: can make DB much larger)
463 |         if hasattr(process_article_entry, '_include_images') and process_article_entry._include_images:
464 |             compress_images = hasattr(process_article_entry, '_compress_images') and process_article_entry._compress_images
465 |             new_page_content = replace_img_and_css_html(page_content, zim, compress_images)
466 |             logger.debug(f"Processed images/CSS for {zim_entry.title} (compression: {compress_images})")
467 |         else:
468 |             new_page_content = page_content
469 |         
470 |         zstd_page_content = zstd.compress(new_page_content.encode(), 9, 4)
471 |         logger.debug(f"Compressed content for {zim_entry.title}: {len(zstd_page_content)} bytes")
472 |         
473 |         cursor.execute("INSERT OR REPLACE INTO articles VALUES(?, ?, ?)", [
474 |             entry_index, zim_entry.title.replace("_", " "), zstd_page_content
475 |         ])
476 |         stats['articles_processed'] += 1
477 |         
478 |     except Exception as e:
479 |         logger.error(f"Failed to process content for {zim_entry.title}: {e}")
480 |         raise
481 | 
482 | 
483 | 
484 | if __name__ == "__main__":
485 |     parser = argparse.ArgumentParser(description='Extract articles from ZIM files to SQLite database.')
486 |     parser.add_argument(
487 |         '--zim-file', help='Path of ZIM file',
488 |         default="./wikipedia.zim"
489 |     )
490 |     parser.add_argument(
491 |         '--output-db', help='Path where the SQLite database will be stored',
492 |         default="./zim_articles.db"
493 |     )
494 |     parser.add_argument(
495 |         '--article-list', help='Path of newline list of articles to extract from ZIM if you dont want to convert all entries',
496 |         default=None, required=False
497 |     )
498 |     parser.add_argument(
499 |         '--include-images', action='store_true',
500 |         help='Include images in the conversion (WARNING: significantly increases database size)'
501 |     )
502 |     parser.add_argument(
503 |         '--compress-images', action='store_true',
504 |         help='EXTREME e-ink optimization: 16-level grayscale, dithered, <15KB per image (requires ImageMagick). Implies --include-images.'
505 |     )
506 |     parser.add_argument(
507 |         '--debug', action='store_true',
508 |         help='Enable debug logging to file and console'
509 |     )
510 |     args = parser.parse_args()
511 |     
512 |     # Configure logging based on debug flag
513 |     if args.debug:
514 |         logging.basicConfig(
515 |             level=logging.DEBUG,
516 |             format='%(asctime)s - %(levelname)s - %(message)s',
517 |             handlers=[
518 |                 logging.FileHandler('zim_converter_debug.log'),
519 |                 logging.StreamHandler()
520 |             ]
521 |         )
522 |         logger.info("Debug logging enabled - detailed output will be saved to zim_converter_debug.log")
523 |     else:
524 |         logging.basicConfig(
525 |             level=logging.INFO,
526 |             format='%(levelname)s: %(message)s',
527 |             handlers=[logging.StreamHandler()]
528 |         )
529 | 
530 |     logger.info(f"Starting ZIM conversion")
531 |     logger.info(f"ZIM file: {args.zim_file}")
532 |     logger.info(f"Output DB: {args.output_db}")
533 |     
534 |     # Check if ZIM file exists
535 |     if not os.path.exists(args.zim_file):
536 |         logger.error(f"ZIM file not found: {args.zim_file}")
537 |         exit(1)
538 | 
539 |     # NOTE: Removed duplicate database connection setup here
540 |     # convert_zim() handles its own connection to avoid conflicts
541 |     
542 |     if args.article_list:
543 |         logger.info(f"Loading article list from: {args.article_list}")
544 |         articles = open(args.article_list).read().splitlines()
545 |         logger.info(f"Loaded {len(articles)} articles from list")
546 |     else:
547 |         articles = None
548 |         logger.info("Processing all entries in ZIM file")
549 |     
550 |     # Check for ImageMagick if compression is requested
551 |     if args.compress_images:
552 |         try:
553 |             result = subprocess.run(['convert', '-version'], capture_output=True, text=True, timeout=5)
554 |             if result.returncode == 0:
555 |                 get_image_link._imagemagick_available = True
556 |                 logger.info("ImageMagick detected - image compression enabled")
557 |                 args.include_images = True  # Compress images implies include images
558 |             else:
559 |                 logger.error("ImageMagick not found! Please install ImageMagick to use --compress-images")
560 |                 exit(1)
561 |         except Exception as e:
562 |             logger.error(f"Failed to check ImageMagick: {e}")
563 |             logger.error("Please install ImageMagick to use --compress-images")
564 |             exit(1)
565 |     
566 |     # Set image processing flags
567 |     process_article_entry._include_images = args.include_images
568 |     process_article_entry._compress_images = args.compress_images
569 |     
570 |     if args.compress_images:
571 |         logger.warning("EXTREME e-ink image compression enabled!")
572 |         logger.info(f"Images: 16-level grayscale, dithered, max 600x400, <{MAX_COMPRESSED_IMAGE_SIZE} bytes each")
573 |     elif args.include_images:
574 |         logger.warning("Image processing enabled - database size will be significantly larger!")
575 |         logger.info(f"Max image size: {MAX_IMAGE_SIZE} bytes")
576 |     else:
577 |         logger.info("Image processing disabled - text-only conversion for smaller database")
578 |     
579 |     try:
580 |         stats = convert_zim(args.zim_file, args.output_db, articles)
581 |         logger.info(f"Conversion completed successfully!")
582 |         logger.info(f"Final statistics: {stats}")
583 |     except Exception as e:
584 |         logger.error(f"Conversion failed: {e}")
585 |         raise
586 | 
587 |     print('Done')
588 | 


--------------------------------------------------------------------------------