├── .dockerignore
├── .gitignore
├── requirements.txt
├── Dockerfile
├── README.md
├── parse_pageviews.py
└── zim_converter.py
/.dockerignore:
--------------------------------------------------------------------------------
1 | *db*
2 | *zim
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *db*
2 | *zim
3 | venv
4 | top*articles
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | libzim==2.0.0
2 | zstd==1.5.2.6
3 |
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | # syntax=docker/dockerfile:1
2 |
3 | FROM python:3.8-slim-buster
4 |
5 | WORKDIR /app
6 |
7 | COPY requirements.txt requirements.txt
8 | RUN pip3 install -r requirements.txt
9 |
10 | COPY . .
11 |
12 | ENTRYPOINT [ "python3", "/app/zim_converter.py" ]
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ZIM-converter
2 |
3 | This Python project converts ZIM files (Kiwix format) to optimized SQLite databases for the WikiReader plugin in KOReader.
4 |
5 | **Now supports both Wikipedia AND non-Wikipedia content** (iFixit, Wiktionary, etc.) with enhanced error handling and image processing options.
6 | ## WikiReader
7 |
8 | I created this plugin for KoReader during sometime off: https://github.com/koreader/koreader/pull/9534
9 | It has not been merged yet and probably never will be, but I am using it myself and is fairly stable and works for me.
10 |
11 | ## Just give me a database
12 |
13 | If you have no experience programming or simply want a database file that is preconverted, that [can be found here](https://mega.nz/file/9zZlQIKC#ZDPEAQvo_jktEdaDn20AplywxXScJW5yOGB8BMfd1qA).
14 | This database contains the top 60k popular articles of english wikipedia as of may 2025, with the top 10k articles containing images too. Note that the max file size is 4GB for the FAT32 filesystems of common ereaders, so this is about as much info as you can pack in a single DB. If you have an external sd card with NTFS or EXT4 you could convert the full wikipedia with images (~100 GiB). In theory it should work, but I have not tested it myself.
15 |
16 | [Old database](https://mega.nz/file/06AX2DrC#1WYLi9GsF2DV7VplMaMoK7bKGWna2ItIeiW92OekALg). This database contains 114303 popular articles of english wikipedia as of september 2022.
17 |
18 | ## Converting ZIM files yourself
19 |
20 | **Supported Content Types:**
21 | - ✅ **Wikipedia** (all languages and variants)
22 | - ✅ **iFixit** (repair guides with images)
23 | - ✅ **Wiktionary** (dictionaries)
24 | - ✅ **Any ZIM file** (universal namespace support)
25 |
26 | Conversion is optimized and includes comprehensive error handling and debug logging.
27 |
28 | ### How to get ZIM file
29 |
30 | You can download a dump of WikiPedias most popular articles from their servers, or use a mirror [like this one](http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/). I recommend using a dump starting with `wikipedia_en_top_nopic`.
31 |
32 | On the command line you could for example do this:
33 |
34 | ```bash
35 | wget -O wikipedia.zim http://ftp.acc.umu.se/mirror/wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_top_nopic_2022-09.zim
36 | ```
37 |
38 | ### Running the CLI
39 |
40 | ```bash
41 | # First install the dependencies with pip:
42 | pip install -r requirements.txt
43 |
44 | # Basic conversion (text-only, smallest database):
45 | python3 zim_converter.py --zim-file ./wikipedia.zim --output-db ./zim_articles.db
46 |
47 | # Include images (larger database):
48 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --include-images
49 |
50 | # EXTREME e-ink compression (requires ImageMagick):
51 | # 16-level grayscale, dithered, max 600x400, <15KB per image
52 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --compress-images
53 |
54 | # Enable debug logging for troubleshooting:
55 | python3 zim_converter.py --zim-file ./ifixit.zim --output-db ./ifixit.db --debug
56 | ```
57 |
58 | **Options:**
59 | - `--include-images`: Include original images (significantly increases size)
60 | - `--compress-images`: EXTREME e-ink optimization - 16-level grayscale, dithered, max 600x400, <15KB per image (requires ImageMagick)
61 | - `--debug`: Enable detailed debug logging to file and console (for troubleshooting)
62 |
63 | **Install ImageMagick for image compression:**
64 | ```bash
65 | # Ubuntu/Debian:
66 | sudo apt install imagemagick
67 |
68 | # macOS:
69 | brew install imagemagick
70 |
71 | # Windows: Download from https://imagemagick.org/
72 | ```
73 |
74 | Then transfer the `.db` file to your e-reader and set it as the database in the WikiReader plugin.
75 |
76 | ### Docker
77 |
78 | You can manually install the 2 dependencies and just run the python file with appropriate arguments. But if needed
79 | you can also build and run the docker if preferred, example when the zim file is called `wikipedia.zim` in the current dir:
80 |
81 | ```bash
82 | docker build --tag zim-converter .
83 | docker run --rm -it -v $(pwd):/project zim-converter --zim-file /project/wikipedia.zim --output-db /project/zim_articles.db
84 | ```
85 |
--------------------------------------------------------------------------------
/parse_pageviews.py:
--------------------------------------------------------------------------------
1 | import time
2 | import sqlite3
3 | import subprocess
4 | import argparse
5 |
6 |
7 | MIN_MONTLY_PAGE_COUNT = 500
8 |
9 |
10 | def setup_db(con):
11 | cursor = con.cursor()
12 |
13 | cursor.executescript("""
14 | PRAGMA journal_mode=WAL;
15 |
16 | CREATE TABLE IF NOT EXISTS pageviews (
17 | wiki_domain_id TEXT PRIMARY KEY,
18 | wiki_id INTEGER NOT NULL,
19 | domain TEXT NOT NULL,
20 | view_count INTEGER NOT NULL
21 | );
22 |
23 | CREATE TABLE IF NOT EXISTS title_2_wiki_domain_id (
24 | wiki_domain_id TEXT NOT NULL,
25 | title TEXT PRIMARY KEY
26 | );
27 |
28 |
29 | """)
30 | con.commit()
31 |
32 |
33 | def getbz2_multithreaded(url: str, silent=False):
34 | curlOutput = subprocess.Popen(('curl', '--silent', '-L', url), stdout=subprocess.PIPE)
35 |
36 | if silent:
37 | rawOutput = subprocess.Popen(('lbzip2', '-d'), stdin=curlOutput.stdout, stdout=subprocess.PIPE)
38 | else:
39 | networkOutput = subprocess.Popen(('pv', '-cN', 'network'), stdin=curlOutput.stdout, stdout=subprocess.PIPE)
40 | lbzipOutput = subprocess.Popen(('lbzip2', '-d'), stdin=networkOutput.stdout, stdout=subprocess.PIPE)
41 | rawOutput = subprocess.Popen(('pv', '-cN', 'raw'), stdin=lbzipOutput.stdout, stdout=subprocess.PIPE)
42 |
43 | return rawOutput.stdout
44 |
45 |
46 | def parse_pageviews_xml_url(url: str, con: sqlite3.Connection, pageview_stats: dict, valid_domains: list):
47 | cursor = con.cursor()
48 | num_bytes_read = 0
49 | num_lines_parsed = 0
50 | start_time = time.time()
51 |
52 | input_fh = getbz2_multithreaded(url)
53 | for line in input_fh:
54 | # Convert to string if needed
55 | if type(line) is bytes:
56 | line = line.decode("utf-8")[:-1]
57 | num_bytes_read += len(line)
58 | num_lines_parsed += 1
59 |
60 | # Report progress every once in a while
61 | if num_lines_parsed % 400000 == 0 and False:
62 | # Currently done by pipeviewer
63 | time_passed = time.time() - start_time
64 | print(f"Total speed: {(num_bytes_read / 1048576):.1f} MiB / {time_passed:.0f} seconds")
65 |
66 | # Break up the line
67 | line_parts = line.split(' ')
68 | if (len(line_parts) < 4):
69 | continue
70 |
71 | # Extract relevant page information
72 | domain = line_parts[0]
73 | page_name = line_parts[1]
74 | page_wiki_id = line_parts[2]
75 | # page_count_str = line_parts[-1]
76 | page_count = int(line_parts[-2])
77 | if page_wiki_id == "null":
78 | continue
79 | # if domain != "en.wikipedia":
80 | # continue
81 | # if page_name.startswith("Category") or page_name.startswith("Talk"):
82 | # continue
83 |
84 | if page_count < MIN_MONTLY_PAGE_COUNT:
85 | continue
86 | if domain not in valid_domains:
87 | continue
88 |
89 | # Add new page entries into the stats dict if they do not exist
90 | if domain not in pageview_stats:
91 | pageview_stats[domain] = {}
92 | if page_wiki_id not in pageview_stats[domain]:
93 | pageview_stats[domain][page_wiki_id] = 0
94 |
95 | # Increment the counter for this page with the page_count
96 | pageview_stats[domain][page_wiki_id] += page_count
97 |
98 | total_page_count = pageview_stats[domain][page_wiki_id]
99 | wiki_domain_id = f'{domain}-{page_wiki_id}'
100 |
101 | cursor.execute("INSERT OR REPLACE INTO pageviews VALUES(?, ?, ?, ?)", (
102 | wiki_domain_id, page_wiki_id, domain, total_page_count
103 | )
104 | )
105 |
106 | cursor.execute("INSERT OR REPLACE INTO title_2_wiki_domain_id VALUES(?, ?)", (
107 | wiki_domain_id, page_name
108 | )
109 | )
110 | if num_lines_parsed % 10000 == 0:
111 | con.commit()
112 | print('Done with bz2 file')
113 | con.commit()
114 |
115 |
116 | if __name__ == "__main__":
117 | parser = argparse.ArgumentParser(description='Process wikipedia pageviews dump.')
118 | parser.add_argument(
119 | '--pageview-url', help='URL of pageview dump .bz2',
120 | default="http://ftp.acc.umu.se/mirror/wikimedia.org/other/pageview_complete/monthly/2022/2022-08/pageviews-202208-user.bz2"
121 | )
122 | parser.add_argument(
123 | '--database-path', help='Path where the database will be stored',
124 | default="pageviews.db"
125 | )
126 | args = parser.parse_args()
127 |
128 | con = sqlite3.connect(args.database_path)
129 | setup_db(con)
130 | pageview_stats = {}
131 |
132 | valid_domains = ["en.wikipedia"]
133 | parse_pageviews_xml_url(args.pageview_url, con, pageview_stats, valid_domains)
134 |
135 | print("done")
136 |
--------------------------------------------------------------------------------
/zim_converter.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sqlite3
3 | import zstd
4 | import argparse
5 | from libzim import Archive
6 | from multiprocessing import Pool
7 | import re
8 | import base64
9 | import time
10 | import logging
11 | import subprocess
12 | import tempfile
13 |
14 | # Setup logging (will be configured later based on --debug flag)
15 | logger = logging.getLogger(__name__)
16 |
17 | img_src_pattern = r']*src=["\']([^"\']+)["\']'
18 | css_link_pattern = r']*href=["\']([^"\']+)["\'][^>]*rel=["\']stylesheet["\'][^>]*>'
19 | MAX_IMAGE_SIZE = 300 * 1024 # 300 KB in bytes
20 | MAX_COMPRESSED_IMAGE_SIZE = 15 * 1024 # 15 KB for e-ink optimized images
21 |
22 | def compress_image_with_imagemagick(image_data, mime_type):
23 | """
24 | EXTREME e-ink optimization using ImageMagick:
25 | - Convert to 16-level grayscale with Floyd-Steinberg dithering
26 | - Resize to max 600x400 for smaller files
27 | - Enhance contrast for better e-ink visibility
28 | - Very aggressive compression (quality 35)
29 | - Target: <15KB per image for optimal e-reader performance
30 | Returns compressed image data and new size, or None if compression fails
31 | """
32 | try:
33 | # Create temporary files
34 | with tempfile.NamedTemporaryFile(suffix=f'.{mime_type}', delete=False) as input_file:
35 | input_file.write(image_data)
36 | input_path = input_file.name
37 |
38 | with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as output_file:
39 | output_path = output_file.name
40 |
41 | # ImageMagick command for EXTREME e-ink optimization:
42 | # - Convert to pure black/white with dithering for better e-ink rendering
43 | # - Aggressive resizing for small file sizes
44 | # - Minimal quality for maximum compression
45 | cmd = [
46 | 'convert',
47 | input_path,
48 | '-colorspace', 'Gray', # Convert to grayscale first
49 | '-resize', '600x400>', # Smaller max size for e-ink
50 | '-contrast-stretch', '0.15x0.05%', # Enhance contrast for e-ink
51 | '-dither', 'FloydSteinberg', # Add dithering for better e-ink display
52 | '-colors', '16', # Reduce to 16 gray levels (e-ink friendly)
53 | '-quality', '35', # Very aggressive compression
54 | '-strip', # Remove all metadata
55 | '-interlace', 'none', # Remove progressive encoding
56 | '-sampling-factor', '4:2:0', # Aggressive chroma subsampling
57 | output_path
58 | ]
59 |
60 | result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
61 |
62 | if result.returncode == 0:
63 | # Read compressed image
64 | with open(output_path, 'rb') as f:
65 | compressed_data = f.read()
66 |
67 | # Clean up temp files
68 | os.unlink(input_path)
69 | os.unlink(output_path)
70 |
71 | if len(compressed_data) < MAX_COMPRESSED_IMAGE_SIZE:
72 | logger.debug(f"Image compressed: {len(image_data)} -> {len(compressed_data)} bytes ({len(compressed_data)/len(image_data)*100:.1f}%)")
73 | return compressed_data, 'jpeg'
74 | else:
75 | logger.debug(f"Compressed image still too large ({len(compressed_data)} bytes), skipping")
76 | return None, None
77 | else:
78 | logger.warning(f"ImageMagick conversion failed: {result.stderr}")
79 | # Clean up temp files
80 | try:
81 | os.unlink(input_path)
82 | os.unlink(output_path)
83 | except:
84 | pass
85 | return None, None
86 |
87 | except subprocess.TimeoutExpired:
88 | logger.warning("ImageMagick conversion timed out")
89 | return None, None
90 | except Exception as e:
91 | logger.error(f"Image compression failed: {e}")
92 | return None, None
93 |
94 | def is_binary_content(path):
95 | """
96 | Check if a path points to binary/media content that shouldn't be processed as text
97 | """
98 | path_lower = path.lower()
99 |
100 | # Image file extensions
101 | image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.bmp', '.ico', '.tiff']
102 | # Media file extensions
103 | media_extensions = ['.mp4', '.webm', '.ogg', '.mp3', '.wav', '.pdf']
104 | # Font and style extensions
105 | font_extensions = ['.woff', '.woff2', '.ttf', '.otf', '.eot']
106 | # Other binary extensions
107 | other_extensions = ['.zip', '.gz', '.tar', '.exe', '.bin']
108 |
109 | # Check file extension
110 | for ext in image_extensions + media_extensions + font_extensions + other_extensions:
111 | if path_lower.endswith(ext):
112 | return True
113 |
114 | # Check for known image hosting patterns
115 | if any(pattern in path_lower for pattern in [
116 | 'images/', 'img/', 'static/', 'assets/', 'media/',
117 | '.amazonaws.com/', 'upload', 'thumbnail', 'logo'
118 | ]):
119 | return True
120 |
121 | return False
122 |
123 | def setup_db(con):
124 | """Setup a SQLite database in the format expected by WikiReader
125 | """
126 | cursor = con.cursor()
127 |
128 | cursor.executescript("""
129 | PRAGMA journal_mode=WAL;
130 |
131 | CREATE TABLE IF NOT EXISTS articles (
132 | id INTEGER PRIMARY KEY,
133 | title TEXT NOT NULL UNIQUE,
134 | page_content_zstd BLOB NOT NULL
135 | );
136 |
137 | CREATE TABLE IF NOT EXISTS title_2_id (
138 | id INTEGER NOT NULL,
139 | title_lower_case TEXT PRIMARY KEY
140 | );
141 |
142 | DROP TABLE IF EXISTS css;
143 | CREATE TABLE IF NOT EXISTS css (
144 | content_zstd BLOB NOT NULL
145 | );
146 |
147 | -- Add indexes for better WikiReader compatibility
148 | CREATE INDEX IF NOT EXISTS idx_title_lower ON title_2_id(title_lower_case);
149 | CREATE INDEX IF NOT EXISTS idx_articles_title ON articles(title);
150 |
151 | """)
152 | con.commit()
153 |
154 | def get_mime_type(path):
155 | ext = path.split('.')[-1].lower()
156 | if ext == 'jpg' or ext == 'jpeg':
157 | return 'jpeg'
158 | elif ext == 'png':
159 | return 'png'
160 | elif ext == 'svg':
161 | return 'svg'
162 | elif ext == 'gif':
163 | return 'gif'
164 | else:
165 | return 'jpeg'
166 |
167 | def get_image_link(link: str, zim: Archive, compress_images=False):
168 | link = link.replace('../', '')
169 | try:
170 | entry = zim.get_entry_by_path(link)
171 | item = entry.get_item()
172 |
173 | original_size = len(item.content)
174 |
175 | # Check if we should compress the image
176 | if compress_images and hasattr(get_image_link, '_imagemagick_available'):
177 | if original_size > MAX_COMPRESSED_IMAGE_SIZE: # Only compress if larger than target
178 | mime_type = get_mime_type(item.path)
179 | compressed_data, new_mime = compress_image_with_imagemagick(item.content.tobytes(), mime_type)
180 |
181 | if compressed_data:
182 | # Use compressed image
183 | base64_bytes = base64.b64encode(compressed_data)
184 | base64_string = base64_bytes.decode('utf-8')
185 | size = len(base64_bytes)
186 | logger.debug(f"Using compressed image for {link}: {original_size} -> {len(compressed_data)} bytes")
187 | return f"data:image/{new_mime};base64,{base64_string}", size
188 | else:
189 | logger.debug(f"Compression failed for {link}, using original or skipping")
190 |
191 | # Use original image logic (with size check)
192 | if original_size > MAX_IMAGE_SIZE:
193 | logger.debug(f"Skipped image {link} ({original_size} bytes > max {MAX_IMAGE_SIZE})")
194 | return None, 0
195 |
196 | base64_bytes = base64.b64encode(item.content)
197 | base64_string = base64_bytes.decode('utf-8')
198 | mime_type = get_mime_type(item.path)
199 | size = len(base64_bytes)
200 | return f"data:image/{mime_type};base64,{base64_string}", size
201 | except KeyError:
202 | return None, 0
203 |
204 | def get_css_content(link: str, zim: Archive):
205 | link = link.replace('../', '')
206 | try:
207 | entry = zim.get_entry_by_path(link)
208 | item = entry.get_item()
209 | content = item.content.tobytes().decode('utf-8')
210 | return content, len(content.encode('utf-8'))
211 | except Exception as e:
212 | # print(f"Failed to load CSS {link}: {e}")
213 | return None, 0
214 |
215 | def replace_img_and_css_html(html: str, zim: Archive, compress_images=False):
216 | # Handle image sources
217 | img_sources = re.findall(img_src_pattern, html)
218 | for src in img_sources:
219 | image_link, size = get_image_link(src, zim, compress_images)
220 | if image_link:
221 | html = html.replace(src, image_link)
222 | else:
223 | pass
224 | # print('Failed for image', src)
225 |
226 | # Handle CSS links
227 | css_links = [] # re.findall(css_link_pattern, html)
228 | # print(css_links)
229 | css_links = [l for l in css_links if 'inserted_style' not in l]
230 | for href in css_links:
231 | css_content, size = get_css_content(href, zim)
232 | if css_content:
233 | style_tag = f""
234 | html = re.sub(
235 | rf']*href=["\']{re.escape(href)}["\'][^>]*>',
236 | style_tag,
237 | html
238 | )
239 | else:
240 | # print('Failed for CSS', href)
241 | pass
242 | return html
243 |
244 |
245 | def convert_zim(zim_path, db_path, article_list=None):
246 | """Process a range of a ZIM file into a seperate SQLite database"""
247 | logger.info(f"Starting ZIM conversion: {zim_path} -> {db_path}")
248 |
249 | con = sqlite3.connect(db_path)
250 | cursor = con.cursor()
251 | setup_db(con)
252 |
253 | try:
254 | zim = Archive(zim_path)
255 | logger.info(f"ZIM file loaded successfully. Entry count: {zim.entry_count}")
256 | except Exception as e:
257 | logger.error(f"Failed to load ZIM file {zim_path}: {e}")
258 | raise
259 |
260 | # Track processing statistics
261 | stats = {
262 | 'total_entries': 0,
263 | 'special_files_skipped': 0,
264 | 'redirects_processed': 0,
265 | 'articles_processed': 0,
266 | 'binary_files_skipped': 0,
267 | 'other_entries_skipped': 0,
268 | 'processing_errors': 0
269 | }
270 |
271 | def all_entry_gen():
272 | logger.info("Using all_entry_gen - processing all entries")
273 | for id in range(0, zim.entry_count):
274 | try:
275 | # Try to use public API first, fall back to private if needed
276 | try:
277 | zim_entry = zim.get_entry_by_id(id)
278 | except AttributeError:
279 | logger.debug(f"Public get_entry_by_id not available, using private _get_entry_by_id for entry {id}")
280 | zim_entry = zim._get_entry_by_id(id)
281 | yield zim_entry
282 | except Exception as e:
283 | logger.error(f"Failed to get entry {id}: {e}")
284 | stats['processing_errors'] += 1
285 |
286 | def selected_entry_gen():
287 | logger.info(f"Using selected_entry_gen - processing {len(article_list)} specific articles")
288 | for article_title in article_list:
289 | try:
290 | yield zim.get_entry_by_path('/A/' + article_title)
291 | except Exception as e:
292 | logger.error(f'Failed to get article {article_title}: {e}')
293 | stats['processing_errors'] += 1
294 |
295 | entry_gen = selected_entry_gen if article_list else all_entry_gen
296 | num_total = len(article_list) if article_list else zim.entry_count
297 | num_done = 0
298 | t0 = time.time()
299 |
300 | logger.info(f"Starting processing {num_total} entries")
301 |
302 | for zim_entry in entry_gen():
303 | stats['total_entries'] += 1
304 |
305 | # Log first 10 entries to understand structure
306 | if num_done < 10:
307 | logger.info(f"Entry {num_done}: path='{zim_entry.path}', title='{zim_entry.title}', is_redirect={zim_entry.is_redirect}")
308 |
309 | # Detect special files
310 | if zim_entry.path.startswith('-'):
311 | logger.debug(f"Skipping special file: {zim_entry.path}")
312 | stats['special_files_skipped'] += 1
313 | continue
314 |
315 | # deal with normal files
316 | if zim_entry.is_redirect:
317 | logger.debug(f"Processing redirect: {zim_entry.title} -> {zim_entry.path}")
318 | try:
319 | destination_entry = zim_entry.get_redirect_entry()
320 | # Try to use public API for index, fall back to private
321 | try:
322 | dest_index = destination_entry.index
323 | except AttributeError:
324 | logger.debug("Using private _index attribute for redirect destination")
325 | dest_index = destination_entry._index
326 |
327 | cursor.execute("INSERT OR REPLACE INTO title_2_id VALUES(?, ?)", [
328 | dest_index, zim_entry.title.lower()
329 | ])
330 | stats['redirects_processed'] += 1
331 | except Exception as e:
332 | logger.error(f"Failed to process redirect {zim_entry.title}: {e}")
333 | stats['processing_errors'] += 1
334 |
335 | elif zim_entry.path.startswith('A/'): # Wikipedia articles
336 | logger.debug(f"Processing Wikipedia article: {zim_entry.title}")
337 | try:
338 | process_article_entry(zim_entry, cursor, zim, stats)
339 | except Exception as e:
340 | logger.error(f"Failed to process Wikipedia article {zim_entry.title}: {e}")
341 | stats['processing_errors'] += 1
342 |
343 | elif not zim_entry.path.startswith('A/') and len(zim_entry.path) > 2 and '/' in zim_entry.path:
344 | # This might be a non-Wikipedia article (like iFixit)
345 | namespace = zim_entry.path.split('/')[0]
346 |
347 | # Skip known binary/media files
348 | if is_binary_content(zim_entry.path):
349 | logger.debug(f"Skipping binary/media file: {zim_entry.path}")
350 | stats['binary_files_skipped'] += 1
351 | continue
352 |
353 | if num_done < 50: # Log first 50 non-A/ entries to understand structure
354 | logger.info(f"Non-Wikipedia entry found: namespace='{namespace}', path='{zim_entry.path}', title='{zim_entry.title}'")
355 |
356 | # Try to process as article regardless of namespace
357 | try:
358 | process_article_entry(zim_entry, cursor, zim, stats)
359 | logger.debug(f"Successfully processed non-Wikipedia article: {zim_entry.title}")
360 | except Exception as e:
361 | logger.debug(f"Failed to process non-Wikipedia article {zim_entry.title}: {e}")
362 | stats['processing_errors'] += 1
363 | else:
364 | stats['other_entries_skipped'] += 1
365 | if num_done < 20: # Log first 20 skipped entries
366 | logger.debug(f"Skipping other entry: path='{zim_entry.path}', title='{zim_entry.title}'")
367 |
368 | num_done += 1
369 | # Commit to db on disk every once in a while
370 | if num_done % 500 == 0:
371 | elapsed = time.time() - t0
372 | logger.info(f'{elapsed:.1f}s Committing batch to db, at entry {num_done} of {num_total}')
373 | logger.info(f"Stats so far: {stats}")
374 | con.commit()
375 |
376 | elapsed = time.time() - t0
377 | logger.info(f"Processing completed in {elapsed:.1f}s")
378 | logger.info(f"Final statistics: {stats}")
379 |
380 | # Add a default "Ebook" entry for WikiReader plugin compatibility
381 | try:
382 | cursor.execute("SELECT COUNT(*) FROM articles WHERE title = 'Ebook'")
383 | if cursor.fetchone()[0] == 0:
384 | default_content = """
385 |
386 |
This database contains articles converted from a ZIM file.
390 |Use the search function to find articles.
391 |Database statistics:
392 |