├── smart-ingest-kit ├── requirements.txt ├── demo.py ├── README.md └── smart_ingestor.py ├── LICENSE └── README.md /smart-ingest-kit/requirements.txt: -------------------------------------------------------------------------------- 1 | docling 2 | pydantic 3 | loguru 4 | llama-index-core # Optional, but good for typing 5 | -------------------------------------------------------------------------------- /smart-ingest-kit/demo.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from smart_ingestor import ingest_file 3 | 4 | def main(): 5 | if len(sys.argv) < 2: 6 | print("Usage: python demo.py ") 7 | print("Supported: .pdf, .docx, .pptx, .md, .html") 8 | return 9 | 10 | file_path = sys.argv[1] 11 | print(f"--- Smart Ingest Demo: {file_path} ---") 12 | 13 | try: 14 | docs = ingest_file(file_path) 15 | doc = docs[0] 16 | 17 | print("\n✅ Success!") 18 | print(f"📄 Extracted Length: {len(doc.text)} chars") 19 | print(f"🧠 Metadata & Heuristics:") 20 | for k, v in doc.metadata.items(): 21 | print(f" - {k}: {v}") 22 | 23 | print("\n--- Content Preview (Markdown) ---") 24 | print(doc.text[:500] + "...") 25 | 26 | except Exception as e: 27 | print(f"❌ Error: {e}") 28 | 29 | if __name__ == "__main__": 30 | main() 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 2 Dogs a Nerd 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /smart-ingest-kit/README.md: -------------------------------------------------------------------------------- 1 | # 🧠 Smart Ingest Kit 2 | 3 | > **"Stop using static chunk sizes."** 4 | 5 | This is a lightweight extraction from a production RAG platform. It solves the "Garbage In, Garbage Out" problem by applying intelligent parsing and heuristics *before* your data hits the vector database. 6 | 7 | ## 🚀 Features 8 | 9 | 1. **Layout-Aware Parsing:** Uses [Docling](https://github.com/DS4SD/docling) to preserve document structure (headers, tables, lists) as Markdown. No more "soup of text". 10 | 2. **Smart Heuristics:** Automatically applies optimal chunking strategies based on file type. 11 | * **PDFs:** Larger chunks (800 chars) with semantic splitting. 12 | * **Code:** Small chunks (256 chars) to preserve logic. 13 | * **Markdown:** Medium chunks (400 chars). 14 | 3. **Metadata Enrichment:** Tags your documents with the optimal settings, so your RAG pipeline knows how to handle them. 15 | 16 | ## 📦 Installation 17 | 18 | ```bash 19 | pip install -r requirements.txt 20 | ``` 21 | 22 | ## ⚡ Usage 23 | 24 | ```python 25 | from smart_ingestor import ingest_file 26 | 27 | # Just pass a file. The system handles the rest. 28 | docs = ingest_file("my_complex_contract.pdf") 29 | 30 | print(docs[0].text) # Clean Markdown 31 | print(docs[0].metadata['optimal_chunk_size']) # e.g., 800 32 | ``` 33 | 34 | ## ❓ Why this exists? 35 | 36 | Most RAG tutorials tell you to use `RecursiveCharacterTextSplitter(chunk_size=1000)`. 37 | That's fine for demos, but bad for production. 38 | * A 1000-char chunk might cut a Python function in half. 39 | * A PDF table needs to be understood as a table, not a stream of words. 40 | 41 | This kit gives you the **"Ingestion Logic"** of a professional system, without the bloat. 42 | 43 | ## 🤝 Credits 44 | 45 | Extracted from the **Mail Modul Alpha** RAG Platform. 46 | Powered by [Docling](https://github.com/DS4SD/docling). 47 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Smart Ingest Kit 2 | 3 | **Stop using static chunk sizes.** A lightweight, production-ready RAG ingestion toolkit that uses smart heuristics for optimal, layout-aware chunking. 4 | 5 | *Extracted from a battle-tested, production RAG platform.* 6 | 7 | --- 8 | 9 | ### ✨ Love this tool? This is just the beginning. 10 | 11 | This toolkit is a core component of a much larger, **private-by-design AI platform** I'm building. It's designed to be the central, searchable brain for all your data, running entirely on your own hardware. 12 | 13 | If you're tired of generic AI solutions and believe in the power of data privacy, follow the journey. 14 | 15 | See also my smart-router-kit. Stop sending every query to every data source. A lightweight, production-ready RAG routing toolkit 16 | that uses an LLM to intelligently route user queries to the right tool or data source. @ https://github.com/2dogsandanerd/smart-router-kit 17 | 18 | ------->>>>>>> You also might be intrested in my "Knowledge‑Base Self‑Hosting Kit" – a production‑ready starter that glues Smart‑Ingest‑Kit & Smart‑Router‑Kit together <<<<<<<<------- 19 | --------------->>>>>>>>>>>>>>>>>> https://www.reddit.com/r/docling/comments/1p6koa0/knowledgebase_selfhosting_kit_a_productionready/ <<<<<<<<<<<<<---------- 20 | 21 | ➡️ **[Get early access and join the Private AI Lab here](https://mailchi.mp/38a074f598a3/github_catcher)** ⬅️ 22 | 23 | --- 24 | 25 | ## 🤔 Why Smart Ingest Kit? 26 | 27 | Standard RAG pipelines use a "one-size-fits-all" approach with static chunk sizes. This works okay for simple text, but fails miserably with complex documents like PDFs with tables, source code, or structured Markdown. The result: poor context and bad answers. 28 | 29 | This kit fixes that by being smart about the ingestion process. 30 | 31 | ## ✅ Features 32 | 33 | * **Layout-Aware Parsing:** Uses `Docling` to understand the structure of your documents. Tables, titles, and lists are treated as what they are. 34 | * **Smart Chunking Heuristics:** Applies different chunking strategies for different file types. Code is chunked differently than a research paper. 35 | * **Production-Ready & Lightweight:** No complex dependencies. Just a simple, effective toolkit to improve your RAG pipeline. 36 | * **Preserves Table Structure:** Solves the nightmare of tables in PDFs by converting them to Markdown before chunking, keeping the relational data intact. 37 | 38 | ## 🚀 Quick Start 39 | 40 | (Coming soon - I'm working on making this a pip-installable package!) 41 | 42 | ## 🤝 Contributing 43 | 44 | This is a new open-source project and I'm open to any ideas or contributions. Feel free to open an issue or a pull request. 45 | -------------------------------------------------------------------------------- /smart-ingest-kit/smart_ingestor.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | from typing import List, Dict, Optional, Any 4 | from pydantic import BaseModel, Field 5 | from loguru import logger 6 | 7 | try: 8 | from llama_index.core.schema import Document 9 | except ImportError: 10 | # Fallback for non-LlamaIndex users 11 | class Document: 12 | def __init__(self, text: str, metadata: dict): 13 | self.text = text 14 | self.metadata = metadata 15 | def __repr__(self): 16 | return f"Document(text={self.text[:50]}..., metadata={self.metadata})" 17 | 18 | # --- Configuration & Heuristics --- 19 | 20 | class ChunkConfig(BaseModel): 21 | """Heuristic defaults for chunking per document type""" 22 | chunk_size: int # Size in characters 23 | overlap: int # Overlap in characters 24 | splitter_type: str # "semantic", "fixed", "code", "row_based" 25 | 26 | class IngestHeuristics(BaseModel): 27 | """Document type specific heuristics - The 'Secret Sauce'""" 28 | pdf: ChunkConfig = ChunkConfig(chunk_size=800, overlap=120, splitter_type="semantic") 29 | docx: ChunkConfig = ChunkConfig(chunk_size=600, overlap=100, splitter_type="semantic") 30 | html: ChunkConfig = ChunkConfig(chunk_size=500, overlap=80, splitter_type="semantic") 31 | markdown: ChunkConfig = ChunkConfig(chunk_size=400, overlap=60, splitter_type="semantic") 32 | csv: ChunkConfig = ChunkConfig(chunk_size=500, overlap=50, splitter_type="row_based") 33 | email: ChunkConfig = ChunkConfig(chunk_size=512, overlap=80, splitter_type="semantic") 34 | code: ChunkConfig = ChunkConfig(chunk_size=256, overlap=40, splitter_type="code") 35 | default: ChunkConfig = ChunkConfig(chunk_size=800, overlap=120, splitter_type="semantic") 36 | 37 | @classmethod 38 | def get_config_for_file(cls, filename: str) -> ChunkConfig: 39 | ext = Path(filename).suffix.lower().replace('.', '') 40 | heuristics = cls() 41 | if hasattr(heuristics, ext): 42 | return getattr(heuristics, ext) 43 | return heuristics.default 44 | 45 | # --- The Smart Loader --- 46 | 47 | class SmartDoclingLoader: 48 | """ 49 | Smart Document Loader using Docling. 50 | 51 | Features: 52 | - Layout-aware parsing (tables, headers) 53 | - Auto-format detection 54 | - Returns Markdown-formatted text (preserving structure) 55 | """ 56 | 57 | SUPPORTED_EXTENSIONS = {'.pdf', '.docx', '.pptx', '.xlsx', '.html', '.md'} 58 | 59 | def __init__(self, file_path: str): 60 | self.file_path = Path(file_path) 61 | if not self.file_path.exists(): 62 | raise FileNotFoundError(f"Document not found: {file_path}") 63 | 64 | def load(self) -> List[Document]: 65 | """Load and parse the document using Docling.""" 66 | try: 67 | from docling.document_converter import DocumentConverter 68 | 69 | logger.info(f"🚀 Processing with Docling: {self.file_path.name}") 70 | 71 | # 1. Convert 72 | converter = DocumentConverter() 73 | result = converter.convert(str(self.file_path)) 74 | 75 | # 2. Export to Markdown (The key to preserving layout!) 76 | markdown_content = result.document.export_to_markdown() 77 | 78 | # 3. Get Optimal Settings (Heuristics) 79 | config = IngestHeuristics.get_config_for_file(self.file_path.name) 80 | logger.info(f"🧠 Applied Heuristics for {self.file_path.suffix}: Size={config.chunk_size}, Overlap={config.overlap}") 81 | 82 | # 4. Create Document 83 | doc = Document( 84 | text=markdown_content, 85 | metadata={ 86 | 'source': str(self.file_path), 87 | 'file_name': self.file_path.name, 88 | 'file_type': self.file_path.suffix.lower(), 89 | 'loader': 'smart_docling', 90 | 'optimal_chunk_size': config.chunk_size, 91 | 'optimal_overlap': config.overlap 92 | } 93 | ) 94 | 95 | return [doc] 96 | 97 | except ImportError: 98 | logger.error("Docling not installed. Run: pip install docling") 99 | raise 100 | except Exception as e: 101 | logger.error(f"Failed to process {self.file_path.name}: {e}") 102 | raise 103 | 104 | # --- Demo Function --- 105 | 106 | def ingest_file(file_path: str): 107 | loader = SmartDoclingLoader(file_path) 108 | docs = loader.load() 109 | return docs 110 | --------------------------------------------------------------------------------