├── smart-ingest-kit
    ├── requirements.txt
    ├── demo.py
    ├── README.md
    └── smart_ingestor.py
├── LICENSE
└── README.md


/smart-ingest-kit/requirements.txt:
--------------------------------------------------------------------------------
1 | docling
2 | pydantic
3 | loguru
4 | llama-index-core  # Optional, but good for typing
5 | 


--------------------------------------------------------------------------------
/smart-ingest-kit/demo.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | from smart_ingestor import ingest_file
 3 | 
 4 | def main():
 5 |     if len(sys.argv) < 2:
 6 |         print("Usage: python demo.py <path_to_file>")
 7 |         print("Supported: .pdf, .docx, .pptx, .md, .html")
 8 |         return
 9 | 
10 |     file_path = sys.argv[1]
11 |     print(f"--- Smart Ingest Demo: {file_path} ---")
12 |     
13 |     try:
14 |         docs = ingest_file(file_path)
15 |         doc = docs[0]
16 |         
17 |         print("\n✅ Success!")
18 |         print(f"📄 Extracted Length: {len(doc.text)} chars")
19 |         print(f"🧠 Metadata & Heuristics:")
20 |         for k, v in doc.metadata.items():
21 |             print(f"   - {k}: {v}")
22 |             
23 |         print("\n--- Content Preview (Markdown) ---")
24 |         print(doc.text[:500] + "...")
25 |         
26 |     except Exception as e:
27 |         print(f"❌ Error: {e}")
28 | 
29 | if __name__ == "__main__":
30 |     main()
31 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 2 Dogs a Nerd
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/smart-ingest-kit/README.md:
--------------------------------------------------------------------------------
 1 | # 🧠 Smart Ingest Kit
 2 | 
 3 | > **"Stop using static chunk sizes."**
 4 | 
 5 | This is a lightweight extraction from a production RAG platform. It solves the "Garbage In, Garbage Out" problem by applying intelligent parsing and heuristics *before* your data hits the vector database.
 6 | 
 7 | ## 🚀 Features
 8 | 
 9 | 1.  **Layout-Aware Parsing:** Uses [Docling](https://github.com/DS4SD/docling) to preserve document structure (headers, tables, lists) as Markdown. No more "soup of text".
10 | 2.  **Smart Heuristics:** Automatically applies optimal chunking strategies based on file type.
11 |     *   **PDFs:** Larger chunks (800 chars) with semantic splitting.
12 |     *   **Code:** Small chunks (256 chars) to preserve logic.
13 |     *   **Markdown:** Medium chunks (400 chars).
14 | 3.  **Metadata Enrichment:** Tags your documents with the optimal settings, so your RAG pipeline knows how to handle them.
15 | 
16 | ## 📦 Installation
17 | 
18 | ```bash
19 | pip install -r requirements.txt
20 | ```
21 | 
22 | ## ⚡ Usage
23 | 
24 | ```python
25 | from smart_ingestor import ingest_file
26 | 
27 | # Just pass a file. The system handles the rest.
28 | docs = ingest_file("my_complex_contract.pdf")
29 | 
30 | print(docs[0].text)  # Clean Markdown
31 | print(docs[0].metadata['optimal_chunk_size'])  # e.g., 800
32 | ```
33 | 
34 | ## ❓ Why this exists?
35 | 
36 | Most RAG tutorials tell you to use `RecursiveCharacterTextSplitter(chunk_size=1000)`.
37 | That's fine for demos, but bad for production.
38 | *   A 1000-char chunk might cut a Python function in half.
39 | *   A PDF table needs to be understood as a table, not a stream of words.
40 | 
41 | This kit gives you the **"Ingestion Logic"** of a professional system, without the bloat.
42 | 
43 | ## 🤝 Credits
44 | 
45 | Extracted from the **Mail Modul Alpha** RAG Platform.
46 | Powered by [Docling](https://github.com/DS4SD/docling).
47 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Smart Ingest Kit
 2 | 
 3 | **Stop using static chunk sizes.** A lightweight, production-ready RAG ingestion toolkit that uses smart heuristics for optimal, layout-aware chunking.
 4 | 
 5 | *Extracted from a battle-tested, production RAG platform.*
 6 | 
 7 | ---
 8 | 
 9 | ### ✨ Love this tool? This is just the beginning.
10 | 
11 | This toolkit is a core component of a much larger, **private-by-design AI platform** I'm building. It's designed to be the central, searchable brain for all your data, running entirely on your own hardware.
12 | 
13 | If you're tired of generic AI solutions and believe in the power of data privacy, follow the journey.
14 | 
15 | See also my smart-router-kit. Stop sending every query to every data source. A lightweight, production-ready RAG routing toolkit 
16 | that uses an LLM to intelligently route user queries to the right tool or data source. @ https://github.com/2dogsandanerd/smart-router-kit
17 | 
18 | ------->>>>>>> You also might be intrested in my "Knowledge‑Base Self‑Hosting Kit" – a production‑ready starter that glues Smart‑Ingest‑Kit & Smart‑Router‑Kit together <<<<<<<<-------
19 |             --------------->>>>>>>>>>>>>>>>>>   https://www.reddit.com/r/docling/comments/1p6koa0/knowledgebase_selfhosting_kit_a_productionready/ <<<<<<<<<<<<<----------
20 |             
21 | ➡️ **[Get early access and join the Private AI Lab here](https://mailchi.mp/38a074f598a3/github_catcher)** ⬅️
22 | 
23 | ---
24 | 
25 | ## 🤔 Why Smart Ingest Kit?
26 | 
27 | Standard RAG pipelines use a "one-size-fits-all" approach with static chunk sizes. This works okay for simple text, but fails miserably with complex documents like PDFs with tables, source code, or structured Markdown. The result: poor context and bad answers.
28 | 
29 | This kit fixes that by being smart about the ingestion process.
30 | 
31 | ## ✅ Features
32 | 
33 | *   **Layout-Aware Parsing:** Uses `Docling` to understand the structure of your documents. Tables, titles, and lists are treated as what they are.
34 | *   **Smart Chunking Heuristics:** Applies different chunking strategies for different file types. Code is chunked differently than a research paper.
35 | *   **Production-Ready & Lightweight:** No complex dependencies. Just a simple, effective toolkit to improve your RAG pipeline.
36 | *   **Preserves Table Structure:** Solves the nightmare of tables in PDFs by converting them to Markdown before chunking, keeping the relational data intact.
37 | 
38 | ## 🚀 Quick Start
39 | 
40 | (Coming soon - I'm working on making this a pip-installable package!)
41 | 
42 | ## 🤝 Contributing
43 | 
44 | This is a new open-source project and I'm open to any ideas or contributions. Feel free to open an issue or a pull request.
45 | 


--------------------------------------------------------------------------------
/smart-ingest-kit/smart_ingestor.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from pathlib import Path
  3 | from typing import List, Dict, Optional, Any
  4 | from pydantic import BaseModel, Field
  5 | from loguru import logger
  6 | 
  7 | try:
  8 |     from llama_index.core.schema import Document
  9 | except ImportError:
 10 |     # Fallback for non-LlamaIndex users
 11 |     class Document:
 12 |         def __init__(self, text: str, metadata: dict):
 13 |             self.text = text
 14 |             self.metadata = metadata
 15 |         def __repr__(self):
 16 |             return f"Document(text={self.text[:50]}..., metadata={self.metadata})"
 17 | 
 18 | # --- Configuration & Heuristics ---
 19 | 
 20 | class ChunkConfig(BaseModel):
 21 |     """Heuristic defaults for chunking per document type"""
 22 |     chunk_size: int  # Size in characters
 23 |     overlap: int  # Overlap in characters
 24 |     splitter_type: str  # "semantic", "fixed", "code", "row_based"
 25 | 
 26 | class IngestHeuristics(BaseModel):
 27 |     """Document type specific heuristics - The 'Secret Sauce'"""
 28 |     pdf: ChunkConfig = ChunkConfig(chunk_size=800, overlap=120, splitter_type="semantic")
 29 |     docx: ChunkConfig = ChunkConfig(chunk_size=600, overlap=100, splitter_type="semantic")
 30 |     html: ChunkConfig = ChunkConfig(chunk_size=500, overlap=80, splitter_type="semantic")
 31 |     markdown: ChunkConfig = ChunkConfig(chunk_size=400, overlap=60, splitter_type="semantic")
 32 |     csv: ChunkConfig = ChunkConfig(chunk_size=500, overlap=50, splitter_type="row_based")
 33 |     email: ChunkConfig = ChunkConfig(chunk_size=512, overlap=80, splitter_type="semantic")
 34 |     code: ChunkConfig = ChunkConfig(chunk_size=256, overlap=40, splitter_type="code")
 35 |     default: ChunkConfig = ChunkConfig(chunk_size=800, overlap=120, splitter_type="semantic")
 36 | 
 37 |     @classmethod
 38 |     def get_config_for_file(cls, filename: str) -> ChunkConfig:
 39 |         ext = Path(filename).suffix.lower().replace('.', '')
 40 |         heuristics = cls()
 41 |         if hasattr(heuristics, ext):
 42 |             return getattr(heuristics, ext)
 43 |         return heuristics.default
 44 | 
 45 | # --- The Smart Loader ---
 46 | 
 47 | class SmartDoclingLoader:
 48 |     """
 49 |     Smart Document Loader using Docling.
 50 |     
 51 |     Features:
 52 |     - Layout-aware parsing (tables, headers)
 53 |     - Auto-format detection
 54 |     - Returns Markdown-formatted text (preserving structure)
 55 |     """
 56 | 
 57 |     SUPPORTED_EXTENSIONS = {'.pdf', '.docx', '.pptx', '.xlsx', '.html', '.md'}
 58 | 
 59 |     def __init__(self, file_path: str):
 60 |         self.file_path = Path(file_path)
 61 |         if not self.file_path.exists():
 62 |             raise FileNotFoundError(f"Document not found: {file_path}")
 63 | 
 64 |     def load(self) -> List[Document]:
 65 |         """Load and parse the document using Docling."""
 66 |         try:
 67 |             from docling.document_converter import DocumentConverter
 68 |             
 69 |             logger.info(f"🚀 Processing with Docling: {self.file_path.name}")
 70 |             
 71 |             # 1. Convert
 72 |             converter = DocumentConverter()
 73 |             result = converter.convert(str(self.file_path))
 74 |             
 75 |             # 2. Export to Markdown (The key to preserving layout!)
 76 |             markdown_content = result.document.export_to_markdown()
 77 |             
 78 |             # 3. Get Optimal Settings (Heuristics)
 79 |             config = IngestHeuristics.get_config_for_file(self.file_path.name)
 80 |             logger.info(f"🧠 Applied Heuristics for {self.file_path.suffix}: Size={config.chunk_size}, Overlap={config.overlap}")
 81 | 
 82 |             # 4. Create Document
 83 |             doc = Document(
 84 |                 text=markdown_content,
 85 |                 metadata={
 86 |                     'source': str(self.file_path),
 87 |                     'file_name': self.file_path.name,
 88 |                     'file_type': self.file_path.suffix.lower(),
 89 |                     'loader': 'smart_docling',
 90 |                     'optimal_chunk_size': config.chunk_size,
 91 |                     'optimal_overlap': config.overlap
 92 |                 }
 93 |             )
 94 |             
 95 |             return [doc]
 96 | 
 97 |         except ImportError:
 98 |             logger.error("Docling not installed. Run: pip install docling")
 99 |             raise
100 |         except Exception as e:
101 |             logger.error(f"Failed to process {self.file_path.name}: {e}")
102 |             raise
103 | 
104 | # --- Demo Function ---
105 | 
106 | def ingest_file(file_path: str):
107 |     loader = SmartDoclingLoader(file_path)
108 |     docs = loader.load()
109 |     return docs
110 | 


--------------------------------------------------------------------------------