├── .gitignore
├── LICENSE
├── README.md
├── chunker_regex.py
├── chunkify-gui.py
├── chunkify-no-kobold.bat
├── chunkify-run.bat
├── chunkify-run.sh
├── chunkify.kcppt
├── chunkify.py
├── chunkify_littleprince_translate.gif
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | /chunkify_env
2 | /ignore
3 | /__pycache__
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 jabberjabberjabber
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Chunkify: A Python Script for Text Processing with Large Language Models
  2 | 
  3 | ## Overview
  4 | 
  5 | Chunkify was made as a proof-of-concept for a chunking method that doesn't rely on a tokenizer. The text processing features were added because they are commonly used and useful. Aya Expanse is particularly good at these tasks and is downloaded automatically the first time you run it if you use the batch file.
  6 | 
  7 | ## Key Features
  8 | 
  9 | - **Document Chunking:** Divides large documents into manageable chunks, intelligently identifying breaks based on chapters, headings, paragraphs, or sentences.
 10 | - **Automatic Template Selection:** Adapts to the loaded model, ensuring the correct instruction template is used.
 11 | - **Real-time Monitoring:** Provides continuous feedback on the generation process, allowing users to track progress.
 12 | - **Compatible with multiple document formats, including PDF and HTML**
 13 | - **Multiple Processing Modes:**
 14 |   - **Summary:** Generates concise summaries of the content.
 15 |   - **Translation:** Translates text into a language you can specify (default is English).
 16 |   - **Distillation:** Rewrites content for conciseness while retaining key information.
 17 |   - **Correction:** Fixes grammar, spelling, and style issues.
 18 | - **File Output Support:** Saves results to specified output files.
 19 | 
 20 | ## Requirements
 21 | 
 22 | - Python 3.8 or later
 23 | - KoboldCpp executable in the script directory
 24 | - Essential Python packages:
 25 |   - `requests`
 26 |   - `extractous` (for text extraction)
 27 |   - `PyQt6` (for GUI)
 28 | 
 29 | ## Installation
 30 | 
 31 | #### Windows Installation:
 32 | 
 33 | 1. Clone the repository or download the ZIP file from GitHub.
 34 | 2. Install Python 3.8 or later if not already present.
 35 | 3. Download KoboldCPP.exe from the [KoboldCPP releases](https://github.com/LostRuins/koboldcpp/releases) page and place it in the project folder.
 36 | 4. Run `chunkify-run.bat`. This script will install necessary dependencies and download the Aya Expanse 8b Q6_K model.
 37 | 5. Upon completion, you should see a message: "Please connect to custom endpoint at http://localhost:5001".
 38 | 
 39 | #### macOS Installation:
 40 | 
 41 | 1. Follow the Windows installation steps, ensuring you use the appropriate KoboldCPP binary for macOS.
 42 | 
 43 | #### Linux Installation:
 44 | 
 45 | 1. Similar to Windows, clone the repository, install Python 3.8 or later, and download the Linux KoboldCPP binary from the releases page.
 46 | 2. Run the script using: `./chunkify-run.sh`.
 47 | 
 48 | ## Usage
 49 | 
 50 | 1. **GUI Launch:**
 51 |    - Windows: Run `chunkify-run.bat`.
 52 |    - macOS/Linux: Execute `python3 chunkify-gui.py`.
 53 | 
 54 | 2. Ensure KoboldCPP is running and displaying the message: "Please connect to custom endpoint at http://localhost:5001".
 55 | 
 56 | 3. Configure settings and API details through the GUI or a configuration JSON file.
 57 | 
 58 | 4. Click "Process" to initiate the text processing task.
 59 | 
 60 | 5. Monitor progress in the GUI's output area.
 61 | 
 62 | ## Configuration
 63 | 
 64 | Configuration can be managed through:
 65 | 
 66 | - Command-line arguments
 67 | - `chunkify_config.json` file
 68 | - GUI settings
 69 | 
 70 | ## Command-Line Usage
 71 | 
 72 | ```bash
 73 | python chunkify.py --content input.txt --task summary
 74 | ```
 75 | 
 76 | or with a configuration file:
 77 | 
 78 | ```bash
 79 | python chunkify.py --config config.json --content input.txt --task translate
 80 | ```
 81 | 
 82 | ## Output Format
 83 | 
 84 | When using the `--file` option, the script generates a Markdown-formatted output file containing:
 85 | 
 86 | - Document metadata
 87 | - Task-specific results
 88 | 
 89 | The default output file is `output.txt` in the script directory, or the GUI will save files with an added '_processed' suffix.
 90 | 
 91 | 
 92 | ## Limitations
 93 | 
 94 | - Context length is model-dependent.
 95 | - Chunking and generation length are set to half the context size.
 96 | - Speed varies based on API response time.
 97 | - Consider a GPU with 8GB VRAM or a powerful CPU with 16GB RAM for optimal performance.
 98 | 
 99 | ## Contribution and License
100 | 
101 | Feel free to contribute and submit issues or pull requests. The script is licensed under the MIT license.
102 | 
103 | 


--------------------------------------------------------------------------------
/chunker_regex.py:
--------------------------------------------------------------------------------
 1 | # code source: https://gist.github.com/hanxiao/3f60354cf6dc5ac698bc9154163b4e6a
 2 | # link: https://jina.ai/tokenizer/
 3 | 
 4 | import regex as re
 5 | 
 6 | MAX_HEADING_LENGTH = 7
 7 | MAX_HEADING_CONTENT_LENGTH = 200
 8 | MAX_HEADING_UNDERLINE_LENGTH = 200
 9 | MAX_HTML_HEADING_ATTRIBUTES_LENGTH = 100
10 | MAX_LIST_ITEM_LENGTH = 200
11 | MAX_NESTED_LIST_ITEMS = 6
12 | MAX_LIST_INDENT_SPACES = 7
13 | MAX_BLOCKQUOTE_LINE_LENGTH = 200
14 | MAX_BLOCKQUOTE_LINES = 15
15 | MAX_CODE_BLOCK_LENGTH = 1500
16 | MAX_CODE_LANGUAGE_LENGTH = 20
17 | MAX_INDENTED_CODE_LINES = 20
18 | MAX_TABLE_CELL_LENGTH = 200
19 | MAX_TABLE_ROWS = 20
20 | MAX_HTML_TABLE_LENGTH = 2000
21 | MIN_HORIZONTAL_RULE_LENGTH = 3
22 | MAX_SENTENCE_LENGTH = 400
23 | MAX_QUOTED_TEXT_LENGTH = 600
24 | MAX_PARENTHETICAL_CONTENT_LENGTH = 400
25 | MAX_NESTED_PARENTHESES = 5
26 | MAX_MATH_INLINE_LENGTH = 100
27 | MAX_MATH_BLOCK_LENGTH = 500
28 | MAX_PARAGRAPH_LENGTH = 1000
29 | MAX_STANDALONE_LINE_LENGTH = 800
30 | MAX_HTML_TAG_ATTRIBUTES_LENGTH = 100
31 | MAX_HTML_TAG_CONTENT_LENGTH = 1000
32 | LOOKAHEAD_RANGE = 200; 
33 | 
34 | chunk_regex = re.compile(
35 |     r"(" +
36 |     # 1. Headings (Setext-style, Markdown, and HTML-style)
37 |     rf"(?:^(?:[#*=-]{{1,{MAX_HEADING_LENGTH}}}|\w[^\r\n]{{0,{MAX_HEADING_CONTENT_LENGTH}}}\r?\n[-=]{{2,{MAX_HEADING_UNDERLINE_LENGTH}}}|<h[1-6][^>]{{0,{MAX_HTML_HEADING_ATTRIBUTES_LENGTH}}}>)[^\r\n]{{1,{MAX_HEADING_CONTENT_LENGTH}}}(?:</h[1-6]>)?(?:\r?\n|$))" +
38 |     "|" +
39 |     # 2. Citations
40 |     rf"(?:\[[0-9]+\][^\r\n]{{1,{MAX_STANDALONE_LINE_LENGTH}}})" +
41 |     "|" +
42 |     # 3. List items (Adjusted to handle indentation correctly)
43 |     rf"(?:(?:^|\r?\n)[ \t]{{0,3}}(?:[-*+•]|\d{{1,3}}\.\w\.|\[[ xX]\])[ \t]+(?:[^\r\n]{{1,{MAX_LIST_ITEM_LENGTH}}})(?:\r?\n[ \t]{{2,}}(?:[^\r\n]{{1,{MAX_LIST_ITEM_LENGTH}}}))*)" +
44 |     "|" +
45 |     # 4. Block quotes (Handles nested quotes without chunking)
46 |     rf"(?:(?:^>(?:>|\\s{{2,}}){{0,2}}(?:[^\r\n]{{0,{MAX_BLOCKQUOTE_LINE_LENGTH}}})(?:\r?\n[ \t]+[^\r\n]{{0,{MAX_BLOCKQUOTE_LINE_LENGTH}}})*?\r?\n?))" +
47 |     "|" +
48 |     # 5. Code blocks
49 |     rf"(?:(?:^|\r?\n)(?:```|~~~)(?:\w{{0,{MAX_CODE_LANGUAGE_LENGTH}}})?\r?\n[\s\S]{{0,{MAX_CODE_BLOCK_LENGTH}}}?(?:```|~~~)\r?\n?)" +
50 |     rf"|(?:(?:^|\r?\n)(?: {{4}}|\t)[^\r\n]{{0,{MAX_LIST_ITEM_LENGTH}}}(?:\r?\n(?: {{4}}|\t)[^\r\n]{{0,{MAX_LIST_ITEM_LENGTH}}}){{0,{MAX_INDENTED_CODE_LINES}}}\r?\n?)" +
51 |     rf"|(?:<pre>(?:<code>)[\s\S]{{0,{MAX_CODE_BLOCK_LENGTH}}}?(?:</code>)?</pre>)" +
52 |     "|" +
53 |     # 6. Tables
54 |     rf"(?:(?:^|\r?\n)\|[^\r\n]{{0,{MAX_TABLE_CELL_LENGTH}}}\|(?:\r?\n\|[-:]{{1,{MAX_TABLE_CELL_LENGTH}}}\|)?(?:\r?\n\|[^\r\n]{{0,{MAX_TABLE_CELL_LENGTH}}}\|){{0,{MAX_TABLE_ROWS}}})" +
55 |     rf"|<table>[\s\S]{{0,{MAX_HTML_TABLE_LENGTH}}}?</table>" +
56 |     "|" +
57 |     # 7. Horizontal rules
58 |     rf"(?:^(?:[-*_]){{{MIN_HORIZONTAL_RULE_LENGTH},}}\s*$|<hr\s*/?>)" +
59 |     "|" +
60 |     # 8. Standalone lines or phrases (Prevent chunking by treating indented lines as part of the same block)
61 |     rf"(?:^(?:<[a-zA-Z][^>]{{0,{MAX_HTML_TAG_ATTRIBUTES_LENGTH}}}>[^\r\n]{{1,{MAX_STANDALONE_LINE_LENGTH}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}})?(?:</[a-zA-Z]+>)?(?:\r?\n|$))" +
62 |     rf"(?:\r?\n[ \t]+[^\r\n]*)*)" +
63 |     "|" +
64 |     # 9. Sentences (Allow sentences to include multiple lines if they are indented)
65 |     rf"(?:[^\r\n]{{1,{MAX_SENTENCE_LENGTH}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}})?(?=\s|$)(?:\r?\n[ \t]+[^\r\n]*)*)" +
66 |     "|" +
67 |     # 10. Quoted text, parentheticals, or bracketed content
68 |     rf"(?<!\w)\"\"\"[^\"]{{0,{MAX_QUOTED_TEXT_LENGTH}}}\"\"\"(?!\w)" +
69 |     rf"|(?<!\w)(?:['\"\`])[^\r\n]{{0,{MAX_QUOTED_TEXT_LENGTH}}}\g<1>(?!\w)" +
70 |     rf"|\([^\r\n()]{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}(?:\([^\r\n()]{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}\)[^\r\n()]{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}){{0,{MAX_NESTED_PARENTHESES}}}\)" +
71 |     rf"|\[[^\r\n\[\]]{{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}}(?:\[[^\r\n\[\]]{{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}}\][^\r\n\[\]]{{0,{MAX_PARENTHETICAL_CONTENT_LENGTH}}}){{0,{MAX_NESTED_PARENTHESES}}}\]" +
72 |     rf"|\$[^\r\n$]{{0,{MAX_MATH_INLINE_LENGTH}}}\$" +
73 |     rf"|`[^\r\n`]{{0,{MAX_MATH_INLINE_LENGTH}}}`" +
74 |     "|" +
75 |    
76 |     # 11. Paragraphs (Treats indented lines as part of the same paragraph)
77 |     rf"(?:(?:^|\r?\n\r?\n)(?:<p>)?(?:(?:[^\r\n]{{1,{MAX_PARAGRAPH_LENGTH}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}})?(?=\s|$))|(?:[^\r\n]{{1,{MAX_PARAGRAPH_LENGTH}}}(?=[\r\n]|$))|(?:[^\r\n]{{1,{MAX_PARAGRAPH_LENGTH}}}(?=[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}}])(?:.{{1,{LOOKAHEAD_RANGE}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}}])(?=\s|$))?))(?:</p>)?(?:\r?\n[ \t]+[^\r\n]*)*)" +
78 |     "|" +
79 |     
80 |     # 12. HTML-like tags and their content
81 |     rf"(?:<[a-zA-Z][^>]{{0,{MAX_HTML_TAG_ATTRIBUTES_LENGTH}}}(?:>[\s\S]{{0,{MAX_HTML_TAG_CONTENT_LENGTH}}}</[a-zA-Z]+>|\s*/>))" +
82 |     "|" +
83 |     # 13. LaTeX-style math expressions
84 |     rf"(?:(?:\$\$[\s\S]{{0,{MAX_MATH_BLOCK_LENGTH}}}?\$\$)|(?:\$[^\$\r\n]{{0,{MAX_MATH_INLINE_LENGTH}}}\$))" +
85 |     "|" +
86 |     # 14. Fallback for any remaining content (Keep content together if it's indented)
87 |     rf"(?:(?:[^\r\n]{{1,{MAX_STANDALONE_LINE_LENGTH}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}})?(?=\s|$))|(?:[^\r\n]{{1,{MAX_STANDALONE_LINE_LENGTH}}}(?=[\r\n]|$))|(?:[^\r\n]{{1,{MAX_STANDALONE_LINE_LENGTH}}}(?=[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}}])(?:.{{1,{LOOKAHEAD_RANGE}}}(?:[.!?…]|\.\.\.|[\u2026\u2047-\u2049]|\p{{Emoji_Presentation}}\p{{Extended_Pictographic}}])(?=\s|$))(?:\r?\n[ \t]+[^\r\n]*)?))" +
88 |     r")",
89 |     re.MULTILINE | re.UNICODE
90 | )
91 | 
92 | 
93 | 
94 | 


--------------------------------------------------------------------------------
/chunkify-gui.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import os
  3 | import asyncio
  4 | from pathlib import Path
  5 | from PyQt6.QtWidgets import (
  6 |     QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout,
  7 |     QPushButton, QRadioButton, QLabel, QTextEdit, QFileDialog,
  8 |     QDialog, QFormLayout, QLineEdit, QSpinBox, QDialogButtonBox,
  9 |     QGroupBox, QScrollArea, QMessageBox, QProgressBar
 10 | )
 11 | from PyQt6.QtCore import Qt, QObject, pyqtSignal, pyqtSlot, QThread
 12 | 
 13 | import chunkify
 14 | 
 15 | 
 16 | class OutputRedirector(QObject):
 17 |     """ Redirects print output to a text widget """
 18 |     text_output = pyqtSignal(str)
 19 | 
 20 |     def __init__(self):
 21 |         super().__init__()
 22 |         self.buffer = ""
 23 | 
 24 |     def write(self, text):
 25 |         self.buffer += text
 26 |         if '\n' in text:
 27 |             self.text_output.emit(self.buffer)
 28 |             self.buffer = ""
 29 |         return len(text)
 30 | 
 31 |     def flush(self):
 32 |         if self.buffer:
 33 |             self.text_output.emit(self.buffer)
 34 |             self.buffer = ""
 35 | 
 36 | 
 37 | class WorkerThread(QThread):
 38 |     """ Thread for running the text processing operations """
 39 |     finished = pyqtSignal(int)
 40 |     progress = pyqtSignal(int, int)  # (current_chunk, total_chunks)
 41 | 
 42 |     def __init__(self, api_url, input_path, task, output_path, language, max_chunk_size, api_password):
 43 |         super().__init__()
 44 |         self.api_url = api_url
 45 |         self.input_path = input_path
 46 |         self.task = task
 47 |         self.output_path = output_path
 48 |         self.language = language
 49 |         self.max_chunk_size = max_chunk_size
 50 |         self.api_password = api_password
 51 | 
 52 |     def run(self):
 53 |         try:
 54 |             loop = asyncio.new_event_loop()
 55 |             asyncio.set_event_loop(loop)
 56 |             exit_code = loop.run_until_complete(chunkify.process_file(
 57 |                 api_url=self.api_url,
 58 |                 input_path=self.input_path,
 59 |                 task=self.task,
 60 |                 output_path=self.output_path,
 61 |                 language=self.language,
 62 |                 max_chunk_size=self.max_chunk_size,
 63 |                 api_password=self.api_password
 64 |             ))
 65 |             self.finished.emit(exit_code)
 66 |         except Exception as e:
 67 |             print(f"Error in worker thread: {str(e)}")
 68 |             self.finished.emit(1)
 69 | 
 70 | 
 71 | class ConfigDialog(QDialog):
 72 |     """ Dialog for configuring the API settings """
 73 |     
 74 |     def __init__(self, parent=None, settings=None):
 75 |         super().__init__(parent)
 76 |         
 77 |         self.settings = settings or {
 78 |             "api_url": "http://localhost:5001",
 79 |             "api_password": "",
 80 |             "language": "English",
 81 |             "max_chunk_size": 4096
 82 |         }
 83 |         
 84 |         self.setWindowTitle("Configuration")
 85 |         self.resize(400, 200)
 86 |         
 87 |         # Create form layout for settings
 88 |         layout = QFormLayout(self)
 89 |         
 90 |         # API URL
 91 |         self.api_url_edit = QLineEdit(self.settings["api_url"])
 92 |         layout.addRow("API URL:", self.api_url_edit)
 93 |         
 94 |         # API Password/Key
 95 |         self.api_password_edit = QLineEdit(self.settings["api_password"])
 96 |         layout.addRow("API Key:", self.api_password_edit)
 97 |         
 98 |         # Default language
 99 |         self.language_edit = QLineEdit(self.settings["language"])
100 |         layout.addRow("Default Language:", self.language_edit)
101 |         
102 |         # Max chunk size
103 |         self.chunk_size_spin = QSpinBox()
104 |         self.chunk_size_spin.setRange(256, 8192)
105 |         self.chunk_size_spin.setSingleStep(128)
106 |         self.chunk_size_spin.setValue(self.settings["max_chunk_size"])
107 |         layout.addRow("Max Chunk Size:", self.chunk_size_spin)
108 |         
109 |         # Dialog buttons
110 |         self.button_box = QDialogButtonBox(QDialogButtonBox.StandardButton.Ok | 
111 |                                           QDialogButtonBox.StandardButton.Cancel)
112 |         self.button_box.accepted.connect(self.accept)
113 |         self.button_box.rejected.connect(self.reject)
114 |         layout.addRow(self.button_box)
115 |     
116 |     def get_settings(self):
117 |         """ Return the current settings """
118 |         return {
119 |             "api_url": self.api_url_edit.text(),
120 |             "api_password": self.api_password_edit.text(),
121 |             "language": self.language_edit.text(),
122 |             "max_chunk_size": self.chunk_size_spin.value()
123 |         }
124 | 
125 | 
126 | class ChunkifyGUI(QMainWindow):
127 |     """ Main application window for Chunkify GUI """
128 |     
129 |     def __init__(self):
130 |         super().__init__()
131 |         
132 |         # Initialize settings
133 |         self.settings = {
134 |             "api_url": "http://localhost:5001",
135 |             "api_password": "",
136 |             "language": "English",
137 |             "max_chunk_size": 4096
138 |         }
139 |         
140 |         self.input_files = []
141 |         self.selected_task = "summary"
142 |         
143 |         # Main window setup
144 |         self.setWindowTitle("Chunkify Text Processor")
145 |         self.resize(800, 600)
146 |         
147 |         # Central widget and main layout
148 |         central_widget = QWidget()
149 |         self.setCentralWidget(central_widget)
150 |         main_layout = QVBoxLayout(central_widget)
151 |         
152 |         # Input file selection area
153 |         file_layout = QHBoxLayout()
154 |         self.file_label = QLabel("No files selected")
155 |         file_layout.addWidget(self.file_label)
156 |         
157 |         self.choose_file_btn = QPushButton("Choose Files")
158 |         self.choose_file_btn.clicked.connect(self.select_input_files)
159 |         file_layout.addWidget(self.choose_file_btn)
160 |         
161 |         main_layout.addLayout(file_layout)
162 |         
163 |         # Task selection group
164 |         task_group = QGroupBox("Task")
165 |         task_layout = QVBoxLayout(task_group)
166 |         
167 |         self.task_buttons = {}
168 |         for task in ["summary", "translate", "correct", "distill"]:
169 |             self.task_buttons[task] = QRadioButton(task.capitalize())
170 |             self.task_buttons[task].clicked.connect(self.update_selected_task)
171 |             task_layout.addWidget(self.task_buttons[task])
172 |         
173 |         # Set default task
174 |         self.task_buttons["summary"].setChecked(True)
175 |         
176 |         main_layout.addWidget(task_group)
177 |         
178 |         # Control buttons
179 |         controls_layout = QHBoxLayout()
180 |         
181 |         self.config_btn = QPushButton("Configuration")
182 |         self.config_btn.clicked.connect(self.open_config_dialog)
183 |         controls_layout.addWidget(self.config_btn)
184 |         
185 |         self.process_btn = QPushButton("Process Files")
186 |         self.process_btn.clicked.connect(self.process_files)
187 |         self.process_btn.setEnabled(False)  # Disabled until files are selected
188 |         controls_layout.addWidget(self.process_btn)
189 |         
190 |         main_layout.addLayout(controls_layout)
191 |         
192 |         # Progress bar
193 |         self.progress_bar = QProgressBar()
194 |         self.progress_bar.setRange(0, 100)
195 |         self.progress_bar.setValue(0)
196 |         self.progress_bar.setVisible(False)
197 |         main_layout.addWidget(self.progress_bar)
198 |         
199 |         # Output window
200 |         output_group = QGroupBox("Output")
201 |         output_layout = QVBoxLayout(output_group)
202 |         
203 |         self.output_text = QTextEdit()
204 |         self.output_text.setReadOnly(True)
205 |         output_layout.addWidget(self.output_text)
206 |         
207 |         # Create a scroll area for the output
208 |         scroll_area = QScrollArea()
209 |         scroll_area.setWidget(output_group)
210 |         scroll_area.setWidgetResizable(True)
211 |         
212 |         main_layout.addWidget(scroll_area)
213 |         
214 |         # Setup output redirection
215 |         self.redirector = OutputRedirector()
216 |         self.redirector.text_output.connect(self.update_output)
217 |         sys.stdout = self.redirector
218 |     
219 |     def update_selected_task(self):
220 |         """ Update the selected task based on radio button selection """
221 |         for task, button in self.task_buttons.items():
222 |             if button.isChecked():
223 |                 self.selected_task = task
224 |                 break
225 |     
226 |     def select_input_files(self):
227 |         """ Open file dialog to select input files """
228 |         files, _ = QFileDialog.getOpenFileNames(
229 |             self,
230 |             "Select Input Files",
231 |             "",
232 |             "All Files (*.*)"
233 |         )
234 |         
235 |         if files:
236 |             self.input_files = files
237 |             if len(files) == 1:
238 |                 self.file_label.setText(Path(files[0]).name)
239 |             else:
240 |                 self.file_label.setText(f"{len(files)} files selected")
241 |             
242 |             self.process_btn.setEnabled(True)
243 |     
244 |     def open_config_dialog(self):
245 |         """ Open the configuration dialog """
246 |         dialog = ConfigDialog(self, self.settings)
247 |         if dialog.exec():
248 |             self.settings = dialog.get_settings()
249 |     
250 |     @pyqtSlot(str)
251 |     def update_output(self, text):
252 |         """ Update the output text widget """
253 |         self.output_text.append(text)
254 |         # Scroll to the bottom
255 |         cursor = self.output_text.textCursor()
256 |         cursor.movePosition(cursor.MoveOperation.End)
257 |         self.output_text.setTextCursor(cursor)
258 |     
259 |     def process_files(self):
260 |         """ Start processing the selected files """
261 |         if not self.input_files:
262 |             QMessageBox.warning(self, "Warning", "No input files selected")
263 |             return
264 |         
265 |         # Clear output
266 |         self.output_text.clear()
267 |         
268 |         # Disable the process button during processing
269 |         self.process_btn.setEnabled(False)
270 |         self.choose_file_btn.setEnabled(False)
271 |         self.config_btn.setEnabled(False)
272 |         
273 |         # Set up progress tracking
274 |         self.progress_bar.setValue(0)
275 |         self.progress_bar.setVisible(True)
276 |         
277 |         # Process each file
278 |         self.current_file_index = 0
279 |         self.process_next_file()
280 |     
281 |     def process_next_file(self):
282 |         """ Process the next file in the queue """
283 |         if self.current_file_index >= len(self.input_files):
284 |             # All files processed
285 |             self.process_btn.setEnabled(True)
286 |             self.choose_file_btn.setEnabled(True)
287 |             self.config_btn.setEnabled(True)
288 |             self.progress_bar.setVisible(False)
289 |             return
290 |         
291 |         input_path = self.input_files[self.current_file_index]
292 |         input_stem = Path(input_path).stem
293 |         output_path = f"{input_stem}_{self.selected_task}.txt"
294 |         
295 |         # Update progress
296 |         progress_value = int((self.current_file_index / len(self.input_files)) * 100)
297 |         self.progress_bar.setValue(progress_value)
298 |         
299 |         # Add file header to output
300 |         self.update_output(f"\n\n--- Processing {Path(input_path).name} ({self.current_file_index + 1}/{len(self.input_files)}) ---\n")
301 |         
302 |         # Start processing in a separate thread
303 |         self.worker = WorkerThread(
304 |             api_url=self.settings["api_url"],
305 |             input_path=input_path,
306 |             task=self.selected_task,
307 |             output_path=output_path,
308 |             language=self.settings["language"],
309 |             max_chunk_size=self.settings["max_chunk_size"],
310 |             api_password=self.settings["api_password"]
311 |         )
312 |         
313 |         self.worker.finished.connect(self.on_file_processed)
314 |         self.worker.start()
315 |     
316 |     def on_file_processed(self, exit_code):
317 |         """ Handle completion of file processing """
318 |         if exit_code != 0:
319 |             self.update_output(f"\nError processing file {Path(self.input_files[self.current_file_index]).name}")
320 |         
321 |         # Move to next file
322 |         self.current_file_index += 1
323 |         self.process_next_file()
324 |     
325 |     def closeEvent(self, event):
326 |         """ Handle window close event """
327 |         # Restore stdout
328 |         sys.stdout = sys.__stdout__
329 |         event.accept()
330 | 
331 | 
332 | if __name__ == "__main__":
333 |     app = QApplication(sys.argv)
334 |     window = ChunkifyGUI()
335 |     window.show()
336 |     sys.exit(app.exec())
337 | 


--------------------------------------------------------------------------------
/chunkify-no-kobold.bat:
--------------------------------------------------------------------------------
 1 | @echo off
 2 | setlocal enabledelayedexpansion
 3 | 
 4 | REM Set the name of your virtual environment
 5 | set "VENV_NAME=chunkify_env"
 6 | 
 7 | REM Set the path to your Python installation (update this if needed)
 8 | set "PYTHON_PATH=python"
 9 | 
10 | REM Check if Python is installed and in PATH
11 | %PYTHON_PATH% --version >nul 2>&1
12 | if errorlevel 1 (
13 |     echo Python is not found. Please ensure Python is installed and added to your PATH.
14 |     pause
15 |     exit /b 1
16 | )
17 | 
18 | 
19 | REM Check if the virtual environment exists, create if it doesn't
20 | if not exist "%VENV_NAME%\Scripts\activate.bat" (
21 |     echo Creating new virtual environment: %VENV_NAME%
22 |     %PYTHON_PATH% -m venv %VENV_NAME%
23 |     if errorlevel 1 (
24 |         echo Failed to create virtual environment. Please check your Python installation.
25 |         pause
26 |         exit /b 1
27 |     )
28 | ) else (
29 |     echo Virtual environment %VENV_NAME% already exists.
30 | )
31 | 
32 | REM Activate the virtual environment
33 | call "%VENV_NAME%\Scripts\activate.bat"
34 | 
35 | REM Check if requirements.txt exists
36 | if not exist requirements.txt (
37 |     echo requirements.txt not found. Please create a requirements.txt file in the same directory as this script.
38 |     pause
39 |     exit /b 1
40 | )
41 | 
42 | REM Upgrade pip to the latest version
43 | python -m pip install --upgrade pip
44 | 
45 | REM Install packages from requirements.txt
46 | echo Installing packages from requirements.txt...
47 | pip install -r requirements.txt
48 | if errorlevel 1 (
49 |     echo Failed to install some packages. Please check your internet connection and requirements.txt file.
50 |     pause
51 |     exit /b 1
52 | )
53 | 
54 | 
55 | 
56 | 
57 | python chunkify-gui.py
58 | pause
59 | 
60 | REM Deactivate the virtual environment
61 | deactivate
62 | 
63 | pause


--------------------------------------------------------------------------------
/chunkify-run.bat:
--------------------------------------------------------------------------------
 1 | @echo off
 2 | setlocal enabledelayedexpansion
 3 | 
 4 | REM Set the name of your virtual environment
 5 | set "VENV_NAME=chunkify_env"
 6 | 
 7 | REM Set the path to your Python installation (update this if needed)
 8 | set "PYTHON_PATH=python"
 9 | 
10 | REM Check if Python is installed and in PATH
11 | %PYTHON_PATH% --version >nul 2>&1
12 | if errorlevel 1 (
13 |     echo Python is not found. Please ensure Python is installed and added to your PATH.
14 |     pause
15 |     exit /b 1
16 | )
17 | 
18 | 
19 | REM Check if the virtual environment exists, create if it doesn't
20 | if not exist "%VENV_NAME%\Scripts\activate.bat" (
21 |     echo Creating new virtual environment: %VENV_NAME%
22 |     %PYTHON_PATH% -m venv %VENV_NAME%
23 |     if errorlevel 1 (
24 |         echo Failed to create virtual environment. Please check your Python installation.
25 |         pause
26 |         exit /b 1
27 |     )
28 | ) else (
29 |     echo Virtual environment %VENV_NAME% already exists.
30 | )
31 | 
32 | REM Activate the virtual environment
33 | call "%VENV_NAME%\Scripts\activate.bat"
34 | 
35 | REM Check if requirements.txt exists
36 | if not exist requirements.txt (
37 |     echo requirements.txt not found. Please create a requirements.txt file in the same directory as this script.
38 |     pause
39 |     exit /b 1
40 | )
41 | 
42 | REM Upgrade pip to the latest version
43 | python -m pip install --upgrade pip
44 | 
45 | REM Install packages from requirements.txt
46 | echo Installing packages from requirements.txt...
47 | pip install -r requirements.txt
48 | if errorlevel 1 (
49 |     echo Failed to install some packages. Please check your internet connection and requirements.txt file.
50 |     pause
51 |     exit /b 1
52 | )
53 | 
54 | 
55 | REM Check if koboldcpp.exe exists, if not, check for koboldcpp_cu12.exe
56 | if exist koboldcpp.exe (
57 |     set "KOBOLD_EXE=koboldcpp.exe"
58 | ) else if exist koboldcpp_cu12.exe (
59 |     set "KOBOLD_EXE=koboldcpp_cu12.exe"
60 | ) else (
61 |     echo Neither koboldcpp.exe nor koboldcpp_cu12.exe found. Please ensure one of these files exists.
62 |     pause
63 |     exit /b 1
64 | )
65 | 
66 | REM Launch your Python script
67 | start %KOBOLD_EXE% --config chunkify.kcppt
68 | 
69 | python chunkify-gui.py
70 | pause
71 | 
72 | REM Deactivate the virtual environment
73 | deactivate
74 | 
75 | pause


--------------------------------------------------------------------------------
/chunkify-run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Set the name of your virtual environment
 4 | VENV_NAME="chunkify_env"
 5 | 
 6 | # Function to check if a command exists
 7 | command_exists() {
 8 |     command -v "$1" >/dev/null 2>&1
 9 | }
10 | 
11 | # Check if Python is installed
12 | if ! command_exists python3; then
13 |     echo "Python 3 is not found. Please ensure Python 3 is installed and added to your PATH."
14 |     exit 1
15 | fi
16 | 
17 | # Check if the virtual environment exists, create if it doesn't
18 | if [ ! -d "$VENV_NAME" ]; then
19 |     echo "Creating new virtual environment: $VENV_NAME"
20 |     python3 -m venv "$VENV_NAME"
21 |     if [ $? -ne 0 ]; then
22 |         echo "Failed to create virtual environment. Please check your Python installation."
23 |         exit 1
24 |     fi
25 | else
26 |     echo "Virtual environment $VENV_NAME already exists."
27 | fi
28 | 
29 | # Activate the virtual environment
30 | source "$VENV_NAME/bin/activate"
31 | 
32 | # Check if requirements.txt exists
33 | if [ ! -f "requirements.txt" ]; then
34 |     echo "requirements.txt not found. Please create a requirements.txt file in the same directory as this script."
35 |     exit 1
36 | fi
37 | 
38 | # Upgrade pip to the latest version
39 | python3 -m pip install --upgrade pip
40 | 
41 | # Install packages from requirements.txt
42 | echo "Installing packages from requirements.txt..."
43 | pip install -r requirements.txt
44 | if [ $? -ne 0 ]; then
45 |     echo "Failed to install some packages. Please check your internet connection and requirements.txt file."
46 |     exit 1
47 | fi
48 | 
49 | # Determine the correct KoboldCPP binary based on the system
50 | if [[ "$(uname)" == "Darwin" ]]; then
51 |     if [[ "$(uname -m)" == "arm64" ]]; then
52 |         KOBOLDCPP_BINARY="./koboldcpp-mac-arm64"
53 |     else
54 |         KOBOLDCPP_BINARY="./koboldcpp-mac-x64"
55 |     fi
56 | elif [[ "$(uname)" == "Linux" ]]; then
57 |     KOBOLDCPP_BINARY="./koboldcpp-linux-x64"
58 | else
59 |     echo "Unsupported operating system. Please run on macOS or Linux."
60 |     exit 1
61 | fi
62 | 
63 | # Check if the KoboldCPP binary exists and is executable
64 | if [ ! -x "$KOBOLDCPP_BINARY" ]; then
65 |     echo "KoboldCPP binary not found or not executable. Please check the binary and its permissions."
66 |     exit 1
67 | fi
68 | 
69 | # Launch KoboldCPP
70 | "$KOBOLDCPP_BINARY" --config chunkify.kcppt &
71 | 
72 | # Wait for KoboldCPP to start
73 | echo "Waiting for KoboldCPP to start..."
74 | while ! nc -z localhost 5001; do   
75 |   sleep 1
76 | done
77 | 
78 | # Launch the Python GUI script
79 | python3 chunkify-gui.py
80 | 
81 | # Deactivate the virtual environment when the GUI is closed
82 | deactivate
83 | 
84 | # Wait for user input before closing
85 | read -p "Press Enter to exit..."
86 | 


--------------------------------------------------------------------------------
/chunkify.kcppt:
--------------------------------------------------------------------------------
  1 | {
  2 |     "benchmark": null,
  3 |     "blasbatchsize": 512,
  4 |     "blasthreads": null,
  5 |     "config": null,
  6 |     "contextsize": 8192,
  7 |     "debugmode": 0,
  8 |     "flashattention": true,
  9 |     "forceversion": 0,
 10 |     "foreground": false,
 11 |     "gpulayers": -1,
 12 |     "highpriority": false,
 13 |     "hordeconfig": null,
 14 |     "hordegenlen": 0,
 15 |     "hordekey": "",
 16 |     "hordemaxctx": 0,
 17 |     "hordemodelname": "",
 18 |     "hordeworkername": "",
 19 |     "host": "",
 20 |     "ignoremissing": false,
 21 |     "istemplate": true,
 22 |     "launch": false,
 23 |     "lora": null,
 24 |     "mmproj": "",
 25 |     "model": "",
 26 |     "model_param": "https://huggingface.co/bartowski/aya-expanse-8b-GGUF/resolve/main/aya-expanse-8b-Q6_K.gguf",
 27 |     "multiuser": 8,
 28 |     "noavx2": false,
 29 |     "noblas": false,
 30 |     "nocertify": false,
 31 |     "nommap": false,
 32 |     "noshift": true,
 33 |     "onready": "",
 34 |     "password": null,
 35 |     "port": 5001,
 36 |     "port_param": 5001,
 37 |     "preloadstory": {
 38 |         "actions": [
 39 |         ],
 40 |         "actions_metadata": {
 41 |         },
 42 |         "anotestr": 320,
 43 |         "anotetemplate": "[Author's note: <|>]",
 44 |         "authorsnote": "",
 45 |         "completed_imgs_meta": {
 46 |         },
 47 |         "extrastopseq": "",
 48 |         "gamestarted": true,
 49 |         "logitbiasdict": {
 50 |         },
 51 |         "memory": "",
 52 |         "personal_notes": "",
 53 |         "placeholder_tags_data": [
 54 |         ],
 55 |         "prompt": "",
 56 |         "regexreplace_data": [
 57 |         ],
 58 |         "savedsettings": {
 59 |             "adventure_context_mod": true,
 60 |             "adventure_is_action": false,
 61 |             "chat_context_mod": true,
 62 |             "chatname": "User",
 63 |             "chatopponent": "KoboldAI",
 64 |             "compressnewlines": false,
 65 |             "eos_ban_mode": "0",
 66 |             "gui_type_instruct": "3",
 67 |             "instruct_endtag": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\\n",
 68 |             "instruct_has_markdown": true,
 69 |             "instruct_starttag": "<|eot_id|><|start_header_id|>user<|end_header_id|>\\n\\n",
 70 |             "instruct_sysprompt": "",
 71 |             "instruct_systag": "<|start_header_id|>system<|end_header_id|>\\n\\n",
 72 |             "opmode": "4",
 73 |             "persist_session": true,
 74 |             "placeholder_tags": true,
 75 |             "render_special_tags": false,
 76 |             "trimsentences": true,
 77 |             "trimwhitespace": false
 78 |         },
 79 |         "tokenbans": "",
 80 |         "wifolders_d": {
 81 |         },
 82 |         "wifolders_l": [
 83 |         ],
 84 |         "wiinsertlocation": 0,
 85 |         "wisearchdepth": 0,
 86 |         "worldinfo": [
 87 |         ]
 88 |     },
 89 |     "quantkv": 0,
 90 |     "quiet": false,
 91 |     "remotetunnel": false,
 92 |     "ropeconfig": [
 93 |         0.0,
 94 |         10000.0
 95 |     ],
 96 |     "sdclamped": 0,
 97 |     "sdconfig": null,
 98 |     "sdlora": "",
 99 |     "sdloramult": 1.0,
100 |     "sdmodel": "",
101 |     "sdquant": false,
102 |     "sdthreads": 0,
103 |     "sdvae": "",
104 |     "sdvaeauto": false,
105 |     "skiplauncher": true,
106 |     "smartcontext": false,
107 |     "ssl": null,
108 |     "tensor_split": null,
109 |     "threads": -1,
110 |     "unpack": "",
111 |     "useclblast": null,
112 |     "usecublas": null,
113 |     "usemlock": false,
114 |     "usevulkan": null,
115 |     "whispermodel": "",
116 | 	"nofastforward": true
117 | }


--------------------------------------------------------------------------------
/chunkify.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import asyncio
  3 | import json
  4 | import os
  5 | import random
  6 | import re
  7 | import sys
  8 | import time
  9 | from datetime import datetime
 10 | from pathlib import Path
 11 | from typing import Dict, List, Optional, Tuple, Iterator, Union, Any
 12 | import requests
 13 | from requests.exceptions import RequestException
 14 | 
 15 | from extractous import Extractor
 16 | from chunker_regex import chunk_regex
 17 | 
 18 | 
 19 | class ChunkingProcessor:
 20 |     """ Handles splitting content into manageable chunks using natural breaks """
 21 | 
 22 |     def __init__(self, api_url: str, 
 23 |                  max_chunk_length: int,
 24 |                  api_password: Optional[str] = None,
 25 |                  max_total_chunks: int = 1000):
 26 |         """ Initialize the chunking processor
 27 |         
 28 |         Args:
 29 |             api_url: URL to the KoboldAPI server
 30 |             max_chunk_length: Maximum token length for a single chunk
 31 |             api_password: Optional API password/key
 32 |             max_total_chunks: Maximum number of chunks to process
 33 |         """
 34 |         if max_chunk_length <= 0:
 35 |             raise ValueError("max_chunk_length must be positive")
 36 |             
 37 |         self.api_url = api_url
 38 |         self.max_chunk = max_chunk_length
 39 |         self.max_total_chunks = max_total_chunks
 40 |         self.api_password = api_password
 41 |         self.headers = {
 42 |             "Content-Type": "application/json",
 43 |         }
 44 |         
 45 |         if api_password:
 46 |             self.headers["Authorization"] = f"Bearer {api_password}"
 47 |             
 48 |         # Generate a unique key for this processing session
 49 |         self.genkey = self._create_genkey()
 50 |         
 51 |         # Verify API and get max context length if needed
 52 |         self.api_max_context = self._get_max_context_length()
 53 |         if self.max_chunk > self.api_max_context // 2:
 54 |             print(f"Warning: Reducing chunk size to fit model context window")
 55 |             self.max_chunk = self.api_max_context // 2
 56 |     
 57 |     def _create_genkey(self) -> str:
 58 |         """ Create a unique generation key to prevent cross-request contamination """
 59 |         return f"KCPP{''.join(str(random.randint(0, 9)) for _ in range(4))}"
 60 |         
 61 |     def _get_max_context_length(self) -> int:
 62 |         """ Get the maximum context length from the KoboldAPI """
 63 |         try:
 64 |             response = requests.get(f"{self.api_url}/api/extra/true_max_context_length")
 65 |             if response.status_code == 200:
 66 |                 max_context = int(response.json().get("value", 8192))
 67 |                 print(f"Model has maximum context length of: {max_context}")
 68 |                 return max_context
 69 |             else:
 70 |                 print(f"Warning: Could not get max context length. Defaulting to 8192")
 71 |                 return 8192
 72 |         except Exception as e:
 73 |             print(f"Error getting max context length: {str(e)}. Defaulting to 8192")
 74 |             return 8192
 75 |             
 76 |     def count_tokens(self, text: str) -> int:
 77 |         """ Count tokens in the provided text using KoboldAPI """
 78 |         try:
 79 |             payload = {"prompt": text, "genkey": self.genkey}
 80 |             response = requests.post(
 81 |                 f"{self.api_url}/api/extra/tokencount",
 82 |                 json=payload,
 83 |                 headers=self.headers
 84 |             )
 85 |             if response.status_code == 200:
 86 |                 return int(response.json().get("value", 0))
 87 |             else:
 88 |                 # Fallback estimation
 89 |                 return len(text.split()) 
 90 |         except Exception as e:
 91 |             print(f"Error counting tokens: {str(e)}. Using word count as estimate.")
 92 |             return len(text.split())
 93 |     
 94 |     def chunk_text(self, content: str) -> List[Tuple[str, int]]:
 95 |         """ Split content into chunks using natural breakpoints
 96 |         
 97 |         Args:
 98 |             content: The text content to chunk
 99 |             
100 |         Returns:
101 |             List of (chunk_text, token_count) tuples
102 |         """
103 |         if not content:
104 |             return []
105 |             
106 |         chunks = []
107 |         remaining = content
108 |         chunk_num = 0
109 |         
110 |         while remaining and chunk_num < self.max_total_chunks:
111 |             # KoboldCPP has max char limit of 50k
112 |             current_section = remaining[:45000]
113 |             remaining = remaining[45000:]
114 |             
115 |             chunk = self._get_chunk(current_section)
116 |             chunk_len = len(chunk)
117 |             
118 |             if chunk_len == 0:
119 |                 continue
120 |                 
121 |             chunk_tokens = self.count_tokens(chunk)
122 |             chunks.append((chunk, chunk_tokens))
123 |             
124 |             # Update remaining with what wasn't included in this chunk
125 |             remaining = current_section[len(chunk):].strip() + remaining
126 | 
127 |             chunk_num += 1
128 |             print(f"Created chunk {chunk_num}: {chunk_tokens} tokens")
129 |             
130 |         if remaining and chunk_num >= self.max_total_chunks:
131 |             raise ValueError(f"Text exceeded maximum of {self.max_total_chunks} chunks")
132 |             
133 |         return chunks
134 | 
135 |     def _get_chunk(self, content: str) -> str:
136 |         """ Get appropriately sized chunk using natural breaks
137 |         
138 |         Args:
139 |             content: Text content to chunk
140 |             
141 |         Returns:
142 |             A chunk of text within token limits
143 |         """
144 |         total_tokens = self.count_tokens(content)
145 |         if total_tokens < self.max_chunk:
146 |             return content
147 | 
148 |         # chunk_regex is designed to break at natural language points
149 |         # to preserve context and readability
150 |         matches = chunk_regex.finditer(content)
151 |         current_size = 0
152 |         chunks = []
153 |         
154 |         for match in matches:
155 |             chunk = match.group(0)
156 |             chunk_size = self.count_tokens(chunk)
157 |             if current_size + chunk_size > self.max_chunk:
158 |                 if not chunks:
159 |                     chunks.append(chunk)
160 |                 break
161 |             chunks.append(chunk)
162 |             current_size += chunk_size
163 |         
164 |         return ''.join(chunks)
165 | 
166 |     def chunk_file(self, file_path) -> Tuple[List[Tuple[str, int]], Dict]:
167 |         """ Chunk text from file
168 |         
169 |         Args:
170 |             file_path: Path to text file (str or Path object)
171 |             
172 |         Returns:
173 |             Tuple of (chunks with token counts, file metadata)
174 |         """
175 |         extractor = Extractor()
176 |         extractor = extractor.set_extract_string_max_length(100000000)
177 |         
178 |         try:
179 |             content, metadata = extractor.extract_file_to_string(str(file_path))
180 |             chunks = self.chunk_text(content)
181 |             return chunks, metadata
182 |         except Exception as e:
183 |             print(f"Error extracting file: {str(e)}")
184 |             return [], {"error": str(e)}
185 | 
186 | 
187 | class SSEProcessingClient:
188 |     """ Client for processing chunks with OpenAI-compatible endpoints via SSE streaming """
189 |     
190 |     def __init__(self, api_url: str, api_password: Optional[str] = None):
191 |         """ Initialize the processing client
192 |         
193 |         Args:
194 |             api_url: URL to the OpenAI-compatible API
195 |             api_password: Optional API key/password
196 |         """
197 |         self.api_url = api_url
198 |         self.api_password = api_password
199 |         
200 |         # Ensure API URL ends with correct endpoint for OpenAI compatibility
201 |         if not self.api_url.endswith('/v1/chat/completions'):
202 |             self.api_url = f"{self.api_url.rstrip('/')}/v1/chat/completions"
203 |             
204 |         self.headers = {
205 |             "Content-Type": "application/json",
206 |             "Accept": "text/event-stream",
207 |         }
208 |         
209 |         if api_password:
210 |             self.headers["Authorization"] = f"Bearer {api_password}"
211 |     
212 |     def _create_payload(self, instruction: str, content: str, 
213 |                        max_tokens: int = 2048,
214 |                        temperature: float = 0.2, 
215 |                        top_p: float = 1.0,
216 |                        top_k: int = 0,
217 |                        rep_pen: float = 1.0,
218 |                        min_p: float = 0.05) -> Dict:
219 |         """ Create the API payload with standard parameters
220 |         
221 |         Args:
222 |             instruction: Instruction text for the model
223 |             content: Content text to process
224 |             max_tokens: Maximum tokens to generate
225 |             temperature: Temperature parameter
226 |             top_p: Top-p parameter
227 |             top_k: Top-k parameter
228 |             rep_pen: Repetition penalty
229 |             min_p: Minimum p parameter
230 |             
231 |         Returns:
232 |             Dictionary payload for the API
233 |         """
234 |         system_content = "You are a helpful assistant."
235 |         
236 |         combined_content = f"<START_TEXT>{content}<END_TEXT>\n{instruction}"
237 |         
238 |         return {
239 |             "messages": [
240 |                 {"role": "system", "content": system_content},
241 |                 {"role": "user", "content": combined_content}
242 |             ],
243 |             "max_tokens": max_tokens,
244 |             "temperature": temperature,
245 |             "top_p": top_p,
246 |             "top_k": top_k,
247 |             "repetition_penalty": rep_pen,
248 |             "min_p": min_p,
249 |             "stream": True
250 |         }
251 |     
252 |     async def process_chunk(self, instruction: str, content: str,
253 |                          max_tokens: int = 2048,
254 |                          temperature: float = 0.2) -> str:
255 |         """ Process a single chunk with streaming output
256 |         
257 |         Args:
258 |             instruction: Instruction for processing
259 |             content: Text content to process
260 |             max_tokens: Maximum tokens to generate
261 |             temperature: Generation temperature
262 |             
263 |         Returns:
264 |             Generated text
265 |         """
266 |         payload = self._create_payload(
267 |             instruction=instruction,
268 |             content=content,
269 |             max_tokens=max_tokens,
270 |             temperature=temperature
271 |         )
272 |         
273 |         result = []
274 |         partial_line = ""
275 |         
276 |         try:
277 |             response = requests.post(
278 |                 self.api_url,
279 |                 json=payload,
280 |                 headers=self.headers,
281 |                 stream=True
282 |             )
283 |             
284 |             for line in response.iter_lines():
285 |                 if not line:
286 |                     continue
287 |                     
288 |                 # Remove the "data: " prefix and decode
289 |                 line_text = line.decode('utf-8')
290 |                 
291 |                 if line_text.startswith('data: '):
292 |                     line_text = line_text[6:]
293 |                 
294 |                 # Handle the "[DONE]" message
295 |                 if line_text == '[DONE]':
296 |                     break
297 |                     
298 |                 # Parse the JSON from the SSE stream
299 |                 try:
300 |                     data = json.loads(line_text)
301 |                     
302 |                     # Extract the delta content from the received data
303 |                     if 'choices' in data and len(data['choices']) > 0:
304 |                         if 'delta' in data['choices'][0]:
305 |                             if 'content' in data['choices'][0]['delta']:
306 |                                 token = data['choices'][0]['delta']['content']
307 |                                 # Handle any newlines in the token
308 |                                 if token.endswith('\n'):
309 |                                     partial_line += token[:-1]
310 |                                     print(partial_line)
311 |                                     partial_line = ""
312 |                                 elif '\n' in token:
313 |                                     parts = token.split('\n')
314 |                                     for i, part in enumerate(parts):
315 |                                         if i < len(parts) - 1:
316 |                                             print(partial_line + part)
317 |                                             partial_line = ""
318 |                                         else:
319 |                                             partial_line += part
320 |                                 else:
321 |                                     partial_line += token
322 |                                 
323 |                                 result.append(token)
324 |                 except json.JSONDecodeError:
325 |                     continue
326 |                 
327 |             if partial_line:
328 |                 print(partial_line)
329 |                 
330 |             return ''.join(result)
331 |                 
332 |         except Exception as e:
333 |             print(f"Error in API call: {str(e)}")
334 |             return ""
335 | 
336 | 
337 | class TextProcessor:
338 |     """ Handles processing of text documents with KoboldAPI """
339 |     
340 |     def __init__(self, api_url: str, 
341 |                  max_chunk_size: int = 4096,
342 |                  api_password: Optional[str] = None):
343 |         """ Initialize the text processor
344 |         
345 |         Args:
346 |             api_url: URL to the KoboldAPI server
347 |             max_chunk_size: Maximum chunk size in tokens
348 |             api_password: Optional API key/password
349 |         """
350 |         self.api_url = api_url
351 |         self.api_password = api_password
352 |         self.task_configs = {
353 |             'translate': {
354 |                 'chunk_ratio': 0.45,
355 |                 'instruction_template': (
356 |                     "Translate the text into {language}. "
357 |                     "Maintain linguistic flourish and authorial style."
358 |                 )
359 |             },
360 |             'summary': {
361 |                 'chunk_ratio': 0.8,
362 |                 'instruction_template': (
363 |                     "Extract the key points and themes from the text "
364 |                     "without developing conclusions. "
365 |                     "Be thorough but concise."
366 |                 )
367 |             },
368 |             'correct': {
369 |                 'chunk_ratio': 0.45,
370 |                 'instruction_template': (
371 |                     "Correct any grammar, spelling, and style errors. "
372 |                     "Preserve the original meaning and style."
373 |                 )
374 |             },
375 |             'distill': {
376 |                 'chunk_ratio': 0.8,
377 |                 'instruction_template': (
378 |                     "Rewrite the text to be concise without losing meaning."
379 |                 )
380 |             }
381 |         }
382 |         
383 |         # Initialize the chunking processor
384 |         self.chunker = ChunkingProcessor(
385 |             api_url=api_url,
386 |             max_chunk_length=max_chunk_size,
387 |             api_password=api_password
388 |         )
389 |         
390 |         # Use SSE processing for streaming output
391 |         self.processor = SSEProcessingClient(
392 |             api_url=api_url,
393 |             api_password=api_password
394 |         )
395 |         
396 |     def _get_task_config(self, task: str, language: str = "English") -> dict:
397 |         """ Get configuration for specified task
398 |         
399 |         Args:
400 |             task: Processing task name
401 |             language: Target language for translation
402 |             
403 |         Returns:
404 |             Task configuration dictionary
405 |         """
406 |         if task not in self.task_configs:
407 |             raise ValueError(f"Unknown task: {task}")
408 |             
409 |         config = self.task_configs[task].copy()
410 |         config['chunk_size'] = int(self.chunker.api_max_context * config['chunk_ratio'])
411 |         config['instruction'] = config['instruction_template'].format(
412 |             language=language
413 |         )
414 |         return config
415 |     
416 |     async def process_text(self, task: str, file_path: Union[str, Path], 
417 |                          language: str = "English") -> Tuple[List[str], Dict]:
418 |         """ Process text document with specified task
419 |         
420 |         Args:
421 |             task: Processing task name
422 |             file_path: Path to the text file
423 |             language: Target language for translation
424 |             
425 |         Returns:
426 |             Tuple of (results, metadata)
427 |         """
428 |         task_config = self._get_task_config(task, language)
429 |         
430 |         # Update the chunker's max size based on the task
431 |         self.chunker.max_chunk = task_config['chunk_size']
432 |         
433 |         # Chunk the file
434 |         chunks, metadata = self.chunker.chunk_file(file_path)
435 |         
436 |         print(f"\nProcessing {len(chunks)} chunks...")
437 |         results = []
438 |         
439 |         for i, (chunk, tokens) in enumerate(chunks, 1):
440 |             print(f"\nChunk {i}/{len(chunks)} ({tokens} tokens):")
441 |             try:
442 |                 result = await self.processor.process_chunk(
443 |                     instruction=task_config['instruction'],
444 |                     content=chunk
445 |                 )
446 |                 results.append(result)
447 |             except Exception as e:
448 |                 print(f"\nError processing chunk {i}: {e}")
449 |                 results.append(f"[Error processing chunk {i}]")
450 |                 
451 |         # Update metadata
452 |         metadata.update({
453 |             'processing_time': datetime.now().isoformat(),
454 |             'task': task,
455 |             'chunks_processed': len(chunks),
456 |             'language': language
457 |         })
458 |         
459 |         return results, metadata
460 | 
461 | 
462 | def write_output(output_path: str, results: List[str], metadata: Dict) -> None:
463 |     """ Write processing results to a file
464 |     
465 |     Args:
466 |         output_path: Path to output file
467 |         results: List of processed text chunks
468 |         metadata: Metadata about the processing
469 |     """
470 |     try:
471 |         with open(output_path, 'w', encoding='utf-8') as f:
472 |             f.write(f"File: {metadata.get('resourceName', 'Unknown')}\n")
473 |             f.write(f"Type: {metadata.get('Content-Type', 'Unknown')}\n")
474 |             f.write(f"Task: {metadata.get('task', 'Unknown')}\n")
475 |             f.write(f"Processed: {metadata.get('processing_time', 'Unknown')}\n")
476 |             f.write(f"Chunks: {metadata.get('chunks_processed', 0)}\n\n")
477 |             
478 |             for i, result in enumerate(results, 1):
479 |                 f.write(f"--- Chunk {i} ---\n\n")
480 |                 f.write(f"{result}\n\n")
481 |                 
482 |         print(f"\nOutput written to: {output_path}")
483 |     except Exception as e:
484 |         print(f"Error writing output file: {str(e)}")
485 | 
486 | 
487 | async def process_file(api_url: str, input_path: Path, task: str, 
488 |                       output_path: Optional[str] = None,
489 |                       language: str = "English", 
490 |                       max_chunk_size: int = 4096,
491 |                       api_password: Optional[str] = None) -> int:
492 |     """ Process a text file and save results
493 |     
494 |     Args:
495 |         api_url: URL to the KoboldAPI server
496 |         input_path: Path to input file
497 |         task: Processing task name
498 |         output_path: Path for output file (defaults to input + task)
499 |         language: Target language for translation
500 |         max_chunk_size: Maximum chunk size in tokens
501 |         api_password: Optional API key/password
502 |         
503 |     Returns:
504 |         Exit code (0 for success, 1 for error)
505 |     """
506 |     if not output_path:
507 |         input_stem = Path(input_path).stem
508 |         output_path = f"{input_stem}_{task}.txt"
509 |         
510 |     try:
511 |         processor = TextProcessor(
512 |             api_url=api_url,
513 |             max_chunk_size=max_chunk_size,
514 |             api_password=api_password
515 |         )
516 |         
517 |         results, metadata = await processor.process_text(task, input_path, language)
518 |         
519 |         write_output(output_path, results, metadata)
520 |         print("\nProcessing complete.")
521 |         return 0
522 |         
523 |     except Exception as e:
524 |         print(f"Error: {e}")
525 |         return 1
526 | 
527 | 
528 | async def main():
529 |     parser = argparse.ArgumentParser(
530 |         description="Process text documents with KoboldAPI"
531 |     )
532 |     parser.add_argument(
533 |         'input',
534 |         type=str,
535 |         help='Input text file path'
536 |     )
537 |     parser.add_argument(
538 |         '--task',
539 |         required=True,
540 |         choices=['summary', 'translate', 'correct', 'distill'],
541 |         help='Processing task to perform'
542 |     )
543 |     parser.add_argument(
544 |         '--api-url',
545 |         default='http://localhost:5001',
546 |         help='KoboldAPI URL'
547 |     )
548 |     parser.add_argument(
549 |         '--api-password',
550 |         default='',
551 |         help='API key/password'
552 |     )
553 |     parser.add_argument(
554 |         '--language',
555 |         default='English',
556 |         help='Target language for translation'
557 |     )
558 |     parser.add_argument(
559 |         '--output',
560 |         default=None,
561 |         help='Output file path'
562 |     )
563 |     parser.add_argument(
564 |         '--max-chunk-size',
565 |         default=4096,
566 |         type=int,
567 |         help='Maximum token size for a chunk'
568 |     )
569 |     
570 |     args = parser.parse_args()
571 |     
572 |     return await process_file(
573 |         api_url=args.api_url,
574 |         input_path=args.input,
575 |         task=args.task,
576 |         output_path=args.output,
577 |         language=args.language,
578 |         max_chunk_size=args.max_chunk_size,
579 |         api_password=args.api_password
580 |     )
581 | 
582 | 
583 | if __name__ == '__main__':
584 |     result = asyncio.run(main())
585 |     sys.exit(result)
586 | 


--------------------------------------------------------------------------------
/chunkify_littleprince_translate.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jabberjabberjabber/Chunkify/700d0ac8f83c792b17bc23f19e3c660f0fa59f18/chunkify_littleprince_translate.gif


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | regex
3 | extractous
4 | PyQt6


--------------------------------------------------------------------------------