3 |
4 | [](https://chrome.google.com/webstore/detail/full-text-tabs-forever/gfmbnlbnapjmffgcnbopfgmflmlfghel)
5 |
6 | # Full Text Tabs Forever
7 |
8 | Search everything you read online. FTTF lets you search the full text of every web page you visit.
9 |
10 | Available in the [Chrome Web Store](https://chrome.google.com/webstore/detail/full-text-tabs-forever/gfmbnlbnapjmffgcnbopfgmflmlfghel).
11 |
12 | Available in the [Firefox Add-ons Store](https://addons.mozilla.org/en-US/firefox/addon/full-text-tabs-forever/).
13 |
14 | > **IMPORTANT FOR v2.0 USERS:** If you're upgrading from v1.x, see the [Database Migration](#database-migration-v20) section for instructions on migrating your existing data.
15 |
16 | _Firefox requires additional permissions. See [below](#firefox)._
17 |
18 |
19 |
20 | **Doesn't the browser do that already? How is this different?**
21 |
22 | Chrome does not let you search the text on pages you've visited, **only the URLs and titles**, and it deletes your history after a number of months. Firefox will keep your history for longer, but likewise doesn't let you search page content, only URLs and titles.
23 |
24 |
25 | FTTF is different:
26 |
27 | - **Full-Text Search Capabilities:** The full content of every page you've visited becomes searchable.
28 | - **Permanent History:** Your digital footprints are yours to keep. Your data is yours, so it should not be removed without your approval. Nothing is deleted automatically.
29 | - **Instant indexing:** FTTF creates a search index as you browse, so pages are immediately available for searching right after you land on a page.
30 | - **For your eyes only:** Your browsing history is stored locally on your device, and not on any external servers. Beware that if you switch computers your FTTF history will not automatically come with you. It can be exported though.
31 |
32 |
33 |
34 | 
35 |
36 |
37 |
38 | **Who is it for?**
39 |
40 | Data hoarders like myself that never want to delete anything, and want everything to be searchable. More generally, if you've ever felt limited by the standard history search you should try this out.
41 |
42 | **How it works:**
43 |
44 | Browser extensions have access to the pages you visit, which lets FTTF make an index of the content on any page. When a page loads, its content is extracted and indexed.
45 |
46 | Extracted? Yes, or "distilled" if you prefer. Full web pages are huge and have a lot of information that's not related to the content itself. FTTF will ignore all of that. It acts like "reader mode" to find relevant content on a page and only index that.
47 |
48 | # Installation
49 |
50 | Install in your browser via the [Chrome Web Store](https://chrome.google.com/webstore/detail/full-text-tabs-forever/gfmbnlbnapjmffgcnbopfgmflmlfghel) or the [Firefox Add-ons Store](https://addons.mozilla.org/en-US/firefox/addon/full-text-tabs-forever/).
51 |
52 | # Testing
53 |
54 | This project uses `bun` as a unit testing framework, but not (currently) as a bundler. You will need to install `bun`, then:
55 |
56 | `bun test`
57 |
58 | Or, `bun run test` if you prefer.
59 |
60 | # Note to self: Submitting a new version manually
61 |
62 | > How could this be automated?
63 |
64 | - Manually bump the version in the manifest file
65 | - Run the build
66 | - `bun run build:chrome`
67 | - `bun run build:firefox`
68 | - Submit
69 | - Chrome
70 | - Go to: https://chrome.google.com/webstore/devconsole/bc898ad5-018e-4774-b9ab-c4bef7b7f92b/gfmbnlbnapjmffgcnbopfgmflmlfghel/edit/package
71 | - Upload the `fttf-chrome.zip` file
72 | - Firefox
73 | - Go to: https://addons.mozilla.org/en-US/developers/addon/full-text-tabs-forever/edit
74 | - Upload the `fttf-firefox.zip` file
75 | - Zip the original source code and upload that too: `zip -r src.zip src`
76 |
77 | # Firefox
78 |
79 | Install here: https://addons.mozilla.org/en-US/firefox/addon/full-text-tabs-forever/
80 |
81 | Currently you have to manually enable additional permissions in Firefox like so:
82 |
83 | .
84 |
85 | See this comment for more details: https://github.com/iansinnott/full-text-tabs-forever/issues/3#issuecomment-1963238416
86 |
87 | Support was added in: https://github.com/iansinnott/full-text-tabs-forever/pull/4.
88 |
89 | # Database Migration (v2.0)
90 |
91 | With version 2.0, Full Text Tabs Forever has migrated from SQLite (VLCN) to PostgreSQL (PgLite) as its database backend. This change brings several improvements:
92 |
93 | - Better full-text search capabilities with PostgreSQL's advanced text search
94 | - Support for vector embeddings for semantic search (coming soon)
95 | - Improved performance for large databases
96 | - More efficient storage of document fragments
97 |
98 | ## For Existing Users
99 |
100 | If you're upgrading from a previous version (v1.x), your data will not be lost! The extension includes a migration system that will:
101 |
102 | 1. Detect your existing VLCN (SQLite) database
103 | 2. Provide a simple one-click migration option in the Settings page
104 | 3. Transfer all your saved pages to the new PostgreSQL database
105 | 4. Show real-time progress during migration
106 | 5. Preserve all your searchable content
107 |
108 | To migrate your data:
109 |
110 | 1. After upgrading, open the extension
111 | 2. Go to the Settings page
112 | 3. Find the "Import VLCN Database (v1)" section
113 | 4. Click the "Import VLCN Database" button
114 | 5. Wait for the migration to complete - this may take several minutes depending on how many pages you've saved
115 | 6. Your data is now accessible in the new database system!
116 |
117 | The migration happens entirely on your device, and no data is sent to external servers. Your privacy remains protected throughout the process.
118 |
119 | # TODO
120 |
121 | - [ ] Backfill history
122 | Currently only new pages you visit are indexed, but we could backfill by opening every page in the browser's history that hasn't yet been indexed. An optional feature, but a useful one.
123 | - [ ] Backup and sync
124 | Improved export/import capabilities for moving data between devices.
125 | - [ ] Semantic search
126 | Leverage vector embeddings in the new PostgreSQL backend for more intelligent searching.
127 | - [ ] Integrate with [browser-gopher](https://github.com/iansinnott/browser-gopher)
128 | Browser gopher and [BrowserParrot](https://www.browserparrot.com/) were the initial impetus to create a better way to ingest full text web pages, without triggering a Cloudflare captcha party on your home connection.
129 | - [x] Migrate to PostgreSQL
130 | Replace SQLite with a more powerful database backend using PgLite.
131 | - [x] Improve discoverability of functionality.
132 | There is now a button to open the command palette. Still not much GUI, but enough to be discovered.
133 | - [x] Firefox
134 | ~~This should not be too difficult since this project was started with web extension polyfills. However, there is currently some chrome specific code.~~
135 | It appears that the APIs do not have to be rewritten to work in Firefox. See this PR for details: https://github.com/iansinnott/full-text-tabs-forever/pull/4
136 |
137 | # Contributing
138 |
139 | PRs welcome!
140 |
--------------------------------------------------------------------------------
/scripts/release.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | main() {
4 | echo "Releasing new version..."
5 | echo
6 | echo " PWD: $PWD"
7 |
8 | local version=$(jq -r '.version' package.json)
9 |
10 | # Replace version in src/manifest.json
11 | sed -i '' -e "s/\"version\": \".*\"/\"version\": \"$version\"/g" src/manifest.json
12 |
13 | # amend last commit
14 | git add src/manifest.json > /dev/null
15 | git commit --amend --no-edit > /dev/null
16 |
17 | # upsert the tag. if running yarn version the tag will have been created already
18 | git tag -d "v$version" > /dev/null 2>&1 || true
19 | git tag -a "v$version" -m "v$version" > /dev/null
20 |
21 | echo " Tag: v$version"
22 | echo " Commit: $(git rev-parse HEAD)"
23 | echo
24 | echo "Don't forget to push the tag to GitHub: git push --tags"
25 | }
26 |
27 | main
--------------------------------------------------------------------------------
/scripts/replace-manifest.cjs:
--------------------------------------------------------------------------------
1 | /**
2 | * Because chrome is so sensitive about the manifest file this script serves to
3 | * modify it for distribution.
4 | */
5 | const { readFileSync, writeFileSync } = require("fs");
6 | const path = require("path");
7 |
8 | const modifyManifest = (manifest) => {
9 | delete manifest["$schema"];
10 | };
11 |
12 | try {
13 | const manifestV3 = JSON.parse(
14 | readFileSync(path.resolve(__dirname, "../dist/manifest.json"), "utf8")
15 | );
16 |
17 | // Mutate the manifest object
18 | modifyManifest(manifestV3);
19 |
20 | writeFileSync(
21 | path.resolve(__dirname, "../dist/manifest.json"),
22 | JSON.stringify(manifestV3, null, 2)
23 | );
24 |
25 | console.log("Manifest converted v3 -> v2");
26 | } catch (err) {
27 | console.error("Could not build manifest", err);
28 | }
29 |
--------------------------------------------------------------------------------
/scripts/resize-images.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Input file
4 | input_file="$1"
5 |
6 | if [[ ! -e $input_file ]]; then
7 | echo "File does not exist"
8 | exit 1
9 | fi
10 |
11 | if [[ ${input_file: -4} != ".png" ]]; then
12 | echo "File is not a PNG"
13 | exit 1
14 | fi
15 |
16 | # Output directory
17 | output_dir="src/assets"
18 |
19 | # Create the output directory if it doesn't exist
20 | mkdir -p $output_dir
21 |
22 | # Icon sizes
23 | sizes=(16 48 128)
24 |
25 | # Generate the icons
26 | for size in "${sizes[@]}"; do
27 | base_name=$(basename "$input_file" .png)
28 | echo "Generating ${size}x${size} icon..."
29 | convert "$input_file" -resize "${size}x${size}" "$output_dir/${base_name}_${size}.png"
30 | done
--------------------------------------------------------------------------------
/src/assets/icon-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-1.png
--------------------------------------------------------------------------------
/src/assets/icon-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-2.png
--------------------------------------------------------------------------------
/src/assets/icon-cropped-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-cropped-1.png
--------------------------------------------------------------------------------
/src/assets/icon-cropped-1_128.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-cropped-1_128.png
--------------------------------------------------------------------------------
/src/assets/icon-cropped-1_16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-cropped-1_16.png
--------------------------------------------------------------------------------
/src/assets/icon-cropped-1_48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-cropped-1_48.png
--------------------------------------------------------------------------------
/src/assets/icon-cropped-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon-cropped-2.png
--------------------------------------------------------------------------------
/src/assets/icon_128.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon_128.png
--------------------------------------------------------------------------------
/src/assets/icon_16.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon_16.png
--------------------------------------------------------------------------------
/src/assets/icon_48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/icon_48.png
--------------------------------------------------------------------------------
/src/assets/star-empty-38.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/star-empty-38.png
--------------------------------------------------------------------------------
/src/assets/star-filled-38.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansinnott/full-text-tabs-forever/25795d53a1522841956f26b1f6772d9cb340b51a/src/assets/star-filled-38.png
--------------------------------------------------------------------------------
/src/background.ts:
--------------------------------------------------------------------------------
1 | // import browser, { omnibox, Runtime } from "webextension-polyfill";
2 |
3 | import { PgLiteBackend } from "./background/backend-pglite";
4 | import { log } from "./common/logs";
5 | import { debounce } from "./common/utils";
6 | import { BackendAdapter } from "./background/backend-adapter";
7 |
8 | // Although there were initially multiple adapters there is no mainly one.
9 | const adapter = new BackendAdapter({
10 | backend: new PgLiteBackend(),
11 | runtime: chrome.runtime,
12 | });
13 |
14 | /**
15 | * Expose for debugging
16 | * @example await fttf.backend._db.execO(`select * from sqlite_master;`)
17 | */
18 | globalThis.fttf = adapter;
19 |
20 | export type FTTF = {
21 | adapter: BackendAdapter;
22 | };
23 |
24 | if (adapter.onMessage) {
25 | chrome.runtime.onMessage.addListener((...args) => adapter.onMessage(...args));
26 | }
27 |
28 | // @note We do not support spas currently. URL changes trigger here, but we do
29 | // not then instruct the frontend to send the full text.
30 | const updateHandler = debounce(
31 | async (tabId: number, changeInfo: chrome.tabs.TabChangeInfo, tab: chrome.tabs.Tab) => {
32 | console.debug("%ctab update", "color:gray;", "no action performed", tab.url);
33 | // browser.tabs.sendMessage(tabId, ["onTabUpdated", { tabId, changeInfo }]);
34 | },
35 | 200
36 | );
37 |
38 | // Listen for tab updates, because the content script normally only runs on load. This is for SPA apps
39 | chrome.tabs.onUpdated.addListener((...args) => updateHandler(...args));
40 |
41 | // When the extension button is clicked, log a message
42 | chrome.action.onClicked.addListener(async () => {
43 | await adapter.openIndexPage();
44 | });
45 |
--------------------------------------------------------------------------------
/src/background/backend-adapter.ts:
--------------------------------------------------------------------------------
1 | import type { SendResponse } from "./backend";
2 | import { VLCN } from "./backend-vlcn";
3 | import { PgLiteBackend } from "./backend-pglite";
4 | import { log } from "../common/logs";
5 |
6 | export type BackendAdapterRuntime = {
7 | sendMessage: typeof chrome.runtime.sendMessage;
8 | getURL: typeof chrome.runtime.getURL;
9 | };
10 |
11 | export class BackendAdapter {
12 | backend: PgLiteBackend;
13 | runtime: BackendAdapterRuntime;
14 | _vlcn: VLCN | null = null;
15 |
16 | constructor({ backend, runtime }: { backend: PgLiteBackend; runtime: BackendAdapterRuntime }) {
17 | this.backend = backend;
18 | this.runtime = runtime;
19 | }
20 |
21 | onMessage(message: any, sender: chrome.runtime.MessageSender, sendResponse: SendResponse) {
22 | // Special case for migrating from VLCN to PgLite
23 | if (message[0] === "importVLCNDocuments" || message[0] === "importVLCNDocumentsV1") {
24 | this.importVLCNDocumentsV1()
25 | .then((result) => {
26 | sendResponse({ ok: true, ...result });
27 | })
28 | .catch((err) => {
29 | sendResponse({ error: err.message });
30 | });
31 | return true;
32 | }
33 |
34 | // Add handler for checking VLCN migration status
35 | if (message[0] === "checkVLCNMigrationStatus") {
36 | this.checkVLCNMigrationStatus()
37 | .then((result) => {
38 | sendResponse(result);
39 | })
40 | .catch((err) => {
41 | sendResponse({ error: err.message });
42 | });
43 | return true;
44 | }
45 |
46 | let waitForResponse = false;
47 | try {
48 | const { tab } = sender;
49 | const [method, payload] = message as [string, any];
50 |
51 | if (sender.url !== tab?.url) {
52 | console.log(`%cinfo`, "color:yellow;", "sender URL and tab URL differ. probably iframe");
53 | }
54 |
55 | // @ts-ignore This could be handled better. unimportant for now
56 | if (typeof this.backend[method] === "function") {
57 | waitForResponse = true;
58 | // @ts-ignore
59 | this.backend[method](payload, sender)
60 | .then((ret) => {
61 | sendResponse(ret);
62 | })
63 | .catch((err) => {
64 | console.error(`backend :: err :: ${method} ::`, payload);
65 | console.error(err);
66 | sendResponse({ error: err.message, stack: err.stack });
67 | });
68 | } else {
69 | console.warn(`%c${method}`, "color:yellow;", "is not a valid method", payload);
70 | sendResponse({ error: `'${method}' is not a valid RPC` });
71 | }
72 | } catch (err) {
73 | console.error("Could not parse message", message, sender, err);
74 | sendResponse({ error: err.message });
75 | }
76 |
77 | return waitForResponse; // Keep channel open for async response. Yikes
78 | }
79 |
80 | async checkVLCNMigrationStatus() {
81 | try {
82 | const isComplete = await this.isMigrationComplete();
83 |
84 | if (isComplete) {
85 | return { available: true, migrated: true };
86 | }
87 |
88 | if (!this._vlcn) {
89 | this._vlcn = new VLCN();
90 | try {
91 | await this._vlcn.readyPromise;
92 | } catch (err) {
93 | console.error("Failed to initialize VLCN", err);
94 | return { available: false, error: err.message };
95 | }
96 | }
97 |
98 | const status = await this._vlcn.getStatus();
99 | if (!status.ok) {
100 | return { available: false, error: status.error };
101 | }
102 |
103 | // Check if there are documents to migrate
104 | const count = await this._vlcn.sql<{
105 | count: number;
106 | }>`select count(*) as count from "document";`;
107 |
108 | const documentCount = count[0].count;
109 |
110 | // Flag the migration as somplete so that we don't continue to initialize
111 | // VLCN ever time. Ultimately we will remove VLCN completely.
112 | if (documentCount === 0) {
113 | await this.setMigrationComplete();
114 | }
115 |
116 | return {
117 | available: true,
118 | migrated: false,
119 | documentCount,
120 | };
121 | } catch (err) {
122 | console.error("Error checking VLCN migration status", err);
123 | return { available: false, error: err.message };
124 | }
125 | }
126 |
127 | // Created for debugging workflow
128 | async openIndexPage() {
129 | const [existingTab] = await chrome.tabs.query({
130 | url: this.runtime.getURL("index.html"),
131 | });
132 |
133 | if (existingTab) {
134 | await chrome.tabs.update(existingTab.id!, { active: true });
135 | } else {
136 | await chrome.tabs.create({ url: chrome.runtime.getURL("index.html") });
137 | }
138 | }
139 |
140 | async setMigrationComplete() {
141 | // First create the table if it doesn't exist
142 | await this.backend.db!.exec(
143 | `CREATE TABLE IF NOT EXISTS migration_info (key TEXT PRIMARY KEY, value TEXT);`
144 | );
145 |
146 | // Then insert the migration flag
147 | await this.backend.db!.exec(
148 | `INSERT INTO migration_info (key, value) VALUES ('migrated_to_pglite', '1') ON CONFLICT(key) DO UPDATE SET value = '1';`
149 | );
150 | }
151 |
152 | async isMigrationComplete() {
153 | try {
154 | const result = await this.backend.db!.query<{ value: string }>(
155 | `SELECT value FROM migration_info WHERE key = 'migrated_to_pglite';`
156 | );
157 | return result.rows[0]?.value === "1";
158 | } catch (error) {
159 | // If we haven't run the migration yet don't consider this an error
160 | if (error instanceof Error && error.message.includes("does not exist")) {
161 | return false;
162 | }
163 |
164 | throw error;
165 | }
166 | }
167 |
168 | async importVLCNDocumentsV1() {
169 | try {
170 | // Send initial status update
171 | this.runtime.sendMessage({
172 | type: "vlcnMigrationStatus",
173 | status: "starting",
174 | message: "Initializing VLCN database...",
175 | });
176 |
177 | if (!this._vlcn) {
178 | this._vlcn = new VLCN();
179 | await this._vlcn.readyPromise;
180 | }
181 |
182 | // Check document count
183 | const count = await this._vlcn.sql<{
184 | count: number;
185 | }>`select count(*) as count from "document";`;
186 |
187 | console.log("vlcnAdapter :: count", count);
188 |
189 | if (count[0].count === 0) {
190 | this.runtime.sendMessage({
191 | type: "vlcnMigrationStatus",
192 | status: "empty",
193 | message: "No documents found in the VLCN database.",
194 | });
195 | return { imported: 0, message: "No documents found in VLCN database" };
196 | }
197 |
198 | // Send update with document count
199 | this.runtime.sendMessage({
200 | type: "vlcnMigrationStatus",
201 | status: "fetching",
202 | message: `Found ${count[0].count} documents to migrate...`,
203 | });
204 |
205 | // Process documents in batches
206 | const BATCH_SIZE = 100;
207 | let imported = 0;
208 | let duplicates = 0;
209 | let processed = 0;
210 | const totalDocuments = count[0].count;
211 |
212 | // Send update before importing
213 | this.runtime.sendMessage({
214 | type: "vlcnMigrationStatus",
215 | status: "importing",
216 | message: `Beginning import of ${totalDocuments} documents...`,
217 | total: totalDocuments,
218 | current: 0,
219 | });
220 |
221 | while (processed < totalDocuments) {
222 | // Fetch batch of documents
223 | const batchQuery = `SELECT
224 | id,
225 | title,
226 | url,
227 | excerpt,
228 | mdContent,
229 | mdContentHash,
230 | publicationDate,
231 | hostname,
232 | lastVisit,
233 | lastVisitDate,
234 | extractor,
235 | createdAt,
236 | updatedAt
237 | FROM "document"
238 | LIMIT ${BATCH_SIZE} OFFSET ${processed};`;
239 |
240 | const batch = await this._vlcn?.db.execA(batchQuery);
241 |
242 | if (batch.length === 0) {
243 | break; // No more documents to process
244 | }
245 |
246 | if (processed === 0) {
247 | // Log sample of first batch only
248 | console.log(
249 | "vlcnAdapter :: docs sample",
250 | batch.slice(0, 3).map((d) => ({ id: d[0], title: d[1], url: d[2] }))
251 | );
252 | }
253 |
254 | // Import current batch
255 | const batchResult = await this.backend.importDocumentsJSONv1({ document: batch });
256 |
257 | imported += batchResult.imported;
258 | duplicates += batchResult.duplicates;
259 | processed += batch.length;
260 |
261 | // Update progress
262 | this.runtime.sendMessage({
263 | type: "vlcnMigrationStatus",
264 | status: "importing",
265 | message: `Imported ${processed} of ${totalDocuments} documents...`,
266 | total: totalDocuments,
267 | current: processed,
268 | });
269 | }
270 |
271 | const result = { imported, duplicates };
272 |
273 | // Send completion status
274 | this.runtime.sendMessage({
275 | type: "vlcnMigrationStatus",
276 | status: "complete",
277 | message: `Migration complete. Imported ${result.imported} documents (${result.duplicates} were duplicates).`,
278 | result,
279 | });
280 |
281 | // Mark VLCN database as migrated to prevent duplicate migrations
282 | try {
283 | await this.setMigrationComplete();
284 |
285 | console.log("Marked VLCN database as migrated successfully");
286 | } catch (err) {
287 | console.error("Error marking VLCN database as migrated", err);
288 | }
289 |
290 | return result;
291 | } catch (error) {
292 | console.error("VLCN migration failed", error);
293 |
294 | // Send error status
295 | this.runtime.sendMessage({
296 | type: "vlcnMigrationStatus",
297 | status: "error",
298 | message: `Migration failed: ${error.message}`,
299 | error: error.message,
300 | });
301 |
302 | return { error: error.message };
303 | }
304 | }
305 | }
306 |
--------------------------------------------------------------------------------
/src/background/backend-debug.ts:
--------------------------------------------------------------------------------
1 | /**
2 | * This backend is used for debugging purposes. It does not index anything.
3 | */
4 |
5 | import { formatDebuggablePayload } from "../common/utils";
6 | import { Backend, DetailRow } from "./backend";
7 |
8 | export class DebugBackend implements Backend {
9 | getStatus: Backend["getStatus"] = async () => {
10 | return {
11 | ok: true,
12 | };
13 | };
14 |
15 | search: Backend["search"] = async (search) => {
16 | console.debug(`backend#%c${"search"}`, "color:lime;", search);
17 | return {
18 | ok: true,
19 | results: [],
20 | count: 0,
21 | perfMs: 0,
22 | query: search.query,
23 | };
24 | };
25 |
26 | async findOne(query: { where: { url: string } }): Promise {
27 | console.debug(`backend#%c${"findOne"}`, "color:lime;", query);
28 | return null;
29 | }
30 |
31 | getPageStatus: Backend["getPageStatus"] = async (payload, sender) => {
32 | const { tab } = sender;
33 | let shouldIndex = tab?.url?.startsWith("http"); // ignore chrome extensions, about:blank, etc
34 |
35 | try {
36 | const url = new URL(tab?.url || "");
37 | if (url.hostname === "localhost") shouldIndex = false;
38 | if (url.hostname.endsWith(".local")) shouldIndex = false;
39 | } catch (err) {
40 | // should not happen
41 | throw err;
42 | }
43 |
44 | console.debug(`%c${"getPageStatus"}`, "color:lime;", { shouldIndex, url: tab?.url }, payload);
45 |
46 | return {
47 | shouldIndex,
48 | };
49 | };
50 |
51 | indexPage: Backend["indexPage"] = async (payload, sender) => {
52 | const { tab } = sender;
53 |
54 | // remove adjacent whitespace since it serves no purpose. The html or
55 | // markdown content stores formatting.
56 | const plainText = payload.text_content?.replace(/[ \t]+/g, " ").replace(/\n+/g, "\n");
57 |
58 | console.debug(`%c${"indexPage"}`, "color:lime;", tab?.url);
59 | console.debug(formatDebuggablePayload({ ...payload, textContent: plainText }));
60 | return {
61 | message: "debug backend does not index pages",
62 | };
63 | };
64 |
65 | nothingToIndex: Backend["nothingToIndex"] = async (payload, sender) => {
66 | const { tab } = sender;
67 | console.debug(`%c${"nothingToIndex"}`, "color:beige;", tab?.url);
68 | return {
69 | ok: true,
70 | };
71 | };
72 |
73 | getRecent: Backend["getRecent"] = async (options) => {
74 | console.debug(`backend#%c${"getRecent"}`, "color:lime;", options);
75 | return {
76 | ok: true,
77 | results: [],
78 | count: 0,
79 | perfMs: 0,
80 | };
81 | };
82 | }
83 |
--------------------------------------------------------------------------------
/src/background/backend.ts:
--------------------------------------------------------------------------------
1 | import type { Runtime } from "webextension-polyfill";
2 | import type { Readability } from "@mozilla/readability";
3 |
4 | export type SendResponse = (response?: any) => void;
5 |
6 | export type RemoteProcWithSender = (
7 | payload: T,
8 | sender: Runtime.MessageSender
9 | ) => Promise;
10 | export type RemoteProc = (payload: T) => Promise;
11 |
12 | type ReadabilityArticle = Omit>, "content">;
13 |
14 | export type Article = ReadabilityArticle & {
15 | extractor: string;
16 | /** Optional for now b/c i'm not sending it over the wire if turndown is used in the content script */
17 | html_content?: string;
18 | /** Optional because the parsing can fail */
19 | md_content?: string;
20 | text_content?: string;
21 | date?: string;
22 | _extraction_time: number;
23 | };
24 |
25 | export type ArticleRow = Omit & {
26 | id: number;
27 | md_content_hash?: string;
28 | md_content?: string;
29 | url: string;
30 | hostname: string;
31 | search_words?: string[];
32 | last_visit?: number; // Timestamp
33 | last_visit_date?: string;
34 | updated_at: number;
35 | created_at: number; // Timestamp
36 | publication_date?: number;
37 | };
38 |
39 | /** @deprecated don't use urls directly for now. use documents which have URLs */
40 | export type UrlRow = {
41 | url: string;
42 | url_hash: string;
43 | title?: string;
44 | last_visit?: number; // Timestamp
45 | hostname: string;
46 | text_content_hash?: string;
47 | search_words?: string[];
48 | };
49 |
50 | export type ResultRow = {
51 | rowid: number;
52 | id: number;
53 | entity_id: number;
54 | attribute: string;
55 | snippet?: string;
56 | url: string;
57 | hostname: string;
58 | title?: string;
59 | excerpt?: string;
60 | last_visit?: number; // Timestamp
61 | last_visit_date?: string;
62 | md_content_hash?: string;
63 | updated_at: number;
64 | created_at: number; // Timestamp
65 | };
66 |
67 | export type DetailRow = ResultRow & {
68 | md_content?: string;
69 | };
70 |
71 | type FirstArg = T extends (arg: infer U, ...args: any[]) => any ? U : never;
72 |
73 | export type RpcMessage =
74 | | [method: "getPageStatus"]
75 | | [method: "indexPage", payload: FirstArg]
76 | | [method: "nothingToIndex"]
77 | | [method: "getStats"]
78 | | [method: "getStatus"]
79 | | [method: "exportJson"]
80 | | [method: "importJson"]
81 | | [method: "reindex"]
82 | | [method: "search", payload: FirstArg]
83 | | [method: string, payload?: any];
84 |
85 | export type DBDump = Record;
86 |
87 | export interface Backend {
88 | getStatus(): Promise<{ ok: true } | { ok: false; error: string; detail?: any }>;
89 | getPageStatus: (_: any, sender: { tab: { url: string } }) => Promise;
90 | indexPage: (payload: Article, sender: { tab: { url: string } }) => Promise;
91 | nothingToIndex: RemoteProcWithSender;
92 | search: RemoteProc<
93 | {
94 | query: string;
95 | limit?: number;
96 | offset?: number;
97 | orderBy: "updated_at" | "rank" | "last_visit";
98 | preprocessQuery?: boolean;
99 | },
100 | {
101 | ok: boolean;
102 | results: ResultRow[];
103 | count?: number;
104 | perfMs: number;
105 | query: string;
106 | }
107 | >;
108 | getRecent(options: { limit?: number; offset?: number }): Promise<{
109 | ok: boolean;
110 | results: ResultRow[];
111 | count?: number;
112 | perfMs: number;
113 | }>;
114 | findOne(query: { where: { url: string } }): Promise;
115 | exportJson?(): Promise;
116 | importDocumentsJSONv1?(payload: {
117 | document: any[][];
118 | }): Promise<{ imported: number; duplicates: number }>;
119 | }
120 |
--------------------------------------------------------------------------------
/src/background/embedding/pipeline.ts:
--------------------------------------------------------------------------------
1 | /**
2 | * For use in background.js - Handles requests from the UI, runs the model, then
3 | * sends back a response
4 | */
5 |
6 | import { pipeline, env, type FeatureExtractionPipeline } from "@xenova/transformers";
7 |
8 | export type TransformersProgress =
9 | | {
10 | status: "done" | "initiate" | "download";
11 | name: string;
12 | file: string;
13 | }
14 | | {
15 | status: "progress";
16 | name: string;
17 | file: string;
18 | progress: number;
19 | loaded: number;
20 | total: number;
21 | }
22 | | {
23 | status: "ready";
24 | task: string;
25 | model: string;
26 | };
27 |
28 | // Skip initial check for local models, since we are not loading any local models.
29 | env.allowLocalModels = false;
30 |
31 | // Due to a bug in onnxruntime-web, we must disable multithreading for now.
32 | // See https://github.com/microsoft/onnxruntime/issues/14445 for more information.
33 | env.backends.onnx.wasm.numThreads = 1;
34 |
35 | class PipelineSingleton {
36 | static task = "feature-extraction" as const;
37 | static model = "Xenova/all-MiniLM-L6-v2";
38 | static instance: FeatureExtractionPipeline | null = null;
39 |
40 | static async getInstance(progress_callback?: (x: TransformersProgress) => void) {
41 | if (this.instance === null) {
42 | console.time("loading pipeline");
43 | this.instance = await pipeline(this.task, this.model, { progress_callback });
44 | console.timeEnd("loading pipeline");
45 | }
46 |
47 | return this.instance;
48 | }
49 | }
50 |
51 | export const createTensor = async (text: string) => {
52 | // Get the pipeline instance. This will load and build the model when run for the first time.
53 | let model = await PipelineSingleton.getInstance((data) => {
54 | console.log("progress ::", data);
55 | });
56 |
57 | // Actually run the model on the input text
58 | let tensor = await model(text, { pooling: "mean", normalize: true });
59 |
60 | return tensor;
61 | };
62 |
63 | // Create generic classify function, which will be reused for the different types of events.
64 | export const createEmbedding = async (text: string) => {
65 | const tensor = await createTensor(text);
66 | return tensor.tolist()?.[0] as number[];
67 | };
68 |
--------------------------------------------------------------------------------
/src/background/pglite/HAX_pglite.ts:
--------------------------------------------------------------------------------
1 | /**
2 | * HAX: Load PGlite in a service worker
3 | *
4 | * This is a temporary solution to allow PGlite to work in a service worker.
5 | * Hopefully in future versions this will not be necessary. The core issue here
6 | * is that PGlite, perhaps via some internal emscripted logic, is using the
7 | * _synchronous_ XMLHttpRequest API to load assets. This poses two issues:
8 | *
9 | * - chrome does not support XMLHttpRequest AT ALL in service workers
10 | * - we cannot create a full polyfill for XMLHttpRequest because we cannot mimic the synchronous behavior
11 | *
12 | * Thus this script simply loads the relevant bytes into memory and hands them
13 | * back if requested via the correct URL.
14 | *
15 | * @todo Not sure if vite grabs the relevant asset and puts in the the build,
16 | * might need to create a plugin for that. works for the `dev` comamnd but might
17 | * not wokr for `build`.
18 | */
19 |
20 | const assetCache = new Map();
21 |
22 | async function preloadAssets() {
23 | // NOTE: The wasm file exists in the pglite package but does not seem to be used. preloading the data file was enough
24 | const assetUrls = [
25 | chrome.runtime.getURL("/assets/postgres-CkP7QCDB.data"), // 0.2.17
26 | ];
27 |
28 | for (const url of assetUrls) {
29 | try {
30 | const response = await fetch(url);
31 | if (!response.ok) {
32 | console.log(`failed to fetch asset :: ${url}`);
33 | continue;
34 | }
35 | const arrayBuffer = await response.arrayBuffer();
36 | assetCache.set(url, arrayBuffer);
37 | } catch (error) {
38 | console.error(`failed to preload asset :: ${url}`, error);
39 | }
40 | }
41 | }
42 |
43 | // As with XMLHttpRequest, this is not supported in the service worker context.
44 | class ProgressEventPolyfill {
45 | type: string;
46 | constructor(type: string) {
47 | this.type = type;
48 | }
49 | }
50 |
51 | // A partial polyfill for XMLHttpRequest to support the loading of pglite in a
52 | // service worker
53 | class XMLHttpRequestPolyfill {
54 | private url: string = "";
55 | public onload: ((this: XMLHttpRequest, ev: ProgressEvent) => any) | null = null;
56 | public onerror: ((this: XMLHttpRequest, ev: ProgressEvent) => any) | null = null;
57 | public status: number = 0;
58 | public responseText: string = "";
59 | public response: any = null;
60 |
61 | open(method: string, url: string) {
62 | console.log("open ::", { method, url });
63 | this.url = url;
64 | }
65 |
66 | send(body: any = null) {
67 | console.log("send ::", { body });
68 | if (assetCache.has(this.url)) {
69 | this.response = assetCache.get(this.url);
70 | this.status = 200;
71 | if (this.onload) {
72 | // @ts-expect-error
73 | this.onload.call(this, new ProgressEventPolyfill("load") as any);
74 | }
75 | } else {
76 | console.error(`asset not preloaded :: ${this.url}`);
77 | this.status = 404;
78 | if (this.onerror) {
79 | // @ts-expect-error
80 | this.onerror.call(this, new ProgressEventPolyfill("error") as any);
81 | }
82 | }
83 | }
84 | }
85 |
86 | (globalThis as any).XMLHttpRequest = XMLHttpRequestPolyfill;
87 | (globalThis as any).ProgressEvent = ProgressEventPolyfill;
88 |
89 | // Preload assets BEFORE importing PGlite
90 | //
91 | // NOTE: This will require vite-plugin-top-level-await. Chrome will not allow
92 | // top level await in service workers even if supported by the browser in other
93 | // context.
94 | await preloadAssets();
95 |
96 | import { PGlite } from "@electric-sql/pglite";
97 |
98 | export { PGlite };
99 |
--------------------------------------------------------------------------------
/src/background/pglite/defaultBlacklistRules.ts:
--------------------------------------------------------------------------------
1 | export const defaultBlacklistRules: Array<[string, "url_only" | "no_index"]> = [
2 | ["https://news.ycombinator.com", "url_only"],
3 | ["https://news.ycombinator.com/news", "url_only"],
4 | ["https://news.ycombinator.com/new", "url_only"],
5 | ["https://news.ycombinator.com/best", "url_only"],
6 | ["http://localhost%", "no_index"],
7 | ["https://localhost%", "no_index"],
8 | ["https://www.bankofamerica.com%", "url_only"],
9 | ["https://www.chase.com%", "url_only"],
10 | ["https://www.wellsfargo.com%", "url_only"],
11 | ["https://www.citibank.com%", "url_only"],
12 | ["https://www.capitalone.com%", "url_only"],
13 | ["https://app.mercury.com%", "url_only"],
14 | ["https://www.schwab.com%", "url_only"],
15 | ["https://www.fidelity.com%", "url_only"],
16 | ["https://www.vanguard.com%", "url_only"],
17 | ["https://www.etrade.com%", "url_only"],
18 | ["https://www.tdameritrade.com%", "url_only"],
19 | ["https://www.robinhood.com%", "url_only"],
20 | ["https://www.paypal.com%", "url_only"],
21 | ["https://www.venmo.com%", "url_only"],
22 | ["https://www.facebook.com", "url_only"],
23 | ["https://www.amazon.com%", "url_only"],
24 | ["https://www.ebay.com%", "url_only"],
25 | ["https://www.dropbox.com", "url_only"],
26 | ["https://drive.google.com%", "url_only"],
27 | ["https://www.coinbase.com%", "url_only"],
28 | ["https://www.webmd.com", "url_only"],
29 | ["https://%.local", "no_index"],
30 | ["https://%.internal", "no_index"],
31 | ["https://twitter.com", "url_only"],
32 | ["https://twitter.com/home", "url_only"],
33 | ["https://x.com", "url_only"],
34 | ["https://x.com/home", "url_only"],
35 | ["https://www.linkedin.com", "url_only"],
36 | ["https://www.tiktok.com", "url_only"],
37 | ["https://mail.google.com", "no_index"],
38 | ["https://outlook.live.com%", "no_index"],
39 | ["https://docs.google.com%", "url_only"],
40 | ["https://www.office.com%", "url_only"],
41 | ["https://slack.com", "url_only"],
42 | ["https://zoom.us%", "url_only"],
43 | ["https://www.ask.com/web?q=%", "url_only"],
44 | ["https://www.baidu.com/s?%", "url_only"],
45 | ["https://www.reddit.com/search%", "url_only"],
46 | ["https://www.bing.com/search%", "url_only"],
47 | ["https://search.yahoo.com/search%", "url_only"],
48 | ["https://www.duckduckgo.com/?q=%", "url_only"],
49 | ["https://yandex.com/search/?%", "url_only"],
50 | ["https://%dashlane.com%", "no_index"],
51 | ["https://%bitwarden.com%", "no_index"],
52 | ["https://%lastpass.com%", "no_index"],
53 | ["https://%1password.com%", "no_index"],
54 | ["https://kagi.com/search%", "url_only"],
55 | ["https://www.google.com/search%", "url_only"],
56 | ];
57 |
--------------------------------------------------------------------------------
/src/background/pglite/job_queue.test.ts:
--------------------------------------------------------------------------------
1 | // @ts-nocheck
2 | import { describe, it, expect, beforeEach, afterEach, mock } from "bun:test";
3 | import { JobQueue, JOB_QUEUE_SCHEMA } from "./job_queue";
4 | import { PGlite } from "@electric-sql/pglite";
5 | import * as defaultTasks from "./tasks";
6 |
7 | describe("JobQueue", () => {
8 | let db: PGlite;
9 | let jobQueue: JobQueue;
10 | let mockTasks: typeof defaultTasks;
11 |
12 | beforeEach(async () => {
13 | // Create an in-memory PGLite instance
14 | db = new PGlite("memory://");
15 | await db.query(JOB_QUEUE_SCHEMA);
16 |
17 | // Create mock tasks
18 | mockTasks = {
19 | ...defaultTasks,
20 | generate_fragments: {
21 | handler: mock(() => Promise.resolve()),
22 | params: { parse: (p: any) => p },
23 | },
24 | };
25 |
26 | jobQueue = new JobQueue(db, mockTasks, 100);
27 | await jobQueue.initialize();
28 | });
29 |
30 | afterEach(async () => {
31 | // Clean up the database
32 | await db.query("DROP TABLE IF EXISTS task");
33 | await db.close();
34 | });
35 |
36 | it("should initialize the job queue", async () => {
37 | const result = await db.query<{ count: number }>("SELECT COUNT(*) as count FROM task");
38 | expect(result.rows[0].count).toBe(0);
39 | });
40 |
41 | it("should enqueue a task", async () => {
42 | const taskType = "generate_fragments";
43 | const params = { articleId: 1 };
44 |
45 | const taskId = await jobQueue.enqueue(taskType, params);
46 | expect(taskId).toBeGreaterThan(0);
47 |
48 | const result = await db.query<{ count: number }>("SELECT COUNT(*) as count FROM task");
49 | expect(result.rows[0].count).toBe(1);
50 | });
51 |
52 | it("should not enqueue duplicate tasks", async () => {
53 | const taskType = "generate_fragments";
54 | const params = { articleId: 1 };
55 |
56 | const taskId1 = await jobQueue.enqueue(taskType, params);
57 | const taskId2 = await jobQueue.enqueue(taskType, params);
58 |
59 | expect(taskId1).toBeGreaterThan(0);
60 | expect(taskId2).toBeUndefined();
61 |
62 | const result = await db.query<{ count: number }>("SELECT COUNT(*) as count FROM task");
63 | expect(result.rows[0].count).toBe(1);
64 | });
65 |
66 | it("should process pending tasks", async () => {
67 | const taskType = "generate_fragments";
68 | const params = { articleId: 1 };
69 |
70 | await jobQueue.enqueue(taskType, params);
71 |
72 | await jobQueue.processPendingTasks();
73 |
74 | // Wait for a short time to allow the task to be processed
75 | await new Promise((resolve) => setTimeout(resolve, 100));
76 |
77 | const result = await db.query<{ count: number }>("SELECT COUNT(*) as count FROM task");
78 | expect(result.rows[0].count).toBe(0);
79 | expect(mockTasks[taskType].handler).toHaveBeenCalledTimes(1);
80 | });
81 |
82 | it("should mark failed tasks", async () => {
83 | const taskType = "generate_fragments";
84 | const params = { articleId: 1 };
85 |
86 | // Mock the task handler to throw an error
87 | mockTasks[taskType] = {
88 | handler: mock(() => Promise.reject(new Error("Test error"))),
89 | params: { parse: (p: any) => p },
90 | };
91 |
92 | await jobQueue.enqueue(taskType, params);
93 |
94 | await jobQueue.processPendingTasks();
95 |
96 | // Wait for a short time to allow the task to be processed
97 | await new Promise((resolve) => setTimeout(resolve, 100));
98 |
99 | const result = await db.query<{ count: number; failed_count: number }>(
100 | "SELECT COUNT(*) as count, COUNT(failed_at) as failed_count FROM task"
101 | );
102 | expect(result.rows[0].count).toBe(1);
103 | expect(result.rows[0].failed_count).toBe(1);
104 | });
105 |
106 | it("should stop processing tasks when requested", async () => {
107 | const taskType = "generate_fragments";
108 | const params = { articleId: 1 };
109 |
110 | // Mock the task handler
111 | mockTasks[taskType] = {
112 | handler: mock(() => new Promise((resolve) => setTimeout(resolve, 500))),
113 | params: { parse: (p: any) => p },
114 | };
115 |
116 | await jobQueue.enqueue(taskType, params);
117 | await jobQueue.enqueue(taskType, { articleId: 2 });
118 |
119 | const processPromise = jobQueue.processPendingTasks();
120 |
121 | // Stop the queue after a short delay
122 | setTimeout(() => jobQueue.stop(), 100);
123 |
124 | await processPromise;
125 |
126 | const result = await db.query<{ count: number }>("SELECT COUNT(*) as count FROM task");
127 | expect(result.rows[0].count).toBe(1); // One task should remain unprocessed
128 | });
129 | });
130 |
--------------------------------------------------------------------------------
/src/background/pglite/job_queue.ts:
--------------------------------------------------------------------------------
1 | import type { PGlite, Transaction } from "@electric-sql/pglite";
2 | import type { TaskDefinition } from "./tasks";
3 | import * as defaultTasks from "./tasks";
4 |
5 | const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
6 |
7 | type DBWriter = Pick;
8 |
9 | export const JOB_QUEUE_SCHEMA = `
10 | CREATE TABLE IF NOT EXISTS task (
11 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
12 | task_type TEXT NOT NULL,
13 | params JSONB DEFAULT '{}'::jsonb NOT NULL,
14 | created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP NOT NULL,
15 | failed_at TIMESTAMP WITH TIME ZONE,
16 | error TEXT,
17 | CONSTRAINT task_task_type_params_unique UNIQUE(task_type, params)
18 | );
19 | `;
20 |
21 | export class JobQueue {
22 | private isProcessing: boolean = false;
23 | private shouldStop: boolean = false;
24 |
25 | constructor(
26 | private db: PGlite,
27 | private tasks: typeof defaultTasks = defaultTasks,
28 | private taskInterval: number = 1000
29 | ) {}
30 |
31 | async initialize() {
32 | await this.db.query(JOB_QUEUE_SCHEMA);
33 | }
34 |
35 | async enqueue(
36 | taskType: keyof typeof this.tasks,
37 | params: object = {},
38 | tx: DBWriter = this.db
39 | ): Promise {
40 | const task = this.tasks[taskType];
41 |
42 | if (!task) {
43 | throw new Error(`Task type ${taskType} not implemented`);
44 | }
45 |
46 | // Make sure params are valid before adding to queue
47 | task.params?.parse(params);
48 |
49 | const result = await tx.query<{ id: number }>(
50 | `
51 | INSERT INTO task (task_type, params)
52 | VALUES ($1, $2::jsonb)
53 | ON CONFLICT (task_type, params) DO NOTHING
54 | RETURNING id
55 | `,
56 | [taskType, params]
57 | );
58 |
59 | const taskId = result.rows[0]?.id;
60 |
61 | return taskId;
62 | }
63 |
64 | /**
65 | * Process a single task from the queue
66 | *
67 | * NOTE: a few things about this queue strategy:
68 | * - priority queue based on logic in the ORDER BY clause. add cases as needed
69 | * - random order if no priority is set
70 | */
71 | private async processQueue() {
72 | let processedId: number | null = null;
73 |
74 | try {
75 | await this.db.transaction(async (tx) => {
76 | const result = await tx.query<{
77 | id: number;
78 | task_type: string;
79 | params: Record;
80 | }>(`
81 | SELECT id, task_type, params::jsonb
82 | FROM task
83 | WHERE failed_at IS NULL
84 | ORDER BY
85 | CASE
86 | WHEN task_type = 'generate_fragments' THEN 0
87 | ELSE random()
88 | END,
89 | created_at
90 | LIMIT 1
91 | FOR UPDATE SKIP LOCKED
92 | `);
93 |
94 | if (!result.rows.length) {
95 | console.log("task :: empty queue");
96 | return;
97 | }
98 |
99 | const { id, task_type, params } = result.rows[0];
100 |
101 | processedId = id;
102 |
103 | if (!(task_type in this.tasks)) {
104 | console.warn(`task :: ${task_type} :: not implemented`);
105 | await this.markTaskAsFailed(tx, id, "Task type not implemented");
106 | return;
107 | }
108 |
109 | const task = this.tasks[task_type as keyof typeof this.tasks] as TaskDefinition;
110 | const start = performance.now();
111 | try {
112 | await task.handler(tx, task.params?.parse(params));
113 | await tx.query("DELETE FROM task WHERE id = $1", [id]);
114 | } catch (error) {
115 | console.error(`task :: error`, error.message);
116 | throw error;
117 | } finally {
118 | console.log(
119 | `task :: ${performance.now() - start}ms :: ${task_type} :: ${JSON.stringify(params)}`
120 | );
121 | }
122 | });
123 | } catch (error) {
124 | console.error(`task :: processQueue :: error`, error);
125 |
126 | // NOTE this cannot be done within the transaction. using the tx after a
127 | // failure will result in an error saying the transaction is aborted.
128 | if (processedId) {
129 | await this.markTaskAsFailed(this.db, processedId, error.message);
130 | }
131 | }
132 | }
133 |
134 | private async markTaskAsFailed(tx: DBWriter, id: number, errorMessage: string) {
135 | await tx.query(
136 | `
137 | UPDATE task
138 | SET failed_at = CURRENT_TIMESTAMP, error = $1
139 | WHERE id = $2
140 | `,
141 | [errorMessage, id]
142 | );
143 | }
144 |
145 | async processPendingTasks() {
146 | if (this.isProcessing) {
147 | return;
148 | }
149 |
150 | this.isProcessing = true;
151 | this.shouldStop = false;
152 |
153 | const getPendingCount = async () => {
154 | const pendingTasks = await this.db.query<{ count: number }>(`
155 | SELECT COUNT(*) as count FROM task
156 | WHERE failed_at IS NULL
157 | `);
158 | return pendingTasks.rows[0].count;
159 | };
160 |
161 | try {
162 | while ((await getPendingCount()) > 0 && !this.shouldStop) {
163 | await this.processQueue();
164 | await sleep(this.taskInterval);
165 | }
166 | } finally {
167 | this.isProcessing = false;
168 | }
169 | }
170 |
171 | stop() {
172 | this.shouldStop = true;
173 | }
174 | }
175 |
--------------------------------------------------------------------------------
/src/background/pglite/migration-manager.test.ts:
--------------------------------------------------------------------------------
1 | import { describe, it, expect, beforeEach, mock } from "bun:test";
2 | import { PGlite } from "@electric-sql/pglite"; // Use the direct import for testing
3 | import { MigrationManager, Migration } from "./migration-manager";
4 |
5 | describe("MigrationManager", () => {
6 | let db: PGlite;
7 | let migrationManager: MigrationManager;
8 |
9 | beforeEach(async () => {
10 | if (db) {
11 | await db.close();
12 | }
13 |
14 | // Create a new in-memory database for each test
15 | db = new PGlite();
16 | migrationManager = new MigrationManager(db);
17 | });
18 |
19 | it("should initialize with no migrations", async () => {
20 | const status = await migrationManager.applyMigrations();
21 |
22 | expect(status.ok).toBe(true);
23 | expect(status.currentVersion).toBe(0);
24 | expect(status.availableVersion).toBe(0);
25 | expect(status.pendingCount).toBe(0);
26 | });
27 |
28 | it("should register migrations correctly", () => {
29 | const migration1: Migration = {
30 | version: 1,
31 | name: "test_migration_1",
32 | description: "Test migration 1",
33 | sql: "CREATE TABLE test1 (id SERIAL PRIMARY KEY);",
34 | };
35 |
36 | const migration2: Migration = {
37 | version: 2,
38 | name: "test_migration_2",
39 | description: "Test migration 2",
40 | sql: "CREATE TABLE test2 (id SERIAL PRIMARY KEY);",
41 | };
42 |
43 | migrationManager.registerMigration(migration1);
44 | migrationManager.registerMigration(migration2);
45 |
46 | // We're testing internal state here, so we need to cast to access private properties
47 | const migrations = (migrationManager as any).migrations;
48 | expect(migrations.length).toBe(2);
49 | expect(migrations[0].version).toBe(1);
50 | expect(migrations[1].version).toBe(2);
51 | });
52 |
53 | it("should apply migrations in order", async () => {
54 | const migration1: Migration = {
55 | version: 1,
56 | name: "test_migration_1",
57 | description: "Test migration 1",
58 | sql: "CREATE TABLE test1 (id SERIAL PRIMARY KEY);",
59 | };
60 |
61 | const migration2: Migration = {
62 | version: 2,
63 | name: "test_migration_2",
64 | description: "Test migration 2",
65 | sql: "CREATE TABLE test2 (id SERIAL PRIMARY KEY);",
66 | };
67 |
68 | migrationManager.registerMigration(migration1);
69 | migrationManager.registerMigration(migration2);
70 |
71 | const status = await migrationManager.applyMigrations();
72 |
73 | expect(status.ok).toBe(true);
74 | expect(status.currentVersion).toBe(2);
75 | expect(status.availableVersion).toBe(2);
76 | expect(status.pendingCount).toBe(0);
77 |
78 | // Verify tables were created
79 | const result1 = await db.query(
80 | "SELECT table_name FROM information_schema.tables WHERE table_name = 'test1'"
81 | );
82 | const result2 = await db.query(
83 | "SELECT table_name FROM information_schema.tables WHERE table_name = 'test2'"
84 | );
85 |
86 | expect(result1.rows.length).toBe(1);
87 | expect(result2.rows.length).toBe(1);
88 | });
89 |
90 | it("should only apply pending migrations", async () => {
91 | // First migration
92 | const migration1: Migration = {
93 | version: 1,
94 | name: "test_migration_1",
95 | description: "Test migration 1",
96 | sql: "CREATE TABLE test1 (id SERIAL PRIMARY KEY);",
97 | };
98 |
99 | migrationManager.registerMigration(migration1);
100 | await migrationManager.applyMigrations();
101 |
102 | // Second migration
103 | const migration2: Migration = {
104 | version: 2,
105 | name: "test_migration_2",
106 | description: "Test migration 2",
107 | sql: "CREATE TABLE test2 (id SERIAL PRIMARY KEY);",
108 | };
109 |
110 | migrationManager.registerMigration(migration2);
111 | const status = await migrationManager.applyMigrations();
112 |
113 | expect(status.ok).toBe(true);
114 | expect(status.currentVersion).toBe(2);
115 |
116 | // Verify the migrations table has 2 records
117 | const migrationsResult = await db.query<{ version: number }>(
118 | "SELECT * FROM migrations ORDER BY version"
119 | );
120 | expect(migrationsResult.rows.length).toBe(2);
121 | expect(migrationsResult.rows[0].version).toBe(1);
122 | expect(migrationsResult.rows[1].version).toBe(2);
123 | });
124 |
125 | it("should handle errors in migrations", async () => {
126 | const migration1: Migration = {
127 | version: 1,
128 | name: "test_migration_1",
129 | description: "Test migration 1",
130 | sql: "CREATE TABLE test1 (id SERIAL PRIMARY KEY);",
131 | };
132 |
133 | // This migration has invalid SQL
134 | const migration2: Migration = {
135 | version: 2,
136 | name: "invalid_migration",
137 | description: "Invalid SQL migration",
138 | sql: "CREATE TABLE WITH INVALID SYNTAX!!!",
139 | };
140 |
141 | migrationManager.registerMigration(migration1);
142 | migrationManager.registerMigration(migration2);
143 |
144 | const status = await migrationManager.applyMigrations();
145 |
146 | expect(status.ok).toBe(false);
147 | expect(status.currentVersion).toBe(1); // Only the first migration should be applied
148 |
149 | // Verify only the first table exists
150 | const result1 = await db.query(
151 | "SELECT table_name FROM information_schema.tables WHERE table_name = 'test1'"
152 | );
153 | const result2 = await db.query(
154 | "SELECT table_name FROM information_schema.tables WHERE table_name = 'test2'"
155 | );
156 |
157 | expect(result1.rows.length).toBe(1);
158 | expect(result2.rows.length).toBe(0);
159 | });
160 |
161 | it("should handle migrations with out-of-order versions", async () => {
162 | const migration2: Migration = {
163 | version: 2,
164 | name: "test_migration_2",
165 | description: "Test migration 2",
166 | sql: "CREATE TABLE test2 (id SERIAL PRIMARY KEY);",
167 | };
168 |
169 | const migration1: Migration = {
170 | version: 1,
171 | name: "test_migration_1",
172 | description: "Test migration 1",
173 | sql: "CREATE TABLE test1 (id SERIAL PRIMARY KEY);",
174 | };
175 |
176 | // Register in reverse order
177 | migrationManager.registerMigration(migration2);
178 | migrationManager.registerMigration(migration1);
179 |
180 | const status = await migrationManager.applyMigrations();
181 |
182 | expect(status.ok).toBe(true);
183 | expect(status.currentVersion).toBe(2);
184 |
185 | // Verify migrations were applied in correct order
186 | const migrationsResult = await db.query<{ version: number }>(
187 | "SELECT * FROM migrations ORDER BY version"
188 | );
189 | expect(migrationsResult.rows.length).toBe(2);
190 | expect(migrationsResult.rows[0].version).toBe(1);
191 | expect(migrationsResult.rows[1].version).toBe(2);
192 | });
193 | });
194 |
--------------------------------------------------------------------------------
/src/background/pglite/migration-manager.ts:
--------------------------------------------------------------------------------
1 | import { PGlite } from "./HAX_pglite";
2 | import { Transaction } from "@electric-sql/pglite";
3 |
4 | /**
5 | * Simple migration interface for defining database schema changes
6 | * with forward-only migrations
7 | */
8 | export interface Migration {
9 | version: number;
10 | name: string;
11 | description: string;
12 | sql: string; // SQL to execute for this migration
13 | }
14 |
15 | /**
16 | * Migration status
17 | */
18 | export interface MigrationStatus {
19 | ok: boolean;
20 | currentVersion: number;
21 | availableVersion: number;
22 | pendingCount: number;
23 | }
24 |
25 | /**
26 | * A simple, forward-only migration manager for PGlite
27 | */
28 | export class MigrationManager {
29 | private db: PGlite;
30 | private migrations: Migration[] = [];
31 | private currentVersion = 0;
32 | private highestVersion = 0;
33 |
34 | constructor(db: PGlite) {
35 | this.db = db;
36 | }
37 |
38 | /**
39 | * Register a migration with the manager
40 | */
41 | registerMigration(migration: Migration): void {
42 | this.migrations.push(migration);
43 |
44 | // Update highest available version
45 | this.highestVersion = Math.max(this.highestVersion, migration.version);
46 |
47 | // Sort migrations by version
48 | this.migrations.sort((a, b) => a.version - b.version);
49 | }
50 |
51 | /**
52 | * Check if a table exists
53 | */
54 | private async checkTableExists(tableName: string): Promise {
55 | try {
56 | const result = await this.db.query<{ exists: boolean }>(
57 | "SELECT EXISTS (SELECT FROM pg_tables WHERE tablename = $1) as exists",
58 | [tableName]
59 | );
60 | return result.rows[0]?.exists || false;
61 | } catch (error) {
62 | // If this fails, assume table doesn't exist
63 | console.warn(`Error checking if table ${tableName} exists:`, error);
64 | return false;
65 | }
66 | }
67 |
68 | /**
69 | * Get current migration version from the database
70 | */
71 | private async getCurrentVersion(): Promise {
72 | try {
73 | const migrationsTableExists = await this.checkTableExists('migrations');
74 |
75 | if (!migrationsTableExists) {
76 | return 0; // No migrations applied yet
77 | }
78 |
79 | const result = await this.db.query<{ max_version: number }>(
80 | "SELECT MAX(version) as max_version FROM migrations"
81 | );
82 |
83 | return result.rows[0]?.max_version || 0;
84 | } catch (error) {
85 | console.error("Error getting current migration version:", error);
86 | return 0;
87 | }
88 | }
89 |
90 | /**
91 | * Apply a single migration
92 | */
93 | private async applyMigration(migration: Migration): Promise {
94 | try {
95 | console.debug(`Applying migration ${migration.name} (v${migration.version})...`);
96 |
97 | const startTime = performance.now();
98 |
99 | await this.db.transaction(async (tx) => {
100 | // Execute migration SQL
101 | await tx.exec(migration.sql);
102 |
103 | // Record migration in the migrations table
104 | await tx.query(
105 | "INSERT INTO migrations (version, name, description, applied_at) VALUES ($1, $2, $3, $4)",
106 | [migration.version, migration.name, migration.description, Date.now()]
107 | );
108 | });
109 |
110 | const duration = Math.round(performance.now() - startTime);
111 | console.debug(`Migration ${migration.name} (v${migration.version}) applied successfully in ${duration}ms`);
112 |
113 | return true;
114 | } catch (error) {
115 | console.error(`Error applying migration ${migration.name} (v${migration.version}):`, error);
116 | return false;
117 | }
118 | }
119 |
120 | /**
121 | * Apply all pending migrations
122 | */
123 | async applyMigrations(): Promise {
124 | try {
125 | // Ensure migrations table exists
126 | const migrationsTableExists = await this.checkTableExists('migrations');
127 |
128 | if (!migrationsTableExists) {
129 | // Create migrations table if it doesn't exist
130 | await this.db.exec(`
131 | CREATE TABLE IF NOT EXISTS migrations (
132 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
133 | version INTEGER UNIQUE NOT NULL,
134 | name TEXT NOT NULL,
135 | description TEXT,
136 | applied_at BIGINT NOT NULL
137 | );
138 | `);
139 | }
140 |
141 | // Check if the migrations table has the required columns
142 | try {
143 | await this.db.query("SELECT name FROM migrations LIMIT 0");
144 | } catch (error) {
145 | console.warn("Migrations table exists but may be missing columns. Attempting to upgrade schema...");
146 | // Add missing columns if they don't exist
147 | try {
148 | await this.db.exec("ALTER TABLE migrations ADD COLUMN IF NOT EXISTS name TEXT NOT NULL DEFAULT 'legacy_migration'");
149 | await this.db.exec("ALTER TABLE migrations ADD COLUMN IF NOT EXISTS description TEXT");
150 | console.debug("Successfully upgraded migrations table schema");
151 | } catch (alterError) {
152 | console.error("Failed to alter migrations table:", alterError);
153 | throw alterError;
154 | }
155 | }
156 |
157 | // Get current version
158 | this.currentVersion = await this.getCurrentVersion();
159 | console.debug(`Current migration version: ${this.currentVersion}`);
160 |
161 | // Find pending migrations
162 | const pendingMigrations = this.migrations.filter(m => m.version > this.currentVersion);
163 | console.debug(`Found ${pendingMigrations.length} pending migrations`);
164 |
165 | if (pendingMigrations.length === 0) {
166 | return {
167 | ok: true,
168 | currentVersion: this.currentVersion,
169 | availableVersion: this.highestVersion,
170 | pendingCount: 0
171 | };
172 | }
173 |
174 | // Apply migrations in order
175 | for (const migration of pendingMigrations) {
176 | const success = await this.applyMigration(migration);
177 |
178 | if (!success) {
179 | return {
180 | ok: false,
181 | currentVersion: this.currentVersion,
182 | availableVersion: this.highestVersion,
183 | pendingCount: pendingMigrations.length
184 | };
185 | }
186 |
187 | this.currentVersion = migration.version;
188 | }
189 |
190 | return {
191 | ok: true,
192 | currentVersion: this.currentVersion,
193 | availableVersion: this.highestVersion,
194 | pendingCount: 0
195 | };
196 | } catch (error) {
197 | console.error("Error applying migrations:", error);
198 |
199 | return {
200 | ok: false,
201 | currentVersion: this.currentVersion,
202 | availableVersion: this.highestVersion,
203 | pendingCount: this.migrations.filter(m => m.version > this.currentVersion).length
204 | };
205 | }
206 | }
207 | }
--------------------------------------------------------------------------------
/src/background/pglite/migrations/001_init.ts:
--------------------------------------------------------------------------------
1 | import { Migration } from '../migration-manager';
2 |
3 | export const migration: Migration = {
4 | version: 1,
5 | name: 'initial_schema',
6 | description: 'Initial schema creation with base tables for documents and search',
7 | sql: `
8 | -- make sure pgvector is enabled
9 | CREATE EXTENSION IF NOT EXISTS vector;
10 | CREATE EXTENSION IF NOT EXISTS pg_trgm;
11 |
12 | CREATE TABLE IF NOT EXISTS document (
13 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
14 | title TEXT,
15 | url TEXT UNIQUE NOT NULL,
16 | excerpt TEXT,
17 | md_content TEXT,
18 | md_content_hash TEXT,
19 | publication_date BIGINT,
20 | hostname TEXT,
21 | last_visit BIGINT,
22 | last_visit_date TEXT,
23 | extractor TEXT,
24 | created_at BIGINT NOT NULL DEFAULT EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) * 1000,
25 | updated_at BIGINT
26 | );
27 |
28 | CREATE INDEX IF NOT EXISTS document_hostname ON document (hostname);
29 |
30 | CREATE TABLE IF NOT EXISTS document_fragment (
31 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
32 | entity_id BIGINT NOT NULL REFERENCES document (id) ON DELETE CASCADE,
33 | attribute TEXT,
34 | value TEXT,
35 | fragment_order INTEGER,
36 | created_at BIGINT NOT NULL DEFAULT EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) * 1000,
37 | search_vector tsvector,
38 | content_vector vector(384)
39 | );
40 |
41 | CREATE OR REPLACE FUNCTION update_document_fragment_fts() RETURNS TRIGGER AS $$
42 | BEGIN
43 | NEW.search_vector := to_tsvector('simple', NEW.value);
44 | RETURN NEW;
45 | END;
46 | $$ LANGUAGE plpgsql;
47 |
48 | -- Trigger to update search vector
49 | DROP TRIGGER IF EXISTS update_document_fragment_fts_trigger ON document_fragment;
50 | CREATE TRIGGER update_document_fragment_fts_trigger
51 | BEFORE INSERT OR UPDATE ON document_fragment
52 | FOR EACH ROW EXECUTE FUNCTION update_document_fragment_fts();
53 |
54 | -- Index for full-text search
55 | CREATE INDEX IF NOT EXISTS idx_document_fragment_search_vector ON document_fragment USING GIN(search_vector);
56 |
57 | -- Index for trigram similarity search, i.e. postgres trigram
58 | -- NOTE: Disabled for now. Takes up a significant amount of space and not yet proven useful for this project
59 | --CREATE INDEX IF NOT EXISTS trgm_idx_document_fragment_value ON document_fragment USING GIN(value gin_trgm_ops);
60 |
61 | CREATE TABLE IF NOT EXISTS blacklist_rule (
62 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
63 | pattern TEXT UNIQUE NOT NULL,
64 | level TEXT NOT NULL CHECK (level IN ('no_index', 'url_only')),
65 | created_at BIGINT NOT NULL DEFAULT EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) * 1000
66 | );
67 |
68 | CREATE INDEX IF NOT EXISTS idx_blacklist_rule_pattern ON blacklist_rule (pattern);
69 |
70 | CREATE TABLE IF NOT EXISTS migrations (
71 | id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
72 | version INTEGER UNIQUE NOT NULL,
73 | name TEXT NOT NULL,
74 | description TEXT,
75 | applied_at BIGINT NOT NULL
76 | );
77 | `
78 | };
79 |
80 | // For backward compatibility with existing code
81 | export const sql = migration.sql;
--------------------------------------------------------------------------------
/src/background/pglite/tasks.ts:
--------------------------------------------------------------------------------
1 | import type { Transaction } from "@electric-sql/pglite";
2 | import { z } from "zod";
3 | import { createEmbedding } from "../embedding/pipeline";
4 | import { getArticleFragments, segment } from "../../common/utils";
5 |
6 | /**
7 | * A helper for type inference.
8 | */
9 | function createTask({
10 | params = z.object({}),
11 | handler,
12 | }: {
13 | params?: T;
14 | handler: (
15 | tx: Transaction,
16 | params: T extends z.AnyZodObject ? z.infer : undefined
17 | ) => Promise;
18 | }) {
19 | return { params, handler } as const;
20 | }
21 |
22 | export type TaskDefinition = ReturnType<
23 | typeof createTask
24 | >;
25 |
26 | export const generate_vector = createTask({
27 | params: z.object({
28 | fragment_id: z.number(),
29 | }),
30 | handler: async (tx, params) => {
31 | const result = await tx.query<{ value: string }>(
32 | "SELECT value FROM document_fragment WHERE id = $1",
33 | [params.fragment_id]
34 | );
35 | const embedding = await createEmbedding(result.rows[0].value);
36 | await tx.query("UPDATE document_fragment SET content_vector = $1 WHERE id = $2", [
37 | JSON.stringify(embedding),
38 | params.fragment_id,
39 | ]);
40 | },
41 | });
42 |
43 | export const generate_fragments = createTask({
44 | params: z.object({
45 | document_id: z.number(),
46 | }),
47 | handler: async (tx, params) => {
48 | const document = await tx.query<{
49 | id: number;
50 | title: string;
51 | url: string;
52 | excerpt: string;
53 | md_content: string;
54 | }>("SELECT * FROM document WHERE id = $1", [params.document_id]);
55 | const row = document.rows[0];
56 |
57 | if (!row) {
58 | throw new Error("Document not found");
59 | }
60 |
61 | const fragments = getArticleFragments(row.md_content || "");
62 |
63 | const sql = `
64 | INSERT INTO document_fragment (
65 | entity_id,
66 | attribute,
67 | value,
68 | fragment_order
69 | ) VALUES ($1, $2, $3, $4)
70 | ON CONFLICT DO NOTHING;
71 | `;
72 |
73 | let triples: [e: number, a: string, v: string, o: number][] = [];
74 | if (row.title) triples.push([params.document_id, "title", segment(row.title), 0]);
75 | if (row.excerpt) triples.push([params.document_id, "excerpt", segment(row.excerpt), 0]);
76 | if (row.url) triples.push([params.document_id, "url", row.url, 0]);
77 | triples = triples.concat(
78 | fragments
79 | .filter((x) => x.trim())
80 | .map((fragment, i) => {
81 | return [params.document_id, "content", fragment, i];
82 | })
83 | );
84 |
85 | const logLimit = 5;
86 | console.debug(
87 | `generate_fragments :: triples :: ${triples.length} (${triples.length - logLimit} omitted)`,
88 | triples.slice(0, logLimit)
89 | );
90 |
91 | for (const param of triples) {
92 | await tx.query(sql, param);
93 | }
94 | },
95 | });
96 |
97 | export const ping = createTask({
98 | handler: async () => {
99 | console.log("Pong!");
100 | },
101 | });
102 |
103 | export const failing_task = createTask({
104 | handler: async () => {
105 | throw new Error("This task always fails");
106 | },
107 | });
108 |
--------------------------------------------------------------------------------
/src/common/logs.ts:
--------------------------------------------------------------------------------
1 | export function log(...args: string[]) {
2 | console.log(...args);
3 | }
4 |
--------------------------------------------------------------------------------
/src/common/utils.test.ts:
--------------------------------------------------------------------------------
1 | import { describe, it, expect } from "bun:test";
2 |
3 | import { getArticleFragments, segment, sanitizeHtmlAllowMark } from "./utils";
4 |
5 | describe("getArticleFragments", () => {
6 | it("should handle longform, multi-paragraph text", () => {
7 | const longText = `# Introduction
8 |
9 | Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
10 |
11 | ## Section 1
12 |
13 | Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
14 |
15 | ### Subsection 1.1
16 |
17 | Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.`;
18 |
19 | const fragments = getArticleFragments(longText);
20 | expect(fragments.length).toBeGreaterThan(1);
21 | expect(fragments[0]).toBe(
22 | "# Introduction Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
23 | );
24 | expect(fragments[1]).toBe(
25 | "## Section 1 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
26 | );
27 | expect(fragments[2]).toBe(
28 | "### Subsection 1.1 Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo."
29 | );
30 | });
31 |
32 | it("should handle short text below minFragmentLength", () => {
33 | const shortText = "This is a short text.";
34 | const fragments = getArticleFragments(shortText);
35 | expect(fragments).toHaveLength(1);
36 | expect(fragments[0]).toBe(shortText);
37 | });
38 |
39 | it("should handle empty input", () => {
40 | const fragments = getArticleFragments("");
41 | expect(fragments).toHaveLength(0);
42 | });
43 |
44 | it("should handle input with only headings", () => {
45 | const headingsOnly = `# Heading 1
46 | ## Heading 2
47 | ### Heading 3`;
48 | const fragments = getArticleFragments(headingsOnly);
49 | expect(fragments).toHaveLength(3);
50 | expect(fragments[0]).toBe("# Heading 1");
51 | expect(fragments[1]).toBe("## Heading 2");
52 | expect(fragments[2]).toBe("### Heading 3");
53 | });
54 |
55 | it("should handle input with very long paragraphs", () => {
56 | const longParagraph = "Lorem ipsum ".repeat(100);
57 | const fragments = getArticleFragments(longParagraph);
58 | expect(fragments.length).toBe(1);
59 | expect(fragments[0].length).toBeGreaterThan(100);
60 | });
61 |
62 | it("should respect custom minFragmentLength", () => {
63 | const text = `Short para 1.
64 |
65 | Slightly longer paragraph 2.
66 |
67 | Even longer paragraph 3 with more content.`;
68 |
69 | const fragments = getArticleFragments(text);
70 | expect(fragments[0]).toBe(
71 | "Short para 1. Slightly longer paragraph 2. Even longer paragraph 3 with more content."
72 | );
73 | });
74 | });
75 |
76 | describe("getArticleFragments with plain text", () => {
77 | it("should handle a single long paragraph", () => {
78 | const text =
79 | "This is a long paragraph that should be treated as a single fragment. It contains multiple sentences and goes on for a while to ensure it exceeds the minimum fragment length of 100 characters. The content is not particularly meaningful, but it serves the purpose of this test case.";
80 | const fragments = getArticleFragments(text);
81 | expect(fragments).toHaveLength(1);
82 | expect(fragments[0]).toBe(text);
83 | });
84 |
85 | it("should split long text into multiple fragments", () => {
86 | const text =
87 | "First paragraph that is long enough to be its own fragment. It contains multiple sentences to exceed the minimum length of 100 characters.\n\nSecond paragraph that is also long enough to be a separate fragment. It also has multiple sentences and exceeds 100 characters.\n\nThird paragraph, again long enough to be distinct and over 100 characters in length.";
88 | const fragments = getArticleFragments(text);
89 | expect(fragments).toHaveLength(3);
90 | expect(fragments[0]).toContain("First paragraph");
91 | expect(fragments[1]).toContain("Second paragraph");
92 | expect(fragments[2]).toContain("Third paragraph");
93 | });
94 |
95 | it("should combine short paragraphs", () => {
96 | const text =
97 | "Short para 1.\n\nAnother short one.\n\nYet another.\n\nStill short.\n\nNeed more text to reach 100 characters. This should do it, creating a single fragment.";
98 | const fragments = getArticleFragments(text);
99 | expect(fragments).toHaveLength(1);
100 | expect(fragments[0]).toContain("Short para 1.");
101 | expect(fragments[0]).toContain("Need more text to reach 100 characters.");
102 | });
103 |
104 | it("should handle text with varying paragraph lengths", () => {
105 | const text =
106 | "Short intro.\n\nThis is a much longer paragraph that should be its own fragment because it exceeds the minimum length of 100 characters. It contains multiple sentences to ensure it's long enough.\n\nAnother short paragraph.\n\nYet another long paragraph that should be separate. It also contains multiple sentences and exceeds the minimum length of 100 characters to be its own fragment.";
107 | const fragments = getArticleFragments(text);
108 | expect(fragments).toHaveLength(2);
109 | expect(fragments[0]).toContain("This is a much longer paragraph");
110 | expect(fragments[1]).toContain("Yet another long paragraph");
111 | });
112 |
113 | it("should handle text with line breaks but no paragraphs", () => {
114 | const text =
115 | "This is a text\nwith line breaks\nbut no paragraph\nbreaks. It should\nbe treated as one\nfragment. We need to add more text to ensure it exceeds 100 characters and becomes a valid fragment.";
116 | const fragments = getArticleFragments(text);
117 | expect(fragments).toHaveLength(1);
118 | expect(fragments[0]).toBe(
119 | "This is a text with line breaks but no paragraph breaks. It should be treated as one fragment. We need to add more text to ensure it exceeds 100 characters and becomes a valid fragment."
120 | );
121 | });
122 | });
123 |
124 | describe("segment", () => {
125 | it("should not affect normal English text", () => {
126 | const text = "This is a normal English sentence.";
127 | expect(segment(text)).toBe(text);
128 | });
129 |
130 | it("should handle empty string", () => {
131 | expect(segment("")).toBe("");
132 | });
133 |
134 | it("should handle text with numbers and punctuation", () => {
135 | const text = "Hello, world! This is test #123.";
136 | expect(segment(text)).toBe(text);
137 | });
138 |
139 | it("should segment text with non-Latin characters", () => {
140 | const text = "こんにちは世界";
141 | const segmented = segment(text);
142 | expect(segmented).toBe("こんにちは 世界");
143 | });
144 |
145 | it("should handle mixed Latin and non-Latin text", () => {
146 | const text = "Hello こんにちは world 世界";
147 | const segmented = segment(text);
148 | expect(segmented).toBe("Hello こんにちは world 世界");
149 | });
150 |
151 | it("should handle mixed Latin and Mandarin Chinese text", () => {
152 | const text = "Hello 你好世界我是一个人工智能助手 world 这是一个测试";
153 | const segmented = segment(text);
154 | expect(segmented).toBe("Hello 你好 世界 我是 一个 人工 智能 助手 world 这 是 一个 测试");
155 | });
156 |
157 | it("should handle chinese with punctuation", () => {
158 | const text =
159 | "你好,世界!这是一个测试句子,用于检查中文文本的分段功能。我们希望确保即使在有标点符号的情况下,文本也能正确分段。";
160 | const segmented = segment(text);
161 | expect(segmented).toBe(
162 | "你好 , 世界 ! 这 是 一个 测试 句子 , 用于 检查 中文 文本 的 分段 功能 。 我们 希望 确保 即使 在 有 标点 符号 的 情况 下 , 文本 也能 正确 分段 。"
163 | );
164 | });
165 | });
166 |
167 | describe("sanitizeHtmlAllowMark", () => {
168 | it("should preserve mark tags while removing all other HTML tags", () => {
169 | const html = '
Text with highlighted and bold and italic parts
';
170 | const sanitized = sanitizeHtmlAllowMark(html);
171 | expect(sanitized).toBe('Text with highlighted and bold and italic parts');
172 | });
173 |
174 | it("should strip attributes from mark tags", () => {
175 | const html = 'Text with attributes';
176 | const sanitized = sanitizeHtmlAllowMark(html);
177 | expect(sanitized).toBe('Text with attributes');
178 | });
179 |
180 | it("should handle empty input", () => {
181 | expect(sanitizeHtmlAllowMark("")).toBe("");
182 | expect(sanitizeHtmlAllowMark(null as any)).toBe("");
183 | expect(sanitizeHtmlAllowMark(undefined as any)).toBe("");
184 | });
185 |
186 | it("should remove script tags and their content", () => {
187 | const html = 'Text with scripts';
188 | const sanitized = sanitizeHtmlAllowMark(html);
189 | expect(sanitized).toBe('Text with scripts');
190 | });
191 |
192 | it("should remove style tags and their content", () => {
193 | const html = 'Text with styles';
194 | const sanitized = sanitizeHtmlAllowMark(html);
195 | expect(sanitized).toBe('Text with styles');
196 | });
197 |
198 | it("should handle complex nested HTML while preserving mark tags", () => {
199 | const html = `
200 |
201 |
Title
202 |
Paragraph with highlighted text and dangerous content