13 |
14 | Values within documents can have the following data types:
15 |
16 | - String
17 | - Integer
18 | - Float
19 | - List
20 | - Dictionary
21 |
22 | When documents are added, a `uuid` key is added for use in uniquely identifying the document.
23 |
24 |
25 | Dictionaries are not indexable. You can store dictionaries and they will be returned in payloads, but you cannot run search operations on them.
26 |
--------------------------------------------------------------------------------
/docs/pages/templates/update.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Update a Document
4 | permalink: /update/
5 | ---
6 |
7 |
8 | You need a document UUID to update a document. You can retrieve a UUID by searching for a document.
9 |
10 | Here is an example showing how to update a document:
11 |
12 | ```python
13 | response = index.search(
14 | {
15 | "query": {"title": {"equals": "tolerate it"}},
16 | "limit": 10,
17 | "sort_by": "title",
18 | }
19 | )
20 |
21 | uuid = response["documents"][0]["uuid"]
22 |
23 | index.update(uuid, {"title": "tolerate it (folklore)", "artist": "Taylor Swift"})
24 | ```
25 |
26 | `update` is an override operation. This means you must provide the full document that you want to save, instead of only the fields you want to update.
27 |
--------------------------------------------------------------------------------
/docs/pages/templates/highlight.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Highlight Results
4 | permalink: /highlight/
5 | ---
6 |
7 | You can extract context around results. This data can be used to show a snippet of the document that contains the query term.
8 |
9 | Here is an example of a query that highlights context around all instances of the term "sky" in the `lyric` field:
10 |
11 | ```python
12 | query = {
13 | "query": {
14 | "lyric": {
15 | "contains": "sky",
16 | "highlight": True,
17 | "highlight_stride": 3
18 | }
19 | }
20 | }
21 | ```
22 |
23 | `highlight_stride` states how many words to retrieve before and after the match.
24 |
25 | All documents returned by this query will have a `_context` key that contains the context around all instances of the term "sky".
--------------------------------------------------------------------------------
/docs/pages/templates/code-search.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Code Search
4 | permalink: /code-search/
5 | ---
6 |
7 | You can use JameSQL to efficiently search through code.
8 |
9 | To do so, first create a `TRIGRAM_CODE` index on the field you want to search.
10 |
11 | When you add documents, include at least the following two fields:
12 |
13 | - `file_name`: The name of the file the code is in.
14 | - `code`: The code you want to index.
15 |
16 | When you search for code, all matching documents will have a `_context` key with the following structure:
17 |
18 |
24 |
25 | This tells you on what line your search matched, and the code that matched. This information is ideal to highlight specific lines relevant to your query.
--------------------------------------------------------------------------------
/docs/pages/templates/aggregate-metrics.md:
--------------------------------------------------------------------------------
1 |
2 | You can find the total number of unique values for the fields returned by a query using an `aggregate` query. This is useful for presenting the total number of options available in a search space to a user.
3 |
4 | You can use the following query to find the total number of unique values for all fields whose `lyric` field contains the term "sky":
5 |
6 |
--------------------------------------------------------------------------------
/docs/pages/templates/delete.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Delete a Document
4 | permalink: /delete/
5 | ---
6 |
7 | You need a document UUID to delete a document. You can retrieve a UUID by searching for a document.
8 |
9 | Here is an example showing how to delete a document:
10 |
11 |
--------------------------------------------------------------------------------
/.github/workflows/test.yml:
--------------------------------------------------------------------------------
1 | name: JameSQL Test Workflow (macOS and Ubuntu)
2 |
3 | on:
4 | pull_request:
5 | branches: [main]
6 | push:
7 | branches: [main]
8 |
9 | jobs:
10 | build-dev-test:
11 | runs-on: ${{ matrix.os }}
12 | strategy:
13 | matrix:
14 | os: ["ubuntu-latest", "macos-latest"]
15 | python-version: ["3.10", "3.11", "3.12", "3.13"]
16 | steps:
17 | - name: 🛎️ Checkout
18 | uses: actions/checkout@v4
19 | - name: 🐍 Set up Python ${{ matrix.python-version }}
20 | uses: actions/setup-python@v5
21 | with:
22 | python-version: ${{ matrix.python-version }}
23 | check-latest: true
24 |
25 | - name: 📦 Install dependencies
26 | run: |
27 | python -m pip install --upgrade pip
28 | pip install -e .
29 | pip install -e .[dev]
30 | - name: 🧪 Test
31 | run: |
32 | python -m pytest tests/*.py
33 |
--------------------------------------------------------------------------------
/.github/workflows/release.yml:
--------------------------------------------------------------------------------
1 | name: Publish Workflow
2 |
3 | on:
4 | release:
5 | types: [created]
6 |
7 | jobs:
8 | build:
9 | runs-on: ubuntu-latest
10 | strategy:
11 | matrix:
12 | python-version: [3.12]
13 | steps:
14 | - name: 🛎️ Checkout
15 | uses: actions/checkout@v4
16 | with:
17 | ref: ${{ github.head_ref }}
18 | - name: 🐍 Set up Python ${{ matrix.python-version }}
19 | uses: actions/setup-python@v5
20 | with:
21 | python-version: ${{ matrix.python-version }}
22 | - name: 🦾 Install dependencies
23 | run: |
24 | python -m pip install --upgrade pip twine wheel
25 | - name: 🚀 Publish to PyPi
26 | env:
27 | PYPI_USERNAME: ${{ secrets.PYPI_USERNAME }}
28 | PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
29 | run: |
30 | python setup.py sdist bdist_wheel
31 | twine check dist/*
32 | twine upload dist/* -u ${PYPI_USERNAME} -p ${PYPI_PASSWORD} --verbose
33 |
--------------------------------------------------------------------------------
/.github/workflows/benchmark.yml:
--------------------------------------------------------------------------------
1 | name: JameSQL Benchmark Workflow
2 |
3 | on:
4 | pull_request:
5 | branches: [main]
6 | push:
7 | branches: [main]
8 |
9 | jobs:
10 | build-dev-test:
11 | runs-on: ${{ matrix.os }}
12 | strategy:
13 | matrix:
14 | os: ["ubuntu-latest", "macos-latest"]
15 | python-version: ["3.10", "3.11", "3.12", "3.13"]
16 | steps:
17 | - name: 🛎️ Checkout
18 | uses: actions/checkout@v4
19 | - name: 🐍 Set up Python ${{ matrix.python-version }}
20 | uses: actions/setup-python@v5
21 | with:
22 | python-version: ${{ matrix.python-version }}
23 | check-latest: true
24 |
25 | - name: 📦 Install dependencies
26 | run: |
27 | python -m pip install --upgrade pip
28 | pip install -e .
29 | pip install -e .[dev]
30 |
31 | - name: 🧪 Run benchmark stress test
32 | env:
33 | SITE_ENV: production
34 | run: |
35 | python -m pytest ./tests/*.py --benchmark
36 | python -m pytest ./tests/*.py --long-benchmark
37 |
38 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 James
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/.github/workflows/windows.yml:
--------------------------------------------------------------------------------
1 | name: JameSQL Test Workflow (Windows)
2 |
3 | on:
4 | pull_request:
5 | branches: [main]
6 | push:
7 | branches: [main]
8 |
9 | jobs:
10 | build-dev-test:
11 | runs-on: ${{ matrix.os }}
12 | strategy:
13 | matrix:
14 | os: ["windows-latest"]
15 | python-version: ["3.10", "3.11", "3.12", "3.13"]
16 | steps:
17 | - name: 🛎️ Checkout
18 | uses: actions/checkout@v4
19 | - name: 🐍 Set up Python ${{ matrix.python-version }}
20 | uses: actions/setup-python@v5
21 | with:
22 | python-version: ${{ matrix.python-version }}
23 | check-latest: true
24 |
25 | - name: 📦 Install dependencies
26 | run: |
27 | python -m pip install --upgrade pip
28 | pip install -e .
29 | pip install -e .[dev]
30 | - name: 🧪 Test
31 | run: |
32 | python -m pytest tests/aggregation.py tests/data_types.py tests/group_by.py tests/highlight.py tests/range_queries.py tests/save_and_load.py tests/script_lang.py tests/string_queries_categorical_and_range.py tests/string_query.py tests/test.py
33 |
--------------------------------------------------------------------------------
/.github/workflows/documentation.yml:
--------------------------------------------------------------------------------
1 | name: Publish Documentation
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 |
8 | jobs:
9 | build:
10 | runs-on: ubuntu-latest
11 |
12 | steps:
13 | - name: Checkout code
14 | uses: actions/checkout@v4
15 |
16 | - name: Set up Python
17 | uses: actions/setup-python@v5
18 | with:
19 | python-version: '3.13'
20 | check-latest: true
21 |
22 | - name: Install dependencies
23 | run: |
24 | python -m pip install --upgrade pip
25 | python -m pip install pygments bs4 lxml
26 | python -m pip install git+https://github.com/capjamesg/aurora
27 | cd docs
28 | - name: Build main site
29 | env:
30 | SITE_ENV: ${{ secrets.SITE_ENV }}
31 | run: |
32 | cd docs
33 | aurora build
34 | - name: rsync deployments
35 | uses: burnett01/rsync-deployments@7.0.1
36 | with:
37 | switches: -avzr
38 | path: "docs/_site/*"
39 | remote_path: ${{ secrets.REMOTE_PATH }}
40 | remote_host: ${{ secrets.SERVER_HOST }}
41 | remote_user: ${{ secrets.SERVER_USERNAME }}
42 | remote_key: ${{ secrets.KEY }}
43 |
--------------------------------------------------------------------------------
/tests/concurrency.py:
--------------------------------------------------------------------------------
1 | import json
2 | import random
3 | import threading
4 |
5 | from jamesql import JameSQL
6 | from jamesql.index import GSI_INDEX_STRATEGIES
7 |
8 |
9 | def test_threading():
10 | with open("tests/fixtures/documents.json") as f:
11 | documents = json.load(f)
12 |
13 | index = JameSQL()
14 |
15 | index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
16 | index.create_gsi("lyric", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
17 |
18 | for document in documents * 100:
19 | document = document.copy()
20 | index.add(document, doc_id=str(random.randint(0, 1000000)))
21 |
22 | def query(i):
23 | if i == 0:
24 | document = documents[0].copy()
25 | document["title"] = "teal"
26 | index.add(document, "xyz")
27 | index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
28 |
29 | assert len(index.string_query_search("teal")["documents"]) == 1
30 |
31 | threads = []
32 |
33 | for i in range(2500):
34 | t = threading.Thread(target=query, args=(i,))
35 | threads.append(t)
36 | t.start()
37 |
38 | for t in threads:
39 | t.join()
40 |
41 | assert len(index.global_index) == 301
42 | assert index.global_index["xyz"]["title"] == "teal"
43 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 | from setuptools import find_packages
3 | import re
4 |
5 | with open("./jamesql/__init__.py", 'r') as f:
6 | content = f.read()
7 | version = re.search(r'__version__\s*=\s*[\'"]([^\'"]*)[\'"]', content).group(1)
8 |
9 | with open("README.md", "r") as fh:
10 | long_description = fh.read()
11 |
12 | setuptools.setup(
13 | name="jamesql",
14 | version=version,
15 | author="capjamesg",
16 | author_email="jamesg@jamesg.blog",
17 | description="A JameSQL database implemented in Python.",
18 | long_description=long_description,
19 | long_description_content_type="text/markdown",
20 | url="https://github.com/capjamesg/jamesql",
21 | install_requires=[
22 | "pybmoore",
23 | "pygtrie",
24 | "lark",
25 | "btrees",
26 | "nltk",
27 | "sortedcontainers"
28 | ],
29 | packages=find_packages(exclude=("tests",)),
30 | extras_require={
31 | "dev": ["flake8", "black==22.3.0", "isort", "twine", "pytest", "wheel", "flask", "orjson", "tqdm", "deepdiff"],
32 | },
33 | classifiers=[
34 | "Programming Language :: Python :: 3",
35 | "License :: OSI Approved :: MIT License",
36 | "Operating System :: OS Independent",
37 | ],
38 | python_requires=">=3.7",
39 | )
40 |
--------------------------------------------------------------------------------
/web/landing.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | Document
7 |
32 |
33 |
34 |
35 |
JameSQL
36 |
Fast, in-memory search.
37 |
38 |
39 |
--------------------------------------------------------------------------------
/jamesql/query_simplifier.py:
--------------------------------------------------------------------------------
1 | def normalize_operator_query(t):
2 | if isinstance(t, str):
3 | return t
4 |
5 | return "_".join(t)
6 |
7 |
8 | def simplifier(terms):
9 | new_terms = []
10 | outer_terms = set()
11 | to_remove = set()
12 |
13 | for i, t in enumerate(terms):
14 | if isinstance(t, str) and t not in outer_terms:
15 | outer_terms.add(t)
16 | new_terms.append(t)
17 |
18 | for i, t in enumerate(terms):
19 | normalized_terms = normalize_operator_query(t)
20 | if isinstance(t, list) and t[1] == "OR":
21 | for inner_term in t:
22 | if inner_term == "OR":
23 | continue
24 |
25 | if inner_term not in outer_terms:
26 | outer_terms.add(inner_term)
27 | new_terms.append(inner_term)
28 | elif (
29 | isinstance(t, list)
30 | and t[1] == "AND"
31 | and normalized_terms not in outer_terms
32 | ):
33 | new_terms.append(t[0])
34 | new_terms.append("AND")
35 | new_terms.append(t[2])
36 | elif (
37 | isinstance(t, list)
38 | and t[0] == "NOT"
39 | and normalized_terms not in outer_terms
40 | ):
41 | new_terms.append("-" + t[1])
42 |
43 | if t[1] in outer_terms:
44 | to_remove.add(t[1])
45 | to_remove.add("-" + t[1])
46 |
47 | return [i for i in new_terms if normalize_operator_query(i) not in to_remove]
48 |
--------------------------------------------------------------------------------
/docs/pages/templates/quickstart.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Quickstart
4 | permalink: /quickstart/
5 | ---
6 |
7 |
You can create a JameSQL database in five lines of code.
8 |
9 |
Install JameSQL
10 |
11 | First, install JameSQL:
12 |
13 |
14 | pip install jamesql
15 |
16 |
17 |
Insert Records
18 |
19 | Then, create a new Python file and add the following code:
20 |
21 |
22 | from jamesql import JameSQL, GSI_INDEX_STRATEGIES
23 |
24 | index = JameSQL.load()
25 |
26 | index.add({"title": "tolerate it", "lyric": "Use my best colors for your portrait"})
27 |
28 |
29 |
Create an Index
30 |
31 | For efficient data retrieval for longer pieces of text in the `lyric` key, we are going to use the `CONTAINS` index type. This creates a reverse index for each word in the text.
32 |
33 |
50 | {"documents": [{"title": "tolerate it", "lyric": "Use my best colors for your portrait" …}]}
51 |
52 |
53 | We have successfully built a database!
54 |
55 |
--------------------------------------------------------------------------------
/tests/fixtures/code/simplifier.py:
--------------------------------------------------------------------------------
1 | def normalize_operator_query(t):
2 | if isinstance(t, str):
3 | return t
4 |
5 | return "_".join(t)
6 |
7 |
8 | def simplifier(terms):
9 | new_terms = []
10 | outer_terms = set()
11 | to_remove = set()
12 |
13 | for i, t in enumerate(terms):
14 | if isinstance(t, str) and t not in outer_terms:
15 | outer_terms.add(t)
16 | new_terms.append(t)
17 |
18 | for i, t in enumerate(terms):
19 | normalized_terms = normalize_operator_query(t)
20 | if isinstance(t, list) and t[1] == "OR":
21 | for inner_term in t:
22 | if inner_term == "OR":
23 | continue
24 |
25 | if inner_term not in outer_terms:
26 | outer_terms.add(inner_term)
27 | new_terms.append(inner_term)
28 | elif (
29 | isinstance(t, list)
30 | and t[1] == "AND"
31 | and normalized_terms not in outer_terms
32 | ):
33 | new_terms.append(t)
34 | outer_terms.add(normalized_terms)
35 | if t[0] in outer_terms:
36 | to_remove.add(t[0])
37 | if t[2] in outer_terms:
38 | to_remove.add(t[2])
39 | elif (
40 | isinstance(t, list)
41 | and t[0] == "NOT"
42 | and normalized_terms not in outer_terms
43 | ):
44 | if t[1] in outer_terms:
45 | to_remove.add(t[1])
46 |
47 | return [i for i in new_terms if normalize_operator_query(i) not in to_remove]
48 |
--------------------------------------------------------------------------------
/docs/pages/templates/ranking.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | permalink: /ranking/
4 | title: Document Ranking
5 | ---
6 |
7 | By default, documents are ranked in no order. If you provide a `sort_by` field, documents are sorted by that field.
8 |
9 | For more advanced ranking, you can use the `boost` feature. This feature lets you boost the value of a field in a document to calculate a final score.
10 |
11 | The default score for each field is `1`.
12 |
13 | To use this feature, you must use `boost` on fields that have an index.
14 |
15 | Here is an example of a query that uses the `boost` feature:
16 |
17 | ```python
18 | {
19 | "query": {
20 | "or": {
21 | "post": {
22 | "contains": "taylor swift",
23 | "strict": False,
24 | "boost": 1
25 | },
26 | "title": {
27 | "contains": "desk",
28 | "strict": True,
29 | "boost": 25
30 | }
31 | }
32 | },
33 | "limit": 4,
34 | "sort_by": "_score",
35 | }
36 | ```
37 |
38 | This query would search for documents whose `post` field contains `taylor swift` or whose `title` field contains `desk`. The `title` field is boosted by 25, so documents that match the `title` field are ranked higher.
39 |
40 | The score for each document before boosting is equal to the number of times the query condition is satisfied. For example, if a post contains `taylor swift` twice, the score for that document is `2`; if a title contains `desk` once, the score for that document is `1`.
41 |
42 | Documents are then ranked in decreasing order of score.
--------------------------------------------------------------------------------
/docs/pages/templates/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: JameSQL
4 | permalink: /
5 | ---
6 |
7 | An in-memory, NoSQL database implemented in Python, with support for building custom ranking algorithms.
8 |
9 | You can run full text search queries on thousands of documents with multiple fields in < 1ms.
10 |
11 | ## Demo
12 |
13 | [Try a site search engine built with JameSQL](https://jamesg.blog/search-pages/).
14 |
15 |
18 |
19 | ## Ideal use case
20 |
21 | JameSQL is designed for small-scale search projects where objects can easily be loaded into and stored in memory.
22 |
23 | James uses it for his [personal website search engine](https://jamesg.blog/search-pages/), which indexes 1,000+ documents (500,000+ words).
24 |
25 | On James' search engine, are computed in < 10ms and returned to a client in < 70ms.
--------------------------------------------------------------------------------
/tests/fixtures/code/simplifier_demo.py:
--------------------------------------------------------------------------------
1 | import math
2 | import os
3 | from collections import defaultdict
4 |
5 |
6 | def get_trigrams(line):
7 | return [line[i : i + 3] for i in range(len(line) - 2)]
8 |
9 |
10 | index = defaultdict(list)
11 |
12 | # read all python files in .
13 | DIR = "./pages/posts/"
14 | id2line = {}
15 | doc_lengths = {}
16 |
17 | for root, dirs, files in os.walk(DIR):
18 | for file in files:
19 | if file.endswith(".md"):
20 | with open(os.path.join(root, file), "r") as file:
21 | code = file.read()
22 |
23 | code_lines = code.split("\n")
24 | total_lines = len(code_lines)
25 |
26 | for line_num, line in enumerate(code_lines):
27 | trigrams = get_trigrams(line)
28 |
29 | if len(trigrams) == 0:
30 | id2line[f"{file.name}:{line_num}"] = line
31 |
32 | for trigram in trigrams:
33 | index[trigram].append((file, line_num))
34 | id2line[f"{file.name}:{line_num}"] = line
35 |
36 | doc_lengths[file.name] = total_lines
37 |
38 | query = "coffee"
39 | context = 0
40 |
41 | trigrams = get_trigrams(query)
42 |
43 | candidates = set(index[trigrams[0]])
44 | # print([file.name + ":" + str(line_num) for file, line_num in candidates])
45 | for trigram in trigrams:
46 | candidates = candidates.intersection(set(index[trigram]))
47 |
48 | for file, line_num in candidates:
49 | print(f"{file.name}:{line_num}")
50 | for i in range(
51 | max(0, line_num - context), min(doc_lengths[file.name], line_num + context + 1)
52 | ):
53 | line = id2line[f"{file.name}:{i}"]
54 | print(f"{i}: {line}")
55 |
56 | print()
57 |
--------------------------------------------------------------------------------
/docs/pages/templates/search.html:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | permalink: /search-pages/
4 | title: Search
5 | notoc: true
6 | ---
7 |
8 |
search results for ""
9 |
10 |
11 |
12 |
13 |
--------------------------------------------------------------------------------
/schema.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | from enum import Enum
4 | from typing import Dict, Optional
5 |
6 | from pydantic import BaseModel, ConfigDict, model_validator
7 |
8 | VALID_QUERY_TYPES = ["contains", "equals", "starts_with"]
9 | VALID_OPERATOR_QUERY_TYPES = ["or", "and"]
10 |
11 |
12 | class QueryType(str, Enum):
13 | contains = "contains"
14 | equals = "equals"
15 | starts_with = "starts_with"
16 |
17 |
18 | class AndOperatorQueryType(str, Enum):
19 | and_ = "and"
20 |
21 |
22 | class OrOperatorQueryType(str, Enum):
23 | or_ = "or"
24 |
25 |
26 | class QueryItem(BaseModel):
27 | model_config = ConfigDict(extra="forbid")
28 |
29 | contains: Optional[str] = None
30 | equals: Optional[str] = None
31 | starts_with: Optional[str] = None
32 |
33 | strict: Optional[bool] = False
34 | boost: Optional[int] = 1
35 |
36 | # ensure that only one of the query types is used
37 | @model_validator(mode="after")
38 | def validate_query_type(cls, v):
39 | query_types = [v.contains, v.equals, v.starts_with]
40 |
41 | if len([qt for qt in query_types if qt]) > 1:
42 | raise ValueError("Only one query type can be used")
43 |
44 | return v
45 |
46 |
47 | class RootQuery(BaseModel):
48 | query: (
49 | Dict[AndOperatorQueryType, Dict[str, QueryItem]]
50 | | Dict[OrOperatorQueryType, Dict[str, QueryItem]]
51 | | Dict[str, QueryItem]
52 | )
53 | limit: Optional[int] = 10
54 | sort_by: Optional[str] = "score"
55 |
56 |
57 | query = {
58 | "query": {
59 | "or": {
60 | "post": {"contains": "taylor swift", "strict": False, "boost": 1},
61 | "title": {"contains": "my desk", "strict": True, "boost": 25},
62 | }
63 | },
64 | "limit": 4,
65 | "sort_by": "score",
66 | }
67 |
68 | # validate query
69 | print(RootQuery(**query))
70 |
--------------------------------------------------------------------------------
/jamesql/script_lang.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import math
3 |
4 | from lark import Transformer
5 |
6 | grammar = """
7 | start: query
8 |
9 | query: decay | "(" query OPERATOR query ")" | logarithm | FLOAT | WORD
10 | logarithm: LOGARITHM "(" query ")"
11 | OPERATOR: "+" | "-" | "*" | "/"
12 | LOGARITHM: "log"
13 | decay: "decay" WORD
14 |
15 | WORD: /[a-zA-Z0-9_]+/
16 | FLOAT: /[0-9]+(\.[0-9]+)?/
17 |
18 | %import common.WS
19 | %ignore WS
20 | """
21 |
22 | OPERATOR_METHODS = {
23 | "+": lambda x, y: x + y,
24 | "-": lambda x, y: x - y,
25 | "*": lambda x, y: x * y,
26 | "/": lambda x, y: x / y,
27 | }
28 |
29 |
30 | class JameSQLScriptTransformer(Transformer):
31 | def __init__(self, document):
32 | self.document = document
33 |
34 | def query(self, items):
35 | if len(items) == 1:
36 | return items[0]
37 |
38 | left = items[0]
39 | operator = items[1]
40 | right = items[2]
41 |
42 | operator_command = OPERATOR_METHODS[operator]
43 |
44 | return operator_command(left, right)
45 |
46 | def logarithm(self, items):
47 | # + 0.1 removes the possibility of log(0)
48 | # which would return a math domain error
49 | return math.log(items[1] + 0.1)
50 |
51 | def start(self, items):
52 | return items[0]
53 |
54 | def decay(self, items):
55 | # decay by half for every 30 days
56 | # item is datetime.dateime object
57 | days_since_post = (
58 | datetime.datetime.now()
59 | - datetime.datetime.strptime(items[0], "%Y-%m-%dT%H:%M:%S")
60 | ).days
61 |
62 | return 1.1 ** (days_since_post / 30)
63 |
64 | def WORD(self, items):
65 | if items.value.isdigit():
66 | return float(items.value)
67 |
68 | return self.document[items.value]
69 |
70 | def FLOAT(self, items):
71 | return float(items.value)
72 |
73 | def OPERATOR(self, items):
74 | return items.value
75 |
--------------------------------------------------------------------------------
/docs/pages/templates/autosuggest.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | permalink: /autosuggest/
4 | title: Autosuggest
5 | ---
6 |
7 | You can enable autosuggest using one or more fields in an index. This can be used to efficiently find records that start with a given prefix.
8 |
9 | To enable autosuggest on an index, run:
10 |
11 |
16 |
17 | Where `field` is the name of the field on which you want to enable autosuggest.
18 |
19 | You can enable autosuggest on multiple fields:
20 |
21 |
25 |
26 | When you enable autosuggest on a field, JameSQL will create a trie index for that field. This index is used to efficiently find records that start with a given prefix.
27 |
28 | To run an autosuggest query, use the following code:
29 |
30 |
33 |
34 | This will automatically return records that start with the prefix `started`.
35 |
36 | The `match_full_record` parameter indicates whether to return full record names, or any records starting with a term.
37 |
38 | `match_full_record=True` means that the full record name will be returned. This is ideal to enable selection between full records.
39 |
40 | `match_full_record=False` means that any records starting with the term will be returned. This is ideal for autosuggesting single words.
41 |
42 | For example, given the query `start`, matching against full records with `match_full_record=True` would return:
43 |
44 | - `Started with a kiss`
45 |
46 | This is the content of a full document.
47 |
48 | `match_full_record=False`, on the other hand, would return:
49 |
50 | - `started`
51 | - `started with a kiss`
52 |
53 | This contains both a root word starting with `start` and full documents starting with `start`.
54 |
55 | This feature is case insensitive.
56 |
57 | The `limit` argument limits the number of results returned.
--------------------------------------------------------------------------------
/docs/pages/templates/spelling-correction.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Spelling Correction
4 | permalink: /spelling-correction/
5 | ---
6 |
7 | It is recommended that you check the spelling of words before you run a query.
8 |
9 | This is because correcting the spelling of a word can improve the accuracy of your search results.
10 |
11 | ### Correcting the spelling of a single word
12 |
13 | To recommend a spelling correction for a query, use the following code:
14 |
15 | ```python
16 | index = ...
17 |
18 | suggestion = index.spelling_correction("taylr swift")
19 | ```
20 |
21 | This will return a single suggestion. The suggestion will be the word that is most likely to be the correct spelling of the word you provided.
22 |
23 | Spelling correction first generates segmentations of a word, like:
24 |
25 | - `t aylorswift`
26 | - `ta ylorswift`
27 |
28 | If a segmentation is valid, it is returned.
29 |
30 | For example, if the user types in `taylorswift`, one permutation would be segmented into `taylor swift`. If `taylor swift` is common in the index, `taylor swift` will be returned as the suggestion.
31 |
32 | Spelling correction works by transforming the input query by inserting, deleting, and transforming one character in every position in a string. The transformed strings are then looked up in the index to find if they are present and, if so, how common they are.
33 |
34 | The most common suggestion is then returned.
35 |
36 | For example, if you provide the word `tayloi` and `taylor` is common in the index, the suggestion will be `taylor`.
37 |
38 | If correction was not possible after transforming one character, correction will be attempted with two transformations given the input string.
39 |
40 | If the word you provided is already spelled correctly, the suggestion will be the word you provided. If spelling correction is not possible (i.e. the word is too distant from any word in the index), the suggestion will be `None`.
41 |
42 | ### Correcting a string query
43 |
44 | If you are correcting a string query submitted with the `string_query_search()` function, spelling will be automatically corrected using the algorithm above. No configuration is required.
--------------------------------------------------------------------------------
/docs/pages/templates/conditions/operators.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Operators
3 | layout: default
4 | permalink: /conditions/operators
5 | ---
6 |
7 | There are three operators you can use for condition matching:
8 |
9 | - `equals`
10 | - `contains`
11 | - `starts_with`
12 |
13 | Here is an example of a query that searches for documents that have the `artist` field set to `Taylor Swift`:
14 |
15 | ```python
16 | query = {
17 | "query": {
18 | "artist": {
19 | "equals": "Taylor Swift"
20 | }
21 | }
22 | }
23 | ```
24 |
25 | These operators can be used with three query types:
26 |
27 | - `and`
28 | - `or`
29 | - `not`
30 |
31 | ### and
32 |
33 | You can also search for documents that have the `artist` field set to `Taylor Swift` and the `title` field set to `tolerate it`:
34 |
35 | ```python
36 | query = {
37 | "query": {
38 | "and": [
39 | {
40 | "artist": {
41 | "equals": "Taylor Swift"
42 | }
43 | },
44 | {
45 | "title": {
46 | "equals": "tolerate it"
47 | }
48 | }
49 | ]
50 | }
51 | }
52 | ```
53 |
54 | ### or
55 |
56 | You can nest conditions to create complex queries, like:
57 |
58 | ```python
59 | query = {
60 | "query": {
61 | "or": {
62 | "and": [
63 | {"title": {"starts_with": "tolerate"}},
64 | {"title": {"contains": "it"}},
65 | ],
66 | "lyric": {"contains": "kiss"},
67 | }
68 | },
69 | "limit": 2,
70 | "sort_by": "title",
71 | }
72 | ```
73 |
74 | This will return a list of documents that match the query.
75 |
76 | ### not
77 |
78 | You can search for documents that do not match a query by using the `not` operator. Here is an example of a query that searches for lyrics that contain `sky` but not `kiss`:
79 |
80 | ```python
81 | query = {
82 | "query": {
83 | "and": {
84 | "or": [
85 | {"lyric": {"contains": "sky", "boost": 3}},
86 | ],
87 | "not": {"lyric": {"contains": "kiss"}},
88 | }
89 | },
90 | "limit": 10,
91 | "sort_by": "title",
92 | }
93 | ```
--------------------------------------------------------------------------------
/docs/pages/templates/string-queries.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: String Queries
4 | permalink: /string-query/
5 | ---
6 |
7 | JameSQL supports string queries. String queries are single strings that use special syntax to assert the meaning of parts of a string.
8 |
9 | For example, you could use the following query to find documents where the `title` field contains `tolerate it` and any field contains `mural`:
10 |
11 |
12 | title:"tolerate it" mural
13 |
14 |
15 | The following operators are supported:
16 |
17 |
18 |
19 |
20 |
Operator
21 |
Description
22 |
23 |
24 |
25 |
26 |
-term
27 |
Search for documents that do not contain term.
28 |
29 |
30 |
term
31 |
Search for documents that contain term.
32 |
33 |
34 |
term1 term2
35 |
Search for documents that contain term1 and term2.
36 |
37 |
38 |
'term1 term2'
39 |
Search for the literal phrase term1 term2 in documents.
40 |
41 |
42 |
field:'term'
43 |
Search for documents where the field field contains term (i.e. title:"tolerate it").
44 |
45 |
46 |
field^2 term
47 |
Boost the score of documents where the field field matches the query term by 2.
48 |
49 |
50 |
51 |
52 | This feature turns a string query into a JameSQL query, which is then executed and the results returned.
53 |
54 | To run a string query, use the following code:
55 |
56 | ```python
57 | results = index.string_query_search("title:'tolerate it' mural")
58 | ```
59 |
60 | When you run a string query, JameSQL will attempt to simplify the query to make it more efficient. For example, if you search for `-sky sky mural`, the query will be `mural` because `-sky` negates the `sky` mention.
61 |
--------------------------------------------------------------------------------
/docs/hooks.py:
--------------------------------------------------------------------------------
1 |
2 | from pygments import highlight
3 | from pygments.lexers import PythonLexer, HtmlLexer
4 | from pygments.formatters import HtmlFormatter
5 | from bs4 import BeautifulSoup
6 |
7 | languages = {
8 | "python": PythonLexer(),
9 | "html": HtmlLexer(),
10 | "text": HtmlLexer(),
11 | }
12 |
13 | def generate_table_of_contents(file_name, page_state, site_state):
14 | page = BeautifulSoup(page_state["page"].contents, 'html.parser')
15 | h2s = page.find_all('h2')
16 | toc = []
17 | for h2 in h2s:
18 | toc.append({
19 | "text": h2.text,
20 | "id": h2.text.lower().replace(" ", "-"),
21 | "children": []
22 | })
23 | h3s = h2.find_next_siblings('h3')
24 | for h3 in h3s:
25 | # if h3 is a child of another h3, skip it
26 | if h3.find_previous_sibling('h2') != h2:
27 | continue
28 | toc[-1]["children"].append({
29 | "text": h3.text,
30 | "id": h3.text.lower().replace(" ", "-"),
31 | })
32 | page_state["page"].toc = toc
33 |
34 | return page_state
35 |
36 | def highlight_code(file_name, page_state, _, page_contents):
37 | print(f"Checking {file_name}")
38 | if ".txt" in file_name or ".xml" in file_name:
39 | return page_contents
40 | print(f"Highlighting code in {file_name}")
41 | soup = BeautifulSoup(page_contents, 'lxml')
42 |
43 | for pre in soup.find_all('pre'):
44 | code = pre.find('code')
45 | try:
46 | language = code['class'][0].split("language-")[1]
47 | code = highlight(code.text, languages[language], HtmlFormatter())
48 | except:
49 | continue
50 |
51 | pre.replace_with(BeautifulSoup(code, 'html.parser'))
52 |
53 | css = HtmlFormatter().get_style_defs('.highlight')
54 | css = f""
55 |
56 | # this happens for bookmarks
57 | if not soup.find("body"):
58 | return ""
59 |
60 | body = soup.find('body')
61 | body.insert(0, BeautifulSoup(css, 'html.parser'))
62 |
63 | # get every h2 and add id= to it
64 | for h2 in soup.find_all('h2'):
65 | h2['id'] = h2.text.lower().replace(" ", "-")
66 | for h3 in soup.find_all('h3'):
67 | h3['id'] = h3.text.lower().replace(" ", "-")
68 |
69 | return str(soup)
--------------------------------------------------------------------------------
/docs/pages/templates/matching.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Search Matching
4 | permalink: /matching/
5 | ---
6 |
7 | ### Strict matching
8 |
9 | By default, a search query on a text field will find any document where the field contains any word in the query string. For example, a query for `tolerate it` on a `title` field will match any document whose `title` that contains `tolerate` or `it`. This is called a non-strict match.
10 |
11 | Non-strict matches are the default because they are faster to compute than strict matches.
12 |
13 | If you want to find documents where terms appear next to each other in a field, you can do so with a strict match. Here is an example of a strict match:
14 |
15 | ```python
16 | query = {
17 | "query": {
18 | "title": {
19 | "contains": "tolerate it",
20 | "strict": True
21 | }
22 | }
23 | }
24 | ```
25 |
26 | This will return documents whose title contains `tolerate it` as a single phrase.
27 |
28 | ### Fuzzy matching
29 |
30 | By default, search queries look for the exact string provided. This means that if a query contains a typo (i.e. searching for `tolerate ip` instead of `tolerate it`), no documents will be returned.
31 |
32 | JameSQL implements a limited form of fuzzy matching. This means that if a query contains a typo, JameSQL will still return documents that match the query.
33 |
34 | The fuzzy matching feature matches documents that contain one typo. If a document contains more than one typo, it will not be returned. A typo is an incorrectly typed character. JameSQL does not support fuzzy matching that accounts for missing or additional characters (i.e. `tolerate itt` will not match `tolerate it`).
35 |
36 | You can enable fuzzy matching by setting the `fuzzy` key to `True` in the query. Here is an example of a query that uses fuzzy matching:
37 |
38 | ```python
39 | query = {
40 | "query": {
41 | "title": {
42 | "contains": "tolerate ip",
43 | "fuzzy": True
44 | }
45 | }
46 | }
47 | ```
48 |
49 | ### Wildcard matching
50 |
51 | You can match documents using a single wildcard character. This character is represented by an asterisk `*`.
52 |
53 | ```python
54 | query = {
55 | "query": {
56 | "title": {
57 | "contains": "tolerat* it",
58 | "fuzzy": True
59 | }
60 | }
61 | }
62 | ```
63 |
64 | This query will look for all words that match the pattern `tolerat* it`, where the `*` character can be any single character.
--------------------------------------------------------------------------------
/docs/pages/templates/script-scores.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | permalink: /script-scores/
4 | title: Script Scores
5 | ---
6 |
7 | The script score feature lets you write custom scripts to calculate the score for each document. This is useful if you want to calculate a score based on multiple fields, including numeric fields.
8 |
9 | Script scores are applied after all documents are retrieved.
10 |
11 | The script score feature supports the following mathematical operations:
12 |
13 | - `+` (addition)
14 | - `-` (subtraction)
15 | - `*` (multiplication)
16 | - `/` (division)
17 | - `log` (logarithm)
18 | - `decay` (timeseries decay)
19 |
20 | You can apply a script score at the top level of your query:
21 |
22 | ```python
23 | {
24 | "query": {
25 | "or": {
26 | "post": {
27 | "contains": "taylor swift",
28 | "strict": False,
29 | "boost": 1
30 | },
31 | "title": {
32 | "contains": "desk",
33 | "strict": True,
34 | "boost": 25
35 | }
36 | }
37 | },
38 | "limit": 4,
39 | "sort_by": "_score",
40 | "script_score": "((post + title) * 2)"
41 | }
42 | ```
43 |
44 | The above example will calculate the score of documents by adding the score of the `post` field and the `title` field, then multiplying the result by `2`.
45 |
46 | A script score is made up of terms. A term is a field name or number (float or int), followed by an operator, followed by another term or number. Terms can be nested.
47 |
48 | All terms must be enclosed within parentheses.
49 |
50 | To compute a score that adds the `post` score to `title` and multiplies the result by `2`, use the following code:
51 |
52 | ```text
53 | ((post + title) * 2)
54 | ```
55 |
56 | Invalid forms of this query include:
57 |
58 | - `post + title * 2` (missing parentheses)
59 | - `(post + title * 2)` (terms can only include one operator)
60 |
61 | The `decay` function lets you decay a value by `0.9 ** days_since_post / 30`. This is useful for gradually decreasing the rank for older documents as time passes. This may be particularly useful if you are working with data where you want more recent documents to be ranked higher. `decay` only works with timeseries.
62 |
63 | Here is an example of `decay` in use:
64 |
65 | ```
66 | (_score * decay published)
67 | ```
68 |
69 | This will apply the `decay` function to the `published` field.
70 |
71 | Data must be stored as a Python `datetime` object for the `decay` function to work.
--------------------------------------------------------------------------------
/tests/autosuggest.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 | from contextlib import ExitStack as DoesNotRaise
4 |
5 | import pytest
6 | from deepdiff import DeepDiff
7 |
8 | from jamesql import JameSQL
9 | from jamesql.index import GSI_INDEX_STRATEGIES
10 |
11 |
12 | def pytest_addoption(parser):
13 | parser.addoption("--benchmark", action="store")
14 |
15 |
16 | @pytest.fixture(scope="session")
17 | def create_indices(request):
18 | with open("tests/fixtures/documents.json") as f:
19 | documents = json.load(f)
20 |
21 | index = JameSQL()
22 |
23 | for document in documents:
24 | index.add(document)
25 |
26 | index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
27 | index.create_gsi("lyric", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
28 |
29 | index.enable_autosuggest("title")
30 |
31 | with open("tests/fixtures/documents.json") as f:
32 | documents = json.load(f)
33 |
34 | if request.config.getoption("--benchmark") or request.config.getoption(
35 | "--long-benchmark"
36 | ):
37 | large_index = JameSQL()
38 |
39 | for document in documents * 100000:
40 | if request.config.getoption("--long-benchmark"):
41 | document = document.copy()
42 | document["title"] = "".join(
43 | [
44 | word + " "
45 | for word in document["title"].split()
46 | for _ in range(10)
47 | ]
48 | )
49 | large_index.add(document)
50 |
51 | large_index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
52 | large_index.create_gsi("lyric", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
53 |
54 | large_index.enable_autosuggest("title")
55 | else:
56 | large_index = None
57 |
58 | return index, large_index
59 |
60 |
61 | @pytest.mark.parametrize(
62 | "query, suggestion",
63 | [
64 | ("tolerat", "tolerate"),
65 | ("toler", "tolerate"),
66 | ("th", "the"),
67 | ("b", "bolter"),
68 | ("he", ""), # not in index; part of another word
69 | ("cod", ""), # not in index
70 | ],
71 | )
72 | def test_autosuggest(create_indices, query, suggestion):
73 | index = create_indices[0]
74 | large_index = create_indices[1]
75 |
76 | if suggestion != "":
77 | assert index.autosuggest(query)[0] == suggestion
78 |
79 | if large_index and suggestion != "":
80 | assert large_index.autosuggest(query)[0] == suggestion
81 |
--------------------------------------------------------------------------------
/tests/spelling_correction.py:
--------------------------------------------------------------------------------
1 | import json
2 | import sys
3 | from contextlib import ExitStack as DoesNotRaise
4 |
5 | import pytest
6 | from deepdiff import DeepDiff
7 |
8 | from jamesql import JameSQL
9 | from jamesql.index import GSI_INDEX_STRATEGIES
10 |
11 |
12 | def pytest_addoption(parser):
13 | parser.addoption("--benchmark", action="store")
14 |
15 |
16 | @pytest.fixture(scope="session")
17 | def create_indices(request):
18 | with open("tests/fixtures/documents.json") as f:
19 | documents = json.load(f)
20 |
21 | index = JameSQL()
22 |
23 | for document in documents:
24 | index.add(document)
25 |
26 | index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
27 | index.create_gsi("lyric", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
28 |
29 | with open("tests/fixtures/documents.json") as f:
30 | documents = json.load(f)
31 |
32 | if request.config.getoption("--benchmark") or request.config.getoption(
33 | "--long-benchmark"
34 | ):
35 | large_index = JameSQL()
36 |
37 | for document in documents * 100000:
38 | if request.config.getoption("--long-benchmark"):
39 | document = document.copy()
40 | document["title"] = "".join(
41 | [
42 | word + " "
43 | for word in document["title"].split()
44 | for _ in range(10)
45 | ]
46 | )
47 | large_index.add(document)
48 |
49 | large_index.create_gsi("title", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
50 | large_index.create_gsi("lyric", strategy=GSI_INDEX_STRATEGIES.CONTAINS)
51 | else:
52 | large_index = None
53 |
54 | return index, large_index
55 |
56 |
57 | @pytest.mark.parametrize(
58 | "query, corrected_query",
59 | [
60 | ("tolerat", "tolerate"),
61 | ("tolerateit", "tolerate it"), # test segmentation
62 | (
63 | "startedwith",
64 | "started with",
65 | ), # query word that appears uppercase in corpus of text
66 | ("toleratt", "tolerate"),
67 | ("toleratt", "tolerate"),
68 | ("tolerate", "tolerate"),
69 | ("toler", "toler"), # not in index
70 | ("cod", "cod"), # not in index
71 | ],
72 | )
73 | def test_spelling_correction(create_indices, query, corrected_query):
74 | index = create_indices[0]
75 | large_index = create_indices[1]
76 |
77 | assert index.spelling_correction(query) == corrected_query
78 |
79 | if large_index:
80 | assert large_index.spelling_correction(query) == corrected_query
81 |
--------------------------------------------------------------------------------
/tests/gsi_type_inference.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | import pytest
4 |
5 | from jamesql import JameSQL
6 | from jamesql.index import GSI_INDEX_STRATEGIES
7 |
8 |
9 | def pytest_addoption(parser):
10 | parser.addoption("--benchmark", action="store")
11 |
12 |
13 | @pytest.mark.timeout(20)
14 | def test_gsi_type_inference(request):
15 | with open("tests/fixtures/documents_with_varied_data_types.json") as f:
16 | documents = json.load(f)
17 |
18 | index = JameSQL()
19 |
20 | for document in documents:
21 | index.add(document)
22 |
23 | # check gsi type
24 | assert index.gsis["title"]["strategy"] == GSI_INDEX_STRATEGIES.CONTAINS.name
25 | assert index.gsis["lyric"]["strategy"] == GSI_INDEX_STRATEGIES.CONTAINS.name
26 | assert index.gsis["listens"]["strategy"] == GSI_INDEX_STRATEGIES.NUMERIC.name
27 | assert index.gsis["album_in_stock"]["strategy"] == GSI_INDEX_STRATEGIES.FLAT.name
28 | assert index.gsis["rating"]["strategy"] == GSI_INDEX_STRATEGIES.NUMERIC.name
29 | assert index.gsis["metadata"]["strategy"] == GSI_INDEX_STRATEGIES.NOT_INDEXABLE.name
30 | assert (
31 | index.gsis["record_last_updated"]["strategy"] == GSI_INDEX_STRATEGIES.DATE.name
32 | )
33 |
34 | with open("tests/fixtures/documents_with_varied_data_types.json") as f:
35 | documents = json.load(f)
36 |
37 | if request.config.getoption("--benchmark") or request.config.getoption(
38 | "--long-benchmark"
39 | ):
40 | large_index = JameSQL()
41 |
42 | for document in documents * 100000:
43 | if request.config.getoption("--long-benchmark"):
44 | document = document.copy()
45 | document["title"] = "".join(
46 | [
47 | word + " "
48 | for word in document["title"].split()
49 | for _ in range(10)
50 | ]
51 | )
52 | large_index.add(document)
53 |
54 | assert (
55 | large_index.gsis["title"]["strategy"] == GSI_INDEX_STRATEGIES.CONTAINS.name
56 | )
57 | assert (
58 | large_index.gsis["lyric"]["strategy"] == GSI_INDEX_STRATEGIES.CONTAINS.name
59 | )
60 | assert (
61 | large_index.gsis["listens"]["strategy"] == GSI_INDEX_STRATEGIES.NUMERIC.name
62 | )
63 | assert (
64 | large_index.gsis["album_in_stock"]["strategy"]
65 | == GSI_INDEX_STRATEGIES.FLAT.name
66 | )
67 | assert (
68 | large_index.gsis["rating"]["strategy"] == GSI_INDEX_STRATEGIES.NUMERIC.name
69 | )
70 | assert (
71 | large_index.gsis["metadata"]["strategy"]
72 | == GSI_INDEX_STRATEGIES.NOT_INDEXABLE.name
73 | )
74 | assert (
75 | large_index.gsis["record_last_updated"]["strategy"]
76 | == GSI_INDEX_STRATEGIES.DATE.name
77 | )
78 |
--------------------------------------------------------------------------------
/docs/pages/templates/storage-and-consistency.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Data Storage and Consistency
4 | permalink: /storage-and-consistency/
5 | ---
6 |
7 | JameSQL indices are stored in memory and on disk.
8 |
9 | When you call the `add()` method, the document is appended to an `index.jamesql` file in the directory in which your program is running. This file is serialized as JSONL.
10 |
11 | When you load an index, all entries in the `index.jamesql` file will be read back into memory.
12 |
13 | _Note: You will need to manually reconstruct your indices using the `create_gsi()` method after loading an index._
14 |
15 | ## Data Consistency
16 |
17 | When you call `add()`, a `journal.jamesql` file is created. This is used to store the contents of the `add()` operation you are executing. If JameSQL terminates during an `add()` call for any reason (i.e. system crash, program termination), this journal will be used to reconcile the database.
18 |
19 | Next time you initialize a JameSQL instance, your documents in `index.jamesql` will be read into memory. Then, the transactions in `journal.jamesql` will be replayed to ensure the index is consistent. Finally, the `journal.jamesql` file will be deleted.
20 |
21 | You can access the JSON of the last transaction issued, sans the `uuid`, by calling `index.last_transaction`.
22 |
23 | If you were in the middle of ingesting data, this could be used to resume the ingestion process from where you left off by allowing you to skip records that were already ingested.
24 |
25 | ## Reducing Precision for Large Results Pages
26 |
27 | By default, JameSQL assigns scores to the top 1,000 documents in each clause in a query. Consider the following query;
28 |
29 |
48 |
49 | The `{ "artist": { "equals": "Taylor Swift" } }` clause will return the top 1,000 documents that match the query. The `{ "title": { "equals": "tolerate it" } }` clause will return the top 1,000 documents that match the query.
50 |
51 | These will then be combine and sorted to return the 10 documents of the 2,000 processed that have the highest score.
52 |
53 | This means that if you have a large number of documents that match a query, you may not get precisely the most relevant documents in the top 10 results, rather an approximation of the most relevant documents.
54 |
55 | You can override the number of documents to consider with:
56 |
57 |
20 |
21 | See the table below for a list of available index strategies.
22 |
23 | ## Indexing strategies
24 |
25 | The following index strategies are available:
26 |
27 |
28 |
29 |
30 |
Index Strategy
31 |
Description
32 |
33 |
34 |
35 |
36 |
37 | GSI_INDEX_STRATEGIES.CONTAINS
38 |
39 |
40 | Creates a reverse index for the field. This is useful for fields that contain longer strings (i.e. body text in a blog post). TF-IDF is used to search fields structured with the CONTAINS type.
41 |
42 |
43 |
44 |
45 | GSI_INDEX_STRATEGIES.NUMERIC
46 |
47 |
48 | Creates several buckets to allow for efficient search of numeric values, especially values with high cardinality.
49 |
50 |
51 |
52 |
53 | GSI_INDEX_STRATEGIES.FLAT
54 |
55 |
56 | Stores the field as the data type it is. A flat index is created of values that are not strings or numbers. This is the default. For example, if you are indexing document titles and don't need to do a starts_with query, you may choose a flat index to allow for efficient equals and contains queries.
57 |
58 |
59 |
60 |
61 | GSI_INDEX_STRATEGIES.PREFIX
62 |
63 |
64 | Creates a trie index for the field. This is useful for fields that contain short strings (i.e. titles).
65 |
66 |
67 |
68 |
69 | GSI_INDEX_STRATEGIES.CATEGORICAL
70 |
71 |
72 | Creates a categorical index for the field. This is useful for fields that contain specific categories (i.e. genres).
73 |
74 |
75 |
76 |
77 | GSI_INDEX_STRATEGIES.TRIGRAM_CODE
78 |
79 |
80 | Creates a character-level trigram index for the field. This is useful for efficient code search. See the "Code Search" documentation later in this README for more information about using code search with JameSQL.
81 |
82 |
83 |
84 |
--------------------------------------------------------------------------------
/docs/pages/templates/search.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: Search for Documents
4 | permalink: /search/
5 | ---
6 |
7 | There are two ways you can run a search:
8 |
9 | - Using a natural language query with JameSQL operators, or;
10 | - Using a JSON DSL.
11 |
12 | ## Using the JSON DSL
13 |
14 | A query has the following format:
15 |
16 |
26 |
27 | - `query` is a dictionary that contains the fields to search for.
28 | - `limit` is the maximum number of documents to return. (default 10)
29 | - `sort_by` is the field to sort by. (default None)
30 | - `skip` is the number of documents to skip. This is useful for implementing pagination. (default 0)
31 |
32 | `limit`, `sort_by`, and `skip` are optional.
33 |
34 | Within the `query` key you can query for documents that match one or more conditions.
35 |
36 | An empty query returns no documents.
37 |
38 | ### Running a search
39 |
40 | To search for documents that match a query, use the following code:
41 |
42 |
43 | result = index.search(query)
44 |
45 |
46 | This returns a JSON payload with the following structure:
47 |
59 |
60 | You can search through multiple pages with the `scroll()` method:
61 |
62 |
63 | result = index.scroll(query)
64 |
65 |
66 | `scroll()` returns a generator that yields documents in the same format as `search()`.
67 |
68 | ## Retrieve All Documents
69 |
70 | You can retrieve all documents by using a catch-all query, which uses the following syntax:
71 |
72 |
80 |
81 | This is useful if you want to page through documents. You should supply a `sort_by` field to ensure the order of documents is consistent.
82 |
83 | ### Response
84 |
85 | All valid queries return responses in the following form:
86 |
87 |
98 |
99 | `documents` is a list of documents that match the query. `query_time` is the amount of time it took to execute the query. `total_results` is the total number of documents that match the query before applying any `limit`.
100 |
101 | `total_results` is useful for implementing pagination.
102 |
103 | If an error was encountered, the response will be in the following form:
104 |
105 |
You can find documents where a field is less than, greater than, less than or equal to, or greater than or equal to a value with a range query. Here is an example of a query that looks for documents where the year field is greater than 2010:
You can find the total number of unique values for the fields returned by a query using an aggregate query. This is useful for presenting the total number of options available in a search space to a user.
76 |
You can use the following query to find the total number of unique values for all fields whose lyric field contains the term “sky”: