├── test
    ├── __init__.py
    ├── test_app.py
    └── test_embed.py
├── .github
    ├── CODEOWNERS
    ├── linters
    │   ├── .hadolint.yaml
    │   ├── .markdown-lint.yaml
    │   └── .python-lint
    ├── pull_request_template.md
    ├── dependabot.yml
    ├── ISSUE_TEMPLATE
    │   ├── feature-request.md
    │   └── bug-report.md
    └── workflows
    │   ├── embed.yaml
    │   ├── lint.yaml
    │   ├── sync-space.yaml
    │   ├── build.yaml
    │   ├── unit-test.yaml
    │   └── integration-test.yaml
├── Modelfile
├── dev-requirements.txt
├── pytest.ini
├── cypress
    ├── test.sh
    ├── tsconfig.json
    ├── e2e
    │   ├── on_chat_start
    │   │   └── spec.cy.ts
    │   └── set_chat_settings
    │   │   └── spec.cy.ts
    └── cypress.config.ts
├── package.json
├── .dockerignore
├── .gitignore
├── requirements.txt
├── config.yaml
├── Dockerfile
├── SECURITY.md
├── LICENSE
├── .pre-commit-config.yaml
├── CONTRIBUTING.md
├── .chainlit
    └── config.toml
├── CODE_OF_CONDUCT.md
├── embed.py
├── app.py
└── README.md


/test/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.github/CODEOWNERS:
--------------------------------------------------------------------------------
1 | * @tylertitsworth
2 | 


--------------------------------------------------------------------------------
/Modelfile:
--------------------------------------------------------------------------------
1 | FROM llama3-groq-tool-use:8b
2 | 


--------------------------------------------------------------------------------
/dev-requirements.txt:
--------------------------------------------------------------------------------
1 | pylint
2 | pytest
3 | 


--------------------------------------------------------------------------------
/pytest.ini:
--------------------------------------------------------------------------------
1 | [pytest]
2 | pythonpath = test/
3 | 


--------------------------------------------------------------------------------
/.github/linters/.hadolint.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | ignored:
3 |   - DL3006
4 |   - DL3008
5 |   - DL3009
6 |   - DL3059
7 | 


--------------------------------------------------------------------------------
/.github/linters/.markdown-lint.yaml:
--------------------------------------------------------------------------------
1 | ---
2 | MD013: false
3 | MD024: false
4 | MD034: false
5 | MD039: false
6 | MD041: false
7 | 


--------------------------------------------------------------------------------
/cypress/test.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | export TEST=true
4 | chainlit run -h app.py &
5 | sleep 10
6 | npx cypress run --record false --config-file cypress/cypress.config.ts
7 | kill %%
8 | 


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "dependencies": {
 3 |     "cypress": "13.16.0",
 4 |     "npx": "^10.2.2",
 5 |     "shell-exec": "^1.1.2",
 6 |     "ts-node": "^10.9.2",
 7 |     "typescript": "^5.7.2"
 8 |   }
 9 | }
10 | 


--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
1 | ## Describe your changes
2 | 
3 | ## Issue ticket number and link
4 | 
5 | ## How you have validated your changes
6 | 
7 | ## How you have added/modified tests for your changes
8 | 


--------------------------------------------------------------------------------
/.dockerignore:
--------------------------------------------------------------------------------
 1 | !model/*/q4_0.bin
 2 | .env
 3 | .git/
 4 | .github/
 5 | .gitignore
 6 | .pytest_cache/
 7 | .venv/
 8 | .vscode
 9 | **__pycache__**
10 | Dockerfile
11 | memory/
12 | model/*/*
13 | node_modules/
14 | sources/
15 | test_data/
16 | 


--------------------------------------------------------------------------------
/.github/linters/.python-lint:
--------------------------------------------------------------------------------
1 | [MESSAGES CONTROL]
2 | disable=duplicate-code, import-error, line-too-long, missing-module-docstring, no-name-in-module, no-name-in-module, protected-access, redefined-outer-name, too-few-public-methods, abstract-method, arguments-differ
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | **__pycache__**
 2 | .chainlit/*.db
 3 | .chainlit/translations
 4 | .env
 5 | .files
 6 | .pytest_cache/
 7 | .venv
 8 | .vscode
 9 | chainlit.md
10 | cypress/screenshots
11 | data
12 | memory
13 | model
14 | node_modules
15 | sources
16 | test_data
17 | 


--------------------------------------------------------------------------------
/cypress/tsconfig.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "compilerOptions": {
 3 |         "lib": [
 4 |             "ESNext",
 5 |             "dom"
 6 |         ],
 7 |         "strictNullChecks": true,
 8 |         "types": [
 9 |             "cypress",
10 |             "node"
11 |         ]
12 |     },
13 |     "include": [
14 |         "**/*.ts"
15 |     ]
16 | }
17 | 


--------------------------------------------------------------------------------
/cypress/e2e/on_chat_start/spec.cy.ts:
--------------------------------------------------------------------------------
 1 | describe('on_chat_start', () => {
 2 |   before(() => {
 3 |     cy.visit('/')
 4 |   })
 5 |   it('should correctly run on_chat_start', () => {
 6 |     const messages = cy.get('.step')
 7 |     messages.should('have.length', 1)
 8 | 
 9 |     messages.eq(0).should('contain.text', 'Ah my good fellow!')
10 |   })
11 | })
12 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # https://github.com/mediawiki-utilities/python-mwxml/pull/19
 2 | git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11
 3 | git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11
 4 | chainlit==1.3.1
 5 | chromadb>=0.4.22
 6 | discord>=2.3.2
 7 | fastapi>=0.109.1
 8 | langchain==0.3.4
 9 | langchainhub>=0.1.20
10 | langchain-chroma>=0.1.2
11 | langchain-community==0.3.3
12 | langchain-huggingface>=0.0.3
13 | langchain-ollama==0.2.0
14 | openai>=1.12.0
15 | mwparserfromhell==0.6.6
16 | sentence-transformers>=2.3.0
17 | starlette>=0.36.2
18 | tqdm>=4.66.1
19 | 


--------------------------------------------------------------------------------
/cypress/cypress.config.ts:
--------------------------------------------------------------------------------
 1 | import { defineConfig } from 'cypress'
 2 | 
 3 | export default defineConfig({
 4 |   projectId: 'iqos6r',
 5 |   component: {
 6 |     devServer: {
 7 |       framework: 'react',
 8 |       bundler: 'vite'
 9 |     }
10 |   },
11 |   viewportWidth: 1200,
12 | 
13 |   e2e: {
14 |     supportFile: false,
15 |     defaultCommandTimeout: 50000,
16 |     video: false,
17 |     baseUrl: 'http://127.0.0.1:8000',
18 |     setupNodeEvents (on) {
19 |       on('task', {
20 |         log (message) {
21 |           console.log(message)
22 |           return null
23 |         }
24 |       })
25 |     }
26 |   }
27 | })
28 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | data_dir: ./data
 3 | # Huggingface
 4 | embeddings_model: sentence-transformers/all-mpnet-base-v2
 5 | introduction: Ah my good fellow!
 6 | # Sources
 7 | mediawikis:
 8 |   - dnd4e
 9 |   - dnd5e
10 |   - darksun
11 |   - dragonlance
12 |   - eberron
13 |   - exandria
14 |   - forgottenrealms
15 |   - greyhawk
16 |   - planescape
17 |   - ravenloft
18 |   - spelljammer
19 | # Ollama
20 | model: volo
21 | question: How many eyestalks does a Beholder have?
22 | settings:
23 |   num_sources: 4
24 |   repeat_penalty: 2.2
25 |   temperature: 0
26 |   top_k: 0
27 |   top_p: 0.35
28 | # Sources Path
29 | source: ./sources
30 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM ollama/ollama
 2 | 
 3 | RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
 4 |     git \
 5 |     python3 \
 6 |     python3-pip
 7 | 
 8 | RUN useradd -m -u 1000 user
 9 | USER user
10 | ENV HOME=/home/user \
11 | 	PATH=/home/user/.local/bin:$PATH \
12 |     TOKENIZERS_PARALLELISM=true
13 | 
14 | WORKDIR $HOME/app
15 | 
16 | COPY --chown=user . $HOME/app
17 | 
18 | RUN python3 -m pip install --no-cache-dir -r requirements.txt
19 | 
20 | EXPOSE 7860
21 | 
22 | ENTRYPOINT []
23 | CMD ["/bin/bash", "-c", "/bin/ollama create volo -f Modelfile && chainlit run app.py -h -d --port 7860"]
24 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | # Security Policy
 2 | 
 3 | ## Reporting a Vulnerability
 4 | 
 5 | Report a vulnerability by using [GitHub's Security Reporting Tool](https://github.com/tylertitsworth/multi-mediawiki-rag/security/advisories/new).
 6 | 
 7 | ## Responding to a Vulnerability Report
 8 | 
 9 | Repository administrators will respond to reports rapidly, but at maximum 14 days. After the first response to a report, if the responder cannot reproduce the vulnerability within 48 hours the report will be closed. If the vulernability can be reproduced the report will be classified with a critical priority and triaged as the next contribution to be committed into the repository's default branch.
10 | 


--------------------------------------------------------------------------------
/.github/dependabot.yml:
--------------------------------------------------------------------------------
 1 | ---
 2 | version: 2
 3 | updates:
 4 |   - package-ecosystem: "pip" # See documentation for possible values
 5 |     directory: "." # Location of package manifests
 6 |     groups:
 7 |       python-requirements:
 8 |         patterns:
 9 |           - "*"
10 |     schedule:
11 |       interval: "weekly"
12 |   - package-ecosystem: "github-actions" # See documentation for possible values
13 |     directory: ".github/workflows" # Location of package manifests
14 |     schedule:
15 |       interval: "weekly"
16 |   - package-ecosystem: "npm"
17 |     directory: "/"
18 |     groups:
19 |       node-requirements:
20 |         patterns:
21 |           - "*"
22 |     schedule:
23 |       interval: "weekly"
24 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature-request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature Request
 3 | about: Suggest an idea for this project
 4 | title: "[FEATURE]"
 5 | labels: enhancement
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 | 
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 | 
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 | 
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug-report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug Report
 3 | about: Create a report to help us improve
 4 | title: "[BUG]"
 5 | labels: bug
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 
16 | 1. Go to '...'
17 | 2. Click on '....'
18 | 3. Scroll down to '....'
19 | 4. See error
20 | 
21 | **Expected behavior**
22 | A clear and concise description of what you expected to happen.
23 | 
24 | **Screenshots**
25 | If applicable, add screenshots to help explain your problem.
26 | 
27 | **Desktop (please complete the following information):**
28 | 
29 | - OS: [e.g. iOS]
30 | - Browser [e.g. chrome, safari]
31 | - Version [e.g. 22]
32 | 
33 | **Additional context**
34 | Add any other context about the problem here.
35 | 


--------------------------------------------------------------------------------
/.github/workflows/embed.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Embed
 3 | permissions: read-all
 4 | on: workflow_dispatch
 5 | jobs:
 6 |   embed:
 7 |     runs-on: ubuntu-latest
 8 |     steps:
 9 |       - uses: actions/checkout@v4
10 |       - name: Install Python Requirements
11 |         run: pip install -r requirements.txt
12 |       - name: Download Sources and Data
13 |         run: |
14 |           python -m pip install -U "huggingface-hub[cli]" hf_transfer
15 |           huggingface-cli login --token ${{ secrets.HF_TOKEN }}
16 |           huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \
17 |           --include *.xml \
18 |           --local-dir . \
19 |           --local-dir-use-symlinks False
20 |         env:
21 |           HF_HUB_ENABLE_HF_TRANSFER: 1
22 |       - name: Embed VectorDB
23 |         run: python embed.py
24 |       - uses: actions/upload-artifact@v4
25 |         with:
26 |           name: VectorDB
27 |           path: data/*
28 |           if-no-files-found: 'error'
29 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2023 Tyler Titsworth
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.
20 | 


--------------------------------------------------------------------------------
/.github/workflows/lint.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Lint
 3 | permissions: read-all
 4 | on:  # yamllint disable-line rule:truthy
 5 |   push: null
 6 | 
 7 | concurrency:
 8 |   group: ${{ github.workflow }}-${{ github.ref }}
 9 |   cancel-in-progress: true
10 | 
11 | jobs:
12 |   build:
13 |     name: Lint
14 |     runs-on: ubuntu-latest
15 | 
16 |     permissions:
17 |       contents: read
18 |       packages: read
19 |       # To report GitHub Actions status checks
20 |       statuses: write
21 | 
22 |     steps:
23 |       - name: Checkout code
24 |         uses: actions/checkout@v4
25 |         with:
26 |           fetch-depth: 0
27 |       - name: Super-linter
28 |         uses: super-linter/super-linter/slim@v6.8.0
29 |         env:
30 |           DEFAULT_BRANCH: main
31 |           # To report GitHub Actions status checks
32 |           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
33 |           GITHUB_ACTIONS_COMMAND_ARGS: '-ignore SC.*'
34 |           TYPESCRIPT_STANDARD_TSCONFIG_FILE: cypress/tsconfig.json
35 |           VALIDATE_CHECKOV: false
36 |           VALIDATE_TYPESCRIPT_PRETTIER: false
37 |           VALIDATE_PYTHON_FLAKE8: false
38 |           VALIDATE_PYTHON_ISORT: false
39 |           VALIDATE_JSCPD: false
40 |           VALIDATE_PYTHON_MYPY: false
41 | 


--------------------------------------------------------------------------------
/.github/workflows/sync-space.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Sync Space
 3 | permissions: read-all
 4 | on:
 5 |   push:
 6 |     branches:
 7 |       - main
 8 | 
 9 | jobs:
10 |   sync:
11 |     runs-on: ubuntu-latest
12 |     steps:
13 |     - uses: actions/checkout@v4
14 |       with:
15 |         fetch-depth: 0
16 |     - name: Login with Huggingface CLI
17 |       run: |
18 |         pip install -U huggingface_hub[cli] hf-transfer
19 |         git config --global credential.helper store
20 |         huggingface-cli login --token ${{ secrets.HF_TOKEN }}
21 |     - name: Upload Files
22 |       env:
23 |         HF_HUB_ENABLE_HF_TRANSFER: 1
24 |       run: |
25 |         files=$(git diff --name-only HEAD HEAD~1)
26 |         deleted_files=$(git diff --name-only --diff-filter=A HEAD HEAD~1)
27 |         upload_files() {
28 |             local file=$1
29 |             if [ "$file" != "README.md" ] && [ "$file" != ".gitignore" ] && [ "$file" != ".chainlit/config.toml" ] && [ "$file" != "Modelfile" ]; then
30 |                 huggingface-cli upload --repo-type space TotalSundae/dungeons-and-dragons $file $file
31 |             fi
32 |         }
33 |         for file in $files; do
34 |             if [[ ! "${deleted_files[@]}" =~ "${file}" ]]; then
35 |                 upload_files $file
36 |             else
37 |                 echo "File $file needs to be deleted"
38 |             fi
39 |         done
40 | 


--------------------------------------------------------------------------------
/test/test_app.py:
--------------------------------------------------------------------------------
 1 | from pathlib import Path
 2 | import torch
 3 | from langchain_community.embeddings import HuggingFaceEmbeddings
 4 | from langchain_community.vectorstores import Chroma
 5 | from embed import parse_args, load_config, load_documents
 6 | from app import import_db, create_chain
 7 | 
 8 | if not torch.cuda.is_available():
 9 |     torch.set_num_threads(torch.get_num_threads() * 2)
10 | 
11 | 
12 | def test_import_db():
13 |     "Test import_db()."
14 |     config = load_config()
15 |     config = parse_args(config, ["--test-embed"])
16 |     documents = load_documents(config)
17 |     if not Path(config["data_dir"]).is_dir():
18 |         embeddings = HuggingFaceEmbeddings(
19 |             model_name=config["embeddings_model"], cache_folder="./model"
20 |         )
21 |         vectordb = Chroma.from_documents(
22 |             documents=documents,
23 |             embedding=embeddings,
24 |             persist_directory=config["data_dir"],
25 |         )
26 |         vectordb.persist()
27 |     vectordb = import_db(config)
28 |     assert Path("test_data").is_dir()
29 |     assert vectordb.embeddings.model_name == config["embeddings_model"]
30 |     assert vectordb.embeddings.cache_folder == "./model"
31 | 
32 | 
33 | def test_create_chain():
34 |     "Test create_chain()"
35 |     config = load_config()
36 |     config = parse_args(config, ["--test-embed"])
37 |     chain = create_chain(config)
38 |     res = chain.invoke(config["question"])
39 |     assert res != ""
40 |     # assert res["answer"] != ""
41 |     # assert res["context"] != []
42 | 


--------------------------------------------------------------------------------
/cypress/e2e/set_chat_settings/spec.cy.ts:
--------------------------------------------------------------------------------
 1 | describe('set_chat_settings', () => {
 2 |   before(() => {
 3 |     cy.visit('/')
 4 |   })
 5 | 
 6 |   it('should update inputs', () => {
 7 |     // Open chat settings modal
 8 |     cy.get('#chat-settings-open-modal').should('exist')
 9 |     cy.get('#chat-settings-open-modal').click()
10 |     cy.get('#chat-settings-dialog').should('exist')
11 | 
12 |     cy.get('#num_sources').clear().type('5')
13 |     cy.get('#num_sources').should('have.value', '5')
14 | 
15 |     cy.get('#temperature').clear().type('0.7')
16 |     cy.get('#temperature').should('have.value', '0.7')
17 | 
18 |     cy.get('#repeat_penalty').type('{upArrow}{upArrow}').trigger('change')
19 |     cy.get('#repeat_penalty').should('have.value', '2.4')
20 | 
21 |     cy.get('#top_k').type('{upArrow}{upArrow}').trigger('change')
22 |     cy.get('#top_k').should('have.value', '2')
23 | 
24 |     cy.get('#top_p').clear().type('0.77')
25 |     cy.get('#top_p').should('have.value', '0.77')
26 | 
27 |     cy.contains('Confirm').click()
28 | 
29 |     cy.get('.step').should('have.length', 1)
30 | 
31 |     // Check if inputs are updated
32 |     cy.get('#chat-settings-open-modal').click()
33 |     cy.get('#num_sources').should('have.value', '5')
34 |     cy.get('#temperature').should('have.value', '0.7')
35 |     cy.get('#repeat_penalty').should('have.value', '2.4')
36 |     cy.get('#top_k').should('have.value', '2')
37 |     cy.get('#top_p').should('have.value', '0.77')
38 | 
39 |     // Check if modal is correctly closed
40 |     cy.contains('Cancel').click()
41 |     cy.get('#chat-settings-dialog').should('not.exist')
42 |   })
43 | })
44 | 


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | ci:
 3 |     autofix_commit_msg: "[pre-commit.ci] auto fixes from pre-commit.com hooks"
 4 |     autofix_prs: true
 5 |     autoupdate_commit_msg: '[pre-commit.ci] pre-commit autoupdate'
 6 |     autoupdate_schedule: weekly
 7 |     skip: [hadolint-docker, markdownlint, pylint, pytest, cypress]
 8 |     submodules: false
 9 | repos:
10 |   - repo: https://github.com/pre-commit/pre-commit-hooks
11 |     rev: v5.0.0
12 |     hooks:
13 |     - id: check-added-large-files
14 |     - id: check-ast
15 |     - id: check-merge-conflict
16 |     - id: check-yaml
17 |     - id: debug-statements
18 |     - id: end-of-file-fixer
19 |     - id: forbid-submodules
20 |     - id: sort-simple-yaml
21 |       files: config.yaml
22 |     - id: trailing-whitespace
23 |   - repo: https://github.com/hadolint/hadolint
24 |     rev: v2.13.1-beta
25 |     hooks:
26 |       - id: hadolint-docker
27 |         args: ["--config", ".github/linters/.hadolint.yaml"]
28 |   - repo: https://github.com/igorshubovych/markdownlint-cli
29 |     rev: v0.43.0
30 |     hooks:
31 |     - id: markdownlint
32 |       args: ["--config", ".github/linters/.markdown-lint.yaml"]
33 |   - repo: https://github.com/psf/black
34 |     rev: 24.10.0
35 |     hooks:
36 |       - id: black
37 |   - repo: local
38 |     hooks:
39 |       - id: pylint
40 |         name: pylint
41 |         entry: pylint
42 |         language: system
43 |         types: [python]
44 |         args: ["--rcfile=.github/linters/.python-lint"]
45 |       - id: pytest
46 |         name: pytest
47 |         entry: pytest
48 |         language: system
49 |         types: [python]
50 |         args: ["test/test_app.py", "test/test_embed.py", "-W", "ignore::DeprecationWarning"]
51 |         pass_filenames: false
52 |       - id: cypress
53 |         name: cypress
54 |         entry: bash
55 |         language: system
56 |         types_or: [ts, python]
57 |         args: ["cypress/test.sh"]
58 | 


--------------------------------------------------------------------------------
/.github/workflows/build.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Build Container
 3 | permissions: read-all
 4 | on:
 5 |   push:
 6 |     branches:
 7 |       - main
 8 | 
 9 | jobs:
10 |   container-build:
11 |     permissions:
12 |       packages: write
13 |     runs-on: ubuntu-latest
14 |     steps:
15 |       - uses: actions/checkout@v4
16 |       - uses: docker/login-action@v3
17 |         with:
18 |           registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
19 |           username: ${{ github.actor }}
20 |           password: ${{ github.token }}
21 |       - uses: docker/metadata-action@v5
22 |         id: meta
23 |         with:
24 |           images: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
25 |       - uses: docker/build-push-action@v6
26 |         with:
27 |           context: .
28 |           push: true
29 |           tags: ${{ steps.meta.outputs.tags }}
30 |           labels: ${{ steps.meta.outputs.labels }}
31 |   container-scan:
32 |     needs: [container-build]
33 |     permissions:
34 |       contents: read # for actions/checkout to fetch code
35 |       security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
36 |       actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status
37 |     runs-on: ubuntu-latest
38 |     steps:
39 |       - uses: docker/login-action@v3
40 |         with:
41 |           registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
42 |           username: ${{ github.actor }}
43 |           password: ${{ github.token }}
44 |       - uses: aquasecurity/trivy-action@0.29.0
45 |         with:
46 |           image-ref: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:main
47 |           format: sarif
48 |           output: trivy-results-main.sarif
49 |       - uses: github/codeql-action/upload-sarif@v3
50 |         with:
51 |           sarif_file: trivy-results-main.sarif
52 | 


--------------------------------------------------------------------------------
/.github/workflows/unit-test.yaml:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Unit Tests
 3 | permissions: read-all
 4 | on:
 5 |   push:
 6 |     branches:
 7 |       - main
 8 |   pull_request:
 9 |     branches:
10 |       - main
11 |     paths:
12 |       - '**.py'
13 |       - 'requirements.txt'
14 |       - 'config.yaml'
15 |       - 'test/**.py'
16 |       - 'cypress/**/**.ts'
17 | 
18 | concurrency:
19 |   group: ${{ github.workflow }}-${{ github.ref }}
20 |   cancel-in-progress: true
21 | 
22 | jobs:
23 |   pytest:
24 |     runs-on: ubuntu-latest
25 |     steps:
26 |     - uses: actions/checkout@v4
27 |     - name: Install requirements
28 |       run: |
29 |         python -m pip install pytest
30 |         python -m pip install -r requirements.txt
31 |     - name: Download Sources
32 |       run: |
33 |         python -m pip install -U "huggingface-hub[cli]" hf_transfer
34 |         huggingface-cli login --token ${{ secrets.HF_TOKEN }}
35 |         huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \
36 |         --include *.xml \
37 |         --local-dir . \
38 |         --local-dir-use-symlinks False
39 |       env:
40 |         HF_HUB_ENABLE_HF_TRANSFER: 1
41 |     - name: Basic Unit Test
42 |       run: pytest test/test_embed.py -W ignore::DeprecationWarning
43 |   cypress:
44 |     runs-on: ubuntu-latest
45 |     steps:
46 |       - uses: actions/checkout@v4
47 |       - name: Install Python Requirements
48 |         run: python -m pip install -r requirements.txt
49 |       - name: Setup Ollama
50 |         run: |
51 |           curl https://ollama.ai/install.sh | sh
52 |           sleep 5
53 |           ollama create volo -f ./Modelfile
54 |       - uses: actions/setup-node@v4
55 |         with:
56 |           node-version: 18
57 |       - uses: cypress-io/github-action@v6
58 |         with:
59 |           config-file: ${{ github.workspace }}/cypress/cypress.config.ts
60 |           record: true
61 |           spec: cypress/e2e/on_chat_start/spec.cy.ts
62 |           start: chainlit run app.py -h
63 |           wait-on: 'http://localhost:8000'
64 |           wait-on-timeout: 10
65 |         env:
66 |           CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
67 |           GITHUB_TOKEN: ${{ github.token }}
68 |           LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
69 |       - uses: cypress-io/github-action@v6
70 |         with:
71 |           config-file: ${{ github.workspace }}/cypress/cypress.config.ts
72 |           record: true
73 |           start: chainlit run app.py -h
74 |           wait-on: 'http://localhost:8000'
75 |           wait-on-timeout: 10
76 |         env:
77 |           CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
78 |           GITHUB_TOKEN: ${{ github.token }}
79 |           LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
80 | 


--------------------------------------------------------------------------------
/test/test_embed.py:
--------------------------------------------------------------------------------
 1 | from collections import namedtuple
 2 | import yaml
 3 | from embed import (
 4 |     parse_args,
 5 |     load_config,
 6 |     rename_duplicates,
 7 |     load_document,
 8 |     load_documents,
 9 | )
10 | 
11 | Document = namedtuple("Document", ["page_content", "metadata"])
12 | 
13 | 
14 | def test_parse_args():
15 |     "Test parse_args()."
16 |     config = load_config()
17 |     config = parse_args(config, ["--test-embed"])
18 |     assert config["mediawikis"] == ["dnd5e"]
19 |     assert config["data_dir"] == "./test_data"
20 |     assert config["question"] == "What is the Armor Class of a Beholder?"
21 | 
22 | 
23 | def test_load_config():
24 |     "Test load_config()."
25 |     with open("config.yaml", "r", encoding="utf-8") as file:
26 |         data = yaml.safe_load(file)
27 |     config = load_config()
28 |     assert data == config
29 | 
30 | 
31 | def test_rename_duplicates():
32 |     "Test rename_duplicates()."
33 |     documents = [
34 |         Document(page_content="document 1", metadata={"source": "mydoc"}),
35 |         Document(page_content="document 2", metadata={"source": "mydoc"}),
36 |     ]
37 |     renamed_documents = rename_duplicates(documents)
38 |     assert documents[0].page_content == renamed_documents[0].page_content
39 |     assert documents[0].page_content != renamed_documents[1].page_content
40 |     assert documents[0].metadata["source"] == renamed_documents[0].metadata["source"]
41 |     assert documents[0].metadata["source"] != renamed_documents[1].metadata["source"]
42 | 
43 | 
44 | def test_load_document():
45 |     "Test load_document()."
46 |     config = load_config()
47 |     wiki = (config["source"], "dnd5e")
48 |     documents = load_document(wiki)
49 |     beholder_page = [
50 |         document
51 |         for document in documents
52 |         if document.metadata["source"] == "Beholder - dnd5e"
53 |     ]
54 |     assert "From Monster Manual, page 28." in beholder_page[0].page_content
55 |     assert {"source": "Beholder - dnd5e"} == beholder_page[0].metadata
56 | 
57 | 
58 | def test_load_documents():
59 |     "Test load_documents()."
60 |     config = load_config()
61 |     config = parse_args(config, ["--test-embed"])
62 |     documents = load_documents(config)
63 |     beholder_page = [
64 |         document
65 |         for document in documents
66 |         if document.metadata["source"] == "Beholder - dnd5e"
67 |     ]
68 |     beholder_page_1 = [
69 |         document
70 |         for document in documents
71 |         if document.metadata["source"] == "Beholder - dnd5e_1"
72 |     ]
73 |     assert "From Monster Manual, page 28." in beholder_page[0].page_content
74 |     assert {"source": "Beholder - dnd5e", "start_index": -1} == beholder_page[
75 |         0
76 |     ].metadata
77 |     assert "Eye Rays." in beholder_page_1[0].page_content
78 |     assert {"source": "Beholder - dnd5e_1", "start_index": -1} == beholder_page_1[
79 |         0
80 |     ].metadata
81 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to multi-mediawiki-rag
 2 | 
 3 | Thank you for considering contributing to multi-mediawiki-rag! We welcome your help to make this project better.
 4 | 
 5 | ## Getting Started
 6 | 
 7 | Before you start contributing, please take a moment to review the following guidelines.
 8 | 
 9 | ### Code of Conduct
10 | 
11 | Please review our [Code of Conduct](CODE_OF_CONDUCT.md) to understand the standards we follow and the behavior expected in this project.
12 | 
13 | ### How to Contribute
14 | 
15 | 1. Fork the repository.
16 | 2. Create a new branch for your contribution: `git checkout -b feature/your-feature`.
17 | 3. Install [pre-commit](https://pre-commit.com/), [Docker](https://docs.docker.com/engine/install/), and [Node.js](https://nodejs.org/en/download).
18 | 4. Follow the repository [pre-requisities](README.md#prerequisites) and [Unit Testing Prerequisites](README.md#testing).
19 | 5. Make your changes and commit them: `git commit -m 'Add your feature'`.
20 | 6. Push to the branch: `git push origin feature/your-feature`.
21 | 7. Submit a pull request.
22 | 
23 | ## Contribution Guidelines
24 | 
25 | To ensure a smooth and effective contribution process, please follow these guidelines:
26 | 
27 | ### Reporting Issues
28 | 
29 | - Before creating a new issue, check if it already exists.
30 | - Use a clear and descriptive title for the issue.
31 | - Provide a detailed description of the issue, including steps to reproduce it.
32 | 
33 | ### Making Changes
34 | 
35 | - Fork the repository and create a new branch for your changes.
36 | - Keep each pull request focused on a single feature or bugfix.
37 | - Write clear and descriptive commit messages.
38 | - Keep code changes concise and well-documented.
39 | - Ensure that your code adheres to the project's coding standards.
40 | 
41 | ### Testing
42 | 
43 | - Include tests for your changes, if applicable.
44 | - Ensure that all existing tests pass before submitting a pull request.
45 | - Provide information on how to test your changes.
46 | 
47 | ### Documentation
48 | 
49 | - If you make changes that affect the project's documentation, update it accordingly.
50 | - Document new features and functionalities.
51 | 
52 | ### Code Style
53 | 
54 | - Follow the established [code style](https://google.github.io/styleguide/pyguide.html) for this project.
55 | - Consistent and clean code is highly appreciated.
56 | 
57 | ### Pull Requests
58 | 
59 | - Include a summary of your changes in your pull request.
60 | - Reference the relevant issue(s) if applicable.
61 | - Be responsive to feedback and be ready to make further changes if necessary.
62 | 
63 | ## Code of Conduct
64 | 
65 | This project follows the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). Please review it to understand the expectations for participant behavior.
66 | 
67 | ## License
68 | 
69 | By contributing to multi-mediawiki-rag, you agree that your contributions will be licensed under the [MIT License](LICENSE).
70 | 
71 | Thank you for your contribution!
72 | 


--------------------------------------------------------------------------------
/.chainlit/config.toml:
--------------------------------------------------------------------------------
  1 | [project]
  2 | # Whether to enable telemetry (default: true). No personal data is collected.
  3 | enable_telemetry = true
  4 | 
  5 | # List of environment variables to be provided by each user to use the app.
  6 | user_env = []
  7 | 
  8 | # Duration (in seconds) during which the session is saved when the connection is lost
  9 | session_timeout = 3600
 10 | 
 11 | # Enable third parties caching (e.g LangChain cache)
 12 | cache = true
 13 | 
 14 | # Authorized origins
 15 | allow_origins = ["*"]
 16 | 
 17 | # Follow symlink for asset mount (see https://github.com/Chainlit/chainlit/issues/317)
 18 | # follow_symlink = false
 19 | 
 20 | [features]
 21 | # Show the prompt playground
 22 | prompt_playground = true
 23 | 
 24 | # Process and display HTML in messages. This can be a security risk (see https://stackoverflow.com/questions/19603097/why-is-it-dangerous-to-render-user-generated-html-or-javascript)
 25 | unsafe_allow_html = false
 26 | 
 27 | # Process and display mathematical expressions. This can clash with "$" characters in messages.
 28 | latex = false
 29 | 
 30 | # Authorize users to upload files with messages
 31 | [features.multi_modal]
 32 |     enabled = true
 33 |     accept = ["*/*"]
 34 |     max_files = 20
 35 |     max_size_mb = 500
 36 | 
 37 | # Allows user to use speech to text
 38 | [features.speech_to_text]
 39 |     enabled = false
 40 |     # See all languages here https://github.com/JamesBrill/react-speech-recognition/blob/HEAD/docs/API.md#language-string
 41 |     # language = "en-US"
 42 | 
 43 | [UI]
 44 | # Name of the app and chatbot.
 45 | name = "Volo"
 46 | 
 47 | # Show the readme while the thread is empty.
 48 | show_readme_as_default = true
 49 | 
 50 | # Description of the app and chatbot. This is used for HTML tags.
 51 | # description = ""
 52 | 
 53 | # Large size content are by default collapsed for a cleaner ui
 54 | default_collapse_content = true
 55 | 
 56 | # The default value for the expand messages settings.
 57 | default_expand_messages = true
 58 | 
 59 | # Hide the chain of thought details from the user in the UI.
 60 | hide_cot = false
 61 | 
 62 | # Link to your github repo. This will add a github button in the UI's header.
 63 | # github = "https://github.com/tylertitsworth/multi-mediawiki-rag"
 64 | 
 65 | # Specify a CSS file that can be used to customize the user interface.
 66 | # The CSS file can be served from the public directory or via an external link.
 67 | # custom_css = "/public/test.css"
 68 | 
 69 | # Specify a Javascript file that can be used to customize the user interface.
 70 | # The Javascript file can be served from the public directory.
 71 | # custom_js = "/public/test.js"
 72 | 
 73 | # Specify a custom font url.
 74 | # custom_font = "https://fonts.googleapis.com/css2?family=Inter:wght@400;500;700&display=swap"
 75 | 
 76 | # Specify a custom build directory for the frontend.
 77 | # This can be used to customize the frontend code.
 78 | # Be careful: If this is a relative path, it should not start with a slash.
 79 | # custom_build = "./public/build"
 80 | 
 81 | # Override default MUI light theme. (Check theme.ts)
 82 | [UI.theme]
 83 |     #font_family = "Inter, sans-serif"
 84 | [UI.theme.light]
 85 |     #background = "#FAFAFA"
 86 |     #paper = "#FFFFFF"
 87 | 
 88 |     [UI.theme.light.primary]
 89 |         #main = "#F80061"
 90 |         #dark = "#980039"
 91 |         #light = "#FFE7EB"
 92 | 
 93 | # Override default MUI dark theme. (Check theme.ts)
 94 | [UI.theme.dark]
 95 |     #background = "#FAFAFA"
 96 |     #paper = "#FFFFFF"
 97 | 
 98 |     [UI.theme.dark.primary]
 99 |         #main = "#F80061"
100 |         #dark = "#980039"
101 |         #light = "#FFE7EB"
102 | 
103 | 
104 | [meta]
105 | generated_by = "1.0.502"
106 | 


--------------------------------------------------------------------------------
/.github/workflows/integration-test.yaml:
--------------------------------------------------------------------------------
  1 | ---
  2 | name: Integration Test
  3 | permissions: read-all
  4 | on:
  5 |   pull_request_review:
  6 |     types: [submitted]
  7 | 
  8 | concurrency:
  9 |   group: ${{ github.workflow }}-${{ github.ref }}
 10 |   cancel-in-progress: true
 11 | 
 12 | jobs:
 13 |   container-build:
 14 |     if: >
 15 |       github.event.review.state == 'approved' ||
 16 |       contains(github.event.pull_request.assignees.*.login, 'tylertitsworth')
 17 |     permissions:
 18 |       packages: write
 19 |     runs-on: ubuntu-latest
 20 |     steps:
 21 |       - uses: actions/checkout@v4
 22 |       - uses: docker/login-action@v3
 23 |         with:
 24 |           registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
 25 |           username: ${{ github.actor }}
 26 |           password: ${{ github.token }}
 27 |       - uses: docker/metadata-action@v5
 28 |         id: meta
 29 |         with:
 30 |           images: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
 31 |       - uses: docker/build-push-action@v6
 32 |         with:
 33 |           context: .
 34 |           push: true
 35 |           tags: ${{ steps.meta.outputs.tags }}
 36 |           labels: ${{ steps.meta.outputs.labels }}
 37 |   container-scan:
 38 |     needs: [container-build]
 39 |     permissions:
 40 |       contents: read # for actions/checkout to fetch code
 41 |       security-events: write # for github/codeql-action/upload-sarif to upload SARIF results
 42 |       actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status
 43 |     runs-on: ubuntu-latest
 44 |     steps:
 45 |       - uses: docker/login-action@v3
 46 |         with:
 47 |           registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}
 48 |           username: ${{ github.actor }}
 49 |           password: ${{ github.token }}
 50 |       - uses: aquasecurity/trivy-action@0.29.0
 51 |         with:
 52 |           image-ref: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:pr-${{ github.event.pull_request.number }}
 53 |           format: sarif
 54 |           output: trivy-results-pr-${{ github.event.pull_request.number }}.sarif
 55 |       - uses: github/codeql-action/upload-sarif@v3
 56 |         with:
 57 |           sarif_file: trivy-results-pr-${{ github.event.pull_request.number }}.sarif
 58 |   embed-test:
 59 |     needs: [container-build]
 60 |     runs-on: ubuntu-latest
 61 |     steps:
 62 |       - uses: actions/checkout@v4
 63 |       - name: Download Sources and Data
 64 |         run: |
 65 |           python -m pip install -U "huggingface-hub[cli]" hf_transfer
 66 |           huggingface-cli login --token ${{ secrets.HF_TOKEN }}
 67 |           huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \
 68 |           --include *.xml \
 69 |           --local-dir . \
 70 |           --local-dir-use-symlinks False
 71 |           huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \
 72 |           --include data/* \
 73 |           --local-dir . \
 74 |           --local-dir-use-symlinks False
 75 |         env:
 76 |           HF_HUB_ENABLE_HF_TRANSFER: 1
 77 |       - name: Test Embedding and Chain Creation
 78 |         run: |
 79 |           docker run --shm-size=7GB \
 80 |           -u root -w /home/user/app \
 81 |           -v $PWD/data:/home/user/app/test_data \
 82 |           -v $PWD/sources:/home/user/app/sources \
 83 |           ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:pr-${{ github.event.pull_request.number }} \
 84 |           bash -c "pip install pytest && pytest test/test_embed.py -W ignore::DeprecationWarning"
 85 |         env:
 86 |           OLLAMA_HOST: "https://totalsundae-ollama.hf.space"
 87 |           LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
 88 |   e2e-test:
 89 |       needs: [embed-test]
 90 |       runs-on: ubuntu-latest
 91 |       steps:
 92 |       - uses: actions/checkout@v4
 93 |       - name: Install Python Requirements
 94 |         run: python -m pip install -r requirements.txt
 95 |       - name: Setup Ollama
 96 |         run: |
 97 |           curl https://ollama.ai/install.sh | sh
 98 |           sleep 5
 99 |           ollama create volo -f ./Modelfile
100 |         env:
101 |           OLLAMA_HOST: "https://totalsundae-ollama.hf.space"
102 |       - name: Download Data
103 |         run: |
104 |           python -m pip install -U "huggingface-hub[cli]" hf_transfer
105 |           huggingface-cli login --token ${{ secrets.HF_TOKEN }}
106 |           huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \
107 |           --include data/* \
108 |           --local-dir . \
109 |           --local-dir-use-symlinks False
110 |         env:
111 |           HF_HUB_ENABLE_HF_TRANSFER: 1
112 |       - name: Move Data
113 |         run: |
114 |           mkdir test_data
115 |           mv data/* test_data/
116 |       - uses: cypress-io/github-action@v6
117 |         with:
118 |           config-file: ${{ github.workspace }}/cypress/cypress.config.ts
119 |           record: true
120 |           start: chainlit run app.py -h
121 |           wait-on: 'http://localhost:8000'
122 |           wait-on-timeout: 10
123 |         env:
124 |           CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
125 |           GITHUB_TOKEN: ${{ github.token }}
126 |           LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
127 |           TEST: true
128 |           OLLAMA_HOST: "https://totalsundae-ollama.hf.space"
129 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, religion, or sexual identity
 10 | and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the
 26 |   overall community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or
 31 |   advances of any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email
 35 |   address, without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official email address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at
 63 | @tylertitsworth.
 64 | All complaints will be reviewed and investigated promptly and fairly.
 65 | 
 66 | All community leaders are obligated to respect the privacy and security of the
 67 | reporter of any incident.
 68 | 
 69 | ## Enforcement Guidelines
 70 | 
 71 | Community leaders will follow these Community Impact Guidelines in determining
 72 | the consequences for any action they deem in violation of this Code of Conduct:
 73 | 
 74 | ### 1. Correction
 75 | 
 76 | **Community Impact**: Use of inappropriate language or other behavior deemed
 77 | unprofessional or unwelcome in the community.
 78 | 
 79 | **Consequence**: A private, written warning from community leaders, providing
 80 | clarity around the nature of the violation and an explanation of why the
 81 | behavior was inappropriate. A public apology may be requested.
 82 | 
 83 | ### 2. Warning
 84 | 
 85 | **Community Impact**: A violation through a single incident or series
 86 | of actions.
 87 | 
 88 | **Consequence**: A warning with consequences for continued behavior. No
 89 | interaction with the people involved, including unsolicited interaction with
 90 | those enforcing the Code of Conduct, for a specified period of time. This
 91 | includes avoiding interactions in community spaces as well as external channels
 92 | like social media. Violating these terms may lead to a temporary or
 93 | permanent ban.
 94 | 
 95 | ### 3. Temporary Ban
 96 | 
 97 | **Community Impact**: A serious violation of community standards, including
 98 | sustained inappropriate behavior.
 99 | 
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 | 
106 | ### 4. Permanent Ban
107 | 
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior,  harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 | 
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 | 
115 | ## Attribution
116 | 
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 | 
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 | 
124 | [homepage]: https://www.contributor-covenant.org
125 | 
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 | 


--------------------------------------------------------------------------------
/embed.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | from collections import namedtuple
  4 | from typing import Any
  5 | 
  6 | import torch
  7 | import yaml
  8 | from langchain_community.document_loaders import MWDumpLoader
  9 | from langchain_huggingface import HuggingFaceEmbeddings
 10 | from langchain_chroma import Chroma
 11 | from langchain_text_splitters import RecursiveCharacterTextSplitter
 12 | from tqdm.contrib.concurrent import process_map
 13 | 
 14 | Document = namedtuple("Document", ["page_content", "metadata"])
 15 | if not torch.cuda.is_available():
 16 |     torch.set_num_threads(torch.get_num_threads() * 2)
 17 | 
 18 | 
 19 | def parse_args(config: dict, args: list):
 20 |     """Parses command line arguments.
 21 | 
 22 |     Args:
 23 |         config (dict): items in config.yaml
 24 |         args (list(str)): user input parameters
 25 | 
 26 |     Returns:
 27 |         dict: dictionary of items in config.yaml, modified by user input parameters
 28 |     """
 29 |     parser = argparse.ArgumentParser()
 30 |     parser.add_argument("--test-embed", dest="test_embed", action="store_true")
 31 |     args = parser.parse_args(args)
 32 |     if args.test_embed:
 33 |         config["mediawikis"] = ["dnd5e"]
 34 |         config["data_dir"] = "./test_data"
 35 |         config["question"] = "What is the Armor Class of a Beholder?"
 36 | 
 37 |     return config
 38 | 
 39 | 
 40 | def load_config():
 41 |     """Loads configuration from config.yaml file.
 42 | 
 43 |     Returns:
 44 |         dict: items in config.yaml
 45 |     """
 46 |     try:
 47 |         with open("config.yaml", "r", encoding="utf-8") as file:
 48 |             data = yaml.safe_load(file)
 49 |     except FileNotFoundError:
 50 |         print("Error: File config.yaml not found.")
 51 |         sys.exit(1)
 52 |     except yaml.YAMLError as err:
 53 |         print(f"Error reading YAML file: {err}")
 54 |         sys.exit(1)
 55 | 
 56 |     return data
 57 | 
 58 | 
 59 | def rename_duplicates(documents: [Document]):
 60 |     """Rename duplicates in a list of documents.
 61 | 
 62 |     Args:
 63 |         documents (list(Document)): input documents via loader.load()
 64 | 
 65 |     Returns:
 66 |         list(Document): input documents with modified source metadata
 67 |     """
 68 |     document_counts = {}
 69 |     for idx, doc in enumerate(documents):
 70 |         doc_source = doc.metadata["source"]
 71 |         count = document_counts.get(doc_source, 0) + 1
 72 |         document_counts[doc_source] = count
 73 |         documents[idx].metadata["source"] = (
 74 |             doc_source if count == 1 else f"{doc_source}_{count - 1}"
 75 |         )
 76 | 
 77 |     return documents
 78 | 
 79 | 
 80 | def load_document(wiki: tuple):
 81 |     """Loads an xml file of mediawiki pages into document format.
 82 | 
 83 |     Args:
 84 |         wiki (str): name of the wiki
 85 | 
 86 |     Returns:
 87 |         list(Document): input documents from mediawikis config with modified source metadata
 88 |     """
 89 |     # https://python.langchain.com/docs/integrations/document_loaders/mediawikidump
 90 |     loader = MWDumpLoader(
 91 |         encoding="utf-8",
 92 |         file_path=f"{wiki[0]}/{wiki[1]}_pages_current.xml",
 93 |         # https://www.mediawiki.org/wiki/Help:Namespaces
 94 |         namespaces=[0],
 95 |         skip_redirects=True,
 96 |         stop_on_error=False,
 97 |     )
 98 |     # For each Document provided:
 99 |     # Modify the source metadata by accounting for duplicates (<name>_n)
100 |     # And add the mediawiki title (<name>_n - <wikiname>)
101 | 
102 |     return [
103 |         Document(doc.page_content, {"source": doc.metadata["source"] + f" - {wiki[1]}"})
104 |         for doc in rename_duplicates(loader.load())
105 |     ]
106 | 
107 | 
108 | class CustomTextSplitter(RecursiveCharacterTextSplitter):
109 |     """Creates a custom Character Text Splitter.
110 | 
111 |     Args:
112 |         RecursiveCharacterTextSplitter (RecursiveCharacterTextSplitter): Generates chunks based on different separator rules
113 |     """
114 | 
115 |     def __init__(self, **kwargs: Any) -> None:
116 |         separators = [r"\w(=){3}\n", r"\w(=){2}\n", r"\n\n", r"\n", r"\s"]
117 |         super().__init__(separators=separators, keep_separator=False, **kwargs)
118 | 
119 | 
120 | def load_documents(config: dict):
121 |     """Load all the documents in the MediaWiki wiki page using multithreading.
122 | 
123 |     Args:
124 |         config (dict): items in config.yaml
125 | 
126 |     Returns:
127 |         list(Document): input documents from mediawikis config with modified source metadata
128 |     """
129 | 
130 |     documents = sum(
131 |         process_map(
132 |             load_document,
133 |             [(config["source"], wiki) for wiki in config["mediawikis"]],
134 |             desc="Loading Documents",
135 |             max_workers=torch.get_num_threads(),
136 |         ),
137 |         [],
138 |     )
139 |     splitter = CustomTextSplitter(
140 |         add_start_index=True,
141 |         chunk_size=1000,
142 |         is_separator_regex=True,
143 |     )
144 |     documents = sum(
145 |         process_map(
146 |             splitter.split_documents,
147 |             [[doc] for doc in documents],
148 |             chunksize=1,
149 |             desc="Splitting Documents",
150 |             max_workers=torch.get_num_threads(),
151 |         ),
152 |         [],
153 |     )
154 |     documents = rename_duplicates(documents)
155 | 
156 |     return documents
157 | 
158 | 
159 | if __name__ == "__main__":
160 |     config = load_config()
161 |     config = parse_args(config, sys.argv[1:])
162 |     documents = load_documents(config)
163 |     print(f"Embedding {len(documents)} Documents, this may take a while.")
164 |     # https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub
165 |     embeddings = HuggingFaceEmbeddings(
166 |         cache_folder="./model",
167 |         model_name=config["embeddings_model"],
168 |         show_progress=True,
169 |     )
170 |     # https://python.langchain.com/docs/integrations/vectorstores/chroma
171 |     vectordb = Chroma.from_documents(
172 |         documents=documents,
173 |         embedding=embeddings,
174 |         persist_directory=config["data_dir"],
175 |     )
176 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | 
  3 | import chainlit as cl
  4 | from chainlit.input_widget import Slider, TextInput
  5 | from langchain import hub
  6 | from langchain.callbacks.base import BaseCallbackHandler
  7 | from langchain.globals import set_llm_cache
  8 | from langchain.schema.runnable.config import RunnableConfig
  9 | from langchain_chroma import Chroma
 10 | from langchain_community.cache import InMemoryCache
 11 | from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
 12 | from langchain_core.output_parsers import StrOutputParser
 13 | from langchain_core.runnables import RunnablePassthrough
 14 | from langchain_huggingface import HuggingFaceEmbeddings
 15 | from langchain_ollama import ChatOllama
 16 | 
 17 | from embed import load_config, parse_args
 18 | 
 19 | 
 20 | def import_db(config: dict):
 21 |     """Use existing Chroma vectorDB
 22 | 
 23 |     Args:
 24 |         config (dict): items in config.yaml
 25 | 
 26 |     Returns:
 27 |         Chroma: initialize a Chroma client.
 28 |     """
 29 |     # https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub
 30 |     embeddings = HuggingFaceEmbeddings(
 31 |         cache_folder="./model",
 32 |         model_name=config["embeddings_model"],
 33 |     )
 34 |     vectordb = Chroma(
 35 |         persist_directory=config["data_dir"], embedding_function=embeddings
 36 |     )
 37 | 
 38 |     return vectordb
 39 | 
 40 | 
 41 | def create_chain(config: dict):
 42 |     """Creates a conversation chain from a config file.
 43 | 
 44 |     Args:
 45 |         config (dict): items in config.yaml
 46 | 
 47 |     Returns:
 48 |         Runnable: Langchain Runnable for use with ChatOllama
 49 |     """
 50 |     if os.getenv("TEST"):
 51 |         config = parse_args(config, ["--test-embed"])
 52 |         print("Running in TEST mode.")
 53 |     set_llm_cache(InMemoryCache())
 54 |     callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
 55 |     vectordb = import_db(config)
 56 |     prompt = hub.pull("rlm/rag-prompt")
 57 |     # https://python.langchain.com/docs/integrations/llms/ollama
 58 |     model = ChatOllama(
 59 |         cache=True,
 60 |         callback_manager=callback_manager,
 61 |         model=config["model"],
 62 |         repeat_penalty=config["settings"]["repeat_penalty"],
 63 |         temperature=config["settings"]["temperature"],
 64 |         top_k=config["settings"]["top_k"],
 65 |         top_p=config["settings"]["top_p"],
 66 |     )
 67 |     chain = (
 68 |         {
 69 |             "context": vectordb.as_retriever(
 70 |                 search_kwargs={"k": int(config["settings"]["num_sources"])}
 71 |             ),
 72 |             "question": RunnablePassthrough(),
 73 |         }
 74 |         | prompt
 75 |         | model
 76 |         | StrOutputParser()
 77 |     )
 78 | 
 79 |     return chain
 80 | 
 81 | 
 82 | async def update_cl(config: dict, settings: dict):
 83 |     """Update the model configuration.
 84 | 
 85 |     Args:
 86 |         config (dict): items in config.yaml
 87 |         settings (dict): user chat settings input
 88 |     """
 89 |     if settings:
 90 |         config["settings"] = settings
 91 |     chain = create_chain(config)
 92 |     # https://docs.chainlit.io/api-reference/chat-settings
 93 |     inputs = [
 94 |         TextInput(
 95 |             id="num_sources",
 96 |             label="# of Sources",
 97 |             initial=str(config["settings"]["num_sources"]),
 98 |             description="Number of sources returned based on their similarity score. The same source can be returned more than once. (Default: 4)",
 99 |         ),
100 |         Slider(
101 |             id="temperature",
102 |             label="Temperature",
103 |             initial=config["settings"]["temperature"],
104 |             min=0,
105 |             max=1,
106 |             step=0.1,
107 |             description="The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)",
108 |         ),
109 |         Slider(
110 |             id="repeat_penalty",
111 |             label="Repeat Penalty",
112 |             initial=config["settings"]["repeat_penalty"],
113 |             min=1.0,
114 |             max=3.0,
115 |             step=0.1,
116 |             description="Sets how strongly to penalize repetitions. A higher value will penalize repetitions more strongly. (Default: 1.1)",
117 |         ),
118 |         Slider(
119 |             id="top_k",
120 |             label="Top K",
121 |             initial=config["settings"]["top_k"],
122 |             min=0,
123 |             max=100,
124 |             step=1,
125 |             description="Reduces the probability of generating nonsense. A higher value will give more diverse answers. (Default: 40)",
126 |         ),
127 |         Slider(
128 |             id="top_p",
129 |             label="Top P",
130 |             initial=config["settings"]["top_p"],
131 |             min=0,
132 |             max=1,
133 |             step=0.1,
134 |             description="Works together with top-k. A higher value will lead to more diverse text. (Default: 0.9)",
135 |         ),
136 |     ]
137 |     cl.user_session.set("chain", chain)
138 | 
139 |     await cl.ChatSettings(inputs).send()
140 | 
141 | 
142 | # https://docs.chainlit.io/integrations/langchain
143 | # https://docs.chainlit.io/examples/qa
144 | @cl.on_chat_start
145 | async def on_chat_start():
146 |     """
147 |     Triggered at the start of a chat session. It loads the model configuration from a file
148 |     and sets it in the user session for future use.
149 |     """
150 |     config = load_config()
151 |     cl.user_session.set("config", config)
152 |     await update_cl(config, {})
153 | 
154 |     await cl.Message(content=config["introduction"]).send()
155 | 
156 | 
157 | @cl.on_message
158 | async def on_message(message: cl.Message):
159 |     "Chat message handler."
160 |     runnable = cl.user_session.get("chain")
161 |     msg = cl.Message(content="")
162 | 
163 |     class PostMessageHandler(BaseCallbackHandler):
164 |         """
165 |         Callback handler for handling the retriever and LLM processes.
166 |         Used to post the sources of the retrieved documents as a Chainlit element.
167 |         """
168 | 
169 |         def __init__(self, msg: cl.Message):
170 |             BaseCallbackHandler.__init__(self)
171 |             self.msg = msg
172 |             self.sources = set()  # To store unique pairs
173 | 
174 |         def on_retriever_end(self, documents):
175 |             "Save the sources found by the retriever."
176 |             for d in documents:
177 |                 self.sources.add(d.metadata["source"])  # Add unique pairs to the set
178 | 
179 |         def on_llm_end(self):
180 |             "Stream the sources as a Chainlit element."
181 |             if self.sources:
182 |                 sources_text = "\n".join(self.sources)
183 |                 self.msg.elements.append(
184 |                     cl.Text(name="Sources", content=sources_text, display="inline")
185 |                 )
186 | 
187 |     async for chunk in runnable.astream(
188 |         message.content,
189 |         config=RunnableConfig(
190 |             callbacks=[
191 |                 cl.LangchainCallbackHandler(),
192 |                 PostMessageHandler(msg),
193 |             ],
194 |         ),
195 |     ):
196 |         await msg.stream_token(chunk)
197 | 
198 |     await msg.send()
199 | 
200 | 
201 | @cl.on_settings_update
202 | async def setup_agent(settings: dict):
203 |     """Update Chat Settings.
204 | 
205 |     Args:
206 |         settings (dict): user chat settings input
207 |     """
208 |     config = cl.user_session.get("config")
209 |     await update_cl(config, settings)
210 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Multi Mediawiki RAG Chatbot [![forks - multi-mediawiki-rag](https://img.shields.io/github/forks/tylertitsworth/multi-mediawiki-rag?style=social)](https://github.com/tylertitsworth/multi-mediawiki-rag) [![stars - multi-mediawiki-rag](https://img.shields.io/github/stars/tylertitsworth/multi-mediawiki-rag?style=social)](https://github.com/tylertitsworth/multi-mediawiki-rag)
  2 | 
  3 | [![OS - Linux](https://img.shields.io/badge/OS-Linux-blue?logo=linux&logoColor=white)](https://www.linux.org/ "Go to Linux homepage")
  4 | [![Made with Python](https://img.shields.io/badge/Python->=3.10-blue?logo=python&logoColor=white)](https://python.org "Go to Python homepage")
  5 | [![Hosted on - Huggingface](https://img.shields.io/static/v1?label=Hosted+on&message=Huggingface&color=FFD21E)](https://huggingface.co/spaces/TotalSundae/dungeons-and-dragons)
  6 | [![contributions - welcome](https://img.shields.io/badge/contributions-welcome-4ac41d)](https://github.com/tylertitsworth/multi-mediawiki-rag/blob/main/CONTRIBUTING.md)
  7 | [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8272/badge)](https://www.bestpractices.dev/projects/8272)
  8 | ![Fossa](https://app.fossa.com/api/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag.svg?type=shield&skipHeapTracking=true&issueType=license)
  9 | [![Unit Tests](https://github.com/tylertitsworth/multi-mediawiki-rag/actions/workflows/unit-test.yaml/badge.svg?branch=main)](https://github.com/tylertitsworth/multi-mediawiki-rag/actions/workflows/unit-test.yaml)
 10 | [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/tylertitsworth/multi-mediawiki-rag/main.svg)](https://results.pre-commit.ci/latest/github/tylertitsworth/multi-mediawiki-rag/main)
 11 | 
 12 | [Chatbots](https://www.forbes.com/advisor/business/software/what-is-a-chatbot/) are very popular right now. Most openly accessible information is stored in some kind of a [Mediawiki](https://en.wikipedia.org/wiki/MediaWiki). Creating a [RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG) Chatbot is becoming a very powerful alternative to traditional data gathering. This project is designed to create a basic format for creating your own chatbot to run locally on linux.
 13 | 
 14 | ## Table of Contents
 15 | 
 16 | - [Multi Mediawiki RAG Chatbot  ](#multi-mediawiki-rag-chatbot--)
 17 |   - [Table of Contents](#table-of-contents)
 18 |   - [About](#about)
 19 |     - [Architecture](#architecture)
 20 |     - [Runtime Filesystem](#runtime-filesystem)
 21 |   - [Quickstart](#quickstart)
 22 |     - [Prerequisites](#prerequisites)
 23 |     - [Create Custom LLM](#create-custom-llm)
 24 |       - [Use Model from Huggingface](#use-model-from-huggingface)
 25 |     - [Create Vector Database](#create-vector-database)
 26 |       - [Expected Output](#expected-output)
 27 |       - [Add Different Document Type to DB](#add-different-document-type-to-db)
 28 |     - [Start Chatbot](#start-chatbot)
 29 |     - [Start Discord Bot](#start-discord-bot)
 30 |   - [Testing](#testing)
 31 |     - [Cypress](#cypress)
 32 |     - [Pytest](#pytest)
 33 |   - [License](#license)
 34 | 
 35 | ## About
 36 | 
 37 | [Mediawikis](https://en.wikipedia.org/wiki/MediaWiki) hosted by [Fandom](https://www.fandom.com/) usually allow you to download an XML dump of the entire wiki as it currently exists. This project primarily leverages [Langchain](https://github.com/langchain-ai/langchain) with a few other open source projects to combine many of the readily available quickstart guides into a complete vertical application based on mediawiki data.
 38 | 
 39 | ### Architecture
 40 | 
 41 | ```mermaid
 42 | graph TD;
 43 |     a[/xml dump a/] --MWDumpLoader--> emb
 44 |     b[/xml dump b/] --MWDumpLoader--> emb
 45 |     emb{Embedding} --> db
 46 |     db[(Chroma)] --Document Retriever--> lc
 47 |     hf(Huggingface) --Sentence Transformer --> emb
 48 |     hf --LLM--> modelfile
 49 |     modelfile[/Modelfile/] --> Ollama
 50 |     Ollama(((Ollama))) <-.ChatOllama.-> lc
 51 |     lc{Langchain} <-.LLMChain.-> cl(((Chainlit)))
 52 |     click db href "https://github.com/chroma-core/chroma"
 53 |     click hf href "https://huggingface.co/"
 54 |     click cl href "https://github.com/Chainlit/chainlit"
 55 |     click lc href "https://github.com/langchain-ai/langchain"
 56 |     click Ollama href "https://github.com/jmorganca/ollama"
 57 | ```
 58 | 
 59 | ### Runtime Filesystem
 60 | 
 61 | ```txt
 62 | multi-mediawiki-rag # $HOME/app
 63 | ├── .chainlit
 64 | │   ├── .langchain.db # Server Cache
 65 | │   └── config.toml # Server Config
 66 | ├── app.py
 67 | ├── chainlit.md
 68 | ├── config.yaml
 69 | ├── data # VectorDB
 70 | │   ├── 47e4e036-****-****-****-************
 71 | │   │   └── *
 72 | │   └── chroma.sqlite3
 73 | ├── embed.py
 74 | ├── entrypoint.sh
 75 | └── requirements.txt
 76 | ```
 77 | 
 78 | ## Quickstart
 79 | 
 80 | These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
 81 | 
 82 | ### Prerequisites
 83 | 
 84 | These steps assume you are using a modern Linux OS like [Ubuntu 22.04](https://www.releases.ubuntu.com/jammy/) with [Python 3.10+](https://www.python.org/downloads/release/python-3100/).
 85 | 
 86 | ```bash
 87 | apt-get install -y curl git python3-venv
 88 | git clone https://github.com/tylertitsworth/multi-mediawiki-rag.git
 89 | curl https://ollama.ai/install.sh | sh
 90 | python -m .venv venv
 91 | source .venv/bin/activate
 92 | pip install -U pip setuptools wheel
 93 | pip install -r requirements.txt
 94 | ```
 95 | 
 96 | 1. Run the above setup steps
 97 | 2. Download a mediawiki's XML dump by browsing to `/wiki/Special:Statistics` or using a tool like [wikiteam3](https://pypi.org/project/wikiteam3/)
 98 |     1. If Downloading, download only the current pages, not the entire history
 99 |     2. If using `wikiteam3`, scrape only namespace 0
100 |     3. Provide in the following format: `sources/<wikiname>_pages_current.xml`
101 | 3. Edit [`config.yaml`](config.yaml) with the location of your XML mediawiki data you downloaded in step 1 and other configuration information
102 | 
103 | > [!CAUTION]
104 | > Installing [Ollama](https://github.com/jmorganca/ollama) will create a new user and a service on your system. Follow the [manual installation steps](https://github.com/jmorganca/ollama/blob/main/docs/linux.md#manual-install) to avoid this step and instead launch the ollama API using `ollama serve`.
105 | 
106 | ### Create Custom LLM
107 | 
108 | After installing Ollama we can use a [Modelfile](https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md) to download and tune an LLM to be more precise for Document Retrieval QA.
109 | 
110 | ```bash
111 | ollama create volo -f ./Modelfile
112 | ```
113 | 
114 | > [!TIP]
115 | > Choose a model from the [Ollama model library](https://ollama.ai/library) and download with `ollama pull <modelname>:<version>`, then edit the `model` field in [`config.yaml`](config.yaml) with the same information.
116 | 
117 | #### Use Model from Huggingface
118 | 
119 | 1. Download a model of choice from [Huggingface](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) with `git clone https://huggingface.co/<org>/<modelname> model/<modelname>`.
120 | 2. If your model of choice is not in `GGUF` format, convert it with `docker run --rm -v $PWD/model/<modelname>:/model ollama/quantize -q q4_0 /model`.
121 | 3. Modify the [Modelfile's](Modelfile) `FROM` line to contain the path to the `q4_0.bin` file in the modelname directory.
122 | 
123 | ### Create Vector Database
124 | 
125 | Your XML data needs to be loaded and transformed into embeddings to create a [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) VectorDB.
126 | 
127 | ```bash
128 | python embed.py
129 | ```
130 | 
131 | #### Expected Output
132 | 
133 | ```txt
134 | 2023-12-16 09:50:53 - Loaded .env file
135 | 2023-12-16 09:50:55 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
136 | 2023-12-16 09:51:18 - Use pytorch device: cpu
137 | 2023-12-16 09:56:09 - Anonymized telemetry enabled. See
138 | https://docs.trychroma.com/telemetry for more information.
139 | Batches: 100%|████████████████████████████████████████| 1303/1303 [1:23:14<00:00,  3.83s/it]
140 | ...
141 | Batches: 100%|████████████████████████████████████████| 1172/1172 [1:04:08<00:00,  3.28s/it]
142 | 023-12-16 19:47:01 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
143 | 2023-12-16 19:47:33 - Use pytorch device: cpu
144 | Batches: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
145 | A Tako was an intelligent race of octopuses found in the Kara-Tur setting. They were known for
146 | their territorial nature and combat skills, as well as having incredible camouflaging abilities
147 | that allowed them to blend into various environments. Takos lived in small tribes with a
148 | matriarchal society led by one or two female rulers. Their diet consisted mainly of crabs,
149 | lobsters, oysters, and shellfish, while their ink was highly sought after for use in calligraphy
150 | within Kara-Tur.
151 | ```
152 | 
153 | #### Add Different Document Type to DB
154 | 
155 | Choose a new [File type Document Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/) or [App Document Loader](https://python.langchain.com/docs/integrations/document_loaders/), and add them using your own script. Check out the provided [Example](examples/add_more_docs.py).
156 | 
157 | ### Start Chatbot
158 | 
159 | ```bash
160 | chainlit run app.py -h
161 | ```
162 | 
163 | Access the Chatbot GUI at `http://localhost:8000`.
164 | 
165 | ### Start Discord Bot
166 | 
167 | ```bash
168 | export DISCORD_BOT_TOKEN=...
169 | chainlit run app.py -h
170 | ```
171 | 
172 | > [!TIP]
173 | > Develop locally with [ngrok](https://dashboard.ngrok.com/get-started/setup/linux).
174 | 
175 | ## Hosting
176 | 
177 | This chatbot is hosted on [Huggingface Spaces](https://huggingface.co/spaces/TotalSundae/dungeons-and-dragons) for free, which means this chatbot is very slow due to the minimal hardware resources allocated to it. Despite this, the provided [Dockerfile](./Dockerfile) provides a generic method for hosting this solution as one unified container, however this method is not ideal and can lead to many issues if used for professional production systems.
178 | 
179 | ## Testing
180 | 
181 | ### Cypress
182 | 
183 | [Cypress](https://www.cypress.io/) tests modern web applications with visual debugging. It is used to test the [Chainlit](https://github.com/Chainlit/chainlit) UI functionality.
184 | 
185 | ```bash
186 | npm install
187 | # Run Test Suite
188 | bash cypress/test.sh
189 | ```
190 | 
191 | > [!NOTE]
192 | > Cypress requires `node >= 16`.
193 | 
194 | ### Pytest
195 | 
196 | [Pytest](https://docs.pytest.org/en/7.4.x/) is a mature full-featured Python testing tool that helps you write better programs.
197 | 
198 | ```bash
199 | pip install pytest
200 | # Test Embedding Functions
201 | pytest test/test_embed.py -W ignore::DeprecationWarning
202 | # Test e2e with Ollama Backend
203 | pytest test -W ignore::DeprecationWarning
204 | ```
205 | 
206 | ## License
207 | 
208 | [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag.svg?type=large&issueType=license)](https://app.fossa.com/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag?ref=badge_large&issueType=license)
209 | 


--------------------------------------------------------------------------------