├── test ├── __init__.py ├── test_app.py └── test_embed.py ├── .github ├── CODEOWNERS ├── linters │ ├── .hadolint.yaml │ ├── .markdown-lint.yaml │ └── .python-lint ├── pull_request_template.md ├── dependabot.yml ├── ISSUE_TEMPLATE │ ├── feature-request.md │ └── bug-report.md └── workflows │ ├── embed.yaml │ ├── lint.yaml │ ├── sync-space.yaml │ ├── build.yaml │ ├── unit-test.yaml │ └── integration-test.yaml ├── Modelfile ├── dev-requirements.txt ├── pytest.ini ├── cypress ├── test.sh ├── tsconfig.json ├── e2e │ ├── on_chat_start │ │ └── spec.cy.ts │ └── set_chat_settings │ │ └── spec.cy.ts └── cypress.config.ts ├── package.json ├── .dockerignore ├── .gitignore ├── requirements.txt ├── config.yaml ├── Dockerfile ├── SECURITY.md ├── LICENSE ├── .pre-commit-config.yaml ├── CONTRIBUTING.md ├── .chainlit └── config.toml ├── CODE_OF_CONDUCT.md ├── embed.py ├── app.py └── README.md /test/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | * @tylertitsworth 2 | -------------------------------------------------------------------------------- /Modelfile: -------------------------------------------------------------------------------- 1 | FROM llama3-groq-tool-use:8b 2 | -------------------------------------------------------------------------------- /dev-requirements.txt: -------------------------------------------------------------------------------- 1 | pylint 2 | pytest 3 | -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | pythonpath = test/ 3 | -------------------------------------------------------------------------------- /.github/linters/.hadolint.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | ignored: 3 | - DL3006 4 | - DL3008 5 | - DL3009 6 | - DL3059 7 | -------------------------------------------------------------------------------- /.github/linters/.markdown-lint.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | MD013: false 3 | MD024: false 4 | MD034: false 5 | MD039: false 6 | MD041: false 7 | -------------------------------------------------------------------------------- /cypress/test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | export TEST=true 4 | chainlit run -h app.py & 5 | sleep 10 6 | npx cypress run --record false --config-file cypress/cypress.config.ts 7 | kill %% 8 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "dependencies": { 3 | "cypress": "13.16.0", 4 | "npx": "^10.2.2", 5 | "shell-exec": "^1.1.2", 6 | "ts-node": "^10.9.2", 7 | "typescript": "^5.7.2" 8 | } 9 | } 10 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | ## Describe your changes 2 | 3 | ## Issue ticket number and link 4 | 5 | ## How you have validated your changes 6 | 7 | ## How you have added/modified tests for your changes 8 | -------------------------------------------------------------------------------- /.dockerignore: -------------------------------------------------------------------------------- 1 | !model/*/q4_0.bin 2 | .env 3 | .git/ 4 | .github/ 5 | .gitignore 6 | .pytest_cache/ 7 | .venv/ 8 | .vscode 9 | **__pycache__** 10 | Dockerfile 11 | memory/ 12 | model/*/* 13 | node_modules/ 14 | sources/ 15 | test_data/ 16 | -------------------------------------------------------------------------------- /.github/linters/.python-lint: -------------------------------------------------------------------------------- 1 | [MESSAGES CONTROL] 2 | disable=duplicate-code, import-error, line-too-long, missing-module-docstring, no-name-in-module, no-name-in-module, protected-access, redefined-outer-name, too-few-public-methods, abstract-method, arguments-differ 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | **__pycache__** 2 | .chainlit/*.db 3 | .chainlit/translations 4 | .env 5 | .files 6 | .pytest_cache/ 7 | .venv 8 | .vscode 9 | chainlit.md 10 | cypress/screenshots 11 | data 12 | memory 13 | model 14 | node_modules 15 | sources 16 | test_data 17 | -------------------------------------------------------------------------------- /cypress/tsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "lib": [ 4 | "ESNext", 5 | "dom" 6 | ], 7 | "strictNullChecks": true, 8 | "types": [ 9 | "cypress", 10 | "node" 11 | ] 12 | }, 13 | "include": [ 14 | "**/*.ts" 15 | ] 16 | } 17 | -------------------------------------------------------------------------------- /cypress/e2e/on_chat_start/spec.cy.ts: -------------------------------------------------------------------------------- 1 | describe('on_chat_start', () => { 2 | before(() => { 3 | cy.visit('/') 4 | }) 5 | it('should correctly run on_chat_start', () => { 6 | const messages = cy.get('.step') 7 | messages.should('have.length', 1) 8 | 9 | messages.eq(0).should('contain.text', 'Ah my good fellow!') 10 | }) 11 | }) 12 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # https://github.com/mediawiki-utilities/python-mwxml/pull/19 2 | git+https://github.com/gdedrouas/python-mwxml@xml_format_0.11 3 | git+https://github.com/mediawiki-utilities/python-mwtypes@updates_schema_0.11 4 | chainlit==1.3.1 5 | chromadb>=0.4.22 6 | discord>=2.3.2 7 | fastapi>=0.109.1 8 | langchain==0.3.4 9 | langchainhub>=0.1.20 10 | langchain-chroma>=0.1.2 11 | langchain-community==0.3.3 12 | langchain-huggingface>=0.0.3 13 | langchain-ollama==0.2.0 14 | openai>=1.12.0 15 | mwparserfromhell==0.6.6 16 | sentence-transformers>=2.3.0 17 | starlette>=0.36.2 18 | tqdm>=4.66.1 19 | -------------------------------------------------------------------------------- /cypress/cypress.config.ts: -------------------------------------------------------------------------------- 1 | import { defineConfig } from 'cypress' 2 | 3 | export default defineConfig({ 4 | projectId: 'iqos6r', 5 | component: { 6 | devServer: { 7 | framework: 'react', 8 | bundler: 'vite' 9 | } 10 | }, 11 | viewportWidth: 1200, 12 | 13 | e2e: { 14 | supportFile: false, 15 | defaultCommandTimeout: 50000, 16 | video: false, 17 | baseUrl: 'http://127.0.0.1:8000', 18 | setupNodeEvents (on) { 19 | on('task', { 20 | log (message) { 21 | console.log(message) 22 | return null 23 | } 24 | }) 25 | } 26 | } 27 | }) 28 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | data_dir: ./data 3 | # Huggingface 4 | embeddings_model: sentence-transformers/all-mpnet-base-v2 5 | introduction: Ah my good fellow! 6 | # Sources 7 | mediawikis: 8 | - dnd4e 9 | - dnd5e 10 | - darksun 11 | - dragonlance 12 | - eberron 13 | - exandria 14 | - forgottenrealms 15 | - greyhawk 16 | - planescape 17 | - ravenloft 18 | - spelljammer 19 | # Ollama 20 | model: volo 21 | question: How many eyestalks does a Beholder have? 22 | settings: 23 | num_sources: 4 24 | repeat_penalty: 2.2 25 | temperature: 0 26 | top_k: 0 27 | top_p: 0.35 28 | # Sources Path 29 | source: ./sources 30 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ollama/ollama 2 | 3 | RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ 4 | git \ 5 | python3 \ 6 | python3-pip 7 | 8 | RUN useradd -m -u 1000 user 9 | USER user 10 | ENV HOME=/home/user \ 11 | PATH=/home/user/.local/bin:$PATH \ 12 | TOKENIZERS_PARALLELISM=true 13 | 14 | WORKDIR $HOME/app 15 | 16 | COPY --chown=user . $HOME/app 17 | 18 | RUN python3 -m pip install --no-cache-dir -r requirements.txt 19 | 20 | EXPOSE 7860 21 | 22 | ENTRYPOINT [] 23 | CMD ["/bin/bash", "-c", "/bin/ollama create volo -f Modelfile && chainlit run app.py -h -d --port 7860"] 24 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | ## Reporting a Vulnerability 4 | 5 | Report a vulnerability by using [GitHub's Security Reporting Tool](https://github.com/tylertitsworth/multi-mediawiki-rag/security/advisories/new). 6 | 7 | ## Responding to a Vulnerability Report 8 | 9 | Repository administrators will respond to reports rapidly, but at maximum 14 days. After the first response to a report, if the responder cannot reproduce the vulnerability within 48 hours the report will be closed. If the vulernability can be reproduced the report will be classified with a critical priority and triaged as the next contribution to be committed into the repository's default branch. 10 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | --- 2 | version: 2 3 | updates: 4 | - package-ecosystem: "pip" # See documentation for possible values 5 | directory: "." # Location of package manifests 6 | groups: 7 | python-requirements: 8 | patterns: 9 | - "*" 10 | schedule: 11 | interval: "weekly" 12 | - package-ecosystem: "github-actions" # See documentation for possible values 13 | directory: ".github/workflows" # Location of package manifests 14 | schedule: 15 | interval: "weekly" 16 | - package-ecosystem: "npm" 17 | directory: "/" 18 | groups: 19 | node-requirements: 20 | patterns: 21 | - "*" 22 | schedule: 23 | interval: "weekly" 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature Request 3 | about: Suggest an idea for this project 4 | title: "[FEATURE]" 5 | labels: enhancement 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug-report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug Report 3 | about: Create a report to help us improve 4 | title: "[BUG]" 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 16 | 1. Go to '...' 17 | 2. Click on '....' 18 | 3. Scroll down to '....' 19 | 4. See error 20 | 21 | **Expected behavior** 22 | A clear and concise description of what you expected to happen. 23 | 24 | **Screenshots** 25 | If applicable, add screenshots to help explain your problem. 26 | 27 | **Desktop (please complete the following information):** 28 | 29 | - OS: [e.g. iOS] 30 | - Browser [e.g. chrome, safari] 31 | - Version [e.g. 22] 32 | 33 | **Additional context** 34 | Add any other context about the problem here. 35 | -------------------------------------------------------------------------------- /.github/workflows/embed.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Embed 3 | permissions: read-all 4 | on: workflow_dispatch 5 | jobs: 6 | embed: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - uses: actions/checkout@v4 10 | - name: Install Python Requirements 11 | run: pip install -r requirements.txt 12 | - name: Download Sources and Data 13 | run: | 14 | python -m pip install -U "huggingface-hub[cli]" hf_transfer 15 | huggingface-cli login --token ${{ secrets.HF_TOKEN }} 16 | huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \ 17 | --include *.xml \ 18 | --local-dir . \ 19 | --local-dir-use-symlinks False 20 | env: 21 | HF_HUB_ENABLE_HF_TRANSFER: 1 22 | - name: Embed VectorDB 23 | run: python embed.py 24 | - uses: actions/upload-artifact@v4 25 | with: 26 | name: VectorDB 27 | path: data/* 28 | if-no-files-found: 'error' 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2023 Tyler Titsworth 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. 20 | -------------------------------------------------------------------------------- /.github/workflows/lint.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Lint 3 | permissions: read-all 4 | on: # yamllint disable-line rule:truthy 5 | push: null 6 | 7 | concurrency: 8 | group: ${{ github.workflow }}-${{ github.ref }} 9 | cancel-in-progress: true 10 | 11 | jobs: 12 | build: 13 | name: Lint 14 | runs-on: ubuntu-latest 15 | 16 | permissions: 17 | contents: read 18 | packages: read 19 | # To report GitHub Actions status checks 20 | statuses: write 21 | 22 | steps: 23 | - name: Checkout code 24 | uses: actions/checkout@v4 25 | with: 26 | fetch-depth: 0 27 | - name: Super-linter 28 | uses: super-linter/super-linter/slim@v6.8.0 29 | env: 30 | DEFAULT_BRANCH: main 31 | # To report GitHub Actions status checks 32 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 33 | GITHUB_ACTIONS_COMMAND_ARGS: '-ignore SC.*' 34 | TYPESCRIPT_STANDARD_TSCONFIG_FILE: cypress/tsconfig.json 35 | VALIDATE_CHECKOV: false 36 | VALIDATE_TYPESCRIPT_PRETTIER: false 37 | VALIDATE_PYTHON_FLAKE8: false 38 | VALIDATE_PYTHON_ISORT: false 39 | VALIDATE_JSCPD: false 40 | VALIDATE_PYTHON_MYPY: false 41 | -------------------------------------------------------------------------------- /.github/workflows/sync-space.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Sync Space 3 | permissions: read-all 4 | on: 5 | push: 6 | branches: 7 | - main 8 | 9 | jobs: 10 | sync: 11 | runs-on: ubuntu-latest 12 | steps: 13 | - uses: actions/checkout@v4 14 | with: 15 | fetch-depth: 0 16 | - name: Login with Huggingface CLI 17 | run: | 18 | pip install -U huggingface_hub[cli] hf-transfer 19 | git config --global credential.helper store 20 | huggingface-cli login --token ${{ secrets.HF_TOKEN }} 21 | - name: Upload Files 22 | env: 23 | HF_HUB_ENABLE_HF_TRANSFER: 1 24 | run: | 25 | files=$(git diff --name-only HEAD HEAD~1) 26 | deleted_files=$(git diff --name-only --diff-filter=A HEAD HEAD~1) 27 | upload_files() { 28 | local file=$1 29 | if [ "$file" != "README.md" ] && [ "$file" != ".gitignore" ] && [ "$file" != ".chainlit/config.toml" ] && [ "$file" != "Modelfile" ]; then 30 | huggingface-cli upload --repo-type space TotalSundae/dungeons-and-dragons $file $file 31 | fi 32 | } 33 | for file in $files; do 34 | if [[ ! "${deleted_files[@]}" =~ "${file}" ]]; then 35 | upload_files $file 36 | else 37 | echo "File $file needs to be deleted" 38 | fi 39 | done 40 | -------------------------------------------------------------------------------- /test/test_app.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import torch 3 | from langchain_community.embeddings import HuggingFaceEmbeddings 4 | from langchain_community.vectorstores import Chroma 5 | from embed import parse_args, load_config, load_documents 6 | from app import import_db, create_chain 7 | 8 | if not torch.cuda.is_available(): 9 | torch.set_num_threads(torch.get_num_threads() * 2) 10 | 11 | 12 | def test_import_db(): 13 | "Test import_db()." 14 | config = load_config() 15 | config = parse_args(config, ["--test-embed"]) 16 | documents = load_documents(config) 17 | if not Path(config["data_dir"]).is_dir(): 18 | embeddings = HuggingFaceEmbeddings( 19 | model_name=config["embeddings_model"], cache_folder="./model" 20 | ) 21 | vectordb = Chroma.from_documents( 22 | documents=documents, 23 | embedding=embeddings, 24 | persist_directory=config["data_dir"], 25 | ) 26 | vectordb.persist() 27 | vectordb = import_db(config) 28 | assert Path("test_data").is_dir() 29 | assert vectordb.embeddings.model_name == config["embeddings_model"] 30 | assert vectordb.embeddings.cache_folder == "./model" 31 | 32 | 33 | def test_create_chain(): 34 | "Test create_chain()" 35 | config = load_config() 36 | config = parse_args(config, ["--test-embed"]) 37 | chain = create_chain(config) 38 | res = chain.invoke(config["question"]) 39 | assert res != "" 40 | # assert res["answer"] != "" 41 | # assert res["context"] != [] 42 | -------------------------------------------------------------------------------- /cypress/e2e/set_chat_settings/spec.cy.ts: -------------------------------------------------------------------------------- 1 | describe('set_chat_settings', () => { 2 | before(() => { 3 | cy.visit('/') 4 | }) 5 | 6 | it('should update inputs', () => { 7 | // Open chat settings modal 8 | cy.get('#chat-settings-open-modal').should('exist') 9 | cy.get('#chat-settings-open-modal').click() 10 | cy.get('#chat-settings-dialog').should('exist') 11 | 12 | cy.get('#num_sources').clear().type('5') 13 | cy.get('#num_sources').should('have.value', '5') 14 | 15 | cy.get('#temperature').clear().type('0.7') 16 | cy.get('#temperature').should('have.value', '0.7') 17 | 18 | cy.get('#repeat_penalty').type('{upArrow}{upArrow}').trigger('change') 19 | cy.get('#repeat_penalty').should('have.value', '2.4') 20 | 21 | cy.get('#top_k').type('{upArrow}{upArrow}').trigger('change') 22 | cy.get('#top_k').should('have.value', '2') 23 | 24 | cy.get('#top_p').clear().type('0.77') 25 | cy.get('#top_p').should('have.value', '0.77') 26 | 27 | cy.contains('Confirm').click() 28 | 29 | cy.get('.step').should('have.length', 1) 30 | 31 | // Check if inputs are updated 32 | cy.get('#chat-settings-open-modal').click() 33 | cy.get('#num_sources').should('have.value', '5') 34 | cy.get('#temperature').should('have.value', '0.7') 35 | cy.get('#repeat_penalty').should('have.value', '2.4') 36 | cy.get('#top_k').should('have.value', '2') 37 | cy.get('#top_p').should('have.value', '0.77') 38 | 39 | // Check if modal is correctly closed 40 | cy.contains('Cancel').click() 41 | cy.get('#chat-settings-dialog').should('not.exist') 42 | }) 43 | }) 44 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | ci: 3 | autofix_commit_msg: "[pre-commit.ci] auto fixes from pre-commit.com hooks" 4 | autofix_prs: true 5 | autoupdate_commit_msg: '[pre-commit.ci] pre-commit autoupdate' 6 | autoupdate_schedule: weekly 7 | skip: [hadolint-docker, markdownlint, pylint, pytest, cypress] 8 | submodules: false 9 | repos: 10 | - repo: https://github.com/pre-commit/pre-commit-hooks 11 | rev: v5.0.0 12 | hooks: 13 | - id: check-added-large-files 14 | - id: check-ast 15 | - id: check-merge-conflict 16 | - id: check-yaml 17 | - id: debug-statements 18 | - id: end-of-file-fixer 19 | - id: forbid-submodules 20 | - id: sort-simple-yaml 21 | files: config.yaml 22 | - id: trailing-whitespace 23 | - repo: https://github.com/hadolint/hadolint 24 | rev: v2.13.1-beta 25 | hooks: 26 | - id: hadolint-docker 27 | args: ["--config", ".github/linters/.hadolint.yaml"] 28 | - repo: https://github.com/igorshubovych/markdownlint-cli 29 | rev: v0.43.0 30 | hooks: 31 | - id: markdownlint 32 | args: ["--config", ".github/linters/.markdown-lint.yaml"] 33 | - repo: https://github.com/psf/black 34 | rev: 24.10.0 35 | hooks: 36 | - id: black 37 | - repo: local 38 | hooks: 39 | - id: pylint 40 | name: pylint 41 | entry: pylint 42 | language: system 43 | types: [python] 44 | args: ["--rcfile=.github/linters/.python-lint"] 45 | - id: pytest 46 | name: pytest 47 | entry: pytest 48 | language: system 49 | types: [python] 50 | args: ["test/test_app.py", "test/test_embed.py", "-W", "ignore::DeprecationWarning"] 51 | pass_filenames: false 52 | - id: cypress 53 | name: cypress 54 | entry: bash 55 | language: system 56 | types_or: [ts, python] 57 | args: ["cypress/test.sh"] 58 | -------------------------------------------------------------------------------- /.github/workflows/build.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Build Container 3 | permissions: read-all 4 | on: 5 | push: 6 | branches: 7 | - main 8 | 9 | jobs: 10 | container-build: 11 | permissions: 12 | packages: write 13 | runs-on: ubuntu-latest 14 | steps: 15 | - uses: actions/checkout@v4 16 | - uses: docker/login-action@v3 17 | with: 18 | registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 19 | username: ${{ github.actor }} 20 | password: ${{ github.token }} 21 | - uses: docker/metadata-action@v5 22 | id: meta 23 | with: 24 | images: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 25 | - uses: docker/build-push-action@v6 26 | with: 27 | context: . 28 | push: true 29 | tags: ${{ steps.meta.outputs.tags }} 30 | labels: ${{ steps.meta.outputs.labels }} 31 | container-scan: 32 | needs: [container-build] 33 | permissions: 34 | contents: read # for actions/checkout to fetch code 35 | security-events: write # for github/codeql-action/upload-sarif to upload SARIF results 36 | actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status 37 | runs-on: ubuntu-latest 38 | steps: 39 | - uses: docker/login-action@v3 40 | with: 41 | registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 42 | username: ${{ github.actor }} 43 | password: ${{ github.token }} 44 | - uses: aquasecurity/trivy-action@0.29.0 45 | with: 46 | image-ref: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:main 47 | format: sarif 48 | output: trivy-results-main.sarif 49 | - uses: github/codeql-action/upload-sarif@v3 50 | with: 51 | sarif_file: trivy-results-main.sarif 52 | -------------------------------------------------------------------------------- /.github/workflows/unit-test.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Unit Tests 3 | permissions: read-all 4 | on: 5 | push: 6 | branches: 7 | - main 8 | pull_request: 9 | branches: 10 | - main 11 | paths: 12 | - '**.py' 13 | - 'requirements.txt' 14 | - 'config.yaml' 15 | - 'test/**.py' 16 | - 'cypress/**/**.ts' 17 | 18 | concurrency: 19 | group: ${{ github.workflow }}-${{ github.ref }} 20 | cancel-in-progress: true 21 | 22 | jobs: 23 | pytest: 24 | runs-on: ubuntu-latest 25 | steps: 26 | - uses: actions/checkout@v4 27 | - name: Install requirements 28 | run: | 29 | python -m pip install pytest 30 | python -m pip install -r requirements.txt 31 | - name: Download Sources 32 | run: | 33 | python -m pip install -U "huggingface-hub[cli]" hf_transfer 34 | huggingface-cli login --token ${{ secrets.HF_TOKEN }} 35 | huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \ 36 | --include *.xml \ 37 | --local-dir . \ 38 | --local-dir-use-symlinks False 39 | env: 40 | HF_HUB_ENABLE_HF_TRANSFER: 1 41 | - name: Basic Unit Test 42 | run: pytest test/test_embed.py -W ignore::DeprecationWarning 43 | cypress: 44 | runs-on: ubuntu-latest 45 | steps: 46 | - uses: actions/checkout@v4 47 | - name: Install Python Requirements 48 | run: python -m pip install -r requirements.txt 49 | - name: Setup Ollama 50 | run: | 51 | curl https://ollama.ai/install.sh | sh 52 | sleep 5 53 | ollama create volo -f ./Modelfile 54 | - uses: actions/setup-node@v4 55 | with: 56 | node-version: 18 57 | - uses: cypress-io/github-action@v6 58 | with: 59 | config-file: ${{ github.workspace }}/cypress/cypress.config.ts 60 | record: true 61 | spec: cypress/e2e/on_chat_start/spec.cy.ts 62 | start: chainlit run app.py -h 63 | wait-on: 'http://localhost:8000' 64 | wait-on-timeout: 10 65 | env: 66 | CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }} 67 | GITHUB_TOKEN: ${{ github.token }} 68 | LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }} 69 | - uses: cypress-io/github-action@v6 70 | with: 71 | config-file: ${{ github.workspace }}/cypress/cypress.config.ts 72 | record: true 73 | start: chainlit run app.py -h 74 | wait-on: 'http://localhost:8000' 75 | wait-on-timeout: 10 76 | env: 77 | CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }} 78 | GITHUB_TOKEN: ${{ github.token }} 79 | LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }} 80 | -------------------------------------------------------------------------------- /test/test_embed.py: -------------------------------------------------------------------------------- 1 | from collections import namedtuple 2 | import yaml 3 | from embed import ( 4 | parse_args, 5 | load_config, 6 | rename_duplicates, 7 | load_document, 8 | load_documents, 9 | ) 10 | 11 | Document = namedtuple("Document", ["page_content", "metadata"]) 12 | 13 | 14 | def test_parse_args(): 15 | "Test parse_args()." 16 | config = load_config() 17 | config = parse_args(config, ["--test-embed"]) 18 | assert config["mediawikis"] == ["dnd5e"] 19 | assert config["data_dir"] == "./test_data" 20 | assert config["question"] == "What is the Armor Class of a Beholder?" 21 | 22 | 23 | def test_load_config(): 24 | "Test load_config()." 25 | with open("config.yaml", "r", encoding="utf-8") as file: 26 | data = yaml.safe_load(file) 27 | config = load_config() 28 | assert data == config 29 | 30 | 31 | def test_rename_duplicates(): 32 | "Test rename_duplicates()." 33 | documents = [ 34 | Document(page_content="document 1", metadata={"source": "mydoc"}), 35 | Document(page_content="document 2", metadata={"source": "mydoc"}), 36 | ] 37 | renamed_documents = rename_duplicates(documents) 38 | assert documents[0].page_content == renamed_documents[0].page_content 39 | assert documents[0].page_content != renamed_documents[1].page_content 40 | assert documents[0].metadata["source"] == renamed_documents[0].metadata["source"] 41 | assert documents[0].metadata["source"] != renamed_documents[1].metadata["source"] 42 | 43 | 44 | def test_load_document(): 45 | "Test load_document()." 46 | config = load_config() 47 | wiki = (config["source"], "dnd5e") 48 | documents = load_document(wiki) 49 | beholder_page = [ 50 | document 51 | for document in documents 52 | if document.metadata["source"] == "Beholder - dnd5e" 53 | ] 54 | assert "From Monster Manual, page 28." in beholder_page[0].page_content 55 | assert {"source": "Beholder - dnd5e"} == beholder_page[0].metadata 56 | 57 | 58 | def test_load_documents(): 59 | "Test load_documents()." 60 | config = load_config() 61 | config = parse_args(config, ["--test-embed"]) 62 | documents = load_documents(config) 63 | beholder_page = [ 64 | document 65 | for document in documents 66 | if document.metadata["source"] == "Beholder - dnd5e" 67 | ] 68 | beholder_page_1 = [ 69 | document 70 | for document in documents 71 | if document.metadata["source"] == "Beholder - dnd5e_1" 72 | ] 73 | assert "From Monster Manual, page 28." in beholder_page[0].page_content 74 | assert {"source": "Beholder - dnd5e", "start_index": -1} == beholder_page[ 75 | 0 76 | ].metadata 77 | assert "Eye Rays." in beholder_page_1[0].page_content 78 | assert {"source": "Beholder - dnd5e_1", "start_index": -1} == beholder_page_1[ 79 | 0 80 | ].metadata 81 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to multi-mediawiki-rag 2 | 3 | Thank you for considering contributing to multi-mediawiki-rag! We welcome your help to make this project better. 4 | 5 | ## Getting Started 6 | 7 | Before you start contributing, please take a moment to review the following guidelines. 8 | 9 | ### Code of Conduct 10 | 11 | Please review our [Code of Conduct](CODE_OF_CONDUCT.md) to understand the standards we follow and the behavior expected in this project. 12 | 13 | ### How to Contribute 14 | 15 | 1. Fork the repository. 16 | 2. Create a new branch for your contribution: `git checkout -b feature/your-feature`. 17 | 3. Install [pre-commit](https://pre-commit.com/), [Docker](https://docs.docker.com/engine/install/), and [Node.js](https://nodejs.org/en/download). 18 | 4. Follow the repository [pre-requisities](README.md#prerequisites) and [Unit Testing Prerequisites](README.md#testing). 19 | 5. Make your changes and commit them: `git commit -m 'Add your feature'`. 20 | 6. Push to the branch: `git push origin feature/your-feature`. 21 | 7. Submit a pull request. 22 | 23 | ## Contribution Guidelines 24 | 25 | To ensure a smooth and effective contribution process, please follow these guidelines: 26 | 27 | ### Reporting Issues 28 | 29 | - Before creating a new issue, check if it already exists. 30 | - Use a clear and descriptive title for the issue. 31 | - Provide a detailed description of the issue, including steps to reproduce it. 32 | 33 | ### Making Changes 34 | 35 | - Fork the repository and create a new branch for your changes. 36 | - Keep each pull request focused on a single feature or bugfix. 37 | - Write clear and descriptive commit messages. 38 | - Keep code changes concise and well-documented. 39 | - Ensure that your code adheres to the project's coding standards. 40 | 41 | ### Testing 42 | 43 | - Include tests for your changes, if applicable. 44 | - Ensure that all existing tests pass before submitting a pull request. 45 | - Provide information on how to test your changes. 46 | 47 | ### Documentation 48 | 49 | - If you make changes that affect the project's documentation, update it accordingly. 50 | - Document new features and functionalities. 51 | 52 | ### Code Style 53 | 54 | - Follow the established [code style](https://google.github.io/styleguide/pyguide.html) for this project. 55 | - Consistent and clean code is highly appreciated. 56 | 57 | ### Pull Requests 58 | 59 | - Include a summary of your changes in your pull request. 60 | - Reference the relevant issue(s) if applicable. 61 | - Be responsive to feedback and be ready to make further changes if necessary. 62 | 63 | ## Code of Conduct 64 | 65 | This project follows the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). Please review it to understand the expectations for participant behavior. 66 | 67 | ## License 68 | 69 | By contributing to multi-mediawiki-rag, you agree that your contributions will be licensed under the [MIT License](LICENSE). 70 | 71 | Thank you for your contribution! 72 | -------------------------------------------------------------------------------- /.chainlit/config.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | # Whether to enable telemetry (default: true). No personal data is collected. 3 | enable_telemetry = true 4 | 5 | # List of environment variables to be provided by each user to use the app. 6 | user_env = [] 7 | 8 | # Duration (in seconds) during which the session is saved when the connection is lost 9 | session_timeout = 3600 10 | 11 | # Enable third parties caching (e.g LangChain cache) 12 | cache = true 13 | 14 | # Authorized origins 15 | allow_origins = ["*"] 16 | 17 | # Follow symlink for asset mount (see https://github.com/Chainlit/chainlit/issues/317) 18 | # follow_symlink = false 19 | 20 | [features] 21 | # Show the prompt playground 22 | prompt_playground = true 23 | 24 | # Process and display HTML in messages. This can be a security risk (see https://stackoverflow.com/questions/19603097/why-is-it-dangerous-to-render-user-generated-html-or-javascript) 25 | unsafe_allow_html = false 26 | 27 | # Process and display mathematical expressions. This can clash with "$" characters in messages. 28 | latex = false 29 | 30 | # Authorize users to upload files with messages 31 | [features.multi_modal] 32 | enabled = true 33 | accept = ["*/*"] 34 | max_files = 20 35 | max_size_mb = 500 36 | 37 | # Allows user to use speech to text 38 | [features.speech_to_text] 39 | enabled = false 40 | # See all languages here https://github.com/JamesBrill/react-speech-recognition/blob/HEAD/docs/API.md#language-string 41 | # language = "en-US" 42 | 43 | [UI] 44 | # Name of the app and chatbot. 45 | name = "Volo" 46 | 47 | # Show the readme while the thread is empty. 48 | show_readme_as_default = true 49 | 50 | # Description of the app and chatbot. This is used for HTML tags. 51 | # description = "" 52 | 53 | # Large size content are by default collapsed for a cleaner ui 54 | default_collapse_content = true 55 | 56 | # The default value for the expand messages settings. 57 | default_expand_messages = true 58 | 59 | # Hide the chain of thought details from the user in the UI. 60 | hide_cot = false 61 | 62 | # Link to your github repo. This will add a github button in the UI's header. 63 | # github = "https://github.com/tylertitsworth/multi-mediawiki-rag" 64 | 65 | # Specify a CSS file that can be used to customize the user interface. 66 | # The CSS file can be served from the public directory or via an external link. 67 | # custom_css = "/public/test.css" 68 | 69 | # Specify a Javascript file that can be used to customize the user interface. 70 | # The Javascript file can be served from the public directory. 71 | # custom_js = "/public/test.js" 72 | 73 | # Specify a custom font url. 74 | # custom_font = "https://fonts.googleapis.com/css2?family=Inter:wght@400;500;700&display=swap" 75 | 76 | # Specify a custom build directory for the frontend. 77 | # This can be used to customize the frontend code. 78 | # Be careful: If this is a relative path, it should not start with a slash. 79 | # custom_build = "./public/build" 80 | 81 | # Override default MUI light theme. (Check theme.ts) 82 | [UI.theme] 83 | #font_family = "Inter, sans-serif" 84 | [UI.theme.light] 85 | #background = "#FAFAFA" 86 | #paper = "#FFFFFF" 87 | 88 | [UI.theme.light.primary] 89 | #main = "#F80061" 90 | #dark = "#980039" 91 | #light = "#FFE7EB" 92 | 93 | # Override default MUI dark theme. (Check theme.ts) 94 | [UI.theme.dark] 95 | #background = "#FAFAFA" 96 | #paper = "#FFFFFF" 97 | 98 | [UI.theme.dark.primary] 99 | #main = "#F80061" 100 | #dark = "#980039" 101 | #light = "#FFE7EB" 102 | 103 | 104 | [meta] 105 | generated_by = "1.0.502" 106 | -------------------------------------------------------------------------------- /.github/workflows/integration-test.yaml: -------------------------------------------------------------------------------- 1 | --- 2 | name: Integration Test 3 | permissions: read-all 4 | on: 5 | pull_request_review: 6 | types: [submitted] 7 | 8 | concurrency: 9 | group: ${{ github.workflow }}-${{ github.ref }} 10 | cancel-in-progress: true 11 | 12 | jobs: 13 | container-build: 14 | if: > 15 | github.event.review.state == 'approved' || 16 | contains(github.event.pull_request.assignees.*.login, 'tylertitsworth') 17 | permissions: 18 | packages: write 19 | runs-on: ubuntu-latest 20 | steps: 21 | - uses: actions/checkout@v4 22 | - uses: docker/login-action@v3 23 | with: 24 | registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 25 | username: ${{ github.actor }} 26 | password: ${{ github.token }} 27 | - uses: docker/metadata-action@v5 28 | id: meta 29 | with: 30 | images: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 31 | - uses: docker/build-push-action@v6 32 | with: 33 | context: . 34 | push: true 35 | tags: ${{ steps.meta.outputs.tags }} 36 | labels: ${{ steps.meta.outputs.labels }} 37 | container-scan: 38 | needs: [container-build] 39 | permissions: 40 | contents: read # for actions/checkout to fetch code 41 | security-events: write # for github/codeql-action/upload-sarif to upload SARIF results 42 | actions: read # only required for a private repository by github/codeql-action/upload-sarif to get the Action run status 43 | runs-on: ubuntu-latest 44 | steps: 45 | - uses: docker/login-action@v3 46 | with: 47 | registry: ghcr.io/${{ github.repository_owner }}/${{ github.repository }} 48 | username: ${{ github.actor }} 49 | password: ${{ github.token }} 50 | - uses: aquasecurity/trivy-action@0.29.0 51 | with: 52 | image-ref: ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:pr-${{ github.event.pull_request.number }} 53 | format: sarif 54 | output: trivy-results-pr-${{ github.event.pull_request.number }}.sarif 55 | - uses: github/codeql-action/upload-sarif@v3 56 | with: 57 | sarif_file: trivy-results-pr-${{ github.event.pull_request.number }}.sarif 58 | embed-test: 59 | needs: [container-build] 60 | runs-on: ubuntu-latest 61 | steps: 62 | - uses: actions/checkout@v4 63 | - name: Download Sources and Data 64 | run: | 65 | python -m pip install -U "huggingface-hub[cli]" hf_transfer 66 | huggingface-cli login --token ${{ secrets.HF_TOKEN }} 67 | huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \ 68 | --include *.xml \ 69 | --local-dir . \ 70 | --local-dir-use-symlinks False 71 | huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \ 72 | --include data/* \ 73 | --local-dir . \ 74 | --local-dir-use-symlinks False 75 | env: 76 | HF_HUB_ENABLE_HF_TRANSFER: 1 77 | - name: Test Embedding and Chain Creation 78 | run: | 79 | docker run --shm-size=7GB \ 80 | -u root -w /home/user/app \ 81 | -v $PWD/data:/home/user/app/test_data \ 82 | -v $PWD/sources:/home/user/app/sources \ 83 | ghcr.io/${{ github.repository_owner }}/${{ github.repository }}:pr-${{ github.event.pull_request.number }} \ 84 | bash -c "pip install pytest && pytest test/test_embed.py -W ignore::DeprecationWarning" 85 | env: 86 | OLLAMA_HOST: "https://totalsundae-ollama.hf.space" 87 | LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }} 88 | e2e-test: 89 | needs: [embed-test] 90 | runs-on: ubuntu-latest 91 | steps: 92 | - uses: actions/checkout@v4 93 | - name: Install Python Requirements 94 | run: python -m pip install -r requirements.txt 95 | - name: Setup Ollama 96 | run: | 97 | curl https://ollama.ai/install.sh | sh 98 | sleep 5 99 | ollama create volo -f ./Modelfile 100 | env: 101 | OLLAMA_HOST: "https://totalsundae-ollama.hf.space" 102 | - name: Download Data 103 | run: | 104 | python -m pip install -U "huggingface-hub[cli]" hf_transfer 105 | huggingface-cli login --token ${{ secrets.HF_TOKEN }} 106 | huggingface-cli download --repo-type space TotalSundae/dungeons-and-dragons \ 107 | --include data/* \ 108 | --local-dir . \ 109 | --local-dir-use-symlinks False 110 | env: 111 | HF_HUB_ENABLE_HF_TRANSFER: 1 112 | - name: Move Data 113 | run: | 114 | mkdir test_data 115 | mv data/* test_data/ 116 | - uses: cypress-io/github-action@v6 117 | with: 118 | config-file: ${{ github.workspace }}/cypress/cypress.config.ts 119 | record: true 120 | start: chainlit run app.py -h 121 | wait-on: 'http://localhost:8000' 122 | wait-on-timeout: 10 123 | env: 124 | CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }} 125 | GITHUB_TOKEN: ${{ github.token }} 126 | LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }} 127 | TEST: true 128 | OLLAMA_HOST: "https://totalsundae-ollama.hf.space" 129 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official email address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | @tylertitsworth. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /embed.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | from collections import namedtuple 4 | from typing import Any 5 | 6 | import torch 7 | import yaml 8 | from langchain_community.document_loaders import MWDumpLoader 9 | from langchain_huggingface import HuggingFaceEmbeddings 10 | from langchain_chroma import Chroma 11 | from langchain_text_splitters import RecursiveCharacterTextSplitter 12 | from tqdm.contrib.concurrent import process_map 13 | 14 | Document = namedtuple("Document", ["page_content", "metadata"]) 15 | if not torch.cuda.is_available(): 16 | torch.set_num_threads(torch.get_num_threads() * 2) 17 | 18 | 19 | def parse_args(config: dict, args: list): 20 | """Parses command line arguments. 21 | 22 | Args: 23 | config (dict): items in config.yaml 24 | args (list(str)): user input parameters 25 | 26 | Returns: 27 | dict: dictionary of items in config.yaml, modified by user input parameters 28 | """ 29 | parser = argparse.ArgumentParser() 30 | parser.add_argument("--test-embed", dest="test_embed", action="store_true") 31 | args = parser.parse_args(args) 32 | if args.test_embed: 33 | config["mediawikis"] = ["dnd5e"] 34 | config["data_dir"] = "./test_data" 35 | config["question"] = "What is the Armor Class of a Beholder?" 36 | 37 | return config 38 | 39 | 40 | def load_config(): 41 | """Loads configuration from config.yaml file. 42 | 43 | Returns: 44 | dict: items in config.yaml 45 | """ 46 | try: 47 | with open("config.yaml", "r", encoding="utf-8") as file: 48 | data = yaml.safe_load(file) 49 | except FileNotFoundError: 50 | print("Error: File config.yaml not found.") 51 | sys.exit(1) 52 | except yaml.YAMLError as err: 53 | print(f"Error reading YAML file: {err}") 54 | sys.exit(1) 55 | 56 | return data 57 | 58 | 59 | def rename_duplicates(documents: [Document]): 60 | """Rename duplicates in a list of documents. 61 | 62 | Args: 63 | documents (list(Document)): input documents via loader.load() 64 | 65 | Returns: 66 | list(Document): input documents with modified source metadata 67 | """ 68 | document_counts = {} 69 | for idx, doc in enumerate(documents): 70 | doc_source = doc.metadata["source"] 71 | count = document_counts.get(doc_source, 0) + 1 72 | document_counts[doc_source] = count 73 | documents[idx].metadata["source"] = ( 74 | doc_source if count == 1 else f"{doc_source}_{count - 1}" 75 | ) 76 | 77 | return documents 78 | 79 | 80 | def load_document(wiki: tuple): 81 | """Loads an xml file of mediawiki pages into document format. 82 | 83 | Args: 84 | wiki (str): name of the wiki 85 | 86 | Returns: 87 | list(Document): input documents from mediawikis config with modified source metadata 88 | """ 89 | # https://python.langchain.com/docs/integrations/document_loaders/mediawikidump 90 | loader = MWDumpLoader( 91 | encoding="utf-8", 92 | file_path=f"{wiki[0]}/{wiki[1]}_pages_current.xml", 93 | # https://www.mediawiki.org/wiki/Help:Namespaces 94 | namespaces=[0], 95 | skip_redirects=True, 96 | stop_on_error=False, 97 | ) 98 | # For each Document provided: 99 | # Modify the source metadata by accounting for duplicates (_n) 100 | # And add the mediawiki title (_n - ) 101 | 102 | return [ 103 | Document(doc.page_content, {"source": doc.metadata["source"] + f" - {wiki[1]}"}) 104 | for doc in rename_duplicates(loader.load()) 105 | ] 106 | 107 | 108 | class CustomTextSplitter(RecursiveCharacterTextSplitter): 109 | """Creates a custom Character Text Splitter. 110 | 111 | Args: 112 | RecursiveCharacterTextSplitter (RecursiveCharacterTextSplitter): Generates chunks based on different separator rules 113 | """ 114 | 115 | def __init__(self, **kwargs: Any) -> None: 116 | separators = [r"\w(=){3}\n", r"\w(=){2}\n", r"\n\n", r"\n", r"\s"] 117 | super().__init__(separators=separators, keep_separator=False, **kwargs) 118 | 119 | 120 | def load_documents(config: dict): 121 | """Load all the documents in the MediaWiki wiki page using multithreading. 122 | 123 | Args: 124 | config (dict): items in config.yaml 125 | 126 | Returns: 127 | list(Document): input documents from mediawikis config with modified source metadata 128 | """ 129 | 130 | documents = sum( 131 | process_map( 132 | load_document, 133 | [(config["source"], wiki) for wiki in config["mediawikis"]], 134 | desc="Loading Documents", 135 | max_workers=torch.get_num_threads(), 136 | ), 137 | [], 138 | ) 139 | splitter = CustomTextSplitter( 140 | add_start_index=True, 141 | chunk_size=1000, 142 | is_separator_regex=True, 143 | ) 144 | documents = sum( 145 | process_map( 146 | splitter.split_documents, 147 | [[doc] for doc in documents], 148 | chunksize=1, 149 | desc="Splitting Documents", 150 | max_workers=torch.get_num_threads(), 151 | ), 152 | [], 153 | ) 154 | documents = rename_duplicates(documents) 155 | 156 | return documents 157 | 158 | 159 | if __name__ == "__main__": 160 | config = load_config() 161 | config = parse_args(config, sys.argv[1:]) 162 | documents = load_documents(config) 163 | print(f"Embedding {len(documents)} Documents, this may take a while.") 164 | # https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub 165 | embeddings = HuggingFaceEmbeddings( 166 | cache_folder="./model", 167 | model_name=config["embeddings_model"], 168 | show_progress=True, 169 | ) 170 | # https://python.langchain.com/docs/integrations/vectorstores/chroma 171 | vectordb = Chroma.from_documents( 172 | documents=documents, 173 | embedding=embeddings, 174 | persist_directory=config["data_dir"], 175 | ) 176 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import chainlit as cl 4 | from chainlit.input_widget import Slider, TextInput 5 | from langchain import hub 6 | from langchain.callbacks.base import BaseCallbackHandler 7 | from langchain.globals import set_llm_cache 8 | from langchain.schema.runnable.config import RunnableConfig 9 | from langchain_chroma import Chroma 10 | from langchain_community.cache import InMemoryCache 11 | from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler 12 | from langchain_core.output_parsers import StrOutputParser 13 | from langchain_core.runnables import RunnablePassthrough 14 | from langchain_huggingface import HuggingFaceEmbeddings 15 | from langchain_ollama import ChatOllama 16 | 17 | from embed import load_config, parse_args 18 | 19 | 20 | def import_db(config: dict): 21 | """Use existing Chroma vectorDB 22 | 23 | Args: 24 | config (dict): items in config.yaml 25 | 26 | Returns: 27 | Chroma: initialize a Chroma client. 28 | """ 29 | # https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub 30 | embeddings = HuggingFaceEmbeddings( 31 | cache_folder="./model", 32 | model_name=config["embeddings_model"], 33 | ) 34 | vectordb = Chroma( 35 | persist_directory=config["data_dir"], embedding_function=embeddings 36 | ) 37 | 38 | return vectordb 39 | 40 | 41 | def create_chain(config: dict): 42 | """Creates a conversation chain from a config file. 43 | 44 | Args: 45 | config (dict): items in config.yaml 46 | 47 | Returns: 48 | Runnable: Langchain Runnable for use with ChatOllama 49 | """ 50 | if os.getenv("TEST"): 51 | config = parse_args(config, ["--test-embed"]) 52 | print("Running in TEST mode.") 53 | set_llm_cache(InMemoryCache()) 54 | callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) 55 | vectordb = import_db(config) 56 | prompt = hub.pull("rlm/rag-prompt") 57 | # https://python.langchain.com/docs/integrations/llms/ollama 58 | model = ChatOllama( 59 | cache=True, 60 | callback_manager=callback_manager, 61 | model=config["model"], 62 | repeat_penalty=config["settings"]["repeat_penalty"], 63 | temperature=config["settings"]["temperature"], 64 | top_k=config["settings"]["top_k"], 65 | top_p=config["settings"]["top_p"], 66 | ) 67 | chain = ( 68 | { 69 | "context": vectordb.as_retriever( 70 | search_kwargs={"k": int(config["settings"]["num_sources"])} 71 | ), 72 | "question": RunnablePassthrough(), 73 | } 74 | | prompt 75 | | model 76 | | StrOutputParser() 77 | ) 78 | 79 | return chain 80 | 81 | 82 | async def update_cl(config: dict, settings: dict): 83 | """Update the model configuration. 84 | 85 | Args: 86 | config (dict): items in config.yaml 87 | settings (dict): user chat settings input 88 | """ 89 | if settings: 90 | config["settings"] = settings 91 | chain = create_chain(config) 92 | # https://docs.chainlit.io/api-reference/chat-settings 93 | inputs = [ 94 | TextInput( 95 | id="num_sources", 96 | label="# of Sources", 97 | initial=str(config["settings"]["num_sources"]), 98 | description="Number of sources returned based on their similarity score. The same source can be returned more than once. (Default: 4)", 99 | ), 100 | Slider( 101 | id="temperature", 102 | label="Temperature", 103 | initial=config["settings"]["temperature"], 104 | min=0, 105 | max=1, 106 | step=0.1, 107 | description="The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)", 108 | ), 109 | Slider( 110 | id="repeat_penalty", 111 | label="Repeat Penalty", 112 | initial=config["settings"]["repeat_penalty"], 113 | min=1.0, 114 | max=3.0, 115 | step=0.1, 116 | description="Sets how strongly to penalize repetitions. A higher value will penalize repetitions more strongly. (Default: 1.1)", 117 | ), 118 | Slider( 119 | id="top_k", 120 | label="Top K", 121 | initial=config["settings"]["top_k"], 122 | min=0, 123 | max=100, 124 | step=1, 125 | description="Reduces the probability of generating nonsense. A higher value will give more diverse answers. (Default: 40)", 126 | ), 127 | Slider( 128 | id="top_p", 129 | label="Top P", 130 | initial=config["settings"]["top_p"], 131 | min=0, 132 | max=1, 133 | step=0.1, 134 | description="Works together with top-k. A higher value will lead to more diverse text. (Default: 0.9)", 135 | ), 136 | ] 137 | cl.user_session.set("chain", chain) 138 | 139 | await cl.ChatSettings(inputs).send() 140 | 141 | 142 | # https://docs.chainlit.io/integrations/langchain 143 | # https://docs.chainlit.io/examples/qa 144 | @cl.on_chat_start 145 | async def on_chat_start(): 146 | """ 147 | Triggered at the start of a chat session. It loads the model configuration from a file 148 | and sets it in the user session for future use. 149 | """ 150 | config = load_config() 151 | cl.user_session.set("config", config) 152 | await update_cl(config, {}) 153 | 154 | await cl.Message(content=config["introduction"]).send() 155 | 156 | 157 | @cl.on_message 158 | async def on_message(message: cl.Message): 159 | "Chat message handler." 160 | runnable = cl.user_session.get("chain") 161 | msg = cl.Message(content="") 162 | 163 | class PostMessageHandler(BaseCallbackHandler): 164 | """ 165 | Callback handler for handling the retriever and LLM processes. 166 | Used to post the sources of the retrieved documents as a Chainlit element. 167 | """ 168 | 169 | def __init__(self, msg: cl.Message): 170 | BaseCallbackHandler.__init__(self) 171 | self.msg = msg 172 | self.sources = set() # To store unique pairs 173 | 174 | def on_retriever_end(self, documents): 175 | "Save the sources found by the retriever." 176 | for d in documents: 177 | self.sources.add(d.metadata["source"]) # Add unique pairs to the set 178 | 179 | def on_llm_end(self): 180 | "Stream the sources as a Chainlit element." 181 | if self.sources: 182 | sources_text = "\n".join(self.sources) 183 | self.msg.elements.append( 184 | cl.Text(name="Sources", content=sources_text, display="inline") 185 | ) 186 | 187 | async for chunk in runnable.astream( 188 | message.content, 189 | config=RunnableConfig( 190 | callbacks=[ 191 | cl.LangchainCallbackHandler(), 192 | PostMessageHandler(msg), 193 | ], 194 | ), 195 | ): 196 | await msg.stream_token(chunk) 197 | 198 | await msg.send() 199 | 200 | 201 | @cl.on_settings_update 202 | async def setup_agent(settings: dict): 203 | """Update Chat Settings. 204 | 205 | Args: 206 | settings (dict): user chat settings input 207 | """ 208 | config = cl.user_session.get("config") 209 | await update_cl(config, settings) 210 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Multi Mediawiki RAG Chatbot [![forks - multi-mediawiki-rag](https://img.shields.io/github/forks/tylertitsworth/multi-mediawiki-rag?style=social)](https://github.com/tylertitsworth/multi-mediawiki-rag) [![stars - multi-mediawiki-rag](https://img.shields.io/github/stars/tylertitsworth/multi-mediawiki-rag?style=social)](https://github.com/tylertitsworth/multi-mediawiki-rag) 2 | 3 | [![OS - Linux](https://img.shields.io/badge/OS-Linux-blue?logo=linux&logoColor=white)](https://www.linux.org/ "Go to Linux homepage") 4 | [![Made with Python](https://img.shields.io/badge/Python->=3.10-blue?logo=python&logoColor=white)](https://python.org "Go to Python homepage") 5 | [![Hosted on - Huggingface](https://img.shields.io/static/v1?label=Hosted+on&message=Huggingface&color=FFD21E)](https://huggingface.co/spaces/TotalSundae/dungeons-and-dragons) 6 | [![contributions - welcome](https://img.shields.io/badge/contributions-welcome-4ac41d)](https://github.com/tylertitsworth/multi-mediawiki-rag/blob/main/CONTRIBUTING.md) 7 | [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8272/badge)](https://www.bestpractices.dev/projects/8272) 8 | ![Fossa](https://app.fossa.com/api/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag.svg?type=shield&skipHeapTracking=true&issueType=license) 9 | [![Unit Tests](https://github.com/tylertitsworth/multi-mediawiki-rag/actions/workflows/unit-test.yaml/badge.svg?branch=main)](https://github.com/tylertitsworth/multi-mediawiki-rag/actions/workflows/unit-test.yaml) 10 | [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/tylertitsworth/multi-mediawiki-rag/main.svg)](https://results.pre-commit.ci/latest/github/tylertitsworth/multi-mediawiki-rag/main) 11 | 12 | [Chatbots](https://www.forbes.com/advisor/business/software/what-is-a-chatbot/) are very popular right now. Most openly accessible information is stored in some kind of a [Mediawiki](https://en.wikipedia.org/wiki/MediaWiki). Creating a [RAG](https://research.ibm.com/blog/retrieval-augmented-generation-RAG) Chatbot is becoming a very powerful alternative to traditional data gathering. This project is designed to create a basic format for creating your own chatbot to run locally on linux. 13 | 14 | ## Table of Contents 15 | 16 | - [Multi Mediawiki RAG Chatbot ](#multi-mediawiki-rag-chatbot--) 17 | - [Table of Contents](#table-of-contents) 18 | - [About](#about) 19 | - [Architecture](#architecture) 20 | - [Runtime Filesystem](#runtime-filesystem) 21 | - [Quickstart](#quickstart) 22 | - [Prerequisites](#prerequisites) 23 | - [Create Custom LLM](#create-custom-llm) 24 | - [Use Model from Huggingface](#use-model-from-huggingface) 25 | - [Create Vector Database](#create-vector-database) 26 | - [Expected Output](#expected-output) 27 | - [Add Different Document Type to DB](#add-different-document-type-to-db) 28 | - [Start Chatbot](#start-chatbot) 29 | - [Start Discord Bot](#start-discord-bot) 30 | - [Testing](#testing) 31 | - [Cypress](#cypress) 32 | - [Pytest](#pytest) 33 | - [License](#license) 34 | 35 | ## About 36 | 37 | [Mediawikis](https://en.wikipedia.org/wiki/MediaWiki) hosted by [Fandom](https://www.fandom.com/) usually allow you to download an XML dump of the entire wiki as it currently exists. This project primarily leverages [Langchain](https://github.com/langchain-ai/langchain) with a few other open source projects to combine many of the readily available quickstart guides into a complete vertical application based on mediawiki data. 38 | 39 | ### Architecture 40 | 41 | ```mermaid 42 | graph TD; 43 | a[/xml dump a/] --MWDumpLoader--> emb 44 | b[/xml dump b/] --MWDumpLoader--> emb 45 | emb{Embedding} --> db 46 | db[(Chroma)] --Document Retriever--> lc 47 | hf(Huggingface) --Sentence Transformer --> emb 48 | hf --LLM--> modelfile 49 | modelfile[/Modelfile/] --> Ollama 50 | Ollama(((Ollama))) <-.ChatOllama.-> lc 51 | lc{Langchain} <-.LLMChain.-> cl(((Chainlit))) 52 | click db href "https://github.com/chroma-core/chroma" 53 | click hf href "https://huggingface.co/" 54 | click cl href "https://github.com/Chainlit/chainlit" 55 | click lc href "https://github.com/langchain-ai/langchain" 56 | click Ollama href "https://github.com/jmorganca/ollama" 57 | ``` 58 | 59 | ### Runtime Filesystem 60 | 61 | ```txt 62 | multi-mediawiki-rag # $HOME/app 63 | ├── .chainlit 64 | │ ├── .langchain.db # Server Cache 65 | │ └── config.toml # Server Config 66 | ├── app.py 67 | ├── chainlit.md 68 | ├── config.yaml 69 | ├── data # VectorDB 70 | │ ├── 47e4e036-****-****-****-************ 71 | │ │ └── * 72 | │ └── chroma.sqlite3 73 | ├── embed.py 74 | ├── entrypoint.sh 75 | └── requirements.txt 76 | ``` 77 | 78 | ## Quickstart 79 | 80 | These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. 81 | 82 | ### Prerequisites 83 | 84 | These steps assume you are using a modern Linux OS like [Ubuntu 22.04](https://www.releases.ubuntu.com/jammy/) with [Python 3.10+](https://www.python.org/downloads/release/python-3100/). 85 | 86 | ```bash 87 | apt-get install -y curl git python3-venv 88 | git clone https://github.com/tylertitsworth/multi-mediawiki-rag.git 89 | curl https://ollama.ai/install.sh | sh 90 | python -m .venv venv 91 | source .venv/bin/activate 92 | pip install -U pip setuptools wheel 93 | pip install -r requirements.txt 94 | ``` 95 | 96 | 1. Run the above setup steps 97 | 2. Download a mediawiki's XML dump by browsing to `/wiki/Special:Statistics` or using a tool like [wikiteam3](https://pypi.org/project/wikiteam3/) 98 | 1. If Downloading, download only the current pages, not the entire history 99 | 2. If using `wikiteam3`, scrape only namespace 0 100 | 3. Provide in the following format: `sources/_pages_current.xml` 101 | 3. Edit [`config.yaml`](config.yaml) with the location of your XML mediawiki data you downloaded in step 1 and other configuration information 102 | 103 | > [!CAUTION] 104 | > Installing [Ollama](https://github.com/jmorganca/ollama) will create a new user and a service on your system. Follow the [manual installation steps](https://github.com/jmorganca/ollama/blob/main/docs/linux.md#manual-install) to avoid this step and instead launch the ollama API using `ollama serve`. 105 | 106 | ### Create Custom LLM 107 | 108 | After installing Ollama we can use a [Modelfile](https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md) to download and tune an LLM to be more precise for Document Retrieval QA. 109 | 110 | ```bash 111 | ollama create volo -f ./Modelfile 112 | ``` 113 | 114 | > [!TIP] 115 | > Choose a model from the [Ollama model library](https://ollama.ai/library) and download with `ollama pull :`, then edit the `model` field in [`config.yaml`](config.yaml) with the same information. 116 | 117 | #### Use Model from Huggingface 118 | 119 | 1. Download a model of choice from [Huggingface](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) with `git clone https://huggingface.co// model/`. 120 | 2. If your model of choice is not in `GGUF` format, convert it with `docker run --rm -v $PWD/model/:/model ollama/quantize -q q4_0 /model`. 121 | 3. Modify the [Modelfile's](Modelfile) `FROM` line to contain the path to the `q4_0.bin` file in the modelname directory. 122 | 123 | ### Create Vector Database 124 | 125 | Your XML data needs to be loaded and transformed into embeddings to create a [Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) VectorDB. 126 | 127 | ```bash 128 | python embed.py 129 | ``` 130 | 131 | #### Expected Output 132 | 133 | ```txt 134 | 2023-12-16 09:50:53 - Loaded .env file 135 | 2023-12-16 09:50:55 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2 136 | 2023-12-16 09:51:18 - Use pytorch device: cpu 137 | 2023-12-16 09:56:09 - Anonymized telemetry enabled. See 138 | https://docs.trychroma.com/telemetry for more information. 139 | Batches: 100%|████████████████████████████████████████| 1303/1303 [1:23:14<00:00, 3.83s/it] 140 | ... 141 | Batches: 100%|████████████████████████████████████████| 1172/1172 [1:04:08<00:00, 3.28s/it] 142 | 023-12-16 19:47:01 - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2 143 | 2023-12-16 19:47:33 - Use pytorch device: cpu 144 | Batches: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s] 145 | A Tako was an intelligent race of octopuses found in the Kara-Tur setting. They were known for 146 | their territorial nature and combat skills, as well as having incredible camouflaging abilities 147 | that allowed them to blend into various environments. Takos lived in small tribes with a 148 | matriarchal society led by one or two female rulers. Their diet consisted mainly of crabs, 149 | lobsters, oysters, and shellfish, while their ink was highly sought after for use in calligraphy 150 | within Kara-Tur. 151 | ``` 152 | 153 | #### Add Different Document Type to DB 154 | 155 | Choose a new [File type Document Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/) or [App Document Loader](https://python.langchain.com/docs/integrations/document_loaders/), and add them using your own script. Check out the provided [Example](examples/add_more_docs.py). 156 | 157 | ### Start Chatbot 158 | 159 | ```bash 160 | chainlit run app.py -h 161 | ``` 162 | 163 | Access the Chatbot GUI at `http://localhost:8000`. 164 | 165 | ### Start Discord Bot 166 | 167 | ```bash 168 | export DISCORD_BOT_TOKEN=... 169 | chainlit run app.py -h 170 | ``` 171 | 172 | > [!TIP] 173 | > Develop locally with [ngrok](https://dashboard.ngrok.com/get-started/setup/linux). 174 | 175 | ## Hosting 176 | 177 | This chatbot is hosted on [Huggingface Spaces](https://huggingface.co/spaces/TotalSundae/dungeons-and-dragons) for free, which means this chatbot is very slow due to the minimal hardware resources allocated to it. Despite this, the provided [Dockerfile](./Dockerfile) provides a generic method for hosting this solution as one unified container, however this method is not ideal and can lead to many issues if used for professional production systems. 178 | 179 | ## Testing 180 | 181 | ### Cypress 182 | 183 | [Cypress](https://www.cypress.io/) tests modern web applications with visual debugging. It is used to test the [Chainlit](https://github.com/Chainlit/chainlit) UI functionality. 184 | 185 | ```bash 186 | npm install 187 | # Run Test Suite 188 | bash cypress/test.sh 189 | ``` 190 | 191 | > [!NOTE] 192 | > Cypress requires `node >= 16`. 193 | 194 | ### Pytest 195 | 196 | [Pytest](https://docs.pytest.org/en/7.4.x/) is a mature full-featured Python testing tool that helps you write better programs. 197 | 198 | ```bash 199 | pip install pytest 200 | # Test Embedding Functions 201 | pytest test/test_embed.py -W ignore::DeprecationWarning 202 | # Test e2e with Ollama Backend 203 | pytest test -W ignore::DeprecationWarning 204 | ``` 205 | 206 | ## License 207 | 208 | [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag.svg?type=large&issueType=license)](https://app.fossa.com/projects/git%2Bgithub.com%2Ftylertitsworth%2Fmulti-mediawiki-rag?ref=badge_large&issueType=license) 209 | --------------------------------------------------------------------------------