├── .editorconfig ├── .gitattributes ├── .github ├── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── PULL_REQUEST_TEMPLATE.md ├── dependabot.yml └── workflows │ ├── add-pr-to-project.yml │ ├── automerge.yml │ ├── conventional-prs.yml │ ├── release-please.yml │ └── rust.yml ├── .gitignore ├── CHANGELOG.md ├── Cargo.toml ├── LICENSE.txt ├── README.md ├── benches ├── TTN.fasta └── translate_cds.rs ├── build.rs ├── codecov.yml ├── src ├── data │ ├── cdot │ │ ├── json.rs │ │ ├── mod.rs │ │ └── snapshots │ │ │ ├── hgvs__data__cdot__json__tests__deserialize_brca1.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_gene_info.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_tx_exons.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_tx_for_gene.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_tx_for_region_brca1.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_tx_for_region_empty.snap │ │ │ ├── hgvs__data__cdot__json__tests__provider_get_tx_info.snap │ │ │ └── hgvs__data__cdot__json__tests__provider_get_tx_mapping_options.snap │ ├── error.rs │ ├── interface.rs │ ├── mod.rs │ ├── uta.rs │ └── uta_sr.rs ├── lib.rs ├── mapper │ ├── alignment.rs │ ├── altseq.rs │ ├── assembly.rs │ ├── cigar.rs │ ├── error.rs │ ├── mod.rs │ ├── snapshots │ │ └── hgvs__mapper__variant__test__issue_131.snap │ └── variant.rs ├── normalizer.rs ├── parser │ ├── display.rs │ ├── ds.rs │ ├── error.rs │ ├── impl_parse.rs │ ├── impl_validate.rs │ ├── mod.rs │ └── parse_funcs.rs ├── sequences.rs └── validator │ ├── error.rs │ └── mod.rs ├── tables.in └── tests └── data ├── data ├── .gitignore ├── bootstrap.sh ├── cdot │ ├── cdot-0.2.12.refseq.grch37_grch38.brca1.txt │ ├── cdot-0.2.21.refseq.grch37_grch38.brca1.json │ ├── extract_gene.py │ └── tx-mane.brca1.tsv ├── subset.awk └── uta_20210129-subset.pgd.gz ├── mapper ├── gcp │ ├── ADRA2B-dbSNP.tsv │ ├── BAHCC1-dbSNP.tsv │ ├── DNAH11-HGMD.tsv │ ├── DNAH11-dbSNP-NM_001277115.tsv │ ├── DNAH11-dbSNP-NM_003777.tsv │ ├── DNAH11-dbSNP.tsv │ ├── FOLR3-dbSNP.tsv │ ├── JRK-dbSNP.tsv │ ├── NEFL-dbSNP.tsv │ ├── ORAI1-dbSNP.tsv │ ├── ZCCHC3-dbSNP.tsv │ ├── noncoding.tsv │ ├── real-met1.tsv │ ├── real.tsv │ └── regression.tsv ├── proj-near-disc.tsv ├── real_cp.tsv └── sanity_cp.tsv ├── parser ├── gauntlet ├── grammar_test.tsv └── reject └── seqrepo_cache.fasta /.editorconfig: -------------------------------------------------------------------------------- 1 | # http://editorconfig.org 2 | 3 | root = true 4 | 5 | [*] 6 | charset = utf-8 7 | end_of_line = lf 8 | insert_final_newline = true 9 | trim_trailing_whitespace = true 10 | 11 | [*.{py,rst,ini,rs,toml}] 12 | indent_style = space 13 | indent_size = 4 14 | 15 | [*.{html,css,scss,json,yml}] 16 | indent_style = space 17 | indent_size = 2 18 | 19 | [*.md] 20 | trim_trailing_whitespace = false 21 | 22 | [Makefile] 23 | indent_style = tab 24 | 25 | [nginx.conf] 26 | indent_style = space 27 | indent_size = 2 28 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | *.fasta filter=lfs diff=lfs merge=lfs -text 2 | tests/data/*.gz filter=lfs diff=lfs merge=lfs -text 3 | tests/data/*/*.gz filter=lfs diff=lfs merge=lfs -text 4 | tests/data/*.fasta filter=lfs diff=lfs merge=lfs -text 5 | tests/data/*/*.fasta filter=lfs diff=lfs merge=lfs -text 6 | tests/data/*/*.tsv filter=lfs diff=lfs merge=lfs -text 7 | tests/data/*/*/*.tsv filter=lfs diff=lfs merge=lfs -text 8 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | 17 | 18 | **Describe the bug** 19 | A clear and concise description of what the bug is. 20 | 21 | **To Reproduce** 22 | Steps to reproduce the behavior: 23 | 1. Go to '...' 24 | 2. Click on '....' 25 | 3. Scroll down to '....' 26 | 4. See error 27 | 28 | **Expected behavior** 29 | A clear and concise description of what you expected to happen. 30 | 31 | **Screenshots** 32 | If applicable, add screenshots to help explain your problem. 33 | 34 | **Desktop (please complete the following information):** 35 | - OS: [e.g. iOS] 36 | - Browser [e.g. chrome, safari] 37 | - Version [e.g. 22] 38 | 39 | **Smartphone (please complete the following information):** 40 | - Device: [e.g. iPhone6] 41 | - OS: [e.g. iOS8.1] 42 | - Browser [e.g. stock browser, safari] 43 | - Version [e.g. 22] 44 | 45 | **Additional context** 46 | Add any other context about the problem here. 47 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: enhancement 6 | assignees: '' 7 | 8 | --- 9 | 10 | 17 | 18 | **Is your feature request related to a problem? Please describe.** 19 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 20 | 21 | **Describe the solution you'd like** 22 | A clear and concise description of what you want to happen. 23 | 24 | **Describe alternatives you've considered** 25 | A clear and concise description of any alternative solutions or features you've considered. 26 | 27 | **Additional context** 28 | Add any other context or screenshots about the feature request here. 29 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 15 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | # Please see the documentation for all configuration options: 2 | # https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates 3 | 4 | version: 2 5 | updates: 6 | - package-ecosystem: "github-actions" 7 | directory: "/" 8 | schedule: 9 | interval: "weekly" 10 | 11 | - package-ecosystem: "cargo" 12 | directory: "/" 13 | schedule: 14 | interval: "weekly" 15 | groups: 16 | # Group together updates to noodles as there are peer dependencies. 17 | # 18 | # Also include "reverse" transitive dependencies 19 | noodles: 20 | patterns: 21 | - "seqrepo" 22 | - "noodles-*" 23 | -------------------------------------------------------------------------------- /.github/workflows/add-pr-to-project.yml: -------------------------------------------------------------------------------- 1 | name: add needs-review pull requests to projects 2 | 3 | on: 4 | pull_request: 5 | types: 6 | - labeled 7 | 8 | jobs: 9 | add-to-project: 10 | name: Add pull request to project 11 | runs-on: ubuntu-latest 12 | steps: 13 | - name: register pull requests with release planning project 14 | uses: actions/add-to-project@v1.0.2 15 | with: 16 | project-url: https://github.com/orgs/varfish-org/projects/2 17 | github-token: ${{ secrets.BOT_TOKEN }} 18 | labeled: needs-review 19 | -------------------------------------------------------------------------------- /.github/workflows/automerge.yml: -------------------------------------------------------------------------------- 1 | name: Dependabot auto-merge 2 | 3 | on: pull_request 4 | 5 | permissions: 6 | contents: write 7 | 8 | jobs: 9 | dependabot: 10 | runs-on: ubuntu-latest 11 | if: ${{ github.actor == 'dependabot[bot]' }} 12 | steps: 13 | - name: Enable auto-merge for Dependabot PRs 14 | run: gh pr merge --auto --squash "$PR_URL" 15 | env: 16 | PR_URL: ${{github.event.pull_request.html_url}} 17 | # GitHub provides this variable in the CI env. You don't 18 | # need to add anything to the secrets vault. 19 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 20 | -------------------------------------------------------------------------------- /.github/workflows/conventional-prs.yml: -------------------------------------------------------------------------------- 1 | name: PR 2 | on: 3 | pull_request_target: 4 | types: 5 | - opened 6 | - reopened 7 | - edited 8 | - synchronize 9 | 10 | jobs: 11 | title-format: 12 | runs-on: ubuntu-latest 13 | steps: 14 | - uses: amannn/action-semantic-pull-request@v5.5.3 15 | env: 16 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 17 | with: 18 | validateSingleCommit: true 19 | -------------------------------------------------------------------------------- /.github/workflows/release-please.yml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | branches: 4 | - main 5 | 6 | name: release-please 7 | 8 | jobs: 9 | release-please: 10 | if: github.repository_owner == 'varfish-org' 11 | runs-on: ubuntu-latest 12 | steps: 13 | - uses: GoogleCloudPlatform/release-please-action@v4 14 | id: release 15 | with: 16 | release-type: rust 17 | package-name: hgvs 18 | token: ${{ secrets.BOT_TOKEN }} 19 | 20 | - uses: actions/checkout@v4 21 | if: ${{ steps.release.outputs.release_created }} 22 | 23 | - name: Install stable toolchain 24 | uses: actions-rs/toolchain@v1 25 | if: ${{ steps.release.outputs.release_created }} 26 | with: 27 | toolchain: stable 28 | override: true 29 | 30 | - uses: Swatinem/rust-cache@v2.7.8 31 | if: ${{ steps.release.outputs.release_created }} 32 | 33 | - name: Publish crate 34 | if: ${{ steps.release.outputs.release_created }} 35 | uses: actions-rs/cargo@v1 36 | with: 37 | command: publish 38 | args: --token ${{ secrets.CRATES_IO_TOKEN }} 39 | -------------------------------------------------------------------------------- /.github/workflows/rust.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | pull_request: 8 | branches: 9 | - main 10 | 11 | jobs: 12 | Formatting: 13 | runs-on: ubuntu-latest 14 | steps: 15 | - name: Checkout repository 16 | uses: actions/checkout@v4 17 | 18 | - name: Install stable toolchain 19 | uses: dtolnay/rust-toolchain@stable 20 | with: 21 | components: rustfmt 22 | 23 | - name: Check format 24 | run: cargo fmt --check 25 | 26 | Linting: 27 | runs-on: ubuntu-latest 28 | steps: 29 | - name: Checkout repository 30 | uses: actions/checkout@v4 31 | with: 32 | lfs: true 33 | 34 | - name: Install stable toolchain 35 | uses: dtolnay/rust-toolchain@stable 36 | with: 37 | components: clippy 38 | 39 | - name: Lint with clippy 40 | run: cargo clippy --no-deps # --all-targets --all-features 41 | 42 | Testing: 43 | needs: [Formatting, Linting] 44 | runs-on: ubuntu-latest 45 | 46 | strategy: 47 | matrix: 48 | include: 49 | - label: fast 50 | - label: full 51 | 52 | services: 53 | # The tests need a postgres server; the data will be loaded later 54 | # after checkout. 55 | postgres: 56 | image: postgres 57 | env: 58 | POSTGRES_DB: uta 59 | POSTGRES_USER: uta_admin 60 | POSTGRES_PASSWORD: uta_admin 61 | options: >- 62 | --health-cmd pg_isready 63 | --health-interval 10s 64 | --health-timeout 5s 65 | --health-retries 5 66 | ports: 67 | - 5432:5432 68 | 69 | steps: 70 | - name: Checkout repository 71 | uses: actions/checkout@v4 72 | with: 73 | lfs: 'true' 74 | 75 | - name: Install host libraries 76 | run: sudo apt-get install -y libsqlite3-dev libsqlite3-0 77 | 78 | - name: Import test database. 79 | run: | 80 | set -euo pipefail 81 | zcat tests/data/data/uta_20210129-subset.pgd.gz \ 82 | | psql -v ON_ERROR_STOP=1 -U uta_admin -h 0.0.0.0 -d uta 83 | shell: bash 84 | env: 85 | PGPASSWORD: uta_admin 86 | 87 | - name: Install stable toolchain 88 | uses: dtolnay/rust-toolchain@stable 89 | with: 90 | components: llvm-tools-preview # needed for cargo llvm-cov 91 | 92 | - uses: Swatinem/rust-cache@v2.7.8 93 | 94 | - name: Install cargo-llvm-cov 95 | uses: taiki-e/install-action@cargo-llvm-cov 96 | 97 | - name: Run cargo-llvm-cov with fast tests 98 | run: cargo llvm-cov --lcov --output-path lcov.info -- --test-threads 1 99 | env: 100 | TEST_UTA_DATABASE_URL: postgres://uta_admin:uta_admin@0.0.0.0/uta 101 | TEST_UTA_DATABASE_SCHEMA: uta_20210129 102 | TEST_SEQREPO_CACHE_MODE: read 103 | TEST_SEQREPO_CACHE_PATH: tests/data/seqrepo_cache.fasta 104 | if: ${{ matrix.label == 'fast' }} 105 | 106 | - name: Run cargo-test with full tests 107 | run: "cargo test --release -- --include-ignored" 108 | env: 109 | TEST_UTA_DATABASE_URL: postgres://uta_admin:uta_admin@0.0.0.0/uta 110 | TEST_UTA_DATABASE_SCHEMA: uta_20210129 111 | TEST_SEQREPO_CACHE_MODE: read 112 | TEST_SEQREPO_CACHE_PATH: tests/data/seqrepo_cache.fasta 113 | if: ${{ matrix.label == 'full' }} 114 | 115 | - name: Codecov submission of fast test results 116 | uses: codecov/codecov-action@v5 117 | with: 118 | verbose: true 119 | token: ${{ secrets.CODECOV_TOKEN }} 120 | if: ${{ matrix.label == 'fast' }} 121 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | perf.* 2 | 3 | /target 4 | /Cargo.lock 5 | 6 | *~ 7 | .*.sw? 8 | /.vscode 9 | 10 | *.lock 11 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## [0.18.2](https://github.com/varfish-org/hgvs-rs/compare/v0.18.1...v0.18.2) (2025-03-31) 4 | 5 | 6 | ### Bug Fixes 7 | 8 | * incorporate biocommons PR 765: fix exception in g_to_c mapping ([#231](https://github.com/varfish-org/hgvs-rs/issues/231)) ([f658e9b](https://github.com/varfish-org/hgvs-rs/commit/f658e9b87e8981cbd3788c9d1cd7b9dc2a1215d2)) 9 | * miscellaneous out of range fixes ([#234](https://github.com/varfish-org/hgvs-rs/issues/234)) ([9d80c7e](https://github.com/varfish-org/hgvs-rs/commit/9d80c7e3bf88ae86029a6ea22ab7df7f1ad025aa)) 10 | 11 | ## [0.18.1](https://github.com/varfish-org/hgvs-rs/compare/v0.18.0...v0.18.1) (2025-02-25) 12 | 13 | 14 | ### Bug Fixes 15 | 16 | * introduce clamping to mimick hgvs python behaviour ([#225](https://github.com/varfish-org/hgvs-rs/issues/225)) ([4693f09](https://github.com/varfish-org/hgvs-rs/commit/4693f09c8fd048983430be806f62c206884386db)) 17 | * make altseq public ([#228](https://github.com/varfish-org/hgvs-rs/issues/228)) ([7447835](https://github.com/varfish-org/hgvs-rs/commit/74478359709e9c6a7bebaaabc6e678a7cf1a02d1)) 18 | 19 | ## [0.18.0](https://github.com/varfish-org/hgvs-rs/compare/v0.17.5...v0.18.0) (2025-02-12) 20 | 21 | 22 | ### Features 23 | 24 | * Renormalize during `g_to_n` if `renormalize_g` is set to true ([#223](https://github.com/varfish-org/hgvs-rs/issues/223)) ([305058d](https://github.com/varfish-org/hgvs-rs/commit/305058d7a491cd2e9a2e0fbdcf5edf928216ff0e)) 25 | 26 | ## [0.17.5](https://github.com/varfish-org/hgvs-rs/compare/v0.17.4...v0.17.5) (2025-01-07) 27 | 28 | 29 | ### Bug Fixes 30 | 31 | * Produce Ident protein edits ([#218](https://github.com/varfish-org/hgvs-rs/issues/218)) ([44ab9d1](https://github.com/varfish-org/hgvs-rs/commit/44ab9d1bf61fdccd1fa1c666b9f4f870ee340629)) 32 | 33 | ## [0.17.4](https://github.com/varfish-org/hgvs-rs/compare/v0.17.3...v0.17.4) (2024-12-09) 34 | 35 | 36 | ### Bug Fixes 37 | 38 | * Sec and Stop coincide in SEPHS2 ([#214](https://github.com/varfish-org/hgvs-rs/issues/214)) ([8ae7e73](https://github.com/varfish-org/hgvs-rs/commit/8ae7e73336a1d8142f316b42209a1461f2608f0a)) 39 | 40 | ## [0.17.3](https://github.com/varfish-org/hgvs-rs/compare/v0.17.2...v0.17.3) (2024-10-04) 41 | 42 | 43 | ### Bug Fixes 44 | 45 | * Add catch-all tag `Other` to account for future cdot tag changes ([#204](https://github.com/varfish-org/hgvs-rs/issues/204)) ([e02c98c](https://github.com/varfish-org/hgvs-rs/commit/e02c98c56e5d52ff8d834484139fdad9124dd100)) 46 | 47 | ## [0.17.2](https://github.com/varfish-org/hgvs-rs/compare/v0.17.1...v0.17.2) (2024-08-06) 48 | 49 | 50 | ### Bug Fixes 51 | 52 | * add support for tag 'GENCODE Primary' in cdot json ([#201](https://github.com/varfish-org/hgvs-rs/issues/201)) ([0b37341](https://github.com/varfish-org/hgvs-rs/commit/0b37341359ca1f616cda86a8b0dd928dc3144f2f)) 53 | 54 | ## [0.17.1](https://github.com/varfish-org/hgvs-rs/compare/v0.17.0...v0.17.1) (2024-07-26) 55 | 56 | 57 | ### Bug Fixes 58 | 59 | * add selenoprotein biotype ([#197](https://github.com/varfish-org/hgvs-rs/issues/197)) ([e4a43b6](https://github.com/varfish-org/hgvs-rs/commit/e4a43b6e0027b819b243b2926bbdf026c0c77cd3)) 60 | 61 | ## [0.17.0](https://github.com/varfish-org/hgvs-rs/compare/v0.16.3...v0.17.0) (2024-07-25) 62 | 63 | 64 | ### Features 65 | 66 | * add variant region condition for dups ending in 3' UTR ([#192](https://github.com/varfish-org/hgvs-rs/issues/192)) ([#195](https://github.com/varfish-org/hgvs-rs/issues/195)) ([362909c](https://github.com/varfish-org/hgvs-rs/commit/362909ca0a7887cda6588f7da5671cb01ee4500f)) 67 | 68 | ## [0.16.3](https://github.com/varfish-org/hgvs-rs/compare/v0.16.2...v0.16.3) (2024-07-16) 69 | 70 | 71 | ### Bug Fixes 72 | 73 | * BioType naming for Ig* types, derive Hash for BioType ([#189](https://github.com/varfish-org/hgvs-rs/issues/189)) ([62519c7](https://github.com/varfish-org/hgvs-rs/commit/62519c7c6344b465fb65964aa52d503688b091a2)) 74 | 75 | ## [0.16.2](https://github.com/varfish-org/hgvs-rs/compare/v0.16.1...v0.16.2) (2024-07-14) 76 | 77 | 78 | ### Bug Fixes 79 | 80 | * add biotype vaultRNA_primary_transcript ([#180](https://github.com/varfish-org/hgvs-rs/issues/180)) ([d59783c](https://github.com/varfish-org/hgvs-rs/commit/d59783c51c30feeb36c389b2c8242e79ef9b5ad4)) 81 | * add biotype vaultRNA_primary_transcript ([#183](https://github.com/varfish-org/hgvs-rs/issues/183)) ([4345377](https://github.com/varfish-org/hgvs-rs/commit/4345377031220063a63e34be8b1af90c567e94f2)) 82 | * add partial field to cdot json representation struct ([#179](https://github.com/varfish-org/hgvs-rs/issues/179)) ([113a06a](https://github.com/varfish-org/hgvs-rs/commit/113a06a4646e6418b739da2423f953eadaac4ccc)) 83 | * add Serialize to data::cdot::json representations ([#176](https://github.com/varfish-org/hgvs-rs/issues/176)) ([2de1b8a](https://github.com/varfish-org/hgvs-rs/commit/2de1b8ab03d7671099b04e9bdfe58251f80e94f9)) 84 | 85 | ## [0.16.1](https://github.com/varfish-org/hgvs-rs/compare/v0.16.0...v0.16.1) (2024-06-10) 86 | 87 | 88 | ### Bug Fixes 89 | 90 | * use array instead of Vec for Codon representation ([#169](https://github.com/varfish-org/hgvs-rs/issues/169)) ([e455168](https://github.com/varfish-org/hgvs-rs/commit/e455168bf72b36f7e1c662bfa5dc3c10d2e448b0)) 91 | 92 | 93 | ### Performance Improvements 94 | 95 | * cache mappers, avoid allocations, codons are arrays ([#172](https://github.com/varfish-org/hgvs-rs/issues/172)) ([34acaf4](https://github.com/varfish-org/hgvs-rs/commit/34acaf4e5151de7186ce06ef42850ae6be716e18)) 96 | 97 | ## [0.16.0](https://github.com/varfish-org/hgvs-rs/compare/v0.15.0...v0.16.0) (2024-03-05) 98 | 99 | 100 | ### Features 101 | 102 | * adding vertebrate mitochondrial code ([#160](https://github.com/varfish-org/hgvs-rs/issues/160)) ([#161](https://github.com/varfish-org/hgvs-rs/issues/161)) ([eb6739b](https://github.com/varfish-org/hgvs-rs/commit/eb6739b4e19d00beb4bae60bbced6fa16f33cd06)) 103 | 104 | ## [0.15.0](https://github.com/varfish-org/hgvs-rs/compare/v0.14.1...v0.15.0) (2024-02-08) 105 | 106 | 107 | ### Miscellaneous Chores 108 | 109 | * bumping dependencies ([#157](https://github.com/varfish-org/hgvs-rs/issues/157)) ([6e75cd5](https://github.com/varfish-org/hgvs-rs/commit/6e75cd5dac6b5a885e682fdf57046497a8a45373)) 110 | 111 | ## [0.14.1](https://github.com/varfish-org/hgvs-rs/compare/v0.14.0...v0.14.1) (2023-11-19) 112 | 113 | 114 | ### Bug Fixes 115 | 116 | * allow import of BioType guide_RNA ([#148](https://github.com/varfish-org/hgvs-rs/issues/148)) ([0e1240f](https://github.com/varfish-org/hgvs-rs/commit/0e1240f3522d5dbf2b53e60ddc2df2d6f81d3b3a)) 117 | 118 | ## [0.14.0](https://github.com/varfish-org/hgvs-rs/compare/v0.13.2...v0.14.0) (2023-11-19) 119 | 120 | 121 | ### Features 122 | 123 | * adding support for selenoproteins ([#145](https://github.com/varfish-org/hgvs-rs/issues/145)) ([#146](https://github.com/varfish-org/hgvs-rs/issues/146)) ([c5e21e2](https://github.com/varfish-org/hgvs-rs/commit/c5e21e28aa9de2bd4738e1f52cd96930fb1a3e48)) 124 | 125 | ## [0.13.2](https://github.com/varfish-org/hgvs-rs/compare/v0.13.1...v0.13.2) (2023-11-08) 126 | 127 | 128 | ### Bug Fixes 129 | 130 | * add more biotypes from cdot JSON for vep110 release ([#141](https://github.com/varfish-org/hgvs-rs/issues/141)) ([4cb7cb6](https://github.com/varfish-org/hgvs-rs/commit/4cb7cb6bd4456d7341260e3f753ef58aa32aab48)) 131 | 132 | ## [0.13.1](https://github.com/varfish-org/hgvs-rs/compare/v0.13.0...v0.13.1) (2023-11-08) 133 | 134 | 135 | ### Bug Fixes 136 | 137 | * adding back missing Gene::biotype member ([#138](https://github.com/varfish-org/hgvs-rs/issues/138)) ([84db645](https://github.com/varfish-org/hgvs-rs/commit/84db6458b0e3551913488662a74dd738c1279308)) 138 | 139 | ## [0.13.0](https://github.com/varfish-org/hgvs-rs/compare/v0.12.0...v0.13.0) (2023-11-08) 140 | 141 | 142 | ### Features 143 | 144 | * make compatible with cdot v0.2.21 ([#136](https://github.com/varfish-org/hgvs-rs/issues/136)) ([c677b33](https://github.com/varfish-org/hgvs-rs/commit/c677b33d000117914359412e0f44c1317a82cc7c)) 145 | 146 | ## [0.12.0](https://github.com/varfish-org/hgvs-rs/compare/v0.11.0...v0.12.0) (2023-10-21) 147 | 148 | 149 | ### Features 150 | 151 | * move out assemblies to biocommons_bioutils crate ([#134](https://github.com/varfish-org/hgvs-rs/issues/134)) ([#135](https://github.com/varfish-org/hgvs-rs/issues/135)) ([d06bd28](https://github.com/varfish-org/hgvs-rs/commit/d06bd28d17884df1999669a378a516d317ab28e0)) 152 | 153 | 154 | ### Bug Fixes 155 | 156 | * problem with annotating stop_retained insertions ([#131](https://github.com/varfish-org/hgvs-rs/issues/131)) ([#132](https://github.com/varfish-org/hgvs-rs/issues/132)) ([0f42d20](https://github.com/varfish-org/hgvs-rs/commit/0f42d20fa89bfc3db429ce6bfe53984dd03ba29c)) 157 | 158 | ## [0.11.0](https://github.com/varfish-org/hgvs-rs/compare/v0.10.1...v0.11.0) (2023-09-18) 159 | 160 | 161 | ### Features 162 | 163 | * various dependency updates, bump version to 0.11.0 ([#128](https://github.com/varfish-org/hgvs-rs/issues/128)) ([09ed016](https://github.com/varfish-org/hgvs-rs/commit/09ed0167624ea67354eef78e698f9b0439050b11)) 164 | 165 | ## [0.10.1](https://github.com/varfish-org/hgvs-rs/compare/v0.10.0...v0.10.1) (2023-07-04) 166 | 167 | 168 | ### Bug Fixes 169 | 170 | * properly configure dependabot for noodles ([#119](https://github.com/varfish-org/hgvs-rs/issues/119)) ([b539921](https://github.com/varfish-org/hgvs-rs/commit/b5399215294e555df06bc78c7ffca4d71acb4c75)) 171 | 172 | ## [0.10.0](https://github.com/varfish-org/hgvs-rs/compare/v0.9.0...v0.10.0) (2023-07-04) 173 | 174 | 175 | ### Miscellaneous Chores 176 | 177 | * bump version to 0.10.0 ([7bf4cbd](https://github.com/varfish-org/hgvs-rs/commit/7bf4cbde1c4f09756943b339a7d28d05fc0b7e24)) 178 | 179 | ## [0.9.0](https://github.com/varfish-org/hgvs-rs/compare/v0.8.2...v0.9.0) (2023-06-12) 180 | 181 | 182 | ### Features 183 | 184 | * express existing thread safety (as most things are immutable) ([#115](https://github.com/varfish-org/hgvs-rs/issues/115)) ([0d6f241](https://github.com/varfish-org/hgvs-rs/commit/0d6f24177cfafa62796e9b5da069f8d1ca807aad)) 185 | 186 | ## [0.8.2](https://github.com/varfish-org/hgvs-rs/compare/v0.8.1...v0.8.2) (2023-06-08) 187 | 188 | 189 | ### Code Refactoring 190 | 191 | * replace linked-hash-map by indexmap ([#111](https://github.com/varfish-org/hgvs-rs/issues/111)) ([#112](https://github.com/varfish-org/hgvs-rs/issues/112)) ([f6d3ab4](https://github.com/varfish-org/hgvs-rs/commit/f6d3ab47dc79daab2ad837c5c739976061991926)) 192 | 193 | ## [0.8.1](https://github.com/varfish-org/hgvs-rs/compare/v0.8.0...v0.8.1) (2023-05-23) 194 | 195 | 196 | ### Bug Fixes 197 | 198 | * bump dependencies ([#108](https://github.com/varfish-org/hgvs-rs/issues/108)) ([af75b48](https://github.com/varfish-org/hgvs-rs/commit/af75b48b3f189010a9e6e27d8cd6a29477cb5198)) 199 | 200 | ## [0.8.0](https://github.com/varfish-org/hgvs-rs/compare/v0.7.0...v0.8.0) (2023-05-23) 201 | 202 | 203 | ### Features 204 | 205 | * losen dependencies ([#106](https://github.com/varfish-org/hgvs-rs/issues/106)) ([c654507](https://github.com/varfish-org/hgvs-rs/commit/c6545077e5e0bad33d4e0168cf455c3dcc1c928a)) 206 | 207 | ## [0.7.0](https://github.com/varfish-org/hgvs-rs/compare/v0.6.2...v0.7.0) (2023-04-24) 208 | 209 | 210 | ### Features 211 | 212 | * proper error handling with enums and thiserror ([#69](https://github.com/varfish-org/hgvs-rs/issues/69)) ([#103](https://github.com/varfish-org/hgvs-rs/issues/103)) ([add8248](https://github.com/varfish-org/hgvs-rs/commit/add8248cabe25d1f19993e7041b1e97279e18bdd)) 213 | 214 | ## [0.6.2](https://github.com/varfish-org/hgvs-rs/compare/v0.6.1...v0.6.2) (2023-04-18) 215 | 216 | 217 | ### Bug Fixes 218 | 219 | * issue with non-dup insertion at start of protein ([#99](https://github.com/varfish-org/hgvs-rs/issues/99)) ([#100](https://github.com/varfish-org/hgvs-rs/issues/100)) ([bc5b5cf](https://github.com/varfish-org/hgvs-rs/commit/bc5b5cf4cd77e56b96be879eb8c939d29e0c58e0)) 220 | 221 | ## [0.6.1](https://github.com/varfish-org/hgvs-rs/compare/v0.6.0...v0.6.1) (2023-04-06) 222 | 223 | 224 | ### Bug Fixes 225 | 226 | * cases where dup/inv goes beyond CDS ([#89](https://github.com/varfish-org/hgvs-rs/issues/89)) ([5d951b1](https://github.com/varfish-org/hgvs-rs/commit/5d951b131a1295fc6e83be2676834e5b1ea34244)) 227 | * only warn on combining RefAlt with whole gene deletion ([#92](https://github.com/varfish-org/hgvs-rs/issues/92)) ([bff6c72](https://github.com/varfish-org/hgvs-rs/commit/bff6c725ab10d56acb1f01809e7f4926f4a08d2e)) 228 | * out of bound panic in case of long deletions ([#91](https://github.com/varfish-org/hgvs-rs/issues/91)) ([f0bcaa6](https://github.com/varfish-org/hgvs-rs/commit/f0bcaa6a7aeaf70212cb7d6b8e49b5ace1aeeb92)) 229 | * out of bounds issue for protein sequence creation ([#93](https://github.com/varfish-org/hgvs-rs/issues/93)) ([a40a5f5](https://github.com/varfish-org/hgvs-rs/commit/a40a5f547406feefc9f952b209838914721857fd)) 230 | * out of bounds issue on 5'-to-3' shifting beyond CDS ([#94](https://github.com/varfish-org/hgvs-rs/issues/94)) ([4fccdb8](https://github.com/varfish-org/hgvs-rs/commit/4fccdb826fea15539b59630b2f9d742271aac3c4)) 231 | * problem with variants in multi-stop codon txs ([#95](https://github.com/varfish-org/hgvs-rs/issues/95)) ([#96](https://github.com/varfish-org/hgvs-rs/issues/96)) ([f25658c](https://github.com/varfish-org/hgvs-rs/commit/f25658c7ada6483970c46d0cf43bf53fd7a2b791)) 232 | 233 | ## [0.6.0](https://github.com/varfish-org/hgvs-rs/compare/v0.5.2...v0.6.0) (2023-04-05) 234 | 235 | 236 | ### Features 237 | 238 | * make some tables visible in `hgvs::sequences` ([#87](https://github.com/varfish-org/hgvs-rs/issues/87)) ([d81bf8c](https://github.com/varfish-org/hgvs-rs/commit/d81bf8c5ee8548471972828b2985fe99be94eccc)) 239 | 240 | ## [0.5.2](https://github.com/varfish-org/hgvs-rs/compare/v0.5.1...v0.5.2) (2023-04-04) 241 | 242 | 243 | ### Performance Improvements 244 | 245 | * further speeding up translate_cds code ([#83](https://github.com/varfish-org/hgvs-rs/issues/83)) ([#85](https://github.com/varfish-org/hgvs-rs/issues/85)) ([60b071d](https://github.com/varfish-org/hgvs-rs/commit/60b071db524df11b1f8e9633074df7c3213fb8ed)) 246 | 247 | ## [0.5.1](https://github.com/varfish-org/hgvs-rs/compare/v0.5.0...v0.5.1) (2023-04-03) 248 | 249 | 250 | ### Performance Improvements 251 | 252 | * tune translate_cds implementation ([#80](https://github.com/varfish-org/hgvs-rs/issues/80)) ([#81](https://github.com/varfish-org/hgvs-rs/issues/81)) ([a608ba1](https://github.com/varfish-org/hgvs-rs/commit/a608ba1a62892b9b49e85c06d02665db94c4e4ad)) 253 | 254 | ## [0.5.0](https://github.com/varfish-org/hgvs-rs/compare/v0.4.0...v0.5.0) (2023-03-31) 255 | 256 | 257 | ### Features 258 | 259 | * allow configuring that there is no genome sequence ([#65](https://github.com/varfish-org/hgvs-rs/issues/65)) ([cd5b7bb](https://github.com/varfish-org/hgvs-rs/commit/cd5b7bb0f04369b34f4d8857d558915ae2ddccbb)) 260 | * replacing usages of unwrap() with expect or Result ([#70](https://github.com/varfish-org/hgvs-rs/issues/70)) ([#73](https://github.com/varfish-org/hgvs-rs/issues/73)) ([94d6f88](https://github.com/varfish-org/hgvs-rs/commit/94d6f88f44f1f574dad7ca3015c2dba2852b869e)) 261 | 262 | 263 | ### Bug Fixes 264 | 265 | * case of missing stop codon (in particular ENSEMBL) ([#72](https://github.com/varfish-org/hgvs-rs/issues/72)) ([a44e28c](https://github.com/varfish-org/hgvs-rs/commit/a44e28c3ea57bea66cf9354c17b0f93e9a920636)) 266 | * fixing wrong validation assumption about insertions ([#64](https://github.com/varfish-org/hgvs-rs/issues/64)) ([f58ff53](https://github.com/varfish-org/hgvs-rs/commit/f58ff53ad017c65968afef64ad50c532be64f605)) 267 | * issue with transcripts missing stop codon ([#67](https://github.com/varfish-org/hgvs-rs/issues/67)) ([f70bb91](https://github.com/varfish-org/hgvs-rs/commit/f70bb91bba07f41110f32bc7da62f0cbc873e85d)) 268 | * return error if problem in normalization ([#71](https://github.com/varfish-org/hgvs-rs/issues/71)) ([b656d2a](https://github.com/varfish-org/hgvs-rs/commit/b656d2ae87e0cfbbb08a7ae6b055f2d9ce13f1bc)) 269 | * return error in normalization instead of unwrap() ([#68](https://github.com/varfish-org/hgvs-rs/issues/68)) ([8144db6](https://github.com/varfish-org/hgvs-rs/commit/8144db60b74bb5fa2ff625cfef73be2297ddfd0f)) 270 | 271 | ## [0.4.0](https://github.com/varfish-org/hgvs-rs/compare/v0.3.1...v0.4.0) (2023-03-27) 272 | 273 | 274 | ### Features 275 | 276 | * allow for disabling renormalization ([#53](https://github.com/varfish-org/hgvs-rs/issues/53)) ([b2f5dbe](https://github.com/varfish-org/hgvs-rs/commit/b2f5dbeb9904fd70494a030eab37bcf1edb8845a)) 277 | * improved normalizer configuration ([#52](https://github.com/varfish-org/hgvs-rs/issues/52)) ([2db5e92](https://github.com/varfish-org/hgvs-rs/commit/2db5e92316930e76d4b58d47ccc1d2ba2c1010fe)) 278 | * make sequences module public ([#54](https://github.com/varfish-org/hgvs-rs/issues/54)) ([cfbb134](https://github.com/varfish-org/hgvs-rs/commit/cfbb134b38d853273d734db7bdb7ce1beafa9ad8)) 279 | * more comprehensive clone for field-less enums ([#58](https://github.com/varfish-org/hgvs-rs/issues/58)) ([3ada6b2](https://github.com/varfish-org/hgvs-rs/commit/3ada6b2a21e24fea2feeb8fb89888710072d4b52)) 280 | 281 | 282 | ### Bug Fixes 283 | 284 | * cases of empty sequence ([#61](https://github.com/varfish-org/hgvs-rs/issues/61)) ([0a6a109](https://github.com/varfish-org/hgvs-rs/commit/0a6a1094f7387daa92c7bbcbb1c08e2c0c3620aa)) 285 | * fixing projection issues with cdot::json::Provider ([#55](https://github.com/varfish-org/hgvs-rs/issues/55)) ([edd0de9](https://github.com/varfish-org/hgvs-rs/commit/edd0de9e12e781a3fbeb8fc49d623b8d66484e8d)) 286 | * fixing wrong validation assumption about insertions ([#62](https://github.com/varfish-org/hgvs-rs/issues/62)) ([bf027dc](https://github.com/varfish-org/hgvs-rs/commit/bf027dc3cdef0b07af4cc234f79499b9146426cf)) 287 | * interpret replace_reference condition ([#49](https://github.com/varfish-org/hgvs-rs/issues/49)) ([8326a55](https://github.com/varfish-org/hgvs-rs/commit/8326a558eeee75defe8c466ec996ba02f81116cd)) 288 | * issue with out of bounds deletion ([#59](https://github.com/varfish-org/hgvs-rs/issues/59)) ([475398d](https://github.com/varfish-org/hgvs-rs/commit/475398d0e831afa43abc53327abb14135df40aea)) 289 | * remove ref length validation ([#57](https://github.com/varfish-org/hgvs-rs/issues/57)) ([763b7a3](https://github.com/varfish-org/hgvs-rs/commit/763b7a38f69f1c1b87472fa655099a988c424294)) 290 | * truncate AA insertion at stop codon ([#60](https://github.com/varfish-org/hgvs-rs/issues/60)) ([54231b9](https://github.com/varfish-org/hgvs-rs/commit/54231b983b1cbc4480631bde05ab55e2a668ed72)) 291 | 292 | ## [0.3.1](https://github.com/varfish-org/hgvs-rs/compare/v0.3.0...v0.3.1) (2023-03-13) 293 | 294 | 295 | ### Bug Fixes 296 | 297 | * deriving PartialEq and Eq for enums ([#47](https://github.com/varfish-org/hgvs-rs/issues/47)) ([abd946e](https://github.com/varfish-org/hgvs-rs/commit/abd946e0b37444222ff4f30da99eb61d67ac1a3d)) 298 | 299 | ## [0.3.0](https://github.com/varfish-org/hgvs-rs/compare/v0.2.0...v0.3.0) (2023-03-13) 300 | 301 | 302 | ### Features 303 | 304 | * implement cdot data provider ([#43](https://github.com/varfish-org/hgvs-rs/issues/43)) ([#44](https://github.com/varfish-org/hgvs-rs/issues/44)) ([3a8ed9d](https://github.com/varfish-org/hgvs-rs/commit/3a8ed9d49c1c34bb7295091afe82b7011d6826ef)) 305 | 306 | ## [0.2.0](https://github.com/varfish-org/hgvs-rs/compare/v0.1.1...v0.2.0) (2023-03-06) 307 | 308 | 309 | ### Features 310 | 311 | * implement PartialEq for Assembly ([#40](https://github.com/varfish-org/hgvs-rs/issues/40)) ([8808211](https://github.com/varfish-org/hgvs-rs/commit/8808211ed3f26c187f4d2787c23e680bbcdf38c3)) 312 | 313 | 314 | ### Bug Fixes 315 | 316 | * git lfs untrack 'src/static_data/**/*.json*' ([#41](https://github.com/varfish-org/hgvs-rs/issues/41)) ([fa1af68](https://github.com/varfish-org/hgvs-rs/commit/fa1af68c13b76bf1e9ba329159d3cf29b5893620)) 317 | 318 | ## [0.1.1](https://github.com/varfish-org/hgvs-rs/compare/v0.1.0...v0.1.1) (2023-03-03) 319 | 320 | 321 | ### Bug Fixes 322 | 323 | * release 0.1.1 on crates.io ([#38](https://github.com/varfish-org/hgvs-rs/issues/38)) ([e318abf](https://github.com/varfish-org/hgvs-rs/commit/e318abf6368b1f0b7160ad25f0880706a92fc662)) 324 | 325 | ## 0.1.0 (2023-03-03) 326 | 327 | 328 | ### Features 329 | 330 | * implement AlignmentMapper ([#14](https://github.com/varfish-org/hgvs-rs/issues/14)) ([#15](https://github.com/varfish-org/hgvs-rs/issues/15)) ([07b573d](https://github.com/varfish-org/hgvs-rs/commit/07b573df79601dab3bbb933258693f5afd55c55c)) 331 | * implement Display for HgvsVariant and related types ([#4](https://github.com/varfish-org/hgvs-rs/issues/4)) ([#6](https://github.com/varfish-org/hgvs-rs/issues/6)) ([8e81536](https://github.com/varfish-org/hgvs-rs/commit/8e815366b19639b932a837c916384e663524043a)) 332 | * implement parsing of HGVS variants ([#2](https://github.com/varfish-org/hgvs-rs/issues/2)) ([#3](https://github.com/varfish-org/hgvs-rs/issues/3)) ([dbcfd05](https://github.com/varfish-org/hgvs-rs/commit/dbcfd059802459d6a5ca595b560c955de2d5f4ac)) 333 | * implement VariantMapper ([#12](https://github.com/varfish-org/hgvs-rs/issues/12)) ([#13](https://github.com/varfish-org/hgvs-rs/issues/13)) ([44b10de](https://github.com/varfish-org/hgvs-rs/commit/44b10de9bc312814dfdb6ca160dedd5322570e0c)) 334 | * implementing AssemblyMapper ([#7](https://github.com/varfish-org/hgvs-rs/issues/7)) ([#35](https://github.com/varfish-org/hgvs-rs/issues/35)) ([9f3e0f3](https://github.com/varfish-org/hgvs-rs/commit/9f3e0f3198b5b78b394692d3a99fedc65091c467)) 335 | * port over access to the UTA data structures ([#10](https://github.com/varfish-org/hgvs-rs/issues/10)) ([#11](https://github.com/varfish-org/hgvs-rs/issues/11)) ([3e71e32](https://github.com/varfish-org/hgvs-rs/commit/3e71e3285215eed80e152d520864686d543af2b0)) 336 | * port over assembly info from bioutils ([#8](https://github.com/varfish-org/hgvs-rs/issues/8)) ([#9](https://github.com/varfish-org/hgvs-rs/issues/9)) ([d583556](https://github.com/varfish-org/hgvs-rs/commit/d5835565c358f2b132e94f5496a77c01a1b96096)) 337 | * port over variant normalizer ([#19](https://github.com/varfish-org/hgvs-rs/issues/19)) ([#20](https://github.com/varfish-org/hgvs-rs/issues/20)) ([8593068](https://github.com/varfish-org/hgvs-rs/commit/8593068608a045e753565cd8e17decb0f429b26e)) 338 | * porting over test_hgvs_variantmapper_gcp ([#21](https://github.com/varfish-org/hgvs-rs/issues/21)) ([#27](https://github.com/varfish-org/hgvs-rs/issues/27)) ([7f925c8](https://github.com/varfish-org/hgvs-rs/commit/7f925c845e8c2188d7cf06cb0beea75fa38c2ec3)) 339 | * porting over test_variantmapper_cp_real ([#21](https://github.com/varfish-org/hgvs-rs/issues/21)) ([#26](https://github.com/varfish-org/hgvs-rs/issues/26)) ([6873324](https://github.com/varfish-org/hgvs-rs/commit/687332422d317cb173733491080b460773b6b2f2)) 340 | * start to port over test_hgvs_grammer_full.py ([#21](https://github.com/varfish-org/hgvs-rs/issues/21)) ([#31](https://github.com/varfish-org/hgvs-rs/issues/31)) ([9d3e3b5](https://github.com/varfish-org/hgvs-rs/commit/9d3e3b532922095278abdb5ae57108f0c5067109)) 341 | 342 | 343 | ### Bug Fixes 344 | 345 | * variant mapper g_to_t used wrong logic ([#32](https://github.com/varfish-org/hgvs-rs/issues/32)) ([040e2d3](https://github.com/varfish-org/hgvs-rs/commit/040e2d3cd9ec77b2eecc755c4a8fc39a83f43101)) 346 | 347 | ## Changelog 348 | -------------------------------------------------------------------------------- /Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "hgvs" 3 | version = "0.18.2" 4 | edition = "2021" 5 | authors = ["Manuel Holtgrewe "] 6 | description = "Port of biocommons/hgvs to Rust" 7 | license = "Apache-2.0" 8 | repository = "https://github.com/varfish-org/hgvs-rs" 9 | readme = "README.md" 10 | rust-version = "1.80.0" 11 | 12 | [lib] 13 | name = "hgvs" 14 | path = "src/lib.rs" 15 | 16 | [dependencies] 17 | base16ct = "0.2" 18 | bio = "2.0" 19 | chrono = "0.4" 20 | enum-map = "2.4" 21 | flate2 = "1.0" 22 | log = "0.4" 23 | md-5 = "0.10" 24 | nom = "8.0" 25 | nom-language = "0.1.0" 26 | postgres = { version = "0.19", features = ["with-chrono-0_4"] } 27 | quick_cache = "0.6" 28 | regex = "1.7" 29 | rustc-hash = "2.0" 30 | seqrepo = { version = "0.10.3", features = ["cached"] } 31 | serde_json = "1.0" 32 | serde = { version = "1.0", features = ["derive"] } 33 | thiserror = "2.0" 34 | indexmap = { version = "2", features = ["serde"] } 35 | biocommons-bioutils = "0.1.0" 36 | ahash = "0.8.11" 37 | cached = "0.55.1" 38 | 39 | [dev-dependencies] 40 | anyhow = "1.0" 41 | criterion = "0.6" 42 | csv = "1.2" 43 | env_logger = "0.11" 44 | insta = { version = "1", features = ["yaml"] } 45 | pretty_assertions = "1.3" 46 | rstest = "0.25" 47 | test-log = "0.2" 48 | 49 | [[bench]] 50 | name = "translate_cds" 51 | harness = false 52 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Crates.io](https://img.shields.io/crates/d/hgvs.svg)](https://crates.io/crates/hgvs) 2 | [![Crates.io](https://img.shields.io/crates/v/hgvs.svg)](https://crates.io/crates/hgvs) 3 | [![Crates.io](https://img.shields.io/crates/l/hgvs.svg)](https://crates.io/crates/hgvs) 4 | [![CI](https://github.com/varfish-org/hgvs-rs/actions/workflows/rust.yml/badge.svg)](https://github.com/varfish-org/hgvs-rs/actions/workflows/rust.yml) 5 | [![codecov](https://codecov.io/gh/varfish-org/hgvs-rs/branch/main/graph/badge.svg?token=aZchhLWdzt)](https://codecov.io/gh/varfish-org/hgvs-rs) 6 | [![DOI](https://zenodo.org/badge/601272076.svg)](https://zenodo.org/badge/latestdoi/601272076) 7 | 8 | # hgvs-rs 9 | 10 | This is a port of [biocommons/hgvs](https://github.com/biocommons/hgvs) to the Rust programming language. 11 | The `data::cdot::*` code is based on a port of [SACGF/cdot](https://github.com/SACGF/cdot) to Rust. 12 | 13 | ## Running Tests 14 | 15 | The tests need an instance of UTA to run. 16 | Either you setup a local copy (with minimal dataset in `tests/data/data/*.pgd.gz`) or use the public one. 17 | You will have to set the environment variables `TEST_UTA_DATABASE_URL` and `TEST_UTA_DATABASE_SCHEMA` appropriately. 18 | To use the public database: 19 | 20 | ``` 21 | export TEST_UTA_DATABASE_URL=postgres://anonymous:anonymous@uta.biocommons.org:/uta 22 | export TEST_UTA_DATABASE_SCHEMA=uta_20210129 23 | ``` 24 | 25 | Note that [seqrepo-rs](https://github.com/varfish-org/seqrepo-rs) is used for access to the genome contig sequence. 26 | It is inconvenient to provide sub sets of sequences in SeqRepo format. 27 | Instead, we use a build-cache/read-cache approach that is also used by `biocommons/hgvs`. 28 | 29 | To build the cache, you will first need a download of the seqrepo [as described in biocommons/biocommons.seqrepo Quickstart](https://github.com/biocommons/biocommons.seqrepo#quick-start). 30 | Then, you configure the running of tests for `hgvs-rs` as follows: 31 | 32 | ``` 33 | export TEST_SEQREPO_CACHE_MODE=write 34 | export TEST_SEQREPO_PATH=path/to/seqrepo/instance 35 | export TEST_SEQREPO_CACHE_PATH=tests/data/seqrepo_cache.fasta 36 | ``` 37 | 38 | When running the tests with `cargo test`, the cache file will be (re-)written. 39 | Note that you have to use `cargo test --release -- --test-threads 1 --include-ignored` when writing the cache for enforcing a single test writing to the cache at any time. 40 | If you don't want to regenerate the cache then you can use the following settings. 41 | With these settings, the cache will only be read. 42 | 43 | ``` 44 | export TEST_SEQREPO_CACHE_MODE=read 45 | export TEST_SEQREPO_CACHE_PATH=tests/data/seqrepo_cache.fasta 46 | ``` 47 | 48 | After either this, you can run the tests. 49 | 50 | ``` 51 | cargo test 52 | ``` 53 | 54 | ## Creating Reduced UTA Databases 55 | 56 | The script `tests/data/data/bootstrap.sh` allows to easily build a reduced set of the UTA database given a list of genes. 57 | The process is as follows: 58 | 59 | 1. You edit `bootstrap.sh` to include the HGNC gene symbols of the transcripts that you want to use. 60 | 2. You run the bootstrap script. 61 | This will download the given UTA dump and reduce it to the information related to these transcripts. 62 | 63 | ``` 64 | $ bootstrap.sh http://dl.biocommons.org/uta uta_20210129 65 | ``` 66 | 67 | The `*.pgd.gz` file is added to the Git repository via `git-lfs` and in CI, this minimal database will be used. 68 | 69 | ## Some Timing Results 70 | 71 | (I don't want to call it "benchmarks" yet.) 72 | 73 | ### Deserialization of large cdot JSON files. 74 | 75 | Host: 76 | 77 | - CPU: Intel(R) Xeon(R) E-2174G CPU @ 3.80GHz 78 | - Disk: NVME (WDC CL SN720 SDAQNTW-1T00-2000) 79 | 80 | Single Running Time Results (no repetitions/warm start etc.) 81 | 82 | - ENSEMBL: 37s 83 | - RefSeq: 67s 84 | 85 | This includes loading and deserialization of the records only. 86 | -------------------------------------------------------------------------------- /benches/TTN.fasta: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:87917059ae36051f6e7b498d63084cdea9b9121225e4644f53075ba1caff1f80 3 | size 105866 4 | -------------------------------------------------------------------------------- /benches/translate_cds.rs: -------------------------------------------------------------------------------- 1 | use criterion::{criterion_group, criterion_main, Criterion}; 2 | use hgvs::sequences::{translate_cds, TranslationTable}; 3 | use std::sync::LazyLock; 4 | 5 | /// TTN FASTA string from https://www.ncbi.nlm.nih.gov/nuccore/NM_001126114.1 6 | static TTN_FASTA: &str = include_str!("TTN.fasta"); 7 | 8 | /// Raw TTN sequence. 9 | static SEQ_TTN: LazyLock = LazyLock::new(|| { 10 | let mut seq = String::new(); 11 | for line in TTN_FASTA.lines() { 12 | if !line.starts_with('>') { 13 | seq.push_str(line); 14 | } 15 | } 16 | seq 17 | }); 18 | 19 | fn criterion_benchmark(c: &mut Criterion) { 20 | c.bench_function("translate_cds TTN", |b| { 21 | b.iter(|| translate_cds(&SEQ_TTN, true, "*", TranslationTable::Standard).unwrap()) 22 | }); 23 | } 24 | 25 | criterion_group!(benches, criterion_benchmark); 26 | criterion_main!(benches); 27 | -------------------------------------------------------------------------------- /build.rs: -------------------------------------------------------------------------------- 1 | use std::env; 2 | use std::fs::File; 3 | use std::io::{BufWriter, Result, Write}; 4 | use std::path::Path; 5 | 6 | fn main() -> Result<()> { 7 | let out_dir = env::var("OUT_DIR").unwrap(); 8 | let dest_path = Path::new(&out_dir).join("tables_gen.rs"); 9 | let mut f = File::create(&dest_path).map(BufWriter::new)?; 10 | 11 | include_hardcoded_translation_tables(&mut f)?; 12 | generate_dna_ascii_map(&mut f)?; 13 | 14 | generate_codon_2bit_to_aa1_lut(&mut f)?; 15 | generate_codon_2bit_to_aa1_sec(&mut f)?; 16 | generate_codon_2bit_to_aa1_chrmt_vertebrate(&mut f)?; 17 | 18 | generate_aa1_to_aa3_str_lookup_function(&mut f)?; 19 | generate_aa1_to_aa3_str_lookup_table(&mut f)?; 20 | generate_aa3_to_aa1_lookup_function(&mut f)?; 21 | 22 | f.flush()?; 23 | println!("cargo::rerun-if-changed=build.rs"); 24 | Ok(()) 25 | } 26 | 27 | fn generate_dna_ascii_map(f: &mut BufWriter) -> Result<()> { 28 | let mut result = [0; 256]; 29 | for c in 0..=255 { 30 | if c == b'u' || c == b'U' { 31 | result[c as usize] = b'T'; 32 | } else if c.is_ascii_lowercase() { 33 | result[c as usize] = c.to_ascii_uppercase(); 34 | } else { 35 | result[c as usize] = c; 36 | } 37 | } 38 | 39 | writeln!(f, "/// Mapping for DNA characters for normalization.")?; 40 | write!(f, "const DNA_ASCII_MAP: [u8; 256] = [")?; 41 | for v in result { 42 | write!(f, "{}, ", v)?; 43 | } 44 | writeln!(f, "];")?; 45 | Ok(()) 46 | } 47 | fn generate_codon_2bit_to_aa1_lut(f: &mut BufWriter) -> Result<()> { 48 | let mut result = [0; 64]; 49 | for (i, (dna3, aa1)) in DNA_TO_AA1_LUT_VEC.iter().enumerate() { 50 | if i > 63 { 51 | break; // skip degenerate codons 52 | } 53 | let dna3_2bit = dna3_to_2bit(dna3.as_bytes()).expect("invalid dna3"); 54 | result[dna3_2bit as usize] = aa1.as_bytes()[0]; 55 | } 56 | write!(f, "const CODON_2BIT_TO_AA1_LUT: [u8; 64] = [")?; 57 | for v in result { 58 | write!(f, "{}, ", v)?; 59 | } 60 | writeln!(f, "];")?; 61 | Ok(()) 62 | } 63 | 64 | fn generate_codon_2bit_to_aa1_sec(f: &mut BufWriter) -> Result<()> { 65 | let mut result = [0; 64]; 66 | for (i, (dna3, aa1)) in DNA_TO_AA1_SEC_VEC.iter().enumerate() { 67 | if i > 63 { 68 | break; // skip degenerate codons 69 | } 70 | let dna3_2bit = dna3_to_2bit(dna3.as_bytes()).expect("invalid dna3"); 71 | result[dna3_2bit as usize] = aa1.as_bytes()[0]; 72 | } 73 | write!(f, "const CODON_2BIT_TO_AA1_SEC: [u8; 64] = [")?; 74 | for v in result { 75 | write!(f, "{}, ", v)?; 76 | } 77 | writeln!(f, "];")?; 78 | Ok(()) 79 | } 80 | 81 | fn generate_codon_2bit_to_aa1_chrmt_vertebrate(f: &mut BufWriter) -> Result<()> { 82 | let mut result = [0; 64]; 83 | for (i, (dna3, aa1)) in DNA_TO_AA1_CHRMT_VERTEBRATE_VEC.iter().enumerate() { 84 | if i > 63 { 85 | break; // skip degenerate codons 86 | } 87 | let dna3_2bit = dna3_to_2bit(dna3.as_bytes()).expect("invalid dna3"); 88 | result[dna3_2bit as usize] = aa1.as_bytes()[0]; 89 | } 90 | write!(f, "const CODON_2BIT_TO_AA1_CHRMT_VERTEBRATE: [u8; 64] = [")?; 91 | for v in result { 92 | write!(f, "{}, ", v)?; 93 | } 94 | writeln!(f, "];")?; 95 | Ok(()) 96 | } 97 | 98 | fn generate_aa1_to_aa3_str_lookup_function(f: &mut BufWriter) -> Result<()> { 99 | writeln!( 100 | f, 101 | "const fn _aa1_to_aa3_str(aa1: u8) -> Option<&'static str> {{" 102 | )?; 103 | writeln!(f, " match aa1 {{")?; 104 | for (aa3, aa1) in AA3_TO_AA1_VEC { 105 | writeln!(f, " b'{}' => Some(\"{}\"),", aa1, aa3)?; 106 | } 107 | writeln!(f, r" _ => None,")?; 108 | writeln!(f, " }}")?; 109 | writeln!(f, "}}")?; 110 | Ok(()) 111 | } 112 | 113 | fn generate_aa1_to_aa3_str_lookup_table(f: &mut BufWriter) -> Result<()> { 114 | let mut result = [""; 256]; 115 | for (aa3, aa1) in AA3_TO_AA1_VEC { 116 | result[aa1.as_bytes()[0] as usize] = aa3; 117 | } 118 | write!(f, "const AA1_TO_AA3_STR: [Option<&str>; 256] = [")?; 119 | for v in result { 120 | if v.is_empty() { 121 | write!(f, "None, ")?; 122 | } else { 123 | write!(f, r##"Some("{}"), "##, v)?; 124 | } 125 | } 126 | writeln!(f, "];")?; 127 | Ok(()) 128 | } 129 | 130 | fn generate_aa3_to_aa1_lookup_function(f: &mut BufWriter) -> Result<()> { 131 | writeln!(f, "const fn _aa3_to_aa1(aa3: &[u8]) -> Option {{")?; 132 | writeln!(f, " match aa3 {{")?; 133 | for (aa3, aa1) in AA3_TO_AA1_VEC { 134 | writeln!(f, " b\"{}\" => Some(b'{}'),", aa3, aa1)?; 135 | } 136 | writeln!(f, " _ => None,")?; 137 | writeln!(f, " }}")?; 138 | writeln!(f, "}}")?; 139 | Ok(()) 140 | } 141 | 142 | fn include_hardcoded_translation_tables(f: &mut BufWriter) -> Result<()> { 143 | let text = include_str!("tables.in"); 144 | writeln!(f, "{}", text)?; 145 | Ok(()) 146 | } 147 | 148 | const DNA_ASCII_TO_2BIT: [u8; 256] = { 149 | let mut result = [255; 256]; 150 | 151 | result[b'A' as usize] = 0; 152 | result[b'a' as usize] = 0; 153 | 154 | result[b'C' as usize] = 1; 155 | result[b'c' as usize] = 1; 156 | 157 | result[b'G' as usize] = 2; 158 | result[b'g' as usize] = 2; 159 | 160 | result[b'T' as usize] = 3; 161 | result[b't' as usize] = 3; 162 | result[b'U' as usize] = 3; 163 | result[b'u' as usize] = 3; 164 | result 165 | }; 166 | 167 | fn dna3_to_2bit(c: &[u8]) -> Option { 168 | let mut result = 0; 169 | for i in 0..3 { 170 | result <<= 2; 171 | let tmp = DNA_ASCII_TO_2BIT[c[i] as usize]; 172 | if tmp == 255 { 173 | return None; 174 | } 175 | result |= tmp; 176 | } 177 | Some(result) 178 | } 179 | 180 | // Hard-coded translation tables from src/tables.rs 181 | 182 | pub const AA3_TO_AA1_VEC: &[(&str, &str)] = &[ 183 | ("Ala", "A"), 184 | ("Arg", "R"), 185 | ("Asn", "N"), 186 | ("Asp", "D"), 187 | ("Cys", "C"), 188 | ("Gln", "Q"), 189 | ("Glu", "E"), 190 | ("Gly", "G"), 191 | ("His", "H"), 192 | ("Ile", "I"), 193 | ("Leu", "L"), 194 | ("Lys", "K"), 195 | ("Met", "M"), 196 | ("Phe", "F"), 197 | ("Pro", "P"), 198 | ("Ser", "S"), 199 | ("Thr", "T"), 200 | ("Trp", "W"), 201 | ("Tyr", "Y"), 202 | ("Val", "V"), 203 | ("Xaa", "X"), 204 | ("Ter", "*"), 205 | ("Sec", "U"), 206 | ]; 207 | 208 | const DNA_TO_AA1_LUT_VEC: &[(&str, &str)] = &[ 209 | ("AAA", "K"), 210 | ("AAC", "N"), 211 | ("AAG", "K"), 212 | ("AAT", "N"), 213 | ("ACA", "T"), 214 | ("ACC", "T"), 215 | ("ACG", "T"), 216 | ("ACT", "T"), 217 | ("AGA", "R"), 218 | ("AGC", "S"), 219 | ("AGG", "R"), 220 | ("AGT", "S"), 221 | ("ATA", "I"), 222 | ("ATC", "I"), 223 | ("ATG", "M"), 224 | ("ATT", "I"), 225 | ("CAA", "Q"), 226 | ("CAC", "H"), 227 | ("CAG", "Q"), 228 | ("CAT", "H"), 229 | ("CCA", "P"), 230 | ("CCC", "P"), 231 | ("CCG", "P"), 232 | ("CCT", "P"), 233 | ("CGA", "R"), 234 | ("CGC", "R"), 235 | ("CGG", "R"), 236 | ("CGT", "R"), 237 | ("CTA", "L"), 238 | ("CTC", "L"), 239 | ("CTG", "L"), 240 | ("CTT", "L"), 241 | ("GAA", "E"), 242 | ("GAC", "D"), 243 | ("GAG", "E"), 244 | ("GAT", "D"), 245 | ("GCA", "A"), 246 | ("GCC", "A"), 247 | ("GCG", "A"), 248 | ("GCT", "A"), 249 | ("GGA", "G"), 250 | ("GGC", "G"), 251 | ("GGG", "G"), 252 | ("GGT", "G"), 253 | ("GTA", "V"), 254 | ("GTC", "V"), 255 | ("GTG", "V"), 256 | ("GTT", "V"), 257 | ("TAA", "*"), 258 | ("TAC", "Y"), 259 | ("TAG", "*"), 260 | ("TAT", "Y"), 261 | ("TCA", "S"), 262 | ("TCC", "S"), 263 | ("TCG", "S"), 264 | ("TCT", "S"), 265 | // caveat lector 266 | ("TGA", "*"), 267 | ("TGC", "C"), 268 | ("TGG", "W"), 269 | ("TGT", "C"), 270 | ("TTA", "L"), 271 | ("TTC", "F"), 272 | ("TTG", "L"), 273 | ("TTT", "F"), 274 | // degenerate codons 275 | ("AAR", "K"), 276 | ("AAY", "N"), 277 | ("ACB", "T"), 278 | ("ACD", "T"), 279 | ("ACH", "T"), 280 | ("ACK", "T"), 281 | ("ACM", "T"), 282 | ("ACN", "T"), 283 | ("ACR", "T"), 284 | ("ACS", "T"), 285 | ("ACV", "T"), 286 | ("ACW", "T"), 287 | ("ACY", "T"), 288 | ("AGR", "R"), 289 | ("AGY", "S"), 290 | ("ATH", "I"), 291 | ("ATM", "I"), 292 | ("ATW", "I"), 293 | ("ATY", "I"), 294 | ("CAR", "Q"), 295 | ("CAY", "H"), 296 | ("CCB", "P"), 297 | ("CCD", "P"), 298 | ("CCH", "P"), 299 | ("CCK", "P"), 300 | ("CCM", "P"), 301 | ("CCN", "P"), 302 | ("CCR", "P"), 303 | ("CCS", "P"), 304 | ("CCV", "P"), 305 | ("CCW", "P"), 306 | ("CCY", "P"), 307 | ("CGB", "R"), 308 | ("CGD", "R"), 309 | ("CGH", "R"), 310 | ("CGK", "R"), 311 | ("CGM", "R"), 312 | ("CGN", "R"), 313 | ("CGR", "R"), 314 | ("CGS", "R"), 315 | ("CGV", "R"), 316 | ("CGW", "R"), 317 | ("CGY", "R"), 318 | ("CTB", "L"), 319 | ("CTD", "L"), 320 | ("CTH", "L"), 321 | ("CTK", "L"), 322 | ("CTM", "L"), 323 | ("CTN", "L"), 324 | ("CTR", "L"), 325 | ("CTS", "L"), 326 | ("CTV", "L"), 327 | ("CTW", "L"), 328 | ("CTY", "L"), 329 | ("GAR", "E"), 330 | ("GAY", "D"), 331 | ("GCB", "A"), 332 | ("GCD", "A"), 333 | ("GCH", "A"), 334 | ("GCK", "A"), 335 | ("GCM", "A"), 336 | ("GCN", "A"), 337 | ("GCR", "A"), 338 | ("GCS", "A"), 339 | ("GCV", "A"), 340 | ("GCW", "A"), 341 | ("GCY", "A"), 342 | ("GGB", "G"), 343 | ("GGD", "G"), 344 | ("GGH", "G"), 345 | ("GGK", "G"), 346 | ("GGM", "G"), 347 | ("GGN", "G"), 348 | ("GGR", "G"), 349 | ("GGS", "G"), 350 | ("GGV", "G"), 351 | ("GGW", "G"), 352 | ("GGY", "G"), 353 | ("GTB", "V"), 354 | ("GTD", "V"), 355 | ("GTH", "V"), 356 | ("GTK", "V"), 357 | ("GTM", "V"), 358 | ("GTN", "V"), 359 | ("GTR", "V"), 360 | ("GTS", "V"), 361 | ("GTV", "V"), 362 | ("GTW", "V"), 363 | ("GTY", "V"), 364 | ("MGA", "R"), 365 | ("MGG", "R"), 366 | ("MGR", "R"), 367 | ("TAR", "*"), 368 | ("TAY", "Y"), 369 | ("TCB", "S"), 370 | ("TCD", "S"), 371 | ("TCH", "S"), 372 | ("TCK", "S"), 373 | ("TCM", "S"), 374 | ("TCN", "S"), 375 | ("TCR", "S"), 376 | ("TCS", "S"), 377 | ("TCV", "S"), 378 | ("TCW", "S"), 379 | ("TCY", "S"), 380 | ("TGY", "C"), 381 | ("TRA", "*"), 382 | ("TTR", "L"), 383 | ("TTY", "F"), 384 | ("YTA", "L"), 385 | ("YTG", "L"), 386 | ("YTR", "L"), 387 | ]; 388 | 389 | /// Translation table for selenocysteine. 390 | const DNA_TO_AA1_SEC_VEC: &[(&str, &str)] = &[ 391 | ("AAA", "K"), 392 | ("AAC", "N"), 393 | ("AAG", "K"), 394 | ("AAT", "N"), 395 | ("ACA", "T"), 396 | ("ACC", "T"), 397 | ("ACG", "T"), 398 | ("ACT", "T"), 399 | ("AGA", "R"), 400 | ("AGC", "S"), 401 | ("AGG", "R"), 402 | ("AGT", "S"), 403 | ("ATA", "I"), 404 | ("ATC", "I"), 405 | ("ATG", "M"), 406 | ("ATT", "I"), 407 | ("CAA", "Q"), 408 | ("CAC", "H"), 409 | ("CAG", "Q"), 410 | ("CAT", "H"), 411 | ("CCA", "P"), 412 | ("CCC", "P"), 413 | ("CCG", "P"), 414 | ("CCT", "P"), 415 | ("CGA", "R"), 416 | ("CGC", "R"), 417 | ("CGG", "R"), 418 | ("CGT", "R"), 419 | ("CTA", "L"), 420 | ("CTC", "L"), 421 | ("CTG", "L"), 422 | ("CTT", "L"), 423 | ("GAA", "E"), 424 | ("GAC", "D"), 425 | ("GAG", "E"), 426 | ("GAT", "D"), 427 | ("GCA", "A"), 428 | ("GCC", "A"), 429 | ("GCG", "A"), 430 | ("GCT", "A"), 431 | ("GGA", "G"), 432 | ("GGC", "G"), 433 | ("GGG", "G"), 434 | ("GGT", "G"), 435 | ("GTA", "V"), 436 | ("GTC", "V"), 437 | ("GTG", "V"), 438 | ("GTT", "V"), 439 | ("TAA", "*"), 440 | ("TAC", "Y"), 441 | ("TAG", "*"), 442 | ("TAT", "Y"), 443 | ("TCA", "S"), 444 | ("TCC", "S"), 445 | ("TCG", "S"), 446 | ("TCT", "S"), 447 | // caveat lector 448 | ("TGA", "U"), 449 | ("TGC", "C"), 450 | ("TGG", "W"), 451 | ("TGT", "C"), 452 | ("TTA", "L"), 453 | ("TTC", "F"), 454 | ("TTG", "L"), 455 | ("TTT", "F"), 456 | // degenerate codons 457 | ("AAR", "K"), 458 | ("AAY", "N"), 459 | ("ACB", "T"), 460 | ("ACD", "T"), 461 | ("ACH", "T"), 462 | ("ACK", "T"), 463 | ("ACM", "T"), 464 | ("ACN", "T"), 465 | ("ACR", "T"), 466 | ("ACS", "T"), 467 | ("ACV", "T"), 468 | ("ACW", "T"), 469 | ("ACY", "T"), 470 | ("AGR", "R"), 471 | ("AGY", "S"), 472 | ("ATH", "I"), 473 | ("ATM", "I"), 474 | ("ATW", "I"), 475 | ("ATY", "I"), 476 | ("CAR", "Q"), 477 | ("CAY", "H"), 478 | ("CCB", "P"), 479 | ("CCD", "P"), 480 | ("CCH", "P"), 481 | ("CCK", "P"), 482 | ("CCM", "P"), 483 | ("CCN", "P"), 484 | ("CCR", "P"), 485 | ("CCS", "P"), 486 | ("CCV", "P"), 487 | ("CCW", "P"), 488 | ("CCY", "P"), 489 | ("CGB", "R"), 490 | ("CGD", "R"), 491 | ("CGH", "R"), 492 | ("CGK", "R"), 493 | ("CGM", "R"), 494 | ("CGN", "R"), 495 | ("CGR", "R"), 496 | ("CGS", "R"), 497 | ("CGV", "R"), 498 | ("CGW", "R"), 499 | ("CGY", "R"), 500 | ("CTB", "L"), 501 | ("CTD", "L"), 502 | ("CTH", "L"), 503 | ("CTK", "L"), 504 | ("CTM", "L"), 505 | ("CTN", "L"), 506 | ("CTR", "L"), 507 | ("CTS", "L"), 508 | ("CTV", "L"), 509 | ("CTW", "L"), 510 | ("CTY", "L"), 511 | ("GAR", "E"), 512 | ("GAY", "D"), 513 | ("GCB", "A"), 514 | ("GCD", "A"), 515 | ("GCH", "A"), 516 | ("GCK", "A"), 517 | ("GCM", "A"), 518 | ("GCN", "A"), 519 | ("GCR", "A"), 520 | ("GCS", "A"), 521 | ("GCV", "A"), 522 | ("GCW", "A"), 523 | ("GCY", "A"), 524 | ("GGB", "G"), 525 | ("GGD", "G"), 526 | ("GGH", "G"), 527 | ("GGK", "G"), 528 | ("GGM", "G"), 529 | ("GGN", "G"), 530 | ("GGR", "G"), 531 | ("GGS", "G"), 532 | ("GGV", "G"), 533 | ("GGW", "G"), 534 | ("GGY", "G"), 535 | ("GTB", "V"), 536 | ("GTD", "V"), 537 | ("GTH", "V"), 538 | ("GTK", "V"), 539 | ("GTM", "V"), 540 | ("GTN", "V"), 541 | ("GTR", "V"), 542 | ("GTS", "V"), 543 | ("GTV", "V"), 544 | ("GTW", "V"), 545 | ("GTY", "V"), 546 | ("MGA", "R"), 547 | ("MGG", "R"), 548 | ("MGR", "R"), 549 | ("TAR", "*"), 550 | ("TAY", "Y"), 551 | ("TCB", "S"), 552 | ("TCD", "S"), 553 | ("TCH", "S"), 554 | ("TCK", "S"), 555 | ("TCM", "S"), 556 | ("TCN", "S"), 557 | ("TCR", "S"), 558 | ("TCS", "S"), 559 | ("TCV", "S"), 560 | ("TCW", "S"), 561 | ("TCY", "S"), 562 | ("TGY", "C"), 563 | ("TRA", "*"), 564 | ("TTR", "L"), 565 | ("TTY", "F"), 566 | ("YTA", "L"), 567 | ("YTG", "L"), 568 | ("YTR", "L"), 569 | ]; 570 | 571 | /// Vertebrate mitochondrial code, cf. https://en.wikipedia.org/wiki/Vertebrate_mitochondrial_code 572 | const DNA_TO_AA1_CHRMT_VERTEBRATE_VEC: &[(&str, &str)] = &[ 573 | ("AAA", "K"), 574 | ("AAC", "N"), 575 | ("AAG", "K"), 576 | ("AAT", "N"), 577 | ("ACA", "T"), 578 | ("ACC", "T"), 579 | ("ACG", "T"), 580 | ("ACT", "T"), 581 | // caveat lector 582 | ("AGA", "*"), 583 | ("AGC", "S"), 584 | // caveat lector 585 | ("AGG", "*"), 586 | ("AGT", "S"), 587 | // caveat lector 588 | ("ATA", "M"), 589 | ("ATC", "I"), 590 | ("ATG", "M"), 591 | ("ATT", "I"), 592 | ("CAA", "Q"), 593 | ("CAC", "H"), 594 | ("CAG", "Q"), 595 | ("CAT", "H"), 596 | ("CCA", "P"), 597 | ("CCC", "P"), 598 | ("CCG", "P"), 599 | ("CCT", "P"), 600 | ("CGA", "R"), 601 | ("CGC", "R"), 602 | ("CGG", "R"), 603 | ("CGT", "R"), 604 | ("CTA", "L"), 605 | ("CTC", "L"), 606 | ("CTG", "L"), 607 | ("CTT", "L"), 608 | ("GAA", "E"), 609 | ("GAC", "D"), 610 | ("GAG", "E"), 611 | ("GAT", "D"), 612 | ("GCA", "A"), 613 | ("GCC", "A"), 614 | ("GCG", "A"), 615 | ("GCT", "A"), 616 | ("GGA", "G"), 617 | ("GGC", "G"), 618 | ("GGG", "G"), 619 | ("GGT", "G"), 620 | ("GTA", "V"), 621 | ("GTC", "V"), 622 | ("GTG", "V"), 623 | ("GTT", "V"), 624 | ("TAA", "*"), 625 | ("TAC", "Y"), 626 | ("TAG", "*"), 627 | ("TAT", "Y"), 628 | ("TCA", "S"), 629 | ("TCC", "S"), 630 | ("TCG", "S"), 631 | ("TCT", "S"), 632 | // caveat lector 633 | ("TGA", "W"), 634 | ("TGC", "C"), 635 | ("TGG", "W"), 636 | ("TGT", "C"), 637 | ("TTA", "L"), 638 | ("TTC", "F"), 639 | ("TTG", "L"), 640 | ("TTT", "F"), 641 | // degenerate codons 642 | ("AAR", "K"), 643 | ("AAY", "N"), 644 | ("ACB", "T"), 645 | ("ACD", "T"), 646 | ("ACH", "T"), 647 | ("ACK", "T"), 648 | ("ACM", "T"), 649 | ("ACN", "T"), 650 | ("ACR", "T"), 651 | ("ACS", "T"), 652 | ("ACV", "T"), 653 | ("ACW", "T"), 654 | ("ACY", "T"), 655 | ("AGR", "R"), 656 | ("AGY", "S"), 657 | ("ATH", "I"), 658 | ("ATM", "I"), 659 | ("ATW", "I"), 660 | ("ATY", "I"), 661 | ("CAR", "Q"), 662 | ("CAY", "H"), 663 | ("CCB", "P"), 664 | ("CCD", "P"), 665 | ("CCH", "P"), 666 | ("CCK", "P"), 667 | ("CCM", "P"), 668 | ("CCN", "P"), 669 | ("CCR", "P"), 670 | ("CCS", "P"), 671 | ("CCV", "P"), 672 | ("CCW", "P"), 673 | ("CCY", "P"), 674 | ("CGB", "R"), 675 | ("CGD", "R"), 676 | ("CGH", "R"), 677 | ("CGK", "R"), 678 | ("CGM", "R"), 679 | ("CGN", "R"), 680 | ("CGR", "R"), 681 | ("CGS", "R"), 682 | ("CGV", "R"), 683 | ("CGW", "R"), 684 | ("CGY", "R"), 685 | ("CTB", "L"), 686 | ("CTD", "L"), 687 | ("CTH", "L"), 688 | ("CTK", "L"), 689 | ("CTM", "L"), 690 | ("CTN", "L"), 691 | ("CTR", "L"), 692 | ("CTS", "L"), 693 | ("CTV", "L"), 694 | ("CTW", "L"), 695 | ("CTY", "L"), 696 | ("GAR", "E"), 697 | ("GAY", "D"), 698 | ("GCB", "A"), 699 | ("GCD", "A"), 700 | ("GCH", "A"), 701 | ("GCK", "A"), 702 | ("GCM", "A"), 703 | ("GCN", "A"), 704 | ("GCR", "A"), 705 | ("GCS", "A"), 706 | ("GCV", "A"), 707 | ("GCW", "A"), 708 | ("GCY", "A"), 709 | ("GGB", "G"), 710 | ("GGD", "G"), 711 | ("GGH", "G"), 712 | ("GGK", "G"), 713 | ("GGM", "G"), 714 | ("GGN", "G"), 715 | ("GGR", "G"), 716 | ("GGS", "G"), 717 | ("GGV", "G"), 718 | ("GGW", "G"), 719 | ("GGY", "G"), 720 | ("GTB", "V"), 721 | ("GTD", "V"), 722 | ("GTH", "V"), 723 | ("GTK", "V"), 724 | ("GTM", "V"), 725 | ("GTN", "V"), 726 | ("GTR", "V"), 727 | ("GTS", "V"), 728 | ("GTV", "V"), 729 | ("GTW", "V"), 730 | ("GTY", "V"), 731 | ("MGA", "R"), 732 | ("MGG", "R"), 733 | ("MGR", "R"), 734 | ("TAR", "*"), 735 | ("TAY", "Y"), 736 | ("TCB", "S"), 737 | ("TCD", "S"), 738 | ("TCH", "S"), 739 | ("TCK", "S"), 740 | ("TCM", "S"), 741 | ("TCN", "S"), 742 | ("TCR", "S"), 743 | ("TCS", "S"), 744 | ("TCV", "S"), 745 | ("TCW", "S"), 746 | ("TCY", "S"), 747 | ("TGY", "C"), 748 | ("TRA", "*"), 749 | ("TTR", "L"), 750 | ("TTY", "F"), 751 | ("YTA", "L"), 752 | ("YTG", "L"), 753 | ("YTR", "L"), 754 | ]; 755 | -------------------------------------------------------------------------------- /codecov.yml: -------------------------------------------------------------------------------- 1 | # For more configuration details: 2 | # https://docs.codecov.io/docs/codecov-yaml 3 | 4 | # Check if this file is valid by running in bash: 5 | # curl -X POST --data-binary @.codecov.yml https://codecov.io/validate 6 | 7 | # Coverage configuration 8 | # ---------------------- 9 | coverage: 10 | status: 11 | patch: false 12 | 13 | range: 70..90 # First number represents red, and second represents green 14 | # (default is 70..100) 15 | round: down # up, down, or nearest 16 | precision: 2 # Number of decimal places, between 0 and 5 17 | 18 | # Ignoring Paths 19 | # -------------- 20 | # which folders/files to ignore 21 | 22 | # Pull request comments: 23 | # ---------------------- 24 | # Diff is the Coverage Diff of the pull request. 25 | # Files are the files impacted by the pull request 26 | comment: 27 | layout: diff, files # accepted in any order: reach, diff, flags, and/or files 28 | -------------------------------------------------------------------------------- /src/data/cdot/mod.rs: -------------------------------------------------------------------------------- 1 | //! Access to `cdot` transcripts. 2 | pub mod json; 3 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_gene_info.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&record" 4 | --- 5 | GeneInfoRecord { 6 | hgnc: "BRCA1", 7 | maploc: "17q21.31", 8 | descr: "BRCA1 DNA repair associated", 9 | summary: "This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability, and it also acts as a tumor suppressor. The BRCA1 gene contains 22 exons spanning about 110 kb of DNA. The encoded protein combines with other tumor suppressors, DNA damage sensors, and signal transducers to form a large multi-subunit protein complex known as the BRCA1-associated genome surveillance complex (BASC). This gene product associates with RNA polymerase II, and through the C-terminal domain, also interacts with histone deacetylase complexes. This protein thus plays a role in transcription, DNA repair of double-stranded breaks, and recombination. Mutations in this gene are responsible for approximately 40% of inherited breast cancers and more than 80% of inherited breast and ovarian cancers. Alternative splicing plays a role in modulating the subcellular localization and physiological function of this gene. Many alternatively spliced transcript variants, some of which are disease-associated mutations, have been described for this gene, but the full-length natures of only some of these variants has been described. A related pseudogene, which is also located on chromosome 17, has been identified. [provided by RefSeq, May 2020]", 10 | aliases: [ 11 | "BRCAI", 12 | "BRCC1", 13 | "BROVCA1", 14 | "FANCS", 15 | "IRIS", 16 | "PNCA4", 17 | "PPP1R53", 18 | "PSCP", 19 | "RNF53", 20 | ], 21 | added: 1970-01-01T00:00:00, 22 | } 23 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_tx_exons.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&result" 4 | --- 5 | [ 6 | TxExonsRecord { 7 | hgnc: "BRCA1", 8 | tx_ac: "NM_007294.3", 9 | alt_ac: "NC_000017.10", 10 | alt_aln_method: "splign", 11 | alt_strand: -1, 12 | ord: 22, 13 | tx_start_i: 5699, 14 | tx_end_i: 7207, 15 | alt_start_i: 41196311, 16 | alt_end_i: 41197819, 17 | cigar: "1508M", 18 | tx_aseq: None, 19 | alt_aseq: None, 20 | tx_exon_set_id: 2147483647, 21 | alt_exon_set_id: 2147483647, 22 | tx_exon_id: 2147483647, 23 | alt_exon_id: 2147483647, 24 | exon_aln_id: 2147483647, 25 | }, 26 | TxExonsRecord { 27 | hgnc: "BRCA1", 28 | tx_ac: "NM_007294.3", 29 | alt_ac: "NC_000017.10", 30 | alt_aln_method: "splign", 31 | alt_strand: -1, 32 | ord: 21, 33 | tx_start_i: 5638, 34 | tx_end_i: 5699, 35 | alt_start_i: 41199659, 36 | alt_end_i: 41199720, 37 | cigar: "61M", 38 | tx_aseq: None, 39 | alt_aseq: None, 40 | tx_exon_set_id: 2147483647, 41 | alt_exon_set_id: 2147483647, 42 | tx_exon_id: 2147483647, 43 | alt_exon_id: 2147483647, 44 | exon_aln_id: 2147483647, 45 | }, 46 | TxExonsRecord { 47 | hgnc: "BRCA1", 48 | tx_ac: "NM_007294.3", 49 | alt_ac: "NC_000017.10", 50 | alt_aln_method: "splign", 51 | alt_strand: -1, 52 | ord: 20, 53 | tx_start_i: 5564, 54 | tx_end_i: 5638, 55 | alt_start_i: 41201137, 56 | alt_end_i: 41201211, 57 | cigar: "74M", 58 | tx_aseq: None, 59 | alt_aseq: None, 60 | tx_exon_set_id: 2147483647, 61 | alt_exon_set_id: 2147483647, 62 | tx_exon_id: 2147483647, 63 | alt_exon_id: 2147483647, 64 | exon_aln_id: 2147483647, 65 | }, 66 | TxExonsRecord { 67 | hgnc: "BRCA1", 68 | tx_ac: "NM_007294.3", 69 | alt_ac: "NC_000017.10", 70 | alt_aln_method: "splign", 71 | alt_strand: -1, 72 | ord: 19, 73 | tx_start_i: 5509, 74 | tx_end_i: 5564, 75 | alt_start_i: 41203079, 76 | alt_end_i: 41203134, 77 | cigar: "55M", 78 | tx_aseq: None, 79 | alt_aseq: None, 80 | tx_exon_set_id: 2147483647, 81 | alt_exon_set_id: 2147483647, 82 | tx_exon_id: 2147483647, 83 | alt_exon_id: 2147483647, 84 | exon_aln_id: 2147483647, 85 | }, 86 | TxExonsRecord { 87 | hgnc: "BRCA1", 88 | tx_ac: "NM_007294.3", 89 | alt_ac: "NC_000017.10", 90 | alt_aln_method: "splign", 91 | alt_strand: -1, 92 | ord: 18, 93 | tx_start_i: 5425, 94 | tx_end_i: 5509, 95 | alt_start_i: 41209068, 96 | alt_end_i: 41209152, 97 | cigar: "84M", 98 | tx_aseq: None, 99 | alt_aseq: None, 100 | tx_exon_set_id: 2147483647, 101 | alt_exon_set_id: 2147483647, 102 | tx_exon_id: 2147483647, 103 | alt_exon_id: 2147483647, 104 | exon_aln_id: 2147483647, 105 | }, 106 | TxExonsRecord { 107 | hgnc: "BRCA1", 108 | tx_ac: "NM_007294.3", 109 | alt_ac: "NC_000017.10", 110 | alt_aln_method: "splign", 111 | alt_strand: -1, 112 | ord: 17, 113 | tx_start_i: 5384, 114 | tx_end_i: 5425, 115 | alt_start_i: 41215349, 116 | alt_end_i: 41215390, 117 | cigar: "41M", 118 | tx_aseq: None, 119 | alt_aseq: None, 120 | tx_exon_set_id: 2147483647, 121 | alt_exon_set_id: 2147483647, 122 | tx_exon_id: 2147483647, 123 | alt_exon_id: 2147483647, 124 | exon_aln_id: 2147483647, 125 | }, 126 | TxExonsRecord { 127 | hgnc: "BRCA1", 128 | tx_ac: "NM_007294.3", 129 | alt_ac: "NC_000017.10", 130 | alt_aln_method: "splign", 131 | alt_strand: -1, 132 | ord: 16, 133 | tx_start_i: 5306, 134 | tx_end_i: 5384, 135 | alt_start_i: 41215890, 136 | alt_end_i: 41215968, 137 | cigar: "78M", 138 | tx_aseq: None, 139 | alt_aseq: None, 140 | tx_exon_set_id: 2147483647, 141 | alt_exon_set_id: 2147483647, 142 | tx_exon_id: 2147483647, 143 | alt_exon_id: 2147483647, 144 | exon_aln_id: 2147483647, 145 | }, 146 | TxExonsRecord { 147 | hgnc: "BRCA1", 148 | tx_ac: "NM_007294.3", 149 | alt_ac: "NC_000017.10", 150 | alt_aln_method: "splign", 151 | alt_strand: -1, 152 | ord: 15, 153 | tx_start_i: 5218, 154 | tx_end_i: 5306, 155 | alt_start_i: 41219624, 156 | alt_end_i: 41219712, 157 | cigar: "88M", 158 | tx_aseq: None, 159 | alt_aseq: None, 160 | tx_exon_set_id: 2147483647, 161 | alt_exon_set_id: 2147483647, 162 | tx_exon_id: 2147483647, 163 | alt_exon_id: 2147483647, 164 | exon_aln_id: 2147483647, 165 | }, 166 | TxExonsRecord { 167 | hgnc: "BRCA1", 168 | tx_ac: "NM_007294.3", 169 | alt_ac: "NC_000017.10", 170 | alt_aln_method: "splign", 171 | alt_strand: -1, 172 | ord: 14, 173 | tx_start_i: 4907, 174 | tx_end_i: 5218, 175 | alt_start_i: 41222944, 176 | alt_end_i: 41223255, 177 | cigar: "311M", 178 | tx_aseq: None, 179 | alt_aseq: None, 180 | tx_exon_set_id: 2147483647, 181 | alt_exon_set_id: 2147483647, 182 | tx_exon_id: 2147483647, 183 | alt_exon_id: 2147483647, 184 | exon_aln_id: 2147483647, 185 | }, 186 | TxExonsRecord { 187 | hgnc: "BRCA1", 188 | tx_ac: "NM_007294.3", 189 | alt_ac: "NC_000017.10", 190 | alt_aln_method: "splign", 191 | alt_strand: -1, 192 | ord: 13, 193 | tx_start_i: 4716, 194 | tx_end_i: 4907, 195 | alt_start_i: 41226347, 196 | alt_end_i: 41226538, 197 | cigar: "191M", 198 | tx_aseq: None, 199 | alt_aseq: None, 200 | tx_exon_set_id: 2147483647, 201 | alt_exon_set_id: 2147483647, 202 | tx_exon_id: 2147483647, 203 | alt_exon_id: 2147483647, 204 | exon_aln_id: 2147483647, 205 | }, 206 | TxExonsRecord { 207 | hgnc: "BRCA1", 208 | tx_ac: "NM_007294.3", 209 | alt_ac: "NC_000017.10", 210 | alt_aln_method: "splign", 211 | alt_strand: -1, 212 | ord: 12, 213 | tx_start_i: 4589, 214 | tx_end_i: 4716, 215 | alt_start_i: 41228504, 216 | alt_end_i: 41228631, 217 | cigar: "127M", 218 | tx_aseq: None, 219 | alt_aseq: None, 220 | tx_exon_set_id: 2147483647, 221 | alt_exon_set_id: 2147483647, 222 | tx_exon_id: 2147483647, 223 | alt_exon_id: 2147483647, 224 | exon_aln_id: 2147483647, 225 | }, 226 | TxExonsRecord { 227 | hgnc: "BRCA1", 228 | tx_ac: "NM_007294.3", 229 | alt_ac: "NC_000017.10", 230 | alt_aln_method: "splign", 231 | alt_strand: -1, 232 | ord: 11, 233 | tx_start_i: 4417, 234 | tx_end_i: 4589, 235 | alt_start_i: 41234420, 236 | alt_end_i: 41234592, 237 | cigar: "172M", 238 | tx_aseq: None, 239 | alt_aseq: None, 240 | tx_exon_set_id: 2147483647, 241 | alt_exon_set_id: 2147483647, 242 | tx_exon_id: 2147483647, 243 | alt_exon_id: 2147483647, 244 | exon_aln_id: 2147483647, 245 | }, 246 | TxExonsRecord { 247 | hgnc: "BRCA1", 248 | tx_ac: "NM_007294.3", 249 | alt_ac: "NC_000017.10", 250 | alt_aln_method: "splign", 251 | alt_strand: -1, 252 | ord: 10, 253 | tx_start_i: 4328, 254 | tx_end_i: 4417, 255 | alt_start_i: 41242960, 256 | alt_end_i: 41243049, 257 | cigar: "89M", 258 | tx_aseq: None, 259 | alt_aseq: None, 260 | tx_exon_set_id: 2147483647, 261 | alt_exon_set_id: 2147483647, 262 | tx_exon_id: 2147483647, 263 | alt_exon_id: 2147483647, 264 | exon_aln_id: 2147483647, 265 | }, 266 | TxExonsRecord { 267 | hgnc: "BRCA1", 268 | tx_ac: "NM_007294.3", 269 | alt_ac: "NC_000017.10", 270 | alt_aln_method: "splign", 271 | alt_strand: -1, 272 | ord: 9, 273 | tx_start_i: 902, 274 | tx_end_i: 4328, 275 | alt_start_i: 41243451, 276 | alt_end_i: 41246877, 277 | cigar: "3426M", 278 | tx_aseq: None, 279 | alt_aseq: None, 280 | tx_exon_set_id: 2147483647, 281 | alt_exon_set_id: 2147483647, 282 | tx_exon_id: 2147483647, 283 | alt_exon_id: 2147483647, 284 | exon_aln_id: 2147483647, 285 | }, 286 | TxExonsRecord { 287 | hgnc: "BRCA1", 288 | tx_ac: "NM_007294.3", 289 | alt_ac: "NC_000017.10", 290 | alt_aln_method: "splign", 291 | alt_strand: -1, 292 | ord: 8, 293 | tx_start_i: 825, 294 | tx_end_i: 902, 295 | alt_start_i: 41247862, 296 | alt_end_i: 41247939, 297 | cigar: "77M", 298 | tx_aseq: None, 299 | alt_aseq: None, 300 | tx_exon_set_id: 2147483647, 301 | alt_exon_set_id: 2147483647, 302 | tx_exon_id: 2147483647, 303 | alt_exon_id: 2147483647, 304 | exon_aln_id: 2147483647, 305 | }, 306 | TxExonsRecord { 307 | hgnc: "BRCA1", 308 | tx_ac: "NM_007294.3", 309 | alt_ac: "NC_000017.10", 310 | alt_aln_method: "splign", 311 | alt_strand: -1, 312 | ord: 7, 313 | tx_start_i: 779, 314 | tx_end_i: 825, 315 | alt_start_i: 41249260, 316 | alt_end_i: 41249306, 317 | cigar: "46M", 318 | tx_aseq: None, 319 | alt_aseq: None, 320 | tx_exon_set_id: 2147483647, 321 | alt_exon_set_id: 2147483647, 322 | tx_exon_id: 2147483647, 323 | alt_exon_id: 2147483647, 324 | exon_aln_id: 2147483647, 325 | }, 326 | TxExonsRecord { 327 | hgnc: "BRCA1", 328 | tx_ac: "NM_007294.3", 329 | alt_ac: "NC_000017.10", 330 | alt_aln_method: "splign", 331 | alt_strand: -1, 332 | ord: 6, 333 | tx_start_i: 673, 334 | tx_end_i: 779, 335 | alt_start_i: 41251791, 336 | alt_end_i: 41251897, 337 | cigar: "106M", 338 | tx_aseq: None, 339 | alt_aseq: None, 340 | tx_exon_set_id: 2147483647, 341 | alt_exon_set_id: 2147483647, 342 | tx_exon_id: 2147483647, 343 | alt_exon_id: 2147483647, 344 | exon_aln_id: 2147483647, 345 | }, 346 | TxExonsRecord { 347 | hgnc: "BRCA1", 348 | tx_ac: "NM_007294.3", 349 | alt_ac: "NC_000017.10", 350 | alt_aln_method: "splign", 351 | alt_strand: -1, 352 | ord: 5, 353 | tx_start_i: 533, 354 | tx_end_i: 673, 355 | alt_start_i: 41256138, 356 | alt_end_i: 41256278, 357 | cigar: "140M", 358 | tx_aseq: None, 359 | alt_aseq: None, 360 | tx_exon_set_id: 2147483647, 361 | alt_exon_set_id: 2147483647, 362 | tx_exon_id: 2147483647, 363 | alt_exon_id: 2147483647, 364 | exon_aln_id: 2147483647, 365 | }, 366 | TxExonsRecord { 367 | hgnc: "BRCA1", 368 | tx_ac: "NM_007294.3", 369 | alt_ac: "NC_000017.10", 370 | alt_aln_method: "splign", 371 | alt_strand: -1, 372 | ord: 4, 373 | tx_start_i: 444, 374 | tx_end_i: 533, 375 | alt_start_i: 41256884, 376 | alt_end_i: 41256973, 377 | cigar: "89M", 378 | tx_aseq: None, 379 | alt_aseq: None, 380 | tx_exon_set_id: 2147483647, 381 | alt_exon_set_id: 2147483647, 382 | tx_exon_id: 2147483647, 383 | alt_exon_id: 2147483647, 384 | exon_aln_id: 2147483647, 385 | }, 386 | TxExonsRecord { 387 | hgnc: "BRCA1", 388 | tx_ac: "NM_007294.3", 389 | alt_ac: "NC_000017.10", 390 | alt_aln_method: "splign", 391 | alt_strand: -1, 392 | ord: 3, 393 | tx_start_i: 366, 394 | tx_end_i: 444, 395 | alt_start_i: 41258472, 396 | alt_end_i: 41258550, 397 | cigar: "78M", 398 | tx_aseq: None, 399 | alt_aseq: None, 400 | tx_exon_set_id: 2147483647, 401 | alt_exon_set_id: 2147483647, 402 | tx_exon_id: 2147483647, 403 | alt_exon_id: 2147483647, 404 | exon_aln_id: 2147483647, 405 | }, 406 | TxExonsRecord { 407 | hgnc: "BRCA1", 408 | tx_ac: "NM_007294.3", 409 | alt_ac: "NC_000017.10", 410 | alt_aln_method: "splign", 411 | alt_strand: -1, 412 | ord: 2, 413 | tx_start_i: 312, 414 | tx_end_i: 366, 415 | alt_start_i: 41267742, 416 | alt_end_i: 41267796, 417 | cigar: "54M", 418 | tx_aseq: None, 419 | alt_aseq: None, 420 | tx_exon_set_id: 2147483647, 421 | alt_exon_set_id: 2147483647, 422 | tx_exon_id: 2147483647, 423 | alt_exon_id: 2147483647, 424 | exon_aln_id: 2147483647, 425 | }, 426 | TxExonsRecord { 427 | hgnc: "BRCA1", 428 | tx_ac: "NM_007294.3", 429 | alt_ac: "NC_000017.10", 430 | alt_aln_method: "splign", 431 | alt_strand: -1, 432 | ord: 1, 433 | tx_start_i: 213, 434 | tx_end_i: 312, 435 | alt_start_i: 41276033, 436 | alt_end_i: 41276132, 437 | cigar: "99M", 438 | tx_aseq: None, 439 | alt_aseq: None, 440 | tx_exon_set_id: 2147483647, 441 | alt_exon_set_id: 2147483647, 442 | tx_exon_id: 2147483647, 443 | alt_exon_id: 2147483647, 444 | exon_aln_id: 2147483647, 445 | }, 446 | TxExonsRecord { 447 | hgnc: "BRCA1", 448 | tx_ac: "NM_007294.3", 449 | alt_ac: "NC_000017.10", 450 | alt_aln_method: "splign", 451 | alt_strand: -1, 452 | ord: 0, 453 | tx_start_i: 0, 454 | tx_end_i: 213, 455 | alt_start_i: 41277287, 456 | alt_end_i: 41277500, 457 | cigar: "213M", 458 | tx_aseq: None, 459 | alt_aseq: None, 460 | tx_exon_set_id: 2147483647, 461 | alt_exon_set_id: 2147483647, 462 | tx_exon_id: 2147483647, 463 | alt_exon_id: 2147483647, 464 | exon_aln_id: 2147483647, 465 | }, 466 | ] 467 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_tx_for_region_brca1.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&result" 4 | --- 5 | [ 6 | TxForRegionRecord { 7 | tx_ac: "NM_007300.3", 8 | alt_ac: "NC_000017.10", 9 | alt_strand: -1, 10 | alt_aln_method: "splign", 11 | start_i: 41196311, 12 | end_i: 41277500, 13 | }, 14 | TxForRegionRecord { 15 | tx_ac: "NM_007294.3", 16 | alt_ac: "NC_000017.10", 17 | alt_strand: -1, 18 | alt_aln_method: "splign", 19 | start_i: 41196311, 20 | end_i: 41277500, 21 | }, 22 | TxForRegionRecord { 23 | tx_ac: "NM_007299.3", 24 | alt_ac: "NC_000017.10", 25 | alt_strand: -1, 26 | alt_aln_method: "splign", 27 | start_i: 41196311, 28 | end_i: 41277468, 29 | }, 30 | TxForRegionRecord { 31 | tx_ac: "NM_007297.3", 32 | alt_ac: "NC_000017.10", 33 | alt_strand: -1, 34 | alt_aln_method: "splign", 35 | start_i: 41196311, 36 | end_i: 41277468, 37 | }, 38 | TxForRegionRecord { 39 | tx_ac: "NR_027676.2", 40 | alt_ac: "NC_000017.10", 41 | alt_strand: -1, 42 | alt_aln_method: "splign", 43 | start_i: 41196311, 44 | end_i: 41277381, 45 | }, 46 | TxForRegionRecord { 47 | tx_ac: "NM_007300.4", 48 | alt_ac: "NC_000017.10", 49 | alt_strand: -1, 50 | alt_aln_method: "splign", 51 | start_i: 41196311, 52 | end_i: 41277381, 53 | }, 54 | TxForRegionRecord { 55 | tx_ac: "NM_007299.4", 56 | alt_ac: "NC_000017.10", 57 | alt_strand: -1, 58 | alt_aln_method: "splign", 59 | start_i: 41196311, 60 | end_i: 41277381, 61 | }, 62 | TxForRegionRecord { 63 | tx_ac: "NM_007297.4", 64 | alt_ac: "NC_000017.10", 65 | alt_strand: -1, 66 | alt_aln_method: "splign", 67 | start_i: 41196311, 68 | end_i: 41277381, 69 | }, 70 | TxForRegionRecord { 71 | tx_ac: "NM_007294.4", 72 | alt_ac: "NC_000017.10", 73 | alt_strand: -1, 74 | alt_aln_method: "splign", 75 | start_i: 41196311, 76 | end_i: 41277381, 77 | }, 78 | TxForRegionRecord { 79 | tx_ac: "NR_027676.1", 80 | alt_ac: "NC_000017.10", 81 | alt_strand: -1, 82 | alt_aln_method: "splign", 83 | start_i: 41196311, 84 | end_i: 41277340, 85 | }, 86 | TxForRegionRecord { 87 | tx_ac: "NM_007298.3", 88 | alt_ac: "NC_000017.10", 89 | alt_strand: -1, 90 | alt_aln_method: "splign", 91 | start_i: 41196311, 92 | end_i: 41276132, 93 | }, 94 | ] 95 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_tx_for_region_empty.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&result" 4 | --- 5 | [] 6 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_tx_info.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&result" 4 | --- 5 | TxInfoRecord { 6 | hgnc: "BRCA1", 7 | cds_start_i: Some( 8 | 232, 9 | ), 10 | cds_end_i: Some( 11 | 5824, 12 | ), 13 | tx_ac: "NM_007294.3", 14 | alt_ac: "NC_000017.10", 15 | alt_aln_method: "splign", 16 | } 17 | -------------------------------------------------------------------------------- /src/data/cdot/snapshots/hgvs__data__cdot__json__tests__provider_get_tx_mapping_options.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/data/cdot/json.rs 3 | expression: "&result" 4 | --- 5 | [ 6 | TxMappingOptionsRecord { 7 | tx_ac: "NM_007294.3", 8 | alt_ac: "NC_000017.10", 9 | alt_aln_method: "splign", 10 | }, 11 | TxMappingOptionsRecord { 12 | tx_ac: "NM_007294.3", 13 | alt_ac: "NC_000017.11", 14 | alt_aln_method: "splign", 15 | }, 16 | ] 17 | -------------------------------------------------------------------------------- /src/data/error.rs: -------------------------------------------------------------------------------- 1 | //! Error type definition. 2 | 3 | use std::sync::Arc; 4 | use thiserror::Error; 5 | 6 | /// Error type for data. 7 | #[derive(Error, Debug, Clone)] 8 | pub enum Error { 9 | #[error("UTA Postgres access error")] 10 | UtaPostgresError(#[from] Arc), 11 | #[error("sequence operation failed")] 12 | SequenceOperationFailed(#[from] crate::sequences::Error), 13 | #[error("problem with seqrepo access")] 14 | SeqRepoError(#[from] seqrepo::Error), 15 | #[error("no tx_exons for tx_ac={0}, alt_ac={1}, alt_aln_method={2}")] 16 | NoTxExons(String, String, String), 17 | #[error("could not get parent from {0}")] 18 | PathParent(String), 19 | #[error("could not get basename from {0}")] 20 | PathBasename(String), 21 | #[error("could not open cdot JSON file: {0}")] 22 | CdotJsonOpen(String), 23 | #[error("could not parse cdot JSON file: {0}")] 24 | CdotJsonParse(String), 25 | #[error("no gene found for {0}")] 26 | NoGeneFound(String), 27 | #[error("no transcript found for {0}")] 28 | NoTranscriptFound(String), 29 | #[error("no alignment found for {0} to {1}")] 30 | NoAlignmentFound(String, String), 31 | #[error("found no sequence record for accession {0}")] 32 | NoSequenceRecord(String), 33 | } 34 | -------------------------------------------------------------------------------- /src/data/interface.rs: -------------------------------------------------------------------------------- 1 | //! Definition of the interface for accessing the transcript database. 2 | 3 | use chrono::NaiveDateTime; 4 | use indexmap::IndexMap; 5 | 6 | use crate::{data::error::Error, sequences::TranslationTable}; 7 | use biocommons_bioutils::assemblies::Assembly; 8 | 9 | /// Information about a gene. 10 | /// 11 | /// ```text 12 | /// hgnc | ATM 13 | /// maploc | 11q22-q23 14 | /// descr | ataxia telangiectasia mutated 15 | /// summary | The protein encoded by this gene belongs to the PI3/PI4-kinase family. This... 16 | /// aliases | AT1,ATA,ATC,ATD,ATE,ATDC,TEL1,TELO1 17 | /// added | 2014-02-04 21:39:32.57125 18 | /// ``` 19 | #[derive(Debug, PartialEq, Default, Clone)] 20 | pub struct GeneInfoRecord { 21 | pub hgnc: String, 22 | pub maploc: String, 23 | pub descr: String, 24 | pub summary: String, 25 | pub aliases: Vec, 26 | pub added: NaiveDateTime, 27 | } 28 | 29 | /// Information about similar transcripts. 30 | /// 31 | /// ```text 32 | /// tx_ac1 | NM_001285829.1 33 | /// tx_ac2 | ENST00000341255 34 | /// hgnc_eq | f 35 | /// cds_eq | f 36 | /// es_fp_eq | f 37 | /// cds_es_fp_eq | f 38 | /// cds_exon_lengths_fp_eq | t 39 | /// ``` 40 | /// 41 | /// Hint: "es" = "exon set", "fp" = "fingerprint", "eq" = "equal" 42 | /// 43 | /// "Exon structure" refers to the start and end coordinates on a 44 | /// specified reference sequence. Thus, having the same exon 45 | /// structure means that the transcripts are defined on the same 46 | /// reference sequence and have the same exon spans on that 47 | /// sequence. 48 | #[derive(Debug, PartialEq, Default, Clone)] 49 | pub struct TxSimilarityRecord { 50 | /// Accession of first transcript. 51 | pub tx_ac1: String, 52 | /// Accession of second transcript. 53 | pub tx_ac2: String, 54 | pub hgnc_eq: bool, 55 | /// Whether CDS sequences are identical. 56 | pub cds_eq: bool, 57 | /// Whether the full exon structures are identical (i.e., incl. UTR). 58 | pub es_fp_eq: bool, 59 | /// Whether the cds-clipped portions of the exon structures are identical 60 | /// (i.e., ecluding. UTR). 61 | pub cds_es_fp_eq: bool, 62 | pub cds_exon_lengths_fp_eq: bool, 63 | } 64 | 65 | ///```text 66 | /// hgnc | TGDS 67 | /// tx_ac | NM_001304430.1 68 | /// alt_ac | NC_000013.10 69 | /// alt_aln_method | blat 70 | /// alt_strand | -1 71 | /// ord | 0 72 | /// tx_start_i | 0 73 | /// tx_end_i | 301 74 | /// alt_start_i | 95248228 75 | /// alt_end_i | 95248529 76 | /// cigar | 301= 77 | /// tx_aseq | 78 | /// alt_aseq | 79 | /// tx_exon_set_id | 348239 80 | /// alt_exon_set_id | 722624 81 | /// tx_exon_id | 3518579 82 | /// alt_exon_id | 6063334 83 | /// exon_aln_id | 3461425 84 | ///``` 85 | #[derive(Debug, PartialEq, Default, Clone)] 86 | pub struct TxExonsRecord { 87 | pub hgnc: String, 88 | pub tx_ac: String, 89 | pub alt_ac: String, 90 | pub alt_aln_method: String, 91 | pub alt_strand: i16, 92 | pub ord: i32, 93 | pub tx_start_i: i32, 94 | pub tx_end_i: i32, 95 | pub alt_start_i: i32, 96 | pub alt_end_i: i32, 97 | pub cigar: String, 98 | pub tx_aseq: Option, 99 | pub alt_aseq: Option, 100 | pub tx_exon_set_id: i32, 101 | pub alt_exon_set_id: i32, 102 | pub tx_exon_id: i32, 103 | pub alt_exon_id: i32, 104 | pub exon_aln_id: i32, 105 | } 106 | 107 | /// ```text 108 | /// tx_ac | NM_001304430.2 109 | /// alt_ac | NC_000013.10 110 | /// alt_strand | -1 111 | /// alt_aln_method | splign 112 | /// start_i | 95226307 113 | /// end_i | 95248406 114 | /// ``` 115 | #[derive(Debug, PartialEq, Default, Clone)] 116 | pub struct TxForRegionRecord { 117 | pub tx_ac: String, 118 | pub alt_ac: String, 119 | pub alt_strand: i16, 120 | pub alt_aln_method: String, 121 | pub start_i: i32, 122 | pub end_i: i32, 123 | } 124 | 125 | /// ```text 126 | /// tx_ac | NM_199425.2 127 | /// alt_ac | NM_199425.2 128 | /// alt_aln_method | transcript 129 | /// cds_start_i | 283 130 | /// cds_end_i | 1003 131 | /// lengths | {707,79,410} 132 | /// hgnc | VSX1 133 | /// ``` 134 | #[derive(Debug, PartialEq, Default, Clone)] 135 | pub struct TxIdentityInfo { 136 | pub tx_ac: String, 137 | pub alt_ac: String, 138 | pub alt_aln_method: String, 139 | pub cds_start_i: i32, 140 | pub cds_end_i: i32, 141 | pub lengths: Vec, 142 | pub hgnc: String, 143 | /// The translation table to use for this transcript. 144 | pub translation_table: TranslationTable, 145 | } 146 | 147 | /// ```text 148 | /// hgnc | ATM 149 | /// cds_start_i | 385 150 | /// cds_end_i | 9556 151 | /// tx_ac | NM_000051.3 152 | /// alt_ac | AC_000143.1 153 | /// alt_aln_method | splign 154 | /// ``` 155 | #[derive(Debug, PartialEq, Default, Clone)] 156 | pub struct TxInfoRecord { 157 | pub hgnc: String, 158 | pub cds_start_i: Option, 159 | pub cds_end_i: Option, 160 | pub tx_ac: String, 161 | pub alt_ac: String, 162 | pub alt_aln_method: String, 163 | } 164 | 165 | /// ```text 166 | /// -[ RECORD 1 ]--+---------------- 167 | /// tx_ac | ENST00000000233 168 | /// alt_ac | NC_000007.13 169 | /// alt_aln_method | genebuild 170 | /// -[ RECORD 2 ]--+---------------- 171 | /// tx_ac | ENST00000000412 172 | /// alt_ac | NC_000012.11 173 | /// alt_aln_method | genebuild 174 | /// ``` 175 | #[derive(Debug, PartialEq, Default, Clone)] 176 | pub struct TxMappingOptionsRecord { 177 | pub tx_ac: String, 178 | pub alt_ac: String, 179 | pub alt_aln_method: String, 180 | } 181 | 182 | /// Interface for data providers. 183 | pub trait Provider { 184 | /// Return the data version, e.g., `uta_20210129`. 185 | fn data_version(&self) -> &str; 186 | 187 | /// Return the schema version, e.g., `"1.1"`. 188 | fn schema_version(&self) -> &str; 189 | 190 | /// Return a map from accession to chromosome name for the given assembly 191 | /// 192 | /// For example, when `assembly_name = "GRCh38.p5"`, the value for `"NC_000001.11"` 193 | /// would be `"1"`. 194 | /// 195 | /// # Arguments 196 | /// 197 | /// * `assembly` - The assembly to build the map for. 198 | fn get_assembly_map(&self, assembly: Assembly) -> IndexMap; 199 | 200 | /// Returns the basic information about the gene. 201 | /// 202 | /// # Arguments 203 | /// 204 | /// * `hgnc` - HGNC gene name 205 | fn get_gene_info(&self, hgnc: &str) -> Result; 206 | 207 | /// Return the (single) associated protein accession for a given transcript accession, 208 | /// or None if not found. 209 | /// 210 | /// # Arguments 211 | /// 212 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3') 213 | fn get_pro_ac_for_tx_ac(&self, tx_ac: &str) -> Result, Error>; 214 | 215 | /// Return full sequence for the given accession. 216 | /// 217 | /// # Arguments 218 | /// 219 | /// * `ac` -- accession 220 | fn get_seq(&self, ac: &str) -> Result { 221 | self.get_seq_part(ac, None, None) 222 | } 223 | 224 | /// Return sequence part for the given accession. 225 | /// 226 | /// # Arguments 227 | /// 228 | /// * `ac` -- accession 229 | /// * `start` -- start position (0-based, start of sequence if missing) 230 | /// * `end` -- end position (0-based, end of sequence if missing) 231 | fn get_seq_part( 232 | &self, 233 | ac: &str, 234 | begin: Option, 235 | end: Option, 236 | ) -> Result; 237 | 238 | /// Returns a list of protein accessions for a given sequence. 239 | /// 240 | /// The list is guaranteed to contain at least one element with the MD5-based accession 241 | /// (MD5_01234abc..def56789) at the end of the list. 242 | fn get_acs_for_protein_seq(&self, seq: &str) -> Result, Error>; 243 | 244 | /// Return a list of transcripts that are similar to the given transcript, with relevant 245 | /// similarity criteria. 246 | /// 247 | /// # Arguments 248 | /// 249 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3') 250 | fn get_similar_transcripts(&self, tx_ac: &str) -> Result, Error>; 251 | 252 | /// Return transcript exon info for supplied accession (tx_ac, alt_ac, alt_aln_method), 253 | /// or empty `Vec` if not found. 254 | /// 255 | /// # Arguments 256 | /// 257 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3') 258 | /// * `alt_ac` -- specific genomic sequence (e.g., NC_000011.4) 259 | /// * `alt_aln_method` -- sequence alignment method (e.g., splign, blat) 260 | fn get_tx_exons( 261 | &self, 262 | tx_ac: &str, 263 | alt_ac: &str, 264 | alt_aln_method: &str, 265 | ) -> Result, Error>; 266 | 267 | /// Return transcript info records for supplied gene, in order of decreasing length. 268 | /// 269 | /// # Arguments 270 | /// 271 | /// * `gene` - HGNC gene name 272 | fn get_tx_for_gene(&self, gene: &str) -> Result, Error>; 273 | 274 | /// Return transcripts that overlap given region. 275 | /// 276 | /// # Arguments 277 | /// 278 | // * `alt_ac` -- reference sequence (e.g., NC_000007.13) 279 | // * `alt_aln_method` -- alignment method (e.g., splign) 280 | // * `start_i` -- 5' bound of region 281 | // * `end_i` -- 3' bound of region 282 | fn get_tx_for_region( 283 | &self, 284 | alt_ac: &str, 285 | alt_aln_method: &str, 286 | start_i: i32, 287 | end_i: i32, 288 | ) -> Result, Error>; 289 | 290 | /// Return features associated with a single transcript. 291 | /// 292 | /// # Arguments 293 | /// 294 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_199425.2') 295 | fn get_tx_identity_info(&self, tx_ac: &str) -> Result; 296 | 297 | /// Return a single transcript info for supplied accession (tx_ac, alt_ac, alt_aln_method), or None if not found. 298 | /// 299 | /// # Arguments 300 | /// 301 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3') 302 | /// * `alt_ac -- specific genomic sequence (e.g., NC_000011.4) 303 | /// * `alt_aln_method` -- sequence alignment method (e.g., splign, blat) 304 | fn get_tx_info( 305 | &self, 306 | tx_ac: &str, 307 | alt_ac: &str, 308 | alt_aln_method: &str, 309 | ) -> Result; 310 | 311 | /// Return all transcript alignment sets for a given transcript accession (tx_ac). 312 | /// 313 | /// Returns empty list if transcript does not exist. Use this method to discovery 314 | /// possible mapping options supported in the database. 315 | /// 316 | /// # Arguments 317 | /// 318 | /// * `tx_ac` -- transcript accession with version (e.g., 'NM_000051.3') 319 | fn get_tx_mapping_options(&self, tx_ac: &str) -> Result, Error>; 320 | } 321 | 322 | // 323 | // Copyright 2023 hgvs-rs Contributors 324 | // Copyright 2014 Bioutils Contributors 325 | // 326 | // Licensed under the Apache License, Version 2.0 (the "License"); 327 | // you may not use this file except in compliance with the License. 328 | // You may obtain a copy of the License at 329 | // 330 | // http://www.apache.org/licenses/LICENSE-2.0 331 | // 332 | // Unless required by applicable law or agreed to in writing, software 333 | // distributed under the License is distributed on an "AS IS" BASIS, 334 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 335 | // See the License for the specific language governing permissions and 336 | // limitations under the License. 337 | // 338 | -------------------------------------------------------------------------------- /src/data/mod.rs: -------------------------------------------------------------------------------- 1 | //! Datatypes, interfaces, and data acess. 2 | 3 | pub mod cdot; 4 | pub mod error; 5 | pub mod interface; 6 | pub mod uta; 7 | pub mod uta_sr; 8 | -------------------------------------------------------------------------------- /src/data/uta_sr.rs: -------------------------------------------------------------------------------- 1 | //! Code for enabling UTA access where sequences from a SeqRepo. 2 | //! 3 | //! * https://github.com/biocommons/uta 4 | //! * https://github.com/biocommons/biocommons.seqrepo 5 | //! * https://github.com/varfish-org/seqrepo-rs 6 | 7 | use std::path::PathBuf; 8 | use std::sync::Arc; 9 | 10 | use crate::data::uta; 11 | use crate::data::{ 12 | error::Error, interface, interface::GeneInfoRecord, interface::TxExonsRecord, 13 | interface::TxForRegionRecord, interface::TxIdentityInfo, interface::TxInfoRecord, 14 | interface::TxMappingOptionsRecord, interface::TxSimilarityRecord, 15 | }; 16 | use seqrepo::{self, AliasOrSeqId, SeqRepo}; 17 | 18 | /// Configuration for the `data::uta_sr::Provider`. 19 | #[derive(Debug, PartialEq, Clone)] 20 | pub struct Config { 21 | /// URL with the connection string, e.g. 22 | /// `"postgresql://anonymous:anonymous@uta.biocommons.org/uta'"`. 23 | pub db_url: String, 24 | /// The databaser schema to use, corresponds to the data version, e.g., `uta_20210129`. 25 | pub db_schema: String, 26 | /// Path to the seqrepo directory, e.g., `/usr/local/share/seqrepo/latest`. The last path 27 | /// component is the "instance" name. 28 | pub seqrepo_path: String, 29 | } 30 | 31 | /// This provider provides information from a UTA and a SeqRepo. 32 | /// 33 | /// Transcripts from a UTA Postgres database, sequences comes from a SeqRepo. This makes 34 | /// genome contig information available in contrast to `data::uta::Provider`. 35 | pub struct Provider { 36 | inner: uta::Provider, 37 | seqrepo: Arc, 38 | } 39 | 40 | impl Provider { 41 | /// Create a new provider that uses UTA and SeqRepo information from the given configuration. 42 | /// 43 | /// This uses `seqrepo::SeqRepo` for the sequence repository. You can inject any 44 | /// `seqrepo::Interface` implementation using `Provider::with_seqrepo`. 45 | pub fn new(config: Config) -> Result { 46 | let seqrepo = PathBuf::from(&config.seqrepo_path); 47 | let path = seqrepo 48 | .parent() 49 | .ok_or(Error::PathParent(config.seqrepo_path.clone()))? 50 | .to_str() 51 | .expect("problem with path to string conversion") 52 | .to_string(); 53 | let instance = seqrepo 54 | .file_name() 55 | .ok_or(Error::PathBasename(config.seqrepo_path.clone()))? 56 | .to_str() 57 | .expect("problem with path to string conversion") 58 | .to_string(); 59 | 60 | Ok(Self { 61 | inner: uta::Provider::with_config(&uta::Config { 62 | db_url: config.db_url.clone(), 63 | db_schema: config.db_schema, 64 | })?, 65 | seqrepo: Arc::new(SeqRepo::new(path, &instance)?), 66 | }) 67 | } 68 | 69 | /// Create a new provider allowing to inject a seqrepo. 70 | pub fn with_seqrepo( 71 | config: Config, 72 | seqrepo: Arc, 73 | ) -> Result { 74 | Ok(Self { 75 | inner: uta::Provider::with_config(&uta::Config { 76 | db_url: config.db_url.clone(), 77 | db_schema: config.db_schema, 78 | })?, 79 | seqrepo, 80 | }) 81 | } 82 | } 83 | 84 | impl interface::Provider for Provider { 85 | fn data_version(&self) -> &str { 86 | self.inner.data_version() 87 | } 88 | 89 | fn schema_version(&self) -> &str { 90 | self.inner.schema_version() 91 | } 92 | 93 | fn get_assembly_map( 94 | &self, 95 | assembly: biocommons_bioutils::assemblies::Assembly, 96 | ) -> indexmap::IndexMap { 97 | self.inner.get_assembly_map(assembly) 98 | } 99 | 100 | fn get_gene_info(&self, hgnc: &str) -> Result { 101 | self.inner.get_gene_info(hgnc) 102 | } 103 | 104 | fn get_pro_ac_for_tx_ac(&self, tx_ac: &str) -> Result, Error> { 105 | self.inner.get_pro_ac_for_tx_ac(tx_ac) 106 | } 107 | 108 | fn get_seq_part( 109 | &self, 110 | ac: &str, 111 | begin: Option, 112 | end: Option, 113 | ) -> Result { 114 | let aos = AliasOrSeqId::Alias { 115 | value: ac.to_owned(), 116 | namespace: None, 117 | }; 118 | self.seqrepo 119 | .fetch_sequence_part(&aos, begin, end) 120 | .map_err(Error::SeqRepoError) 121 | } 122 | 123 | fn get_acs_for_protein_seq(&self, seq: &str) -> Result, Error> { 124 | self.inner.get_acs_for_protein_seq(seq) 125 | } 126 | 127 | fn get_similar_transcripts(&self, tx_ac: &str) -> Result, Error> { 128 | self.inner.get_similar_transcripts(tx_ac) 129 | } 130 | 131 | fn get_tx_exons( 132 | &self, 133 | tx_ac: &str, 134 | alt_ac: &str, 135 | alt_aln_method: &str, 136 | ) -> Result, Error> { 137 | self.inner.get_tx_exons(tx_ac, alt_ac, alt_aln_method) 138 | } 139 | 140 | fn get_tx_for_gene(&self, gene: &str) -> Result, Error> { 141 | self.inner.get_tx_for_gene(gene) 142 | } 143 | 144 | fn get_tx_for_region( 145 | &self, 146 | alt_ac: &str, 147 | alt_aln_method: &str, 148 | start_i: i32, 149 | end_i: i32, 150 | ) -> Result, Error> { 151 | self.inner 152 | .get_tx_for_region(alt_ac, alt_aln_method, start_i, end_i) 153 | } 154 | 155 | fn get_tx_identity_info(&self, tx_ac: &str) -> Result { 156 | self.inner.get_tx_identity_info(tx_ac) 157 | } 158 | 159 | fn get_tx_info( 160 | &self, 161 | tx_ac: &str, 162 | alt_ac: &str, 163 | alt_aln_method: &str, 164 | ) -> Result { 165 | self.inner.get_tx_info(tx_ac, alt_ac, alt_aln_method) 166 | } 167 | 168 | fn get_tx_mapping_options(&self, tx_ac: &str) -> Result, Error> { 169 | self.inner.get_tx_mapping_options(tx_ac) 170 | } 171 | } 172 | 173 | /// Code for helping setup of UTA providers, e.g., for setting up caching of SeqRepo results. 174 | #[cfg(test)] 175 | pub mod test_helpers { 176 | use anyhow::Error; 177 | use seqrepo::{CacheReadingSeqRepo, CacheWritingSeqRepo, SeqRepo}; 178 | use std::{path::PathBuf, sync::Arc}; 179 | 180 | use crate::data::interface; 181 | 182 | use super::{Config, Provider}; 183 | 184 | #[test] 185 | fn test_sync() { 186 | fn is_sync() {} 187 | is_sync::(); 188 | } 189 | 190 | /// Setup a UTA Provider with data source depending on environment variables. 191 | /// 192 | /// See README.md for information on environment variable setup. 193 | pub fn build_provider() -> Result, Error> { 194 | log::debug!("building provider..."); 195 | let db_url = std::env::var("TEST_UTA_DATABASE_URL") 196 | .expect("Environment variable TEST_UTA_DATABASE_URL undefined!"); 197 | let db_schema = std::env::var("TEST_UTA_DATABASE_SCHEMA") 198 | .expect("Environment variable TEST_UTA_DATABASE_SCHEMA undefined!"); 199 | let sr_cache_mode = std::env::var("TEST_SEQREPO_CACHE_MODE") 200 | .expect("Environment variable TEST_SEQREPO_CACHE_MODE undefined!"); 201 | let sr_cache_path = std::env::var("TEST_SEQREPO_CACHE_PATH") 202 | .expect("Environment variable TEST_SEQREPO_CACHE_PATH undefined!"); 203 | 204 | let (seqrepo, seqrepo_path) = if sr_cache_mode == "read" { 205 | log::debug!("reading provider..."); 206 | let seqrepo: Arc = 207 | Arc::new(CacheReadingSeqRepo::new(sr_cache_path)?); 208 | log::debug!("construction done..."); 209 | (seqrepo, "".to_string()) 210 | } else if sr_cache_mode == "write" { 211 | log::debug!("writing provider..."); 212 | build_writing_sr(sr_cache_path)? 213 | } else { 214 | panic!("Invalid cache mode {}", &sr_cache_mode); 215 | }; 216 | log::debug!("now returning provider..."); 217 | 218 | Ok(Arc::new(Provider::with_seqrepo( 219 | Config { 220 | db_url, 221 | db_schema, 222 | seqrepo_path, 223 | }, 224 | seqrepo, 225 | )?)) 226 | } 227 | 228 | /// Helper that builds the cache writing SeqRepo with inner stock SeqRepo. 229 | pub fn build_writing_sr( 230 | sr_cache_path: String, 231 | ) -> Result<(Arc, String), Error> { 232 | let seqrepo_path = std::env::var("TEST_SEQREPO_PATH") 233 | .expect("Environment variable TEST_SEQREPO_PATH undefined!"); 234 | let path_buf = PathBuf::from(seqrepo_path.clone()); 235 | let path = path_buf 236 | .parent() 237 | .ok_or(anyhow::anyhow!( 238 | "Could not get parent from {}", 239 | &seqrepo_path 240 | ))? 241 | .to_str() 242 | .expect("problem with path to string conversion") 243 | .to_string(); 244 | let instance = path_buf 245 | .file_name() 246 | .ok_or(anyhow::anyhow!( 247 | "Could not get basename from {}", 248 | &seqrepo_path 249 | ))? 250 | .to_str() 251 | .expect("problem with path to string conversion") 252 | .to_string(); 253 | let seqrepo: Arc = Arc::new( 254 | CacheWritingSeqRepo::new(SeqRepo::new(path, &instance)?, sr_cache_path)?, 255 | ); 256 | Ok((seqrepo, seqrepo_path)) 257 | } 258 | } 259 | 260 | // 261 | // Copyright 2023 hgvs-rs Contributors 262 | // Copyright 2014 Bioutils Contributors 263 | // 264 | // Licensed under the Apache License, Version 2.0 (the "License"); 265 | // you may not use this file except in compliance with the License. 266 | // You may obtain a copy of the License at 267 | // 268 | // http://www.apache.org/licenses/LICENSE-2.0 269 | // 270 | // Unless required by applicable law or agreed to in writing, software 271 | // distributed under the License is distributed on an "AS IS" BASIS, 272 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 273 | // See the License for the specific language governing permissions and 274 | // limitations under the License. 275 | // 276 | -------------------------------------------------------------------------------- /src/lib.rs: -------------------------------------------------------------------------------- 1 | pub mod data; 2 | pub mod mapper; 3 | pub mod normalizer; 4 | pub mod parser; 5 | pub mod sequences; 6 | pub mod validator; 7 | -------------------------------------------------------------------------------- /src/mapper/cigar.rs: -------------------------------------------------------------------------------- 1 | //! Code supporting the `CigarMapper` 2 | 3 | use std::fmt::Display; 4 | 5 | use crate::mapper::Error; 6 | use nom::{combinator::all_consuming, multi::many0, Parser}; 7 | 8 | /// CIGAR operation as parsed from UTA. 9 | #[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)] 10 | pub enum CigarOp { 11 | /// = 12 | Eq, 13 | /// D 14 | Del, 15 | /// I 16 | Ins, 17 | /// M 18 | Match, 19 | /// S 20 | Skip, 21 | /// X 22 | Mismatch, 23 | } 24 | 25 | impl CigarOp { 26 | pub fn is_advance_ref(&self) -> bool { 27 | matches!( 28 | self, 29 | CigarOp::Eq | CigarOp::Match | CigarOp::Mismatch | CigarOp::Ins | CigarOp::Skip 30 | ) 31 | } 32 | 33 | pub fn is_advance_tgt(&self) -> bool { 34 | matches!( 35 | self, 36 | CigarOp::Eq | CigarOp::Match | CigarOp::Mismatch | CigarOp::Del 37 | ) 38 | } 39 | } 40 | 41 | impl TryFrom for CigarOp { 42 | type Error = Error; 43 | 44 | fn try_from(value: char) -> Result { 45 | Ok(match value { 46 | '=' => Self::Eq, 47 | 'D' => Self::Del, 48 | 'I' => Self::Ins, 49 | 'M' => Self::Match, 50 | 'N' => Self::Skip, 51 | 'X' => Self::Mismatch, 52 | _ => return Err(Error::InvalidCigarValue(value)), 53 | }) 54 | } 55 | } 56 | 57 | impl From for char { 58 | fn from(val: CigarOp) -> Self { 59 | match val { 60 | CigarOp::Eq => '=', 61 | CigarOp::Del => 'D', 62 | CigarOp::Ins => 'I', 63 | CigarOp::Match => 'M', 64 | CigarOp::Skip => 'N', 65 | CigarOp::Mismatch => 'X', 66 | } 67 | } 68 | } 69 | 70 | impl Display for CigarOp { 71 | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { 72 | write!(f, "{}", std::convert::Into::::into(*self)) 73 | } 74 | } 75 | 76 | /// CIGAR element consisting of count and CIGAR operation. 77 | #[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)] 78 | pub struct CigarElement { 79 | pub count: i32, 80 | pub op: CigarOp, 81 | } 82 | 83 | impl Display for CigarElement { 84 | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { 85 | if self.count > 1 { 86 | write!(f, "{}", self.count)?; 87 | } 88 | write!(f, "{}", self.op) 89 | } 90 | } 91 | 92 | impl CigarElement { 93 | fn from_strs(count: &str, op: &str) -> Result { 94 | Ok(CigarElement { 95 | count: if count.is_empty() { 96 | 1 97 | } else { 98 | str::parse(count).map_err(|_e| Error::InvalidCigarCount(count.to_string()))? 99 | }, 100 | op: op 101 | .chars() 102 | .next() 103 | .expect("CIGAR op is empty") 104 | .try_into() 105 | .map_err(|_e| Error::InvalidCigarOp(op.to_string()))?, 106 | }) 107 | } 108 | } 109 | 110 | #[derive(Debug, PartialEq, PartialOrd, Default, Clone)] 111 | pub struct CigarString { 112 | pub elems: Vec, 113 | } 114 | 115 | impl CigarString { 116 | fn from(elems: Vec) -> Self { 117 | Self { elems } 118 | } 119 | } 120 | 121 | impl std::ops::Deref for CigarString { 122 | type Target = Vec; 123 | fn deref(&self) -> &Self::Target { 124 | &self.elems 125 | } 126 | } 127 | impl std::ops::DerefMut for CigarString { 128 | fn deref_mut(&mut self) -> &mut Self::Target { 129 | &mut self.elems 130 | } 131 | } 132 | 133 | impl Display for CigarString { 134 | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { 135 | for item in &self.elems { 136 | write!(f, "{}", &item)? 137 | } 138 | Ok(()) 139 | } 140 | } 141 | 142 | pub mod parse { 143 | use nom::{ 144 | bytes::complete::take_while_m_n, character::complete::digit0, error::context, 145 | sequence::pair, IResult, Parser, 146 | }; 147 | use nom_language::error::VerboseError; 148 | 149 | type Res = IResult>; 150 | 151 | use super::CigarElement; 152 | 153 | pub fn is_cigar_op_char(c: char) -> bool { 154 | "=DIMNX".contains(c) 155 | } 156 | 157 | pub fn cigar_element(input: &str) -> Res<&str, CigarElement> { 158 | context( 159 | "cigar_element", 160 | pair(digit0, take_while_m_n(1, 1, is_cigar_op_char)), 161 | ) 162 | .parse(input) 163 | .map(|(rest, (count, op))| { 164 | ( 165 | rest, 166 | CigarElement::from_strs(count, op).expect("CIGAR parsing failed"), 167 | ) 168 | }) 169 | } 170 | } 171 | 172 | /// Parse a CIGAR `str` into a real one. 173 | pub fn parse_cigar_string(input: &str) -> Result { 174 | Ok(CigarString::from( 175 | all_consuming(many0(parse::cigar_element)) 176 | .parse(input) 177 | .map_err(|e| Error::InvalidCigarString(e.to_string()))? 178 | .1, 179 | )) 180 | } 181 | 182 | /// Provide coordinate mapping between two sequences whose alignment is given by a CIGAR string. 183 | /// 184 | /// CIGAR is about alignments between positions in two sequences. It is base-centric. 185 | /// 186 | /// Unfortunately, base-centric coordinate systems require additional complexity to refer to 187 | /// zero-width positions. 188 | /// 189 | /// This code uses interbase intervals. Interbase positions are zero-width boundaries between 190 | /// bases. They often look similar to zero-based, right open coordinates. (But don't call them 191 | /// that. It upsets me deeply.) The most important difference is that zero width intervals 192 | /// neatly represent insertions between bases (or before or after the sequence). 193 | #[derive(Default, Debug, Clone)] 194 | pub struct CigarMapper { 195 | pub cigar_string: CigarString, 196 | pub ref_pos: Vec, 197 | pub tgt_pos: Vec, 198 | pub cigar_op: Vec, 199 | pub ref_len: i32, 200 | pub tgt_len: i32, 201 | } 202 | 203 | #[derive(Debug, PartialEq)] 204 | pub struct CigarMapperResult { 205 | pub pos: i32, 206 | pub offset: i32, 207 | pub cigar_op: CigarOp, 208 | } 209 | 210 | impl CigarMapper { 211 | pub fn new(cigar_string: &CigarString) -> Self { 212 | let (ref_pos, tgt_pos, cigar_op) = Self::init(cigar_string); 213 | 214 | Self { 215 | cigar_string: cigar_string.clone(), 216 | ref_len: *ref_pos.last().expect("no reference positions?"), 217 | tgt_len: *tgt_pos.last().expect("no target positions?"), 218 | ref_pos, 219 | tgt_pos, 220 | cigar_op, 221 | } 222 | } 223 | 224 | /// For a given CIGAR string, return the start positions of each aligned segment in ref 225 | /// and tgt, and a list of CIGAR operators. 226 | fn init(cigar_string: &CigarString) -> (Vec, Vec, Vec) { 227 | let cigar_len = cigar_string.len(); 228 | 229 | let mut ref_pos = vec![-1; cigar_len]; 230 | let mut tgt_pos = vec![-1; cigar_len]; 231 | let mut cigar_op = vec![CigarOp::Mismatch; cigar_len]; 232 | let mut ref_cur = 0; 233 | let mut tgt_cur = 0; 234 | for (i, CigarElement { count, op }) in cigar_string.iter().enumerate() { 235 | ref_pos[i] = ref_cur; 236 | tgt_pos[i] = tgt_cur; 237 | cigar_op[i] = *op; 238 | if op.is_advance_ref() { 239 | ref_cur += *count; 240 | } 241 | if op.is_advance_tgt() { 242 | tgt_cur += *count; 243 | } 244 | } 245 | ref_pos.push(ref_cur); 246 | tgt_pos.push(tgt_cur); 247 | 248 | (ref_pos, tgt_pos, cigar_op) 249 | } 250 | 251 | pub fn map_ref_to_tgt( 252 | &self, 253 | pos: i32, 254 | end: &str, 255 | strict_bounds: bool, 256 | ) -> Result { 257 | self.map(&self.ref_pos, &self.tgt_pos, pos, end, strict_bounds) 258 | } 259 | 260 | pub fn map_tgt_to_ref( 261 | &self, 262 | pos: i32, 263 | end: &str, 264 | strict_bounds: bool, 265 | ) -> Result { 266 | self.map(&self.tgt_pos, &self.ref_pos, pos, end, strict_bounds) 267 | } 268 | 269 | /// Map position between aligned segments. 270 | /// 271 | /// Positions in this function are 0-based, base-counting. 272 | fn map( 273 | &self, 274 | from_pos: &[i32], 275 | to_pos: &[i32], 276 | pos: i32, 277 | end: &str, 278 | strict_bounds: bool, 279 | ) -> Result { 280 | if strict_bounds && (pos < 0 || pos > *from_pos.last().expect("no last position")) { 281 | return Err(Error::PositionBeyondTranscriptBounds( 282 | pos, 283 | format!("{:?}", from_pos), 284 | format!("{:?}", to_pos), 285 | )); 286 | } 287 | 288 | // Find aligned segment to use as basis for mapping. It is okay for pos to be 289 | // before first element or after last. 290 | let pos_i = { 291 | let mut pos_i = 0; 292 | while pos_i < self.cigar_op.len() { 293 | if pos < from_pos[pos_i + 1] { 294 | break; 295 | } 296 | pos_i += 1; 297 | } 298 | std::cmp::min(pos_i, self.cigar_op.len().saturating_sub(1)) 299 | }; 300 | 301 | let cigar_op = self.cigar_op[pos_i]; 302 | 303 | if cigar_op == CigarOp::Eq || cigar_op == CigarOp::Match || cigar_op == CigarOp::Mismatch { 304 | Ok(CigarMapperResult { 305 | pos: to_pos[pos_i] + (pos - from_pos[pos_i]), 306 | offset: 0, 307 | cigar_op, 308 | }) 309 | } else if cigar_op == CigarOp::Del || cigar_op == CigarOp::Ins { 310 | Ok(CigarMapperResult { 311 | pos: if end == "start" { 312 | to_pos[pos_i] - 1 313 | } else { 314 | to_pos[pos_i] 315 | }, 316 | offset: 0, 317 | cigar_op, 318 | }) 319 | } else if cigar_op == CigarOp::Skip { 320 | if pos - from_pos[pos_i] < from_pos[pos_i + 1] - pos { 321 | Ok(CigarMapperResult { 322 | pos: to_pos[pos_i] - 1, 323 | offset: pos - from_pos[pos_i] + 1, 324 | cigar_op, 325 | }) 326 | } else { 327 | Ok(CigarMapperResult { 328 | pos: to_pos[pos_i], 329 | offset: -(from_pos[pos_i + 1] - pos), 330 | cigar_op, 331 | }) 332 | } 333 | } else { 334 | Err(Error::CigarMapperError) 335 | } 336 | } 337 | } 338 | 339 | #[cfg(test)] 340 | mod test { 341 | use anyhow::Error; 342 | use pretty_assertions::assert_eq; 343 | 344 | use super::{parse_cigar_string, CigarElement, CigarMapper, CigarMapperResult, CigarOp}; 345 | 346 | #[test] 347 | fn parse_cigar_string_simple() -> Result<(), Error> { 348 | // assert_eq!(parse_cigar_string("")?, vec![]); 349 | assert_eq!( 350 | parse_cigar_string("M")?.elems, 351 | vec![CigarElement { 352 | count: 1, 353 | op: CigarOp::Match 354 | }] 355 | ); 356 | assert_eq!( 357 | parse_cigar_string("MM")?.elems, 358 | vec![ 359 | CigarElement { 360 | count: 1, 361 | op: CigarOp::Match 362 | }, 363 | CigarElement { 364 | count: 1, 365 | op: CigarOp::Match 366 | } 367 | ] 368 | ); 369 | assert_eq!( 370 | parse_cigar_string("1M")?.elems, 371 | vec![CigarElement { 372 | count: 1, 373 | op: CigarOp::Match, 374 | },] 375 | ); 376 | assert_eq!( 377 | parse_cigar_string("1M2I3X")?.elems, 378 | vec![ 379 | CigarElement { 380 | count: 1, 381 | op: CigarOp::Match, 382 | }, 383 | CigarElement { 384 | count: 2, 385 | op: CigarOp::Ins, 386 | }, 387 | CigarElement { 388 | count: 3, 389 | op: CigarOp::Mismatch, 390 | }, 391 | ] 392 | ); 393 | assert_eq!( 394 | parse_cigar_string("1MI3X")?.elems, 395 | vec![ 396 | CigarElement { 397 | count: 1, 398 | op: CigarOp::Match, 399 | }, 400 | CigarElement { 401 | count: 1, 402 | op: CigarOp::Ins, 403 | }, 404 | CigarElement { 405 | count: 3, 406 | op: CigarOp::Mismatch, 407 | }, 408 | ] 409 | ); 410 | 411 | Ok(()) 412 | } 413 | 414 | #[test] 415 | fn cigar_mapper_simple() -> Result<(), Error> { 416 | // 0 1 2 3 4 5 6 7 8 9 tgt 417 | // = = = N N = X = N N N = I = D = 418 | // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ref 419 | let cigar = "3=2N=X=3N=I=D=".to_string(); 420 | let cigar_str = parse_cigar_string(&cigar)?; 421 | let cigar_mapper = CigarMapper::new(&cigar_str); 422 | 423 | assert_eq!(cigar_mapper.ref_len, 15); 424 | assert_eq!(cigar_mapper.tgt_len, 10); 425 | assert_eq!(cigar_mapper.ref_pos.len(), cigar_mapper.tgt_pos.len()); 426 | assert_eq!( 427 | cigar_mapper.ref_pos, 428 | vec![0, 3, 5, 6, 7, 8, 11, 12, 13, 14, 14, 15] 429 | ); 430 | assert_eq!( 431 | cigar_mapper.tgt_pos, 432 | vec![0, 3, 3, 4, 5, 6, 6, 7, 7, 8, 9, 10] 433 | ); 434 | 435 | // ref to tgt 436 | { 437 | let cases = vec![ 438 | (0, "start", 0, 0, CigarOp::Eq), 439 | (0, "end", 0, 0, CigarOp::Eq), 440 | (1, "start", 1, 0, CigarOp::Eq), 441 | (1, "end", 1, 0, CigarOp::Eq), 442 | (2, "start", 2, 0, CigarOp::Eq), 443 | (2, "end", 2, 0, CigarOp::Eq), 444 | (3, "start", 2, 1, CigarOp::Skip), 445 | (3, "end", 2, 1, CigarOp::Skip), 446 | (4, "start", 3, -1, CigarOp::Skip), 447 | (4, "end", 3, -1, CigarOp::Skip), 448 | (5, "start", 3, 0, CigarOp::Eq), 449 | (5, "end", 3, 0, CigarOp::Eq), 450 | (6, "start", 4, 0, CigarOp::Mismatch), 451 | (6, "end", 4, 0, CigarOp::Mismatch), 452 | (7, "start", 5, 0, CigarOp::Eq), 453 | (7, "end", 5, 0, CigarOp::Eq), 454 | (8, "start", 5, 1, CigarOp::Skip), 455 | (8, "end", 5, 1, CigarOp::Skip), 456 | (9, "start", 5, 2, CigarOp::Skip), 457 | (9, "end", 5, 2, CigarOp::Skip), 458 | (10, "start", 6, -1, CigarOp::Skip), 459 | (10, "end", 6, -1, CigarOp::Skip), 460 | (11, "start", 6, 0, CigarOp::Eq), 461 | (11, "end", 6, 0, CigarOp::Eq), 462 | (12, "start", 6, 0, CigarOp::Ins), 463 | (12, "end", 7, 0, CigarOp::Ins), 464 | (13, "start", 7, 0, CigarOp::Eq), 465 | (13, "end", 7, 0, CigarOp::Eq), 466 | (14, "start", 9, 0, CigarOp::Eq), 467 | (14, "end", 9, 0, CigarOp::Eq), 468 | ]; 469 | for (arg_pos, arg_end, pos, offset, cigar_op) in cases { 470 | assert_eq!( 471 | cigar_mapper.map_ref_to_tgt(arg_pos, arg_end, true)?, 472 | CigarMapperResult { 473 | pos, 474 | offset, 475 | cigar_op 476 | }, 477 | "case = {:?}", 478 | (arg_pos, arg_end, pos, offset, cigar_op) 479 | ); 480 | } 481 | } 482 | 483 | // tgt to ref 484 | { 485 | let cases = vec![ 486 | (0, "start", 0, 0, CigarOp::Eq), 487 | (0, "end", 0, 0, CigarOp::Eq), 488 | (1, "start", 1, 0, CigarOp::Eq), 489 | (1, "end", 1, 0, CigarOp::Eq), 490 | (2, "start", 2, 0, CigarOp::Eq), 491 | (2, "end", 2, 0, CigarOp::Eq), 492 | (3, "start", 5, 0, CigarOp::Eq), 493 | (3, "end", 5, 0, CigarOp::Eq), 494 | (4, "start", 6, 0, CigarOp::Mismatch), 495 | (4, "end", 6, 0, CigarOp::Mismatch), 496 | (5, "start", 7, 0, CigarOp::Eq), 497 | (5, "end", 7, 0, CigarOp::Eq), 498 | (6, "start", 11, 0, CigarOp::Eq), 499 | (6, "end", 11, 0, CigarOp::Eq), 500 | (7, "start", 13, 0, CigarOp::Eq), 501 | (7, "end", 13, 0, CigarOp::Eq), 502 | (8, "start", 13, 0, CigarOp::Del), 503 | (8, "end", 14, 0, CigarOp::Del), 504 | (9, "start", 14, 0, CigarOp::Eq), 505 | (9, "end", 14, 0, CigarOp::Eq), 506 | ]; 507 | for (arg_pos, arg_end, pos, offset, cigar_op) in cases { 508 | assert_eq!( 509 | cigar_mapper.map_tgt_to_ref(arg_pos, arg_end, true)?, 510 | CigarMapperResult { 511 | pos, 512 | offset, 513 | cigar_op 514 | }, 515 | "case = {:?}", 516 | (arg_pos, arg_end, pos, offset, cigar_op) 517 | ); 518 | } 519 | } 520 | 521 | Ok(()) 522 | } 523 | 524 | #[test] 525 | fn cigar_mapper_strict_bounds() -> Result<(), Error> { 526 | // 0 1 2 3 4 5 6 7 8 9 tgt 527 | // = = = N N = X = N N N = I = D = 528 | // 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ref 529 | let cigar = "3=2N=X=3N=I=D=".to_string(); 530 | let cigar_str = parse_cigar_string(&cigar)?; 531 | let cigar_mapper = CigarMapper::new(&cigar_str); 532 | 533 | // error for out of bounds on left? 534 | assert!(cigar_mapper.map_ref_to_tgt(-1, "start", true).is_err()); 535 | // ... and right? 536 | assert!(cigar_mapper 537 | .map_ref_to_tgt(cigar_mapper.ref_len + 1, "start", true) 538 | .is_err()); 539 | 540 | // test whether 1 base outside bounds results in correct position 541 | assert_eq!( 542 | cigar_mapper.map_ref_to_tgt(0, "start", true)?, 543 | CigarMapperResult { 544 | pos: 0, 545 | offset: 0, 546 | cigar_op: CigarOp::Eq, 547 | } 548 | ); 549 | assert_eq!( 550 | cigar_mapper.map_ref_to_tgt(-1, "start", false)?, 551 | CigarMapperResult { 552 | pos: -1, 553 | offset: 0, 554 | cigar_op: CigarOp::Eq, 555 | } 556 | ); 557 | assert_eq!( 558 | cigar_mapper.map_ref_to_tgt(cigar_mapper.ref_len, "start", true)?, 559 | CigarMapperResult { 560 | pos: cigar_mapper.tgt_len, 561 | offset: 0, 562 | cigar_op: CigarOp::Eq, 563 | } 564 | ); 565 | assert_eq!( 566 | cigar_mapper.map_ref_to_tgt(cigar_mapper.ref_len - 1, "start", false)?, 567 | CigarMapperResult { 568 | pos: cigar_mapper.tgt_len - 1, 569 | offset: 0, 570 | cigar_op: CigarOp::Eq, 571 | } 572 | ); 573 | 574 | Ok(()) 575 | } 576 | } 577 | 578 | // 579 | // Copyright 2023 hgvs-rs Contributors 580 | // Copyright 2014 Bioutils Contributors 581 | // 582 | // Licensed under the Apache License, Version 2.0 (the "License"); 583 | // you may not use this file except in compliance with the License. 584 | // You may obtain a copy of the License at 585 | // 586 | // http://www.apache.org/licenses/LICENSE-2.0 587 | // 588 | // Unless required by applicable law or agreed to in writing, software 589 | // distributed under the License is distributed on an "AS IS" BASIS, 590 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 591 | // See the License for the specific language governing permissions and 592 | // limitations under the License. 593 | // 594 | -------------------------------------------------------------------------------- /src/mapper/error.rs: -------------------------------------------------------------------------------- 1 | //! Error type definition. 2 | 3 | use thiserror::Error; 4 | 5 | /// Error type for variant mapping. 6 | #[derive(Error, Debug, Clone)] 7 | pub enum Error { 8 | #[error("validation error")] 9 | ValidationFailed(#[from] crate::validator::Error), 10 | #[error("normalization error")] 11 | NormalizationFailed(#[from] crate::normalizer::Error), 12 | #[error("parsing failed")] 13 | ParsingFailed(#[from] crate::parser::Error), 14 | #[error("sequence operation failed")] 15 | SequenceOperationFailed(#[from] crate::sequences::Error), 16 | #[error("problem accessing data")] 17 | DataError(#[from] crate::data::error::Error), 18 | #[error("expected a GenomeVariant but received {0}")] 19 | ExpectedGenomeVariant(String), 20 | #[error("expected a TxVariant but received {0}")] 21 | ExpectedTxVariant(String), 22 | #[error("expected a CdsVariant but received {0}")] 23 | ExpectedCdsVariant(String), 24 | #[error("no NAEdit in HGVS.c variant: {0}")] 25 | NoNAEditInHgvsC(String), 26 | #[error("must have ProtVariant")] 27 | NotProtVariant, 28 | #[error("could not construct HGVS.p variant")] 29 | ProtVariantConstructionFailed, 30 | #[error("cannot get altered sequence for missing positions")] 31 | NoAlteredSequenceForMissingPositions, 32 | #[error("variant is missing nucleic acid edit")] 33 | NaEditMissing, 34 | #[error("can only update reference for c, g, m, n, r")] 35 | CannotUpdateReference, 36 | #[error("invalid CIGAR value: {0}")] 37 | InvalidCigarValue(char), 38 | #[error("invalid CIGAR value: {0}")] 39 | InvalidCigarCount(String), 40 | #[error("invalid CIGAR op: {0}")] 41 | InvalidCigarOp(String), 42 | #[error("invalid CIGAR string: {0}")] 43 | InvalidCigarString(String), 44 | #[error( 45 | "position is beyond the bounds of transcript record (pos={0}, from_pos={1}, to_pos={2})" 46 | )] 47 | PositionBeyondTranscriptBounds(i32, String, String), 48 | #[error("algorithm error in CIGAR mapper")] 49 | CigarMapperError, 50 | #[error("not a GenomeVariant: {0}")] 51 | NotGenomeVariant(String), 52 | #[error("no alignments for {0} in {1} using {2}")] 53 | NoAlignments(String, String, String), 54 | #[error( 55 | "multiple chromosome alignments for {0} in {1} using {2} (non- \ 56 | pseudoautosomal region) [{3}]" 57 | )] 58 | MultipleChromAlignsNonPar(String, String, String, String), 59 | #[error( 60 | "multiple chromosome alignments for {0} in {1} using {2} (likely \ 61 | pseudoautosomal region)" 62 | )] 63 | MultipleChromAlignsLikelyPar(String, String, String), 64 | #[error( 65 | "multiple chromosome alignments for {0} in {1} using {2} \ 66 | (in_par_assume={3} select {4} of them)" 67 | )] 68 | MultipleChromAlignsInParAssume(String, String, String, String, usize), 69 | #[error( 70 | "transcript {0} is not supported because its sequence length of 71 | {1} is not a multiple of 3" 72 | )] 73 | TranscriptLengthInvalid(String, usize), 74 | #[error("start pos ouf of range in reference sequence")] 75 | StartPosOutOfRange, 76 | #[error("got multiple AA variants which is not supported")] 77 | MultipleAAVariants, 78 | #[error("deletion sequence should not be empty")] 79 | DeletionSequenceEmpty, 80 | #[error("insertion sequence should not be empty")] 81 | InsertionSequenceEmpty, 82 | #[error("cannot build CIGAR string from empty exons")] 83 | EmptyExons, 84 | #[error("found no exons for tx_ac={0}, alt_ac={1}, alt_aln_method={2}")] 85 | NoExons(String, String, String), 86 | #[error("non-adjacent exons for tx_ac={0}, alt_ac={1}, alt_aln_method={2}: {3}")] 87 | NonAdjacentExons(String, String, String, String), 88 | #[error("CDS start and end must both be defined or undefined")] 89 | InconsistentCdsStartEnd, 90 | #[error("cannot project genome interval with missing start or end position: {0}")] 91 | MissingGenomeIntervalPosition(String), 92 | #[error("CDS is undefined for {0}; cannot map to c. coordinates (non-coding transcript?)")] 93 | CdsUndefined(String), 94 | #[error("coordinate is outside the bounds of the reference sequence")] 95 | CoordinateOutsideReference, 96 | #[error("c.{0} coordinate is out of bounds")] 97 | CoordinateOutOfBounds(String), 98 | #[error("cannot convert interval start: {0} to usize")] 99 | CannotConvertIntervalStart(i32), 100 | #[error("cannot convert interval end: {0} to usize")] 101 | CannotConvertIntervalEnd(i32), 102 | #[error("general mapper error")] 103 | General, 104 | } 105 | -------------------------------------------------------------------------------- /src/mapper/mod.rs: -------------------------------------------------------------------------------- 1 | //! Code supporting mapping between coordinate systems. 2 | 3 | pub mod alignment; 4 | pub mod altseq; 5 | pub mod assembly; 6 | pub mod cigar; 7 | mod error; 8 | pub mod variant; 9 | 10 | pub use error::Error; 11 | -------------------------------------------------------------------------------- /src/mapper/snapshots/hgvs__mapper__variant__test__issue_131.snap: -------------------------------------------------------------------------------- 1 | --- 2 | source: src/mapper/variant.rs 3 | expression: "&var_p_test" 4 | --- 5 | ProtVariant: 6 | accession: 7 | value: NP_001240838.1 8 | gene_symbol: ~ 9 | loc_edit: NoChange 10 | 11 | -------------------------------------------------------------------------------- /src/parser/error.rs: -------------------------------------------------------------------------------- 1 | //! Error type definition. 2 | 3 | use thiserror::Error; 4 | 5 | /// Error type for parsing of HGVS expressions. 6 | #[derive(Error, Debug, Clone)] 7 | pub enum Error { 8 | /// Invalid genome interval. 9 | #[error("{0} is not a valid genome interval")] 10 | InvalidGenomeInterval(String), 11 | /// Invalid transcript interval. 12 | #[error("{0} is not a valid tx interval")] 13 | InvalidTxInterval(String), 14 | /// Invalid CDS interval. 15 | #[error("{0} is not a valid CDS interval")] 16 | InvalidCdsInterval(String), 17 | /// Invalid HGVS expression. 18 | #[error("{0} is not a valid HGVS expression interval")] 19 | InvalidHgvsVariant(String), 20 | 21 | /// Ill-defined conversion. 22 | #[error("conversion of interval with different offsets (CDS start/end) is ill-defined: {0}")] 23 | IllDefinedConversion(String), 24 | /// Cannot None position into range. 25 | #[error("cannot convert interval with None position into range: {0}")] 26 | CannotNonePositionIntoRange(String), 27 | 28 | #[error("ref or alt must be non-empty in: {0}")] 29 | RefOrAltMustBeNonEmpty(String), 30 | #[error("number of deleted bases must be positive in: {0}")] 31 | NumDelBasesNotPositive(String), 32 | #[error("alternate bases must be non-empty in: {0}")] 33 | NumAltBasesEmpty(String), 34 | #[error("number of inverted bases must be positive in: {0}")] 35 | NumInvBasesNotPositive(String), 36 | } 37 | -------------------------------------------------------------------------------- /src/parser/impl_validate.rs: -------------------------------------------------------------------------------- 1 | //! Provide implementation of validation to data structures. 2 | 3 | use std::ops::Range; 4 | 5 | use crate::validator::Error; 6 | use crate::validator::Validateable; 7 | 8 | use super::{ 9 | CdsInterval, CdsLocEdit, GenomeInterval, GenomeLocEdit, HgvsVariant, MtLocEdit, NaEdit, 10 | ProtLocEdit, RnaLocEdit, TxLocEdit, 11 | }; 12 | 13 | impl Validateable for NaEdit { 14 | fn validate(&self) -> Result<(), Error> { 15 | match &self { 16 | NaEdit::RefAlt { 17 | reference, 18 | alternative, 19 | } => { 20 | if reference.is_empty() && alternative.is_empty() { 21 | Err(Error::RefOrAltMustBeNonEmpty(format!("{:?}", self))) 22 | } else { 23 | Ok(()) 24 | } 25 | } 26 | NaEdit::NumAlt { count, alternative } => { 27 | if *count < 1 { 28 | Err(Error::NumDelBasesNotPositive(format!("{:?}", self))) 29 | } else if alternative.is_empty() { 30 | Err(Error::NumAltBasesEmpty(format!("{:?}", self))) 31 | } else { 32 | Ok(()) 33 | } 34 | } 35 | NaEdit::DelRef { reference: _ } => Ok(()), 36 | NaEdit::DelNum { count } => { 37 | if *count < 1 { 38 | Err(Error::NumDelBasesNotPositive(format!("{:?}", self))) 39 | } else { 40 | Ok(()) 41 | } 42 | } 43 | NaEdit::Ins { alternative: _ } => Ok(()), 44 | NaEdit::Dup { reference: _ } => Ok(()), 45 | NaEdit::InvRef { reference: _ } => Ok(()), 46 | NaEdit::InvNum { count } => { 47 | if *count < 1 { 48 | Err(Error::NumInvBasesNotPositive(format!("{:?}", self))) 49 | } else { 50 | Ok(()) 51 | } 52 | } 53 | } 54 | } 55 | } 56 | 57 | impl Validateable for HgvsVariant { 58 | fn validate(&self) -> Result<(), Error> { 59 | // NB: we only need to validate `self.loc_edit`. The cases that the Python library 60 | // considers are fended off by the Rust type system. 61 | match &self { 62 | HgvsVariant::CdsVariant { loc_edit, .. } => loc_edit.validate(), 63 | HgvsVariant::GenomeVariant { loc_edit, .. } => loc_edit.validate(), 64 | HgvsVariant::MtVariant { loc_edit, .. } => loc_edit.validate(), 65 | HgvsVariant::TxVariant { loc_edit, .. } => loc_edit.validate(), 66 | HgvsVariant::ProtVariant { loc_edit, .. } => loc_edit.validate(), 67 | HgvsVariant::RnaVariant { loc_edit, .. } => loc_edit.validate(), 68 | } 69 | } 70 | } 71 | 72 | impl Validateable for CdsLocEdit { 73 | fn validate(&self) -> Result<(), Error> { 74 | let loc = self.loc.inner(); 75 | loc.validate()?; 76 | 77 | let maybe_range: Result, _> = loc.clone().try_into(); 78 | let range = if let Ok(range) = maybe_range { 79 | range 80 | } else { 81 | log::trace!( 82 | "Skipping CDS location because loc cannot be converted to range: {:?}", 83 | loc 84 | ); 85 | return Ok(()); 86 | }; 87 | 88 | match self.edit.inner() { 89 | NaEdit::RefAlt { .. } 90 | | NaEdit::DelRef { .. } 91 | | NaEdit::Dup { .. } 92 | | NaEdit::Ins { .. } 93 | | NaEdit::InvRef { .. } => { 94 | // We cannot make assumptions about reference length as we can have positon 95 | // offsets. 96 | Ok(()) 97 | } 98 | NaEdit::DelNum { count } | NaEdit::NumAlt { count, .. } | NaEdit::InvNum { count } => { 99 | if range.len() as i32 != *count { 100 | Err(Error::ImpliedLengthMismatch(format!("{:?}", self))) 101 | } else { 102 | Ok(()) 103 | } 104 | } 105 | } 106 | } 107 | } 108 | 109 | impl Validateable for CdsInterval { 110 | fn validate(&self) -> Result<(), Error> { 111 | Ok(()) // TODO 112 | } 113 | } 114 | 115 | impl Validateable for GenomeLocEdit { 116 | fn validate(&self) -> Result<(), Error> { 117 | self.loc.inner().validate()?; 118 | self.edit.inner().validate() 119 | } 120 | } 121 | 122 | impl Validateable for GenomeInterval { 123 | fn validate(&self) -> Result<(), Error> { 124 | if let Some(start) = self.start { 125 | if start < 1 { 126 | return Err(Error::StartMustBePositive(format!("{:?}", self))); 127 | } 128 | } 129 | if let Some(end) = self.end { 130 | if end < 1 { 131 | return Err(Error::EndMustBePositive(format!("{:?}", self))); 132 | } 133 | } 134 | if let (Some(start), Some(end)) = (self.start, self.end) { 135 | if start > end { 136 | return Err(Error::StartMustBeLessThanEnd(format!("{:?}", self))); 137 | } 138 | } 139 | 140 | Ok(()) 141 | } 142 | } 143 | 144 | impl Validateable for MtLocEdit { 145 | fn validate(&self) -> Result<(), Error> { 146 | Ok(()) // TODO 147 | } 148 | } 149 | 150 | impl Validateable for TxLocEdit { 151 | fn validate(&self) -> Result<(), Error> { 152 | Ok(()) // TODO 153 | } 154 | } 155 | 156 | impl Validateable for RnaLocEdit { 157 | fn validate(&self) -> Result<(), Error> { 158 | Ok(()) // TODO 159 | } 160 | } 161 | 162 | impl Validateable for ProtLocEdit { 163 | fn validate(&self) -> Result<(), Error> { 164 | Ok(()) // TODO 165 | } 166 | } 167 | 168 | #[cfg(test)] 169 | mod test { 170 | use crate::{ 171 | parser::GenomeInterval, 172 | validator::{Error, Validateable}, 173 | }; 174 | 175 | #[test] 176 | fn validate_genomeinterval() -> Result<(), Error> { 177 | let g_interval = GenomeInterval { 178 | start: Some(1), 179 | end: Some(2), 180 | }; 181 | assert!(g_interval.validate().is_ok()); 182 | 183 | let g_interval = GenomeInterval { 184 | start: Some(-1), 185 | end: Some(2), 186 | }; 187 | assert!(g_interval.validate().is_err()); 188 | 189 | let g_interval = GenomeInterval { 190 | start: Some(1), 191 | end: Some(-2), 192 | }; 193 | assert!(g_interval.validate().is_err()); 194 | 195 | let g_interval = GenomeInterval { 196 | start: Some(2), 197 | end: Some(1), 198 | }; 199 | assert!(g_interval.validate().is_err()); 200 | 201 | Ok(()) 202 | } 203 | } 204 | 205 | // 206 | // Copyright 2023 hgvs-rs Contributors 207 | // Copyright 2014 Bioutils Contributors 208 | // 209 | // Licensed under the Apache License, Version 2.0 (the "License"); 210 | // you may not use this file except in compliance with the License. 211 | // You may obtain a copy of the License at 212 | // 213 | // http://www.apache.org/licenses/LICENSE-2.0 214 | // 215 | // Unless required by applicable law or agreed to in writing, software 216 | // distributed under the License is distributed on an "AS IS" BASIS, 217 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 218 | // See the License for the specific language governing permissions and 219 | // limitations under the License. 220 | // 221 | -------------------------------------------------------------------------------- /src/parser/mod.rs: -------------------------------------------------------------------------------- 1 | //! This module contains the code for HGVS variant descriptions. 2 | //! 3 | //! The parsing functionality is provided through `Type::parse()` functions. 4 | //! The data structures also provide the `Display` trait for conversion to 5 | //! strings etc. 6 | 7 | mod display; 8 | mod ds; 9 | mod error; 10 | mod impl_parse; 11 | mod impl_validate; 12 | mod parse_funcs; 13 | 14 | use std::str::FromStr; 15 | 16 | pub use crate::parser::display::*; 17 | pub use crate::parser::ds::*; 18 | pub use crate::parser::error::*; 19 | use crate::parser::impl_parse::*; 20 | 21 | impl FromStr for HgvsVariant { 22 | type Err = Error; 23 | 24 | fn from_str(s: &str) -> Result { 25 | Self::parse(s) 26 | .map_err(|_e| Error::InvalidHgvsVariant(s.to_string())) 27 | .map(|(_rest, variant)| variant) 28 | } 29 | } 30 | 31 | impl FromStr for GenomeInterval { 32 | type Err = Error; 33 | 34 | fn from_str(s: &str) -> Result { 35 | Self::parse(s) 36 | .map_err(|_e| Error::InvalidGenomeInterval(s.to_string())) 37 | .map(|(_rest, g_interval)| g_interval) 38 | } 39 | } 40 | 41 | impl FromStr for TxInterval { 42 | type Err = Error; 43 | 44 | fn from_str(s: &str) -> Result { 45 | Self::parse(s) 46 | .map_err(|_e| Error::InvalidTxInterval(s.to_string())) 47 | .map(|(_rest, g_interval)| g_interval) 48 | } 49 | } 50 | 51 | impl FromStr for CdsInterval { 52 | type Err = Error; 53 | 54 | fn from_str(s: &str) -> Result { 55 | Self::parse(s) 56 | .map_err(|_e| Error::InvalidCdsInterval(s.to_string())) 57 | .map(|(_rest, g_interval)| g_interval) 58 | } 59 | } 60 | 61 | #[cfg(test)] 62 | mod test { 63 | use anyhow::Error; 64 | use std::{ 65 | fs::File, 66 | io::{BufRead, BufReader}, 67 | str::FromStr, 68 | }; 69 | 70 | use crate::parser::{ 71 | Accession, CdsFrom, CdsInterval, CdsLocEdit, CdsPos, GenomeInterval, Mu, NaEdit, 72 | }; 73 | 74 | use super::HgvsVariant; 75 | 76 | #[test] 77 | fn from_str_basic() -> Result<(), Error> { 78 | assert_eq!( 79 | HgvsVariant::from_str("NM_01234.5:c.22+1A>T")?, 80 | HgvsVariant::CdsVariant { 81 | accession: Accession { 82 | value: "NM_01234.5".to_string() 83 | }, 84 | gene_symbol: None, 85 | loc_edit: CdsLocEdit { 86 | loc: Mu::Certain(CdsInterval { 87 | start: CdsPos { 88 | base: 22, 89 | offset: Some(1), 90 | cds_from: CdsFrom::Start 91 | }, 92 | end: CdsPos { 93 | base: 22, 94 | offset: Some(1), 95 | cds_from: CdsFrom::Start 96 | } 97 | }), 98 | edit: Mu::Certain(NaEdit::RefAlt { 99 | reference: "A".to_string(), 100 | alternative: "T".to_string() 101 | }) 102 | } 103 | } 104 | ); 105 | 106 | Ok(()) 107 | } 108 | 109 | #[test] 110 | fn not_ok() -> Result<(), Error> { 111 | assert!(HgvsVariant::from_str("x").is_err()); 112 | 113 | Ok(()) 114 | } 115 | 116 | // This test uses the "gauntlet" file from the hgvs package. 117 | #[test] 118 | fn hgvs_gauntlet() -> Result<(), Error> { 119 | let reader = BufReader::new(File::open("tests/data/parser/gauntlet")?); 120 | 121 | for line in reader.lines() { 122 | let line = line?; 123 | let line = line.trim(); 124 | if !line.starts_with('#') && !line.is_empty() { 125 | let result = HgvsVariant::from_str(line); 126 | assert!(result.is_ok(), "line = {}; result = {:?}", &line, &result); 127 | } 128 | } 129 | 130 | Ok(()) 131 | } 132 | 133 | // This test uses the "reject" file from the hgvs package. 134 | #[test] 135 | fn hgvs_reject() -> Result<(), Error> { 136 | let reader = BufReader::new(File::open("tests/data/parser/reject")?); 137 | 138 | for line in reader.lines() { 139 | let line = line?; 140 | let line = line.trim(); 141 | if !line.starts_with('#') && !line.is_empty() { 142 | assert!(HgvsVariant::from_str(line).is_err(), "line = {line}") 143 | } 144 | } 145 | 146 | Ok(()) 147 | } 148 | 149 | // Test genome interval parsing. 150 | #[test] 151 | fn genome_interval_from_str() -> Result<(), Error> { 152 | assert!(GenomeInterval::from_str("x").is_err()); 153 | assert_eq!( 154 | GenomeInterval::from_str("1")?, 155 | GenomeInterval { 156 | start: Some(1), 157 | end: Some(1) 158 | } 159 | ); 160 | assert_eq!( 161 | GenomeInterval::from_str("1_1")?, 162 | GenomeInterval { 163 | start: Some(1), 164 | end: Some(1) 165 | } 166 | ); 167 | assert_eq!( 168 | GenomeInterval::from_str("?_1")?, 169 | GenomeInterval { 170 | start: None, 171 | end: Some(1) 172 | } 173 | ); 174 | assert_eq!( 175 | GenomeInterval::from_str("1_?")?, 176 | GenomeInterval { 177 | start: Some(1), 178 | end: None 179 | } 180 | ); 181 | assert_eq!( 182 | GenomeInterval::from_str("?_?")?, 183 | GenomeInterval { 184 | start: None, 185 | end: None 186 | } 187 | ); 188 | 189 | Ok(()) 190 | } 191 | } 192 | 193 | // 194 | // Copyright 2023 hgvs-rs Contributors 195 | // Copyright 2014 Bioutils Contributors 196 | // 197 | // Licensed under the Apache License, Version 2.0 (the "License"); 198 | // you may not use this file except in compliance with the License. 199 | // You may obtain a copy of the License at 200 | // 201 | // http://www.apache.org/licenses/LICENSE-2.0 202 | // 203 | // Unless required by applicable law or agreed to in writing, software 204 | // distributed under the License is distributed on an "AS IS" BASIS, 205 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 206 | // See the License for the specific language governing permissions and 207 | // limitations under the License. 208 | // 209 | -------------------------------------------------------------------------------- /src/sequences.rs: -------------------------------------------------------------------------------- 1 | //! Utility code for working with sequences. 2 | //! 3 | //! Partially ported over from `bioutils.sequences`. 4 | 5 | use ahash::AHashMap; 6 | use md5::{Digest, Md5}; 7 | use std::sync::LazyLock; 8 | 9 | pub use crate::sequences::error::Error; 10 | 11 | include!(concat!(env!("OUT_DIR"), "/tables_gen.rs")); 12 | 13 | mod error { 14 | /// Error type for normalization of HGVS expressins. 15 | #[derive(thiserror::Error, Debug, Clone)] 16 | pub enum Error { 17 | #[error("invalid 1-letter aminoacid: {0} at {1}")] 18 | InvalidOneLetterAminoAcid(String, String), 19 | #[error("invalid 3-letter aminoacid: {0} at {1}")] 20 | InvalidThreeLetterAminoAcid(String, String), 21 | #[error("3-letter amino acid sequence length is not multiple of three: {0}")] 22 | InvalidThreeLetterAminoAcidLength(usize), 23 | #[error("codon is undefined in codon table: {0}")] 24 | UndefinedCodon(String), 25 | #[error("can only translate DNA sequences whose length is multiple of 3, but is: {0}")] 26 | UntranslatableDnaLenth(usize), 27 | #[error("character is not alphabetic: {0}")] 28 | NotAlphabetic(char), 29 | } 30 | } 31 | 32 | pub fn trim_common_prefixes(reference: &str, alternative: &str) -> (usize, String, String) { 33 | if reference.is_empty() || alternative.is_empty() { 34 | return (0, reference.to_string(), alternative.to_string()); 35 | } 36 | 37 | let mut trim = 0; 38 | while trim < reference.len() && trim < alternative.len() { 39 | if reference.chars().nth(trim) != alternative.chars().nth(trim) { 40 | break; 41 | } 42 | 43 | trim += 1; 44 | } 45 | 46 | ( 47 | trim, 48 | reference[trim..].to_string(), 49 | alternative[trim..].to_string(), 50 | ) 51 | } 52 | 53 | pub fn trim_common_suffixes(reference: &str, alternative: &str) -> (usize, String, String) { 54 | if reference.is_empty() || alternative.is_empty() { 55 | return (0, reference.to_string(), alternative.to_string()); 56 | } 57 | 58 | let mut trim = 0; 59 | let mut i_r = reference.len(); 60 | let mut i_a = alternative.len(); 61 | let mut pad = 0; 62 | while trim < reference.len() && trim < alternative.len() { 63 | trim += 1; 64 | assert!(i_r > 0); 65 | assert!(i_a > 0); 66 | i_r -= 1; 67 | i_a -= 1; 68 | 69 | if reference.chars().nth(i_r) != alternative.chars().nth(i_a) { 70 | pad = 1; 71 | break; 72 | } 73 | } 74 | 75 | ( 76 | trim - pad, 77 | reference[..(i_r + pad)].to_string(), 78 | alternative[..(i_a + pad)].to_string(), 79 | ) 80 | } 81 | 82 | /// Reverse complementing shortcut. 83 | pub fn revcomp(seq: &str) -> String { 84 | std::str::from_utf8(&bio::alphabets::dna::revcomp(seq.as_bytes())) 85 | .expect("invalid utf-8 encoding") 86 | .to_string() 87 | } 88 | 89 | /// Allow selection of translation table. 90 | #[derive( 91 | Debug, 92 | Default, 93 | Clone, 94 | Copy, 95 | PartialEq, 96 | Eq, 97 | Hash, 98 | PartialOrd, 99 | Ord, 100 | serde::Serialize, 101 | serde::Deserialize, 102 | )] 103 | pub enum TranslationTable { 104 | #[default] 105 | Standard, 106 | Selenocysteine, 107 | VertebrateMitochondrial, 108 | } 109 | 110 | /// Coerces string of 1- or 3-letter amino acids to 1-letter representation. 111 | /// 112 | /// Fails if the sequence is not of valid 3/1-letter amino acids. 113 | /// 114 | /// # Args 115 | /// 116 | /// * `seq` -- An amino acid sequence. 117 | /// 118 | /// # Returns 119 | /// 120 | /// The sequence as one of 1-letter amino acids. 121 | #[allow(dead_code)] 122 | pub fn aa_to_aa1(seq: &str) -> Result { 123 | if looks_like_aa3_p(seq) { 124 | aa3_to_aa1(seq) 125 | } else { 126 | Ok(seq.to_string()) 127 | } 128 | } 129 | 130 | /// Coerces string of 1- or 3-letter amino acids to 3-letter representation. 131 | /// 132 | /// Fails if the sequence is not of valid 3/1-letter amino acids. 133 | /// 134 | /// # Args 135 | /// 136 | /// * `seq` -- An amino acid sequence. 137 | /// 138 | /// # Returns 139 | /// 140 | /// The sequence as one of 1-letter amino acids. 141 | #[allow(dead_code)] 142 | pub fn aa_to_aa3(seq: &str) -> Result { 143 | if looks_like_aa3_p(seq) { 144 | Ok(seq.to_string()) 145 | } else { 146 | aa1_to_aa3(seq) 147 | } 148 | } 149 | 150 | /// Converts string of 1-letter amino acids to 3-letter amino acids. 151 | /// 152 | /// Fails if the sequence is not of 1-letter amino acids. 153 | /// 154 | /// # Args 155 | /// 156 | /// * `seq` -- An amino acid sequence as 1-letter amino acids. 157 | /// 158 | /// # Returns 159 | /// 160 | /// The sequence as 3-letter amino acids. 161 | #[allow(dead_code)] 162 | pub fn aa1_to_aa3(seq: &str) -> Result { 163 | if seq.is_empty() { 164 | return Ok(String::new()); 165 | } 166 | 167 | let mut result = String::with_capacity(seq.len() * 3); 168 | 169 | for (i, aa1) in seq.as_bytes().iter().enumerate() { 170 | let aa3 = AA1_TO_AA3_STR[*aa1 as usize].ok_or_else(|| { 171 | Error::InvalidOneLetterAminoAcid(format!("{:?}", aa1), format!("{}", i + 1)) 172 | })?; 173 | result.push_str(aa3); 174 | } 175 | 176 | Ok(result) 177 | } 178 | 179 | /// Converts string of 3-letter amino acids to 1-letter amino acids. 180 | /// 181 | /// Fails if the sequence is not of 3-letter amino acids. 182 | /// 183 | /// # Args 184 | /// 185 | /// * `seq` -- An amino acid sequence as 3-letter amino acids. 186 | /// 187 | /// # Returns 188 | /// 189 | /// The sequence as 1-letter amino acids. 190 | #[allow(dead_code)] 191 | pub fn aa3_to_aa1(seq: &str) -> Result { 192 | if seq.len() % 3 != 0 { 193 | return Err(Error::InvalidThreeLetterAminoAcidLength(seq.len())); 194 | } 195 | 196 | let mut result = String::with_capacity(seq.len() / 3); 197 | 198 | for (i, aa3) in seq.as_bytes().chunks(3).enumerate() { 199 | let aa1 = _aa3_to_aa1(aa3).ok_or_else(|| { 200 | Error::InvalidThreeLetterAminoAcid(format!("{:?}", aa3), format!("{}", i + 1)) 201 | })? as char; 202 | result.push(aa1); 203 | } 204 | 205 | Ok(result) 206 | } 207 | 208 | /// Indicates whether a string looks like a 3-letter AA string. 209 | /// 210 | /// # Args 211 | /// 212 | /// * `looks_like_aa3_p` -- A sequence 213 | /// 214 | /// # Returns 215 | /// 216 | /// Whether the string is of the format of a 3-letter AA string. 217 | #[allow(dead_code)] 218 | fn looks_like_aa3_p(seq: &str) -> bool { 219 | seq.len() % 3 == 0 && seq.chars().nth(1).map(|c| c.is_lowercase()).unwrap_or(true) 220 | } 221 | 222 | type Codon = [u8; 3]; 223 | 224 | /// Allow translation of `&[u8]` DNA codons to `u8` amino acids. 225 | /// 226 | /// We use separate structs here to encapsulate getting the lazy static global data. 227 | struct CodonTranslator { 228 | /// Mapping for "normalizing" DNA ASCII character (to upper case and `U -> T`). 229 | dna_ascii_map: [u8; 256], 230 | 231 | /// Mapping from DNA ASCII to 2-bit representation. 232 | dna_ascii_to_2bit: [u8; 256], 233 | 234 | /// IUPAC ambiguity codes. 235 | iupac_ambiguity_codes: [u8; 13], 236 | 237 | /// Mapping from 2bit DNA codon to amino acid 1-letter ASCII. 238 | codon_2bit_to_aa1: [u8; 64], 239 | 240 | /// Mapping from DNA 2-bit to amino acid 1-letter ASCII including degenerate codons. 241 | full_dna_to_aa1: &'static AHashMap, 242 | 243 | /// Buffer. 244 | codon: Codon, 245 | } 246 | 247 | static DNA_TO_AA1_LUT: LazyLock> = LazyLock::new(|| { 248 | let mut m = AHashMap::default(); 249 | for (dna, aa1) in DNA_TO_AA1_LUT_VEC { 250 | assert_eq!(dna.len(), 3); 251 | let d = dna.as_bytes(); 252 | m.insert([d[0], d[1], d[2]], aa1.as_bytes()[0]); 253 | } 254 | m 255 | }); 256 | 257 | static DNA_TO_AA1_SEC: LazyLock> = LazyLock::new(|| { 258 | let mut m = AHashMap::default(); 259 | for (dna, aa1) in DNA_TO_AA1_SEC_VEC { 260 | assert_eq!(dna.len(), 3); 261 | let d = dna.as_bytes(); 262 | m.insert([d[0], d[1], d[2]], aa1.as_bytes()[0]); 263 | } 264 | m 265 | }); 266 | 267 | static DNA_TO_AA1_CHRMT_VERTEBRATE: LazyLock> = LazyLock::new(|| { 268 | let mut m = AHashMap::default(); 269 | for (dna, aa1) in DNA_TO_AA1_CHRMT_VERTEBRATE_VEC { 270 | assert_eq!(dna.len(), 3); 271 | let d = dna.as_bytes(); 272 | m.insert([d[0], d[1], d[2]], aa1.as_bytes()[0]); 273 | } 274 | m 275 | }); 276 | 277 | impl CodonTranslator { 278 | /// Initialize the struct. 279 | pub fn new(table: TranslationTable) -> Self { 280 | Self { 281 | dna_ascii_map: DNA_ASCII_MAP, 282 | dna_ascii_to_2bit: DNA_ASCII_TO_2BIT, 283 | iupac_ambiguity_codes: IUPAC_AMBIGUITY_CODES, 284 | 285 | codon_2bit_to_aa1: match table { 286 | TranslationTable::Standard => CODON_2BIT_TO_AA1_LUT, 287 | TranslationTable::Selenocysteine => CODON_2BIT_TO_AA1_SEC, 288 | TranslationTable::VertebrateMitochondrial => CODON_2BIT_TO_AA1_CHRMT_VERTEBRATE, 289 | }, 290 | full_dna_to_aa1: match table { 291 | TranslationTable::Standard => &DNA_TO_AA1_LUT, 292 | TranslationTable::Selenocysteine => &DNA_TO_AA1_SEC, 293 | TranslationTable::VertebrateMitochondrial => &DNA_TO_AA1_CHRMT_VERTEBRATE, 294 | }, 295 | 296 | codon: [0; 3], 297 | } 298 | } 299 | 300 | /// Translate the given codon to an amino acid. 301 | /// 302 | /// # Args 303 | /// 304 | /// * `codon` -- A codon. 305 | /// 306 | /// # Returns 307 | /// 308 | /// The corresponding amino acid. 309 | pub fn translate(&mut self, codon: &[u8]) -> Result { 310 | // Normalize (to upper case etc.) codon. 311 | self.normalize_codon(codon); 312 | 313 | let translation = self 314 | // Attempt fast translation of codon 315 | .codon_to_aa1(&self.codon) 316 | // Fast translation fails, but slower hash map succeeded. 317 | .or_else(|| self.full_dna_to_aa1.get(&self.codon).copied()) 318 | // If this contains an ambiguous code, set aa to X, otherwise, throw error 319 | .or_else(|| { 320 | codon 321 | .iter() 322 | .any(|c| self.iupac_ambiguity_codes.contains(c)) 323 | .then_some(b'X') 324 | }); 325 | translation.ok_or_else(|| { 326 | Error::UndefinedCodon( 327 | std::str::from_utf8(codon) 328 | .expect("cannot decode UTF-8") 329 | .to_owned(), 330 | ) 331 | }) 332 | } 333 | 334 | fn dna3_to_2bit(&self, c: &[u8]) -> Option { 335 | let mut result = 0; 336 | for i in &c[..3] { 337 | result <<= 2; 338 | let tmp = self.dna_ascii_to_2bit[*i as usize]; 339 | if tmp == 255 { 340 | return None; 341 | } 342 | result |= tmp; 343 | } 344 | Some(result) 345 | } 346 | 347 | /// Helper function to extract normalized codon to `self.codon`. 348 | fn normalize_codon(&mut self, codon: &[u8]) { 349 | for (i, c) in codon[..3].iter().enumerate() { 350 | self.codon[i] = self.dna_ascii_map[*c as usize]; 351 | } 352 | } 353 | 354 | fn codon_to_aa1(&self, codon: &[u8]) -> Option { 355 | if let Some(val) = self.dna3_to_2bit(codon) { 356 | let tmp = self.codon_2bit_to_aa1[val as usize]; 357 | if tmp == 0 { 358 | None 359 | } else { 360 | Some(tmp) 361 | } 362 | } else { 363 | DNA_TO_AA1_LUT.get(codon).copied() 364 | } 365 | } 366 | } 367 | 368 | /// Translates a DNA or RNA sequence into a single-letter amino acid sequence. 369 | /// 370 | /// # Args 371 | /// 372 | /// * `seq` -- A nucleotide sequence. 373 | /// * `full_codons` -- If `true`, forces sequence to have length that is a multiple of 3 374 | /// and return an `Err` otherwise. If `false`, `ter_symbol` will be added as the last 375 | /// amino acid. This corresponds to biopython's behavior of padding the last codon with 376 | /// `N` characters. 377 | /// * `ter_symbol` -- Placeholder for the last amino acid if sequence length is not divisible 378 | /// by three and `full_codons` is `false`. 379 | /// * `translation_table` -- Indicates which codon to amino acid translation table to use. 380 | /// 381 | /// # Returns 382 | /// 383 | /// The corresponding single letter amino acid sequence. 384 | pub fn translate_cds( 385 | seq: &str, 386 | full_codons: bool, 387 | ter_symbol: &str, 388 | translation_table: TranslationTable, 389 | ) -> Result { 390 | if seq.is_empty() { 391 | return Ok("".to_string()); 392 | } 393 | 394 | if full_codons && seq.len() % 3 != 0 { 395 | return Err(Error::UntranslatableDnaLenth(seq.len())); 396 | } 397 | 398 | // Translate the codons from the input to result. 399 | let mut translator = CodonTranslator::new(translation_table); 400 | let mut result = String::with_capacity(seq.len() / 3); 401 | for chunk in seq.as_bytes().chunks_exact(3) { 402 | result.push(char::from(translator.translate(chunk)?)); 403 | } 404 | 405 | // Check for trailing bases and add the ter symbol if required. 406 | if !full_codons && seq.len() % 3 != 0 { 407 | result.push_str(ter_symbol); 408 | } 409 | 410 | Ok(result) 411 | } 412 | 413 | /// Converts sequence to normalized representation for hashing. 414 | /// 415 | /// Essentially, removes whitespace and asterisks, and uppercases the string. 416 | /// 417 | /// # Args 418 | /// 419 | /// * `seq` -- The sequence to be normalized. 420 | /// 421 | /// # Returns 422 | /// 423 | /// The sequence as a string of uppercase letters. 424 | pub fn normalize_sequence(seq: &str) -> Result { 425 | let mut result = String::new(); 426 | 427 | for c in seq.chars() { 428 | if !c.is_whitespace() && c != '*' { 429 | let c = c.to_ascii_uppercase(); 430 | if c.is_alphabetic() { 431 | result.push(c) 432 | } else { 433 | return Err(Error::NotAlphabetic(c)); 434 | } 435 | } 436 | } 437 | 438 | Ok(result) 439 | } 440 | 441 | /// Convert sequence to unicode MD5 hex digest. 442 | /// 443 | /// Fails if normalization is not possible. 444 | /// 445 | /// # Args 446 | /// 447 | /// * `seq` -- A sequence 448 | /// * `normalize` -- Whether to normalize the sequence before conversion, i.e., to ensure 449 | /// representation as uppercase ltters without whitespace or asterisks. 450 | /// 451 | /// # Returns 452 | /// 453 | /// Unicode MD5 hex digest representation of sequence. 454 | pub fn seq_md5(seq: &str, normalize: bool) -> Result { 455 | let seq = if normalize { 456 | normalize_sequence(seq)? 457 | } else { 458 | seq.to_owned() 459 | }; 460 | let mut hasher = Md5::new(); 461 | hasher.update(seq); 462 | let hash = hasher.finalize(); 463 | let mut buf = [0u8; 64]; 464 | let checksum = 465 | base16ct::lower::encode_str(&hash, &mut buf).expect("cannot perform base16 encoding"); 466 | Ok(checksum.to_owned()) 467 | } 468 | 469 | #[cfg(test)] 470 | mod test { 471 | use super::*; 472 | 473 | use pretty_assertions::assert_eq; 474 | 475 | #[test] 476 | fn suffix_trimming() { 477 | assert_eq!( 478 | trim_common_suffixes("", ""), 479 | (0, "".to_string(), "".to_string()) 480 | ); 481 | assert_eq!( 482 | trim_common_suffixes("", "C"), 483 | (0, "".to_string(), "C".to_string()) 484 | ); 485 | assert_eq!( 486 | trim_common_suffixes("C", ""), 487 | (0, "C".to_string(), "".to_string()) 488 | ); 489 | assert_eq!( 490 | trim_common_suffixes("A", "AA"), 491 | (1, "".to_string(), "A".to_string()) 492 | ); 493 | assert_eq!( 494 | trim_common_suffixes("AT", "AG"), 495 | (0, "AT".to_string(), "AG".to_string()) 496 | ); 497 | assert_eq!( 498 | trim_common_suffixes("ATCG", "AGCG"), 499 | (2, "AT".to_string(), "AG".to_string()) 500 | ); 501 | } 502 | 503 | #[test] 504 | fn prefix_trimming() { 505 | assert_eq!( 506 | trim_common_prefixes("", ""), 507 | (0, "".to_string(), "".to_string()) 508 | ); 509 | assert_eq!( 510 | trim_common_prefixes("", "C"), 511 | (0, "".to_string(), "C".to_string()) 512 | ); 513 | assert_eq!( 514 | trim_common_prefixes("C", ""), 515 | (0, "C".to_string(), "".to_string()) 516 | ); 517 | assert_eq!( 518 | trim_common_prefixes("TA", "GA"), 519 | (0, "TA".to_string(), "GA".to_string()) 520 | ); 521 | assert_eq!( 522 | trim_common_prefixes("CGTA", "CGGA"), 523 | (2, "TA".to_string(), "GA".to_string()) 524 | ); 525 | } 526 | 527 | #[test] 528 | fn revcomp_cases() { 529 | assert_eq!(revcomp(""), ""); 530 | assert_eq!(revcomp("A"), "T"); 531 | assert_eq!(revcomp("AG"), "CT"); 532 | assert_eq!(revcomp("CGAG"), "CTCG"); 533 | } 534 | 535 | #[test] 536 | fn aa_to_aa1_examples() -> Result<(), Error> { 537 | assert_eq!(aa_to_aa1("")?, ""); 538 | assert_eq!(aa_to_aa1("CATSARELAME")?, "CATSARELAME"); 539 | assert_eq!( 540 | aa_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")?, 541 | "CATSARELAME" 542 | ); 543 | 544 | Ok(()) 545 | } 546 | 547 | #[test] 548 | fn aa1_to_aa3_examples() -> Result<(), Error> { 549 | assert_eq!(aa1_to_aa3("")?, ""); 550 | assert_eq!( 551 | aa1_to_aa3("CATSARELAME")?, 552 | "CysAlaThrSerAlaArgGluLeuAlaMetGlu" 553 | ); 554 | 555 | Ok(()) 556 | } 557 | 558 | #[test] 559 | fn aa3_to_aa1_examples() -> Result<(), Error> { 560 | assert!(aa3_to_aa1("Te").is_err()); 561 | assert_eq!(aa3_to_aa1("")?, ""); 562 | assert_eq!( 563 | aa3_to_aa1("CysAlaThrSerAlaArgGluLeuAlaMetGlu")?, 564 | "CATSARELAME" 565 | ); 566 | 567 | Ok(()) 568 | } 569 | 570 | #[test] 571 | fn translate_cds_examples() -> Result<(), Error> { 572 | assert_eq!( 573 | translate_cds("ATGCGA", true, "*", TranslationTable::Standard)?, 574 | "MR" 575 | ); 576 | assert_eq!( 577 | translate_cds("AUGCGA", true, "*", TranslationTable::Standard)?, 578 | "MR" 579 | ); 580 | assert_eq!( 581 | translate_cds("", true, "*", TranslationTable::Standard)?, 582 | "" 583 | ); 584 | assert!(translate_cds("AUGCG", true, "*", TranslationTable::Standard).is_err()); 585 | assert_eq!( 586 | translate_cds("AUGCG", false, "*", TranslationTable::Standard)?, 587 | "M*" 588 | ); 589 | assert_eq!( 590 | translate_cds("ATGTAN", true, "*", TranslationTable::Standard)?, 591 | "MX" 592 | ); 593 | assert_eq!( 594 | translate_cds("CCN", true, "*", TranslationTable::Standard)?, 595 | "P" 596 | ); 597 | assert_eq!( 598 | translate_cds("TRA", true, "*", TranslationTable::Standard)?, 599 | "*" 600 | ); 601 | assert_eq!( 602 | translate_cds("TTNTA", false, "*", TranslationTable::Standard)?, 603 | "X*" 604 | ); 605 | assert_eq!( 606 | translate_cds("CTB", true, "*", TranslationTable::Standard)?, 607 | "L" 608 | ); 609 | assert_eq!( 610 | translate_cds("AGM", true, "*", TranslationTable::Standard)?, 611 | "X" 612 | ); 613 | assert_eq!( 614 | translate_cds("GAS", true, "*", TranslationTable::Standard)?, 615 | "X" 616 | ); 617 | assert_eq!( 618 | translate_cds("CUN", true, "*", TranslationTable::Standard)?, 619 | "L" 620 | ); 621 | assert!(translate_cds("AUGCGQ", true, "*", TranslationTable::Standard).is_err()); 622 | 623 | Ok(()) 624 | } 625 | 626 | #[test] 627 | fn seq_md5_examples() -> Result<(), Error> { 628 | assert_eq!(seq_md5("", true)?, "d41d8cd98f00b204e9800998ecf8427e"); 629 | assert_eq!(seq_md5("ACGT", true)?, "f1f8f4bf413b16ad135722aa4591043e"); 630 | assert_eq!(seq_md5("ACGT*", true)?, "f1f8f4bf413b16ad135722aa4591043e"); 631 | assert_eq!( 632 | seq_md5(" A C G T ", true)?, 633 | "f1f8f4bf413b16ad135722aa4591043e" 634 | ); 635 | assert_eq!(seq_md5("acgt", true)?, "f1f8f4bf413b16ad135722aa4591043e"); 636 | assert_eq!(seq_md5("acgt", false)?, "db516c3913e179338b162b2476d1c23f"); 637 | 638 | Ok(()) 639 | } 640 | 641 | #[test] 642 | fn normalize_sequence_examples() -> Result<(), Error> { 643 | assert_eq!(normalize_sequence("ACGT")?, "ACGT"); 644 | assert_eq!(normalize_sequence(" A C G T * ")?, "ACGT"); 645 | assert!(normalize_sequence("ACGT1").is_err()); 646 | 647 | Ok(()) 648 | } 649 | 650 | #[test] 651 | fn exercise_lazy_ds() { 652 | assert!(DNA_ASCII_MAP[0] == b'\0'); 653 | assert!(DNA_ASCII_TO_2BIT[b'A' as usize] == 0); 654 | assert!(AA3_TO_AA1_VEC[0] == ("Ala", "A")); 655 | assert!(DNA_TO_AA1_LUT_VEC[0] == ("AAA", "K")); 656 | assert!(DNA_TO_AA1_SEC_VEC[0] == ("AAA", "K")); 657 | assert!(DNA_TO_AA1_CHRMT_VERTEBRATE_VEC[0] == ("AAA", "K")); 658 | } 659 | 660 | #[test] 661 | fn codon_translator_standard() -> Result<(), Error> { 662 | let mut translator = CodonTranslator::new(TranslationTable::Standard); 663 | 664 | // Non-denenerate codon. 665 | assert_eq!(translator.translate(b"AAA")?, b'K'); 666 | // Degenerate codon. 667 | assert_eq!(translator.translate(b"AAR")?, b'K'); 668 | 669 | Ok(()) 670 | } 671 | 672 | #[test] 673 | fn codon_translator_sec() -> Result<(), Error> { 674 | let mut translator = CodonTranslator::new(TranslationTable::Selenocysteine); 675 | 676 | // Non-denenerate codon. 677 | assert_eq!(translator.translate(b"AAA")?, b'K'); 678 | // Degenerate codon. 679 | assert_eq!(translator.translate(b"AAR")?, b'K'); 680 | 681 | Ok(()) 682 | } 683 | 684 | #[test] 685 | fn codon_translator_chrmt_vertebrate() -> Result<(), Error> { 686 | let mut translator = CodonTranslator::new(TranslationTable::Selenocysteine); 687 | 688 | // Non-denenerate codon. 689 | assert_eq!(translator.translate(b"AAA")?, b'K'); 690 | // Degenerate codon. 691 | assert_eq!(translator.translate(b"AAR")?, b'K'); 692 | 693 | Ok(()) 694 | } 695 | } 696 | 697 | // 698 | // Copyright 2023 hgvs-rs Contributors 699 | // Copyright 2014 Bioutils Contributors 700 | // 701 | // Licensed under the Apache License, Version 2.0 (the "License"); 702 | // you may not use this file except in compliance with the License. 703 | // You may obtain a copy of the License at 704 | // 705 | // http://www.apache.org/licenses/LICENSE-2.0 706 | // 707 | // Unless required by applicable law or agreed to in writing, software 708 | // distributed under the License is distributed on an "AS IS" BASIS, 709 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 710 | // See the License for the specific language governing permissions and 711 | // limitations under the License. 712 | // 713 | -------------------------------------------------------------------------------- /src/validator/error.rs: -------------------------------------------------------------------------------- 1 | //! Error type definition. 2 | 3 | use thiserror::Error; 4 | 5 | /// Error type for validation of HGVS expressions. 6 | #[derive(Error, Debug, Clone)] 7 | pub enum Error { 8 | #[error("ref or alt must be non-empty in {0}")] 9 | RefOrAltMustBeNonEmpty(String), 10 | #[error("number of deleted bases must be positive in {0}")] 11 | NumDelBasesNotPositive(String), 12 | #[error("number of alternative bases must be positive in {0}")] 13 | NumAltBasesEmpty(String), 14 | #[error("number of inverted bases must be positive in {0}")] 15 | NumInvBasesNotPositive(String), 16 | 17 | #[error("Length implied by coordinates must equal count: {0}")] 18 | ImpliedLengthMismatch(String), 19 | #[error("start must be >=1 in {0}")] 20 | StartMustBePositive(String), 21 | #[error("end must be >=1 in {0}")] 22 | EndMustBePositive(String), 23 | #[error("sart <= end must hold in {0}")] 24 | StartMustBeLessThanEnd(String), 25 | } 26 | -------------------------------------------------------------------------------- /src/validator/mod.rs: -------------------------------------------------------------------------------- 1 | //! Implementation of validation. 2 | 3 | mod error; 4 | 5 | use std::sync::Arc; 6 | 7 | use log::{error, warn}; 8 | 9 | pub use crate::validator::error::Error; 10 | use crate::{ 11 | data::interface::Provider, 12 | mapper::{variant::Config, variant::Mapper}, 13 | parser::HgvsVariant, 14 | }; 15 | 16 | /// Trait for validating of variants, locations etc. 17 | pub trait Validateable { 18 | fn validate(&self) -> Result<(), Error>; 19 | } 20 | 21 | /// Validation level specification. 22 | #[derive(Debug, PartialEq, Clone, Copy)] 23 | pub enum ValidationLevel { 24 | /// No validation. 25 | Null, 26 | /// Only inspect the variant description itself. 27 | Intrinsic, 28 | /// Full validation including checks based on sequence and intrinsics. 29 | Full, 30 | } 31 | 32 | impl ValidationLevel { 33 | pub fn validator( 34 | &self, 35 | strict: bool, 36 | provider: Arc, 37 | ) -> Arc { 38 | match self { 39 | ValidationLevel::Null => Arc::new(NullValidator::new()), 40 | ValidationLevel::Intrinsic => Arc::new(IntrinsicValidator::new(strict)), 41 | ValidationLevel::Full => Arc::new(FullValidator::new(strict, provider)), 42 | } 43 | } 44 | } 45 | 46 | /// Trait for validators. 47 | pub trait Validator { 48 | /// Return whether validation is strict. 49 | /// 50 | /// Validation is strict if errors cause `Err` results rather than just logging a warning. 51 | fn is_strict(&self) -> bool; 52 | 53 | /// Validate the given variant. 54 | /// 55 | /// Depending on the configuration and implementation of the validator, an `Err` will be 56 | /// returned or only a warning will be logged. 57 | fn validate(&self, var: &HgvsVariant) -> Result<(), Error>; 58 | } 59 | 60 | /// A validator that performs no validation. 61 | pub struct NullValidator {} 62 | 63 | impl NullValidator { 64 | pub fn new() -> Self { 65 | Self {} 66 | } 67 | } 68 | 69 | impl Default for NullValidator { 70 | fn default() -> Self { 71 | Self::new() 72 | } 73 | } 74 | 75 | impl Validator for NullValidator { 76 | fn is_strict(&self) -> bool { 77 | false 78 | } 79 | 80 | fn validate(&self, _var: &HgvsVariant) -> Result<(), Error> { 81 | Ok(()) 82 | } 83 | } 84 | 85 | /// A validator that only performs intrinsic validation. 86 | /// 87 | /// This means that only the variant description itself is checked without considering the 88 | /// actual sequence. 89 | pub struct IntrinsicValidator { 90 | strict: bool, 91 | } 92 | 93 | impl IntrinsicValidator { 94 | pub fn new(strict: bool) -> Self { 95 | Self { strict } 96 | } 97 | } 98 | 99 | impl Validator for IntrinsicValidator { 100 | fn is_strict(&self) -> bool { 101 | self.strict 102 | } 103 | 104 | fn validate(&self, var: &HgvsVariant) -> Result<(), Error> { 105 | let res = var.validate(); 106 | match (&res, self.is_strict()) { 107 | (Ok(_), _) => Ok(()), 108 | (Err(_), false) => { 109 | warn!("Validation of {} failed: {:?}", var, res); 110 | Ok(()) 111 | } 112 | (Err(_), true) => { 113 | error!("Validation of {} failed: {:?}", var, res); 114 | res 115 | } 116 | } 117 | } 118 | } 119 | 120 | /// Attempts to determine if the HGVS name validates against external data sources 121 | pub struct ExtrinsicValidator { 122 | strict: bool, 123 | #[allow(dead_code)] 124 | mapper: Mapper, 125 | } 126 | 127 | impl ExtrinsicValidator { 128 | pub fn new(strict: bool, provider: Arc) -> Self { 129 | let config = Config { 130 | replace_reference: false, 131 | strict_validation: false, 132 | prevalidation_level: ValidationLevel::Null, 133 | add_gene_symbol: false, 134 | strict_bounds: true, 135 | renormalize_g: false, 136 | genome_seq_available: true, 137 | }; 138 | Self { 139 | strict, 140 | mapper: Mapper::new(&config, provider), 141 | } 142 | } 143 | } 144 | 145 | impl Validator for ExtrinsicValidator { 146 | fn is_strict(&self) -> bool { 147 | self.strict 148 | } 149 | 150 | fn validate(&self, var: &HgvsVariant) -> Result<(), Error> { 151 | // Check transcripts bounds 152 | match var { 153 | HgvsVariant::CdsVariant { .. } | HgvsVariant::TxVariant { .. } => { 154 | let res = self.check_tx_bound(var); 155 | if res.is_err() { 156 | if self.is_strict() { 157 | error!("Validation of {} failed: {:?}", var, res); 158 | return res; 159 | } else { 160 | warn!("Validation of {} failed: {:?}", var, res); 161 | } 162 | } 163 | } 164 | _ => {} 165 | } 166 | 167 | // Check CDS bounds 168 | { 169 | let res = self.check_cds_bound(var); 170 | if res.is_err() { 171 | if self.is_strict() { 172 | error!("Validation of {} failed: {:?}", var, res); 173 | return res; 174 | } else { 175 | warn!("Validation of {} failed: {:?}", var, res); 176 | } 177 | } 178 | } 179 | 180 | // Check reference. 181 | { 182 | let res = self.check_ref(var); 183 | if res.is_err() { 184 | if self.is_strict() { 185 | error!("Validation of {} failed: {:?}", var, res); 186 | return res; 187 | } else { 188 | warn!("Validation of {} failed: {:?}", var, res); 189 | } 190 | } 191 | } 192 | 193 | Ok(()) 194 | } 195 | } 196 | 197 | impl ExtrinsicValidator { 198 | fn check_tx_bound(&self, _var: &HgvsVariant) -> Result<(), Error> { 199 | Ok(()) // TODO 200 | } 201 | 202 | fn check_cds_bound(&self, _var: &HgvsVariant) -> Result<(), Error> { 203 | Ok(()) // TODO 204 | } 205 | 206 | fn check_ref(&self, _var: &HgvsVariant) -> Result<(), Error> { 207 | Ok(()) // TODO 208 | } 209 | } 210 | 211 | /// Full validator performing both intrinsic and extrinsic validation. 212 | pub struct FullValidator { 213 | intrinsic: IntrinsicValidator, 214 | extrinsic: ExtrinsicValidator, 215 | } 216 | 217 | impl FullValidator { 218 | pub fn new(strict: bool, provider: Arc) -> Self { 219 | Self { 220 | intrinsic: IntrinsicValidator::new(strict), 221 | extrinsic: ExtrinsicValidator::new(strict, provider), 222 | } 223 | } 224 | } 225 | 226 | impl Validator for FullValidator { 227 | fn is_strict(&self) -> bool { 228 | self.intrinsic.is_strict() 229 | } 230 | 231 | fn validate(&self, var: &HgvsVariant) -> Result<(), Error> { 232 | self.intrinsic.validate(var)?; 233 | self.extrinsic.validate(var) 234 | } 235 | } 236 | 237 | // 238 | // Copyright 2023 hgvs-rs Contributors 239 | // Copyright 2014 Bioutils Contributors 240 | // 241 | // Licensed under the Apache License, Version 2.0 (the "License"); 242 | // you may not use this file except in compliance with the License. 243 | // You may obtain a copy of the License at 244 | // 245 | // http://www.apache.org/licenses/LICENSE-2.0 246 | // 247 | // Unless required by applicable law or agreed to in writing, software 248 | // distributed under the License is distributed on an "AS IS" BASIS, 249 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 250 | // See the License for the specific language governing permissions and 251 | // limitations under the License. 252 | // 253 | -------------------------------------------------------------------------------- /tables.in: -------------------------------------------------------------------------------- 1 | const DNA_ASCII_TO_2BIT: [u8; 256] = { 2 | let mut result = [255; 256]; 3 | 4 | result[b'A' as usize] = 0; 5 | result[b'a' as usize] = 0; 6 | 7 | result[b'C' as usize] = 1; 8 | result[b'c' as usize] = 1; 9 | 10 | result[b'G' as usize] = 2; 11 | result[b'g' as usize] = 2; 12 | 13 | result[b'T' as usize] = 3; 14 | result[b't' as usize] = 3; 15 | result[b'U' as usize] = 3; 16 | result[b'u' as usize] = 3; 17 | result 18 | }; 19 | 20 | pub const AA3_TO_AA1_VEC: &[(&str, &str)] = &[ 21 | ("Ala", "A"), 22 | ("Arg", "R"), 23 | ("Asn", "N"), 24 | ("Asp", "D"), 25 | ("Cys", "C"), 26 | ("Gln", "Q"), 27 | ("Glu", "E"), 28 | ("Gly", "G"), 29 | ("His", "H"), 30 | ("Ile", "I"), 31 | ("Leu", "L"), 32 | ("Lys", "K"), 33 | ("Met", "M"), 34 | ("Phe", "F"), 35 | ("Pro", "P"), 36 | ("Ser", "S"), 37 | ("Thr", "T"), 38 | ("Trp", "W"), 39 | ("Tyr", "Y"), 40 | ("Val", "V"), 41 | ("Xaa", "X"), 42 | ("Ter", "*"), 43 | ("Sec", "U"), 44 | ]; 45 | 46 | const DNA_TO_AA1_LUT_VEC: &[(&str, &str)] = &[ 47 | ("AAA", "K"), 48 | ("AAC", "N"), 49 | ("AAG", "K"), 50 | ("AAT", "N"), 51 | ("ACA", "T"), 52 | ("ACC", "T"), 53 | ("ACG", "T"), 54 | ("ACT", "T"), 55 | ("AGA", "R"), 56 | ("AGC", "S"), 57 | ("AGG", "R"), 58 | ("AGT", "S"), 59 | ("ATA", "I"), 60 | ("ATC", "I"), 61 | ("ATG", "M"), 62 | ("ATT", "I"), 63 | ("CAA", "Q"), 64 | ("CAC", "H"), 65 | ("CAG", "Q"), 66 | ("CAT", "H"), 67 | ("CCA", "P"), 68 | ("CCC", "P"), 69 | ("CCG", "P"), 70 | ("CCT", "P"), 71 | ("CGA", "R"), 72 | ("CGC", "R"), 73 | ("CGG", "R"), 74 | ("CGT", "R"), 75 | ("CTA", "L"), 76 | ("CTC", "L"), 77 | ("CTG", "L"), 78 | ("CTT", "L"), 79 | ("GAA", "E"), 80 | ("GAC", "D"), 81 | ("GAG", "E"), 82 | ("GAT", "D"), 83 | ("GCA", "A"), 84 | ("GCC", "A"), 85 | ("GCG", "A"), 86 | ("GCT", "A"), 87 | ("GGA", "G"), 88 | ("GGC", "G"), 89 | ("GGG", "G"), 90 | ("GGT", "G"), 91 | ("GTA", "V"), 92 | ("GTC", "V"), 93 | ("GTG", "V"), 94 | ("GTT", "V"), 95 | ("TAA", "*"), 96 | ("TAC", "Y"), 97 | ("TAG", "*"), 98 | ("TAT", "Y"), 99 | ("TCA", "S"), 100 | ("TCC", "S"), 101 | ("TCG", "S"), 102 | ("TCT", "S"), 103 | // caveat lector 104 | ("TGA", "*"), 105 | ("TGC", "C"), 106 | ("TGG", "W"), 107 | ("TGT", "C"), 108 | ("TTA", "L"), 109 | ("TTC", "F"), 110 | ("TTG", "L"), 111 | ("TTT", "F"), 112 | // degenerate codons 113 | ("AAR", "K"), 114 | ("AAY", "N"), 115 | ("ACB", "T"), 116 | ("ACD", "T"), 117 | ("ACH", "T"), 118 | ("ACK", "T"), 119 | ("ACM", "T"), 120 | ("ACN", "T"), 121 | ("ACR", "T"), 122 | ("ACS", "T"), 123 | ("ACV", "T"), 124 | ("ACW", "T"), 125 | ("ACY", "T"), 126 | ("AGR", "R"), 127 | ("AGY", "S"), 128 | ("ATH", "I"), 129 | ("ATM", "I"), 130 | ("ATW", "I"), 131 | ("ATY", "I"), 132 | ("CAR", "Q"), 133 | ("CAY", "H"), 134 | ("CCB", "P"), 135 | ("CCD", "P"), 136 | ("CCH", "P"), 137 | ("CCK", "P"), 138 | ("CCM", "P"), 139 | ("CCN", "P"), 140 | ("CCR", "P"), 141 | ("CCS", "P"), 142 | ("CCV", "P"), 143 | ("CCW", "P"), 144 | ("CCY", "P"), 145 | ("CGB", "R"), 146 | ("CGD", "R"), 147 | ("CGH", "R"), 148 | ("CGK", "R"), 149 | ("CGM", "R"), 150 | ("CGN", "R"), 151 | ("CGR", "R"), 152 | ("CGS", "R"), 153 | ("CGV", "R"), 154 | ("CGW", "R"), 155 | ("CGY", "R"), 156 | ("CTB", "L"), 157 | ("CTD", "L"), 158 | ("CTH", "L"), 159 | ("CTK", "L"), 160 | ("CTM", "L"), 161 | ("CTN", "L"), 162 | ("CTR", "L"), 163 | ("CTS", "L"), 164 | ("CTV", "L"), 165 | ("CTW", "L"), 166 | ("CTY", "L"), 167 | ("GAR", "E"), 168 | ("GAY", "D"), 169 | ("GCB", "A"), 170 | ("GCD", "A"), 171 | ("GCH", "A"), 172 | ("GCK", "A"), 173 | ("GCM", "A"), 174 | ("GCN", "A"), 175 | ("GCR", "A"), 176 | ("GCS", "A"), 177 | ("GCV", "A"), 178 | ("GCW", "A"), 179 | ("GCY", "A"), 180 | ("GGB", "G"), 181 | ("GGD", "G"), 182 | ("GGH", "G"), 183 | ("GGK", "G"), 184 | ("GGM", "G"), 185 | ("GGN", "G"), 186 | ("GGR", "G"), 187 | ("GGS", "G"), 188 | ("GGV", "G"), 189 | ("GGW", "G"), 190 | ("GGY", "G"), 191 | ("GTB", "V"), 192 | ("GTD", "V"), 193 | ("GTH", "V"), 194 | ("GTK", "V"), 195 | ("GTM", "V"), 196 | ("GTN", "V"), 197 | ("GTR", "V"), 198 | ("GTS", "V"), 199 | ("GTV", "V"), 200 | ("GTW", "V"), 201 | ("GTY", "V"), 202 | ("MGA", "R"), 203 | ("MGG", "R"), 204 | ("MGR", "R"), 205 | ("TAR", "*"), 206 | ("TAY", "Y"), 207 | ("TCB", "S"), 208 | ("TCD", "S"), 209 | ("TCH", "S"), 210 | ("TCK", "S"), 211 | ("TCM", "S"), 212 | ("TCN", "S"), 213 | ("TCR", "S"), 214 | ("TCS", "S"), 215 | ("TCV", "S"), 216 | ("TCW", "S"), 217 | ("TCY", "S"), 218 | ("TGY", "C"), 219 | ("TRA", "*"), 220 | ("TTR", "L"), 221 | ("TTY", "F"), 222 | ("YTA", "L"), 223 | ("YTG", "L"), 224 | ("YTR", "L"), 225 | ]; 226 | 227 | /// Translation table for selenocysteine. 228 | const DNA_TO_AA1_SEC_VEC: &[(&str, &str)] = &[ 229 | ("AAA", "K"), 230 | ("AAC", "N"), 231 | ("AAG", "K"), 232 | ("AAT", "N"), 233 | ("ACA", "T"), 234 | ("ACC", "T"), 235 | ("ACG", "T"), 236 | ("ACT", "T"), 237 | ("AGA", "R"), 238 | ("AGC", "S"), 239 | ("AGG", "R"), 240 | ("AGT", "S"), 241 | ("ATA", "I"), 242 | ("ATC", "I"), 243 | ("ATG", "M"), 244 | ("ATT", "I"), 245 | ("CAA", "Q"), 246 | ("CAC", "H"), 247 | ("CAG", "Q"), 248 | ("CAT", "H"), 249 | ("CCA", "P"), 250 | ("CCC", "P"), 251 | ("CCG", "P"), 252 | ("CCT", "P"), 253 | ("CGA", "R"), 254 | ("CGC", "R"), 255 | ("CGG", "R"), 256 | ("CGT", "R"), 257 | ("CTA", "L"), 258 | ("CTC", "L"), 259 | ("CTG", "L"), 260 | ("CTT", "L"), 261 | ("GAA", "E"), 262 | ("GAC", "D"), 263 | ("GAG", "E"), 264 | ("GAT", "D"), 265 | ("GCA", "A"), 266 | ("GCC", "A"), 267 | ("GCG", "A"), 268 | ("GCT", "A"), 269 | ("GGA", "G"), 270 | ("GGC", "G"), 271 | ("GGG", "G"), 272 | ("GGT", "G"), 273 | ("GTA", "V"), 274 | ("GTC", "V"), 275 | ("GTG", "V"), 276 | ("GTT", "V"), 277 | ("TAA", "*"), 278 | ("TAC", "Y"), 279 | ("TAG", "*"), 280 | ("TAT", "Y"), 281 | ("TCA", "S"), 282 | ("TCC", "S"), 283 | ("TCG", "S"), 284 | ("TCT", "S"), 285 | // caveat lector 286 | ("TGA", "U"), 287 | ("TGC", "C"), 288 | ("TGG", "W"), 289 | ("TGT", "C"), 290 | ("TTA", "L"), 291 | ("TTC", "F"), 292 | ("TTG", "L"), 293 | ("TTT", "F"), 294 | // degenerate codons 295 | ("AAR", "K"), 296 | ("AAY", "N"), 297 | ("ACB", "T"), 298 | ("ACD", "T"), 299 | ("ACH", "T"), 300 | ("ACK", "T"), 301 | ("ACM", "T"), 302 | ("ACN", "T"), 303 | ("ACR", "T"), 304 | ("ACS", "T"), 305 | ("ACV", "T"), 306 | ("ACW", "T"), 307 | ("ACY", "T"), 308 | ("AGR", "R"), 309 | ("AGY", "S"), 310 | ("ATH", "I"), 311 | ("ATM", "I"), 312 | ("ATW", "I"), 313 | ("ATY", "I"), 314 | ("CAR", "Q"), 315 | ("CAY", "H"), 316 | ("CCB", "P"), 317 | ("CCD", "P"), 318 | ("CCH", "P"), 319 | ("CCK", "P"), 320 | ("CCM", "P"), 321 | ("CCN", "P"), 322 | ("CCR", "P"), 323 | ("CCS", "P"), 324 | ("CCV", "P"), 325 | ("CCW", "P"), 326 | ("CCY", "P"), 327 | ("CGB", "R"), 328 | ("CGD", "R"), 329 | ("CGH", "R"), 330 | ("CGK", "R"), 331 | ("CGM", "R"), 332 | ("CGN", "R"), 333 | ("CGR", "R"), 334 | ("CGS", "R"), 335 | ("CGV", "R"), 336 | ("CGW", "R"), 337 | ("CGY", "R"), 338 | ("CTB", "L"), 339 | ("CTD", "L"), 340 | ("CTH", "L"), 341 | ("CTK", "L"), 342 | ("CTM", "L"), 343 | ("CTN", "L"), 344 | ("CTR", "L"), 345 | ("CTS", "L"), 346 | ("CTV", "L"), 347 | ("CTW", "L"), 348 | ("CTY", "L"), 349 | ("GAR", "E"), 350 | ("GAY", "D"), 351 | ("GCB", "A"), 352 | ("GCD", "A"), 353 | ("GCH", "A"), 354 | ("GCK", "A"), 355 | ("GCM", "A"), 356 | ("GCN", "A"), 357 | ("GCR", "A"), 358 | ("GCS", "A"), 359 | ("GCV", "A"), 360 | ("GCW", "A"), 361 | ("GCY", "A"), 362 | ("GGB", "G"), 363 | ("GGD", "G"), 364 | ("GGH", "G"), 365 | ("GGK", "G"), 366 | ("GGM", "G"), 367 | ("GGN", "G"), 368 | ("GGR", "G"), 369 | ("GGS", "G"), 370 | ("GGV", "G"), 371 | ("GGW", "G"), 372 | ("GGY", "G"), 373 | ("GTB", "V"), 374 | ("GTD", "V"), 375 | ("GTH", "V"), 376 | ("GTK", "V"), 377 | ("GTM", "V"), 378 | ("GTN", "V"), 379 | ("GTR", "V"), 380 | ("GTS", "V"), 381 | ("GTV", "V"), 382 | ("GTW", "V"), 383 | ("GTY", "V"), 384 | ("MGA", "R"), 385 | ("MGG", "R"), 386 | ("MGR", "R"), 387 | ("TAR", "*"), 388 | ("TAY", "Y"), 389 | ("TCB", "S"), 390 | ("TCD", "S"), 391 | ("TCH", "S"), 392 | ("TCK", "S"), 393 | ("TCM", "S"), 394 | ("TCN", "S"), 395 | ("TCR", "S"), 396 | ("TCS", "S"), 397 | ("TCV", "S"), 398 | ("TCW", "S"), 399 | ("TCY", "S"), 400 | ("TGY", "C"), 401 | ("TRA", "*"), 402 | ("TTR", "L"), 403 | ("TTY", "F"), 404 | ("YTA", "L"), 405 | ("YTG", "L"), 406 | ("YTR", "L"), 407 | ]; 408 | 409 | /// Vertebrate mitochondrial code, cf. https://en.wikipedia.org/wiki/Vertebrate_mitochondrial_code 410 | const DNA_TO_AA1_CHRMT_VERTEBRATE_VEC: &[(&str, &str)] = &[ 411 | ("AAA", "K"), 412 | ("AAC", "N"), 413 | ("AAG", "K"), 414 | ("AAT", "N"), 415 | ("ACA", "T"), 416 | ("ACC", "T"), 417 | ("ACG", "T"), 418 | ("ACT", "T"), 419 | // caveat lector 420 | ("AGA", "*"), 421 | ("AGC", "S"), 422 | // caveat lector 423 | ("AGG", "*"), 424 | ("AGT", "S"), 425 | // caveat lector 426 | ("ATA", "M"), 427 | ("ATC", "I"), 428 | ("ATG", "M"), 429 | ("ATT", "I"), 430 | ("CAA", "Q"), 431 | ("CAC", "H"), 432 | ("CAG", "Q"), 433 | ("CAT", "H"), 434 | ("CCA", "P"), 435 | ("CCC", "P"), 436 | ("CCG", "P"), 437 | ("CCT", "P"), 438 | ("CGA", "R"), 439 | ("CGC", "R"), 440 | ("CGG", "R"), 441 | ("CGT", "R"), 442 | ("CTA", "L"), 443 | ("CTC", "L"), 444 | ("CTG", "L"), 445 | ("CTT", "L"), 446 | ("GAA", "E"), 447 | ("GAC", "D"), 448 | ("GAG", "E"), 449 | ("GAT", "D"), 450 | ("GCA", "A"), 451 | ("GCC", "A"), 452 | ("GCG", "A"), 453 | ("GCT", "A"), 454 | ("GGA", "G"), 455 | ("GGC", "G"), 456 | ("GGG", "G"), 457 | ("GGT", "G"), 458 | ("GTA", "V"), 459 | ("GTC", "V"), 460 | ("GTG", "V"), 461 | ("GTT", "V"), 462 | ("TAA", "*"), 463 | ("TAC", "Y"), 464 | ("TAG", "*"), 465 | ("TAT", "Y"), 466 | ("TCA", "S"), 467 | ("TCC", "S"), 468 | ("TCG", "S"), 469 | ("TCT", "S"), 470 | // caveat lector 471 | ("TGA", "W"), 472 | ("TGC", "C"), 473 | ("TGG", "W"), 474 | ("TGT", "C"), 475 | ("TTA", "L"), 476 | ("TTC", "F"), 477 | ("TTG", "L"), 478 | ("TTT", "F"), 479 | // degenerate codons 480 | ("AAR", "K"), 481 | ("AAY", "N"), 482 | ("ACB", "T"), 483 | ("ACD", "T"), 484 | ("ACH", "T"), 485 | ("ACK", "T"), 486 | ("ACM", "T"), 487 | ("ACN", "T"), 488 | ("ACR", "T"), 489 | ("ACS", "T"), 490 | ("ACV", "T"), 491 | ("ACW", "T"), 492 | ("ACY", "T"), 493 | ("AGR", "R"), 494 | ("AGY", "S"), 495 | ("ATH", "I"), 496 | ("ATM", "I"), 497 | ("ATW", "I"), 498 | ("ATY", "I"), 499 | ("CAR", "Q"), 500 | ("CAY", "H"), 501 | ("CCB", "P"), 502 | ("CCD", "P"), 503 | ("CCH", "P"), 504 | ("CCK", "P"), 505 | ("CCM", "P"), 506 | ("CCN", "P"), 507 | ("CCR", "P"), 508 | ("CCS", "P"), 509 | ("CCV", "P"), 510 | ("CCW", "P"), 511 | ("CCY", "P"), 512 | ("CGB", "R"), 513 | ("CGD", "R"), 514 | ("CGH", "R"), 515 | ("CGK", "R"), 516 | ("CGM", "R"), 517 | ("CGN", "R"), 518 | ("CGR", "R"), 519 | ("CGS", "R"), 520 | ("CGV", "R"), 521 | ("CGW", "R"), 522 | ("CGY", "R"), 523 | ("CTB", "L"), 524 | ("CTD", "L"), 525 | ("CTH", "L"), 526 | ("CTK", "L"), 527 | ("CTM", "L"), 528 | ("CTN", "L"), 529 | ("CTR", "L"), 530 | ("CTS", "L"), 531 | ("CTV", "L"), 532 | ("CTW", "L"), 533 | ("CTY", "L"), 534 | ("GAR", "E"), 535 | ("GAY", "D"), 536 | ("GCB", "A"), 537 | ("GCD", "A"), 538 | ("GCH", "A"), 539 | ("GCK", "A"), 540 | ("GCM", "A"), 541 | ("GCN", "A"), 542 | ("GCR", "A"), 543 | ("GCS", "A"), 544 | ("GCV", "A"), 545 | ("GCW", "A"), 546 | ("GCY", "A"), 547 | ("GGB", "G"), 548 | ("GGD", "G"), 549 | ("GGH", "G"), 550 | ("GGK", "G"), 551 | ("GGM", "G"), 552 | ("GGN", "G"), 553 | ("GGR", "G"), 554 | ("GGS", "G"), 555 | ("GGV", "G"), 556 | ("GGW", "G"), 557 | ("GGY", "G"), 558 | ("GTB", "V"), 559 | ("GTD", "V"), 560 | ("GTH", "V"), 561 | ("GTK", "V"), 562 | ("GTM", "V"), 563 | ("GTN", "V"), 564 | ("GTR", "V"), 565 | ("GTS", "V"), 566 | ("GTV", "V"), 567 | ("GTW", "V"), 568 | ("GTY", "V"), 569 | ("MGA", "R"), 570 | ("MGG", "R"), 571 | ("MGR", "R"), 572 | ("TAR", "*"), 573 | ("TAY", "Y"), 574 | ("TCB", "S"), 575 | ("TCD", "S"), 576 | ("TCH", "S"), 577 | ("TCK", "S"), 578 | ("TCM", "S"), 579 | ("TCN", "S"), 580 | ("TCR", "S"), 581 | ("TCS", "S"), 582 | ("TCV", "S"), 583 | ("TCW", "S"), 584 | ("TCY", "S"), 585 | ("TGY", "C"), 586 | ("TRA", "*"), 587 | ("TTR", "L"), 588 | ("TTY", "F"), 589 | ("YTA", "L"), 590 | ("YTG", "L"), 591 | ("YTR", "L"), 592 | ]; 593 | 594 | const IUPAC_AMBIGUITY_CODES: [u8; 13] = *b"BDHVNUWSMKRYZ"; 595 | -------------------------------------------------------------------------------- /tests/data/data/.gitignore: -------------------------------------------------------------------------------- 1 | download/* 2 | -------------------------------------------------------------------------------- /tests/data/data/bootstrap.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | 3 | # Setup Logging ------------------------------------------------------------- 4 | 5 | log() 6 | { 7 | >&2 echo $@ 8 | } 9 | 10 | debug() 11 | { 12 | [[ "${VERBOSE-0}" -ne 0 ]] && >&2 echo $@ 13 | } 14 | 15 | set -euo pipefail 16 | 17 | if [[ "${VERBOSE-0}" -ne 0 ]]; then 18 | set -x 19 | fi 20 | 21 | psql-uta() 22 | { 23 | echo "set schema '$VERSION'; $1" \ 24 | | PGPASSWORD=anonymous psql --csv -h uta.biocommons.org -U anonymous -d uta \ 25 | | tail -n +3 \ 26 | | sort 27 | } 28 | 29 | pg-list() 30 | { 31 | echo $* \ 32 | | tr ' ' '\n' \ 33 | | sed -e "s/^/'/g" -e "s/$/'/g" \ 34 | | tr '\n' ',' \ 35 | | sed -e 's/,$//g' \ 36 | | sed -e "s/^/(/g" -e "s/$/)/g" 37 | } 38 | 39 | # Initialization ------------------------------------------------------------ 40 | 41 | if [[ "$#" -ne 2 ]]; then 42 | log "USAGE: bootstrap.sh DL_URL VERSION" 43 | log "" 44 | log "E.g.: bootstrap.sh http://dl.biocommons.org/uta uta_20210129" 45 | log "" 46 | log "Set VERBOSE=1 to increase verbosity." 47 | exit 1 48 | fi 49 | 50 | # path to the directory where the script resides. 51 | SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd ) 52 | 53 | # Download URL. 54 | DL_URL=$1 55 | 56 | # Database name/version. 57 | VERSION=$2 58 | 59 | # Destination directory. 60 | DST=$SCRIPT_DIR 61 | 62 | # The HGNC symbols of the genes to fetch. 63 | set +e 64 | read -r -d '' GENES <download/gene.tsv 189 | psql-uta "select ac from transcript where hgnc in $PG_GENES;" >download/transcript.tsv 190 | psql-uta "select seq_anno_id from seq_anno where ac in (select ac from transcript where hgnc in $PG_GENES);" >download/seq_anno.tsv 191 | psql-uta "select seq_id from seq where seq_id in (select seq_id from seq_anno where ac in (select ac from transcript where hgnc in $PG_GENES));" >download/seq.tsv 192 | psql-uta "select associated_accession_id from associated_accessions where tx_ac in (select ac from transcript where hgnc in $PG_GENES);" >download/associated_accessions.tsv 193 | psql-uta "select exon_set_id from exon_set where tx_ac in (select ac from transcript where hgnc in $PG_GENES);" >download/exon_set.tsv 194 | psql-uta "select exon_id from exon where exon_set_id in (select exon_set_id from exon_set where tx_ac in (select ac from transcript where hgnc in $PG_GENES));" >download/exon.tsv 195 | psql-uta "select exon_aln_id from exon_aln where tx_exon_id in (select exon_id from exon where exon_set_id in (select exon_set_id from exon_set where tx_ac in (select ac from transcript where hgnc in $PG_GENES)));" >download/exon_aln.tsv 196 | 197 | # build sql subset 198 | 199 | pigz -d -c download/$VERSION.pgd.gz \ 200 | | awk -F ' ' -f subset.awk \ 201 | | pigz -c \ 202 | > $VERSION-subset.pgd.gz 203 | -------------------------------------------------------------------------------- /tests/data/data/cdot/extract_gene.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Extract transcripts from cdot JSON file.""" 3 | 4 | import gzip 5 | import json 6 | import os 7 | import sys 8 | import typing 9 | 10 | #: HGNC identifier ofs gene to extract, from env or BRCA1. 11 | HGNC_IDS = os.environ.get("HGNC_IDS", "HGNC:1100").split(",") 12 | 13 | 14 | def load_json(json_path: str) -> typing.Dict[str, typing.Any]: 15 | """Load JSON file.""" 16 | print(f"Loading {json_path}", file=sys.stderr) 17 | if json_path.endswith(".gz"): 18 | with gzip.open(json_path, "rt") as json_file: 19 | return json.load(json_file) 20 | else: 21 | with open(json_path, "rt") as json_file: 22 | return json.load(json_file) 23 | 24 | 25 | def extract_gene( 26 | json_path: str, hgncs_stripped: typing.List[str] 27 | ) -> typing.Dict[str, typing.Any]: 28 | """Extract data of one gene, specified by HGNC identifier without prefix.""" 29 | full_data = load_json(json_path) 30 | print("gettings keys...", file=sys.stderr) 31 | keys = [] 32 | for key, gene in full_data["genes"].items(): 33 | if "hgnc" in gene and gene["hgnc"] in hgncs_stripped and key not in keys: 34 | keys.append(key) 35 | print(f"- keys = {keys}", file=sys.stderr) 36 | print("extracting...", file=sys.stderr) 37 | full_data["genes"] = {key: full_data["genes"][key] for key in keys} 38 | full_data["transcripts"] = { 39 | key: value 40 | for key, value in full_data["transcripts"].items() 41 | if "hgnc" in value and value["hgnc"] in hgncs_stripped 42 | } 43 | return full_data 44 | 45 | 46 | def main(json_paths: typing.List[str]): 47 | """Extract transcripts from cdot JSON file.""" 48 | hgncs_stripped = [hgnc_id.replace("HGNC:", "") for hgnc_id in HGNC_IDS] 49 | for json_path in json_paths: 50 | gene_data = extract_gene(json_path, hgncs_stripped) 51 | symbols = "_".join( 52 | [gene["gene_symbol"].lower() for gene in gene_data["genes"].values()] 53 | ) 54 | out_path = json_path.replace(".gz", "").replace(".json", f".{symbols}.json") 55 | print(f"writing to {out_path}...", file=sys.stderr) 56 | with open(out_path, "wt") as outputf: 57 | json.dump(gene_data, outputf, indent=2) 58 | 59 | 60 | if __name__ == "__main__": 61 | main(sys.argv[1:]) 62 | -------------------------------------------------------------------------------- /tests/data/data/cdot/tx-mane.brca1.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ec11974722a1c66832928fa2ad6444ddd5c1070664d3f2bef9fae6b3f2fe3227 3 | size 786 4 | -------------------------------------------------------------------------------- /tests/data/data/subset.awk: -------------------------------------------------------------------------------- 1 | BEGIN { 2 | OFS = FS; 3 | copy_table = ""; 4 | 5 | # define list of known tables 6 | a = "gene|transcript|seq_anno|seq|associated_accessions|" \ 7 | "exon_set|exon|exon_aln"; 8 | split(a, names, "|") 9 | 10 | # read known ids 11 | for (i in names) { 12 | name = names[i]; 13 | while ((getline line < ("download/" name ".tsv")) > 0) { 14 | known[name, line] = 1 15 | # print name, "|", line 16 | } 17 | close(name ".tsv") 18 | } 19 | } 20 | 21 | { 22 | if ($0 ~ /anonymous/) { 23 | next; # skip for testing 24 | } 25 | 26 | # Materializing `tx_similarity_mv` is slow for large databases, so we would 27 | # rather not materialize it. 28 | if ($0 ~ /CREATE MATERIALIZED VIEW .*.tx_similarity_mv/) { 29 | gsub(/MATERIALIZED /, "", $0); 30 | skip_no_data=1; 31 | print; 32 | next; 33 | } else if ($0 ~ /WITH NO DATA/ && skip_no_data == 1) { 34 | gsub(/WITH NO DATA/, "", $0); 35 | print; 36 | next; 37 | } else if ($0 ~ /CREATE .*INDEX .*ON .*tx_similarity_mv.*;/ || 38 | $0 ~ /REFRESH MATERIALIZED VIEW .*.tx_similarity_mv/) { 39 | print "-- " $0; 40 | next; 41 | } 42 | 43 | if ($0 ~ /^COPY/) { 44 | table_name = $2; 45 | gsub(/.*?\./, "", table_name); 46 | if (table_name != "meta" && table_name != "origin") { 47 | print $0; # do not print twice ;-) 48 | } 49 | } else if ($0 ~ /^\\./) { 50 | table_name = ""; 51 | } 52 | 53 | if (table_name == "" || table_name == "meta" || \ 54 | table_name == "origin" || known[table_name, $1] == 1) { 55 | print $0; 56 | } 57 | } 58 | -------------------------------------------------------------------------------- /tests/data/data/uta_20210129-subset.pgd.gz: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:16cdd19f287cab73b33844bf675afbafd51950d323c72927533cb57002e86ea5 3 | size 2016343 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/ADRA2B-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c8e0bdc6ef20e42443287a4a0e5b2bf8f21a71eedd31ab4bdb690f8d12989a1e 3 | size 9734 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/BAHCC1-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ae1b52922db903c5f142a10759c0abdf293d4c2f05497af9ea135ea8fa914ecd 3 | size 19445 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/DNAH11-HGMD.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:54787bb1db25660a096db7fc7b495b8958d78aa787248294e6254bd9bc3d785d 3 | size 1395 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/DNAH11-dbSNP-NM_001277115.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c09c30abed3d5ac940d4a18c0bb6a7f99bbece0db83f0ae60f42e88dd8c2c45a 3 | size 13049 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/DNAH11-dbSNP-NM_003777.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:56324d1298160dfcb554265d9171b3b67be94c301ab13ff257d1f045ef7bbc36 3 | size 11741 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/DNAH11-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:3e54aaf2281b8b8512e01c04ac046f92f8f24e23b76ed1397cf8248f4845d691 3 | size 1466622 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/FOLR3-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:41156482c58dbc7720b570d3ce285fa4c8335f3368fde6605e38db9bd9c8f8ea 3 | size 12317 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/JRK-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:bd8a17624d0e2fe1cab92914ca9649455884bf83a62d63dac5f44dce5befdae4 3 | size 17974 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/NEFL-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:0798273cda49189898ec32099e2de8b3db5f01981a25289b7345a4d24c9d1a53 3 | size 2386 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/ORAI1-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:757c66afb1c660797d90e251410450933206688fe7d8ec64391ebc47d1fe0c4a 3 | size 16072 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/ZCCHC3-dbSNP.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:677a7938e1f0b8ddf2a2ed56a3283e5d3ffa5122af227797648760747b4742b7 3 | size 6450 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/noncoding.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:7d6a32ab8dda82bfe6ab3581bc81791c3a7e9de8d9e8815f62a581edc8cc69b7 3 | size 22994 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/real-met1.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:4006f3113321d8c9c3d7655818d159595952f29b5bdfaac600aab699525597ac 3 | size 97 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/real.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:948229daffb1d3b789d27c1c583d617a90c48d274a86a3fbf6b4325012f18bc5 3 | size 5644 4 | -------------------------------------------------------------------------------- /tests/data/mapper/gcp/regression.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:9f67aff8e534648869a20ddb956225f6d93b52d20247cc56045d31c97215bb71 3 | size 5707 4 | -------------------------------------------------------------------------------- /tests/data/mapper/proj-near-disc.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:8cf77bc78753fb7f905df12be763a5a3c8293a75242030e52b2110cd28b3e340 3 | size 3330 4 | -------------------------------------------------------------------------------- /tests/data/mapper/real_cp.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:192a4203d7a434276c9ae8e92151a5a165fdde5db1437a627f04e180b668fdf2 3 | size 893 4 | -------------------------------------------------------------------------------- /tests/data/mapper/sanity_cp.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ea13fbdb42d2fa7736081fcd7de378425de31f02cdb92b749512b288a8d89760 3 | size 698 4 | -------------------------------------------------------------------------------- /tests/data/parser/gauntlet: -------------------------------------------------------------------------------- 1 | # HGVS test variants 2 | # This file attempts to exercise all components of the HGVS parser (but 3 | # not in all combinations). 4 | 5 | # some negative examples are shown commented out with '#- ' 6 | # These may be used in the future for expected failures 7 | 8 | 9 | ############################################################################ 10 | #### TYPES 11 | AC_01234.5:c.1A>T 12 | AC_01234.5:g.1A>T 13 | AC_01234.5:m.1A>T 14 | AC_01234.5:n.1A>T 15 | AC_01234.5:r.1a>u 16 | AC_01234.5:p.Ala1Ser 17 | 18 | 19 | ############################################################################ 20 | #### POSITIONS 21 | #### g. positions (m., n. identical) 22 | 23 | # ? 24 | # 34+? 25 | # (34+56A>T) 26 | 27 | 28 | 29 | AC_01234.5:g.1A>T 30 | AC_01234.5:g.1_22A>T 31 | 32 | #- AC_01234.5:g.0A>T 33 | #- AC_01234.5:g.*1A>T 34 | 35 | #### c. positions not covered by g. (r. identical) 36 | 37 | 38 | 39 | #### p. positions 40 | 41 | 42 | ############################################################################ 43 | #### EDITS 44 | #### g. edits (m., n. identical) 45 | 46 | #!unsupported: AC_01234.5:g.1209_4523(12_45) 47 | #!unsupported: AC_01234.5:g.123TG[4] 48 | #!unsupported: AC_01234.5:g.123_124[4] 49 | #!unsupported: AC_01234.5:g.123_678conNG_012232.1:g.9456_10011 50 | AC_01234.5:g.5dup 51 | AC_01234.5:g.5dupT 52 | AC_01234.5:g.7_8dup 53 | AC_01234.5:g.7_8dupTG 54 | 55 | 56 | # complex -- same allele (maybe) 57 | #!unsupported: AC_01234.5:c.[76A>T;77G>T] 58 | #!unsupported: AC_01234.5:c.[76A>C(;)283G>C] 59 | 60 | # mosaic and chimeric: 61 | #!unsupported: AC_01234.5:c.[=/83G>C] 62 | #!unsupported: AC_01234.5:c.[=//83G>C] 63 | 64 | # compound: 65 | #!unsupported: AC_01234.5:c.[76A>C];[76A>C] 66 | #!unsupported: AC_01234.5:c.[76A>C];[(76A>C)] 67 | #!unsupported: AC_01234.5:c.[76A>C];[?] 68 | #!unsupported: AC_01234.5:c.[76A>C];[=] 69 | #!unsupported: AC_01234.5:c.[76A>C];[0] 70 | 71 | #### c. edits not covered by g. (r. identical) 72 | 73 | #### p. edits 74 | 75 | 76 | 77 | 78 | AC_01234.5:c.*46T>A 79 | AC_01234.5:c.-14G>C 80 | AC_01234.5:c.112_117delinsTG 81 | AC_01234.5:c.113delinsTACTAGC 82 | AC_01234.5:c.114_115delinsA 83 | #!unsupported: AC_01234.5:c.1210-12T(5_9) 84 | #!unsupported: AC_01234.5:c.1210-12T[5] 85 | #!unsupported: AC_01234.5:c.123+74TG(3_6) 86 | #!unsupported: AC_01234.5:c.203_506inv 87 | AC_01234.5:c.76A>C 88 | AC_01234.5:c.76_77delinsTT 89 | AC_01234.5:c.76_77insT 90 | AC_01234.5:c.76_78del 91 | AC_01234.5:c.76_78delACT 92 | AC_01234.5:c.77_79dup 93 | AC_01234.5:c.77_79dupCTG 94 | AC_01234.5:c.88+1G>T 95 | AC_01234.5:c.89-2A>C 96 | 97 | 98 | #!unsupported: AC_01234.5:g.1209_4523(12_45) 99 | #!unsupported: AC_01234.5:g.123TG[4] 100 | #!unsupported: AC_01234.5:g.123_124[4] 101 | #!unsupported: AC_01234.5:g.123_678conNG_012232.1:g.9456_10011 102 | AC_01234.5:g.5dup 103 | AC_01234.5:g.5dupT 104 | AC_01234.5:g.7_8dup 105 | AC_01234.5:g.7_8dupTG 106 | 107 | 108 | 109 | # uncertain: 110 | #!unsupported: AC_01234.5:c.88-?_923+?del 111 | #!unsupported: AC_01234.5:c.1032-?_1357+?(3) 112 | #!unsupported: AC_01234.5:c.1032-?_1357+?[3] 113 | 114 | # lax / invitae-isms 115 | AC_01234.5:c.76_78del3 116 | #!unsupported: AC_01234.5:c.76_78copy3 117 | 118 | 119 | 120 | 121 | ############################################################################ 122 | #### NOT SUPPORTED 123 | 124 | # t(X;4)(p21.2;q35)(c.857+101_857+102) 125 | 126 | 127 | -------------------------------------------------------------------------------- /tests/data/parser/grammar_test.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:3945ebc4cf372b7312bc36aae3b4b95d575d99631fbf549e91a46614650016b3 3 | size 12531 4 | -------------------------------------------------------------------------------- /tests/data/parser/reject: -------------------------------------------------------------------------------- 1 | #NC_000001.10:g.155208383_155208384dup2 dupN is not valid HGVS 2 | 3 | x 4 | 5 | #previous line is empty 6 | -------------------------------------------------------------------------------- /tests/data/seqrepo_cache.fasta: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:7fa25f4f84328b56334d1a3db4dd5fe671bed75993ccda7b6bf4dbbc4ae1f8f9 3 | size 673799 4 | --------------------------------------------------------------------------------