├── .gitignore ├── .travis.yml ├── Cargo.lock ├── Cargo.toml ├── LICENSE ├── README.md ├── ci ├── before_deploy.sh ├── construct_linux_binary.py ├── install.sh ├── script.sh ├── sha256.sh └── utils.sh ├── editdistancealleles.py ├── paper ├── AML_508084.txt ├── AML_548327.txt ├── AML_721214.txt ├── AML_782328.txt ├── AML_809653.txt ├── coverage.png ├── figure2.R ├── figure3.R ├── figureS2.R ├── figures │ ├── figure2-updated.pdf │ ├── figure3-newlayout.pdf │ ├── figureS1.pdf │ ├── figureS2.pdf │ └── workflow-updated.pdf ├── main.tex ├── sample_gt.txt ├── schlacount.bib ├── table1_AML_DRB1.tsv ├── table2_AML_C.tsv ├── table4_paulson_discovery.R └── workflow.png ├── prepare_reference.sh ├── src ├── config.rs ├── em.rs ├── hla.rs ├── io.rs ├── locus.rs ├── main.rs └── mapping.rs └── test ├── barcodes0.tsv ├── barcodes1.tsv ├── barcodes7.tsv ├── cds_ABC.fa ├── fake_db ├── Allele_status.txt ├── hla_gen.fasta ├── hla_nuc.fasta └── hla_nuc.fasta.idx ├── genomic_ABC.fa ├── test.bam ├── test.bam.bai ├── test_allele_fasta.mtx └── test_call.mtx /.gitignore: -------------------------------------------------------------------------------- 1 | # Generated by Cargo 2 | # will have compiled files and executables 3 | /target/ 4 | 5 | # These are backup files generated by rustfmt 6 | **/*.rs.bk 7 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: rust 2 | env: 3 | global: 4 | - PROJECT_NAME: scHLAcount 5 | - RUST_BACKTRACE: full 6 | matrix: 7 | fast_finish: true 8 | include: 9 | - os: osx 10 | rust: stable 11 | env: TARGET=x86_64-apple-darwin 12 | - os: linux 13 | rust: stable 14 | env: TARGET=x86_64-unknown-linux-gnu 15 | install: ci/install.sh 16 | script: ci/script.sh 17 | before_deploy: ci/before_deploy.sh 18 | deploy: 19 | provider: releases 20 | api_key: 21 | secure: "HHeptIdhv70nFn/qmKyRj42p4GNHE2AZxTgS9TEOL/rcN1+7G5UJojydGqhGlcrCxPLHeJjccgDKEZdvLidh9wFIjNdtyGC25gK49nwRZ773fuvNcRfZjrea96D85TFvMsUv8u+JPPIY9MZb+kBmsVgMCIw14Nr7Tq0tjQiW8TilP2V4n1fobcNYdDbYMPxYue9Os+Prz7K3cMMn3qVjlm5EkZIWDNygSf1+R+WGDMpdcHCu1jC+y6OtScJeqAtnQlz+6vvPV9L3EvSqj5GTTYpfWXNz0CkHr7E8w9BAZ6Lfrm0GEJ7UNY2UXaq/Nhd5yK6HL6GqObeERmcXbUf8jZySTrI/ccbtGna2wm+1BXnItjSmjsDzzrMWABuuo6KuNE9Egy3atmGt4qmqqNpmqX2c2Gany3LHFvjhJQ2mxyTjqDJyV0dY7dCzjL0OC1/IEV4SIurg0sig3e7fSysrwwvVz3mJxW3YGhGfWtYWjEMxnwu/1FpcxYkgxeO5/p3vqVQabDuE1JeTtemr7RR4xDlkRqpZCLsIi58zv1pkgQshH0ssTWJkEK8viuYpSCrp7NniBD/xufnHnBCEhOF0glV4pAFEq9gBlnf5nl3QUR9Qh9H1Yjrin4muBCo/SpobanDypqtt4P3htHgBfsfVALK+ddqReaFs8gnu2DsNJmI=" 22 | file: deployment/${PROJECT_NAME}-${TRAVIS_TAG}-${TARGET}.tar.gz 23 | file_glob: true 24 | skip_cleanup: true 25 | on: 26 | repo: 10XGenomics/scHLAcount 27 | branch: master 28 | tags: true 29 | -------------------------------------------------------------------------------- /Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "sc_hla_count" 3 | version = "0.2.0" 4 | authors = ["Charlotte Darby ", "Ian Fiddes ", "Patrick Marks "] 5 | 6 | [dependencies] 7 | bio = "0.31" 8 | clap = "*" 9 | csv = "1" 10 | debruijn = { git = "https://github.com/10XGenomics/rust-debruijn" } 11 | debruijn_mapping = { git = "https://github.com/10XGenomics/rust-pseudoaligner.git", rev = "ae4aadbce921d3233e3bd9b3ee8a8466804b20ad" } 12 | failure = "*" 13 | human-panic = "1.0.1" 14 | itertools = "0.8" 15 | log = "*" 16 | regex = "1" 17 | rust-htslib = "0.36" 18 | serde = "1.0" 19 | simplelog = "0.5.0" 20 | sprs = "*" 21 | tempfile = "*" 22 | terminal_size = "*" 23 | flate2 = "1" 24 | 25 | [dependencies.smallvec] 26 | features = ["serde"] 27 | version = "0.6.8" 28 | 29 | [profile.release] 30 | debug = true 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 10x Genomics 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # scHLAcount 2 | 3 | ## Overview 4 | scHLAcount allows you to count molecules in single-cell RNA-seq data for class I genes HLA-A, B, and C; and class II genes DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 using a personalized reference genome. You can either use provided HLA types determined by alternative methods or call HLA types with this tool then quantify against those calls. See the [Using scHLAcount](#using-schlacount) section for more details. 5 | 6 | ![Workflow Figure](paper/workflow.png) 7 | 8 | ## Uses 9 | scHLAcount can be used to look at allele specific expression of HLA genes. It can also be used to evaluate loss of heterozygosity by overlaying cell-specific counts onto an expression based t-SNE projection and looking for clusters with complete loss of one haplotype. General loss of HLA expression can also be evaluated with scHLAcount, and performs better at this task than default Cell Ranger, particularly in the case where the sample has HLA haplotypes that are diverged from the reference. 10 | 11 | ``` 12 | scHLAcount DEV 13 | 14 | HLA genotyping and allele-specific expression for single-cell RNA sequencing 15 | 16 | USAGE: 17 | sc_hla_count [FLAGS] [OPTIONS] --bam --cell-barcodes 18 | 19 | FLAGS: 20 | --use-exact-count If specified, will use exact alignment to allele sequences to count moleucles (very slow!) 21 | -h, --help Prints help information 22 | --primary-alignments If specified, will use primary alignments only 23 | --unmapped If specified, will also use unmapped reads for genotyping 24 | -V, --version Prints version information 25 | 26 | OPTIONS: 27 | -b, --bam Cellranger BAM file 28 | -c, --cell-barcodes File with cell barcodes to be evaluated 29 | -f, --fasta-cds Multi-FASTA file with CDS sequence of each allele [default: ] 30 | -g, --fasta-genomic Multi-FASTA file with genomic sequence of each allele [default: ] 31 | -d, --hladb-dir Directory of the IMGT-HLA database [default: ] 32 | -i, --hla-index debruijn_mapping pseudoalignment index file constructed from IMGT-HLA database [default: ] 33 | --log-level Logging level [default: error] [possible values: info, debug, error] 34 | -o, --out-dir [default: hla-typer-results] 35 | --pl-tmp Directory to write the pseudoaligner temporary files generated [default: pseudoaligner_tmp] 36 | -r, --region Samtools-format region string of reads to use [default: 6:28510120-33480577] 37 | ``` 38 | 39 | ## Limitations 40 | 41 | While scHLAcount can determine HLA haplotypes given a HLA database like the one at IMGT, our testing has shown that alternative tools such as [arcasHLA](https://github.com/RabadanLab/arcasHLA) perform better at HLA genotyping. Therefore, we recommend that you use either alternative methods or arcasHLA to determine your genotypes before using scHLAcount to assign allele specific counts in your single cell RNA-seq dataset. 42 | 43 | We have determined that the best results for genotyping and allele-specific counting are found with 5' GEX data. There is a much stronger coverage bias towards the end of the transcript in 3' GEX data, which poses a problem for genotyping and molecule counting of class I genes because most of the variable sites between these three paralogs are contained in exons 2 and 3, which are at the 5' end of the transcript. The following figure shows the coverage profile for 5', 3'v2 and 3'v3 GEX assays, normalized to 0 and 1 for the minimum and maximum coverage seen in the region, respectively, for each assay. 44 | 45 | ![GEX Coverage Figure](paper/coverage.png) 46 | 47 | ## Installation 48 | 49 | scHLAcount has automatically generated downloadable binaries for generic linux and Mac OSX under the [releases page](https://github.com/10XGenomics/scHLAcount/releases). The linux binaries are expected to work on [our supported Operating Systems](https://support.10xgenomics.com/os-support). 50 | 51 | ## Compiling from source 52 | scHLAcount is a standard Rust executable project, that works with stable Rust >=1.13. 53 | 54 | If you need to compile from source, [install Rust](https://www.rust-lang.org/en-US/install.html), then type `cargo build --release` from within the directory containing the scHLAcount source code. The executable will appear at `target/release/sc_hla_count`. As usual it's important to use a release build to get good performance. 55 | 56 | ## Testing 57 | If you have compiled scHLAcount from source, you can run the tiny test dataset by typing the command `cargo test --release` from wthin the directory containing the scHLAcount source code. 58 | 59 | The test data files in the `test/` folder also provide a simple example of the inputs and outputs for scHLAcount. 60 | 61 | ## Support 62 | 63 | scHLAcount is provided as an open-source tool for use by the community. Although we cannot provide full support for the software please submit a GitHub Issue if you have any problems, questions or comments. We would also be happy to consider Pull Requests that fix bugs or provide enhancements. 64 | 65 | Scripts in the `/paper` directory show how to reproduce results from our manuscript and are not supported. 66 | 67 | # Using scHLAcount 68 | 69 | ## Case 1: You have HLA genotypes for some or all class I / class II genes 70 | 71 | Other Requirements: samtools 72 | 73 | 1. Download the the IMGT/HLA database, available at [Github](https://github.com/ANHIG/IMGTHLA) or [FTP](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). You only need the `hla_gen.fasta` and `hla_nuc.fasta` files, but you can download the whole database if you choose. 74 | 2. Use `samtools faidx` to index the `hla_gen.fasta` and `hla_nuc.fasta` files. 75 | 3. Create a file of the known genotypes, at most two per gene, with one genotype on each line. Follow the template at `paper/sample_gt.txt`. 76 | 4. We strongly recommend that if genotypes are unknown for any of the genes, you put the reference genome allele for those genes in the known genotypes file. Alleles represented in the GRCh38 primary assembly are listed below: 77 | ``` 78 | A*03:01:01:01 79 | B*07:02:01:01 80 | C*07:02:01:01 81 | DQA1*01:02:01:01 82 | DQB1*06:02:01:01 83 | DRB1*15:01:01:01 84 | DPA1*01:03:01:01 85 | DPB1*04:01:01:01 86 | ``` 87 | 5. If the indexed IMGT/HLA database files are not in the current directory, edit the `prepare_reference.sh` file to point to these files. 88 | 6. Run `prepare_reference.sh known_genotypes.txt` to get your custom references `cds.fasta` and `gen.fasta`. The samtools command will fail if the coding and genomic sequence of all alleles specified are not present in the database! If multiple alleles are present that match the provided level of specificity of the genotype, one will be chosen arbitrarily. 89 | 7. Run scHLAcount with the custom references as `-f` and `-g` parameters. Do not use the `-i` and `-d` parameters. 90 | 91 | 92 | ## Case 2: You do not have HLA genotypes 93 | 94 | 1. Download the the IMGT/HLA database, available at [Github](https://github.com/ANHIG/IMGTHLA) or [FTP](ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/). You only need the files `hla_gen.fasta`, `hla_nuc.fasta`, and `Allele_status.txt` but you can download the whole database if you choose. 95 | 2. The directory containing these files should be provided as the `-d` parameter to scHLAcount. 96 | 3. Run scHLAcount with the `-d` parameter. Do not use the `-f`, `-g`, or `-i` parameters. 97 | 4. If you run the program again and want to skip building the index, just specify the file `hla_nuc.fasta.idx` as the `-i` parameter. This file is located in the pseudoaligner temporary folder specified by the `--pl-tmp` parameter. 98 | 5. If you run the program again on and want to skip calculating the genotypes, you can use the `pseudoaligner_nuc.fa` and `pseudoaligner_gen.fa` as the `-f` and `-g` parameters. These files are located in the pseudoaligner temporary folder specified by the `--pl-tmp` parameter. 99 | 100 | ## Outputs 101 | scHLAcount produces genome matrices in the same [Market Exchange](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices) format that Cell Ranger uses. This is a sparse matrix format that can be read by common packages. Column labels are the cell barcodes included in the cell barcode input file (specified with `--cell-barcodes`). 102 | 103 | 104 | ## Genotyping Algorithm Details 105 | 106 | If HLA genotypes are not available, scHLAcount provides a preliminary implementation of a genotyping algorithm similar to the one demonstrated on bulk RNA-seq data in arcasHLA (Orenbuch et al., Bioinformatics 2019) and HLApers (Aguiar et al., PLoS Genetics 2019). Although 5' and 3' Gene Expression read coverage is highly skewed towards one end of the transcript (see [Limitations](#using-schlacount)) we have found that combining reads from all cells has enough coverage along the length of the transcript to achieve similar results to those demonstrated in bulk RNA-seq. 107 | 108 | We use the `Allele_status.txt` metadata file from the IMGT/HLA database to select the full-length coding sequences of alleles from hla_nuc.fasta for genes HLA-A, -B, -C, DPA1, DPB1, DRA1, DRB1, DQA1, and DQB1 in the IMGT HLA database that also have complete genomic sequences available. We exclude null alleles (with suffix of N) because by definition, we would never call these genotypes from RNA sequencing data. A colored deBruijn graph of these alleles is constructed using k-mer size 24. We build on the Rust [debruijn_mapping](https://github.com/10XGenomics/rust-pseudoaligner) crate. From the aligned BAM file, all reads aligned to the MHC region (default is GRCh38 coordinates 6:28510120-33480577) are pseudo-aligned to the graph, and the set of alleles to which they align (''equivalence class'') is reported, if the length of the alignment is at least 40 bases, with up to 2 mismatches permitted outside the initial seed. Expectation maximization (EM) is performed on the equivalence class counts, with the accelerated implementation SquareM of Varadhan and Roland (Scandinavian J of Statistics, 2008). 109 | 110 | For each of the eight HLA genes included in the graph, we rank the alleles of that gene by weight in the EM. To determine the diploid genotype, we consider all pairs of alleles that explained at least 100 reads (a pair of alleles explains a read if the read pseudoaligned to an equivalence class containing either allele sequence). We select the pair of alleles that explain the most reads, if the number of reads explained by the 2nd allele alone is at least 15% of the reads explained by the 1st allele. If this is not true, we report only a single genotype: the highest ranked by weight in the EM. For each gene, we report the top 5 allele pairs ranked by number of reads explained, and the top 10 alleles ranked by EM weight in an auxiliary output file, found in the pseudoaligner temporary folder specified by the `--pl-tmp` parameter. This metadata is useful to gauge confidence in the reported genotypes compared to other contenders. The coding sequence of the allele or pair from `hla_nuc.fasta` and the genomic sequence from `hla_gen.fasta` (IMGT/HLA database) are written to FASTA format files for all genes. 111 | 112 | ## Molecule Counting Algorithm Details 113 | 114 | Using multi-FASTA files of the coding and genomic sequences of alleles (generated using `prepare_reference.sh` from a list of genotypes or internally by scHLAcount), two colored de Bruijn graph indexes are built using k-mer size 24. One graph contains coding sequences and one contains genomic sequences. From the aligned BAM file, each read aligned to the MHC region (default is GRCh38 coordinates 6:28510120-33480577) is first pseudo-aligned to the coding sequence graph. If there is no alignment of at least 60 bases (2 mismatches are permitted outside the initial seed), the read is pseudo-aligned to the genomic sequence graph and retained if the alignment is at least 60 bases. In the datasets we've studied, less than 5% of reads with any alignment were aligned to the genomic sequence. The genomic sequence step provides a failsafe for samples that are derived from nuclei, haplotypes with short CDS assemblies in the database, and the occasional read from intronic or UTR sequences. 115 | 116 | Reads are then collated by molecule, which in 10x Genomics data comprises the 12bp cell barcode (CB) and 10bp unique molecular identifier (UMI). All reads sharing a CB and UMI originated from the same RNA molecule, but individual reads may have different equivalence classes according to the pseudoalignment. We only retain reads where the equivalence class corresponds to a single allele (in which case the read is assigned to that allele; e.g. HLA-A*02:01) or the two alleles of the same gene (the read is assigned to the gene; e.g. HLA-A). If at least half of the reads from a molecule are assigned to a particular gene, the molecule is assigned to that gene or one of its alleles using a consensus of the constituent reads’ equivalence classes. Now only considering the reads from the molecule assigned to the most prevalent gene, if both or neither allele have at least 10% of the reads, the molecule is assigned to the gene. Otherwise, the molecule is assigned to the more prevalent allele. 117 | 118 | ## Advanced Parameter Specifications 119 | 120 | Default parameters were selected based on our test datasets with genotypes with two or three-field resolution, where we expect the personalized reference to have very few mismatches with the allele present in the reads. scHLAcount selects an arbitrary allele from the database consistent with the provided genotypes. If the genotypes provided are lower-resolution (e.g. the one-field genotype A\*02 is lower-resolution than the three-field genotype A\*02:01:01), scHLAcount arbitrarily selects a representative sequence from all A\*02 alleles. Therefore, when only lower-resolution genotypes are available, the pseudoalignments of reads to the personalized reference may contain more mismatches and users may want to decrease the k-mer length or decrease the minimum significant alignment length. 121 | 122 | k-mer length is set to 20 in the debruijn_mapping crate. To change the k-mer length, you need to clone the repo of this crate, change [its configuration file](https://github.com/10XGenomics/rust-pseudoaligner/blob/master/src/config.rs), and change the [scHLAcount Cargo.toml file](https://github.com/10XGenomics/scHLAcount/blob/master/Cargo.toml#L11) to point to your local, modified version of the debruijn_mapping crate. Then you will need to re-compile the scHLAcount program. 123 | 124 | Other scHLAcount-specific parameters are specified in our configuration file [here](https://github.com/10XGenomics/scHLAcount/blob/master/src/config.rs) and are described below. After changing the configuration file, you will need to re-compile the scHLAcount program. 125 | 126 | *Genotyping parameters* 127 | MIN_SCORE_CALL - minimum length of pseudoalignment required to use a read in genotyping 128 | Parameters prefixed with "EM" pertain to the expectation-maximization step of the genotyping algorithm. 129 | MIN_READS_CALL - minimum number of reads required to be assigned to a gene in order to call a genotype 130 | HOMOZYGOUS_TH - maximum proportion of reads assigned to an allele to call a homozygous genotype 131 | PAIRS_TO_OUTPUT - in the genotyping output file, report this many allele pairs and their scores 132 | WEIGHTS_TO_OUTPUT - in the genotyping output file, report this many alleles and their scores 133 | 134 | *Molecule-counting parameters* 135 | MIN_SCORE_COUNT_PSEUDO - minimum length of pseudoalignment required to use a read in molecule-counting 136 | MIN_SCORE_COUNT_ALIGNMENT - minimum length of pseudoalignment required to use a read in molecule-counting 137 | GENE_CONSENSUS_THRESHOLD - minimum proportion of reads in a UMI that must be assigned to (any alleles of) a single gene, otherwise the UMI is not counted 138 | ALLELE_CONSENSUS_THRESHOLD - minimum proportion of reads in a UMI that must be assigned to the most prevalent allele, otherwise the UMI is counted at the gene level not at the allele level 139 | 140 | The gene names for genotyping and molecule counting are specified in the code [here](https://github.com/10XGenomics/scHLAcount/blob/master/src/hla.rs#L108). 141 | 142 | ## Future Work 143 | 144 | **General** 145 | 146 | - Extend to other HLA genes and pseudogenes 147 | - Allow multiple disjoint regions to be specified for read extraction 148 | 149 | **Genotyping Step** 150 | 151 | - Option to use UMI counts instead of read counts 152 | - Improve genotyping algorithm, especially on 3' GEX 153 | 154 | **Molecule Counting Step** 155 | 156 | - Option to use unmapped reads 157 | -------------------------------------------------------------------------------- /ci/before_deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # package the build artifacts 4 | 5 | set -ex 6 | 7 | . "$(dirname $0)/utils.sh" 8 | 9 | # Generate artifacts for release 10 | mk_artifacts() { 11 | cargo build --target "$TARGET" --release 12 | } 13 | 14 | mk_tarball() { 15 | # When cross-compiling, use the right `strip` tool on the binary. 16 | local gcc_prefix="$(gcc_prefix)" 17 | # Create a temporary dir that contains our staging area. 18 | # $tmpdir/$name is what eventually ends up as the deployed archive. 19 | local tmpdir="$(mktemp -d)" 20 | local name="${PROJECT_NAME}-${TRAVIS_TAG}-${TARGET}" 21 | local staging="$tmpdir/$name" 22 | mkdir -p "$staging" 23 | # The deployment directory is where the final archive will reside. 24 | # This path is known by the .travis.yml configuration. 25 | local out_dir="$(pwd)/deployment" 26 | mkdir -p "$out_dir" 27 | # Find the correct (most recent) Cargo "out" directory. The out directory 28 | # contains shell completion files and the man page. 29 | local cargo_out_dir="$(cargo_out_dir "target/$TARGET")" 30 | 31 | # Copy the scHLAcount binary and strip it. 32 | cp "target/$TARGET/release/sc_hla_count" "$staging/sc_hla_count" 33 | #"${gcc_prefix}strip" "$staging/sc_hla_count" 34 | # Copy the licenses and README. 35 | cp {README.md,LICENSE} "$staging/" 36 | 37 | (cd "$tmpdir" && tar czf "$out_dir/$name.tar.gz" "$name") 38 | rm -rf "$tmpdir" 39 | } 40 | 41 | main() { 42 | mk_artifacts 43 | mk_tarball 44 | } 45 | 46 | main 47 | -------------------------------------------------------------------------------- /ci/construct_linux_binary.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script copies what the Travis build is doing, but allows you to build in your own environment 3 | """ 4 | 5 | import argparse 6 | import subprocess 7 | import os 8 | import sys 9 | import github3 10 | import shutil 11 | 12 | def parse_args(): 13 | args = argparse.ArgumentParser() 14 | args.add_argument('--token', required=True) 15 | args.add_argument('--repo', default='vartrix') 16 | args.add_argument('--owner', default='10xgenomics') 17 | args.add_argument('--target', default='target/release/vartrix') 18 | args.add_argument('--extra-items', nargs='+', default=['README.md', 'LICENSE']) 19 | args.add_argument('--tag', help='Will use the last release if not set') 20 | return args.parse_args() 21 | 22 | 23 | def construct_tarball(target, extra_items, file_name): 24 | """Package a tarball of the binary + any extra items""" 25 | o, e = subprocess.Popen(['mktemp'], stdout=subprocess.PIPE).communicate() 26 | tar_dir = file_name.replace('.tar.gz', '') 27 | os.makedirs(os.path.join(o, tar_dir)) 28 | if e: 29 | raise Exception('failed to make temp dir') 30 | shutil.copy(target, os.path.join(o, tar_dir, os.path.basename(target))) 31 | for f in extra_items: 32 | shutil.copy(f, os.path.join(o, tar_dir, os.path.basename(f))) 33 | #subprocess.check_call(['strip', os.path.join(o, tar_dir, os.path.basename(target))]) 34 | os.chdir(o) 35 | subprocess.check_call(['tar', 'czf', file_name, tar_dir]) 36 | tarball = os.path.join(o, file_name) 37 | os.chdir('../') 38 | return tarball 39 | 40 | 41 | def upload_to_github(args): 42 | github = github3.login(token=args.token) 43 | repo = github.repository(args.owner, args.repo) 44 | releases = list(repo.releases()) 45 | 46 | if args.tag is not None: 47 | tag = args.tag 48 | try: 49 | release = [x for x in releases if x.tag_name == args.tag][0] 50 | except IndexError: 51 | print "WARNING: provided tag does not exist. Using last release" 52 | release = releases[-1] 53 | else: 54 | release = releases[-1] 55 | tag = release.tag_name 56 | 57 | file_name = 'vartrix-{}-x86_64-linux.tar.gz'.format(tag) 58 | tarball = construct_tarball(args.target, args.extra_items, file_name) 59 | try: 60 | release.upload_asset('application', file_name, open(tarball, 'rb'), label=file_name) 61 | except github3.GitHubError as e: 62 | raise Exception(e.errors) 63 | 64 | 65 | if __name__ == '__main__': 66 | args = parse_args() 67 | upload_to_github(args) 68 | -------------------------------------------------------------------------------- /ci/install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # install stuff needed for the `script` phase 4 | 5 | # Where rustup gets installed. 6 | export PATH="$PATH:$HOME/.cargo/bin" 7 | 8 | set -ex 9 | 10 | . "$(dirname $0)/utils.sh" 11 | 12 | install_rustup() { 13 | curl https://sh.rustup.rs -sSf \ 14 | | sh -s -- -y --default-toolchain="$TRAVIS_RUST_VERSION" 15 | rustc -V 16 | cargo -V 17 | } 18 | 19 | install_targets() { 20 | if [ $(host) != "$TARGET" ]; then 21 | rustup target add $TARGET 22 | fi 23 | } 24 | 25 | install_osx_dependencies() { 26 | if ! is_osx; then 27 | return 28 | fi 29 | 30 | brew install asciidoc docbook-xsl 31 | } 32 | 33 | configure_cargo() { 34 | local prefix=$(gcc_prefix) 35 | if [ -n "${prefix}" ]; then 36 | local gcc_suffix= 37 | if [ -n "$GCC_VERSION" ]; then 38 | gcc_suffix="-$GCC_VERSION" 39 | fi 40 | local gcc="${prefix}gcc${gcc_suffix}" 41 | 42 | # information about the cross compiler 43 | "${gcc}" -v 44 | 45 | # tell cargo which linker to use for cross compilation 46 | mkdir -p .cargo 47 | cat >>.cargo/config <&2 7 | exit 1 8 | fi 9 | version="$1" 10 | 11 | # Linux and Darwin builds. 12 | for arch in i686 x86_64; do 13 | for target in apple-darwin unknown-linux-musl; do 14 | url="https://github.com/10xgenomics/scHLAcount/releases/download/$version/scHLAcount-$version-$arch-$target.tar.gz" 15 | sha=$(curl -sfSL "$url" | sha256sum) 16 | echo "$version-$arch-$target $sha" 17 | done 18 | done 19 | 20 | # Source. 21 | for ext in zip tar.gz; do 22 | url="https://github.com/10xgenomics/scHLAcount/archive/$version.$ext" 23 | sha=$(curl -sfSL "$url" | sha256sum) 24 | echo "source.$ext $sha" 25 | done 26 | -------------------------------------------------------------------------------- /ci/utils.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Various utility functions used through CI. 4 | 5 | # Finds Cargo's `OUT_DIR` directory from the most recent build. 6 | # 7 | # This requires one parameter corresponding to the target directory 8 | # to search for the build output. 9 | cargo_out_dir() { 10 | # This works by finding the most recent stamp file, which is produced by 11 | # every scHLAcount build. 12 | target_dir="$1" 13 | find "$target_dir" -name scHLAcount-stamp -print0 \ 14 | | xargs -0 ls -t \ 15 | | head -n1 \ 16 | | xargs dirname 17 | } 18 | 19 | host() { 20 | case "$TRAVIS_OS_NAME" in 21 | linux) 22 | echo x86_64-unknown-linux-gnu 23 | ;; 24 | osx) 25 | echo x86_64-apple-darwin 26 | ;; 27 | esac 28 | } 29 | 30 | architecture() { 31 | case "$TARGET" in 32 | x86_64-*) 33 | echo amd64 34 | ;; 35 | i686-*|i586-*|i386-*) 36 | echo i386 37 | ;; 38 | arm*-unknown-linux-gnueabihf) 39 | echo armhf 40 | ;; 41 | *) 42 | die "architecture: unexpected target $TARGET" 43 | ;; 44 | esac 45 | } 46 | 47 | gcc_prefix() { 48 | case "$(architecture)" in 49 | armhf) 50 | echo arm-linux-gnueabihf- 51 | ;; 52 | *) 53 | return 54 | ;; 55 | esac 56 | } 57 | 58 | is_ssse3_target() { 59 | case "$(architecture)" in 60 | amd64) return 0 ;; 61 | *) return 1 ;; 62 | esac 63 | } 64 | 65 | is_x86() { 66 | case "$(architecture)" in 67 | amd64|i386) return 0 ;; 68 | *) return 1 ;; 69 | esac 70 | } 71 | 72 | is_arm() { 73 | case "$(architecture)" in 74 | armhf) return 0 ;; 75 | *) return 1 ;; 76 | esac 77 | } 78 | 79 | is_linux() { 80 | case "$TRAVIS_OS_NAME" in 81 | linux) return 0 ;; 82 | *) return 1 ;; 83 | esac 84 | } 85 | 86 | is_osx() { 87 | case "$TRAVIS_OS_NAME" in 88 | osx) return 0 ;; 89 | *) return 1 ;; 90 | esac 91 | } 92 | -------------------------------------------------------------------------------- /editdistancealleles.py: -------------------------------------------------------------------------------- 1 | import editdistance 2 | import sys 3 | from subprocess import Popen, PIPE 4 | 5 | HLANUC="hla_nuc.fasta" 6 | 7 | if len(sys.argv) != 3: 8 | print("Usage: python editdistancealleles.py A*01:01:01 A*03:02:01\nNote: if you do not include the maximum level of specificity (e.g. four fields) an ARBITRARY allele will be chosen matching the prefix given.") 9 | sys.exit() 10 | 11 | a1 = sys.argv[1] 12 | a2 = sys.argv[2] 13 | 14 | a1p = Popen(["grep", "-F", "-m", "1", a1, HLANUC],stdout=PIPE,stderr=PIPE) 15 | (a1_stdout, _) = a1p.communicate() 16 | a1_name = a1_stdout.split()[0][1:] 17 | a1_allele = a1_stdout.split()[1] 18 | a1p = Popen(["samtools", "faidx", HLANUC, a1_name],stdout=PIPE,stderr=PIPE) 19 | (a1_stdout, _) = a1p.communicate() 20 | a1_seq = "".join([s.strip() for s in a1_stdout.split()[1:]]) 21 | 22 | a2p = Popen(["grep", "-F", "-m", "1", a2, HLANUC],stdout=PIPE,stderr=PIPE) 23 | (a2_stdout, _) = a2p.communicate() 24 | a2_name = a2_stdout.split()[0][1:] 25 | a2_allele = a2_stdout.split()[1] 26 | a2p = Popen(["samtools", "faidx", HLANUC, a2_name],stdout=PIPE,stderr=PIPE) 27 | (a2_stdout, _) = a2p.communicate() 28 | a2_seq = "".join([s.strip() for s in a2_stdout.split()[1:]]) 29 | 30 | d = editdistance.eval(a1_seq,a2_seq) 31 | 32 | print("CDS edit distance:", a1_allele, a2_allele, d) 33 | -------------------------------------------------------------------------------- /paper/AML_508084.txt: -------------------------------------------------------------------------------- 1 | A*68:01:01:01 2 | A*01:01:01:01 3 | B*07:02:01:01 4 | B*27:05:02:01 5 | C*07:02:01:01 6 | C*07:04:01:01 7 | DRB1*01:01:01 8 | DRB1*15:01:01:01 9 | DQB1*05:01:01:01 10 | DQB1*06:02:01:01 11 | DQA1*01:02:01:01 12 | DPA1*01:03:01:01 13 | DPB1*04:01:01:01 14 | -------------------------------------------------------------------------------- /paper/AML_548327.txt: -------------------------------------------------------------------------------- 1 | A*68:01:01:01 2 | A*02:06:01:01 3 | B*51:01:01:01 4 | B*44:05:01 5 | C*02:02:02:01 6 | DQB1*02:01:01 7 | DQB1*02:02:01:01 8 | DRB1*07:01:01:01 9 | DRB1*03:01:01:01 10 | DQA1*01:02:01:01 11 | DPA1*01:03:01:01 12 | DPB1*04:01:01:01 13 | -------------------------------------------------------------------------------- /paper/AML_721214.txt: -------------------------------------------------------------------------------- 1 | A*01:01:01:01 2 | A*03:01:01:01 3 | B*18:01:01:01 4 | B*14:01:01:01 5 | C*08:02:01:01 6 | C*07:40 7 | DQB1*02:02:01:01 8 | DQB1*03:02:01:01 9 | DRB1*07:01:01:01 10 | DRB1*04:03:01:01 11 | DQA1*01:02:01:01 12 | DPA1*01:03:01:01 13 | DPB1*04:01:01:01 14 | -------------------------------------------------------------------------------- /paper/AML_782328.txt: -------------------------------------------------------------------------------- 1 | A*32:01:01:01 2 | B*37:01:01:01 3 | B*15:01:01:01 4 | C*06:02:01:01 5 | C*03:04:01:01 6 | DRB1*04:01:01:01 7 | DRB1*10:01:01:01 8 | DQB1*05:01:01:01 9 | DQB1*03:02:01:01 10 | DQA1*01:02:01:01 11 | DPA1*01:03:01:01 12 | DPB1*04:01:01:01 13 | -------------------------------------------------------------------------------- /paper/AML_809653.txt: -------------------------------------------------------------------------------- 1 | A*68:02:01:01 2 | A*31:01:02:01 3 | B*27:05:02:01 4 | B*14:02:01:01 5 | C*08:02:01:01 6 | C*07:02:01:01 7 | DRB1*11:01:01:01 8 | DRB1*01:03:01 9 | DQB1*03:01:01:01 10 | DQA1*01:02:01:01 11 | DPA1*01:03:01:01 12 | DPB1*04:01:01:01 13 | -------------------------------------------------------------------------------- /paper/coverage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/coverage.png -------------------------------------------------------------------------------- /paper/figure2.R: -------------------------------------------------------------------------------- 1 | 2 | library(Matrix) 3 | library(Seurat) 4 | library(ggplot2) 5 | library(gridExtra) 6 | options(bitmapType='cairo') 7 | 8 | # Code to generate Figure 2 (Petti AML Subject 809653 DRB1 case study) 9 | 10 | # Get the cell type labels 11 | # R data object from https://zenodo.org/record/3066262#.XUnM5ZNKijh 12 | AML3 <- readRDS("/mnt/park1/compbio/HLA/aml_seurat/809653.seurat.rds") 13 | celltypes.paper <- AML3@meta.data$CellType 14 | names(celltypes.paper) <- dimnames(AML3@data)[[2]] 15 | rm(AML3) 16 | 17 | # Data processed with Cell Ranger 2.1.1 18 | CELLRANGERDIR <- "/mnt/analysis/marsoc/pipestances/H3TNHDSXX/SC_RNA_COUNTER_PD/64474/HEAD/outs/filtered_gene_bc_matrices_mex/GRCh38/" 19 | # Output of scHLAcount using true genotype FASTA files 20 | ANALYSISDIR <- "/mnt/park1/compbio/HLA/only-count-true-ref/AML3" 21 | 22 | # Read Cell Ranger data 23 | cells.data <- Read10X(data.dir = CELLRANGERDIR) 24 | cells <- CreateSeuratObject(counts = cells.data) 25 | 26 | # Add cell type labels as metadata column 27 | cells <- AddMetaData(object=cells,col.name="celltype",metadata=celltypes.paper) 28 | 29 | # Perform pre-processing with Seurat 30 | cells <- NormalizeData(cells) 31 | cells <- FindVariableFeatures(cells, selection.method = "vst", nfeatures = 2000) 32 | all.genes <- rownames(cells) 33 | cells <- ScaleData(cells, features = all.genes) 34 | 35 | cells <- RunPCA(cells, features = VariableFeatures(object = cells)) 36 | cells <- FindNeighbors(cells, dims = 1:10) 37 | cells <- FindClusters(cells, resolution = 0.5) 38 | cells <- RunTSNE(cells,check_duplicates = F) 39 | 40 | # Read scHLAcount data 41 | mat <- readMM(file = paste0(ANALYSISDIR, "/count_matrix.mtx")) 42 | feature.names <- read.delim(paste0(ANALYSISDIR,"/labels.tsv"), header = FALSE, stringsAsFactors = FALSE) 43 | dimnames(mat)[[1]] <- feature.names$V1 44 | dimnames(mat)[[2]] <- dimnames(cells)[[2]] 45 | 46 | # Normalize counts and insert into matrix as metadata columns 47 | c <- median(cells$nCount_RNA) 48 | curr_gene <- "" 49 | gene_names <- c() 50 | has_two_alleles <- c() 51 | n_seen = 1 52 | 53 | for (i in seq(1,length(feature.names$V1))) { 54 | f <- feature.names$V1[i] 55 | s <- strsplit(f, "*", fixed=T) 56 | if (s[[1]][1] == curr_gene) { 57 | n_seen <- n_seen + 1 58 | } else { 59 | if (n_seen == 1 && curr_gene != "") { 60 | #one allele 61 | gene_names <- c(gene_names, curr_gene) 62 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1+c*mat[i-1,]/cells$nCount_RNA)) 63 | } else if (n_seen == 3) { 64 | #two alleles 65 | gene_names <- c(gene_names, curr_gene) 66 | has_two_alleles <- c(has_two_alleles, curr_gene) 67 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1+c*(mat[i-1,]+mat[i-2,]+mat[i-3,])/cells$nCount_RNA)) 68 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sumnolog"),metadata=c*(mat[i-1,]+mat[i-2,]+mat[i-3,])/cells$nCount_RNA) 69 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene),metadata=c*mat[i-1,]/cells$nCount_RNA) 70 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"allele1"),metadata=c*mat[i-3,]/cells$nCount_RNA) 71 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"allele2"),metadata=c*mat[i-2,]/cells$nCount_RNA) 72 | x <- paste0("gene",curr_gene,"allele1") 73 | y <- paste0("gene",curr_gene,"allele2") 74 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"ratio"),metadata=cells@meta.data[x]/(cells@meta.data[x]+cells@meta.data[y])) 75 | n_seen = 1 76 | } 77 | curr_gene <- s[[1]][1] 78 | } 79 | } 80 | i <- i+1 81 | if (n_seen == 1 && curr_gene != "") { 82 | #one allele 83 | gene_names <- c(gene_names, curr_gene) 84 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1+c*mat[i-1,]/cells$nCount_RNA)) 85 | } else if (n_seen == 3) { 86 | #two alleles 87 | gene_names <- c(gene_names, curr_gene) 88 | has_two_alleles <- c(has_two_alleles, curr_gene) 89 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1+c*(mat[i-1,]+mat[i-2,]+mat[i-3,])/cells$nCount_RNA)) 90 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"sumnolog"),metadata=c*(mat[i-1,]+mat[i-2,]+mat[i-3,])/cells$nCount_RNA) 91 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene),metadata=c*mat[i-1,]/cells$nCount_RNA) 92 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"allele1"),metadata=c*mat[i-3,]/cells$nCount_RNA) 93 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"allele2"),metadata=c*mat[i-2,]/cells$nCount_RNA) 94 | x <- paste0("gene",curr_gene,"allele1") 95 | y <- paste0("gene",curr_gene,"allele2") 96 | cells <- AddMetaData(object=cells,col.name=paste0("gene",curr_gene,"ratio"),metadata=cells@meta.data[x]/(cells@meta.data[x]+cells@meta.data[y])) 97 | } 98 | 99 | # Load completed analysis up to this point 100 | #save(cells, file="/mnt/park1/compbio/HLA/aml_seurat/AML3.Rdata") 101 | load("/mnt/park1/compbio/HLA/aml_seurat/AML3.Rdata") 102 | 103 | CELLTYPE_PLOT <- DimPlot(object=cells,group.by='celltype', label=F, cells=rownames(cells@meta.data)[!is.na(cells$celltype)]) + ggtitle("(e) Cell types from Petti et al.") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 104 | 105 | # Bug in Seurat related to plotting, fails when a NA is first: https://github.com/satijalab/seurat/issues/1853 106 | # Temporary fix: reordering the rows so a NA doesn't come first 107 | new.cell.order <- rownames(cells@meta.data)[order(cells@meta.data["geneDRB1ratio"])] 108 | DRB1_ASE_PLOT <- FeaturePlot(cells, "geneDRB1ratio", reduction="tsne",cols=c('blue','red'), cells = new.cell.order) + ggtitle("(b) Fraction of molecules assigned to allele 01:03") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 109 | 110 | DRB1_EXPR_PLOT <- FeaturePlot(cells, "geneDRB1sum", reduction="tsne", cells = new.cell.order) + ggtitle("(a) HLA-DRB1 normalized expression") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 111 | 112 | C_ASE_PLOT <- FeaturePlot(cells, "geneCratio", reduction="tsne",cols=c('blue','red'), cells = new.cell.order) + ggtitle("(d) Fraction of molecules assigned to allele 07:02") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 113 | 114 | C_EXPR_PLOT <- FeaturePlot(cells, "geneCsum", reduction="tsne", cells = new.cell.order) + ggtitle("(c) HLA-C normalized expression") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 115 | 116 | 117 | COMBINED <- grid.arrange(DRB1_EXPR_PLOT,DRB1_ASE_PLOT,C_EXPR_PLOT,C_ASE_PLOT,CELLTYPE_PLOT, 118 | layout_matrix=rbind(c(1,1,1,2,2,2),c(3,3,3,4,4,4),c(5,5,5,5,NA,NA))) 119 | 120 | ggsave("figure2-updated.pdf", COMBINED, width=10, height=12, units="in") 121 | 122 | # Table 4 (HLA-DRB1) 123 | 124 | TYPES <- unique(celltypes.paper) 125 | ncells <- sapply(TYPES, function(ct) length(na.omit(cells@meta.data[paste0("gene","DRB1","sum")][cells$celltype == ct,1]))) # Number of cells 126 | a1 <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","DRB1","allele1")][cells$celltype == ct,1]))) 127 | a2 <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","DRB1","allele2")][cells$celltype == ct,1]))) 128 | s <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","DRB1","sum")][cells$celltype == ct,1]))) 129 | s_nolog <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","DRB1","sumnolog")][cells$celltype == ct,1]))) 130 | expr_DRB1 <- s/ncells # Normalized and log'd total expression 131 | expr_nolog_DRB1 <- s_nolog/ncells # Normalized and *not* log'd total expression 132 | ratio_DRB1 <- a1/(a1+a2) # Normalized but not log'd counts 133 | table4 <- cbind(ncells,ratio_DRB1,expr_nolog_DRB1) 134 | 135 | # Table 5 (HLA-C) 136 | 137 | a1 <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","C","allele1")][cells$celltype == ct,1]))) 138 | a2 <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","C","allele2")][cells$celltype == ct,1]))) 139 | s <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","C","sum")][cells$celltype == ct,1]))) 140 | s_nolog <- sapply(TYPES, function(ct) sum(na.omit(cells@meta.data[paste0("gene","C","sumnolog")][cells$celltype == ct,1]))) 141 | expr_C <- s/ncells # Normalized and log'd total expression 142 | expr_nolog_C <- s_nolog/ncells # Normalized and *not* log'd total expression 143 | ratio_C <- a1/(a1+a2) # Normalized but not log'd counts 144 | table5 <- cbind(ncells,ratio_C,expr_nolog_C) 145 | 146 | 147 | -------------------------------------------------------------------------------- /paper/figure3.R: -------------------------------------------------------------------------------- 1 | library(Matrix) 2 | library(Seurat) 3 | library(ggplot2) 4 | library(gridExtra) 5 | options(bitmapType='cairo') 6 | 7 | # Data processing commands for the expression matrix from Paulson Supplementary Data, updated to Seurat v3 syntax 8 | 9 | raw_data <- read.csv('/mnt/park1/compbio/HLA/paulson_data/GSE118056_raw.expMatrix.csv', header = TRUE, row.names = 1) 10 | data <- log2(1 + sweep(raw_data, 2, median(colSums(raw_data))/colSums(raw_data), '*')) # Normalization 11 | cellTypes <- sapply(strsplit(colnames(data), ".", fixed=T), function(x) x[2]) 12 | cellTypes <- ifelse(cellTypes == '1', 'PBMC', 'Tumor') 13 | seurat <- CreateSeuratObject(counts = data, project = '10x_MCC_2') # already normalized 14 | seurat <- AddMetaData(object = seurat, metadata = apply(raw_data, 2, sum), col.name = 'nUMI_raw') 15 | seurat <- AddMetaData(object = seurat, metadata = cellTypes, col.name = 'cellTypes') 16 | seurat <- ScaleData(object = seurat, vars.to.regress = c('nUMI_raw'), model.use = 'linear', use.umi = FALSE) 17 | seurat <- FindVariableFeatures(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.05, x.high.cutoff = 4, y.cutoff = 0.5) 18 | seurat <- RunPCA(object = seurat, pc.genes = seurat@var.genes, pcs.compute = 40) 19 | seurat <- RunTSNE(object = seurat, dims.use = 1:10, perplexity = 50, do.fast = TRUE) 20 | seurat <- FindNeighbors(seurat, dims = 1:10, k.param = 20, reduction="pca") 21 | seurat <- FindClusters(seurat, resolution = 0.6) 22 | 23 | # Read in scHLAcount matrixes 24 | mat1 <- readMM(file = paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692286", "/count_matrix.mtx")) 25 | feature.names <- read.delim(paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692286","/labels.tsv"), header = FALSE, stringsAsFactors = FALSE) 26 | cell.names <- read.delim(gzfile(paste0("/mnt/park1/compbio/HLA/paulson_data/cellranger_outs/SRR7692286", "/outs/filtered_feature_bc_matrix/","barcodes.tsv.gz")), header = FALSE, stringsAsFactors = FALSE) 27 | cell.names <- sapply(strsplit(cell.names$V1,"-",fixed=T), function(x) paste0(x[1], ".1")) 28 | dimnames(mat1)[[1]] <- feature.names$V1 29 | dimnames(mat1)[[2]] <- cell.names 30 | 31 | mat2 <- readMM(file = paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692287", "/count_matrix.mtx")) 32 | feature.names <- read.delim(paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692287","/labels.tsv"), header = FALSE, stringsAsFactors = FALSE) 33 | cell.names <- read.delim(gzfile(paste0("/mnt/park1/compbio/HLA/paulson_data/cellranger_outs/SRR7692287", "/outs/filtered_feature_bc_matrix/","barcodes.tsv.gz")), header = FALSE, stringsAsFactors = FALSE) 34 | cell.names <- sapply(strsplit(cell.names$V1,"-",fixed=T), function(x) paste0(x[1], ".1")) 35 | dimnames(mat2)[[1]] <- feature.names$V1 36 | dimnames(mat2)[[2]] <- cell.names 37 | 38 | # Tumor 39 | mat3 <- readMM(file = paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692288", "/count_matrix.mtx")) 40 | feature.names <- read.delim(paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692288","/labels.tsv"), header = FALSE, stringsAsFactors = FALSE) 41 | cell.names <- read.delim(gzfile(paste0("/mnt/park1/compbio/HLA/paulson_data/cellranger_outs/SRR7692288", "/outs/filtered_feature_bc_matrix/","barcodes.tsv.gz")), header = FALSE, stringsAsFactors = FALSE) 42 | cell.names <- sapply(strsplit(cell.names$V1,"-",fixed=T), function(x) paste0(x[1], ".2")) 43 | dimnames(mat3)[[1]] <- feature.names$V1 44 | dimnames(mat3)[[2]] <- cell.names 45 | 46 | mat4 <- readMM(file = paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692289", "/count_matrix.mtx")) 47 | feature.names <- read.delim(paste0("/mnt/park1/compbio/HLA/only-count-true-ref/paulson-7692289","/labels.tsv"), header = FALSE, stringsAsFactors = FALSE) 48 | cell.names <- read.delim(gzfile(paste0("/mnt/park1/compbio/HLA/paulson_data/cellranger_outs/SRR7692289", "/outs/filtered_feature_bc_matrix/","barcodes.tsv.gz")), header = FALSE, stringsAsFactors = FALSE) 49 | cell.names <- sapply(strsplit(cell.names$V1,"-",fixed=T), function(x) paste0(x[1], ".2")) 50 | dimnames(mat4)[[1]] <- feature.names$V1 51 | dimnames(mat4)[[2]] <- cell.names 52 | 53 | mat <- cbind(mat1,mat2,mat3,mat4) 54 | 55 | # Normalize counts and insert into matrix as metadata columns 56 | c <- median(seurat$nCount_RNA) 57 | curr_gene <- "" 58 | gene_names <- c() 59 | has_two_alleles <- c() 60 | n_seen = 1 61 | 62 | for (i in seq(1,length(feature.names$V1))) { 63 | f <- feature.names$V1[i] 64 | s <- strsplit(f, "*", fixed=T) 65 | if (s[[1]][1] == curr_gene) { 66 | n_seen <- n_seen + 1 67 | } else { 68 | if (n_seen == 1 && curr_gene != "") { 69 | #one allele 70 | gene_names <- c(gene_names, curr_gene) 71 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=mat[i-1,]) 72 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1 + c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA)) 73 | } else if (n_seen == 3) { 74 | #two alleles 75 | gene_names <- c(gene_names, curr_gene) 76 | has_two_alleles <- c(has_two_alleles, curr_gene) 77 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=mat[i-1,]+mat[i-2,]+mat[i-3,]) 78 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene),metadata=mat[i-1,]) 79 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele1"),metadata=mat[i-3,]) 80 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele2"),metadata=mat[i-2,]) 81 | 82 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1 + c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA)) 83 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sumnolog"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA) 84 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene),metadata=c*seurat@meta.data[paste0("gene",curr_gene)]/seurat$nCount_RNA) 85 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele1"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"allele1")]/seurat$nCount_RNA) 86 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele2"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"allele2")]/seurat$nCount_RNA) 87 | x <- paste0("gene",curr_gene,"allele1") 88 | y <- paste0("gene",curr_gene,"allele2") 89 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"ratio"),metadata=seurat@meta.data[x]/(seurat@meta.data[x]+seurat@meta.data[y])) 90 | n_seen = 1 91 | } 92 | curr_gene <- s[[1]][1] 93 | } 94 | } 95 | i <- i+1 96 | if (n_seen == 1 && curr_gene != "") { 97 | #one allele 98 | gene_names <- c(gene_names, curr_gene) 99 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=mat[i-1,]) 100 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1 + c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA)) 101 | } else if (n_seen == 3) { 102 | #two alleles 103 | gene_names <- c(gene_names, curr_gene) 104 | has_two_alleles <- c(has_two_alleles, curr_gene) 105 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=mat[i-1,]+mat[i-2,]+mat[i-3,]) 106 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene),metadata=mat[i-1,]) 107 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele1"),metadata=mat[i-3,]) 108 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele2"),metadata=mat[i-2,]) 109 | 110 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sum"),metadata=log2(1 + c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA)) 111 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"sumnolog"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"sum")]/seurat$nCount_RNA) 112 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene),metadata=c*seurat@meta.data[paste0("gene",curr_gene)]/seurat$nCount_RNA) 113 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele1"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"allele1")]/seurat$nCount_RNA) 114 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"allele2"),metadata=c*seurat@meta.data[paste0("gene",curr_gene,"allele2")]/seurat$nCount_RNA) 115 | x <- paste0("gene",curr_gene,"allele1") 116 | y <- paste0("gene",curr_gene,"allele2") 117 | seurat <- AddMetaData(object=seurat,col.name=paste0("gene",curr_gene,"ratio"),metadata=seurat@meta.data[x]/(seurat@meta.data[x]+seurat@meta.data[y])) 118 | } 119 | 120 | # Clusters determined by marker genes 121 | tumorClusters <- c(2,3,4,6,7,10,11) 122 | tumorCells <- seurat@meta.data["seurat_clusters"][,1] %in% tumorClusters 123 | normalCells <- !tumorCells 124 | 125 | cellTypes1 <- ifelse(tumorCells, 'Tumor', 'Non-Tumor') 126 | seurat <- AddMetaData(object = seurat, metadata = cellTypes1, col.name = 'cellTypes1') 127 | 128 | # Load completed analysis up to this point 129 | #save(seurat, file="/mnt/park1/compbio/HLA/paulson_robj/paulson_val_seurat.Rdata") 130 | load("/mnt/park1/compbio/HLA/paulson_robj/paulson_val_seurat.Rdata") 131 | 132 | # Plots 133 | 134 | CELLTYPEPLOT <- TSNEPlot(seurat, group.by = 'cellTypes1') + ggtitle("Cell types inferred from marker genes") + theme(legend.position = c(0,0.1)) 135 | ggsave("figureS1.pdf", CELLTYPEPLOT, width=5, height=5, units="in") 136 | 137 | a_plt <- FeaturePlot(seurat, paste0("gene","A", "sum"), reduction="tsne") + ggtitle("(a) HLA-A normalized expression") + theme(plot.title = element_text(size=12)) 138 | c_plt <- FeaturePlot(seurat, paste0("gene","B", "sum"), reduction="tsne") + ggtitle("(c) HLA-B normalized expression") + theme(plot.title = element_text(size=12)) 139 | e_plt <- FeaturePlot(seurat, paste0("gene","C", "sum"), reduction="tsne") + ggtitle("(e) HLA-C normalized expression") + theme(plot.title = element_text(size=12)) 140 | 141 | # Bug in Seurat related to plotting, fails when a NA is first: https://github.com/satijalab/seurat/issues/1853 142 | # Temporary fix: reordering the rows so a NA doesn't come first 143 | new.cell.order <- rownames(seurat@meta.data)[order(seurat@meta.data[paste0("gene","A","ratio")])] 144 | b_plt <- FeaturePlot(seurat, paste0("gene","A", "ratio"), reduction="tsne", cols=c('blue','red'), cells=new.cell.order) + ggtitle("(b) Fraction of molecules assigned to allele A*02:01") + theme(plot.title = element_text(size=12)) 145 | d_plt <- FeaturePlot(seurat, paste0("gene","B", "ratio"), reduction="tsne", cols=c('blue','red')) + ggtitle("(d) Fraction of molecules assigned to allele B*35:01") + theme(plot.title = element_text(size=12)) 146 | f_plt <- FeaturePlot(seurat, paste0("gene","C", "ratio"), reduction="tsne", cols=c('blue','red')) + ggtitle("(f) Fraction of molecules assigned to allele C1") + theme(plot.title = element_text(size=12)) 147 | 148 | COMBINED <- grid.arrange(a_plt, b_plt, c_plt, d_plt, e_plt, f_plt, 149 | layout_matrix=rbind(c(1,2),c(3,4),c(5,6))) 150 | ggsave("figure3.pdf", COMBINED, width=8.5, height=10, units="in") 151 | 152 | CELLTYPEPLOT <- TSNEPlot(seurat, group.by = 'cellTypes1') + ggtitle("(g) Cell types") + theme(plot.title = element_text(size=12)) + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 153 | 154 | a_plt <- FeaturePlot(seurat, paste0("gene","A", "sum"), reduction="tsne") + ggtitle("(a) HLA-A") + theme(plot.title = element_text(size=12)) + theme(legend.position = "none") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 155 | b_plt <- FeaturePlot(seurat, paste0("gene","B", "sum"), reduction="tsne") + ggtitle("(b) HLA-B") + theme(plot.title = element_text(size=12)) + theme(legend.position = "none") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 156 | c_plt <- FeaturePlot(seurat, paste0("gene","C", "sum"), reduction="tsne") + ggtitle("(c) HLA-C") + theme(plot.title = element_text(size=12)) + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 157 | 158 | # Bug in Seurat related to plotting, fails when a NA is first: https://github.com/satijalab/seurat/issues/1853 159 | # Temporary fix: reordering the rows so a NA doesn't come first 160 | new.cell.order <- rownames(seurat@meta.data)[order(seurat@meta.data[paste0("gene","A","ratio")])] 161 | d_plt <- FeaturePlot(seurat, paste0("gene","A", "ratio"), reduction="tsne", cols=c('blue','red'), cells=new.cell.order) + ggtitle("(d) Fraction molecules\nA*02:01") + theme(plot.title = element_text(size=12)) + theme(legend.position = "none") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 162 | e_plt <- FeaturePlot(seurat, paste0("gene","B", "ratio"), reduction="tsne", cols=c('blue','red')) + ggtitle("(e) Fraction molecules\nB*35:01") + theme(plot.title = element_text(size=12)) + theme(legend.position = "none") + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 163 | f_plt <- FeaturePlot(seurat, paste0("gene","C", "ratio"), reduction="tsne", cols=c('blue','red')) + ggtitle("(f) Fraction molecules C1") + theme(plot.title = element_text(size=12)) + theme(axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank(),axis.title=element_text(size=10)) 164 | 165 | COMBINED <- grid.arrange(a_plt, b_plt, c_plt, d_plt, e_plt, f_plt, CELLTYPEPLOT, 166 | layout_matrix=rbind(c(1,1,1,2,2,2,3,3,3,3),c(4,4,4,5,5,5,6,6,6,6),c(7,7,7,7,7,NA,NA,NA,NA,NA))) 167 | ggsave("figure3-newlayout.pdf", COMBINED, width=8.5, height=8, units="in") 168 | 169 | 170 | 171 | # Table 8 172 | n_tumor <- sum(seurat$cellTypes1 == "Tumor") 173 | mean(na.omit(seurat$geneAsumnolog[seurat$cellTypes1 == "Tumor"])) 174 | mean(na.omit(seurat$geneBsumnolog[seurat$cellTypes1 == "Tumor"])) 175 | mean(na.omit(seurat$geneCsumnolog[seurat$cellTypes1 == "Tumor"])) 176 | mean(na.omit(seurat$geneAratio[seurat$cellTypes1 == "Tumor"]))*100 177 | mean(na.omit(seurat$geneBratio[seurat$cellTypes1 == "Tumor"]))*100 178 | mean(na.omit(seurat$geneCratio[seurat$cellTypes1 == "Tumor"]))*100 179 | 180 | n_normal <- sum(seurat$cellTypes1 == "Non-Tumor") 181 | mean(na.omit(seurat$geneAsumnolog[seurat$cellTypes1 == "Non-Tumor"])) 182 | mean(na.omit(seurat$geneBsumnolog[seurat$cellTypes1 == "Non-Tumor"])) 183 | mean(na.omit(seurat$geneCsumnolog[seurat$cellTypes1 == "Non-Tumor"])) 184 | mean(na.omit(seurat$geneAratio[seurat$cellTypes1 == "Non-Tumor"]))*100 185 | mean(na.omit(seurat$geneBratio[seurat$cellTypes1 == "Non-Tumor"]))*100 186 | mean(na.omit(seurat$geneCratio[seurat$cellTypes1 == "Non-Tumor"]))*100 187 | 188 | 189 | # Table 7 (Discovery Subject) 190 | 191 | # Load completed analysis, same preprocessing steps as for Validation Subject 192 | # save(PBMC, file="/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_pbmc.Rdata") 193 | # save(tumor, file="/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_tumor.Rdata") 194 | load("/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_tumor.Rdata") 195 | load("/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_pbmc.Rdata") 196 | 197 | n_tumor <- sum(tumor$cellTypes1 == "Tumor") 198 | mean(na.omit(tumor$geneAsumnolog[tumor$cellTypes1 == "Tumor"])) 199 | mean(na.omit(tumor$geneBsumnolog[tumor$cellTypes1 == "Tumor"])) 200 | mean(na.omit(tumor$geneCsumnolog[tumor$cellTypes1 == "Tumor"])) 201 | mean(na.omit(tumor$geneAratio[tumor$cellTypes1 == "Tumor"]))*100 202 | mean(na.omit(tumor$geneBratio[tumor$cellTypes1 == "Tumor"]))*100 203 | mean(na.omit(tumor$geneCratio[tumor$cellTypes1 == "Tumor"]))*100 204 | 205 | n_normal <- sum(tumor$cellTypes1 == "Non-Tumor") 206 | mean(na.omit(tumor$geneAsumnolog[tumor$cellTypes1 == "Non-Tumor"])) 207 | mean(na.omit(tumor$geneBsumnolog[tumor$cellTypes1 == "Non-Tumor"])) 208 | mean(na.omit(tumor$geneCsumnolog[tumor$cellTypes1 == "Non-Tumor"])) 209 | mean(na.omit(tumor$geneAratio[tumor$cellTypes1 == "Non-Tumor"]))*100 210 | mean(na.omit(tumor$geneBratio[tumor$cellTypes1 == "Non-Tumor"]))*100 211 | mean(na.omit(tumor$geneCratio[tumor$cellTypes1 == "Non-Tumor"]))*100 212 | 213 | n_normal <- length(PBMC$TimePoints) 214 | mean(na.omit(PBMC$geneAsumnolog)) 215 | mean(na.omit(PBMC$geneBsumnolog)) 216 | mean(na.omit(PBMC$geneCsumnolog)) 217 | mean(na.omit(PBMC$geneAratio))*100 218 | mean(na.omit(PBMC$geneBratio))*100 219 | mean(na.omit(PBMC$geneCratio))*100 220 | -------------------------------------------------------------------------------- /paper/figureS2.R: -------------------------------------------------------------------------------- 1 | 2 | # Figure S2 3 | #==> hlaa.bed <== 4 | #6 29942532 29945870 5 | 6 | #==> hlab.bed <== 7 | #6 31353875 31357179 8 | 9 | #==> hlac.bed <== 10 | #6 31268749 31272092 11 | 12 | # samtools view -b /mnt/yard2/ian/sra-data/paulson/SRR7692286/outs/possorted_genome_bam.bam 6:29942532-29945870 | bedtools coverage -split -a hlaa.bed -b stdin -d | cut -f5 > acov5p.txt 13 | # samtools view -b /mnt/yard2/ian/sra-data/paulson/SRR7692286/outs/possorted_genome_bam.bam 6:31353875-31357179 | bedtools coverage -split -a hlab.bed -b stdin -d | cut -f5 > bcov5p.txt 14 | # samtools view -b /mnt/yard2/ian/sra-data/paulson/SRR7692286/outs/possorted_genome_bam.bam 6:31268749-31272092 | bedtools coverage -split -a hlac.bed -b stdin -d | cut -f5 > ccov5p.txt 15 | 16 | # samtools view -b yard/paulson_merged.bam 6:29942532-29945870 | bedtools coverage -split -a hlaa.bed -b stdin -d | cut -f5 > acov3p.txt 17 | # samtools view -b yard/paulson_merged.bam 6:31353875-31357179 | bedtools coverage -split -a hlab.bed -b stdin -d | cut -f5 > bcov3p.txt 18 | # samtools view -b yard/paulson_merged.bam 6:31268749-31272092 | bedtools coverage -split -a hlac.bed -b stdin -d | cut -f5 > ccov3p.txt 19 | 20 | 21 | acov3p <- read.delim("acov3p.txt",header=F) 22 | bcov3p <- read.delim("bcov3p.txt",header=F) 23 | ccov3p <- read.delim("ccov3p.txt",header=F) 24 | 25 | acov5p <- read.delim("acov5p.txt",header=F) 26 | bcov5p <- read.delim("bcov5p.txt",header=F) 27 | ccov5p <- read.delim("ccov5p.txt",header=F) 28 | 29 | par(mfrow=c(3,1),mar=c(2,2,2,1),cex.main=2,cex.axis=1.3) 30 | plot(seq(1,3338),sapply(acov5p$V1, function(x) (x-min(acov5p))/(max(acov5p)-min(acov5p))), ylim=c(-0.1,1),type='l',lwd=2,main="HLA-A Read Coverage chr6:29942532-29945870 (+ strand)",ylab="") 31 | lines(seq(1,3338),sapply(acov3p$V1, function(x) (x-min(acov3p))/(max(acov3p)-min(acov3p))), type='l',col='red',lwd=2) 32 | legend("top",c("3' GEX", "5' GEX"),col=c("red","black"),lty=c(1,1),lwd=c(3,3),cex=2) 33 | plot(seq(1,3304),sapply(bcov5p$V1, function(x) (x-min(bcov5p))/(max(bcov5p)-min(bcov5p))), ylim=c(-0.1,1),type='l',lwd=2,main="HLA-B Read Coverage chr6:31353875-31357179 (- strand)",ylab="",xlim=c(3304,0)) 34 | lines(seq(1,3304),sapply(bcov3p$V1, function(x) (x-min(bcov3p))/(max(bcov3p)-min(bcov3p))), type='l',col='red',lwd=2) 35 | plot(seq(1,3343),sapply(ccov5p$V1, function(x) (x-min(ccov5p))/(max(ccov5p)-min(ccov5p))), ylim=c(-0.1,1),type='l',lwd=2,main="HLA-C Read Coverage chr6:31268749-31272092 (- strand)",ylab="",xlim=c(3343,0)) 36 | lines(seq(1,3343),sapply(ccov3p$V1, function(x) (x-min(ccov3p))/(max(ccov3p)-min(ccov3p))), type='l',col='red',lwd=2) 37 | -------------------------------------------------------------------------------- /paper/figures/figure2-updated.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/figures/figure2-updated.pdf -------------------------------------------------------------------------------- /paper/figures/figure3-newlayout.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/figures/figure3-newlayout.pdf -------------------------------------------------------------------------------- /paper/figures/figureS1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/figures/figureS1.pdf -------------------------------------------------------------------------------- /paper/figures/figureS2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/figures/figureS2.pdf -------------------------------------------------------------------------------- /paper/figures/workflow-updated.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/figures/workflow-updated.pdf -------------------------------------------------------------------------------- /paper/main.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article}[12pt] 2 | \usepackage[utf8]{inputenc} 3 | \usepackage{graphicx} 4 | \usepackage[margin=1in]{geometry} 5 | \usepackage{wrapfig} 6 | \usepackage{url} 7 | \usepackage{amsmath} 8 | \usepackage{authblk} 9 | \usepackage{placeins} 10 | \usepackage{hyperref} 11 | 12 | \usepackage[minbibnames=10,maxbibnames=10,style=authoryear,isbn=false,url=false]{biblatex} 13 | \addbibresource{schlacount.bib} 14 | 15 | \newcommand{\beginsupplement}{% 16 | \setcounter{table}{0} 17 | \renewcommand{\thetable}{S\arabic{table}}% 18 | \setcounter{figure}{0} 19 | \renewcommand{\thefigure}{S\arabic{figure}}% 20 | } 21 | 22 | \renewcommand{\baselinestretch}{1.25} 23 | 24 | \title{scHLAcount: Allele-specific HLA expression from single-cell gene expression data} 25 | \author[1]{Charlotte A.~Darby} 26 | \author[2]{Michael J.~T.~Stubbington} 27 | \author[2]{Patrick J.~Marks} 28 | \author[2]{Álvaro Martínez Barrio \thanks{Correspondence to ambarrio@10xgenomics.com}} 29 | \author[2]{Ian T.~Fiddes} 30 | \affil[1]{Department of Computer Science, Johns Hopkins University, Baltimore MD} 31 | \affil[2]{10x Genomics, Pleasanton CA} 32 | 33 | \begin{document} 34 | 35 | \maketitle 36 | 37 | \begin{abstract} 38 | Studies in bulk RNA sequencing data suggest cell-type and allele-specific expression of the human leukocyte antigen (HLA) genes. These loci are extremely diverse and they function as part of the major histocompatibility complex (MHC) which is responsible for antigen presentation. Mutation and or misregulation of expression of HLA genes has implications in diseases, especially cancer. Immune responses to tumor cells can be evaded through HLA loss of function. However, bulk RNA-seq does not fully disentangle cell type specificity and allelic expression. Here we present scHLAcount, a workflow for computing allele-specific molecule counts of the HLA genes in single cells an individualized reference. We demonstrate that scHLAcount can be used to find cell-type specific allelic expression of HLA genes in blood cells, and detect different allelic expression patterns between tumor and normal cells in patient biopsies. scHLAcount is available at \url{https://github.com/10XGenomics/scHLAcount}. 39 | \end{abstract} 40 | 41 | 42 | \section*{Introduction} 43 | 44 | The major histocompatibility complex (MHC) locus of human chromosome 6 is important for antigen presentation, containing genes for both class I and class II human leukocyte antigen (HLA). This locus is highly variable in the human population, with hundreds of characterized alleles that can be considerably divergent. Class I HLA alleles are responsible for neoantigen presentation, and therefore HLA haplotype information for a patient is important for developing targeted immunotherapies. Loss of HLA expression or function is likely a major driver of immunotherapy evasion. Loss of HLA class I expression has been demonstrated in relapse after immunotherapy treatment of Merkel cell carcinoma~\parencite{Paulson2018} and loss of HLA class II expression was observed in relapse after hematopoietic stem-cell transplantation for acute myeloid leukemia (AML)~\parencite{Christopher2018}. Genomic loss of heterozygosity of HLA has been detected in 40\% of non-small-cell lung cancers using the LOHHLA algorithm, which uses information about the individual’s HLA genotype to determine copy number~\parencite{McGranahan2017}. 45 | 46 | In bulk RNA-seq data, expression of MHC locus genes are often underestimated due to poor mappability caused by variability in the locus. Tools that build custom diploid references such as AltHapAlignR~\parencite{Lee2018} improves expression quantification. HLApers extended the diploid reference model to improve allele-specific expression estimates~\parencite{Aguiar2019} for eQTL mapping. 47 | 48 | We seek to apply this concept to single cell gene expression data, such as those produced by 10x Genomics' Single Cell Immune Profiling (5' capture) and Gene Expression (GEX) (3' capture) Solutions. Single cell expression analysis software, such as 10x Genomics' Cell Ranger, produce a matrix of molecule counts for each gene in each cell. HLA expression is systematically underestimated when using the reference genome compared to a personalized diploid reference~\parencite{Aguiar2019}. Therefore, as Cell Ranger relies on alignment to the reference genome, per-cell molecule counts for HLA genes are also likely to be underestimated, and potentially skewed by haplotype and population of origin. 49 | 50 | HLA allele-specific expression (ASE) has been seen in lymphoblastoid cell lines~\parencite{Aguiar2019}. In a study of allele-specific expression of HLA-A, -B, and -C genes in peripheral blood mononuclear cells (PBMCs) subsets, no cell type specific allele preference was found~\parencite{Greene2011}. However, alleles in the rhesus macaque with significant cell type specific expression were found. Some HLA-C alleles with consistently higher expression have been found by qPCR~\parencite{Bettens2014}; this has also been observed for some alleles of the class II genes HLA-DQB1 and HLA-DQA1~\parencite{Zajacova2018}. 51 | 52 | The recent paradigm shift in solid tumor treatment by immune-checkpoint blockade (ICB) therapies has not been followed with a parallel development in prognostic biomarkers. Currently, the only FDA-approved biomarker is high PD-L1 expression~\parencite{Eisenstein2017} although many others are being investigated~\parencite{Conway2018}. PD-1 blockade is effective when antigens are presented by MHC of tumor cells, lymphocytes successfully infiltrate the tumor and recognize those antigens. Increased heterozygosity at HLA class I loci has shown overall better survival in ICB, especially when associated with certain HLA types~\parencite{Chowell2018}. HLA class II genes are also expressed in some tumor cells showing positive ICB response~\parencite{Johnson2016}. Here we provide a tool to study allele-specific expression of HLA genes at the single cell resolution. 53 | 54 | \newpage 55 | \section*{Results} 56 | 57 | \begin{wrapfigure}{r}{.46\textwidth} 58 | \includegraphics[width=\hsize]{figures/workflow-updated.pdf} 59 | \caption{scHLAcount pipeline illustration} 60 | \label{fig:workflow} 61 | \end{wrapfigure} 62 | 63 | scHLAcount is a postprocessing workflow for single cell gene expression data that performs allele-specific molecule counting for the main HLA class I and class II genes in each cell based on user-supplied HLA genotypes. Each molecule is assigned to an allele based on the consensus of pseudoalignments of the constituent reads to a personalized HLA reference graph. The workflow is illustrated in Figure~\ref{fig:workflow}. 64 | 65 | We demonstrate that HLA genes present cell-type specific expression~\parencite{Boegel2018} and that HLA loss of expression can be evaluated per-cell and per-cluster. Using five AML samples published in~\parencite{Petti2019} for which HLA class I and class II genotypes were provided by the authors, we demonstrate the ability to find cell type specific allele bias when cell types have been annotated using marker genes. We also analyze data from two Merkel cell carcinoma (MCC) patients published in~\parencite{Yost2019} and extend their finding that HLA class I expression is lost, to show that this expression loss may be allele-specific. Both datasets illustrate that most molecules in 5' GEX data can be assigned to a specific allele when the individual is heterozygous, resulting in dataset-wide estimates of allele bias in molecule counts. 66 | 67 | \subsection*{Acute myeloid leukemia (AML)} 68 | 69 | 10x Genomics Chromium 5' GEX library data derived from five subjects with AML, as described in~\parencite{Petti2019} was reanalyzed. Genotypes for HLA-A, -B, -C, -DRB1, and -DQB1 at two-field resolution were provided by the authors. 70 | 71 | Given the genotypes, we built custom diploid references; the allele from GRCh38 primary assembly was used for genes HLA-DPA1, DPB1, and DQA1 for which genotypes were not available. Raw scHLAcount molecule counts are summarized in Table \ref{tab:aml}. Molecule counts were normalized with the following formula: 72 | 73 | \[ 74 | \text{median molecule count} \times \text{raw molecule count} / \text{cell molecule count} 75 | \] 76 | 77 | Normalization and dimensionality reduction of the gene expression matrix generated by Cell Ranger v2.1.1 was performed using Seurat v3.0.2~\parencite{Stuart2019}. For all the biallelic genes in each subject, we calculated the average normalized expression per gene and the fraction of the normalized expression for each allele of the nine cell types with at least 100 cells assigned. As observed in the T cell dataset, some genes had more expression of one allele than the other. Results for subject 809653 with the class II gene HLA-DRB1 are listed in Table \ref{tab:amldrb1} and visualized on a t-SNE dimensionality reduction plot in Figure \ref{fig:amldrb1}a,b. Depending on cell type, we observe 42\% to 54\% allelic bias for the DRB*01:03 allele. This allele preference does not show a trend with average expression. For the same subject, we also observe a 27\% to 41\% allelic bias for C*07:02 depending on cell type (Figure \ref{fig:amldrb1}c,d; Table \ref{tab:amlc}). 78 | 79 | %TODO 80 | % Prepare figures in the background? 81 | \begin{table}[h] 82 | \begin{tabular}{|l|r|r|r|} 83 | \hline 84 | Cell type & \multicolumn{1}{l|}{\# cells} & 85 | \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% of DRB1 molecules \\ assigned to 01:03 allele\end{tabular}} & 86 | \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Avg. HLA-DRB1 \\ normalized expression\end{tabular}} \\ \hline 87 | ERY & 3,728 & 41.9 & 0.238 \\ \hline 88 | T-CELL & 10,942 & 44.8 & 0.741 \\ \hline 89 | PRE-B-CELL & 336 & 47.4 & 1.162 \\ \hline 90 | B-CELL & 868 & 47.4 & 14.185 \\ \hline 91 | HSC & 2,261 & 52.1 & 5.247 \\ \hline 92 | MEP & 560 & 53.0 & 3.411 \\ \hline 93 | DEND (M) & 620 & 53.7 & 17.602 \\ \hline 94 | ERY (CD34+) & 432 & 53.9 & 2.153 \\ \hline 95 | MONO & 1,366 & 54.0 & 7.390 \\ \hline 96 | \end{tabular} 97 | \caption{Normalized expression and allele-specific expression of HLA-DRB1 for subject 809653 from \parencite{Petti2019}, stratified by cell type. Average is taken over all cells assigned to a particular cell type.} 98 | \label{tab:amldrb1} 99 | \end{table} 100 | 101 | 102 | \begin{table}[h] 103 | \begin{tabular}{|l|r|r|r|} 104 | \hline 105 | Cell type & \multicolumn{1}{l|}{\# cells} & 106 | \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% of HLA-C molecules \\ assigned to 07:02 allele\end{tabular}} & 107 | \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Avg. HLA-C \\ normalized expression\end{tabular}} \\ \hline 108 | B-CELL & 868 & 26.7 & 5.184 \\ \hline 109 | MONO & 1,366 & 32.7 & 5.813 \\ \hline 110 | PRE-B-CELL & 336 & 33.9 & 3.266 \\ \hline 111 | DEND (M) & 620 & 35.1 & 3.890 \\ \hline 112 | T-CELL & 10,942 & 37.0 & 8.926 \\ \hline 113 | HSC & 2,261 & 38.8 & 3.281 \\ \hline 114 | MEP & 560 & 40.3 & 2.578 \\ \hline 115 | ERY (CD34+) & 432 & 40.9 & 2.429 \\ \hline 116 | ERY & 3,728 & 41.0 & 0.386 \\ \hline 117 | \end{tabular} 118 | \caption{Normalized expression and allele-specific expression of HLA-C for subject 809653 from \parencite{Petti2019}, stratified by cell type. Average is taken over all cells assigned to a particular cell type.} 119 | \label{tab:amlc} 120 | \end{table} 121 | 122 | 123 | \begin{figure} 124 | \includegraphics[width=0.95\textwidth]{figures/figure2-updated.pdf} 125 | \caption{ \textbf{(a)} For each cell, color indicates log$_2$(1 + normalized expression) of HLA-DRB1. \textbf{(b)} For each cell, color indicates the fraction of HLA-DRB1 molecules assigned to an allele that are assigned to the 01:03 allele of subject 809653. Overall, 95.4\% of HLA-DRB1 molecules are assigned to an allele. Gray cells have no HLA-DRB1 molecules assigned to an allele. \textbf{(c)} log$_2$(1 + normalized expression) of HLA-C \textbf{(d)} \textbf{(e)} Cell types as inferred in~\parencite{Petti2019}.} 126 | \label{fig:amldrb1} 127 | \end{figure} 128 | 129 | \subsection*{Merkel cell carcinoma (MCC)} 130 | 131 | Genotypes for genes HLA-A, -B, and -C for the discovery and validation subjects in~\parencite{Paulson2018} were provided by the authors. Here, alleles not explicitly reported in their publication are given a placeholder name (e.g.~`A1'). Using scHLAcount with a custom reference for the diploid genotype of genes HLA-A, -B, and -C (and GRCh38 primary assembly alleles for the class II genes) we calculated allele-resolved molecule counts. Raw molecule counts were normalized as described above. 132 | 133 | Normalization, dimensionality reduction, and clustering was performed using Seurat v3.0.2~\parencite{Stuart2019} following Paulson et al~\parencite{Paulson2018}. For the discovery subject, we used the filtered expression matrices for tumor and PBMC samples available at GEO accession GSE117988; for the validation subject, the matrix is available at GSE118056. 134 | 135 | \begin{table}[h] 136 | \begin{tabular}{|l|l|r|r|r|r|r|r|} 137 | \hline 138 | Subject & Assay type & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} \\ \hline 139 | & & \multicolumn{2}{c|}{HLA-A} & \multicolumn{2}{c|}{HLA-B} & \multicolumn{2}{c|}{HLA-C} \\ \hline 140 | \begin{tabular}[c]{@{}l@{}}Discovery\\ (Tumor)\end{tabular} & 3' GEX & 0.866 & 5.34 & 0.391 & 40.76 & 0.639 & 64.31 \\ \hline 141 | \begin{tabular}[c]{@{}l@{}}Discovery\\ (PBMC)\end{tabular} & 3' GEX & 0.855 & 6.42 & 0.449 & 45.98 & 0.767 & 67.94 \\ \hline 142 | \begin{tabular}[c]{@{}l@{}}Validation\\ (Tumor)\end{tabular} & 5' GEX & 0.878 & 81.17 & 0.896 & 91.69 & 0.745 & 80.68 \\ \hline 143 | \begin{tabular}[c]{@{}l@{}}Validation\\ (PBMC)\end{tabular} & 5' GEX & 1.050 & 87.71 & 1.073 & 94.41 & 1.033 & 89.65 \\ \hline 144 | \end{tabular} 145 | \caption{scHLAcount analysis of discovery patient tumor (2 time points) and PBMC (4 time points) and validation patient tumor and PBMC (1 time point each)~\parencite{Paulson2018}. Raw molecule counts for genes A, B, and C are compared to Cell Ranger counts normalized to 1.0. GEX = gene expression} 146 | \label{tab:paulsonsummary} 147 | \end{table} 148 | 149 | \subsubsection*{Discovery subject} 150 | 151 | For this subject, the ``tumor dataset'' comprises cells taken from two time points in treatment; the ``PBMC dataset'' comprises cells taken from four time points in treatment. Unsupervised clustering of the tumor dataset resulted in 15 clusters. As described in Paulson et al~\parencite{Paulson2018}, we identified 11 of these clusters comprising 7,131 cells as putative tumor cells using the tumor marker genes NCAM1, KRT20, CHGA, and ENO2 and the non-tumor marker genes CD3D, CD34, CD61, and Fibronectin. The remaining four clusters contained 300 putative normal cells. 152 | 153 | Due to the nature of 3' GEX data, nearly all reads are sequenced from the opposite end of the HLA-A transcript from the variable sites used to define HLA types \ref{fig:coverage}. These variable sites are mostly located in exons 2 and 3, while the 3' end of the transcripts are mostly homologous between the class I genes \cite{Boegel2018}. As a result of the coverage distribution of 3' GEX data, very few HLA-A molecules could be assigned to an allele. We observe far fewer molecules from scHLAcount compared to Cell Ranger, especially in gene HLA-B because UMIs that only contain reads from the 3' end of the transcript will be ambiguously aligned to all class I genes and the molecule will not be counted by our algorithm. 154 | 155 | As previously reported, HLA-B expression is markedly less in the tumor compared to non-tumor cells and PBMC (Table \ref{tab:paulsondisc}). Additionally, HLA-A and HLA-C expression appears to be reduced in tumor cells. 156 | 157 | %TODO 158 | \begin{table}[h] 159 | \begin{tabular}{|l|r|r|r|r|r|r|} 160 | \hline 161 | \begin{tabular}[c]{@{}l@{}}Gene\\ Genotype\end{tabular} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Tumor cells\\ (n=7,131)\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Non-tumor cells\\ (n=300)\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}PBMC\\ (n=12,874)\end{tabular}} \\ \hline 162 | & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Average\\ normalized\\ expression\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}\% molecules\\ assigned to\\ allele 1\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Average \\ normalized \\ expression\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}\% molecules\\ assigned to\\ allele 1\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Average \\ normalized \\ expression\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}\% molecules\\ assigned to\\ allele 1\end{tabular}} \\ \hline 163 | \begin{tabular}[c]{@{}l@{}}HLA-A\\ A1/A2\end{tabular} & 0.724 & 24.98 & 3.392 & 43.78 & 1.958 & 40.83 \\ \hline 164 | \begin{tabular}[c]{@{}l@{}}HLA-B\\ 35:02/B2\end{tabular} & 0.115 & 76.11 & 3.156 & 61.70 & 1.713 & 63.97 \\ \hline 165 | \begin{tabular}[c]{@{}l@{}}HLA-C\\ C1/C2\end{tabular} & 0.209 & 49.54 & 3.802 & 59.58 & 1.918 & 59.17 \\ \hline 166 | \end{tabular} 167 | \caption{Average overall and allele-specific expression of HLA class I genes in the discovery subject of~\parencite{Paulson2018}.} 168 | \label{tab:paulsondisc} 169 | \end{table} 170 | 171 | 172 | \subsubsection*{Validation subject} 173 | 174 | For this subject, the ``tumor dataset'' and ``PBMC dataset'' comprise cells taken from a single time point after relapse. Unsupervised clustering of all cells together resulted in 18 clusters. As described in Paulson et al~\parencite{Paulson2018}, we identified seven of these clusters comprising 4,682 cells as putative tumor cells using the tumor marker genes NCAM1, KRT20, Large T Antigen, and Small T Antigen. Only 17 of these cells originated from the PBMC dataset. The remaining 6,209 cells were designated putative normal cells and comprised 5,731 cells from the PBMC dataset and 478 cells from the tumor dataset, which~\cite{Paulson2018} identified as tumor-infiltrating leukocytes and tumor-associated macrophages (Figure \ref{fig:amldrb1}e). 175 | 176 | Compared to Cell Ranger molecule counts, we inferred more molecules for the PBMC dataset and fewer molecules for the tumor dataset. At least 80\% of scHLAcount molecules were assigned to an allele for class I genes (Table \ref{tab:paulsonsummary}). 177 | 178 | Dividing cells into tumor and normal as described above, we corroborate the observation from~\cite{Paulson2018} that HLA-A expression is greatly reduced in tumor cells compared to infiltrating immune cells (Figure \ref{fig:paulsonval}a). No marked allele-specific bias in expression is observed in cells in either category. Additionally, we observe decreased expression of HLA-B and HLA-C in tumor cells (Figure \ref{fig:paulsonval}c,e). While non-tumor cells display approximately balanced expression of the two alleles of these genes, tumor cells have only 13\% of allele-resolved HLA-B expression from allele 35:01 and 6\% of allele-resolved HLA-C expression from allele `C1' (Table \ref{tab:paulsonval}). 179 | 180 | \begin{table}[h] 181 | \begin{tabular}{|l|r|r|r|r|} 182 | \hline 183 | \begin{tabular}[c]{@{}l@{}}Gene\\ Genotype\end{tabular} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Tumor cells\\ (n=4862)\end{tabular}} & \multicolumn{2}{c|}{\begin{tabular}[c]{@{}c@{}}Non-tumor cells\\ (n=6209)\end{tabular}} \\ \hline 184 | & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Average\\ normalized\\ expression\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}\% molecules\\ assigned to\\ allele 1\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}Average \\ normalized \\ expression\end{tabular}} & \multicolumn{1}{c|}{\begin{tabular}[c]{@{}c@{}}\% molecules\\ assigned to\\ allele 1\end{tabular}} \\ \hline 185 | \begin{tabular}[c]{@{}l@{}}HLA-A\\ 02:01/A2\end{tabular} & 0.060 & 39.7 & 4.154 & 56.8 \\ \hline 186 | \begin{tabular}[c]{@{}l@{}}HLA-B\\ 35:01/B2\end{tabular} & 0.511 & 13.4 & 5.172 & 50.4 \\ \hline 187 | \begin{tabular}[c]{@{}l@{}}HLA-C\\ C1/C2\end{tabular} & 0.327 & 6.3 & 4.991 & 46.8 \\ \hline 188 | \end{tabular} 189 | \caption{Average overall and allele-specific expression of HLA class I genes in the validation subject of~\parencite{Paulson2018}.} 190 | \label{tab:paulsonval} 191 | \end{table} 192 | 193 | \begin{figure} 194 | %\includegraphics[width=\textwidth]{figures/figure3.pdf} 195 | %\caption{log$_2$(1 + normalized expression) of HLA-A (a) HLA-B (c) and HLA-C (e) and allele preference for HLA-A*02:01 (b) HLA-B*35:01 (d) and HLA-C1 (f) for the validation subject of \parencite{Paulson2018}. Values are plotted per cell; aggregate statistics shown in Table \ref{tab:paulsonval}.} 196 | %\figsupp[Inferred cell type.]{Cell type inferred for the validation patient in~\cite{Petti2019}}{\includegraphics[width=\textwidth]{figures/figureS1.pdf}}\label{figsupp:celltype} 197 | 198 | \includegraphics[width=\textwidth]{figures/figure3-newlayout.pdf} 199 | \caption{log$_2$(1 + normalized expression) of HLA-A (a) HLA-B (b) and HLA-C (c) and allele preference for HLA-A*02:01 (d) HLA-B*35:01 (e) and HLA-C1 (f) for the validation subject of \parencite{Paulson2018}. Values are plotted per cell; aggregate statistics shown in Table \ref{tab:paulsonval}. (g) Cell types inferred using marker genes.} 200 | \label{fig:paulsonval} 201 | \end{figure} 202 | 203 | 204 | 205 | 206 | \section*{Discussion} 207 | 208 | Tumor evasion of immunotherapy is of growing concern, as novel and expensive treatment modalities find themselves stymied by this evolutionary response. scHLAcount provides a simple way to assign reads from scRNA-seq experiments to MHC alleles. This is a powerful tool for investigating allele-specific expression, loss of heterozygosity, and mutational or epigenetic suppression of HLA expression in tumor immune-evasion. Additionally, using a personalized reference and counting with scHLAcount often recovers more molecules than using the standard reference and counting with Cell Ranger. This has the potential to improve gene expression based clustering in cells where MHC genes are a major component of the expression profile. 209 | 210 | scHLAcount could be extended to also apply to any other locus where there is common structural variation present in the human population. The approach of using De Bruijn graphs to improve isoform and haplotype quantification has been considered before~\parencite{Patro2017,Bray2016}, but has not yet been applied to scRNA-seq data until this study. A recent pre-print~\parencite{Tian2019} genotyped individual cells for HLA class I using scRNA-seq data but did not address allele-specific expression on the molecule level. 211 | 212 | We have found that 5' GEX data is preferable to 3' GEX data for genotyping and assigning molecules to alleles, because the sequencing coverage is not as limited to one end of the transcript (Figure \ref{fig:coverage}). Since the three class I genes have considerable sequence homology except in exons 2 and 3 and virtually all of the coverage of 3' GEX data falls in the later exons, few UMIs have a read from the variable exons and could be assigned to a specific allele, and many UMIs have reads only in regions homologous among the 3 genes and thus these molecules are not counted. 213 | 214 | 215 | \section*{Methods and Materials} 216 | 217 | The numbered steps in Figure~\ref{fig:workflow} correspond to the numbers in parentheses in this section. 218 | 219 | To make FASTA files of the coding and genomic sequences of the alleles present in the sample (1), users need to provide a list of genotypes (2) and download the IMGT/HLA allele sequence database (3). These genotypes can be assayed by specialized molecular tests, such as sequence-specific oligonucleotide probe PCR (PCR-SSOP), sequence-specific primed PCR (PCR-SSP), or Sanger sequence-based typing (SBT)~\parencite{Erlich2012}. Alternatively, algorithms for sequence-based typing from next-generation sequencing reads of the genome, exome, or transcriptome utilize comprehensive allele databases such as IMGT/HLA~\parencite{Robinson2015} to successfully infer genotypes (reviewed in~\cite{Bauer2018}). Following the pseudoalignment approach\cite{Bray2016}, scHLAcount builds two colored De Bruijn graph indexes, one containing the CDS sequences and one containing genomic sequences, using a k-mer length of 24. 220 | 221 | Reads aligned to the MHC region (GRCh38 coordinates \texttt{chr6:28510120-33480577}) (4) corresponding to valid cell barcodes (5) are extracted from the BAM file (6). Each read is first pseudo-aligned to the CDS graph, yielding the set of alleles that could have generated the read (also referred to as the equivalence class)~\cite{Bray2016}. If there is no alignment of at least 60 bases (2 mismatches are permitted outside the initial seed kmer), the read is pseudo-aligned to the genomic sequence graph and retained if the alignment is at least 60 bases. (7) In the datasets studied, less than 5\% of reads that failed to align to the CDS were successfully aligned to the genomic sequence. This genomic alignment step is intended to rescue reads that may be haplotype specific in 3' or 5' UTR regions. It also provides a mechanism to handle single nuclei RNA-seq libraries. 222 | 223 | Reads sharing a cell barcode and unique molecular identifier (UMI) are assumed to originate from the same RNA molecule. At recommended sequencing depths with modest sequence saturation, there are typically 1-3 reads per UMI. Individual reads may have different equivalence classes according to their pseudoalignment. We ignore reads whose equivalence class contains more than one gene. If more than half of the reads from a molecule are assigned to a particular gene, that molecule will get counted to one of its alleles (e.g.~HLA-A 02:01), based on the constituent reads' equivalence classes. In the case of ambiguity, it will get counted to that gene (e.g.~HLA-A) instead. The output is a sparse molecule count matrix (8) where each column corresponds to a barcode in the provided cell barcode list, and each row corresponds to an allele. 224 | 225 | \section*{Acknowledgements} 226 | 227 | We thank Kelly Paulson and Allegra Petti for providing HLA genotypes of the subjects in the MCC and AML studies respectively. 228 | 229 | \section*{Author Contributions} 230 | % https://www.cell.com/pb/assets/raw/shared/guidelines/CRediT-taxonomy.pdf 231 | Conceptualization, A.M.B. and I.T.F.; 232 | Methodology, C.A.D.; 233 | Software, C.A.D. and I.T.F.; 234 | Investigation, C.A.D.; 235 | Data Curation, C.A.D. and I.T.F.; 236 | Writing - Original Draft, C.A.D. and I.T.F.; 237 | Writing - Review \& Editing C.A.D., M.J.T.S., P.J.M., A.M.B. and I.T.F.; 238 | Visualization, C.A.D.; 239 | Supervision, A.M.B. and P.J.M. 240 | 241 | \section*{Competing Interests} 242 | 243 | M.J.T.S., P.J.M., and A.M.B. are employees of 10x Genomics. P.J.M. and A.M.B. are shareholders of 10x Genomics. I.T.F. is also a shareholder of 10x Genomics and at the time of this writing was employed at 10x Genomics. M.J.T.S. is option holder of 10x Genomics. C.A.D. was an intern at 10x Genomics. C.A.D., P.J.M., A.M.B. and I.T.F. have filled a provisional patent for ideas presented in this work on behalf of 10x Genomics. 244 | 245 | \newpage 246 | 247 | \printbibliography 248 | 249 | \newpage 250 | \FloatBarrier 251 | 252 | \section*{Supplementary Material} 253 | \beginsupplement 254 | 255 | \paragraph*{GRCh38 primary assembly alleles.} 256 | 257 | Genotypes present in GRCh38 primary assembly were inferred using Kourami v0.9.6~\parencite{kourami}. 2 million 200bp error-free reads were simulated from GRCh38 Chr6:28510120-33480577, which is approximately 80-fold coverage of the region. Reads were aligned to the Kourami reference panel and genotypes were inferred; all listed genotypes had 100\% sequence identity with respect to the corresponding database sequence. 258 | 259 | \noindent A*03:01:01G \\ 260 | B*07:02:01G \\ 261 | C*07:02:01G \\ 262 | DQA1*01:02:01G \\ 263 | DQB1*06:02:01G \\ 264 | DRB1*15:01:01G \\ 265 | DPA1*01:03:01G \\ 266 | DPB1*04:01:01G \\ 267 | 268 | \paragraph*{Computational performance.} 269 | 270 | On the scRNA-seq dataset from donor 4 from \cite{vdjappnote}, scHLAcount analyzed 58 million reads aligned to the MHC region in 83 minutes (55 minutes spent genotyping; 28 minutes spent counting). Maximum memory usage was 1.5 GB. 271 | 272 | \begin{table}[ht] 273 | \centering 274 | \begin{tabular}{|l|l|l|l|l|l|} 275 | \hline 276 | Subject & HLA-A & HLA-B & HLA-C & HLA-DQB1 & HLA-DRB1 \\ \hline 277 | 508084 & 68:01/01:01 & \textbf{07:02}/27:05 & \textbf{07:02}/07:04 & 05:01/\textbf{06:02} & 01:01/\textbf{15:01}\\ \hline 278 | 548327 & 68:01/02:06 & 51:01/44:05 & 02:02 & 02:02/02:01 & 07:01/03:01 \\ \hline 279 | 721214 & \textbf{03:01}/01:01 & 18:01/14:01 & 08:02/07:40 & 03:02/02:02 & 07:01/04:03 \\ \hline 280 | 782328 & 32:01 & 37:01/15:01 & 06:02/03:04 & 05:01/03:02 & 04:01/10:01 \\ \hline 281 | 809653 & 68:02/31:01 & 27:05/14:02 & 08:02/\textbf{07:02} & 03:01 & 11:01/01:03 \\ \hline 282 | \end{tabular} 283 | \caption{Genotypes for subjects from~\cite{Petti2019}, provided by the authors in personal communication with permission to include here. Genotypes shared with the GRCh38 primary assembly are in \textbf{bold text}.} 284 | \label{tab:amlgt} 285 | \end{table} 286 | 287 | \begin{table} 288 | \begin{tabular}{|l|r|r|r|r|r|r|r|r|} 289 | \hline 290 | Subject & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to\\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to \\ an allele\end{tabular}} \\ \hline 291 | & \multicolumn{2}{c|}{HLA-A} & \multicolumn{2}{c|}{HLA-B} & \multicolumn{2}{c|}{HLA-C} & \multicolumn{2}{c|}{HLA-DQB1} \\ \hline 292 | 508084 & 1.039 & 95.13 & 1.066 & 87.22 & 0.885 & 60.77 & 1.028 & 95.89 \\ \hline 293 | 548327 & 1.165 & 86.26 & 1.061 & 93.09 & 1.032 & n/a & 2.721 & 2.27 \\ \hline 294 | 721214 & 1.180 & 69.44 & 1.137 & 90.09 & 0.908 & 93.63 & 3.319 & 98.95 \\ \hline 295 | 782328 & 1.154 & n/a & 0.880 & 63.95 & 0.957 & 89.77 & 1.010 & 99.15 \\ \hline 296 | 809653 & 1.083 & 87.21 & 1.154 & 96.53 & 0.911 & 91.74 & 1.070 & n/a \\ \hline 297 | \end{tabular} 298 | \begin{tabular}{|l|r|r|r|r|r|} 299 | \hline 300 | Subject & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}Custom\\ diploid\\ reference\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}\% molecules\\ assigned to \\ an allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}GRCh38\\ allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}GRCh38\\ allele\end{tabular}} & \multicolumn{1}{l|}{\begin{tabular}[c]{@{}l@{}}GRCh38\\ allele\end{tabular}} \\ \hline 301 | & \multicolumn{2}{c|}{HLA-DRB1} & \multicolumn{1}{c|}{HLA-DPA1} & \multicolumn{1}{c|}{HLA-DPB1} & \multicolumn{1}{c|}{HLA-DQA1} \\ \hline 302 | 508084 & 1.641 & 74.60 & 1.135 & 1.024 & 1.086 \\ \hline 303 | 548327 & 1.920 & 89.52 & 1.180 & 1.172 & 2.087 \\ \hline 304 | 721214 & 1.745 & 89.05 & 1.217 & 1.050 & 2.058 \\ \hline 305 | 782328 & 1.125 & 92.12 & 1.276 & 1.078 & 1.274 \\ \hline 306 | 809653 & 1.066 & 95.43 & 1.136 & 1.050 & 1.455 \\ \hline 307 | \end{tabular} 308 | \caption{Using the custom diploid reference or GRCh38 allele as denoted, raw molecule count for each gene is compared to Cell Ranger counts normalized to 1.0. Subject 548327 is homozygous for HLA-C, Subject 782328 is homozygous for HLA-A, and Subject 809653 is homozygous for HLA-DQB1.} 309 | \label{tab:aml} 310 | \end{table} 311 | 312 | \begin{figure} 313 | \includegraphics[width=0.8\textwidth]{figures/figureS2.pdf} 314 | \caption{Read coverage of HLA Class I genes for 3' GEX and 5' GEX. Minimum and maximum coverage for each assay in the region shown is normalized to 0 and 1 respectively. The 3' dataset is merged from SRR7722937-SRR7722942 and the 5' dataset is SRR7692286~\parencite{Paulson2018}. GEX = gene expression} 315 | \label{fig:coverage} 316 | \end{figure} 317 | 318 | \end{document} 319 | -------------------------------------------------------------------------------- /paper/sample_gt.txt: -------------------------------------------------------------------------------- 1 | A*03:01:01:01 2 | B*07:02:01:01 3 | C*07:02:01:01 4 | DQA1*01:02:01:01 5 | DQB1*06:02:01:01 6 | DRB1*15:01:01:01 7 | DPA1*01:03:01:01 8 | DPB1*04:01:01:01 9 | 10 | -------------------------------------------------------------------------------- /paper/table1_AML_DRB1.tsv: -------------------------------------------------------------------------------- 1 | n_cells gene_norm_count a1_norm_count a2_norm_count avg_count a1_ratio min_ratio 2 | ERY 3728 871.414062222469 58.2423876329442 340.773432436192 0.233748407248516 0.145965108909348 0.145965108909348 3 | T-CELL 10942 8498.53113883834 897.071628577537 3404.33690738686 0.776689009215714 0.208552993996514 0.208552993996514 4 | PRE-B-CELL 336 444.282315820471 77.9479596268836 173.493333881458 1.32226879708473 0.310004607991319 0.310004607991319 5 | ERY (CD34+) 432 938.319487784538 48.1775957572324 479.5524990456 2.1720358513531 0.0912921135855105 0.0912921135855105 6 | HSC 2261 12546.1075707841 1277.82248810992 5872.52357674129 5.54891975709161 0.178707782325569 0.178707782325569 7 | B-CELL 868 15588.1627639117 3701.70390177779 5629.35242259153 17.9587128616494 0.396707915277533 0.396707915277533 8 | DEND (M) 620 12660.165954853 2209.59445853481 5609.59533232102 20.4196225078273 0.282586114116174 0.282586114116174 9 | MONO 1366 9813.06340219331 233.897362495492 5168.35692284053 7.18379458432893 0.0432962519240138 0.0432962519240138 10 | CMP 49 52.1334325345648 0.571366968520914 21.0433577930818 1.06394760274622 0.0264341542546919 0.0264341542546919 11 | EOS 1 0 0 0 0 NA NA 12 | NK 24 4.95140815047913 0 2.05120545354081 0.20630867293663 0 0 13 | MEP 560 1934.98885904625 111.85212984258 966.275752261376 3.45533724829688 0.10374662570112 0.10374662570112 14 | GRAN 8 49.3066872774216 12.4984570592598 15.0992529386868 6.1633359096777 0.452880223039874 0.452880223039874 15 | BASO 1 3.65349264705882 0 0 3.65349264705882 NA NA 16 | NKT 4 36.4425327230037 1.07665222101842 13.7106723332991 9.11063318075092 0.0728091289985288 0.0728091289985288 17 | DEND (L) 3 46.9110682362611 0 24.0340405727527 15.6370227454204 0 0 18 | MEGA 1 0 0 0 0 NA NA 19 | -------------------------------------------------------------------------------- /paper/table2_AML_C.tsv: -------------------------------------------------------------------------------- 1 | n_cells gene_norm_count a1_norm_count a2_norm_count avg_count a1_ratio min_ratio 2 | ERY 3728 1437.59598172291 530.713429517305 763.697922596658 0.38562123973254 0.41000368905184 0.41000368905184 3 | T-CELL 10942 97670.8970291829 33477.9031660512 56983.335909831 8.9262380761454 0.370080086322593 0.370080086322593 4 | PRE-B-CELL 336 1097.36856249978 352.440558542802 687.990067759211 3.26597786458268 0.338744890464707 0.338744890464707 5 | ERY (CD34+) 432 1049.28201996472 383.57978596218 553.744389029159 2.42889356473316 0.409228521141817 0.409228521141817 6 | HSC 2261 7418.54829460474 2602.73727007066 4102.14566944203 3.28109168270886 0.388185340974771 0.388185340974771 7 | B-CELL 868 4499.57904212626 1132.22990460303 3100.74684468954 5.1838468227261 0.267478413339325 0.267478413339325 8 | DEND (M) 620 2411.94950335687 779.692941760964 1440.73128418747 3.89024113444657 0.35114593537995 0.35114593537995 9 | MONO 1366 7939.906983054 2436.04916267755 5013.15173529128 5.81252341365593 0.327021541779304 0.327021541779304 10 | CMP 49 151.022611372485 59.8780445847758 84.1226386211071 3.08209410964254 0.415817781219593 0.415817781219593 11 | EOS 1 5.70710696338837 2.85355348169419 2.85355348169419 5.70710696338837 0.5 0.5 12 | NK 24 399.907820575923 153.429815451202 217.46590053974 16.6628258573301 0.413673733171265 0.413673733171265 13 | MEP 560 1443.75424413973 527.208349508618 781.183180132244 2.57813257882094 0.402943872353966 0.402943872353966 14 | GRAN 8 47.9159990057599 22.1090376653902 23.2135564479448 5.98949987571998 0.487814920966432 0.487814920966432 15 | BASO 1 0 0 0 0 NA NA 16 | NKT 4 34.6193645635334 13.4475182741851 15.798163197327 8.65484114088336 0.45981210208024 0.45981210208024 17 | DEND (L) 3 12.8480134768442 3.88789918070928 7.67763351175513 4.28267115894807 0.336162568909814 0.336162568909814 18 | MEGA 1 11.4123233215548 0.877871024734982 10.5344522968198 11.4123233215548 0.0769230769230769 0.0769230769230769 19 | -------------------------------------------------------------------------------- /paper/table4_paulson_discovery.R: -------------------------------------------------------------------------------- 1 | 2 | # See figure3.R for example processing workflow 3 | 4 | # PBMC (Normal cells) 5 | load("/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_pbmc.Rdata") 6 | 7 | # Average normalized expression 8 | mean(na.omit(PBMC$geneAsumnolog)) 9 | mean(na.omit(PBMC$geneBsumnolog)) 10 | mean(na.omit(PBMC$geneCsumnolog)) 11 | 12 | # Molecules assigned to allele 1 13 | sum(na.omit(PBMC$geneAallele1))/(sum(na.omit(PBMC$geneAallele1)) + sum(na.omit(PBMC$geneAallele2))) 14 | sum(na.omit(PBMC$geneBallele1))/(sum(na.omit(PBMC$geneBallele1)) + sum(na.omit(PBMC$geneBallele2))) 15 | sum(na.omit(PBMC$geneCallele1))/(sum(na.omit(PBMC$geneCallele1)) + sum(na.omit(PBMC$geneCallele2))) 16 | 17 | 18 | # Tumor (mixture of tumor and normal cells) 19 | load("/mnt/park1/compbio/HLA/paulson_robj/paulson_disc_tumor.Rdata") 20 | 21 | # Tumor cells 22 | 23 | # Average normalized expression 24 | mean(na.omit(tumor$geneAsumnolog[tumor$cellTypes1 == "Tumor"])) 25 | mean(na.omit(tumor$geneBsumnolog[tumor$cellTypes1 == "Tumor"])) 26 | mean(na.omit(tumor$geneCsumnolog[tumor$cellTypes1 == "Tumor"])) 27 | 28 | # Molecules assigned to allele 1 29 | sum(na.omit(tumor$geneAallele1[tumor$cellTypes1 == "Tumor"]))/(sum(na.omit(tumor$geneAallele1[tumor$cellTypes1 == "Tumor"])) + sum(na.omit(tumor$geneAallele2[tumor$cellTypes1 == "Tumor"]))) 30 | sum(na.omit(tumor$geneBallele1[tumor$cellTypes1 == "Tumor"]))/(sum(na.omit(tumor$geneBallele1[tumor$cellTypes1 == "Tumor"])) + sum(na.omit(tumor$geneBallele2[tumor$cellTypes1 == "Tumor"]))) 31 | sum(na.omit(tumor$geneCallele1[tumor$cellTypes1 == "Tumor"]))/(sum(na.omit(tumor$geneCallele1[tumor$cellTypes1 == "Tumor"])) + sum(na.omit(tumor$geneCallele2[tumor$cellTypes1 == "Tumor"]))) 32 | 33 | # Normal cells 34 | 35 | # Average normalized expression 36 | mean(na.omit(tumor$geneAsumnolog[tumor$cellTypes1 != "Tumor"])) 37 | mean(na.omit(tumor$geneBsumnolog[tumor$cellTypes1 != "Tumor"])) 38 | mean(na.omit(tumor$geneCsumnolog[tumor$cellTypes1 != "Tumor"])) 39 | 40 | # Molecules assigned to allele 1 41 | sum(na.omit(tumor$geneAallele1[tumor$cellTypes1 != "Tumor"]))/(sum(na.omit(tumor$geneAallele1[tumor$cellTypes1 != "Tumor"])) + sum(na.omit(tumor$geneAallele2[tumor$cellTypes1 != "Tumor"]))) 42 | sum(na.omit(tumor$geneBallele1[tumor$cellTypes1 != "Tumor"]))/(sum(na.omit(tumor$geneBallele1[tumor$cellTypes1 != "Tumor"])) + sum(na.omit(tumor$geneBallele2[tumor$cellTypes1 != "Tumor"]))) 43 | sum(na.omit(tumor$geneCallele1[tumor$cellTypes1 != "Tumor"]))/(sum(na.omit(tumor$geneCallele1[tumor$cellTypes1 != "Tumor"])) + sum(na.omit(tumor$geneCallele2[tumor$cellTypes1 != "Tumor"]))) 44 | -------------------------------------------------------------------------------- /paper/workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/paper/workflow.png -------------------------------------------------------------------------------- /prepare_reference.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | HLANUC="hla_nuc.fasta" 4 | HLAGEN="hla_gen.fasta" 5 | 6 | GTFILE=$1 7 | 8 | while read -r line; do grep -F -m 1 $line $HLANUC >> tmpallele.txt; done < $GTFILE 9 | 10 | samtools faidx $HLANUC $(cut -f1 -d' ' tmpallele.txt | tr '>' ' ' | tr '\n' ' ') > cds.fasta 11 | samtools faidx $HLAGEN $(cut -f1 -d' ' tmpallele.txt | tr '>' ' ' | tr '\n' ' ') > gen.fasta 12 | 13 | while read -r line; do IFS=' ' read -r f1 f2 <<<"$line"; sed -i "" "s/$f1/$f1 $f2/g" cds.fasta; done < tmpallele.txt 14 | while read -r line; do IFS=' ' read -r f1 f2 f3 <<<"$line"; sed -i "" "s/$f1/$f1 $f2/g" gen.fasta; done < tmpallele.txt 15 | 16 | rm tmpallele.txt 17 | 18 | -------------------------------------------------------------------------------- /src/config.rs: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2019 10x Genomics, Inc. All rights reserved. 2 | 3 | /* Constants */ 4 | pub const MIN_SCORE_CALL: usize = 40; 5 | pub const MIN_SCORE_COUNT_PSEUDO: usize = 60; 6 | pub const MIN_SCORE_COUNT_ALIGNMENT: usize = 60; 7 | pub const GENE_CONSENSUS_THRESHOLD: f64 = 0.5; 8 | pub const ALLELE_CONSENSUS_THRESHOLD: f64 = 0.1; 9 | 10 | pub const EM_ITERS: usize = 2000; 11 | pub const EM_REL_TH: f64 = 5e-4; 12 | pub const EM_ABS_TH: f64 = 5e-3; 13 | pub const EM_CARE_TH: f64 = 1e-5; 14 | pub const MIN_READS_CALL: usize = 100; 15 | pub const HOMOZYGOUS_TH: f64 = 0.15; 16 | 17 | pub const PROC_BC_SEQ_TAG: &[u8] = b"CB"; 18 | pub const PROC_UMI_SEQ_TAG: &[u8] = b"UB"; 19 | 20 | pub const PAIRS_TO_OUTPUT: usize = 5; 21 | pub const WEIGHTS_TO_OUTPUT: usize = 10; 22 | 23 | /* Types */ 24 | use smallvec::SmallVec; 25 | pub type Barcode = SmallVec<[u8; 24]>; 26 | pub type Umi = SmallVec<[u8; 16]>; 27 | pub type EqClass = SmallVec<[u32; 4]>; 28 | -------------------------------------------------------------------------------- /src/em.rs: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2019 10x Genomics, Inc. All rights reserved. 2 | 3 | use debruijn_mapping::{config, pseudoaligner, utils}; 4 | use failure::Error; 5 | use itertools::Itertools; 6 | 7 | use std::cmp::max; 8 | use std::collections::{HashMap, HashSet}; 9 | use std::fs::File; 10 | use std::io::{BufWriter, Write}; 11 | use std::path::PathBuf; 12 | 13 | use crate::config::{ 14 | EqClass, EM_ABS_TH, EM_CARE_TH, EM_ITERS, EM_REL_TH, HOMOZYGOUS_TH, MIN_READS_CALL, 15 | PAIRS_TO_OUTPUT, WEIGHTS_TO_OUTPUT, 16 | }; 17 | use crate::mapping::EqClassDb; 18 | 19 | #[derive(Default)] 20 | pub struct EqClassCounts { 21 | pub nitems: usize, 22 | pub counts_reads: HashMap, 23 | pub counts_umi: HashMap, 24 | pub nreads: u32, 25 | } 26 | 27 | impl EqClassCounts { 28 | /* pub fn new() -> EqClassCounts { 29 | EqClassCounts { 30 | nitems: 0, 31 | counts_reads: HashMap::new(), 32 | counts_umi: HashMap::new(), 33 | nreads: 0, 34 | } 35 | }*/ 36 | pub fn pair_reads_explained(&self, a: u32, b: u32) -> u32 { 37 | let mut exp = 0; 38 | for (cls, count) in self.counts_reads.iter() { 39 | if cls.contains(&a) || cls.contains(&b) { 40 | exp += *count; 41 | } 42 | } 43 | exp 44 | } 45 | } 46 | 47 | #[allow(clippy::needless_range_loop)] 48 | impl EmProblem for EqClassCounts { 49 | fn init(&self) -> Vec { 50 | vec![1.0 / (self.nitems as f64); self.nitems] 51 | } 52 | 53 | fn reg(&self, theta: &mut [f64]) { 54 | let mut norm = 0.0; 55 | 56 | for i in 0..theta.len() { 57 | // Clamp weight 58 | let mut v = theta[i]; 59 | if v > 1.0 { 60 | v = 1.0 61 | }; 62 | if v < 0.0 { 63 | v = 1e-15 64 | }; 65 | theta[i] = v; 66 | norm += v; 67 | } 68 | 69 | let inv_norm = 1.0 / norm; 70 | for i in 0..theta.len() { 71 | theta[i] *= inv_norm; 72 | } 73 | } 74 | 75 | fn f(&self, theta1: &[f64], theta2: &mut [f64]) { 76 | let nitems = self.nitems; 77 | 78 | let mut total_counts = 0.0; 79 | 80 | for i in 0..theta2.len() { 81 | theta2[i] = 0.0; 82 | } 83 | 84 | for (class, count) in &self.counts_umi { 85 | let mut norm = 0.0; 86 | for tx in class { 87 | norm += theta1[*tx as usize]; 88 | } 89 | 90 | for tx in class { 91 | let tx_count = theta1[*tx as usize] / norm * f64::from(*count); 92 | theta2[*tx as usize] += tx_count; 93 | total_counts += tx_count; 94 | } 95 | } 96 | 97 | let mut max_abs_diff = 0.0; 98 | let mut max_rel_diff = 0.0; 99 | 100 | for i in 0..nitems { 101 | let old_weights = theta1[i]; 102 | let new_weights = theta2[i] / total_counts; 103 | 104 | let abs_diff = (old_weights - new_weights).abs(); 105 | let rel_diff = abs_diff / old_weights; 106 | 107 | if abs_diff > max_abs_diff { 108 | max_abs_diff = abs_diff; 109 | } 110 | 111 | if new_weights > 1e-2 && rel_diff > max_rel_diff { 112 | max_rel_diff = rel_diff 113 | } 114 | 115 | theta2[i] = new_weights; 116 | } 117 | } 118 | 119 | fn likelihood(&self, theta: &[f64]) -> f64 { 120 | let mut ll = 0.0; 121 | 122 | for (class, count) in &self.counts_umi { 123 | // Exclude empty equivalence classes 124 | if class.is_empty() { 125 | continue; 126 | } 127 | 128 | let mut theta_sum = 0.0; 129 | for tx in class { 130 | theta_sum += theta[*tx as usize]; 131 | } 132 | 133 | ll += f64::from(*count) * theta_sum.ln(); 134 | } 135 | 136 | ll 137 | } 138 | } 139 | 140 | #[allow(dead_code)] 141 | pub fn em(eq_classes: &EqClassCounts) -> Vec { 142 | let nitems = eq_classes.nitems; 143 | 144 | // initialize weights 145 | let mut weights = vec![1.0 / (nitems as f64); nitems]; 146 | 147 | let mut iters = 0; 148 | 149 | // Abundance required to 'care' about a relative change 150 | //let rel_care_thresh = 1e-3 / (nitems as f64); 151 | 152 | loop { 153 | let mut pseudocounts = vec![0.0; nitems]; 154 | let mut total_counts = 0.0; 155 | 156 | for (class, count) in &eq_classes.counts_umi { 157 | let mut norm = 0.0; 158 | for tx in class { 159 | norm += weights[*tx as usize]; 160 | } 161 | 162 | for tx in class { 163 | let tx_count = weights[*tx as usize] / norm * f64::from(*count); 164 | pseudocounts[*tx as usize] += tx_count; 165 | total_counts += tx_count; 166 | } 167 | } 168 | 169 | let mut max_abs_diff = 0.0; 170 | let mut max_rel_diff = 0.0; 171 | let mut simpsons = 0.0; 172 | 173 | for i in 0..nitems { 174 | let old_weights = weights[i]; 175 | let new_weights = pseudocounts[i] / total_counts; 176 | 177 | let abs_diff = (old_weights - new_weights).abs(); 178 | let rel_diff = abs_diff / old_weights; 179 | 180 | if abs_diff > max_abs_diff { 181 | max_abs_diff = abs_diff; 182 | } 183 | 184 | if new_weights > 1e-2 && rel_diff > max_rel_diff { 185 | max_rel_diff = rel_diff 186 | } 187 | 188 | weights[i] = new_weights; 189 | simpsons += new_weights * new_weights; 190 | } 191 | 192 | let ll = eq_classes.likelihood(&weights); 193 | debug!( 194 | "iter: {}, ll: {}, div: {}, rel_diff: {}, abs_diff: {}", 195 | iters, 196 | ll, 197 | 1.0 / simpsons, 198 | max_rel_diff, 199 | max_abs_diff 200 | ); 201 | iters += 1; 202 | if (max_abs_diff < 0.00005 && max_rel_diff < 0.0001) || iters > 5000 { 203 | break; 204 | } 205 | } 206 | 207 | weights 208 | } 209 | 210 | /// Encapsulate an EM optimization problem so that it can run through an accelerated EM loop (SquareM). 211 | pub trait EmProblem { 212 | // Make an initial estimate of the parameter vector. May be a naive estimate. 213 | fn init(&self) -> Vec; 214 | 215 | // Regularize a parameter vector -- fixup an inconsistencies in the parameters. 216 | // E.g. legalize values or enforce global constrains. 217 | fn reg(&self, theta: &mut [f64]); 218 | 219 | // Update the parameters -- one EM step 220 | fn f(&self, theta1: &[f64], theta_2: &mut [f64]); 221 | 222 | // Compute the likelihood of a parameter set 223 | fn likelihood(&self, theta: &[f64]) -> f64; 224 | } 225 | 226 | /// SquareM EM acceleration method. 227 | /// As described in: 228 | /// Varadhan, Ravi, and Christophe Roland. 229 | /// "Simple and globally convergent methods for accelerating the convergence of any EM algorithm." 230 | /// Scandinavian Journal of Statistics 35.2 (2008): 335-353. 231 | /// Takes an implementation of `EmProblem` and applies the accelerated EM algorithm. 232 | pub fn squarem(p: &T) -> Vec { 233 | // Array for holding theta 234 | let mut theta0 = p.init(); 235 | let mut theta1 = theta0.clone(); 236 | let mut theta2 = theta0.clone(); 237 | let mut theta_sq = theta0.clone(); 238 | 239 | let mut r = theta0.clone(); 240 | let mut v = theta0.clone(); 241 | 242 | let n = theta0.len(); 243 | let mut iters = 0; 244 | 245 | let rel_care_thresh = Some(EM_CARE_TH); 246 | 247 | loop { 248 | // Get theta1 249 | p.f(&theta0, &mut theta1); 250 | p.f(&theta1, &mut theta2); 251 | 252 | let mut rsq: f64 = 0.0; 253 | let mut vsq: f64 = 0.0; 254 | 255 | for i in 0..n { 256 | r[i] = theta1[i] - theta0[i]; 257 | rsq += r[i].powi(2); 258 | 259 | v[i] = theta2[i] - theta1[i] - r[i]; 260 | vsq += v[i].powi(2); 261 | } 262 | 263 | let mut alpha = -rsq.sqrt() / vsq.sqrt(); 264 | let mut alpha_tries = 1; 265 | let mut lsq: f64; 266 | let mut l2: f64; 267 | 268 | loop { 269 | let alpha_sq = alpha.powi(2); 270 | 271 | for i in 0..n { 272 | theta_sq[i] = theta0[i] - 2.0 * alpha * r[i] + alpha_sq * v[i] 273 | } 274 | 275 | p.reg(&mut theta_sq); 276 | 277 | lsq = p.likelihood(&theta_sq); 278 | l2 = p.likelihood(&theta2); 279 | 280 | if lsq > l2 || alpha_tries > 5 { 281 | break; 282 | } else { 283 | alpha = (alpha + -1.0) / 2.0; 284 | } 285 | 286 | alpha_tries += 1; 287 | } 288 | 289 | let (max_rel_diff, max_abs_diff) = if lsq > l2 { 290 | let diff = diffs(&theta0, &theta_sq, rel_care_thresh); 291 | std::mem::swap(&mut theta0, &mut theta_sq); 292 | diff 293 | } else { 294 | let diff = diffs(&theta0, &theta2, rel_care_thresh); 295 | std::mem::swap(&mut theta0, &mut theta2); 296 | diff 297 | }; 298 | 299 | debug!( 300 | "iter: {}, ll2: {}, llsq: {}, alpha_tries: {}, rel_diff: {}, abs_diff: {}", 301 | iters, l2, lsq, alpha_tries, max_rel_diff, max_abs_diff 302 | ); 303 | iters += 1; 304 | 305 | if (max_abs_diff < EM_ABS_TH && max_rel_diff < EM_REL_TH) || iters > EM_ITERS { 306 | break; 307 | } 308 | } 309 | 310 | theta0 311 | } 312 | 313 | /// Compute the change in the parameter vectors, returning the largest relative and absolute change, respectively. 314 | /// Only parameters with a value greater than rel_thresh (if set), are counted in the relative change check. 315 | fn diffs(t1: &[f64], t2: &[f64], rel_thresh: Option) -> (f64, f64) { 316 | let mut max_abs_diff = 0.0; 317 | let mut max_rel_diff = 0.0; 318 | 319 | for i in 0..t1.len() { 320 | let old_weights = t1[i]; 321 | let new_weights = t2[i]; 322 | 323 | let abs_diff = (old_weights - new_weights).abs(); 324 | let rel_diff = abs_diff / old_weights; 325 | 326 | if abs_diff > max_abs_diff { 327 | max_abs_diff = abs_diff; 328 | } 329 | 330 | if rel_thresh.map_or(true, |thresh| new_weights > thresh) && rel_diff > max_rel_diff { 331 | max_rel_diff = rel_diff 332 | } 333 | } 334 | 335 | (max_rel_diff, max_abs_diff) 336 | } 337 | 338 | #[allow(clippy::cognitive_complexity)] 339 | pub fn em_wrapper( 340 | hla_index: PathBuf, 341 | hla_counts: PathBuf, 342 | cds_db: PathBuf, 343 | gen_db: PathBuf, 344 | outdir: &str, 345 | ) -> Result<(PathBuf, PathBuf), Error> { 346 | let index: pseudoaligner::Pseudoaligner = utils::read_obj(&hla_index)?; 347 | let mut eq_counts: EqClassDb = utils::read_obj(&hla_counts)?; 348 | let mut alleles_called: HashSet = HashSet::new(); 349 | 350 | let weights_name: PathBuf = [outdir, "weights.tsv"].iter().collect(); 351 | let mut weights_file = BufWriter::new(File::create(weights_name)?); 352 | writeln!(weights_file, "allele_name\tem_weight\treads_explained")?; 353 | 354 | let pairs_name: PathBuf = [outdir, "pairs.tsv"].iter().collect(); 355 | let mut pairs_file = BufWriter::new(File::create(pairs_name)?); 356 | writeln!(pairs_file, "allele_1\tallele_2\treads_explained")?; 357 | 358 | let (eq_class_counts, reads_explained) = eq_counts.eq_class_counts(index.tx_names.len()); 359 | let weights = squarem(&eq_class_counts); 360 | // (index, EM_weight, allele_name, num_reads_explained) 361 | let weight_names: Vec<(usize, f64, &String, usize)> = weights 362 | .into_iter() 363 | .enumerate() 364 | .map(|(i, w)| (i, w, &index.tx_names[i], reads_explained[i])) 365 | .collect(); 366 | let mut weight_names: Vec<(u32, f64, &String, &str, usize)> = weight_names 367 | .into_iter() 368 | .map(|(i, w, n, e)| (i as u32, w, n, n.split('*').next().unwrap(), e)) 369 | .collect(); 370 | weight_names.sort_by(|(_, wa, _, _, _), (_, wb, _, _, _)| (-wa).partial_cmp(&-wb).unwrap()); //sort by descending weight 371 | weight_names.sort_by_key(|x| x.3); //sort by gene name, stable wrt weight 372 | 373 | for (gene, weights) in &weight_names.iter().group_by(|x| (x.3)) { 374 | info!("Evaluating weights for gene {}", gene); 375 | let mut weights_written = 0; 376 | let weights: Vec<_> = weights.collect(); 377 | for (_, w, n, _, exp) in &weights { 378 | if *exp > MIN_READS_CALL { 379 | writeln!( 380 | weights_file, 381 | "{}\t{}\t{}", 382 | n, 383 | w, 384 | f64::from(*exp as u32) / f64::from(eq_class_counts.nreads) 385 | )?; 386 | weights_written += 1; 387 | if weights_written == WEIGHTS_TO_OUTPUT { 388 | break; 389 | } 390 | } else { 391 | break; 392 | } 393 | } 394 | if weights_written == 0 { 395 | continue; 396 | } 397 | // (allele1_name, allele2_name, frac_reads_explained_jointly, max_frac_reads_explained_single) 398 | let mut pairs: Vec<(&String, &String, f64, f64)> = Vec::new(); 399 | 'outer: for i in 0..weights_written - 1 { 400 | for j in i + 1..weights_written { 401 | let exp = eq_class_counts.pair_reads_explained(weights[i].0, weights[j].0); 402 | if exp > MIN_READS_CALL as u32 { 403 | pairs.push(( 404 | weights[i].2, 405 | weights[j].2, 406 | f64::from(exp) / f64::from(eq_class_counts.nreads), 407 | f64::from(max(weights[i].4, weights[j].4) as u32) 408 | / f64::from(eq_class_counts.nreads), 409 | )); 410 | } else { 411 | break 'outer; 412 | } 413 | } 414 | } 415 | if !pairs.is_empty() { 416 | pairs.sort_by(|(_, _, wa, _), (_, _, wb, _)| (-wa).partial_cmp(&-wb).unwrap()); 417 | for p in pairs 418 | .iter() 419 | .take(std::cmp::min(PAIRS_TO_OUTPUT, pairs.len())) 420 | { 421 | writeln!(pairs_file, "{}\t{}\t{}", p.0, p.1, p.2)?; 422 | } 423 | // check for homozygous 424 | if pairs[0].2 * HOMOZYGOUS_TH > pairs[0].2 - pairs[0].3 { 425 | //println!("{} {}",pairs[0].2 * HOMOZYGOUS_TH,pairs[0].2 - pairs[0].3); 426 | alleles_called.insert(pairs[0].0.clone()); 427 | alleles_called.insert(pairs[0].1.clone()); 428 | } else { 429 | alleles_called.insert(weights[0].2.clone()); 430 | } 431 | } else if !weight_names.is_empty() { 432 | alleles_called.insert(weights[0].2.clone()); 433 | } 434 | } 435 | 436 | info!("Writing CDS of {} alleles to file", alleles_called.len()); 437 | use bio::io::fasta; 438 | let gen_name: PathBuf = [outdir, "gen_pseudoaln.fasta"].iter().collect(); 439 | let mut gen_file = fasta::Writer::to_file(gen_name.clone())?; 440 | let cds_name: PathBuf = [outdir, "cds_pseudoaln.fasta"].iter().collect(); 441 | let mut cds_file = fasta::Writer::to_file(cds_name.clone())?; 442 | let cds_db_file = fasta::Reader::from_file(&cds_db)?; 443 | let gen_db_file = fasta::Reader::from_file(&gen_db)?; 444 | 445 | for result in cds_db_file.records() { 446 | let record = result?; 447 | let allele_str = record.desc().ok_or_else(|| format_err!("no HLA allele"))?; 448 | let allele_str = allele_str 449 | .split(' ') 450 | .next() 451 | .ok_or_else(|| format_err!("no HLA allele"))?; 452 | if alleles_called.contains(allele_str) { 453 | info!("Writing CDS of {}", allele_str); 454 | cds_file.write_record(&record)?; 455 | } 456 | } 457 | cds_file.flush()?; 458 | info!( 459 | "Writing genomic sequence of {} alleles to file", 460 | alleles_called.len() 461 | ); 462 | for result in gen_db_file.records() { 463 | let record = result?; 464 | let allele_str = record.desc().ok_or_else(|| format_err!("no HLA allele"))?; 465 | let allele_str = allele_str 466 | .split(' ') 467 | .next() 468 | .ok_or_else(|| format_err!("no HLA allele"))?; 469 | if alleles_called.contains(allele_str) { 470 | info!("Writing genomic sequence of {}", allele_str); 471 | gen_file.write_record(&record)?; 472 | } 473 | } 474 | gen_file.flush()?; 475 | Ok((gen_name, cds_name)) 476 | } 477 | 478 | #[cfg(test)] 479 | mod test_em { 480 | use super::*; 481 | use config::EqClass; 482 | 483 | fn test_ds() -> EqClassCounts { 484 | let mut counts = HashMap::new(); 485 | 486 | let eq_a = EqClass::from(vec![0]); 487 | let eq_ab = EqClass::from(vec![0, 1]); 488 | 489 | let eq_c = EqClass::from(vec![2]); 490 | let eq_d = EqClass::from(vec![3]); 491 | 492 | counts.insert(eq_a, 1); 493 | counts.insert(eq_ab, 19); 494 | counts.insert(eq_c, 10); 495 | counts.insert(eq_d, 10); 496 | 497 | EqClassCounts { 498 | counts_umi: counts, 499 | counts_reads: HashMap::new(), 500 | nitems: 4, 501 | nreads: 0, 502 | } 503 | } 504 | 505 | fn test2_ds() -> EqClassCounts { 506 | let mut counts = HashMap::new(); 507 | 508 | let eq_a = EqClass::from(vec![0]); 509 | let eq_ab = EqClass::from(vec![0, 1]); 510 | 511 | let eq_c = EqClass::from(vec![2]); 512 | let eq_d = EqClass::from(vec![3]); 513 | 514 | let eq_e = EqClass::from(vec![4, 5]); 515 | 516 | counts.insert(eq_a, 1); 517 | counts.insert(eq_ab, 19); 518 | counts.insert(eq_c, 10); 519 | counts.insert(eq_d, 10); 520 | counts.insert(eq_e, 20); 521 | 522 | EqClassCounts { 523 | counts_umi: counts, 524 | counts_reads: HashMap::new(), 525 | nitems: 6, 526 | nreads: 0, 527 | } 528 | } 529 | 530 | #[test] 531 | fn simple_inf() { 532 | let eq_c = test_ds(); 533 | let res = em(&eq_c); 534 | 535 | println!("{:?}", res); 536 | } 537 | 538 | #[test] 539 | fn med_inf() { 540 | let eq_c = test2_ds(); 541 | let res = em(&eq_c); 542 | 543 | println!("{:?}", res); 544 | } 545 | 546 | #[test] 547 | fn accel_inf() { 548 | let eq_c = test_ds(); 549 | let res = squarem(&eq_c); 550 | 551 | println!("{:?}", res); 552 | } 553 | 554 | } 555 | -------------------------------------------------------------------------------- /src/hla.rs: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2019 10x Genomics, Inc. All rights reserved. 2 | 3 | use bio::io::fasta; 4 | use debruijn::dna_string::DnaString; 5 | use debruijn_mapping::build_index::build_index; 6 | use debruijn_mapping::config::KmerType; 7 | use failure::Error; 8 | use itertools::Itertools; 9 | use regex::Regex; 10 | use serde::Serialize; 11 | 12 | use std::collections::{HashMap, HashSet}; 13 | use std::fmt; 14 | use std::io::Read; 15 | use std::path::PathBuf; 16 | use std::str::FromStr; 17 | use std::string::String; 18 | 19 | /// Represent an HLA allele. The gene and 1st field (`f1`) are required, 20 | /// additional fields are optional (`f2`, `f3`, `f4`). The expression character 21 | /// is currently dropped. 22 | /// See this reference for details: http://hla.alleles.org/nomenclature/naming.html 23 | #[derive(Clone, Debug, Serialize, Deserialize, Eq, PartialEq, Ord, PartialOrd)] 24 | pub struct Allele { 25 | pub gene: Vec, 26 | pub f1: u16, 27 | pub f2: Option, 28 | pub f3: Option, 29 | pub f4: Option, 30 | pub name: Vec, 31 | } 32 | 33 | impl fmt::Display for Allele { 34 | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { 35 | write!(f, "{}", String::from_utf8(self.name.clone()).unwrap()) 36 | } 37 | } 38 | 39 | /// Parser for HLA alleles. 40 | pub struct AlleleParser { 41 | valid_regex: Regex, 42 | field_regex: Regex, 43 | } 44 | 45 | impl AlleleParser { 46 | 47 | /// Initialize an `AlleleParser` 48 | pub fn new() -> AlleleParser { 49 | let valid_regex = Regex::new("^[A-Z0-9]+[*][0-9]+(:[0-9]+)*[A-Z]?$").unwrap(); 50 | let field_regex = Regex::new("[0-9]+(:[0-9]+)*").unwrap(); 51 | AlleleParser { 52 | valid_regex, 53 | field_regex, 54 | } 55 | } 56 | 57 | /// Parse an HLA allele string. The string must be a valid HLA allele as defined by 58 | /// http://hla.alleles.org/nomenclature/naming.html 59 | pub fn parse(&self, s: &str) -> Result { 60 | if !self.valid_regex.is_match(s) { 61 | return Err(format_err!("invalid allele string: {}", s)); 62 | } 63 | 64 | let mut star_split = s.split('*'); 65 | let gene = star_split 66 | .next() 67 | .ok_or_else(|| format_err!("no split: {}", s))?; 68 | let suffix = star_split 69 | .next() 70 | .ok_or_else(|| format_err!("invalid allele no star separator: {}", s))?; 71 | 72 | let flds = self 73 | .field_regex 74 | .find(suffix) 75 | .ok_or_else(|| format_err!("no alleles found {}", s))?; 76 | 77 | let fld_str = flds.as_str(); 78 | let mut flds = fld_str.split(':'); 79 | let f1 = u16::from_str(flds.next().unwrap()).unwrap(); 80 | let f2 = flds.next().map(|f| u16::from_str(f).unwrap()); 81 | let f3 = flds.next().map(|f| u16::from_str(f).unwrap()); 82 | let f4 = flds.next().map(|f| u16::from_str(f).unwrap()); 83 | 84 | Ok(Allele { 85 | gene: gene.as_bytes().to_vec(), 86 | f1, 87 | f2, 88 | f3, 89 | f4, 90 | name: s.as_bytes().to_vec(), 91 | }) 92 | } 93 | } 94 | 95 | /// Load the HLA nucleotide sequence database, typically downloaded from: 96 | /// ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_nuc.fasta 97 | /// This method select a single representative sequence for each 2-digit allele. The representative 98 | /// sequence is simply the longest sequence in the group of all entries with the same 2-digit allele. 99 | /// Only alleles from the main class-I and class-II genes are returned (A,B,C,DRB1,DPA1,DPB1,DQA1,DQB1) 100 | /// If `use_filter == True`, only alleles listed in `allele_set` will be returned. If `use_filter == False`, 101 | /// the method returns only the longest sequence for each 2-digit allele. 102 | pub fn read_hla_cds( 103 | reader: fasta::Reader, 104 | allele_set: HashSet, 105 | use_filter: bool, 106 | ) -> Result<(Vec, Vec), Error> { 107 | //TODO make these arguments or config variables 108 | let genes = [ 109 | b"A".to_vec(), 110 | b"B".to_vec(), 111 | b"C".to_vec(), 112 | b"DRB1".to_vec(), 113 | b"DPA1".to_vec(), 114 | b"DPB1".to_vec(), 115 | b"DQA1".to_vec(), 116 | b"DQB1".to_vec(), 117 | ]; 118 | let mut seqs = Vec::new(); 119 | let mut transcript_counter = 0; 120 | let mut tx_ids = Vec::new(); 121 | let allele_parser = AlleleParser::new(); 122 | 123 | info!("Starting reading the Fasta file"); 124 | let mut hlas = Vec::new(); 125 | for result in reader.records() { 126 | let record = result?; 127 | 128 | let dna_string = DnaString::from_acgt_bytes(record.seq()); 129 | let allele_str = record.desc().ok_or_else(|| format_err!("no HLA allele"))?; 130 | let allele_str = allele_str 131 | .split(' ') 132 | .next() 133 | .ok_or_else(|| format_err!("no HLA allele"))?; 134 | if use_filter && !allele_set.contains(&allele_str.to_string()) { 135 | continue; 136 | } 137 | 138 | let allele = allele_parser.parse(allele_str)?; 139 | if genes.contains(&allele.gene) { 140 | let tx_id = record.id().to_string(); 141 | let data = (allele, tx_id, allele_str.to_string(), dna_string); 142 | hlas.push(data); 143 | } 144 | } 145 | 146 | hlas.sort(); 147 | 148 | let mut lengths = HashMap::new(); 149 | 150 | // If we are not filtering using the external database, do this step 151 | if !use_filter { 152 | // All the alleles with a common 2-digit prefix must have the same length -- they can only differ by an synonymous mutation. 153 | // Collate these lengths so that we can filter out non-full length sequences. 154 | for (two_digit, alleles) in &hlas.iter().group_by(|v| (v.0.gene.clone(), v.0.f1, v.0.f2)) { 155 | let mut ma: Vec<_> = alleles.collect(); 156 | //println!("td: {:?}, alleles: {:?}", two_digit, ma.len()); 157 | 158 | // Pick the longest representative 159 | ma.sort_by_key(|v| v.3.len()); 160 | let longest = ma.pop().unwrap(); 161 | let (_, _, _, dna_string) = longest; 162 | 163 | lengths.insert(two_digit.clone(), dna_string.len()); 164 | } 165 | } 166 | 167 | for (three_digit, alleles) in &hlas 168 | .iter() 169 | .group_by(|v| (v.0.gene.clone(), v.0.f1, v.0.f2, v.0.f3)) 170 | { 171 | let mut ma: Vec<_> = alleles.collect(); 172 | //println!("td: {:?}, alleles: {:?}", three_digit, ma.len()); 173 | 174 | // Pick the longest representative 175 | ma.sort_by_key(|v| v.3.len()); 176 | 177 | let longest = ma.pop().unwrap(); 178 | let (_, _, allele_str, dna_string) = longest; 179 | 180 | // Get the length of longest 2-digit entry 181 | if !use_filter { 182 | let req_len = lengths[&(three_digit.0.clone(), three_digit.1, three_digit.2)]; 183 | 184 | let mylen = dna_string.len(); 185 | 186 | //println!("td: {:?}, alleles: {:?}, max_len: {}, req_len: {}", three_digit, nalleles, mylen, req_len); 187 | 188 | if mylen >= req_len { 189 | //TODO 190 | //the CDS only has three-digit resolution so having more than that in allele_str is misleading 191 | //when that's reported as the HLA type 192 | seqs.push(dna_string.clone()); 193 | tx_ids.push(allele_str.to_string()); 194 | transcript_counter += 1; 195 | } 196 | } else { 197 | seqs.push(dna_string.clone()); 198 | tx_ids.push(allele_str.to_string()); 199 | transcript_counter += 1; 200 | } 201 | } 202 | 203 | info!( 204 | "Read {} Alleles, deduped into {} full-length 3-digit alleles", 205 | hlas.len(), 206 | transcript_counter 207 | ); 208 | Ok((seqs, tx_ids)) 209 | } 210 | 211 | 212 | /// Same functionality as `read_hla_cds` but returns the allele sequences as byte arrays rather 213 | /// than DnaStrings. 214 | pub fn read_hla_cds_string( 215 | reader: fasta::Reader, 216 | allele_set: HashSet, 217 | use_filter: bool, 218 | ) -> Result<(Vec>, Vec), Error> { 219 | 220 | let (dna_strings, tx_ids) = read_hla_cds(reader, allele_set, use_filter)?; 221 | let byte_strings = dna_strings.into_iter().map(|s| s.to_ascii_vec()).collect(); 222 | Ok((byte_strings, tx_ids)) 223 | } 224 | 225 | 226 | /// Create a DeBruijn graph index of the HLA alleles listed in the CSV `allele_status` 227 | /// using allele sequences loaded from the FASTA files `hla_fasta`, and write the index 228 | /// to `hla_index`. The index is in an opaque serde/bincode format & can generally only be 229 | /// read by the same build of scHLAcount that produced it. 230 | pub fn make_hla_index( 231 | hla_fasta: PathBuf, 232 | hla_index: PathBuf, 233 | allele_status: PathBuf, 234 | ) -> Result { 235 | info!("Building index from fasta"); 236 | let mut allele_set = HashSet::new(); 237 | let mut rdr = csv::ReaderBuilder::new() 238 | .comment(Some(b'#')) 239 | .from_path(allele_status)?; 240 | for result in rdr.records() { 241 | let record = result?; 242 | let name: String = record[0].parse()?; 243 | let conf: String = record[3].parse()?; 244 | let partial: String = record[6].parse()?; 245 | let dna: String = record[7].parse()?; 246 | //don't use null alleles; we will never see them in RNA! 247 | if name.rfind('N').is_none() && conf == "Confirmed" && partial == "Full" && dna == "gDNA" { 248 | allele_set.insert(name); 249 | } 250 | } 251 | info!( 252 | "Found {} \"Confirmed\" + \"Full\" + \"gDNA\" + non-Null alleles in allele status file", 253 | allele_set.len() 254 | ); 255 | let fasta = fasta::Reader::from_file(hla_fasta)?; 256 | let (seqs, tx_names) = read_hla_cds(fasta, allele_set, true)?; 257 | let tx_gene_map = HashMap::new(); 258 | let index = build_index::(&seqs, &tx_names, &tx_gene_map)?; 259 | info!("Finished building index!"); 260 | info!("Writing index to disk"); 261 | debruijn_mapping::utils::write_obj(&index, hla_index.clone())?; 262 | info!("Finished writing index!"); 263 | Ok(hla_index) 264 | } 265 | 266 | #[cfg(test)] 267 | mod test { 268 | use super::*; 269 | 270 | const T1: &str = "A*01:01:01:01"; 271 | 272 | #[test] 273 | fn test_parse1() { 274 | let parser = AlleleParser::new(); 275 | let al = parser.parse(T1).unwrap(); 276 | assert_eq!(String::from_utf8(al.gene).unwrap(), "A"); 277 | assert_eq!(al.f1, 1); 278 | assert_eq!(al.f2, Some(1)); 279 | assert_eq!(al.f3, Some(1)); 280 | assert_eq!(al.f4, Some(1)); 281 | } 282 | 283 | const T2: &str = "A*01:01:38L"; 284 | 285 | #[test] 286 | fn test_parse2() { 287 | let parser = AlleleParser::new(); 288 | let al = parser.parse(T2).unwrap(); 289 | assert_eq!(String::from_utf8(al.gene).unwrap(), "A"); 290 | assert_eq!(al.f1, 1); 291 | assert_eq!(al.f2, Some(1)); 292 | assert_eq!(al.f3, Some(38)); 293 | assert_eq!(al.f4, None); 294 | } 295 | 296 | const T3: &str = "MICB*012"; 297 | 298 | #[test] 299 | fn test_parse3() { 300 | let parser = AlleleParser::new(); 301 | let al = parser.parse(T3).unwrap(); 302 | assert_eq!(String::from_utf8(al.gene).unwrap(), "MICB"); 303 | assert_eq!(al.f1, 12); 304 | assert_eq!(al.f2, None); 305 | assert_eq!(al.f3, None); 306 | assert_eq!(al.f4, None); 307 | } 308 | 309 | const T4: &str = "MICB*012,5"; 310 | 311 | #[test] 312 | fn test_parse4() { 313 | let parser = AlleleParser::new(); 314 | let al = parser.parse(T4); 315 | assert!(al.is_err()); 316 | } 317 | } 318 | -------------------------------------------------------------------------------- /src/io.rs: -------------------------------------------------------------------------------- 1 | use failure::{Error, ResultExt}; 2 | use serde::{de::DeserializeOwned, Serialize}; 3 | use std::{ 4 | io::{BufRead, BufReader, BufWriter, Read, Seek, Write}, 5 | path::Path, 6 | }; 7 | 8 | /// Open a reader for a text or gzip file 9 | pub fn open_file(p: impl AsRef) -> Result, Error> { 10 | let p = p.as_ref(); 11 | 12 | let mut file = 13 | std::fs::File::open(p).with_context(|_| format!("Error opening file: {:?}", p))?; 14 | 15 | let mut buf = vec![0u8; 4]; 16 | file.read_exact(&mut buf[..])?; 17 | file.seek(std::io::SeekFrom::Start(0))?; 18 | 19 | if &buf[0..2] == &[0x1F, 0x8B] { 20 | let gz = flate2::read::MultiGzDecoder::new(file); 21 | let buf_reader = BufReader::with_capacity(1 << 17, gz); 22 | Ok(Box::new(buf_reader)) 23 | } else { 24 | let buf_reader = BufReader::with_capacity(32 * 1024, file); 25 | Ok(Box::new(buf_reader)) 26 | } 27 | } -------------------------------------------------------------------------------- /src/locus.rs: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2019 10x Genomics, Inc. All rights reserved. 2 | 3 | use failure::Error; 4 | use regex::Regex; 5 | use std::fmt; 6 | use std::str::FromStr; 7 | 8 | #[derive(PartialEq, Eq, Ord, PartialOrd, Hash, Debug, Deserialize, Clone)] 9 | pub struct Locus { 10 | pub chrom: String, 11 | pub start: u32, 12 | pub end: u32, 13 | } 14 | 15 | impl fmt::Display for Locus { 16 | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { 17 | write!(f, "{}:{}-{}", self.chrom, self.start, self.end) 18 | } 19 | } 20 | 21 | fn remove_commas(s: &str) -> String { 22 | let ss = s.to_string(); 23 | ss.replace(",", "") 24 | } 25 | 26 | impl FromStr for Locus { 27 | type Err = Error; 28 | 29 | fn from_str(s: &str) -> Result { 30 | let re = Regex::new(r"^(.*):([0-9,]+)(-|..)([0-9,]+)$").unwrap(); 31 | let cap = re.captures(s); 32 | 33 | if cap.is_none() { 34 | return Err(format_err!("invalid locus: {}", s)); 35 | } 36 | 37 | let cap = cap.unwrap(); 38 | 39 | let start_s = remove_commas(cap.get(2).unwrap().as_str()); 40 | let end_s = remove_commas(cap.get(4).unwrap().as_str()); 41 | 42 | Ok(Locus { 43 | chrom: cap.get(1).unwrap().as_str().to_string(), 44 | start: FromStr::from_str(&start_s).unwrap(), 45 | end: FromStr::from_str(&end_s).unwrap(), 46 | }) 47 | } 48 | } 49 | -------------------------------------------------------------------------------- /src/main.rs: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2019 10x Genomics, Inc. All rights reserved. 2 | 3 | extern crate bio; 4 | extern crate clap; 5 | extern crate debruijn; 6 | extern crate debruijn_mapping; 7 | #[macro_use] 8 | extern crate failure; 9 | #[macro_use] 10 | extern crate human_panic; 11 | extern crate itertools; 12 | #[macro_use] 13 | extern crate log; 14 | extern crate regex; 15 | extern crate rust_htslib; 16 | #[macro_use] 17 | extern crate serde; 18 | extern crate simplelog; 19 | extern crate smallvec; 20 | extern crate sprs; 21 | extern crate tempfile; 22 | extern crate terminal_size; 23 | 24 | use clap::{App, Arg}; 25 | 26 | use failure::{Error, ResultExt}; 27 | use simplelog::*; 28 | use sprs::io::write_matrix_market; 29 | use sprs::TriMat; 30 | use terminal_size::{terminal_size, Width}; 31 | 32 | use std::collections::HashMap; 33 | use std::fs::{create_dir_all, File}; 34 | use std::io::{BufRead, BufReader, BufWriter, Write}; 35 | use std::path::{Path, PathBuf}; 36 | use std::process; 37 | use std::str::FromStr; 38 | use std::string::String; 39 | 40 | mod mapping; 41 | use mapping::{map_and_count_pseudo, map_and_count_sw, mapping_wrapper}; 42 | 43 | mod em; 44 | use em::em_wrapper; 45 | 46 | mod hla; 47 | use hla::make_hla_index; 48 | 49 | mod locus; 50 | use locus::Locus; 51 | 52 | mod config; 53 | use config::Barcode; 54 | 55 | mod io; 56 | 57 | fn get_args() -> clap::App<'static, 'static> { 58 | App::new("scHLAcount") 59 | .set_term_width(if let Some((Width(w), _)) = terminal_size() { w as usize } else { 120 }) 60 | .version("DEV") 61 | .author("Charlotte Darby and Ian Fiddes and Patrick Marks ") 62 | .about("HLA genotyping and allele-specific expression for single-cell RNA sequencing") 63 | // Required parameters 64 | .arg(Arg::with_name("bam") 65 | .short("b") 66 | .long("bam") 67 | .value_name("FILE") 68 | .help("Cellranger BAM file") 69 | .required(true)) 70 | .arg(Arg::with_name("cell_barcodes") 71 | .short("c") 72 | .long("cell-barcodes") 73 | .value_name("FILE") 74 | .help("File with cell barcodes to be evaluated") 75 | .required(true)) 76 | // Output parameters (optional) 77 | .arg(Arg::with_name("out_dir") 78 | .short("o") 79 | .long("out-dir") 80 | .value_name("OUTPUT_DIR") 81 | .default_value("hla-typer-results")) 82 | .arg(Arg::with_name("pseudoalignertmp") 83 | .long("pl-tmp") 84 | .value_name("PSEUDOALIGNER_TMP") 85 | .default_value("pseudoaligner_tmp") 86 | .help("Directory to write the pseudoaligner temporary files generated")) 87 | // Input parameters (optional) 88 | .arg(Arg::with_name("fastagenomic") 89 | .short("g") 90 | .long("fasta-genomic") 91 | .value_name("FILE") 92 | .help("Multi-FASTA file with genomic sequence of each allele") 93 | .default_value("")) 94 | .arg(Arg::with_name("fastacds") 95 | .short("f") 96 | .long("fasta-cds") 97 | .value_name("FILE") 98 | .help("Multi-FASTA file with CDS sequence of each allele") 99 | .default_value("")) 100 | .arg(Arg::with_name("hladbdir") 101 | .short("d") 102 | .long("hladb-dir") 103 | .value_name("PATH") 104 | .help("Directory of the IMGT-HLA database") 105 | .default_value("")) 106 | .arg(Arg::with_name("hlaindex") 107 | .short("i") 108 | .long("hla-index") 109 | .value_name("FILE") 110 | .help("debruijn_mapping pseudoalignment index file constructed from IMGT-HLA database") 111 | .default_value("")) 112 | // Configuration parameters (optional) 113 | .arg(Arg::with_name("region") 114 | .short("r") 115 | .long("region") 116 | .value_name("STRING") 117 | .help("Samtools-format region string of reads to use") 118 | .default_value("6:28510120-33480577")) 119 | .arg(Arg::with_name("log_level") 120 | .long("log-level") 121 | .possible_values(&["info", "debug", "error"]) 122 | .default_value("info") 123 | .help("Logging level")) 124 | .arg(Arg::with_name("primary_alignments") 125 | .long("primary-alignments") 126 | .help("If specified, will use primary alignments only")) 127 | .arg(Arg::with_name("exact_count") 128 | .long("use-exact-count") 129 | .help("If specified, will use exact alignment to allele sequences to count moleucles (very slow!)")) 130 | .arg(Arg::with_name("unmapped") 131 | .long("unmapped") 132 | .help("If specified, will also use unmapped reads for genotyping")) 133 | } 134 | 135 | fn main() { 136 | setup_panic!(); // pretty panics for users 137 | let mut cli_args = Vec::new(); 138 | for arg in std::env::args_os() { 139 | cli_args.push(arg.into_string().unwrap()); 140 | } 141 | let res = _main(cli_args); 142 | if let Err(e) = res { 143 | println!("Failed with error: {}", e); 144 | println!("Additional details\n:{:?}", e); 145 | process::exit(1); 146 | } 147 | } 148 | 149 | // constructing a _main allows for us to run regression tests way more easily 150 | #[allow(clippy::cognitive_complexity)] 151 | fn _main(cli_args: Vec) -> Result<(), Error> { 152 | let args = get_args().get_matches_from(cli_args); 153 | let fasta_file_gen = args.value_of("fastagenomic").unwrap_or_default(); 154 | let fasta_file_cds = args.value_of("fastacds").unwrap_or_default(); 155 | let hla_db_dir = args.value_of("hladbdir").unwrap_or_default(); 156 | let pseudoaligner_tmpdir = args.value_of("pseudoalignertmp").unwrap_or_default(); 157 | let hla_index = args.value_of("hlaindex").unwrap_or_default(); 158 | let bam_file = args.value_of("bam").expect("You must provide a BAM file"); 159 | let cell_barcodes = args 160 | .value_of("cell_barcodes") 161 | .expect("You must provide a cell barcodes file"); 162 | let region = args.value_of("region").unwrap_or_default(); 163 | let out_dir = args.value_of("out_dir").unwrap_or_default(); 164 | let primary_only = args.is_present("primary_alignments"); 165 | let exact_count = args.is_present("exact_count"); 166 | let use_unmapped = args.is_present("unmapped"); 167 | let ll = args.value_of("log_level").unwrap(); 168 | 169 | let ll = match ll { 170 | "info" => LevelFilter::Info, 171 | "debug" => LevelFilter::Debug, 172 | "error" => LevelFilter::Error, 173 | &_ => { 174 | return Err(format_err!("Log level must be 'info', 'debug', or 'error'")); 175 | } 176 | }; 177 | let _ = SimpleLogger::init(ll, Config::default()); 178 | 179 | check_inputs_exist(bam_file, cell_barcodes, out_dir)?; 180 | 181 | let region: Locus = Locus::from_str(region).expect("Failed to parse region string"); 182 | let (mut genomic, mut cds): (PathBuf, PathBuf) = 183 | (PathBuf::from(fasta_file_gen), PathBuf::from(fasta_file_cds)); 184 | let bam_file = PathBuf::from(bam_file); 185 | // If the CDS or genomic FASTA files were not provided, generate them 186 | if fasta_file_gen.is_empty() || fasta_file_cds.is_empty() { 187 | check_inputs_exist_pseudoaligner(pseudoaligner_tmpdir)?; 188 | let p_tmpdir: PathBuf = [pseudoaligner_tmpdir, "pseudoaligner"].iter().collect(); 189 | if hla_db_dir.is_empty() { 190 | return Err(format_err!( 191 | "Must provide either -d (database directory) or both -g and -f (FASTA sequences)." 192 | )); 193 | } 194 | check_inputs_exist_hla_db(hla_db_dir)?; 195 | let (counts_file, hla_index1) : (PathBuf, PathBuf) = 196 | // If the index for pseudoalignment was not provided, generate it 197 | if hla_index.is_empty() { 198 | let db_fasta: PathBuf = [hla_db_dir, "hla_nuc.fasta"].iter().collect(); 199 | let allele_status: PathBuf = [hla_db_dir, "Allele_status.txt"].iter().collect(); 200 | let p_hla_index: PathBuf = [pseudoaligner_tmpdir,"hla_nuc.fasta.idx"].iter().collect(); 201 | let hla_index_generated: PathBuf = make_hla_index(db_fasta,p_hla_index,allele_status).expect("Pseudoaligner index building failed"); 202 | (mapping_wrapper(hla_index_generated.clone(), p_tmpdir, bam_file.clone(), Some(®ion), use_unmapped)?, hla_index_generated) 203 | } else { 204 | check_inputs_exist_hla_idx(hla_index)?; 205 | (mapping_wrapper(PathBuf::from(hla_index), p_tmpdir, bam_file.clone(), Some(®ion), use_unmapped)?, PathBuf::from(hla_index)) 206 | }; 207 | 208 | let cdsdb: PathBuf = [hla_db_dir, "hla_nuc.fasta"].iter().collect(); 209 | let gendb: PathBuf = [hla_db_dir, "hla_gen.fasta"].iter().collect(); 210 | let i = em_wrapper(hla_index1, counts_file, cdsdb, gendb, pseudoaligner_tmpdir).context("EM/pseudoalignment-consensus step failed")?; 211 | genomic = i.0; 212 | cds = i.1; 213 | } 214 | 215 | check_inputs_exist_fasta(&cds, &genomic)?; 216 | 217 | let cell_barcodes = load_barcodes(&cell_barcodes)?; 218 | let (entries, nrows, metrics, rownames) = if exact_count { 219 | map_and_count_sw( 220 | bam_file, 221 | &out_dir, 222 | &cell_barcodes, 223 | ®ion, 224 | cds, 225 | genomic, 226 | primary_only, 227 | )? 228 | } else { 229 | map_and_count_pseudo( 230 | bam_file, 231 | &out_dir, 232 | &cell_barcodes, 233 | ®ion, 234 | cds, 235 | genomic, 236 | primary_only, 237 | )? 238 | }; 239 | 240 | info!( 241 | "Initialized a {} features x {} cell barcodes matrix. Allele set: {}", 242 | nrows, 243 | cell_barcodes.len(), 244 | rownames.len(), 245 | ); 246 | 247 | let mut matrix: TriMat = TriMat::new((nrows, cell_barcodes.len())); 248 | 249 | info!( 250 | "Number of alignments evaluated (with 'CB' and 'UB' tags): {}", 251 | metrics.num_reads 252 | ); 253 | if primary_only { 254 | info!( 255 | "Number of alignments skipped due to not being primary: {}", 256 | metrics.num_non_primary 257 | ); 258 | } else { 259 | info!( 260 | "Number of alignments that were not primary: {}", 261 | metrics.num_non_primary 262 | ); 263 | } 264 | info!("Number of alignments skipped due to not being associated with a cell barcode in the list provided: {}", metrics.num_not_cell_bc); 265 | info!( 266 | "Number of reads with no alignment score above threshold: {}", 267 | metrics.num_not_aligned 268 | ); 269 | info!( 270 | "Number of alignments to CDS sequence: {}", 271 | metrics.num_cds_align 272 | ); 273 | info!( 274 | "Number of alignments to genomic sequence: {}", 275 | metrics.num_gen_align 276 | ); 277 | 278 | for e in entries { 279 | matrix.add_triplet(e.row as usize, e.column, e.value); 280 | } 281 | 282 | let d: PathBuf = [out_dir, "count_matrix.mtx"].iter().collect(); 283 | write_matrix_market(d.to_str().unwrap(), &matrix)?; 284 | debug!("Wrote reference matrix file"); 285 | 286 | let d: PathBuf = [out_dir, "summary.tsv"].iter().collect(); 287 | let mut summary_file = BufWriter::new(File::create(d)?); 288 | let m: sprs::CsMat = matrix.to_csr(); 289 | for (row_ind, row_vec) in m.outer_iterator().enumerate() { 290 | let mut s = 0; 291 | for (_col_ind, &val) in row_vec.iter() { 292 | s += val; 293 | } 294 | debug!("{} - {} molecules", rownames[row_ind], s); 295 | writeln!(summary_file, "{}\t{}", rownames[row_ind], s)?; 296 | } 297 | Ok(()) 298 | } 299 | 300 | /* Validate Input/Output Files/Paths */ 301 | 302 | pub fn check_inputs_exist(bam_file: &str, cell_barcodes: &str, out_dir: &str) -> Result<(), Error> { 303 | for path in [bam_file, cell_barcodes].iter() { 304 | if !Path::new(&path).exists() { 305 | return Err(format_err!("Input {:?} does not exist", path)); 306 | } 307 | } 308 | // check for BAM/CRAM index 309 | let extension = Path::new(bam_file).extension().unwrap().to_str().unwrap(); 310 | match extension { 311 | "bam" => { 312 | let bai = bam_file.to_owned() + ".bai"; 313 | if !Path::new(&bai).exists() { 314 | return Err(format_err!("BAM index {} does not exist", bai)); 315 | } 316 | } 317 | "cram" => { 318 | let crai = bam_file.to_owned() + ".crai"; 319 | if !Path::new(&crai).exists() { 320 | return Err(format_err!("CRAM index {} does not exist", crai)); 321 | } 322 | } 323 | &_ => { 324 | return Err(format_err!( 325 | "BAM file did not end in .bam or .cram. Unable to validate" 326 | )); 327 | } 328 | } 329 | if !Path::new(&out_dir).exists() { 330 | match create_dir_all(&out_dir) { 331 | Err(_e) => { 332 | return Err(format_err!( 333 | "Couldn't create results directory at {}", 334 | out_dir 335 | )); 336 | } 337 | _ => { 338 | info!("Created output directory at {}", out_dir); 339 | } 340 | } 341 | } else { 342 | return Err(format_err!( 343 | "Specified output directory {} already exists", 344 | out_dir 345 | )); 346 | } 347 | Ok(()) 348 | } 349 | 350 | pub fn check_inputs_exist_pseudoaligner(path: &str) -> Result<(), Error> { 351 | if !Path::new(&path).exists() { 352 | match create_dir_all(&path) { 353 | Err(_e) => { 354 | return Err(format_err!("Couldn't create temp directory at {}", path)); 355 | } 356 | _ => { 357 | info!("Created temp directory at {}", path); 358 | } 359 | } 360 | } else { 361 | return Err(format_err!("Specified tempdir {} already exists", path)); 362 | } 363 | Ok(()) 364 | } 365 | 366 | pub fn check_inputs_exist_hla_db(path: &str) -> Result<(), Error> { 367 | if !Path::new(&path).exists() { 368 | return Err(format_err!( 369 | "IMGT-HLA database directory {} does not exist", 370 | path 371 | )); 372 | } 373 | for file in ["hla_gen.fasta", "hla_nuc.fasta", "Allele_status.txt"].iter() { 374 | let p: PathBuf = [path, file].iter().collect(); 375 | if !p.exists() { 376 | return Err(format_err!( 377 | "IMGT-HLA database file {} does not exist", 378 | file 379 | )); 380 | } 381 | } 382 | Ok(()) 383 | } 384 | 385 | pub fn check_inputs_exist_fasta(fasta_cds: &PathBuf, fasta_gen: &PathBuf) -> Result<(), Error> { 386 | for path in [fasta_cds, fasta_gen].iter() { 387 | if !Path::new(path).exists() { 388 | return Err(format_err!("Input file {:?} does not exist", path)); 389 | } 390 | } 391 | Ok(()) 392 | } 393 | 394 | pub fn check_inputs_exist_hla_idx(path: &str) -> Result<(), Error> { 395 | if !Path::new(&path).exists() { 396 | return Err(format_err!( 397 | "Pseudoalignment index {} does not exist. Omit parameter -i to generate automatically.", 398 | path 399 | )); 400 | } 401 | Ok(()) 402 | } 403 | 404 | /* Helper Functions */ 405 | 406 | pub fn load_barcodes(filename: impl AsRef) -> Result, Error> { 407 | let r = io::open_file(filename)?; 408 | let reader = BufReader::with_capacity(32 * 1024, r); 409 | 410 | let mut bc_set = HashMap::new(); 411 | 412 | for (i, l) in reader.lines().enumerate() { 413 | let cb = Barcode::from_slice(l?.as_bytes()); 414 | bc_set.insert(cb, i as u32); 415 | } 416 | let num_bcs = bc_set.len(); 417 | if num_bcs == 0 { 418 | return Err(format_err!( 419 | "Loaded 0 barcodes. Is your barcode file gzipped or empty?" 420 | )); 421 | } 422 | debug!("Loaded {} barcodes", num_bcs); 423 | Ok(bc_set) 424 | } 425 | 426 | #[cfg(test)] 427 | mod tests { 428 | use super::*; 429 | use sprs::io::read_matrix_market; 430 | use tempfile::tempdir; 431 | 432 | #[test] 433 | fn test_allele_fasta() { 434 | //This test takes ~1min in compliation "debug" mode 435 | //21 alignments x 6 allele sequences 436 | let mut cmds = Vec::new(); 437 | let tmp_dir = tempdir().unwrap(); 438 | let pl_dir = tmp_dir.path().join("p"); 439 | let pl_dir = pl_dir.to_str().unwrap(); 440 | let result_dir = tmp_dir.path().join("r"); 441 | let out_file = result_dir.join("count_matrix.mtx"); 442 | let out_file = out_file.to_str().unwrap(); 443 | let result_dir = result_dir.to_str().unwrap(); 444 | for l in &[ 445 | "scHLAcount", 446 | "-b", 447 | "test/test.bam", 448 | "-g", 449 | "test/genomic_ABC.fa", 450 | "-f", 451 | "test/cds_ABC.fa", 452 | "-c", 453 | "test/barcodes1.tsv", 454 | "-o", 455 | result_dir, 456 | "--pl-tmp", 457 | pl_dir, 458 | ] { 459 | cmds.push(l.to_string()); 460 | } 461 | let res = _main(cmds); 462 | assert!(!res.is_err()); 463 | let seen_mat: TriMat = read_matrix_market(out_file).unwrap(); 464 | let expected_mat: TriMat = read_matrix_market("test/test_allele_fasta.mtx").unwrap(); 465 | assert_eq!(seen_mat.to_csr(), expected_mat.to_csr()); 466 | } 467 | 468 | #[test] 469 | fn test_call() { 470 | let mut cmds = Vec::new(); 471 | let tmp_dir = tempdir().unwrap(); 472 | let pl_dir = tmp_dir.path().join("p"); 473 | let pl_dir = pl_dir.to_str().unwrap(); 474 | let result_dir = tmp_dir.path().join("r"); 475 | let out_file = result_dir.join("count_matrix.mtx"); 476 | let out_file = out_file.to_str().unwrap(); 477 | let result_dir = result_dir.to_str().unwrap(); 478 | for l in &[ 479 | "scHLAcount", 480 | "-b", 481 | "test/test.bam", 482 | "-d", 483 | "test/fake_db", 484 | "-i", 485 | "test/fake_db/hla_nuc.fasta.idx", 486 | "-c", 487 | "test/barcodes0.tsv", 488 | "-o", 489 | result_dir, 490 | "--pl-tmp", 491 | pl_dir, 492 | ] { 493 | cmds.push(l.to_string()); 494 | } 495 | let res = _main(cmds); 496 | assert!(!res.is_err()); 497 | let seen_mat: TriMat = read_matrix_market(out_file).unwrap(); 498 | let expected_mat: TriMat = read_matrix_market("test/test_call.mtx").unwrap(); 499 | assert_eq!(seen_mat.to_csr(), expected_mat.to_csr()); 500 | } 501 | } 502 | -------------------------------------------------------------------------------- /test/barcodes0.tsv: -------------------------------------------------------------------------------- 1 | AAACCTGAGAGCAATT-1 2 | 3 | -------------------------------------------------------------------------------- /test/barcodes1.tsv: -------------------------------------------------------------------------------- 1 | AAACCTGAGACCACGA-1 -------------------------------------------------------------------------------- /test/barcodes7.tsv: -------------------------------------------------------------------------------- 1 | AAACCTGAGACCACGA-1 2 | AAACCTGAGAGCAATT-1 3 | AAACCTGAGGCTCTTA-1 4 | AAACCTGAGTTGTCGT-1 5 | AAACCTGCAAACGCGA-1 6 | AAACCTGCAAGAGGCT-1 7 | AAACCTGCACTCTGTC-1 8 | -------------------------------------------------------------------------------- /test/cds_ABC.fa: -------------------------------------------------------------------------------- 1 | >HLA:HLA21338 A*02:01:154 1098 bp 2 | ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACC 3 | CAGACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTCCCGGCCCGGC 4 | CGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 5 | GACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGT 6 | CCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGACTCACCGAGTG 7 | GACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCGTCCAG 8 | AGGATGTATGGCTGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGTACCACCAGTAC 9 | GCCTACGACGGCAAGGATTACATCGCCCTGAAAGAGGACCTGCGCTCTTGGACCGCGGCG 10 | GACATGGCAGCTCAGACCACCAAGCACAAGTGGGAGGCGGCCCATGTGGCGGAGCAGTTG 11 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 12 | GAGACGCTGCAGCGCACGGACGCCCCCAAAACGCATATGACTCACCACGCTGTCTCTGAC 13 | CATGAAGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGATCACACTGACC 14 | TGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGACCAGGCCTGCA 15 | GGGGATGGAACCTTCCAGAAGTGGGCGGCTGTGGTGGTGCCTTCTGGACAGGAGCAGAGA 16 | TACACCTGCCATGTGCAGCATGAGGGTTTGCCCAAGCCCCTCACCCTGAGATGGGAGCCG 17 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCATTGCTGGCCTGGTTCTCTTTGGAGCT 18 | GTGATCACTGGAGCTGTGGTCGCTGCTGTGATGTGGAGGAGGAAGAGCTCAGATAGAAAA 19 | GGAGGGAGCTACTCTCAGGCTGCAAGCAGTGACAGTGCCCAGGGTTCTGATGTGTCTCTC 20 | ACAGCTTGTAAAGTGTGA 21 | >HLA:HLA00097 A*31:01:02:01 1098 bp 22 | ATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGGCCCTGACC 23 | CAGACCTGGGCGGGCTCCCACTCCATGAGGTATTTCACCACATCCGTGTCCCGGCCCGGC 24 | CGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 25 | GACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGAGG 26 | CCTGAGTATTGGGACCAGGAGACACGGAATGTGAAGGCCCACTCACAGATTGACCGAGTG 27 | GACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCATCCAG 28 | ATGATGTATGGCTGCGACGTGGGGTCGGACGGGCGCTTCCTCCGCGGGTACCAGCAGGAC 29 | GCCTACGACGGCAAGGATTACATCGCCTTGAACGAGGACCTGCGCTCTTGGACCGCGGCG 30 | GACATGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGTTG 31 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 32 | GAGACGCTGCAGCGCACGGACCCCCCCAAGACGCATATGACTCACCACGCTGTCTCTGAC 33 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGATCACACTGACC 34 | TGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGACCAGGCCTGCA 35 | GGGGATGGAACCTTCCAGAAGTGGGCGTCTGTGGTGGTGCCTTCTGGACAGGAGCAGAGA 36 | TACACCTGCCATGTGCAGCATGAGGGTCTCCCCAAGCCCCTCACCCTGAGATGGGAGCCG 37 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCATTGCTGGCCTAGTTCTCTTTGGAGCT 38 | GTGTTCGCTGGAGCTGTGGTCGCTGCTGTGAGGTGGAGGAGGAAGAGCTCAGATAGAAAA 39 | GGAGGGAGCTACTCTCAGGCTGCAAGCAGTGACAGTGCCCAGGGCTCTGATATGTCTCTC 40 | ACAGCTTGTAAAGTGTGA 41 | >HLA:HLA00344 B*51:01:01:01 1089 bp 42 | ATGCGGGTCACGGCGCCCCGAACCGTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACC 43 | GAGACCTGGGCCGGCTCCCACTCCATGAGGTATTTCTACACCGCCATGTCCCGGCCCGGC 44 | CGCGGGGAGCCCCGCTTCATTGCAGTGGGCTACGTGGACGACACCCAGTTCGTGAGGTTC 45 | GACAGCGACGCCGCGAGTCCGAGGACGGAGCCCCGGGCGCCATGGATAGAGCAGGAGGGG 46 | CCGGAGTATTGGGACCGGAACACACAGATCTTCAAGACCAACACACAGACTTACCGAGAG 47 | AACCTGCGGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACTTGGCAG 48 | ACGATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATAACCAGTAC 49 | GCCTACGACGGCAAAGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCGGCG 50 | GACACCGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCTG 51 | AGAGCCTACCTGGAGGGCCTGTGCGTGGAGTGGCTCCGCAGACACCTGGAGAACGGGAAG 52 | GAGACGCTGCAGCGCGCGGACCCCCCAAAGACACACGTGACCCACCACCCCGTCTCTGAC 53 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 54 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACTGAGCTTGTGGAGACCAGACCAGCA 55 | GGAGATAGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 56 | TACACATGCCATGTACAGCATGAGGGGCTGCCGAAGCCCCTCACCCTGAGATGGGAGCCA 57 | TCTTCCCAGTCCACCATCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTT 58 | GTGGTCATCGGAGCTGTGGTCGCTACTGTGATGTGTAGGAGGAAGAGCTCAGGTGGAAAA 59 | GGAGGGAGCTACTCTCAGGCTGCGTCCAGCGACAGTGCCCAGGGCTCTGATGTGTCTCTC 60 | ACAGCTTGA 61 | >HLA:HLA00225 B*27:05:02:01 1089 bp 62 | ATGCGGGTCACGGCGCCCCGAACCCTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACC 63 | GAGACCTGGGCTGGCTCCCACTCCATGAGGTATTTCCACACCTCCGTGTCCCGGCCCGGC 64 | CGCGGGGAGCCCCGCTTCATCACCGTGGGCTACGTGGACGACACGCTGTTCGTGAGGTTC 65 | GACAGCGACGCCGCGAGTCCGAGAGAGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGG 66 | CCGGAGTATTGGGACCGGGAGACACAGATCTGCAAGGCCAAGGCACAGACTGACCGAGAG 67 | GACCTGCGGACCCTGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 68 | AATATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGTACCACCAGGAC 69 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCCGCG 70 | GACACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGCTG 71 | AGAGCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 72 | GAGACGCTGCAGCGCGCGGACCCCCCAAAGACACACGTGACCCACCACCCCATCTCTGAC 73 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 74 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACTGAGCTTGTGGAGACCAGACCAGCA 75 | GGAGATAGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 76 | TACACATGCCATGTACAGCATGAGGGGCTGCCGAAGCCCCTCACCCTGAGATGGGAGCCG 77 | TCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTT 78 | GTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGGTGGAAAA 79 | GGAGGGAGCTACTCTCAGGCTGCGTGCAGCGACAGTGCCCAGGGCTCTGATGTGTCTCTC 80 | ACAGCTTGA 81 | >HLA:HLA00405 C*02:02:02:01 1101 bp 82 | ATGCGGGTCATGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGGAGCCCTGGCCCTGACC 83 | GAGACCTGGGCCTGCTCCCACTCCATGAGGTATTTCTACACCGCTGTGTCCCGGCCCAGC 84 | CGCGGAGAGCCCCACTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 85 | GACAGCGACGCCGCGAGTCCAAGAGGGGAGCCGCGGGCGCCGTGGGTGGAGCAGGAGGGG 86 | CCGGAGTATTGGGACCGGGAGACACAGAAGTACAAGCGCCAGGCACAGACTGACCGAGTG 87 | AACCTGCGGAAACTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 88 | AGGATGTACGGCTGCGACCTGGGGCCCGACGGGCGCCTCCTCCGCGGGTATGACCAGTCC 89 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCCGCG 90 | GACACAGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGTGG 91 | AGAGCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 92 | GAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCATCCCGTCTCTGAC 93 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTACGGAGATCACACTGACC 94 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCA 95 | GGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 96 | TACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCA 97 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGCCTGGCTGTCCTGGCTGTC 98 | CTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGGAGGAAGAGCTCAGGTGGA 99 | AAAGGAGGGAGCTGCTCTCAGGCTGCGTCCAGCAACAGTGCCCAGGGCTCTGATGAGTCT 100 | CTCATCGCTTGTAAAGCCTGA 101 | >HLA:HLA00462 C*14:02:01:01 1101 bp 102 | ATGCGGGTCATGGCGCCCCGAACCCTCATCCTGCTGCTCTCGGGAGCCCTGGCCCTGACC 103 | GAGACCTGGGCCTGCTCCCACTCCATGAGGTATTTCTCCACATCCGTGTCCCGGCCCGGC 104 | CGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 105 | GACAGCGACGCCGCGAGTCCGAGAGGGGAGCCGCGGGCGCCGTGGGTGGAGCAGGAGGGG 106 | CCGGAGTATTGGGACCGGGAGACACAGAAGTACAAGCGCCAGGCACAGACTGACCGAGTG 107 | AGCCTGCGGAACCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 108 | TGGATGTTTGGCTGCGACCTGGGGCCCGACGGGCGCCTCCTCCGCGGGTATGACCAGTCC 109 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGATCTGCGCTCCTGGACCGCCGCG 110 | GACACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCGG 111 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 112 | GAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCATCCCGTCTCTGAC 113 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 114 | TGGCAGTGGGATGGGGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCA 115 | GGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 116 | TACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCG 117 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGCCTGGCTGTCCTGGCTGTC 118 | CTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGGAGGAAGAGCTCAGGTGGA 119 | AAAGGAGGGAGCTGCTCTCAGGCTGCGTCCAGCAACAGTGCCCAGGGCTCTGATGAGTCT 120 | CTCATCGCTTGTAAAGCCTGA 121 | -------------------------------------------------------------------------------- /test/fake_db/Allele_status.txt: -------------------------------------------------------------------------------- 1 | # file: Allele_status.txt 2 | Allele,Cells,Groups,Confirmed,Start,End,Partial,Type 3 | A*02:01:154,2,2,Confirmed,1,1098,Full,gDNA 4 | A*31:01:02:01,14,9,Confirmed,1,1098,Full,gDNA 5 | B*51:01:01:01,14,9,Confirmed,1,1089,Full,gDNA 6 | B*27:05:02:01,11,8,Confirmed,1,1089,Full,gDNA 7 | C*02:02:02:01,5,6,Confirmed,1,1101,Full,gDNA 8 | C*14:02:01:01,8,10,Confirmed,1,1101,Full,gDNA 9 | -------------------------------------------------------------------------------- /test/fake_db/hla_gen.fasta: -------------------------------------------------------------------------------- 1 | >HLA:HLA21338 A*02:01:154 3517 bp 2 | CAGAAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTGGCTCTCAGGGTCTCAG 3 | GCCCCGAAGGCGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAG 4 | TTTCTTTTCTCCCTCTCCCAACCTATGTAGGGTCCTTCTTCCTGGATACTCACGACGCGG 5 | ACCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCG 6 | GTCGCGGTTCTAAAGTCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGG 7 | ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACC 8 | CAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACGGCCTCTGTGGGGAGAAGCAAC 9 | GGGCCCGCCTGGCGGGGGCGCAGGACCCGGGAAGCCGCGCCGGGAGGAGGGTCGGGCGGG 10 | TCTCAGCCACTCCTCGTCCCCAGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTC 11 | CCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTT 12 | CGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGA 13 | GCAGGAGGGTCCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGAC 14 | TCACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTGAGTG 15 | ACCCCGGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCAGGTCGCCCA 16 | CAGTCTCCGGGTCCGAGATCCGCCCCGAAGCCGCGGGACCCCGAGACCCTTGCCCCGGGA 17 | GAGGCCCAGGCGCCTTTACCCGGTTTCATTTTCAGTTTAGGCCAAAAATCCCCCCAGGTT 18 | GGTCGGGGCGGGGCGGGGCTCGGGGGACCGGGCTGACCGCGGGGTCCGGGCCAGGTTCTC 19 | ACACCGTCCAGAGGATGTATGGCTGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGT 20 | ACCACCAGTACGCCTACGACGGCAAGGATTACATCGCCCTGAAAGAGGACCTGCGCTCTT 21 | GGACCGCGGCGGACATGGCAGCTCAGACCACCAAGCACAAGTGGGAGGCGGCCCATGTGG 22 | CGGAGCAGTTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGG 23 | AGAACGGGAAGGAGACGCTGCAGCGCACGGGTACCAGGGGCCACGGGGCGCCTCCCTGAT 24 | CGCCTGTAGATCTCCCGGGCTGGCCTCCCACAAGGAGGGGAGACAATTGGGACCAACACT 25 | AGAATATCGCCCTCCCTCTGGTCCTGAGGGAGAGGAATCCTCCTGGGTTTCCAGATCCTG 26 | TACCAGAGAGTGACTCTGAGGTTCCGCCCTGCTCTCTGACACAATTAAGGGATAAAATCT 27 | CTGAAGGAATGACGGGAAGACGATCCCTCGAATACTGATGAGTGGTTCCCTTTGACACAC 28 | ACAGGCAGCAGCCTTGGGCCCGTGACTTTTCCTCTCAGGCCTTGTTCTCTGCTTCACACT 29 | CAATGTGTGTGGGGGTCTGAGTCCAGCACTTCTGAGTCCTTCAGCCTCCACTCAGGTCAG 30 | GACCAGAAGTCGCTGTTCCCTCTTCAGGGACTAGAATTTTCCACGGAATAGGAGATTATC 31 | CCAGGTGCCTGTGTCCAGGCTGGTGTCTGGGTTCTGTGCTCCCTTCCCCATCCCAGGTGT 32 | CCTGTCCATTCTCAAGATAGCCACATGTGTGCTGGAGGAGTGTCCCATGACAGATGCAAA 33 | ATGCCTGAATGATCTGACTCTTCCTGACAGACGCCCCCAAAACGCATATGACTCACCACG 34 | CTGTCTCTGACCATGAAGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGA 35 | TCACACTGACCTGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGA 36 | CCAGGCCTGCAGGGGATGGAACCTTCCAGAAGTGGGCGGCTGTGGTGGTGCCTTCTGGAC 37 | AGGAGCAGAGATACACCTGCCATGTGCAGCATGAGGGTTTGCCCAAGCCCCTCACCCTGA 38 | GATGGGGTAAGGAGGGAGACGGGGGTGTCATGTCTTTTAGGGAAAGCAGGAGCCTCTCTG 39 | ACCTTTAGCAGGGTCAGGGCCCCTCACCTTCCCCTCTTTTCCCAGAGCCGTCTTCCCAGC 40 | CCACCATCCCCATCGTGGGCATCATTGCTGGCCTGGTTCTCTTTGGAGCTGTGATCACTG 41 | GAGCTGTGGTCGCTGCTGTGATGTGGAGGAGGAAGAGCTCAGGTGGGGAAGGGGTGAAGG 42 | GTGGGTCTGAGATTTCTTGTCTCACTGAGGGTTCCAAGACCCAGGTAGAAGTGTGCCCTG 43 | CCTCGTTACTGGGAAGCACCACCCACAATTATGGGCCTACCCAGCCTGGGCCCTGTGTGC 44 | CAGCACTTACTCTTTTGTAAAGCACCTGTTAAAATGAAGGACAGATTTATCACCTTGATT 45 | ACAGCGGTGATGGGACCTGATCCCAGCAGTCACAAGTCACAGGGGAAGGTCCCTGAGGAC 46 | CTTCAGGAGGGCGGTTGGTCCAGGACCCACACCTGCTTTCTTCATGTTTCCTGATCCCGC 47 | CCTGGGTCTGCAGTCACACATTTCTGGAAACTTCTCTGAGGTCCAAGACTTGGAGGTTCC 48 | TCTAGGACCTTAAGGCCCTGACTCCTTTCTGGTATCTCACAGGACATTTTCTTCCCACAG 49 | ATAGAAAAGGAGGGAGCTACTCTCAGGCTGCAAGTAAGTATGAAGGAGGCTGATGCCTGA 50 | GGTCCTTGGGATATTGTGTTTGGGAGCCCATGGGGGAGCTCACCCACCCCACAATTCCTC 51 | CTCTAGCCACATCTTCTGTGGGATCTGACCAGGTTCTGTTTTTGTTCTACCCCAGGCAGT 52 | GACAGTGCCCAGGGTTCTGATGTGTCTCTCACAGCTTGTAAAGGTGAGAGCCTGGAGGGC 53 | CTGATGTGTGTTGGGTGTTGGGCGGAACAGTGGACACAGCTGTGCTATGGGGTTTCTTTC 54 | CATTGGATGTATTGAGCATGCGATGGGCTGTTTAAAGTGTGACCCCTCACTGTGACAGAT 55 | ACGAATTTGTTCATGAATATTTTTTTCTATAGTGTGAGACAGCTGCCTTGTGTGGGACTG 56 | AGAGGCAAGAGTTGTTCCTGCCCTTCCCTTTGTGACTTGAAGAACCCTGACTTTGTTTCT 57 | GCAAAGGCACCTGCATGTGTCTGTGTTCGTGTAGGCATAATGTGAGGAGGTGGGGAGACC 58 | ACCCCACCCCCATGTCCACCATGACCCTCTTCCCACGCTGACCTGTGCTCCCTCCCCAAT 59 | CATCTTTCCTGTTCCAGAGAGGTGGGGCTGAGGTGTCTCCATCTCTGTCTCAACTTCATG 60 | GTGCACTGAGCTGTAACTTCTTCCTTCCCTATTAAAA 61 | >HLA:HLA00097 A*31:01:02:01 3518 bp 62 | CAGGAGCAGAGGGGTCAGGGCGAAGTACCAGGGCCCCAGGCGTGGCTCTCAGGGTCTCAG 63 | GCCCCGAAGGCGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAG 64 | TTTCTTTTCTCCCTCTCCCAACCTATGTAGGGTCCTTCTTCCTGGATACTCACGACGCGG 65 | ACCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCG 66 | GTCGCGGTTCTAAAGTCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGG 67 | ATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGGCCCTGACC 68 | CAGACCTGGGCGGGTGAGTGCGGGGTCGTGGGGAAACCGCCTCTGCGGGGAGAAGCAAGG 69 | GGCCCGCCCGGCGGGGGCGCAGGACCCGGGTAGCCGCGCCGGGAGGAGGGTCGGGCGGAT 70 | CTCAGCCACTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCACCACATCCGTGTCC 71 | CGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTC 72 | GTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAG 73 | CAGGAGAGGCCTGAGTATTGGGACCAGGAGACACGGAATGTGAAGGCCCACTCACAGATT 74 | GACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTGAGTGA 75 | CCCCAGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCAGGTCACCCAC 76 | AGTCTCCGGGTCCGAGATCCACCCCGAAGCCGCGGGACCCCGAGACCCTTGCCCCGGGAG 77 | AGGCCCAGGCGCCTTTACCCGGTTTCATTTTCAGTTTAGGCCAAAAATCCCCCCGGGTTG 78 | GTCGGGGCCGGACGGGGCTCGGGGGACTGGGCTGACCGTGGGGTCGGGGCCAGGTTCTCA 79 | CACCATCCAGATGATGTATGGCTGCGACGTGGGGTCGGACGGGCGCTTCCTCCGCGGGTA 80 | CCAGCAGGACGCCTACGACGGCAAGGATTACATCGCCTTGAACGAGGACCTGCGCTCTTG 81 | GACCGCGGCGGACATGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGC 82 | GGAGCAGTTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGA 83 | GAACGGGAAGGAGACGCTGCAGCGCACGGGTACCAGGGGCCACGGGGCGCCTCCCTGATC 84 | GCCTGTAGATCTCCCGGGCTGGCCTCCCACAAGGAGGGGAGACAATTGGGACCAACACTA 85 | GAATATCACCCTCCCTCTGGTCCTGAGGGAGAGGAATCCTCCTGGGTTTCCAGATCCTGT 86 | ACCAGAGAGTGACTCTGAGGTTCCGCCCTGCTCTGTGACACAATTAAGGGATAAAATCTC 87 | TGAAGGAATGACGGGAAGACGATCCCTCGAATACTGATGAGTGGTTCCCTTTGACACACA 88 | CCGGCAGCAGCCTTGGGCCCGTGACTTTTCCTCTCAGGCCTTGTTCTCTGCTTCACACTC 89 | AATGTGTGTGGGGGTCTGAGTCCAGCACTTCTGAGTCCCTCAGCCTCCACTCAGGTCAGG 90 | ACCAGAAGTCGCTGTTCCCTCTTCAGGGACTAGAATTTTCCACGGAATAGGAGATTATCC 91 | CAGGTGCCTGTGTCCAGGCTGGTGTCTGGGTTCTGTGCTCCCTTCCCCATCCCAGGTGTC 92 | CTGTCCATTCTCAAGATAGCCACATGTGTGCTGGAGGAGTGTCCCATTACAGATGCAAAA 93 | TGCCTGAATGTTCTGACTCTTCCTGACAGACCCCCCCAAGACGCATATGACTCACCACGC 94 | TGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGAT 95 | CACACTGACCTGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGAC 96 | CAGGCCTGCAGGGGATGGAACCTTCCAGAAGTGGGCGTCTGTGGTGGTGCCTTCTGGACA 97 | GGAGCAGAGATACACCTGCCATGTGCAGCATGAGGGTCTCCCCAAGCCCCTCACCCTGAG 98 | ATGGGGTAAGGAGGGAGATGGGGGTGTCATGTCTTTTAGGGAAAGCAGGAGCCTCTCTGA 99 | CCTTTAGCAGGGTCAGGGCCCCTCACCTTCCCCTCTTTTCCCAGAGCCGTCTTCCCAGCC 100 | CACCATCCCCATCGTGGGCATCATTGCTGGCCTAGTTCTCTTTGGAGCTGTGTTCGCTGG 101 | AGCTGTGGTCGCTGCTGTGAGGTGGAGGAGGAAGAGCTCAGGTGGGGTGAAGGGGTGAAG 102 | GGTGGGTCTGAGATTTCTTGTCTCACTGAGGGTTCCAAGACCCAGGTAGAAGTGTGCCCT 103 | GCCTCGTTACTGGGAAGCACCATCCACAATTATGGGCCTACCCAGCCTGGGCCCTGTGTG 104 | CCAGCACTTACTCTTTTGTAAAGCACCTGTTAAAATGAAGGACAGATTTATCACCTTGAT 105 | TATGGCGGTGATGGGACCTGATCCCAGCAGTCACAAGTCACAGGGGAAGGTCCCTGAGGA 106 | CCTTCAGGAGGGCGGTTGGTCCAGGACCCACACCTGCTTTCTTCATGTTTCCTGATCCCG 107 | CCCTGGGTCTGCAGTCACACATTTCTGGAAACTTCTCTGAGGTCCAAGACTTGGAGGTTC 108 | CTCTAGGACCTTAAGGCCCTGGCTCCTTTCTGGTATCTCACAGGACATTTTCTTCCCACA 109 | GATAGAAAAGGAGGGAGCTACTCTCAGGCTGCAAGTAAGTATGAAGGAGGATGATCCAAG 110 | AAATCACTGGGATATTGTGTTTGGGAGCCCGTGGGGGAGCTCACCCACCCCACAATTCCT 111 | CCTCTAGCCACATCTTCTGTGGGATCTGACCAGGTTCTGTTTTTGTCCTACCCCAGGCAG 112 | TGACAGTGCCCAGGGCTCTGATATGTCTCTCACAGCTTGTAAAGGTGAGAGCCTGGAGGG 113 | CCTGATGTGTGTTGGGTGTTGGGCGGAACAGTGGACGCAGCTGTGCTATGGGGTTTCTTT 114 | GCATTGGATGTATTGAGCATGCGATGGGCTGTTTAAAGTGTGACTCCTCACTGTGACAGA 115 | TACGAATTTGTTCATGAATATTTTTTTCTATAGTGTGAGACAGCTGCCTTGTGTGGGACT 116 | GAGAGGCAAGATTTGTTCCTGCCCTTCCCTTTGTGACTTGAAGTACCCTGACTTTGTTTC 117 | TGCAAAGGCACCTGCATGTGTCTGTGTTCTTGTAGGCATAATGTGAGGAGGTGGGGAGAC 118 | CACCCCACCCCCATGTCCACCATGACCCTCTTCCCACGCTGACCTGTGCTCCCTCCCCAA 119 | TCATCTTTCCTGTTCCAGAGAGGTGGGGCTGAGGTGTCTCCATCTCTGCCTCAACTTCAT 120 | GGTGCACTGAGCTGTAACTTCTTCCTTCCCTATTAAAA 121 | >HLA:HLA00344 B*51:01:01:01 4085 bp 122 | GATCAGGACGAAGTCCCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCT 123 | TGTCTGCATTGGGGAGGCGCAGCGTTGGGGATTCCCCACTCCCACGAGTTTCACTTCTTC 124 | TCCCAACCTATGTCGGGTCCTTCTTCCAGGATACTCGTGACGCGTCCCCATTTCCCACTC 125 | CCATTGGGTGTCGGATATCTAGAGAAGCCAATCAGTGTCGCCGGGGTCCCAGTTCTAAAG 126 | TCCCCACGCACCCACCCGGACTCAGAATCTCCTCAGACGCCGAGATGCGGGTCACGGCGC 127 | CCCGAACCGTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACCGAGACCTGGGCCGGTG 128 | AGTGCGGGGTCGGGAGGGAAATGGCCTCTGTGGGGAGGAGCGAGGGGACCGCAGGCGGGG 129 | GCGCAGGACCTGAGGAGCCGCGCCGGGAGGAGGGTCGGGCGGGTCTCAGCCCCTCCTCGC 130 | CCCCAGGCTCCCACTCCATGAGGTATTTCTACACCGCCATGTCCCGGCCCGGCCGCGGGG 131 | AGCCCCGCTTCATTGCAGTGGGCTACGTGGACGACACCCAGTTCGTGAGGTTCGACAGCG 132 | ACGCCGCGAGTCCGAGGACGGAGCCCCGGGCGCCATGGATAGAGCAGGAGGGGCCGGAGT 133 | ATTGGGACCGGAACACACAGATCTTCAAGACCAACACACAGACTTACCGAGAGAACCTGC 134 | GGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGTGAGTGACCCCGGCCCGGGGCGC 135 | AGGTCACGACTCCCCATCCCCCACGTACGGCCCGGGTCGCCCCGAGTCTCCGGGTCCGAG 136 | ATCCGCCTCCCTGAGGCCGCGGGACCCGCCCAGACCCTCGACCGGCGAGAGCCCCAGGCG 137 | CGTTTACCCGGTTTCATTTTCAGTTGAGGCCAAAATCCCCGCGGGTTGGTCGGGGCGGGG 138 | CGGGGCTCGGGGGACGGTGCTGACCGCGGGGCCGGGGCCAGGGTCTCACACTTGGCAGAC 139 | GATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATAACCAGTACGC 140 | CTACGACGGCAAAGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCGGCGGA 141 | CACCGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCTGAG 142 | AGCCTACCTGGAGGGCCTGTGCGTGGAGTGGCTCCGCAGACACCTGGAGAACGGGAAGGA 143 | GACGCTGCAGCGCGCGGGTACCAGGGGCAGTGGGGAGCCTTCCCCATCTCCTATAGGTCG 144 | CCGGGGATGGCCTCCCACGAGAAGAGGAGGAAAATGGGATCAGCGCTAGAATGTCGCCCT 145 | CCCTTGAATGGAGAATGGCATGAGTTTTCCTGAGTTTCCTCTGAGGGCCCCCTCTTCTCT 146 | CTAGGACAATTAAGGGATGACGTCTCTGAGGAAATGGAGGGGAAGACAGTCCCTAGAATA 147 | CTGATCAGGGGTCCCCTTTGACCCCTGCAGCAGCCTTGGGAACCGTGACTTTTCCTCTCA 148 | GGCCTTGTTCTCTGCCTCACACTCAGTGTGTTTGGGGCTCTGATTCCAGCACTTCTGAGT 149 | CACTTTACCTCCACTCAGATCAGGAGCAGAAGTCCCTGTTCCCCGCTCAGAGACTCGAAC 150 | TTTCCAATGAATAGGAGATTATCCCAGGTGCCTGCGTCCAGGCTGGTGTCTGGGTTCTGT 151 | GCCCCTTCCCCACACCAGGTGTCCTGTCCATTCTCAGGCTGGTCACATGGGTGGTCCTAG 152 | GGTGTCCCATGAGAGATGCAAAGCGCCTGAATTTTCTGACTCTTCCCATCAGACCCCCCA 153 | AAGACACACGTGACCCACCACCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCC 154 | CTGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACT 155 | CAGGACACTGAGCTTGTGGAGACCAGACCAGCAGGAGATAGAACCTTCCAGAAGTGGGCA 156 | GCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACATGCCATGTACAGCATGAGGGG 157 | CTGCCGAAGCCCCTCACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATATCTCTTC 158 | TCAGGGAAAGCAGGAGCCCTTCTGGAGCCCTTCAGCAGGGTCAGGGCCCCTCGTCTTCCC 159 | CTCCTTTCCCAGAGCCATCTTCCCAGTCCACCATCCCCATCGTGGGCATTGTTGCTGGCC 160 | TGGCTGTCCTAGCAGTTGTGGTCATCGGAGCTGTGGTCGCTACTGTGATGTGTAGGAGGA 161 | AGAGCTCAGGTAGGGAAGGGGTGAGGGGTGGGGTCTGGGTTTTCTTGTCCCACTGGGGGT 162 | TTCAAGCCCCAGGTAGAAGTGTTCCCTGCCTCATTACTGGGAAGCAGCATCCACACAGGG 163 | GCTAACGCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTTTTGTGCAGCACATGTGACA 164 | ATGAAGGACGGATGTATCACCTTGATGGTTGTGGTGTTGGGGTCCTGATTTCAGCATTCA 165 | TGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGGTCCAGGACCC 166 | ACACTTGCTTTCCTCGTGTTTCCTGATCCTGCCTTGGGTCTGTAGTCATACTTCTGGAAA 167 | TTCCTTTTGGGTCCAAGACGAGGAGGTTCCTCTAAGATCTCATGGCCCTGCTTCCTCCCA 168 | GTCCCCTCACAGGACATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTACTCTCAGGCTG 169 | CGTGTAAGTGGTGGGGGTGGGAGTGTGGAGGAGCTCACCCACCCCATAATTCCTCCTGTC 170 | CCACGTCTCCTGCGGGCTCTGACCAGGTCCTGTTTTTGTTCTACTCCAGCCAGCGACAGT 171 | GCCCAGGGCTCTGATGTGTCTCTCACAGCTTGAAAAGGTGAGATTCTTGGGGTCTAGAGT 172 | GGGCGGGGGGGGCGGGGAGGGGGCAGAGGGGAAAGGCCTGGGTAATGGAGATTCTTTGAT 173 | TGGGATGTTTCGCGTGTGTCGTGGGCTGTTCAGAGTGTCATCACTTACCATGACTAACCA 174 | GAATTTGTTCATGACTGTTGTTTTCTGTAGCCTGAGACAGCTGTCTTGTGAGGGACTGAG 175 | ATGCAGGATTTCTTCACTCCTCCCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTG 176 | CAAAGGCACCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAG 177 | CCCACCCTTGTGTCCACTGTGACCCCTGTTCCCATGCTGACCTGTGTTTCCTCCCCAGTC 178 | ATCTTTCTTGTTCCAGAGAGGTGGGGCTGGATGTCTCCATCTCTGTCTCAACTTTATGTG 179 | CACTGAGCTGCAACTTCTTACTTCCCTACTGAAAATAAGAATCTGAATATACATTTGTTT 180 | TCTCAAATATTTGCTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAATGT 181 | GAGAGAGCAAATAAAGACCTGAGAACCTTCCAGAATCTGCATGTTCGCTGTGCTGAGTCT 182 | ATTGCAGGTGGGGTGTGGAGAAGGCTGTGGGGGGCCGAGTGTGGACAGGGCCTGTGCCCA 183 | GTTGTTGTTGAGCCCATCATGGGCTTTATGTGGTTAGTCCTCAGCTGGGTCACCTTCACT 184 | GCCCCATTGTCCTTGTCCCTTCAGCGGAAACTTGTCCAGTGGGAGCTGTGACCACAGAGG 185 | CTCACACATCGCCCAGGGTGGCCCCTGCACACGGGGGTCTCTGTGCATTCTGAGACAAAT 186 | TTTCAGAGCCATTCACCTCCTGCCCTGCTTCTAGAGCTCCTTTTCTGCTCTGCTCTCCTG 187 | CCCTCTCTCCCTGCCCTGGTTCTAGTGATCTTGGTGCTGAATCCAATCCCAACTCATGAA 188 | TCTGTAAAGCAGAGTCTAATTTAGAGTTACATTTGTCTGTGAAATTGGACCCATCATCAA 189 | GGACTGTTCTTTCCTGAAGAGAGAACCTGATTGTGTGCTGCAGTGTGCTGGGGCAGGGGG 190 | TGCGG 191 | >HLA:HLA00225 B*27:05:02:01 4083 bp 192 | GATCAGGACGAAGTCCCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCT 193 | TGTCTGCATTGGGGAGGCGCAGCATTGGGGATTCCCCACTCCCACGAGTTTCACTTCTTC 194 | TCCCAACCTATGTCGGGTCCTTCTTCCAGGATACTCGTGACGCGTCCCCATTTCCCACTC 195 | CCATTGGGTGTCGGGTGTCTAGAGAAGCCAATCAGTGTCGCCGGGGTCCCAGTTCTAAAG 196 | TCCCCACGCACCCACCCGGACTCAGAATCTCCTCAGACGCCGAGATGCGGGTCACGGCGC 197 | CCCGAACCCTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACCGAGACCTGGGCTGGTG 198 | AGTGCGGGGTCGGCAGGGAAATGGCCTCTGTGGGGAGGAGCGAGGGGACCGCAGGCGGGG 199 | GCGCAGGACCCGGGGAGCCGCGCCGGGAGGAGGGTCGGGCGGGTCTCAGCCCCTCCTCGC 200 | CCCCAGGCTCCCACTCCATGAGGTATTTCCACACCTCCGTGTCCCGGCCCGGCCGCGGGG 201 | AGCCCCGCTTCATCACCGTGGGCTACGTGGACGACACGCTGTTCGTGAGGTTCGACAGCG 202 | ACGCCGCGAGTCCGAGAGAGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCCGGAGT 203 | ATTGGGACCGGGAGACACAGATCTGCAAGGCCAAGGCACAGACTGACCGAGAGGACCTGC 204 | GGACCCTGCTCCGCTACTACAACCAGAGCGAGGCCGGTGAGTGACCCCGGCCCGGGGCGC 205 | AGGTCACGACTCCCCATCCCCCACGTACGGCCCGGGTCGCCCCGAGTCTCCGGGTCCGAG 206 | ATCCGCCCCCGAGGCCGCGGGACCCGCCCAGACCCTCGACCGGCGAGAGCCCCAGGCGCG 207 | TTTACCCGGTTTCATTTTCAGTTGAGGCCAAAATCCCCGCGGGTTGGTCGGGGCGGGGCG 208 | GGGCTCGGGGGGACGGGGCTGACCGCGGGGGCGGGGCCAGGGTCTCACACCCTCCAGAAT 209 | ATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGTACCACCAGGACGCC 210 | TACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCCGCGGAC 211 | ACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGCTGAGA 212 | GCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAG 213 | ACGCTGCAGCGCGCGGGTACCAGGGGCAGTGGGGAGCCTTCCCCATCTCCTATAGGTCGC 214 | CGGGGATGGCCTCCCACGAGAAGAGGAGGAAAATGGGATCAGCGCTAGAATGTCGCCCTC 215 | CCTTGAATGGAGAATGGCATGAGTTTTCCTGAGTTTCCTCTGAGGGCCCCCTCTTCTCTC 216 | TAGGACAATTAAGGGATGACGTCTCTGAGGAAATGGAGGGGAAGACAGTCCCTAGAATAC 217 | TGATCAGGGGTCCCCTTTGACCCCTGCAGCAGCCTTGGGAACCGTGACTTTTCCTCTCAG 218 | GCCTTGTTCTCTGCCTCACACTCAGTGTGTTTGGGGCTCTGATTCCAGCACTTCTGAGTC 219 | ACTTTACCTCCACTCAGATCAGGAGCAGAAGTCCCTGTTCCCCGCTCAGAGACTCGAACT 220 | TTCCAATGAATAGGAGATTATCCCAGGTGCCTGCGTCCAGGCTGGTGTCTGGGTTCTGTG 221 | CCCCTTCCCCACCCCAGGTGTCCTGTCCATTCTCAGGCTGGTCACATGGGTGGTCCTAGG 222 | GTGTCCCATGAGAGATGCAAAGCGCCTGAATTTTCTGACTCTTCCCATCAGACCCCCCAA 223 | AGACACACGTGACCCACCACCCCATCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCC 224 | TGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTC 225 | AGGACACTGAGCTTGTGGAGACCAGACCAGCAGGAGATAGAACCTTCCAGAAGTGGGCAG 226 | CTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACATGCCATGTACAGCATGAGGGGC 227 | TGCCGAAGCCCCTCACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATATCTCTTCT 228 | CAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCTTCCTTTCCCAG 229 | AGCCGTCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAG 230 | CAGTTGTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGGTA 231 | GGGAAGGGGTGAGGGGTGGGGTCTGAGTTTTCTTGTCCCACTGGGGGTTTCAAGCCCCAG 232 | GTAGAAGTGTTCCCTGCCTCATTACTGGGAAGCAGCATCCACACAGGGGCTAACGCAGCC 233 | TGGGACCCTGTGTGCCAGCACTTACTCTTTTGTGCAGCACATGTGACAATGAAGGACGGA 234 | TGTATCACCTTGGTGGTTGTGGTGTTGGGGTCCTGATTCCAGCATTCATGAGTCAGGGGA 235 | AGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGGTCCAGGACCCACACTTGCTTTC 236 | CTCGTGTTTCCTGATCCTGCCTTGGGTCTGTAGTCATACTTCTGGAAATTCCTTTTGGGT 237 | CCAAGACGAGGAGGTTCCTCTAAGATCTCATGGCCCTGCTTCCTCCCAGTCCCCTCACAG 238 | GGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTACTCTCAGGCTGCGTGTAAGTGAT 239 | GGGGGTGGGAGTGTGGAGGAGCTCACCCACCCCCTAATTCCTCCTGTCCCACGTCTCCTG 240 | CGGGCTCTGACCAGGTCCTGTTTTTGTTCTACTCCAGGCAGCGACAGTGCCCAGGGCTCT 241 | GATGTGTCTCTCACAGCTTGAAAAGGTGAGATTCTTGGGGTCTAGAGTGGGTGGGGTGGC 242 | AGGTCTGGGGGTGGGTGGGGCAGTGGGGAAAGGCCTGGGTAATGGAGATTCTTTGATTGG 243 | GATGTTTCGCGTGTGTGGTGGGCTGTTTAGACTGTCATCACTTACCATGACTAACCAGAA 244 | TTTGTTCATGACTGTTGTTTTCTGTAGCCTGAGACAGCTGTCTTGTGAGGGACTGAGATG 245 | CAGGATTTCTTCACGCCTCCCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTGCAA 246 | AGGCACCTGAATGTGTCTGCGTCCCTGTTAGCATAATGTGAGGAGGTGGAGAGACCAGCC 247 | CACCCCCGTGTCCACTGTGACCCCTGTTCCCATGCTGACCTGTGTTTCCTCCCCAGTCAT 248 | CTTTCCTGTTCCAGAGAGGTGGGGCTGGATGTCTCCATCTCTGTCTCAACTTTATGTGCA 249 | CTGAGCTGCAACTTCTTACTTCCCTACTGAAAATAAGAATCTGAATATAAATTTGTTTTC 250 | TCAAATATTTGCTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAATTTGA 251 | GAGAGCAAATAAAGACCTGAGAACCTTCCAGAATCTGCATGTTCGCTGTGCTGAGTCTGT 252 | TGCAGGTGGGGTGTGGAGAAGGCTGTGGGGGGCCGAGTGTGGACGGGGCCTGTGCCCATT 253 | TGGTGTTGAGTCCATCATGGGCTTTATGTGGTTAGTCCTCAGCTGGGTCACCTTCACTGC 254 | TCCATTGTCCTTGTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCT 255 | CACACATCGCCCTGGGCGGCCCCTGCACACGGGGGTCTCTGTGCATTCTGAGACAAATTT 256 | TCAGAGCCATTCACCTCTTGCCCTGCTTCTAGAGCTCCTTTTCTGCTCTGCTCTCCTGCC 257 | CTCTCTCCCTGCCCTGGTTCTAGTGATCTTGGTGCTGAATCCAATCCCAACTCATGAATC 258 | TGTAAAGCAGAGTCTAATTTAGACTTACATTTGTCTGTGAAATTGGACCCGTCATCAAGG 259 | ACTGTTCTTTCCTGAAGAGAGAACCTGATTGTGTGCTGCAGTGTGCTGGGGCAGGGGGTG 260 | CGG 261 | >HLA:HLA00405 C*02:02:02:01 4295 bp 262 | TTATTTTGCTGGATGTAGTTTAATATTACCTGAGGTAAGGTAAGGCAAAGAGTGGGAGGC 263 | AGGGAGTCCAGTTCAGGGACGGGGATTCCAGGAGAAGTGAAGGGGAAGGGGCTGGGCGCA 264 | GCCTGGGGGTCTCTCCCTGGTTTCCACAGACAGATCCTTGGCCAGGACTCAGGCACACAG 265 | TGTGACAAAGATGCTTGGTGTAGGAGAAGAGGGATCAGGACGAAGTCCCAGGTCCCGGGC 266 | GGGGCTCTCAGGGTCTCAGGCTCCAAGGGCCGTGTCTGCACTGGGGAGGCGCCGCGTTGA 267 | GGATTCTCCACTCCCCTGAGTTTCACTTCTTCTCCCAACCTGCGACGGGTCCTTCTTCCT 268 | GAATACTCATGACGCGTCCCCAATTCCCACTCCCATTGGGTGTCGGGTTCTAGAGAAGCC 269 | AATCAGCGTCTCCGCAGTCCCGGTTCTAAAGTCCCCAGTCACCCACCCGGACTCGGATTC 270 | TCCCCAGACGCCGAGATGCGGGTCATGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGGA 271 | GCCCTGGCCCTGACCGAGACCTGGGCCTGTGAGTGCGGGGTTGGGAGGGAAACGGCCTCT 272 | GCGGAGAGGAGCGAGGGGCCCGCCCGGCGAGGGCGCAGGACCCGGGGAGCCGCGCAGGGA 273 | GGAGGGTCGGGCGGGTCTCAGCCCCTCCTCTCCCCCAGGCTCCCACTCCATGAGGTATTT 274 | CTACACCGCTGTGTCCCGGCCCAGCCGCGGAGAGCCCCACTTCATCGCAGTGGGCTACGT 275 | GGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGTCCAAGAGGGGAGCCGCG 276 | GGCGCCGTGGGTGGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACAGAAGTACAA 277 | GCGCCAGGCACAGACTGACCGAGTGAACCTGCGGAAACTGCGCGGCTACTACAACCAGAG 278 | CGAGGCCGGTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCCCTCCCCATCCCCCACG 279 | GACGGCCCGGGTCGCCCCGAGTCTCCGGGTCTGAGATCCACCCCGAGGCTGCGGAACCCG 280 | CCCAGACCCTCGACCGGAGAGAGCCCCAGTCACCTTTACCCGGTTTCATTTTCAGTTTAG 281 | GCCAAAATCCCCGCGGGTTGGTCGGGGCTGGGGCGGGGCTCGGGGGACGGGCTGACCACG 282 | GGGGCGGGGCCAGGGTCTCACACCCTCCAGAGGATGTACGGCTGCGACCTGGGGCCCGAC 283 | GGGCGCCTCCTCCGCGGGTATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCTG 284 | AACGAGGACCTGCGCTCCTGGACCGCCGCGGACACAGCGGCTCAGATCACCCAGCGCAAG 285 | TGGGAGGCGGCCCGTGAGGCGGAGCAGTGGAGAGCCTACCTGGAGGGCGAGTGCGTGGAG 286 | TGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGGTACCAGGGGC 287 | AGTGGGGAGCCTTCCCCATCTCCTGTAGATCTCCCGGGATGGCCTCCCACGAGGAGGGGA 288 | GGAAAATGGGATCAGCGCTAGAATATCGCCCTCCCTTGAATGGAGAATGGGATGAGTTTT 289 | CCTGAGTTTCCTCTGAGGGCCCCCTCTGCTCTCTAGGACAATTAAGGGATGAAGTCCTTG 290 | AGGAAATGGAGGGGAAGACAGTCCCTGGAATACTGATCAGGGGTCCCCTTTGACCACTTT 291 | GACCACTGCAGCAGCTGTGGTCAGGCTGCTGACCTTTCTCTCAGGCCTTGTTCTCTGCCT 292 | CACGCTCAATGTGTTTGAAGGTTTGATTCCAGCTTTTCTGAGTCCTTCGGCCTCCACTCA 293 | GGTCAGGACCAGAAGTCGCTGTTCCTCCCTCAGAGACTAGAACTTTCCAATGAATAGGAG 294 | ATTATCCCAGGTGCCTGTGTCCAGGCTGGCGTCTGGGTTCTGTGCCCCCTTCCCCACCCC 295 | AGGTGTCCTGTCCATTCTCAGGATAGTCACATGGGCGCTGTTGGAGTGTCGCAAGAGAGA 296 | TACAAAGTGTCTGAATTTTCTGACTCTTCCCGTCAGAACACCCAAAGACACACGTGACCC 297 | ACCATCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTA 298 | CGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTG 299 | TGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTT 300 | CTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCA 301 | CCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAG 302 | TCCTGGAGCCCTTCAGCTGGGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCC 303 | CCTCCTTTCCCAGAGCCATCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGC 304 | CTGGCTGTCCTGGCTGTCCTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGG 305 | AGGAAGAGCTCAGGTAGGGAAGGGGTGAGGAGTGGGGTCTGGGTTTTCTTGTCCCACTGG 306 | GAGTTTCAAGCCCCAGGTAGAAGTGTGCCCCACCTCGTTACTGGAAGCACCATCCACACA 307 | TGGGCCATCCCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTGTTGTGAAGCACATGAC 308 | AATGAAGGACAGATGTATCACCTTGATGATTATGGTGTTGGGGTCCTTGATTCCAGCATT 309 | CATGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGCTCCAGAAC 310 | CCACAGCTGCTTTCCCCGTGTTTCCTGATCCTGCCCTGGGTCTGCAGTCATAGTTCTGGA 311 | AACTTCTCTTGGGTCCAAGACTAGGAGGTTCCCCTAAGATTGCATGGCCCTGCCTCCTCC 312 | CTGTCCCCTCACAGGGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTGCTCTCAGGC 313 | TGCGTGTAAGTGATGGCGGCAGGCGTGTGGAGGAGCTCACCCACCCCATAATTCCTCTTG 314 | TCCCACATCTCCTGCGGGCTCTGACCAGGTCTTTTTTTTTGTTCTACCCCAGCCAGCAAC 315 | AGTGCCCAGGGCTCTGATGAGTCTCTCATCGCTTGTAAAGGTGAGATTCTGGGAGCTGAA 316 | GTGGTCGGGGGTGGGGCAGAGGGAAAAGGCCTAGGTAATGGGGATCCTTTGATAGGGACG 317 | TTTCGAATGTGTGGTGAGCTGTTCAGAGTGTCATCACTTACCATGACTGACCTGAATTTG 318 | TTCATGACTATTGTGTTCTGTAGCCTGAGACAGCTGCCTGTGTGGGACTGAGATGCAGGA 319 | TTTCTTCACACCTCTCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTACAAAGGCA 320 | TCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAGCCCACCCC 321 | CGTGTCCACCGTGACCCCTGTCCCCACACTGACCTGTGTTCCCTCCCCGATCATCTTTCC 322 | TGTTCCAGAGAAGTGGGCTGGATGTCTCCATCTCTGTCTCAACTTTACGTGTACTGAGCT 323 | GCAACTTCTTCCCTACTGAAAATAAGAATCTGAATATAAATTTGTTTTCTCAAATATTTG 324 | CTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAAGTTGAGAGAGCAAATA 325 | AAGACCTGAGAACCTTCCAGAATCCGCATGTTCGCTGTGCTGAGTCTGTTGCAGGTGGGG 326 | GTGGGGAAGGCTGTGAGGAGACGAGTGTGGACGGGGCCTGTGCCTAGTTGCTGTTCAGTT 327 | CTTCATGGGCTTTATGTGGTCAGTCCTCAGCTGGGTCACCTTCACTGCTCCATTGTCCTT 328 | GTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCTCACACATCGCCC 329 | AGGGCAGCCCCTGCACACGGGAGTCCCTGTGCTTTCTGAGACAAATTTTCAGACCCATTC 330 | AGCTCCTGCCCTCCTTCTAGGGCTCCTCTTCTGCTTTGGTCTCCTGCCCTCTCTCCCTTC 331 | CCTGATTCCAGTAATCTTCGTGCTGACTCCAATCCCAACTCATGAATCTAAAGCAGAGCC 332 | TAATTTAGATTTATATTTGTTTGTAAAATTGGGTCCATAGTCTAGAATTGTTCCTTCCTG 333 | AAGAGAAACCTGATTGTGTGCTGCAGTGTGCAGGG 334 | >HLA:HLA00462 C*14:02:01:01 4304 bp 335 | TTATTTTGCTGGATGTAGTTTAATATTACCTGAGGTAAGGTAAGGCAAAGAGTGGGAGGC 336 | AGGGAGTCCAGTTCAGGGACGGGGATTCCAGGAGAAGTGAAGGGGAAGGGGCTGGGCGCA 337 | GCCTGGGGGTCTCTCCCTGGTTTCCACAGACAGATCCTTGGCCAGGACTCAGGCACACAG 338 | TGTGACAAAGATGCTTGGTGTAGGAGAAGAGGGATCAGGACGAAGTCCCAGGTCCCGGGC 339 | GGGGCTCTCAGGGTCTCAGGCTCCAAGGGCCGTGTCTGCACTGGGGAGGCGCCGCGTTGA 340 | GGATTCTCCACTCCCCTGAGTTTCACTTCTTCTCCCAACCTGCGTCGGGTCCTTCTTCCT 341 | GAATACTCATGACGCGTCCCCAATTCCCACTCCCATTGGGTGTCGGGTTCTAGAGAAGCC 342 | AATCAGCGTCTCCGCAGTCCCGGTTCTAAAGTCCCCAGTCACCCACCCGGACTCAGATTC 343 | TCCCCAGACGCCGAGATGCGGGTCATGGCGCCCCGAACCCTCATCCTGCTGCTCTCGGGA 344 | GCCCTGGCCCTGACCGAGACCTGGGCCTGTGAGTGCGGGGTTAGGAGGGAAACGGCCTCT 345 | GCGGAGAGGAGCGAGGGGCCCGCCCGGCGAGGGCGCAGGACCCGGGGAGCCGCGCAGGGA 346 | GGAGGGTCGGGCGGGTCTCAGCCACTCCTCGTCCCCAGGCTCCCACTCCATGAGGTATTT 347 | CTCCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGT 348 | GGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGTCCGAGAGGGGAGCCGCG 349 | GGCGCCGTGGGTGGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACAGAAGTACAA 350 | GCGCCAGGCACAGACTGACCGAGTGAGCCTGCGGAACCTGCGCGGCTACTACAACCAGAG 351 | CGAGGCCGGTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCCCTCCCCATCCCCCACG 352 | GACGGCCCGGGTCGCCCCGAGTCTCCCCGTCTGAGATCCACCCCGAGGCTGCGGAACCCG 353 | CCCAGACCCTCGACCGGAGAGAGCCCCAGTCACCTTTACCCGGTTTCATTTTCAGTTTAG 354 | GCCAAAATCCCCGCGGGTTGGTCGGGACTGGGGCGGGGCTCGGGGGACGGGGCTGACCAC 355 | GGGGGCGGGGCCAGGGTCTCACACCCTCCAGTGGATGTTTGGCTGCGACCTGGGGCCCGA 356 | CGGGCGCCTCCTCCGCGGGTATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCT 357 | GAACGAGGATCTGCGCTCCTGGACCGCCGCGGACACGGCGGCTCAGATCACCCAGCGCAA 358 | GTGGGAGGCGGCCCGTGAGGCGGAGCAGCGGAGAGCCTACCTGGAGGGCACGTGCGTGGA 359 | GTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGGTACCAGGGG 360 | CAGTGGGGAGCCTTCCCCATCTCCCGTAGATCTCCCGGGATGGCCTCCCACGAGGAGGGG 361 | AGGAAAATGGGATCAGCGCTAGAATATCGCCCTCCCTTGAATGGAGAATGGGATGAGTTT 362 | TCCTGAGTTTCCTCTGAGGGCCCCCTCTGCTCTCTAGGACAATTAAGGGATGAAGTCCTT 363 | GAGGAAATGGAGGGGAAGACAGTCCCTGGAATACTGATCAGGGGTCCCCTTTGACCACTT 364 | TGACCACTGCAGCAGCTGTGGTCAGGCTGCTGACCTTTCTCTCAGGCCTTGTTCTCTGCC 365 | TCACGCTCAATGTGTTTGAAGGTTTGATTCCAGCTTTTCTGAGTCCTTCGGCCTCCACTC 366 | AGGTCAGGACCAGAAGTCGCTGTTCCTCCCTCAGAGACTAGAACTTTCCAATGAATAGGA 367 | GATTATCCCAGGTGCCTGTGTCCAGGCTGGCGTCTGGGTTCTGTGCCCCCTTCCCCACCC 368 | CAGGTGTCCTGTCCATTCTCAGGATGGTCACATGGGCGCTGTTGGAGTGTCGCAAGAGAG 369 | ATACAAAGTGTCTGAATTTTCTGACTCTTCCCGTCAGAACACCCAAAGACACACGTGACC 370 | CACCATCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCT 371 | GCGGAGATCACACTGACCTGGCAGTGGGATGGGGAGGACCAAACTCAGGACACCGAGCTT 372 | GTGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCT 373 | TCTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTC 374 | ACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTGATGTGTCTTCTCAGGGAAAGCAGAA 375 | GTCCTGGAGCCCTTCAGCCGGGTCAGGGCTGAGGCTTGGAGGTCAGGGCCCCTCACCTTC 376 | CCCTCCTTTCCCAGAGCCGTCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGG 377 | CCTGGCTGTCCTGGCTGTCCTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAG 378 | GAGGAAGAGCTCAGGTAGGGAAGGGGTGAGGAGTGGGGTCTGGGTTTTCTTGTTCCACTG 379 | GGAGTTTCAAGCCCCAGGTAGAAGTGTGCCCCACCTCGTTACTGGAAGCACCATCCACAC 380 | ATGGGCCATCCCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTGTTGTGAAGCACATGA 381 | CAATGAAGGACAGATGTATCACCTTGATGATTATGGTGTTGGGGTCCTTGATTCCAGCAT 382 | TCATGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGCTTCAGGA 383 | CCCACAGCTGCTTTCCCCGTGTTTCCTGATCCTGCCCTGGGTCTGCAGTCATAGTTCTGG 384 | AAACTTCTCTTGGGTCCAAGACTAGGAGGTTCCCCTAAGATCGCATGGCCCTGCCTCCTC 385 | CCTGTCCCCTCACAGGGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTGCTCTCAGG 386 | CTGCGTGTAAGTGATGGCGGTGGGCGTGTGGAGGAGCTCACCCACCCCATAATTCCTCTT 387 | GTCCCACATCTCCTGCGGGCTCTGACCAGGTCTTTTTTTTTGTTCTACCCCAGCCAGCAA 388 | CAGTGCCCAGGGCTCTGATGAGTCTCTCATCGCTTGTAAAGGTGAGATTCTGGGGAGCTG 389 | AAGTGGTCGGGGGTGGGGCAGAGGGAAAAGGCCTGGGTAATGGGGATCCTTTGATTGGGA 390 | CGTTTCGAATGTGTGGTGAGCTGTTCAGAGTGTCATCACTTACCATGACTGACCTGAATT 391 | TGTTCATGACTATTGTGTTCTGTAGCCTGAGACAGCTGCCTGTGTGGGACTGAGATGCAG 392 | GATTTCTTCACACCTCTCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTGCAAAGG 393 | CATCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAGCCCACC 394 | CCCGTGTCCACCGTGACCCCTGTCCCCACACTGACCTGTGTTCCCTCCCCGATCATCTTT 395 | CCTGTTCCAGAGAAGTGGGCTGGATGTCTCCATCTCTGTCTCAACTTCATGGTGCGCTGA 396 | GCTGCAACTTCTTACTTCCCTAATGAAGTTAAGAAGCTGAATATAAATTTGTTTTCTCAA 397 | ATATTTGCTATGAAGGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAAGTTGAGAGA 398 | GCAAATAAAGACCTGAGAACCTTCCAGAATCCGCATGTTCGCTGTGCTGAGTCTGTTGCA 399 | GGTGGGGGTGGGGAAGGCTGTGAGGAGACGAGTGTGGACGGGGCCTGTGCCTAGTTGCTG 400 | TTCAGTTCTTCATGGGCTCTATGTGGTCAGTCCTCAGCTGGGTCACCTTCACTGCTCCAT 401 | TGTCCTTGTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCTCACAC 402 | ATCGCCCAGGGCAGCCCCTGCACACGGGAGTCCCTGTGCTTTCTGAGACAAATTTTCAGA 403 | CCCATTCAGCTCCTGCCCTCCTTCTAGGGCTCCTCTTCTGCTTTGGTCTCCTGCCCTCTC 404 | TCCCTTCCCTGATTCCAGTAATCTTCATGCTGACTCCAATCCCAACTCATGAATCTAAAG 405 | CAGAGCCTAATTTAGATTTATATTTGTTTGTAAAATTGGGTCCATAGTCTAGAATTGTTC 406 | CTTCCTGAAGAGAGAAACCTGATTGTGTGCTGCAGTGTGTGGGG -------------------------------------------------------------------------------- /test/fake_db/hla_nuc.fasta: -------------------------------------------------------------------------------- 1 | >HLA:HLA21338 A*02:01:154 1098 bp 2 | ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACC 3 | CAGACCTGGGCGGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTCCCGGCCCGGC 4 | CGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 5 | GACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGT 6 | CCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGACTCACCGAGTG 7 | GACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCGTCCAG 8 | AGGATGTATGGCTGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGTACCACCAGTAC 9 | GCCTACGACGGCAAGGATTACATCGCCCTGAAAGAGGACCTGCGCTCTTGGACCGCGGCG 10 | GACATGGCAGCTCAGACCACCAAGCACAAGTGGGAGGCGGCCCATGTGGCGGAGCAGTTG 11 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 12 | GAGACGCTGCAGCGCACGGACGCCCCCAAAACGCATATGACTCACCACGCTGTCTCTGAC 13 | CATGAAGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGATCACACTGACC 14 | TGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGACCAGGCCTGCA 15 | GGGGATGGAACCTTCCAGAAGTGGGCGGCTGTGGTGGTGCCTTCTGGACAGGAGCAGAGA 16 | TACACCTGCCATGTGCAGCATGAGGGTTTGCCCAAGCCCCTCACCCTGAGATGGGAGCCG 17 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCATTGCTGGCCTGGTTCTCTTTGGAGCT 18 | GTGATCACTGGAGCTGTGGTCGCTGCTGTGATGTGGAGGAGGAAGAGCTCAGATAGAAAA 19 | GGAGGGAGCTACTCTCAGGCTGCAAGCAGTGACAGTGCCCAGGGTTCTGATGTGTCTCTC 20 | ACAGCTTGTAAAGTGTGA 21 | >HLA:HLA00097 A*31:01:02:01 1098 bp 22 | ATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGGCCCTGACC 23 | CAGACCTGGGCGGGCTCCCACTCCATGAGGTATTTCACCACATCCGTGTCCCGGCCCGGC 24 | CGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 25 | GACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGAGG 26 | CCTGAGTATTGGGACCAGGAGACACGGAATGTGAAGGCCCACTCACAGATTGACCGAGTG 27 | GACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACCATCCAG 28 | ATGATGTATGGCTGCGACGTGGGGTCGGACGGGCGCTTCCTCCGCGGGTACCAGCAGGAC 29 | GCCTACGACGGCAAGGATTACATCGCCTTGAACGAGGACCTGCGCTCTTGGACCGCGGCG 30 | GACATGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGTTG 31 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 32 | GAGACGCTGCAGCGCACGGACCCCCCCAAGACGCATATGACTCACCACGCTGTCTCTGAC 33 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGATCACACTGACC 34 | TGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGACCAGGCCTGCA 35 | GGGGATGGAACCTTCCAGAAGTGGGCGTCTGTGGTGGTGCCTTCTGGACAGGAGCAGAGA 36 | TACACCTGCCATGTGCAGCATGAGGGTCTCCCCAAGCCCCTCACCCTGAGATGGGAGCCG 37 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCATTGCTGGCCTAGTTCTCTTTGGAGCT 38 | GTGTTCGCTGGAGCTGTGGTCGCTGCTGTGAGGTGGAGGAGGAAGAGCTCAGATAGAAAA 39 | GGAGGGAGCTACTCTCAGGCTGCAAGCAGTGACAGTGCCCAGGGCTCTGATATGTCTCTC 40 | ACAGCTTGTAAAGTGTGA 41 | >HLA:HLA00344 B*51:01:01:01 1089 bp 42 | ATGCGGGTCACGGCGCCCCGAACCGTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACC 43 | GAGACCTGGGCCGGCTCCCACTCCATGAGGTATTTCTACACCGCCATGTCCCGGCCCGGC 44 | CGCGGGGAGCCCCGCTTCATTGCAGTGGGCTACGTGGACGACACCCAGTTCGTGAGGTTC 45 | GACAGCGACGCCGCGAGTCCGAGGACGGAGCCCCGGGCGCCATGGATAGAGCAGGAGGGG 46 | CCGGAGTATTGGGACCGGAACACACAGATCTTCAAGACCAACACACAGACTTACCGAGAG 47 | AACCTGCGGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACTTGGCAG 48 | ACGATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATAACCAGTAC 49 | GCCTACGACGGCAAAGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCGGCG 50 | GACACCGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCTG 51 | AGAGCCTACCTGGAGGGCCTGTGCGTGGAGTGGCTCCGCAGACACCTGGAGAACGGGAAG 52 | GAGACGCTGCAGCGCGCGGACCCCCCAAAGACACACGTGACCCACCACCCCGTCTCTGAC 53 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 54 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACTGAGCTTGTGGAGACCAGACCAGCA 55 | GGAGATAGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 56 | TACACATGCCATGTACAGCATGAGGGGCTGCCGAAGCCCCTCACCCTGAGATGGGAGCCA 57 | TCTTCCCAGTCCACCATCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTT 58 | GTGGTCATCGGAGCTGTGGTCGCTACTGTGATGTGTAGGAGGAAGAGCTCAGGTGGAAAA 59 | GGAGGGAGCTACTCTCAGGCTGCGTCCAGCGACAGTGCCCAGGGCTCTGATGTGTCTCTC 60 | ACAGCTTGA 61 | >HLA:HLA00225 B*27:05:02:01 1089 bp 62 | ATGCGGGTCACGGCGCCCCGAACCCTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACC 63 | GAGACCTGGGCTGGCTCCCACTCCATGAGGTATTTCCACACCTCCGTGTCCCGGCCCGGC 64 | CGCGGGGAGCCCCGCTTCATCACCGTGGGCTACGTGGACGACACGCTGTTCGTGAGGTTC 65 | GACAGCGACGCCGCGAGTCCGAGAGAGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGG 66 | CCGGAGTATTGGGACCGGGAGACACAGATCTGCAAGGCCAAGGCACAGACTGACCGAGAG 67 | GACCTGCGGACCCTGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 68 | AATATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGTACCACCAGGAC 69 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCCGCG 70 | GACACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGCTG 71 | AGAGCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 72 | GAGACGCTGCAGCGCGCGGACCCCCCAAAGACACACGTGACCCACCACCCCATCTCTGAC 73 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 74 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACTGAGCTTGTGGAGACCAGACCAGCA 75 | GGAGATAGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 76 | TACACATGCCATGTACAGCATGAGGGGCTGCCGAAGCCCCTCACCCTGAGATGGGAGCCG 77 | TCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTT 78 | GTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGGTGGAAAA 79 | GGAGGGAGCTACTCTCAGGCTGCGTGCAGCGACAGTGCCCAGGGCTCTGATGTGTCTCTC 80 | ACAGCTTGA 81 | >HLA:HLA00405 C*02:02:02:01 1101 bp 82 | ATGCGGGTCATGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGGAGCCCTGGCCCTGACC 83 | GAGACCTGGGCCTGCTCCCACTCCATGAGGTATTTCTACACCGCTGTGTCCCGGCCCAGC 84 | CGCGGAGAGCCCCACTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 85 | GACAGCGACGCCGCGAGTCCAAGAGGGGAGCCGCGGGCGCCGTGGGTGGAGCAGGAGGGG 86 | CCGGAGTATTGGGACCGGGAGACACAGAAGTACAAGCGCCAGGCACAGACTGACCGAGTG 87 | AACCTGCGGAAACTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 88 | AGGATGTACGGCTGCGACCTGGGGCCCGACGGGCGCCTCCTCCGCGGGTATGACCAGTCC 89 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCCGCG 90 | GACACAGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGTGG 91 | AGAGCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 92 | GAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCATCCCGTCTCTGAC 93 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTACGGAGATCACACTGACC 94 | TGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCA 95 | GGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 96 | TACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCA 97 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGCCTGGCTGTCCTGGCTGTC 98 | CTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGGAGGAAGAGCTCAGGTGGA 99 | AAAGGAGGGAGCTGCTCTCAGGCTGCGTCCAGCAACAGTGCCCAGGGCTCTGATGAGTCT 100 | CTCATCGCTTGTAAAGCCTGA 101 | >HLA:HLA00462 C*14:02:01:01 1101 bp 102 | ATGCGGGTCATGGCGCCCCGAACCCTCATCCTGCTGCTCTCGGGAGCCCTGGCCCTGACC 103 | GAGACCTGGGCCTGCTCCCACTCCATGAGGTATTTCTCCACATCCGTGTCCCGGCCCGGC 104 | CGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTC 105 | GACAGCGACGCCGCGAGTCCGAGAGGGGAGCCGCGGGCGCCGTGGGTGGAGCAGGAGGGG 106 | CCGGAGTATTGGGACCGGGAGACACAGAAGTACAAGCGCCAGGCACAGACTGACCGAGTG 107 | AGCCTGCGGAACCTGCGCGGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAG 108 | TGGATGTTTGGCTGCGACCTGGGGCCCGACGGGCGCCTCCTCCGCGGGTATGACCAGTCC 109 | GCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGATCTGCGCTCCTGGACCGCCGCG 110 | GACACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCGG 111 | AGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAG 112 | GAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCATCCCGTCTCTGAC 113 | CATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACC 114 | TGGCAGTGGGATGGGGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCA 115 | GGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGA 116 | TACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCG 117 | TCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGCCTGGCTGTCCTGGCTGTC 118 | CTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGGAGGAAGAGCTCAGGTGGA 119 | AAAGGAGGGAGCTGCTCTCAGGCTGCGTCCAGCAACAGTGCCCAGGGCTCTGATGAGTCT 120 | CTCATCGCTTGTAAAGCCTGA 121 | -------------------------------------------------------------------------------- /test/fake_db/hla_nuc.fasta.idx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/test/fake_db/hla_nuc.fasta.idx -------------------------------------------------------------------------------- /test/genomic_ABC.fa: -------------------------------------------------------------------------------- 1 | >HLA:HLA21338 A*02:01:154 3517 bp 2 | CAGAAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTGGCTCTCAGGGTCTCAG 3 | GCCCCGAAGGCGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAG 4 | TTTCTTTTCTCCCTCTCCCAACCTATGTAGGGTCCTTCTTCCTGGATACTCACGACGCGG 5 | ACCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCG 6 | GTCGCGGTTCTAAAGTCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGG 7 | ATGGCCGTCATGGCGCCCCGAACCCTCGTCCTGCTACTCTCGGGGGCTCTGGCCCTGACC 8 | CAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACGGCCTCTGTGGGGAGAAGCAAC 9 | GGGCCCGCCTGGCGGGGGCGCAGGACCCGGGAAGCCGCGCCGGGAGGAGGGTCGGGCGGG 10 | TCTCAGCCACTCCTCGTCCCCAGGCTCTCACTCCATGAGGTATTTCTTCACATCCGTGTC 11 | CCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGTGGACGACACGCAGTT 12 | CGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGA 13 | GCAGGAGGGTCCGGAGTATTGGGACGGGGAGACACGGAAAGTGAAGGCCCACTCACAGAC 14 | TCACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTGAGTG 15 | ACCCCGGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCAGGTCGCCCA 16 | CAGTCTCCGGGTCCGAGATCCGCCCCGAAGCCGCGGGACCCCGAGACCCTTGCCCCGGGA 17 | GAGGCCCAGGCGCCTTTACCCGGTTTCATTTTCAGTTTAGGCCAAAAATCCCCCCAGGTT 18 | GGTCGGGGCGGGGCGGGGCTCGGGGGACCGGGCTGACCGCGGGGTCCGGGCCAGGTTCTC 19 | ACACCGTCCAGAGGATGTATGGCTGCGACGTGGGGTCGGACTGGCGCTTCCTCCGCGGGT 20 | ACCACCAGTACGCCTACGACGGCAAGGATTACATCGCCCTGAAAGAGGACCTGCGCTCTT 21 | GGACCGCGGCGGACATGGCAGCTCAGACCACCAAGCACAAGTGGGAGGCGGCCCATGTGG 22 | CGGAGCAGTTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGG 23 | AGAACGGGAAGGAGACGCTGCAGCGCACGGGTACCAGGGGCCACGGGGCGCCTCCCTGAT 24 | CGCCTGTAGATCTCCCGGGCTGGCCTCCCACAAGGAGGGGAGACAATTGGGACCAACACT 25 | AGAATATCGCCCTCCCTCTGGTCCTGAGGGAGAGGAATCCTCCTGGGTTTCCAGATCCTG 26 | TACCAGAGAGTGACTCTGAGGTTCCGCCCTGCTCTCTGACACAATTAAGGGATAAAATCT 27 | CTGAAGGAATGACGGGAAGACGATCCCTCGAATACTGATGAGTGGTTCCCTTTGACACAC 28 | ACAGGCAGCAGCCTTGGGCCCGTGACTTTTCCTCTCAGGCCTTGTTCTCTGCTTCACACT 29 | CAATGTGTGTGGGGGTCTGAGTCCAGCACTTCTGAGTCCTTCAGCCTCCACTCAGGTCAG 30 | GACCAGAAGTCGCTGTTCCCTCTTCAGGGACTAGAATTTTCCACGGAATAGGAGATTATC 31 | CCAGGTGCCTGTGTCCAGGCTGGTGTCTGGGTTCTGTGCTCCCTTCCCCATCCCAGGTGT 32 | CCTGTCCATTCTCAAGATAGCCACATGTGTGCTGGAGGAGTGTCCCATGACAGATGCAAA 33 | ATGCCTGAATGATCTGACTCTTCCTGACAGACGCCCCCAAAACGCATATGACTCACCACG 34 | CTGTCTCTGACCATGAAGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGA 35 | TCACACTGACCTGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGA 36 | CCAGGCCTGCAGGGGATGGAACCTTCCAGAAGTGGGCGGCTGTGGTGGTGCCTTCTGGAC 37 | AGGAGCAGAGATACACCTGCCATGTGCAGCATGAGGGTTTGCCCAAGCCCCTCACCCTGA 38 | GATGGGGTAAGGAGGGAGACGGGGGTGTCATGTCTTTTAGGGAAAGCAGGAGCCTCTCTG 39 | ACCTTTAGCAGGGTCAGGGCCCCTCACCTTCCCCTCTTTTCCCAGAGCCGTCTTCCCAGC 40 | CCACCATCCCCATCGTGGGCATCATTGCTGGCCTGGTTCTCTTTGGAGCTGTGATCACTG 41 | GAGCTGTGGTCGCTGCTGTGATGTGGAGGAGGAAGAGCTCAGGTGGGGAAGGGGTGAAGG 42 | GTGGGTCTGAGATTTCTTGTCTCACTGAGGGTTCCAAGACCCAGGTAGAAGTGTGCCCTG 43 | CCTCGTTACTGGGAAGCACCACCCACAATTATGGGCCTACCCAGCCTGGGCCCTGTGTGC 44 | CAGCACTTACTCTTTTGTAAAGCACCTGTTAAAATGAAGGACAGATTTATCACCTTGATT 45 | ACAGCGGTGATGGGACCTGATCCCAGCAGTCACAAGTCACAGGGGAAGGTCCCTGAGGAC 46 | CTTCAGGAGGGCGGTTGGTCCAGGACCCACACCTGCTTTCTTCATGTTTCCTGATCCCGC 47 | CCTGGGTCTGCAGTCACACATTTCTGGAAACTTCTCTGAGGTCCAAGACTTGGAGGTTCC 48 | TCTAGGACCTTAAGGCCCTGACTCCTTTCTGGTATCTCACAGGACATTTTCTTCCCACAG 49 | ATAGAAAAGGAGGGAGCTACTCTCAGGCTGCAAGTAAGTATGAAGGAGGCTGATGCCTGA 50 | GGTCCTTGGGATATTGTGTTTGGGAGCCCATGGGGGAGCTCACCCACCCCACAATTCCTC 51 | CTCTAGCCACATCTTCTGTGGGATCTGACCAGGTTCTGTTTTTGTTCTACCCCAGGCAGT 52 | GACAGTGCCCAGGGTTCTGATGTGTCTCTCACAGCTTGTAAAGGTGAGAGCCTGGAGGGC 53 | CTGATGTGTGTTGGGTGTTGGGCGGAACAGTGGACACAGCTGTGCTATGGGGTTTCTTTC 54 | CATTGGATGTATTGAGCATGCGATGGGCTGTTTAAAGTGTGACCCCTCACTGTGACAGAT 55 | ACGAATTTGTTCATGAATATTTTTTTCTATAGTGTGAGACAGCTGCCTTGTGTGGGACTG 56 | AGAGGCAAGAGTTGTTCCTGCCCTTCCCTTTGTGACTTGAAGAACCCTGACTTTGTTTCT 57 | GCAAAGGCACCTGCATGTGTCTGTGTTCGTGTAGGCATAATGTGAGGAGGTGGGGAGACC 58 | ACCCCACCCCCATGTCCACCATGACCCTCTTCCCACGCTGACCTGTGCTCCCTCCCCAAT 59 | CATCTTTCCTGTTCCAGAGAGGTGGGGCTGAGGTGTCTCCATCTCTGTCTCAACTTCATG 60 | GTGCACTGAGCTGTAACTTCTTCCTTCCCTATTAAAA 61 | >HLA:HLA00097 A*31:01:02:01 3518 bp 62 | CAGGAGCAGAGGGGTCAGGGCGAAGTACCAGGGCCCCAGGCGTGGCTCTCAGGGTCTCAG 63 | GCCCCGAAGGCGGTGTATGGATTGGGGAGTCCCAGCCTTGGGGATTCCCCAACTCCGCAG 64 | TTTCTTTTCTCCCTCTCCCAACCTATGTAGGGTCCTTCTTCCTGGATACTCACGACGCGG 65 | ACCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCG 66 | GTCGCGGTTCTAAAGTCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGG 67 | ATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTTGGGGGCCCTGGCCCTGACC 68 | CAGACCTGGGCGGGTGAGTGCGGGGTCGTGGGGAAACCGCCTCTGCGGGGAGAAGCAAGG 69 | GGCCCGCCCGGCGGGGGCGCAGGACCCGGGTAGCCGCGCCGGGAGGAGGGTCGGGCGGAT 70 | CTCAGCCACTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCACCACATCCGTGTCC 71 | CGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTC 72 | GTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAG 73 | CAGGAGAGGCCTGAGTATTGGGACCAGGAGACACGGAATGTGAAGGCCCACTCACAGATT 74 | GACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTGAGTGA 75 | CCCCAGCCCGGGGCGCAGGTCACGACCTCTCATCCCCCACGGACGGGCCAGGTCACCCAC 76 | AGTCTCCGGGTCCGAGATCCACCCCGAAGCCGCGGGACCCCGAGACCCTTGCCCCGGGAG 77 | AGGCCCAGGCGCCTTTACCCGGTTTCATTTTCAGTTTAGGCCAAAAATCCCCCCGGGTTG 78 | GTCGGGGCCGGACGGGGCTCGGGGGACTGGGCTGACCGTGGGGTCGGGGCCAGGTTCTCA 79 | CACCATCCAGATGATGTATGGCTGCGACGTGGGGTCGGACGGGCGCTTCCTCCGCGGGTA 80 | CCAGCAGGACGCCTACGACGGCAAGGATTACATCGCCTTGAACGAGGACCTGCGCTCTTG 81 | GACCGCGGCGGACATGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGC 82 | GGAGCAGTTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGA 83 | GAACGGGAAGGAGACGCTGCAGCGCACGGGTACCAGGGGCCACGGGGCGCCTCCCTGATC 84 | GCCTGTAGATCTCCCGGGCTGGCCTCCCACAAGGAGGGGAGACAATTGGGACCAACACTA 85 | GAATATCACCCTCCCTCTGGTCCTGAGGGAGAGGAATCCTCCTGGGTTTCCAGATCCTGT 86 | ACCAGAGAGTGACTCTGAGGTTCCGCCCTGCTCTGTGACACAATTAAGGGATAAAATCTC 87 | TGAAGGAATGACGGGAAGACGATCCCTCGAATACTGATGAGTGGTTCCCTTTGACACACA 88 | CCGGCAGCAGCCTTGGGCCCGTGACTTTTCCTCTCAGGCCTTGTTCTCTGCTTCACACTC 89 | AATGTGTGTGGGGGTCTGAGTCCAGCACTTCTGAGTCCCTCAGCCTCCACTCAGGTCAGG 90 | ACCAGAAGTCGCTGTTCCCTCTTCAGGGACTAGAATTTTCCACGGAATAGGAGATTATCC 91 | CAGGTGCCTGTGTCCAGGCTGGTGTCTGGGTTCTGTGCTCCCTTCCCCATCCCAGGTGTC 92 | CTGTCCATTCTCAAGATAGCCACATGTGTGCTGGAGGAGTGTCCCATTACAGATGCAAAA 93 | TGCCTGAATGTTCTGACTCTTCCTGACAGACCCCCCCAAGACGCATATGACTCACCACGC 94 | TGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGAGCTTCTACCCTGCGGAGAT 95 | CACACTGACCTGGCAGCGGGATGGGGAGGACCAGACCCAGGACACGGAGCTCGTGGAGAC 96 | CAGGCCTGCAGGGGATGGAACCTTCCAGAAGTGGGCGTCTGTGGTGGTGCCTTCTGGACA 97 | GGAGCAGAGATACACCTGCCATGTGCAGCATGAGGGTCTCCCCAAGCCCCTCACCCTGAG 98 | ATGGGGTAAGGAGGGAGATGGGGGTGTCATGTCTTTTAGGGAAAGCAGGAGCCTCTCTGA 99 | CCTTTAGCAGGGTCAGGGCCCCTCACCTTCCCCTCTTTTCCCAGAGCCGTCTTCCCAGCC 100 | CACCATCCCCATCGTGGGCATCATTGCTGGCCTAGTTCTCTTTGGAGCTGTGTTCGCTGG 101 | AGCTGTGGTCGCTGCTGTGAGGTGGAGGAGGAAGAGCTCAGGTGGGGTGAAGGGGTGAAG 102 | GGTGGGTCTGAGATTTCTTGTCTCACTGAGGGTTCCAAGACCCAGGTAGAAGTGTGCCCT 103 | GCCTCGTTACTGGGAAGCACCATCCACAATTATGGGCCTACCCAGCCTGGGCCCTGTGTG 104 | CCAGCACTTACTCTTTTGTAAAGCACCTGTTAAAATGAAGGACAGATTTATCACCTTGAT 105 | TATGGCGGTGATGGGACCTGATCCCAGCAGTCACAAGTCACAGGGGAAGGTCCCTGAGGA 106 | CCTTCAGGAGGGCGGTTGGTCCAGGACCCACACCTGCTTTCTTCATGTTTCCTGATCCCG 107 | CCCTGGGTCTGCAGTCACACATTTCTGGAAACTTCTCTGAGGTCCAAGACTTGGAGGTTC 108 | CTCTAGGACCTTAAGGCCCTGGCTCCTTTCTGGTATCTCACAGGACATTTTCTTCCCACA 109 | GATAGAAAAGGAGGGAGCTACTCTCAGGCTGCAAGTAAGTATGAAGGAGGATGATCCAAG 110 | AAATCACTGGGATATTGTGTTTGGGAGCCCGTGGGGGAGCTCACCCACCCCACAATTCCT 111 | CCTCTAGCCACATCTTCTGTGGGATCTGACCAGGTTCTGTTTTTGTCCTACCCCAGGCAG 112 | TGACAGTGCCCAGGGCTCTGATATGTCTCTCACAGCTTGTAAAGGTGAGAGCCTGGAGGG 113 | CCTGATGTGTGTTGGGTGTTGGGCGGAACAGTGGACGCAGCTGTGCTATGGGGTTTCTTT 114 | GCATTGGATGTATTGAGCATGCGATGGGCTGTTTAAAGTGTGACTCCTCACTGTGACAGA 115 | TACGAATTTGTTCATGAATATTTTTTTCTATAGTGTGAGACAGCTGCCTTGTGTGGGACT 116 | GAGAGGCAAGATTTGTTCCTGCCCTTCCCTTTGTGACTTGAAGTACCCTGACTTTGTTTC 117 | TGCAAAGGCACCTGCATGTGTCTGTGTTCTTGTAGGCATAATGTGAGGAGGTGGGGAGAC 118 | CACCCCACCCCCATGTCCACCATGACCCTCTTCCCACGCTGACCTGTGCTCCCTCCCCAA 119 | TCATCTTTCCTGTTCCAGAGAGGTGGGGCTGAGGTGTCTCCATCTCTGCCTCAACTTCAT 120 | GGTGCACTGAGCTGTAACTTCTTCCTTCCCTATTAAAA 121 | >HLA:HLA00344 B*51:01:01:01 4085 bp 122 | GATCAGGACGAAGTCCCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCT 123 | TGTCTGCATTGGGGAGGCGCAGCGTTGGGGATTCCCCACTCCCACGAGTTTCACTTCTTC 124 | TCCCAACCTATGTCGGGTCCTTCTTCCAGGATACTCGTGACGCGTCCCCATTTCCCACTC 125 | CCATTGGGTGTCGGATATCTAGAGAAGCCAATCAGTGTCGCCGGGGTCCCAGTTCTAAAG 126 | TCCCCACGCACCCACCCGGACTCAGAATCTCCTCAGACGCCGAGATGCGGGTCACGGCGC 127 | CCCGAACCGTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACCGAGACCTGGGCCGGTG 128 | AGTGCGGGGTCGGGAGGGAAATGGCCTCTGTGGGGAGGAGCGAGGGGACCGCAGGCGGGG 129 | GCGCAGGACCTGAGGAGCCGCGCCGGGAGGAGGGTCGGGCGGGTCTCAGCCCCTCCTCGC 130 | CCCCAGGCTCCCACTCCATGAGGTATTTCTACACCGCCATGTCCCGGCCCGGCCGCGGGG 131 | AGCCCCGCTTCATTGCAGTGGGCTACGTGGACGACACCCAGTTCGTGAGGTTCGACAGCG 132 | ACGCCGCGAGTCCGAGGACGGAGCCCCGGGCGCCATGGATAGAGCAGGAGGGGCCGGAGT 133 | ATTGGGACCGGAACACACAGATCTTCAAGACCAACACACAGACTTACCGAGAGAACCTGC 134 | GGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGTGAGTGACCCCGGCCCGGGGCGC 135 | AGGTCACGACTCCCCATCCCCCACGTACGGCCCGGGTCGCCCCGAGTCTCCGGGTCCGAG 136 | ATCCGCCTCCCTGAGGCCGCGGGACCCGCCCAGACCCTCGACCGGCGAGAGCCCCAGGCG 137 | CGTTTACCCGGTTTCATTTTCAGTTGAGGCCAAAATCCCCGCGGGTTGGTCGGGGCGGGG 138 | CGGGGCTCGGGGGACGGTGCTGACCGCGGGGCCGGGGCCAGGGTCTCACACTTGGCAGAC 139 | GATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATAACCAGTACGC 140 | CTACGACGGCAAAGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCGGCGGA 141 | CACCGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGAGGCGGAGCAGCTGAG 142 | AGCCTACCTGGAGGGCCTGTGCGTGGAGTGGCTCCGCAGACACCTGGAGAACGGGAAGGA 143 | GACGCTGCAGCGCGCGGGTACCAGGGGCAGTGGGGAGCCTTCCCCATCTCCTATAGGTCG 144 | CCGGGGATGGCCTCCCACGAGAAGAGGAGGAAAATGGGATCAGCGCTAGAATGTCGCCCT 145 | CCCTTGAATGGAGAATGGCATGAGTTTTCCTGAGTTTCCTCTGAGGGCCCCCTCTTCTCT 146 | CTAGGACAATTAAGGGATGACGTCTCTGAGGAAATGGAGGGGAAGACAGTCCCTAGAATA 147 | CTGATCAGGGGTCCCCTTTGACCCCTGCAGCAGCCTTGGGAACCGTGACTTTTCCTCTCA 148 | GGCCTTGTTCTCTGCCTCACACTCAGTGTGTTTGGGGCTCTGATTCCAGCACTTCTGAGT 149 | CACTTTACCTCCACTCAGATCAGGAGCAGAAGTCCCTGTTCCCCGCTCAGAGACTCGAAC 150 | TTTCCAATGAATAGGAGATTATCCCAGGTGCCTGCGTCCAGGCTGGTGTCTGGGTTCTGT 151 | GCCCCTTCCCCACACCAGGTGTCCTGTCCATTCTCAGGCTGGTCACATGGGTGGTCCTAG 152 | GGTGTCCCATGAGAGATGCAAAGCGCCTGAATTTTCTGACTCTTCCCATCAGACCCCCCA 153 | AAGACACACGTGACCCACCACCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCC 154 | CTGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACT 155 | CAGGACACTGAGCTTGTGGAGACCAGACCAGCAGGAGATAGAACCTTCCAGAAGTGGGCA 156 | GCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACATGCCATGTACAGCATGAGGGG 157 | CTGCCGAAGCCCCTCACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATATCTCTTC 158 | TCAGGGAAAGCAGGAGCCCTTCTGGAGCCCTTCAGCAGGGTCAGGGCCCCTCGTCTTCCC 159 | CTCCTTTCCCAGAGCCATCTTCCCAGTCCACCATCCCCATCGTGGGCATTGTTGCTGGCC 160 | TGGCTGTCCTAGCAGTTGTGGTCATCGGAGCTGTGGTCGCTACTGTGATGTGTAGGAGGA 161 | AGAGCTCAGGTAGGGAAGGGGTGAGGGGTGGGGTCTGGGTTTTCTTGTCCCACTGGGGGT 162 | TTCAAGCCCCAGGTAGAAGTGTTCCCTGCCTCATTACTGGGAAGCAGCATCCACACAGGG 163 | GCTAACGCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTTTTGTGCAGCACATGTGACA 164 | ATGAAGGACGGATGTATCACCTTGATGGTTGTGGTGTTGGGGTCCTGATTTCAGCATTCA 165 | TGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGGTCCAGGACCC 166 | ACACTTGCTTTCCTCGTGTTTCCTGATCCTGCCTTGGGTCTGTAGTCATACTTCTGGAAA 167 | TTCCTTTTGGGTCCAAGACGAGGAGGTTCCTCTAAGATCTCATGGCCCTGCTTCCTCCCA 168 | GTCCCCTCACAGGACATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTACTCTCAGGCTG 169 | CGTGTAAGTGGTGGGGGTGGGAGTGTGGAGGAGCTCACCCACCCCATAATTCCTCCTGTC 170 | CCACGTCTCCTGCGGGCTCTGACCAGGTCCTGTTTTTGTTCTACTCCAGCCAGCGACAGT 171 | GCCCAGGGCTCTGATGTGTCTCTCACAGCTTGAAAAGGTGAGATTCTTGGGGTCTAGAGT 172 | GGGCGGGGGGGGCGGGGAGGGGGCAGAGGGGAAAGGCCTGGGTAATGGAGATTCTTTGAT 173 | TGGGATGTTTCGCGTGTGTCGTGGGCTGTTCAGAGTGTCATCACTTACCATGACTAACCA 174 | GAATTTGTTCATGACTGTTGTTTTCTGTAGCCTGAGACAGCTGTCTTGTGAGGGACTGAG 175 | ATGCAGGATTTCTTCACTCCTCCCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTG 176 | CAAAGGCACCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAG 177 | CCCACCCTTGTGTCCACTGTGACCCCTGTTCCCATGCTGACCTGTGTTTCCTCCCCAGTC 178 | ATCTTTCTTGTTCCAGAGAGGTGGGGCTGGATGTCTCCATCTCTGTCTCAACTTTATGTG 179 | CACTGAGCTGCAACTTCTTACTTCCCTACTGAAAATAAGAATCTGAATATACATTTGTTT 180 | TCTCAAATATTTGCTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAATGT 181 | GAGAGAGCAAATAAAGACCTGAGAACCTTCCAGAATCTGCATGTTCGCTGTGCTGAGTCT 182 | ATTGCAGGTGGGGTGTGGAGAAGGCTGTGGGGGGCCGAGTGTGGACAGGGCCTGTGCCCA 183 | GTTGTTGTTGAGCCCATCATGGGCTTTATGTGGTTAGTCCTCAGCTGGGTCACCTTCACT 184 | GCCCCATTGTCCTTGTCCCTTCAGCGGAAACTTGTCCAGTGGGAGCTGTGACCACAGAGG 185 | CTCACACATCGCCCAGGGTGGCCCCTGCACACGGGGGTCTCTGTGCATTCTGAGACAAAT 186 | TTTCAGAGCCATTCACCTCCTGCCCTGCTTCTAGAGCTCCTTTTCTGCTCTGCTCTCCTG 187 | CCCTCTCTCCCTGCCCTGGTTCTAGTGATCTTGGTGCTGAATCCAATCCCAACTCATGAA 188 | TCTGTAAAGCAGAGTCTAATTTAGAGTTACATTTGTCTGTGAAATTGGACCCATCATCAA 189 | GGACTGTTCTTTCCTGAAGAGAGAACCTGATTGTGTGCTGCAGTGTGCTGGGGCAGGGGG 190 | TGCGG 191 | >HLA:HLA00225 B*27:05:02:01 4083 bp 192 | GATCAGGACGAAGTCCCAGGCCCCGGGCGGGGCTCTCAGGGTCTCAGGCTCCGAGAGCCT 193 | TGTCTGCATTGGGGAGGCGCAGCATTGGGGATTCCCCACTCCCACGAGTTTCACTTCTTC 194 | TCCCAACCTATGTCGGGTCCTTCTTCCAGGATACTCGTGACGCGTCCCCATTTCCCACTC 195 | CCATTGGGTGTCGGGTGTCTAGAGAAGCCAATCAGTGTCGCCGGGGTCCCAGTTCTAAAG 196 | TCCCCACGCACCCACCCGGACTCAGAATCTCCTCAGACGCCGAGATGCGGGTCACGGCGC 197 | CCCGAACCCTCCTCCTGCTGCTCTGGGGGGCAGTGGCCCTGACCGAGACCTGGGCTGGTG 198 | AGTGCGGGGTCGGCAGGGAAATGGCCTCTGTGGGGAGGAGCGAGGGGACCGCAGGCGGGG 199 | GCGCAGGACCCGGGGAGCCGCGCCGGGAGGAGGGTCGGGCGGGTCTCAGCCCCTCCTCGC 200 | CCCCAGGCTCCCACTCCATGAGGTATTTCCACACCTCCGTGTCCCGGCCCGGCCGCGGGG 201 | AGCCCCGCTTCATCACCGTGGGCTACGTGGACGACACGCTGTTCGTGAGGTTCGACAGCG 202 | ACGCCGCGAGTCCGAGAGAGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCCGGAGT 203 | ATTGGGACCGGGAGACACAGATCTGCAAGGCCAAGGCACAGACTGACCGAGAGGACCTGC 204 | GGACCCTGCTCCGCTACTACAACCAGAGCGAGGCCGGTGAGTGACCCCGGCCCGGGGCGC 205 | AGGTCACGACTCCCCATCCCCCACGTACGGCCCGGGTCGCCCCGAGTCTCCGGGTCCGAG 206 | ATCCGCCCCCGAGGCCGCGGGACCCGCCCAGACCCTCGACCGGCGAGAGCCCCAGGCGCG 207 | TTTACCCGGTTTCATTTTCAGTTGAGGCCAAAATCCCCGCGGGTTGGTCGGGGCGGGGCG 208 | GGGCTCGGGGGGACGGGGCTGACCGCGGGGGCGGGGCCAGGGTCTCACACCCTCCAGAAT 209 | ATGTATGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGTACCACCAGGACGCC 210 | TACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGAGCTCCTGGACCGCCGCGGAC 211 | ACGGCGGCTCAGATCACCCAGCGCAAGTGGGAGGCGGCCCGTGTGGCGGAGCAGCTGAGA 212 | GCCTACCTGGAGGGCGAGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAG 213 | ACGCTGCAGCGCGCGGGTACCAGGGGCAGTGGGGAGCCTTCCCCATCTCCTATAGGTCGC 214 | CGGGGATGGCCTCCCACGAGAAGAGGAGGAAAATGGGATCAGCGCTAGAATGTCGCCCTC 215 | CCTTGAATGGAGAATGGCATGAGTTTTCCTGAGTTTCCTCTGAGGGCCCCCTCTTCTCTC 216 | TAGGACAATTAAGGGATGACGTCTCTGAGGAAATGGAGGGGAAGACAGTCCCTAGAATAC 217 | TGATCAGGGGTCCCCTTTGACCCCTGCAGCAGCCTTGGGAACCGTGACTTTTCCTCTCAG 218 | GCCTTGTTCTCTGCCTCACACTCAGTGTGTTTGGGGCTCTGATTCCAGCACTTCTGAGTC 219 | ACTTTACCTCCACTCAGATCAGGAGCAGAAGTCCCTGTTCCCCGCTCAGAGACTCGAACT 220 | TTCCAATGAATAGGAGATTATCCCAGGTGCCTGCGTCCAGGCTGGTGTCTGGGTTCTGTG 221 | CCCCTTCCCCACCCCAGGTGTCCTGTCCATTCTCAGGCTGGTCACATGGGTGGTCCTAGG 222 | GTGTCCCATGAGAGATGCAAAGCGCCTGAATTTTCTGACTCTTCCCATCAGACCCCCCAA 223 | AGACACACGTGACCCACCACCCCATCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCC 224 | TGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTC 225 | AGGACACTGAGCTTGTGGAGACCAGACCAGCAGGAGATAGAACCTTCCAGAAGTGGGCAG 226 | CTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACATGCCATGTACAGCATGAGGGGC 227 | TGCCGAAGCCCCTCACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATATCTCTTCT 228 | CAGGGAAAGCAGGAGCCCTTCAGCAGGGTCAGGGCCCCTCATCTTCCCTTCCTTTCCCAG 229 | AGCCGTCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAG 230 | CAGTTGTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGGTA 231 | GGGAAGGGGTGAGGGGTGGGGTCTGAGTTTTCTTGTCCCACTGGGGGTTTCAAGCCCCAG 232 | GTAGAAGTGTTCCCTGCCTCATTACTGGGAAGCAGCATCCACACAGGGGCTAACGCAGCC 233 | TGGGACCCTGTGTGCCAGCACTTACTCTTTTGTGCAGCACATGTGACAATGAAGGACGGA 234 | TGTATCACCTTGGTGGTTGTGGTGTTGGGGTCCTGATTCCAGCATTCATGAGTCAGGGGA 235 | AGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGGTCCAGGACCCACACTTGCTTTC 236 | CTCGTGTTTCCTGATCCTGCCTTGGGTCTGTAGTCATACTTCTGGAAATTCCTTTTGGGT 237 | CCAAGACGAGGAGGTTCCTCTAAGATCTCATGGCCCTGCTTCCTCCCAGTCCCCTCACAG 238 | GGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTACTCTCAGGCTGCGTGTAAGTGAT 239 | GGGGGTGGGAGTGTGGAGGAGCTCACCCACCCCCTAATTCCTCCTGTCCCACGTCTCCTG 240 | CGGGCTCTGACCAGGTCCTGTTTTTGTTCTACTCCAGGCAGCGACAGTGCCCAGGGCTCT 241 | GATGTGTCTCTCACAGCTTGAAAAGGTGAGATTCTTGGGGTCTAGAGTGGGTGGGGTGGC 242 | AGGTCTGGGGGTGGGTGGGGCAGTGGGGAAAGGCCTGGGTAATGGAGATTCTTTGATTGG 243 | GATGTTTCGCGTGTGTGGTGGGCTGTTTAGACTGTCATCACTTACCATGACTAACCAGAA 244 | TTTGTTCATGACTGTTGTTTTCTGTAGCCTGAGACAGCTGTCTTGTGAGGGACTGAGATG 245 | CAGGATTTCTTCACGCCTCCCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTGCAA 246 | AGGCACCTGAATGTGTCTGCGTCCCTGTTAGCATAATGTGAGGAGGTGGAGAGACCAGCC 247 | CACCCCCGTGTCCACTGTGACCCCTGTTCCCATGCTGACCTGTGTTTCCTCCCCAGTCAT 248 | CTTTCCTGTTCCAGAGAGGTGGGGCTGGATGTCTCCATCTCTGTCTCAACTTTATGTGCA 249 | CTGAGCTGCAACTTCTTACTTCCCTACTGAAAATAAGAATCTGAATATAAATTTGTTTTC 250 | TCAAATATTTGCTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAATTTGA 251 | GAGAGCAAATAAAGACCTGAGAACCTTCCAGAATCTGCATGTTCGCTGTGCTGAGTCTGT 252 | TGCAGGTGGGGTGTGGAGAAGGCTGTGGGGGGCCGAGTGTGGACGGGGCCTGTGCCCATT 253 | TGGTGTTGAGTCCATCATGGGCTTTATGTGGTTAGTCCTCAGCTGGGTCACCTTCACTGC 254 | TCCATTGTCCTTGTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCT 255 | CACACATCGCCCTGGGCGGCCCCTGCACACGGGGGTCTCTGTGCATTCTGAGACAAATTT 256 | TCAGAGCCATTCACCTCTTGCCCTGCTTCTAGAGCTCCTTTTCTGCTCTGCTCTCCTGCC 257 | CTCTCTCCCTGCCCTGGTTCTAGTGATCTTGGTGCTGAATCCAATCCCAACTCATGAATC 258 | TGTAAAGCAGAGTCTAATTTAGACTTACATTTGTCTGTGAAATTGGACCCGTCATCAAGG 259 | ACTGTTCTTTCCTGAAGAGAGAACCTGATTGTGTGCTGCAGTGTGCTGGGGCAGGGGGTG 260 | CGG 261 | >HLA:HLA00405 C*02:02:02:01 4295 bp 262 | TTATTTTGCTGGATGTAGTTTAATATTACCTGAGGTAAGGTAAGGCAAAGAGTGGGAGGC 263 | AGGGAGTCCAGTTCAGGGACGGGGATTCCAGGAGAAGTGAAGGGGAAGGGGCTGGGCGCA 264 | GCCTGGGGGTCTCTCCCTGGTTTCCACAGACAGATCCTTGGCCAGGACTCAGGCACACAG 265 | TGTGACAAAGATGCTTGGTGTAGGAGAAGAGGGATCAGGACGAAGTCCCAGGTCCCGGGC 266 | GGGGCTCTCAGGGTCTCAGGCTCCAAGGGCCGTGTCTGCACTGGGGAGGCGCCGCGTTGA 267 | GGATTCTCCACTCCCCTGAGTTTCACTTCTTCTCCCAACCTGCGACGGGTCCTTCTTCCT 268 | GAATACTCATGACGCGTCCCCAATTCCCACTCCCATTGGGTGTCGGGTTCTAGAGAAGCC 269 | AATCAGCGTCTCCGCAGTCCCGGTTCTAAAGTCCCCAGTCACCCACCCGGACTCGGATTC 270 | TCCCCAGACGCCGAGATGCGGGTCATGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGGA 271 | GCCCTGGCCCTGACCGAGACCTGGGCCTGTGAGTGCGGGGTTGGGAGGGAAACGGCCTCT 272 | GCGGAGAGGAGCGAGGGGCCCGCCCGGCGAGGGCGCAGGACCCGGGGAGCCGCGCAGGGA 273 | GGAGGGTCGGGCGGGTCTCAGCCCCTCCTCTCCCCCAGGCTCCCACTCCATGAGGTATTT 274 | CTACACCGCTGTGTCCCGGCCCAGCCGCGGAGAGCCCCACTTCATCGCAGTGGGCTACGT 275 | GGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGTCCAAGAGGGGAGCCGCG 276 | GGCGCCGTGGGTGGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACAGAAGTACAA 277 | GCGCCAGGCACAGACTGACCGAGTGAACCTGCGGAAACTGCGCGGCTACTACAACCAGAG 278 | CGAGGCCGGTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCCCTCCCCATCCCCCACG 279 | GACGGCCCGGGTCGCCCCGAGTCTCCGGGTCTGAGATCCACCCCGAGGCTGCGGAACCCG 280 | CCCAGACCCTCGACCGGAGAGAGCCCCAGTCACCTTTACCCGGTTTCATTTTCAGTTTAG 281 | GCCAAAATCCCCGCGGGTTGGTCGGGGCTGGGGCGGGGCTCGGGGGACGGGCTGACCACG 282 | GGGGCGGGGCCAGGGTCTCACACCCTCCAGAGGATGTACGGCTGCGACCTGGGGCCCGAC 283 | GGGCGCCTCCTCCGCGGGTATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCTG 284 | AACGAGGACCTGCGCTCCTGGACCGCCGCGGACACAGCGGCTCAGATCACCCAGCGCAAG 285 | TGGGAGGCGGCCCGTGAGGCGGAGCAGTGGAGAGCCTACCTGGAGGGCGAGTGCGTGGAG 286 | TGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGGTACCAGGGGC 287 | AGTGGGGAGCCTTCCCCATCTCCTGTAGATCTCCCGGGATGGCCTCCCACGAGGAGGGGA 288 | GGAAAATGGGATCAGCGCTAGAATATCGCCCTCCCTTGAATGGAGAATGGGATGAGTTTT 289 | CCTGAGTTTCCTCTGAGGGCCCCCTCTGCTCTCTAGGACAATTAAGGGATGAAGTCCTTG 290 | AGGAAATGGAGGGGAAGACAGTCCCTGGAATACTGATCAGGGGTCCCCTTTGACCACTTT 291 | GACCACTGCAGCAGCTGTGGTCAGGCTGCTGACCTTTCTCTCAGGCCTTGTTCTCTGCCT 292 | CACGCTCAATGTGTTTGAAGGTTTGATTCCAGCTTTTCTGAGTCCTTCGGCCTCCACTCA 293 | GGTCAGGACCAGAAGTCGCTGTTCCTCCCTCAGAGACTAGAACTTTCCAATGAATAGGAG 294 | ATTATCCCAGGTGCCTGTGTCCAGGCTGGCGTCTGGGTTCTGTGCCCCCTTCCCCACCCC 295 | AGGTGTCCTGTCCATTCTCAGGATAGTCACATGGGCGCTGTTGGAGTGTCGCAAGAGAGA 296 | TACAAAGTGTCTGAATTTTCTGACTCTTCCCGTCAGAACACCCAAAGACACACGTGACCC 297 | ACCATCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTA 298 | CGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTG 299 | TGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTT 300 | CTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCA 301 | CCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTCATGTGTCTTCTCAGGGAAAGCAGAAG 302 | TCCTGGAGCCCTTCAGCTGGGTCAGGGCTGAGGCTTGGGGGTCAGGGCCCCTCACCTTCC 303 | CCTCCTTTCCCAGAGCCATCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGGC 304 | CTGGCTGTCCTGGCTGTCCTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAGG 305 | AGGAAGAGCTCAGGTAGGGAAGGGGTGAGGAGTGGGGTCTGGGTTTTCTTGTCCCACTGG 306 | GAGTTTCAAGCCCCAGGTAGAAGTGTGCCCCACCTCGTTACTGGAAGCACCATCCACACA 307 | TGGGCCATCCCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTGTTGTGAAGCACATGAC 308 | AATGAAGGACAGATGTATCACCTTGATGATTATGGTGTTGGGGTCCTTGATTCCAGCATT 309 | CATGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGCTCCAGAAC 310 | CCACAGCTGCTTTCCCCGTGTTTCCTGATCCTGCCCTGGGTCTGCAGTCATAGTTCTGGA 311 | AACTTCTCTTGGGTCCAAGACTAGGAGGTTCCCCTAAGATTGCATGGCCCTGCCTCCTCC 312 | CTGTCCCCTCACAGGGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTGCTCTCAGGC 313 | TGCGTGTAAGTGATGGCGGCAGGCGTGTGGAGGAGCTCACCCACCCCATAATTCCTCTTG 314 | TCCCACATCTCCTGCGGGCTCTGACCAGGTCTTTTTTTTTGTTCTACCCCAGCCAGCAAC 315 | AGTGCCCAGGGCTCTGATGAGTCTCTCATCGCTTGTAAAGGTGAGATTCTGGGAGCTGAA 316 | GTGGTCGGGGGTGGGGCAGAGGGAAAAGGCCTAGGTAATGGGGATCCTTTGATAGGGACG 317 | TTTCGAATGTGTGGTGAGCTGTTCAGAGTGTCATCACTTACCATGACTGACCTGAATTTG 318 | TTCATGACTATTGTGTTCTGTAGCCTGAGACAGCTGCCTGTGTGGGACTGAGATGCAGGA 319 | TTTCTTCACACCTCTCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTACAAAGGCA 320 | TCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAGCCCACCCC 321 | CGTGTCCACCGTGACCCCTGTCCCCACACTGACCTGTGTTCCCTCCCCGATCATCTTTCC 322 | TGTTCCAGAGAAGTGGGCTGGATGTCTCCATCTCTGTCTCAACTTTACGTGTACTGAGCT 323 | GCAACTTCTTCCCTACTGAAAATAAGAATCTGAATATAAATTTGTTTTCTCAAATATTTG 324 | CTATGAGAGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAAGTTGAGAGAGCAAATA 325 | AAGACCTGAGAACCTTCCAGAATCCGCATGTTCGCTGTGCTGAGTCTGTTGCAGGTGGGG 326 | GTGGGGAAGGCTGTGAGGAGACGAGTGTGGACGGGGCCTGTGCCTAGTTGCTGTTCAGTT 327 | CTTCATGGGCTTTATGTGGTCAGTCCTCAGCTGGGTCACCTTCACTGCTCCATTGTCCTT 328 | GTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCTCACACATCGCCC 329 | AGGGCAGCCCCTGCACACGGGAGTCCCTGTGCTTTCTGAGACAAATTTTCAGACCCATTC 330 | AGCTCCTGCCCTCCTTCTAGGGCTCCTCTTCTGCTTTGGTCTCCTGCCCTCTCTCCCTTC 331 | CCTGATTCCAGTAATCTTCGTGCTGACTCCAATCCCAACTCATGAATCTAAAGCAGAGCC 332 | TAATTTAGATTTATATTTGTTTGTAAAATTGGGTCCATAGTCTAGAATTGTTCCTTCCTG 333 | AAGAGAAACCTGATTGTGTGCTGCAGTGTGCAGGG 334 | >HLA:HLA00462 C*14:02:01:01 4304 bp 335 | TTATTTTGCTGGATGTAGTTTAATATTACCTGAGGTAAGGTAAGGCAAAGAGTGGGAGGC 336 | AGGGAGTCCAGTTCAGGGACGGGGATTCCAGGAGAAGTGAAGGGGAAGGGGCTGGGCGCA 337 | GCCTGGGGGTCTCTCCCTGGTTTCCACAGACAGATCCTTGGCCAGGACTCAGGCACACAG 338 | TGTGACAAAGATGCTTGGTGTAGGAGAAGAGGGATCAGGACGAAGTCCCAGGTCCCGGGC 339 | GGGGCTCTCAGGGTCTCAGGCTCCAAGGGCCGTGTCTGCACTGGGGAGGCGCCGCGTTGA 340 | GGATTCTCCACTCCCCTGAGTTTCACTTCTTCTCCCAACCTGCGTCGGGTCCTTCTTCCT 341 | GAATACTCATGACGCGTCCCCAATTCCCACTCCCATTGGGTGTCGGGTTCTAGAGAAGCC 342 | AATCAGCGTCTCCGCAGTCCCGGTTCTAAAGTCCCCAGTCACCCACCCGGACTCAGATTC 343 | TCCCCAGACGCCGAGATGCGGGTCATGGCGCCCCGAACCCTCATCCTGCTGCTCTCGGGA 344 | GCCCTGGCCCTGACCGAGACCTGGGCCTGTGAGTGCGGGGTTAGGAGGGAAACGGCCTCT 345 | GCGGAGAGGAGCGAGGGGCCCGCCCGGCGAGGGCGCAGGACCCGGGGAGCCGCGCAGGGA 346 | GGAGGGTCGGGCGGGTCTCAGCCACTCCTCGTCCCCAGGCTCCCACTCCATGAGGTATTT 347 | CTCCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCAGTGGGCTACGT 348 | GGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGTCCGAGAGGGGAGCCGCG 349 | GGCGCCGTGGGTGGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACAGAAGTACAA 350 | GCGCCAGGCACAGACTGACCGAGTGAGCCTGCGGAACCTGCGCGGCTACTACAACCAGAG 351 | CGAGGCCGGTGAGTGACCCCGGCCCGGGGCGCAGGTCACGACCCCTCCCCATCCCCCACG 352 | GACGGCCCGGGTCGCCCCGAGTCTCCCCGTCTGAGATCCACCCCGAGGCTGCGGAACCCG 353 | CCCAGACCCTCGACCGGAGAGAGCCCCAGTCACCTTTACCCGGTTTCATTTTCAGTTTAG 354 | GCCAAAATCCCCGCGGGTTGGTCGGGACTGGGGCGGGGCTCGGGGGACGGGGCTGACCAC 355 | GGGGGCGGGGCCAGGGTCTCACACCCTCCAGTGGATGTTTGGCTGCGACCTGGGGCCCGA 356 | CGGGCGCCTCCTCCGCGGGTATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCT 357 | GAACGAGGATCTGCGCTCCTGGACCGCCGCGGACACGGCGGCTCAGATCACCCAGCGCAA 358 | GTGGGAGGCGGCCCGTGAGGCGGAGCAGCGGAGAGCCTACCTGGAGGGCACGTGCGTGGA 359 | GTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGGTACCAGGGG 360 | CAGTGGGGAGCCTTCCCCATCTCCCGTAGATCTCCCGGGATGGCCTCCCACGAGGAGGGG 361 | AGGAAAATGGGATCAGCGCTAGAATATCGCCCTCCCTTGAATGGAGAATGGGATGAGTTT 362 | TCCTGAGTTTCCTCTGAGGGCCCCCTCTGCTCTCTAGGACAATTAAGGGATGAAGTCCTT 363 | GAGGAAATGGAGGGGAAGACAGTCCCTGGAATACTGATCAGGGGTCCCCTTTGACCACTT 364 | TGACCACTGCAGCAGCTGTGGTCAGGCTGCTGACCTTTCTCTCAGGCCTTGTTCTCTGCC 365 | TCACGCTCAATGTGTTTGAAGGTTTGATTCCAGCTTTTCTGAGTCCTTCGGCCTCCACTC 366 | AGGTCAGGACCAGAAGTCGCTGTTCCTCCCTCAGAGACTAGAACTTTCCAATGAATAGGA 367 | GATTATCCCAGGTGCCTGTGTCCAGGCTGGCGTCTGGGTTCTGTGCCCCCTTCCCCACCC 368 | CAGGTGTCCTGTCCATTCTCAGGATGGTCACATGGGCGCTGTTGGAGTGTCGCAAGAGAG 369 | ATACAAAGTGTCTGAATTTTCTGACTCTTCCCGTCAGAACACCCAAAGACACACGTGACC 370 | CACCATCCCGTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCT 371 | GCGGAGATCACACTGACCTGGCAGTGGGATGGGGAGGACCAAACTCAGGACACCGAGCTT 372 | GTGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCT 373 | TCTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTC 374 | ACCCTGAGATGGGGTAAGGAGGGGGATGAGGGGTGATGTGTCTTCTCAGGGAAAGCAGAA 375 | GTCCTGGAGCCCTTCAGCCGGGTCAGGGCTGAGGCTTGGAGGTCAGGGCCCCTCACCTTC 376 | CCCTCCTTTCCCAGAGCCGTCTTCCCAGCCCACCATCCCCATCGTGGGCATCGTTGCTGG 377 | CCTGGCTGTCCTGGCTGTCCTAGCTGTCCTAGGAGCTGTGGTGGCTGTTGTGATGTGTAG 378 | GAGGAAGAGCTCAGGTAGGGAAGGGGTGAGGAGTGGGGTCTGGGTTTTCTTGTTCCACTG 379 | GGAGTTTCAAGCCCCAGGTAGAAGTGTGCCCCACCTCGTTACTGGAAGCACCATCCACAC 380 | ATGGGCCATCCCAGCCTGGGACCCTGTGTGCCAGCACTTACTCTGTTGTGAAGCACATGA 381 | CAATGAAGGACAGATGTATCACCTTGATGATTATGGTGTTGGGGTCCTTGATTCCAGCAT 382 | TCATGAGTCAGGGGAAGGTCCCTGCTAAGGACAGACCTTAGGAGGGCAGTTGCTTCAGGA 383 | CCCACAGCTGCTTTCCCCGTGTTTCCTGATCCTGCCCTGGGTCTGCAGTCATAGTTCTGG 384 | AAACTTCTCTTGGGTCCAAGACTAGGAGGTTCCCCTAAGATCGCATGGCCCTGCCTCCTC 385 | CCTGTCCCCTCACAGGGCATTTTCTTCCCACAGGTGGAAAAGGAGGGAGCTGCTCTCAGG 386 | CTGCGTGTAAGTGATGGCGGTGGGCGTGTGGAGGAGCTCACCCACCCCATAATTCCTCTT 387 | GTCCCACATCTCCTGCGGGCTCTGACCAGGTCTTTTTTTTTGTTCTACCCCAGCCAGCAA 388 | CAGTGCCCAGGGCTCTGATGAGTCTCTCATCGCTTGTAAAGGTGAGATTCTGGGGAGCTG 389 | AAGTGGTCGGGGGTGGGGCAGAGGGAAAAGGCCTGGGTAATGGGGATCCTTTGATTGGGA 390 | CGTTTCGAATGTGTGGTGAGCTGTTCAGAGTGTCATCACTTACCATGACTGACCTGAATT 391 | TGTTCATGACTATTGTGTTCTGTAGCCTGAGACAGCTGCCTGTGTGGGACTGAGATGCAG 392 | GATTTCTTCACACCTCTCCTTTGTGACTTCAAGAGCCTCTGGCATCTCTTTCTGCAAAGG 393 | CATCTGAATGTGTCTGCGTTCCTGTTAGCATAATGTGAGGAGGTGGAGAGACAGCCCACC 394 | CCCGTGTCCACCGTGACCCCTGTCCCCACACTGACCTGTGTTCCCTCCCCGATCATCTTT 395 | CCTGTTCCAGAGAAGTGGGCTGGATGTCTCCATCTCTGTCTCAACTTCATGGTGCGCTGA 396 | GCTGCAACTTCTTACTTCCCTAATGAAGTTAAGAAGCTGAATATAAATTTGTTTTCTCAA 397 | ATATTTGCTATGAAGGGTTGATGGATTAATTAAATAAGTCAATTCCTGGAAGTTGAGAGA 398 | GCAAATAAAGACCTGAGAACCTTCCAGAATCCGCATGTTCGCTGTGCTGAGTCTGTTGCA 399 | GGTGGGGGTGGGGAAGGCTGTGAGGAGACGAGTGTGGACGGGGCCTGTGCCTAGTTGCTG 400 | TTCAGTTCTTCATGGGCTCTATGTGGTCAGTCCTCAGCTGGGTCACCTTCACTGCTCCAT 401 | TGTCCTTGTCCCTTCAGTGGAAACTTGTCCAGCGGGAGCTGTGACCACAGAGGCTCACAC 402 | ATCGCCCAGGGCAGCCCCTGCACACGGGAGTCCCTGTGCTTTCTGAGACAAATTTTCAGA 403 | CCCATTCAGCTCCTGCCCTCCTTCTAGGGCTCCTCTTCTGCTTTGGTCTCCTGCCCTCTC 404 | TCCCTTCCCTGATTCCAGTAATCTTCATGCTGACTCCAATCCCAACTCATGAATCTAAAG 405 | CAGAGCCTAATTTAGATTTATATTTGTTTGTAAAATTGGGTCCATAGTCTAGAATTGTTC 406 | CTTCCTGAAGAGAGAAACCTGATTGTGTGCTGCAGTGTGTGGGG -------------------------------------------------------------------------------- /test/test.bam: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/test/test.bam -------------------------------------------------------------------------------- /test/test.bam.bai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/10XGenomics/scHLAcount/fe65c271a8629bbf41c48de271f344eb4cd1509b/test/test.bam.bai -------------------------------------------------------------------------------- /test/test_allele_fasta.mtx: -------------------------------------------------------------------------------- 1 | %%MatrixMarket matrix coordinate integer general 2 | % written by sprs 3 | 9 1 9 4 | 1 1 9 5 | 2 1 10 6 | 3 1 1 7 | 4 1 16 8 | 5 1 23 9 | 6 1 3 10 | 7 1 12 11 | 8 1 8 12 | 9 1 3 13 | -------------------------------------------------------------------------------- /test/test_call.mtx: -------------------------------------------------------------------------------- 1 | %%MatrixMarket matrix coordinate integer general 2 | % written by sprs 3 | 5 2 1 4 | 2 1 1 5 | --------------------------------------------------------------------------------