├── README.md ├── Structural_Variant_Sets ├── Nonredundant_Structural_Variants │ ├── Deletions │ │ └── README.md │ ├── Duplications │ │ └── README.md │ ├── Insertions │ │ └── README.md │ ├── QuickStart.md │ ├── README.md │ └── ToolGuide.md └── README.md ├── images ├── NR_sv_chr1.PNG ├── NR_sv_chr1_detail.PNG ├── NR_sv_chr1_zoom.PNG ├── galaxy.PNG ├── galaxy_vcf.PNG ├── stub ├── ucsc_browser.PNG └── ucsc_browser_ins.PNG ├── nr_stats_tables ├── ftp_manifest-table4.20191104.inc.md └── test ├── specs ├── README.md ├── dbVar.xsd ├── dbVarSubmissionTemplate_v3.4.xlsx └── dbVarSubmissionTemplate_v3.5.xlsx └── tutorials └── README.md /README.md: -------------------------------------------------------------------------------- 1 | # dbVar (https://www.ncbi.nlm.nih.gov/dbvar) 2 | ## dbVar is NCBI's database of human genomic structural variation – insertions, deletions, duplications, inversions, mobile elements, and translocations 3 | ============================ 4 | 5 | ### directory layout 6 | 7 | . 8 | +-- Structural_Variant_Sets # dbVar Reference SV Project & Data 9 | +-- specs # dbVar Design and Schema Specifications 10 | +-- tutorials # Initial dir and README setup 11 | +-- README.md 12 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/Deletions/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Human Nonredundant Structural Variants – Deletions 2 | 3 | ### Work in progress - data subject to change 4 | 5 | Documentation updated: 11/27/2019 6 | 7 | ## Data Summary 8 | 9 |

10 | Deletions

11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 |
Type and FTP DirectoryGRCh37GRCh38
All2,520,4812,508,155
Common211,888211,465
Pathogenic11,29611,105
Somatic23,36023,319
39 | 40 | All files are available in **bed**, **bedpe**, and **tsv** formats: 41 | 42 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/#github) 43 | 44 | The variant types in the NR "deletions" files are: 45 | 46 | * alu_deletion 47 | * copy_number_loss 48 | * deletion 49 | * herv_deletion 50 | * line1_deletion 51 | * sva_deletion 52 | 53 | 54 | ## Records in the NR SV deletions files 55 | 56 | ### Please note: 57 | 58 | * The fields type, method, analysis, platform, variant, study, clinical_significance, clinvar_accession, and gene may contain multiple values. 59 | * Each of the values is associated with one or more calls found in the variant field. 60 | * The values in the variant field are "dbVar call accessions". 61 | 62 | * Records in the NR SV deletions files contain the following tab-separated fields. 63 | 64 | | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size 65 | 66 | 67 | ## Example record 1: 68 | 69 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size 70 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|--------------------|------------------|------ 71 | X | 153293294 | 153296201 | 1 | deletion | Curated | Curated | NA | LSDB_submitted_variants | nssv7487065 | Pathogenic | SCV000222455 | medium 72 | 73 | ### Explanation: 74 | 75 | * The non-redundant coordinates for this record in dbVar are chr1, with 76 | an outermost start of 10001 and outermost stop of 1535693. 77 | 78 | * The SV_count of 1 indicates there is only one SV with an exact match to the 79 | given placement. This count does not include SVs with a partial match. 80 | 81 | * The variant_call_type is "deletion". 82 | 83 | * The method and the analysis indicate how the one variant was evaluated. 84 | 85 | * NA indicates the no platform was submitted for this variant. 86 | 87 | * LSDB_submitted_variants is the study name as found in dbVar. 88 | 89 | * The dbVar variant_accession is "nssv7487065". 90 | 91 | * The clinical_significance is "Pathogenic" 92 | 93 | * The variant has an accession in ClinVar of SCV000222455 94 | 95 | * bin_size = small (length < 50 bp), medium (< 1000000), large (>= 1000000). Length = outermost_stop - outermost_start + 1. 96 | 97 | * URLs using the study name accession or variant_accession can be created to access the data 98 | in dbVar, e.g.: 99 | https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd103/ 100 | https://www.ncbi.nlm.nih.gov/dbvar/?term=nssv7487065 101 | 102 | * From the latter page you may click on the "Variant Region ID" on the left to see 103 | the variant's region in the NCBI Variation Viewer at: 104 | https://www.ncbi.nlm.nih.gov/dbvar/variants/nsv1197457/ 105 | 106 | * The SCV accession can be used to find the record in ClinVar by searching on ClinVar's home page: 107 | https://www.ncbi.nlm.nih.gov/clinvar/ 108 | 109 | ## Example record 2: 110 | 111 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size 112 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|--------------------|------------------|----- 113 | 1 | 72300544 | 72346418 | 7 | copy_number_loss;deletion | Oligo_aCGH;Sequencing | Probe_signal_intensity;Read_depth | Agilent 24M aCGH;Illumina IIx | Park2010;Ju2010 | nssv1423530:nssv1425248:nssv1428032:nssv1428830:nssv1434173:nssv1439464:nssv1420391 | | | medium 114 | 115 | ### Explanation: 116 | 117 | * This is a more complicated example deletion NR record containing multiple 118 | variants with multiple types, methods, and analyses from multiple studies, using 119 | multiple platforms. This record does not contain clinical_significance or a 120 | ClinVar accession. 121 | 122 | # Questions or feedback 123 | 124 | * Please email dbvar@ncbi.nlm.nih.gov or create an issue on this GitHub page. 125 | 126 | # Thanks! 127 | 128 | Thanks for your interest in the dbVar human "non-redundant structural variations" (nr SVs) 129 | data files from NCBI. 130 | 131 | Please check back for updates soon. 132 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/Duplications/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Human Nonredundant Structural Variants – Duplications 2 | 3 | ### Work in progress - data subject to change 4 | 5 | Documentation updated: 11/27/2019 6 | 7 | ## Data Summary 8 | 9 |

10 | Duplications

11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 |
Type and FTP DirectoryGRCh37GRCh38
All418,648408,362
Common59,57659,150
Pathogenic4,3374,206
Somatic15,10315,077
39 | 40 | All files are available in **bed**, **bedpe**, and **tsv** formats: 41 | 42 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/#github) 43 | 44 | The variant types in the NR "duplications" files are: 45 | 46 | * copy_number_gain 47 | * copy_number_variation 48 | * duplication 49 | * tandem_duplication 50 | 51 | ## Records in the NR SV duplications files 52 | 53 | ### Please note: 54 | 55 | * The fields type, method, analysis, platform, variant, study, clinical_significance, clinvar_accession, and gene may contain multiple values. 56 | * Each of the values is associated with one or more calls found in the variant field. 57 | * The values in the variant field are "dbVar call accessions". 58 | 59 | * Records in the NR SV duplications files contain the following tab-separated fields. 60 | 61 | | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession 62 | 63 | 64 | ## Example record 1: 65 | 66 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size 67 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|--------------------|------------------|------ 68 | 15 | 90243115 | 90477618 | 1 | copy_number_gain | Oligo_aCGH | Probe_signal_intensity | NA | ClinGen_Laboratory-Submitted | nssv13652018 | Uncertain significance | SCV000495160 | medium 69 | 70 | ### Explanation: 71 | 72 | * The non-redundant coordinates for this record in dbVar are chr15, with 73 | an outermost start of 90243115 and outermost stop of 90477618. 74 | 75 | * The variant_count of 1 indicates there is only one ssv with an exact match to 76 | the given placement. This count does not include SVs with a partial match. 77 | 78 | * The variant_call_type is "copy_number_gain". 79 | 80 | * The method and the analysis indicate how the one variant was evaluated. 81 | 82 | * NA indicates the no platform was submitted for this variant. 83 | 84 | * ClinGen_Laboratory-Submitted is the study name as found in dbVar. 85 | 86 | * The dbVar variant_accession is "nssv13652018". 87 | 88 | * The clinical_significance is "Uncertain significance" 89 | 90 | * The variant has an accession in ClinVar of SCV000495160 91 | 92 | * bin_size = small (length < 50 bp), medium (<1000000), large (>= 1000000). Length = outermost_stop - outermost_start + 1. 93 | 94 | * URLs using the study name accession or variant_accession can be created to access the data 95 | in dbVar, e.g.: 96 | https://www.ncbi.nlm.nih.gov/dbvar/?term=ClinGen_Laboratory-Submitted 97 | https://www.ncbi.nlm.nih.gov/dbvar/?term=nssv13652018 98 | 99 | * From the latter page you may click on the "Variant Region ID" on the left to see 100 | the variant's region in the NCBI Variation Viewer at: 101 | https://www.ncbi.nlm.nih.gov/dbvar/variants/nsv2769232/ 102 | 103 | * The SCV accession can be used to find the record in ClinVar by searching on ClinVar's home page: 104 | https://www.ncbi.nlm.nih.gov/clinvar/ 105 | 106 | ## Example record 2: 107 | 108 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size 109 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|--------------------|------------------|----- 110 | 8 | 75857739 | 75858539 | 2 | duplication;tandem_duplication | Sequencing | Split_read_and_paired-end_mapping;Read_depth_and_paired-end_mapping | Illumina HiSeq X Ten;Illumina HiSeq 2000 | Wong2016;Alsmadi2014 | essv26064592;nssv3988418 | | | medium 111 | 112 | ### Explanation: 113 | 114 | * This is a more complicated example duplication NR record containing multiple 115 | variants with multiple types, methods, and analyses from multiple studies, using 116 | multiple platforms. This record does not contain clinical_significance or a 117 | ClinVar accession. 118 | 119 | # Questions or feedback 120 | 121 | * Please email dbvar@ncbi.nlm.nih.gov or create an issue on this GitHub page. 122 | 123 | # Thanks! 124 | 125 | Thanks for your interest in the dbVar human "non-redundant structural variations" (NR SVs) 126 | data files from NCBI. 127 | 128 | Please check back for updates soon. 129 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/Insertions/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Human Nonredundant Structural Variants – Insertions 2 | 3 | ### Work in progress - data subject to change 4 | 5 | Documentation updated: 11/27/2019 6 | 7 | ## Data Summary 8 | 9 |

10 | Insertions

11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 |
Type and FTP DirectoryGRCh37GRCh38
All1,305,1371,309,867
Common124,902124,833
Pathogenic7171
Somatic00
39 | 40 | 41 | All files are available in **bed**, **bedpe**, and **tsv** formats: 42 | 43 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions/#github) 44 | 45 | The variant types in the NR "insertions" files are: 46 | 47 | * alu_insertion 48 | * insertion 49 | * line1_insertion 50 | * mobile_element_insertion 51 | * novel_sequence_insertion 52 | * sva_insertion 53 | 54 | ## Records in NR SV insertions files 55 | 56 | ### Please note: 57 | 58 | * The fields type, method, analysis, platform, variant, study, clinical_significance, clinvar_accession, and gene may contain multiple values. 59 | * Each of the values is associated with one or more calls found in the variant field. 60 | * The values in the variant field are "dbVar call accessions". 61 | 62 | * Records in the NR SV insertions files contain the following tab-separated fields. 63 | 64 | | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | min_insertion_length | max_insertion_length | 65 | 66 | 67 | ## Example record 1: 68 | 69 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size | min_insertion_length | max_insertion_length | 70 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|-----------------------------------|-------------|------|----------------------|----------------------| 71 | 1 | 370037 | 370038 | 1 | insertion | Sequencing | Local_sequence_assembly | NA | Fan2017 | nssv14027289 | | | small | 94 | 94 72 | 73 | ### Explanation: 74 | 75 | * The non-redundant coordinates for this record in dbVar are chr1, with 76 | an outermost start of 370037 and outermost stop of 370038. 77 | 78 | * The SV_count of 1 indicates there is only one SV with an exact match to the 79 | given placement. This count does not include SVs with a partial match. 80 | 81 | * The variant_call_type is "insertion". 82 | 83 | * The method of "Sequencing" and the analysis of "Local_sequence_assembly" 84 | indicate how the variant was evaluated. 85 | 86 | * 94 indicates the minimum insertion length, which is the same as the maximum insertion length since there is only one SV at this set of NR coordinates. 87 | 88 | * NA indicates the no platform was specified for this variant. 89 | 90 | * Fan2017 is the study name as found in dbVar. 91 | 92 | * The dbVar variant_accession is "nssv14027289". 93 | 94 | The clinical_assertion is not provided for this variant. 95 | 96 | There is no clinvar_accession for this variant 97 | 98 | bin_size = small (length < 50 bp), medium (< 1000000), large (>= 1000000), where length = outermost_stop - outermost_start + 1. 99 | 100 | * URLs using the study name or variant_accession can be created to access the data 101 | in dbVar, e.g.: 102 | https://www.ncbi.nlm.nih.gov/dbvar/?term=Fan2017 103 | https://www.ncbi.nlm.nih.gov/dbvar/?term=nssv14027289 104 | 105 | * From the latter page you may click on the "Variant Region ID" on the left to see 106 | the variant's region in the NCBI Variation Viewer at: 107 | https://www.ncbi.nlm.nih.gov/dbvar/variants/nsv3056167/ 108 | 109 | ## Example record 2: 110 | 111 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession | bin_size | min_insertion_length | max_insertion_length 112 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|---------|-----------------------------------|-----------|--------|----------------------|----------------------| 113 | 1 | 147236943 | 147236943 | 2 | insertion;line1_insertion | Sequencing | Sequence_alignment;Split_read_and_paired-end_mapping | Sanger Sequencing;HiSeq 2000 | Levy2007;Gardner2017 | essv4283099;nssv14075648 | | | small | 10 | 6014 114 | 115 | ### Explanation: 116 | 117 | * This is a more complicated example insertion NR record containing multiple variants with multiple variant_types, methods, analyses, studies, platforms, and insertion_lengths. 118 | 119 | # Questions or feedback 120 | 121 | * Please email dbvar@ncbi.nlm.nih.gov or create an issue on this GitHub page. 122 | 123 | # Thanks! 124 | 125 | Thanks for your interest in the dbVar human "non-redundant structural variations" (nr SVs) 126 | data files from NCBI. 127 | 128 | Please check back for updates soon. 129 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/QuickStart.md: -------------------------------------------------------------------------------- 1 | # Quick Start 2 | 3 | 4 | ## A guide to using dbVar non-redundant structural variant (NR-SV) data and annotations 5 | 6 | **Updated** September 12, 2018 7 | 8 | 9 | ---------- 10 | 11 | 12 | 13 | The ***Use Cases*** below illustrate some of the ways dbVar NR SV data and associated annotations can be used. Please send feedback and any questions to dbvar-dev@ncbi.nlm.nih.gov. 14 | 15 | dbVar NR data are provided in GRCh37 and GRCh38 coordinates in TSV, BED, and BEDPE formats. Please choose the file(s) best suited your application. For example, BED files are easy to compute on but are lightweight and lack metadata such as clinical assertions that may be useful to your analysis. 16 | 17 | **NOTE:** At the present time not all clinical SV in ClinVar have yet been ported to dbVar. In addition, clinical assertions contained in NR-SV datasets are not diagnostic and should be interpreted with caution. For more information please visit [ClinVar's Clinical Significance](https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/) informational page. 18 | 19 | #### Disclaimer 20 | 21 | The information on this website is not intended for direct diagnostic use or medical decision-making without review by a genetics professional. Individuals should not change their health behavior solely on the basis of information contained on this website. NIH does not independently verify the submitted information. If you have questions about the information contained on this website, please see a health care professional. More information about [NCBI's disclaimer policy](https://www.ncbi.nlm.nih.gov/About/disclaimer.html) is available. 22 | 23 | 24 | 25 | 26 | ---------- 27 | 28 | ## Use Cases: 29 | 30 | 31 | - [Obtain human clinically-relevant CNVs](#obtain-human-clinically-relevant-cnvs) 32 | - [Find overlaps between NR-SV and candidate structural variants](#find-overlaps-between-nr-sv-and-candidate-structural-variants) 33 | - [Find overlaps between NR SVs and annotation datasets](#find-overlaps-between-nr-svs-and-annotation-datasets) 34 | 35 | 36 | 37 | ---------- 38 | 39 | ### Obtain human clinically-relevant CNVs 40 | 41 | #### Objectives: 42 | * Group NR SV records with clinically-relevant CNVs by assembly and by type 43 | * To obtain clinically-relevant NR SVs in bedpe format so they may be used for further investigation, e.g., to: 44 | * determine genome locations where pathogenic copy number loss variants are found in dbVar 45 | * determine overlaps between these these variants and a candidate set of variants 46 | 47 | #### Summary of Results: 48 | Using the process described in this use case the following results were obtained on Aug 20, 2018: 49 | 50 | Two files of NR SVs, each NR SV containing at least one clinically relevant CNV, 51 | were generated for assembly GRCh37, one for copy_number_loss, and one for copy_number_gain. 52 | 53 | 54 | | NR SV file | Records with clinical_significance | 55 | |:--------------------------------------:|:----------------------------------:| 56 | | GRCh37.copy_number_gain.with_SCV.bedpe | 13016 | 57 | | GRCh37.copy_number_loss.with_SCV.bedpe | 10872 | 58 | 59 | 60 | As an example of further processing of the output files, here is a table showing counts of 61 | NR SV records with pathogenic clinical_signicance: 62 | 63 | 64 | | Count | GRCh37 gains | GRCh37 losses | 65 | |:---------------------:|:---------------------:|:---------------------:| 66 | |records with Pathogenic only|1974|3939| 67 | |records with Pathogenic and another clinical_significance|44|48| 68 | |records with one or more Pathogenic|2018|3987| 69 | | **Total records with clinical significance** | **13016** | **10872** | 70 | 71 | 72 | 73 | ##### 1. Go to ftp site for deletions: 74 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/) 75 | 76 | ##### 2. Download deletions bedpe file: 77 | ###### GRCh37.nr_deletions.bedpe.gz 78 | ##### 3. Get human clinically-relevant copy_number_loss records from deletions bedpe file: 79 | ```markdown 80 | zcat GRCh37.nr_deletions.bedpe.gz | grep SCV > GRCh37.copy_number_loss.with_SCV.bedpe 81 | ``` 82 | ##### 4. Go to ftp site for duplications: 83 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/) 84 | ##### 5. Download duplications bedpe file: 85 | ###### GRCh37.nr_duplications.bedpe.gz 86 | 87 | ##### 6. Get human clinically-relevant copy_number_gain records from deletions bedpe file: 88 | ```markdown 89 | zcat GRCh37.nr_duplications.bedpe.gz | grep SCV > GRCh37.copy_number_gain.with_SCV.bedpe 90 | ``` 91 | 92 | ##### 7. Link to dbVar web pages for additional variant details 93 | 94 | Create URLs for dbVar variant pages using the nssv accessions in the bedpe files. For example: 95 | 96 | [https://www.ncbi.nlm.nih.gov/dbvar/?term=nssv13649440](https://www.ncbi.nlm.nih.gov/dbvar/?term=nssv13649440) 97 | 98 | From the latter page you may click on the "Variant Region ID" on the left to see the variant's region in the NCBI Variation Viewer at: 99 | 100 | [https://www.ncbi.nlm.nih.gov/dbvar/variants/nsv533950/](https://www.ncbi.nlm.nih.gov/dbvar/variants/nsv533950/) 101 | 102 | ##### 8. Link to ClinVar web pages for additional clinical details 103 | 104 | Use an SCV accession obtained from the bedpe files, e.g. SCV000045941, to find the corresponding record in ClinVar by searching for the SCV on ClinVar's home page: 105 | 106 | [https://www.ncbi.nlm.nih.gov/clinvar/](https://www.ncbi.nlm.nih.gov/clinvar/) 107 | 108 | 109 | ---------- 110 | 111 | ### Find overlaps between NR-SV and candidate structural variants 112 | This case may be helpful if you would like to know if candidate structural variants compare to existing variants in dbVar. 113 | #### 1. Format your file or candidate structural variants as either gff, bed, or bedpe 114 | For example: 115 | variant_calls.gff: 116 | ```markdown 117 | chr13 dbVar copy_number_gain 100217593 100233402 . + . ID=chr13_100217593_100233402_copy_number_gain_1 118 | chr10 dbVar copy_number_loss 78342182 78366362 . + . ID=chr10_78342182_78366362_copy_number_loss_1 119 | chr5 dbVar copy_number_loss 113158233 113172463 . + . ID=ch 120 | ``` 121 | #### 2. Download NR files 122 | Visit these directories to select the appropriate files according to assembly (GRCh37, GRCh38), or type (bed, bedpe): 123 | 124 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions) 125 | 126 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications) 127 | 128 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions) 129 | 130 | NOTE: the .bedpe files have the most information about the variants, however, you can use .bedpe if you only care about the placements. Refer to https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants for list of columns included in the files. 131 | 132 | For example, bedpe files on GRCh37: 133 | 134 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/GRCh37.nr_deletions.bedpe.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/GRCh37.nr_deletions.bedpe.gz) 135 | 136 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/GRCh37.nr_duplications.bedpe.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/GRCh37.nr_duplications.bedpe.gz) 137 | 138 | [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions/GRCh37.nr_insertions.bedpe.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions/GRCh37.nr_insertions.bedpe.gz) 139 | 140 | 141 | #### 3. NR files 142 | ```markdown 143 | gunzip GRCh37.nr_deletions.bedpe.gz 144 | gunzip GRCh37.nr_duplications.bedpe.gz 145 | gunzip GRCh37.nr_insertions.bedpe.gz 146 | ``` 147 | #### 4. Run bedtools against each NR file 148 | For example, default to any overlap: 149 | ```markdown 150 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_deletions.bedpe 151 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_duplications.bedpe 152 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_insertions.bedpe 153 | ``` 154 | Example of narrower selection criteria, using at least 75% reciprocal overlap, and saving columns from both files to an output file: 155 | ```markdown 156 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_deletions.bedpe -wo -r -f .75 > variant_calls_X_nr_deletions.gff 157 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_duplications.bedpe -wo -r -f .75 > variant_calls_X_nr_duplications.gff 158 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_insertions.bedpe -wo -r -f .75 > variant_calls_X_nr_insertions.gff 159 | ``` 160 | #### 5. Interpret results: 161 | 162 | For example, identify unique copy_number_loss variants with >75% reciprocal overlap with NR deletions: 163 | ```markdown 164 | cut -f 9 variant_calls_X_nr_deletions.gff | sort -u | grep loss | wc -l 165 | 675 166 | ``` 167 | 168 | Identify unique copy_number_gain variants with >75% reciprocal overlap with NR deletions: 169 | ```markdown 170 | cut -f 9 variant_calls_X_nr_duplications.gff | grep gain | sort -u | wc -l 171 | 558 172 | ``` 173 | 174 | Identify any variants with >75% reciprocal overlap to an NR variant with clinical significance of Pathogenic: 175 | ```markdown 176 | grep SCV variant_calls_X_nr_*.gff | grep Pathogenic 177 | chr16 dbVar copy_number_loss 14945195 16363239 . + . ID=chr16_14945195_16363239_copy_number_loss_1 chr16 14809981 16477578 . -1 -1 chr16_14809981_16477578_del . . . copy_number_loss Oligo_aCGH Probe_signal_intensity NA ClinGen_Laboratory-Submitted nssv585211;nssv3396567 Pathogenic SCV000175002 large . . 1418045 178 | ``` 179 | Using the -wo option to bedtools intersect returns the full set of columns from the NR bedpe file. You can use these columns to filter the results based on other criteria, such as method, platform, or study name. Refer to [https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants](https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants) for the list of columns included in the files. 180 | #### 6. Other options 181 | - Run bedtools with "-f 1" to find variants with 100% reciprocal overlap in NR. 182 | - Run bedtools with "-u" instead of "-wo" to see only unique id's from variant file. For example: 183 | ```markdown 184 | bedtools intersect -a variant_calls.gff -b GRCh37.nr_deletions.bedpe -u -r -f 1 > variant_calls_X_nr_deletions.unique.100pct.gff 185 | more variant_calls_X_nr_deletions.unique.100pct.gff 186 | chr2 dbVar copy_number_loss 165850456 165864123 . + . ID=chr2_165850456_165864123_copy_number_loss_7 187 | 188 | grep chr2_165850456_165864123_copy_number_loss_7 variant_calls_X_nr_deletions.gff 189 | ``` 190 | 191 | 192 | ---------- 193 | 194 | ### Find overlaps between NR SVs and annotation datasets 195 | This case may be helpful if you would like to know if candidate structural variants overlap other NCBI annotation resources. 196 | 197 | #### 1. Format your file or candidate structural variants as either gff, bed, or bedpe. For example: 198 | variant_calls.gff: 199 | 200 | 201 | ``` 202 | chr12 dbVar short_tandem_repeat 132387673 132387821 . + . ID=chr12_132387673_132387821_short_tandem_repeat_1 203 | chr5 dbVar alu_insertion 62339523 62339830 . + . ID=chr5_62339523_62339830_alu_insertion_1 204 | chr9 dbVar insertion 137921881 137921881 . + . ID=chr9_137921881_137921881_insertion_7 205 | chr7 dbVar deletion 67715920 67716363 . + . ID=chr7_67715920_67716363_deletion_2 206 | ``` 207 | 208 | 209 | 210 | #### 2. Download and prepare annotation files 211 | GRCh37: 212 | Download the following: 213 | RefSeq annotations file, which include genes, exons, and regulatory regions: [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ref_GRCh37.p13_top_level.gff3.gz](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ref_GRCh37.p13_top_level.gff3.gz) 214 | 215 | Paralogous alignments: [ftp://ftp.ncbi.nlm.nih.gov/pub/murphyte/PSV/GRCh37.p13_AR105/all.annot.gff3.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/murphyte/PSV/GRCh37.p13_AR105/all.annot.gff3.gz) 216 | 217 | Dosage Sensitivity (more information here: [https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd45/](https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd45/)): [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/gvf/nstd45.GRCh37.variant_region.gvf.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/gvf/nstd45.GRCh37.variant_region.gvf.gz) 218 | 219 | Combine files and translate the NCBI accessions to chromosome names 220 | ```markdown 221 | gunzip ref_GRCh37.p13_top_level.gff3.gz all.annot.gff3.gz nstd45.GRCh37.variant_region.gvf.gz 222 | cat ref_GRCh37.p13_top_level.gff3 all.annot.gff3 nstd45.GRCh37.variant_region.gvf | 223 | grep -v "###" | grep "^NC_" | sed "s/NC_000001.10/chr1/g" | sed "s/NC_000002.11/chr2/g" | sed "s/NC_000003.11/chr3/g" | 224 | sed "s/NC_000004.11/chr4/g" | sed "s/NC_000005.9/chr5/g" | sed "s/NC_000006.11/chr6/g" | sed "s/NC_000007.13/chr7/g" | 225 | sed "s/NC_000008.10/chr8/g" | sed "s/NC_000009.11/chr9/g" | sed "s/NC_000010.10/chr10/g" | sed "s/NC_000011.9/chr11/g" | 226 | sed "s/NC_000012.11/chr12/g" | sed "s/NC_000013.10/chr13/g" | sed "s/NC_000014.8/chr14/g" | sed "s/NC_000015.9/chr15/g" | 227 | sed "s/NC_000016.9/chr16/g" | sed "s/NC_000017.10/chr17/g" | sed "s/NC_000018.9/chr18/g" | sed "s/NC_000019.9/chr19/g" | 228 | sed "s/NC_000020.10/chr20/g" | sed "s/NC_000021.8/chr21/g" | sed "s/NC_000022.10/chr22/g" | sed "s/NC_000023.10/chrX/g" | 229 | sed "s/NC_000024.9/chrY/g" | sed "s/NC_012920.1/chrMT/g" | sort > annotations_37.gff 230 | ``` 231 | 232 | GRCh38: 233 | Download the following: 234 | 235 | RefSeq annotations file, which include genes, exons, and regulatory regions: [ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/GFF/ref_GRCh38.p12_top_level.gff3.gz) 236 | 237 | Assembly anomalies (more information here: [https://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/alignments](https://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/alignments)): [ftp://ftp.ncbi.nlm.nih.gov/pub/remap/Homo_sapiens/current/GCF_000001405.25_GRCh37.p13/GCF_000001405.38_GRCh38.p12/GCF_000001405.25-GCF_000001405.38.gff](ftp://ftp.ncbi.nlm.nih.gov/pub/remap/Homo_sapiens/current/GCF_000001405.25_GRCh37.p13/GCF_000001405.38_GRCh38.p12/GCF_000001405.25-GCF_000001405.38.gff) 238 | 239 | Paralogous alignments: [ftp://ftp.ncbi.nlm.nih.gov/pub/murphyte/PSV/GRCh38.p2_AR107/GRCh38.p2_all.align.gff3.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/murphyte/PSV/GRCh38.p2_AR107/GRCh38.p2_all.align.gff3.gz) 240 | 241 | Dosage Sensitivity (more information here: [https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd45/](https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd45/)): [ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/gvf/nstd45.GRCh38.variant_region.gvf.gz](ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Homo_sapiens/by_study/gvf/nstd45.GRCh38.variant_region.gvf.gz) 242 | 243 | Combine files and translate the NCBI accessions to chromosome names 244 | ```markdown 245 | gunzip ref_GRCh38.p12_top_level.gff3.gz GRCh38.p2_all.align.gff3.gz nstd45.GRCh38.variant_region.gvf.gz 246 | cat GCF_000001405.25-GCF_000001405.38.gff ref_GRCh38.p12_top_level.gff3 GRCh38.p2_all.align.gff3 nstd45.GRCh38.variant_region.gvf | 247 | grep -v "###" | grep "^NC_" | sed "s/NC_000001.11/chr1/g" | sed "s/NC_000002.12/chr2/g" | sed "s/NC_000003.12/chr3/g" | 248 | sed "s/NC_000004.12/chr4/g" | sed "s/NC_000005.10/chr5/g" | sed "s/NC_000006.12/chr6/g" | sed "s/NC_000007.14/chr7/g" | 249 | sed "s/NC_000008.11/chr8/g" | sed "s/NC_000009.12/chr9/g" | sed "s/NC_000010.11/chr10/g" | sed "s/NC_000011.10/chr11/g" | 250 | sed "s/NC_000012.12/chr12/g" | sed "s/NC_000013.11/chr13/g" | sed "s/NC_000014.9/chr14/g" | sed "s/NC_000015.10/chr15/g" | 251 | sed "s/NC_000016.10/chr16/g" | sed "s/NC_000017.11/chr17/g" | sed "s/NC_000018.10/chr18/g" | sed "s/NC_000019.10/chr19/g" | 252 | sed "s/NC_000020.11/chr20/g" | sed "s/NC_000021.9/chr21/g" | sed "s/NC_000022.11/chr22/g" | sed "s/NC_000023.11/chrX/g" | 253 | sed "s/NC_000024.10/chrY/g" | sed "s/NC_012920.1/chrMT/g" | sort > annotations_38.gff 254 | ``` 255 | 256 | #### 3. Run bedtools 257 | This shows how you could identify the overlaps between your candidate structural variants and the downloaded annotation files. 258 | These examples identify overlaps with at least 75% reciprocal overlap, and the output files will contain columns from both the variant and annotation files. 259 | If your data is on GRCh37: 260 | ```markdown 261 | bedtools intersect -a variant_calls.gff -b annotations_37.gff -wo -r -f .75 > variant_calls.gff_X_annotations_37.gff.min75 262 | ``` 263 | If your data is on GRCh38: 264 | ```markdown 265 | bedtools intersect -a variant_calls.gff -b annotations_38.gff -wo -r -f .75 > variant_calls.gff_X_annotations_38.gff.min75 266 | ``` 267 | 268 | #### 4. Interpret Results 269 | Identify unique variants with >75% reciprocal overlap with GRCh38 annotation files: 270 | ```markdown 271 | cut -f 9 variant_calls.gff_X_annotations_38.gff.min75 | sort -u| wc -l 272 | ``` 273 | Examine the result file. The GFF files used for annotation will contain a rich set of descriptions in the attribute column, for example: 274 | ```markdown 275 | cut -f 18 variant_calls.gff_X_annotations_38.gff.min75 | more 276 | ID=rna15957;Parent=gene5140;Dbxref=GeneID:105373346,Genbank:XR_002959474.1;Name=XR_002959474.1;gbkey= 277 | ncRNA;gene=LOC105373346;model_evidence=Supporting evidence includes similarity to: 5 ESTs%2C 20 long 278 | SRA reads%2C and 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 4 279 | samples with support for all annotated introns;product=uncharacterized LOC105373346%2C transcript var 280 | iant X25;transcript_id=XR_002959474.1 281 | ``` 282 | 283 | NOTE: For intersections between GFF and GFF files, col 19 will contain the number of overlapping bases. 284 | 285 | ---------- 286 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Human Nonredundant Structural Variants (NR SVs) 2 | 3 | ### Work in progress - data subject to change 4 | 5 | Documentation updated: 04/23/2020 6 | 7 | ## Data Summary 8 | 9 | 10 | See also [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/release_notes/NR_stats.latest.txt](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/release_notes/NR_stats.latest.txt#github) 11 | 12 | FTP Directory: nonredundant
13 | Last modified: Jul 5, 2022
14 | File types: bed, bedpe, tsv
15 |

16 | Deletions

17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 |
Type and FTP DirectoryGRCh37GRCh38
All1,944,2081,944,574
Common211,888211,465
Pathogenic17,33317,160
Somatic23,36523,324
All-ACMG5,8275,838
Common-ACMG392403
Pathogenic-ACMG2,8402,841
Somatic-ACMG966965
65 |

66 | Duplications

67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 |
Type and FTP DirectoryGRCh37GRCh38
All658,229659,117
Common59,57659,150
Pathogenic4,6554,531
Somatic15,10515,079
All-ACMG4,6084,682
Common-ACMG125129
Pathogenic-ACMG903902
Somatic-ACMG811810
115 |

116 | Insertions

117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 |
Type and FTP DirectoryGRCh37GRCh38
All1,667,9721,678,782
Common121,240121,187
Pathogenic182182
All-ACMG3,4133,489
Common-ACMG255261
Pathogenic-ACMG5555
155 | 156 | All files are available in **bed**, **bedpe**, and **tsv** formats: 157 | 158 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/deletions/#github) 159 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/insertions//#github) 160 | [https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications/](https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/duplications//#github) 161 | 162 | 163 | 164 | ## Description of NR SV data files: 165 | 166 | * Sets of "non-redundant structural variations" (NR SVs) derived from dbVar are 167 | available via FTP as tab delimited files by assembly, GRCh37 & GRCh38, and type of variant. 168 | 169 | * Non-redundant refers to variant coordinates, i.e. chr, outermost start, and 170 | outermost stop. Please note: the non-redundant coordinates are based strictly 171 | on exact overlap of coordinates, not on partial overlaps. 172 | 173 | * Other features of NR SV files: 174 | * variant calls are from germline samples only (no somatic) 175 | * placements are "BestAvailable" on the assembly (guarantees no duplicate placements for a variant) 176 | * placements are on finished chromosomes only (not on NT_ or NW_ contigs) 177 | * placements are 1-based in the .tsv files 178 | * placements are zero-based start and 1-based stop in .bed and .bedpe files 179 | * insertion_length is set to sequence length if the sequence was submitted to dbVar without a specific insertion_length 180 | * insertions submitted to dbVar without insertion_length or submitted sequence are not included in the NR files 181 | 182 | * Other files based on NR SV files: 183 | * NR SV files annotated with overlapping ACMG genes 184 | * NR SV files in .bed format 185 | * NR SV files in .bedpe format 186 | 187 | 188 | 189 | 190 | ## File format: 191 | 192 | Column | NR SV TSV File | BED File | BEDPE File | 193 | -----|:---------|:---|:------ 194 | |1|chr|chr|chr| 195 | |2|outermost_start (1-based)|outermost_start (0-based)|outermost_start (0-based)| 196 | |3|outermost_stop (1-based)|outermost_stop (1-based)|outermost_stop (1-based)| 197 | |4|variant_count|NR_SV_id|.  (chrom2)| 198 | |5|variant_type| |-1  (start2)| 199 | |6|method| |-1  (end2)| 200 | |7|analysis| |NR_SV_id| 201 | |8|platform| |.  (score)| 202 | |9|study| |.  (strand1)| 203 | |10|variant| |.  (stramd2)| 204 | |11|clinical_assertion| |variant_count| 205 | |12|clinvar_accession| |variant_type| 206 | |13|bin_size| |method| 207 | |14|min_insertion_length*| |analysis| 208 | |15|max_insertion_length*| |platform| 209 | |16| | |study| 210 | |17| | |variant| 211 | |18| | |clinical_assertion| 212 | |19| | |clinvar_accession| 213 | |20| | |bin_size| 214 | |21| | |min_insertion_length| 215 | |22| | |max_insertion_length| 216 | 217 | Please note: 218 | * \* = NR_SV TSV fields 14 and 15 are in nr_insertion.tsv files only 219 | * NR_SV_id = chr_outermost_start_outermost_stop_type where type is del, dup, or ins 220 | * bin_size = small (length < 50 bp), medium (< 1000000), large (>= 1000000). Length = outermost_stop - outermost_start + 1. 221 | * In all cases, bedpe columns 4 through 6, and 8 through 10, are populated with default values per the bedpe specification 222 | * The bed and bedpe specifications are found here: https://bedtools.readthedocs.io/en/latest/content/general-usage.html 223 | 224 | 225 | ## Some fields may have multiple values: 226 | * The fields type, method, analysis, platform, variant, study, clinical_significance, clinvar_accession, and gene may contain multiple values. 227 | * Each of the values is associated with one or more calls found in the variant field. 228 | * The values in the variant field are "dbVar call accessions". 229 | 230 | # Records in the NR SV files: 231 | 232 | ## Records in the deletions or duplications NR SV files, e.g.: 233 | 234 | | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clincical_assertion | clinvar_accession | bin_size | 235 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|-------------|-------------------|-------------------|---------| 236 | 15 | 98085101 | 101843270 | 1 | copy_number_loss | Oligo_aCGH | Probe_signal_intensity | Agilent ISCA 44K | ClinGen_Laboratory-Submitted | nssv14082018 | Pathogenic | SCV000586438 237 | 238 | ## Records in the insertions NR SV files, e.g.: 239 | 240 | | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clincical_assertion | clinvar_accession | bin_size | min_insertion_length | max_insertion_length | 241 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|-------------|-------------------|---------|-------|----------|----------------------| 242 | 1 | 1889055 | 1889055 | 1 | alu_insertion | Sequencing | Split_read_and_paired-end_mapping | HiSeq 2000 | Gardner2017 | nssv14051747 | | | small | 258 | 258 243 | 244 | * only insertion SVs have minimum_insertion_length and maximum_insertion_length fields 245 | 246 | ## Records in the NR SV .bed files, e.g. 247 | 248 | chr | outermost_start | outermost_stop | NR_SV_id 249 | ----|-----------------|----------------|----- 250 | chr1 | 0 | 10000 | chr1_0_10000_del 251 | 252 | * Placements in bed files are zero-based start and one-based stop 253 | * name is comprised of chromosome, outermost_start, outermost_stop, and type (del, dup, or ins) 254 | * NR SV .bed files may be used with a variety of tools as shown in the tutorial: 255 | 256 | https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/ToolGuide.md 257 | 258 | 259 | ## Records in the NR SV .bedpe files, e.g. 260 | 261 | chr | outermost_start | outermost_stop | chrom2 | start2 | end2 | NR_SV_id | score | strand1 | strand2 | variant_count | variant_type | method | analysis | platform | study | variant | clinical_assertion | clinvar_accession 262 | -------|--------|------|--------|--------|------|------|-------|---------|---------|----------|--------------|--------|----------|----------|-------|----|--------------------|------------------ 263 | chr1 | 14873 | 7527302 | . | -1 | -1 | chr1_14873_7527302_del | . | . | . | 1 | copy_number_loss | Oligo_aCGH | Probe_signal_intensity | NA | ClinGen_Laboratory-Submitted | nssv13638713 | Pathogenic | SCV000495999 264 | 265 | * Placements in bedpe files are zero-based start and one-based stop 266 | * bedpe files are normally used for disjointed genomic sequences 267 | * the NR bedpe files do not contain disjointed genomic sequences, and instead use default values for chrom2, start2 and end2 268 | * the NR bedpe files use the optional fields starting at field 11 to hold: chr, outermost_start, outermost_stop, name, and all the additional fields which are in the NR SV files 269 | 270 | # ACMG files as "proof of principle" for a "use case" 271 | 272 | ## Records in the NR SV ACMG files contain a field for the ACMG gene that overlaps the variant, e.g.: 273 | 274 | chr | outermost_start | outermost_stop | variant_count | variant_type | method | analysis | platform | study | variant | clinical_significance | clinvar_accession | bin_size | min_insertion_length | max_insertion_length | gene 275 | ----|-----------------|----------------|---------------|--------------|--------|----------|----------|-------|-------------|-------------------|-------------------|----------------------|----------------------|-----|----- 276 | 3| 30621876 | 30621876 | 2 | alu_insertion | Merging;Sequencing | Merging;Split_read_and_paired-end_mapping | See merged experiments;HiSeq 2000 | 1000_Genomes_Consortium_Phase_3_SV_Submission;Gardner2017 | essv18243203;nssv14059593 | | | small |279 | 280 | TGFBR2 277 | 278 | ### Caveats for the ACMG files: 279 | 280 | * ACMG files are provided as a "proof of principle" for a "use case" for the NR SV files. 281 | * ACMG files are based on region/gene overlaps, and are missing a few call/gene overlaps in the cases where the parent variant region does not include all of the variant call placement and the region does not overlap the gene or is upstream or downstream of the gene. 282 | * Placements in the ACMG files do not account for confidence intervals, even though the overlaps reported in the file were determined using confidence intervals. This results in a few of the overlaps that are not supported by the placements as reported in the file. 283 | 284 | For information on ACMG genes please see: 285 | https://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/ 286 | 287 | # Methods and Analyses 288 | ## example values 289 | 290 | | Methods include, e.g. | Analyses include, e.g. | 291 | |:--------------------:|:--------------------:| 292 | | BAC_aCGH | Curated | 293 | | Curated | Genotyping | 294 | | MLPA | Local_sequence_assembly | 295 | | Merging | Merging | 296 | | Multiple | Multiple | 297 | | Not_provided | Not_provided | 298 | | Oligo_aCGH | Optical_mapping | 299 | | Optical_mapping | Other | 300 | | ROMA | Paired-end_mapping | 301 | | SNP_array | Probe_signal_intensity | 302 | | Sequencing | Read_depth | 303 | | qPCR | SNP_genotyping_analysis | 304 | | ROMA | Sequence_alignment | 305 | | SNP_array | Split_read_mapping | 306 | | Sequencing | de_novo_sequence_assembly | 307 | 308 | # README files for deletions, insertions, and duplications 309 | 310 | Please see README files for deletions, insertions, and duplications for example 311 | records and additional details. 312 | 313 | * Deletions: https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/Deletions/README.md 314 | * Insertions: https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/Insertions/README.md 315 | * Duplications: 316 | https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/Duplications/README.md 317 | 318 | # Brief Outline of algorithm used to generate NR-SVs. 319 | 320 | The algorithm makes use of previously existing scripts. 321 | 322 | Input files are generated from the dbVar database with tab separated values and 323 | contain SVs by assembly, type, and other relevant fields. 324 | 325 | Selected type files are grouped into "aggregated type files" as specified above, 326 | by chr. 327 | 328 | The "aggregated type files" are converted into XML records containing all the 329 | neccessary fields required by the nr process. 330 | 331 | The XML is then parsed to generate SV records with coordinates, type, 332 | method, analysis, platform, insertion_length, SV accession and study. 333 | 334 | The SV records are then processed to generate NR SV tab-separated value (tsv) files by assembly and type, as described above, e.g. 335 | * GRCh38.nr_deletions.tsv.gz 336 | 337 | # Tutorials 338 | 339 | Examples of using the NR files with various tools and browsers can be found in: https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/ToolGuide.md 340 | 341 | # Questions or feedback 342 | 343 | * Please email dbvar@ncbi.nlm.nih.gov or create an issue on this GitHub page. 344 | 345 | # Thanks! 346 | 347 | Thanks for your interest in the dbVar human "non-redundant structural variations" (NR SVs) 348 | data files from NCBI. 349 | 350 | Please check back soon for further updates. 351 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/Nonredundant_Structural_Variants/ToolGuide.md: -------------------------------------------------------------------------------- 1 | # Tool Guide – How to Use dbVar's NR SV Data Files 2 | ## Purpose 3 | 4 | This tutorial demonstrates how to intersect dbVar's non-redundant (NR) files with other genomic interval files using popular tools and browsers. By the end of the tutorial you should be able toe, for example, calculate overlaps between genes and NR Deletions throughout the human genome. 5 | 6 | ## Tools 7 | - [Input Files](#input-files) 8 | - [Bedtools](#bedtools) 9 | - [Galaxy](#galaxy) 10 | - [UCSC Genome Browser](#ucsc-genome-browser) 11 | - [NCBI Sequence Viewer](#ncbi-sequence-viewer) 12 | - [Installation Notes (Linux)](#installation-notes-linux) 13 | 14 | ## Input Files 15 | ### dbVar NR Files 16 | NOTES: 17 | - Many genome browsers can access dbVar's NR data files directly using the URLs provided, avoiding the need to download. 18 | - Some locally-installed tools may require you to download data files before use. 19 | - BED files have 0-based starts and 1-based stops. (Standard non-BED dbVar files use 1-based starts.) 20 | - Chromosome names contain "chr", e.g., **chrX**. 21 | - Some scenarios may require you to edit data files after download, using any plain text editor. Instructions are provided (see ***Post-Download instructions*** below). For example, the UCSC Genome Browser requires BED files to include a track name and description, and the removal of any placements on chrMT (usually located at the end of a file). 22 | - All **FTP directory/ file** paths in the table below should be prefixed with: ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv_datasets/nonredundant/... 23 | 24 | |File Content|File format|FTP directory/ file|Post-Download instructions| 25 | |------------|-----------|--------|--------------------------| 26 | |non-redundant ***Deletions***|BED|...deletions/ GRCh38.nr_deletions.bed.gz|```gunzip GRCh38.nr_deletions.bed.gz; echo "track name=\"dbVar NR deletions\" description=\"non-redundant deletions from dbVar\"" > GRCh38.nr_deletions_ucsc.bed; grep -v ^chrMT GRCh38.nr_deletions.bed >> GRCh38.nr_deletions_ucsc.bed```| 27 | |non-redundant ***Duplications***|BED|...duplications/ GRCh38.nr_duplications.bed.gz|```gunzip GRCh38.nr_duplications.bed.gz```| 28 | |non-redundant ***Insertions***|BED|...insertions/ GRCh38.nr_insertions.bed.gz|```gunzip GRCh38.nr_insertions.bed.gz```| 29 | 30 | 31 | ### Query Files 32 | 33 | Download and run instructions to generate these modified files for testing intersections using locally-installed tools, such as **bedtools:** 34 | 35 | NOTES: 36 | - FTP files are located at ftp://ftp.ncbi.nlm.nih.gov 37 | - clinvar_chr.vcf 38 | - "chr" in chromosome names for consistency with the .bed files 39 | - genes_chr.gff 40 | - for simplicity, filter just the genes on finished chromosomes 41 | - convert chromosome accessions to names for consistency with the .bed files 42 | 43 | **NOTE:** these types of modifications may not be necessary depending on the format of your specific input file. 44 | 45 | |File Content|File format|FTP directory/ file|Modified File Name|Post-Download instructions| 46 | |------------|-----------|-----------------|------------------|--------------------------| 47 | |Clinical variants|.vcf|/pub/clinvar/vcf_GRCh38/ clinvar.vcf.gz|clinvar_chr.vcf|gunzip clinvar.vcf.gz; grep "^#" clinvar.vcf > clinvar_chr.vcf; grep -v "^#" clinvar.vcf \| sed "s/^/chr/" >> clinvar_chr.vcf| 48 | |Human genes|.gff|/refseq/H_sapiens/H_sapiens/GFF/ ref_GRCh38.p12_top_level.gff3.gz|genes_chr.gff|gunzip ref_GRCh38.p12_top_level.gff3.gz; grep "^#" ref_GRCh38.p12_top_level.gff3 > genes_chr.gff; cat ref_GRCh38.p12_top_level.gff3 \| awk -F'\t' '$3~/^gene$/' \| grep "^NC_" \| sed "s/NC_000001.11/chr1/g" \| sed "s/NC_000002.12/chr2/g" \| sed "s/NC_000003.12/chr3/g" \| sed "s/NC_000004.12/chr4/g" \| sed "s/NC_000005.10/chr5/g" \| sed "s/NC_000006.12/chr6/g" \| sed "s/NC_000007.14/chr7/g" \| sed "s/NC_000008.11/chr8/g" \| sed "s/NC_000009.12/chr9/g" \| sed "s/NC_000010.11/chr10/g" \| sed "s/NC_000011.10/chr11/g" \| sed "s/NC_000012.12/chr12/g" \| sed "s/NC_000013.11/chr13/g" \| sed "s/NC_000014.9/chr14/g" \| sed "s/NC_000015.10/chr15/g" \| sed "s/NC_000016.10/chr16/g" \| sed "s/NC_000017.11/chr17/g" \| sed "s/NC_000018.10/chr18/g" \| sed "s/NC_000019.10/chr19/g" \| sed "s/NC_000020.11/chr20/g" \| sed "s/NC_000021.9/chr21/g" \| sed "s/NC_000022.11/chr22/g" \| sed "s/NC_000023.11/chrX/g" \| sed "s/NC_000024.10/chrY/g" \| sed "s/NC_012920.1/chrMT/g" >> genes_chr.gff| 49 | 50 | 51 | # Bedtools: 52 | ### Compute Intersections 53 | Refer to: 54 | - 55 | - **Installation Notes** section at end of this document. 56 | 57 | To find ClinVar variants that intersect dbVar deletions, run: 58 | 59 | `bedtools intersect -a clinvar_chr.vcf -b GRCh38.nr_deletions.bed -u > clinvar_dbvar_deletions.vcf` 60 | 61 | To find genes that intersect dbVar insertions, run: 62 | 63 | `bedtools intersect -a genes_chr.gff -b GRCh38.nr_insertions.bed -u > gene_dbvar_insertions.gff` 64 | 65 | *NOTE:* The **-u** option provides a **unique** set of overlaps. 66 | 67 | 68 | # Galaxy: 69 | ### Compute Intersections 70 | - Go to the online Galaxy server: 71 | - If the server is down, select an alternate server from the displayed list. 72 | - Select **Get Data** from the **Tools** menubar 73 | - Select **Upload File** from your computer under **Get Data** 74 | - In the **Download from web or upload from disk** window 75 | - select **Paste/Fetch data** 76 | - Paste the FTP URL in the text box, for example: 77 | - Select **Start** 78 | - Wait for the file to complete loading 79 | - In the **Download from web or upload from disk** window 80 | - Select **Paste/Fetch data** 81 | - Paste the FTP URL in the text box, for example: 82 | - Select **Start** 83 | - Wait for the file to complete loading 84 | - Select **Close** 85 | - Both files should now be displayed in the **History** column 86 | - Select **Operate on Genomic Intervals** from the **Tools** menubar (you may need to scroll down) 87 | - Select **Intersect** under **Operate on Genomic Intervals** 88 | - In the **Intersect the intervals of two datasets (Galaxy Version 1.0.0)** window: 89 | - Select Downloaded file 1 as the First dataset 90 | - Select Downloaded file 2 as the Second dataset 91 | - Select **Execute** 92 | - Wait for the job to complete 93 | - When the job is complete, a new file will be displayed in the **History** column, named "**Intersect on data ... and data ...**" 94 | - Click the **Eye** icon on the new intersect results file to view the data 95 | - The intersection results will be displayed in the center column of page 96 | - For example: 97 | ![Galaxy](../../images/galaxy.PNG?raw=true "Galaxy") 98 | 99 | ### Compute Intersections (BED and VCF) 100 | - Select **Get Data** from the **Tools** menubar 101 | - Select **Upload File from your computer** under **Get Data** 102 | - In the **Download from web or upload from disk** window 103 | - Select **Choose local file** 104 | - Navigate to the clinvar_chr.vcf file or any other local VCF file 105 | - Select **Start** 106 | - Wait for the file to complete loading 107 | - Select **Close** 108 | - Select **NGC VCF Manipulating** from the **Tools** menubar (you may need to scroll down) 109 | - Select **VCF-BEDintersect** 110 | - In the **VCF-BEDintersect: Intersect VCF and BED datasets (Galaxy Version 1.0.0)** window: 111 | - Select VCF file, for example, clinvar_chr.vcf 112 | - Select BED file, for example, downloaded GHRCh38.nr_deletions.bed.gz 113 | - Select **Execute** 114 | - Wait for the job to complete 115 | - When the job is complete, a new file will be displayed in the **History** column, named **VCF-BEDintersect:on data ... and data ...** 116 | - Click the **Eye** icon on the new intersect results file to view the data 117 | - The intersection results will be displayed in the center column of page 118 | - For example: 119 | ![Galaxy VCF](../../images/galaxy_vcf.PNG?raw=true "Galaxy") 120 | 121 | 122 | # UCSC Genome Browser: 123 | ### Browse 124 | 125 | - Open the **UCSC Genome Browser**: 126 | - Select **GRCh38/hg38** in the **Human Assembly** pull-down 127 | - Select **Go** 128 | - Select **Custom Tracks** from the **My Data** menu in the header menu 129 | - In the **Paste URLs or data** entry form, select **Choose File** 130 | - Navigate to select your bed file, for example, **GRCh38.nr_deletions_ucsc.bed** 131 | - Select **Submit** 132 | - NOTE: file requirements: 133 | - "track" line containing name and description 134 | - chr\* chromosome identifiers 135 | - no chrMT 136 | - Follow instructions in **Post-Download instructions** to generate valid file. 137 | - The custom track name should be displayed in the **Manage Custom Tracks** page 138 | - Select **Genome Browser** from the header menu 139 | - The custom track should be displayed 140 | - Right-click on the track to change the display from **dense** to **full** 141 | - Zoom in or out sufficiently to see the individual variants 142 | - Example of display: 143 | ![UCSC Genome Browser](../../images/ucsc_browser.PNG?raw=true "UCSC Genome Browser") 144 | 145 | Add a second track of insertions: 146 | - Select **Custom Tracks** from the **My Data** menu in the header menu 147 | - In the **Paste URLs or data** text entry, paste the URL of the NR insertion file: 148 | - NOTE: 149 | - there are no chrMT in the insertion file 150 | - load without a track name 151 | - The new track should be displayed as "User Track" in the **Manage Custom Tracks** page 152 | - Select **Genome Browser** from the header menu 153 | - Example of display: 154 | ![UCSC Genome Browser insertions](../../images/ucsc_browser_ins.PNG?raw=true "UCSC Genome Browser insertions") 155 | 156 | ### Compute Intersections 157 | - Select **Table Browser** from the **Tools** menu in the header menu. 158 | - Select "Custom Tracks" from **group**, if not already displayed. 159 | - Select the custom track name, for example, **dbVar NR deletions** from **track.** 160 | - Select **create** next to **intersection.** 161 | - The next window will be named: **Intersect with** *your track name*, for example: **Intersect with dbVar NR deletions.** 162 | - Update the parameters, or use the defaults: **Genes and Gene Predictions.** 163 | - Select **submit** from the **Intersect with ...** window. 164 | - In the **Table Browser** window,: 165 | - Select an **output format** or leave as **BED** 166 | - Enter a name into **output file** (or leave blank to display output in browser) 167 | - Select **get output** 168 | - In the **Output ... as BED** window, select **getBED** 169 | - The output is generated in a file or displayed in the browser. 170 | 171 | 172 | # NCBI Sequence Viewer: 173 | ### Browse 174 | - Open **NCBI Sequence Viewer** for GRCh38, chromosome 1 in browser: 175 | - Select **Tracks** 176 | - In **Configure Page** window, select **Custom Data** 177 | - In **Data Source** menu, select **URL** 178 | - In **URL** form, enter: 179 | - **URL** 180 | - **Track Name**: dbVar NR deletions 181 | - Select **Upload** 182 | - In **URL** form, enter: 183 | - **URL** 184 | - **Track Name**: dbVar NR duplications 185 | - Select **Upload** 186 | - In **URL** form, enter: 187 | - **URL** 188 | - **Track Name**: dbVar NR insertions 189 | - Select **Upload** 190 | - In **URL** form: 191 | - Select **Configure** 192 | - All tracks should be displayed. 193 | - Chromosome 1 top level: 194 | ![NCBI Sequence Browser chr1](../../images/NR_sv_chr1.PNG?raw=true "NCBI Sequence Browser chr1") 195 | 196 | - Chromosome 1 zoomed in: 197 | ![NCBI Sequence Browser zoom](../../images/NR_sv_chr1_zoom.PNG?raw=true "NCBI Sequence Browser zoomed") 198 | 199 | - Chromosome 1 showing detailed Non-redundant deletions, duplications, and insertions: 200 | ![NCBI Sequence Browser detail](../../images/NR_sv_chr1_detail.PNG?raw=true "NCBI Sequence Browser detail") 201 | 202 | 203 | # Installation Notes (Linux) 204 | ## Bedtools 205 | - Find the latest bedtools directory in github, for example: 206 | - Download the \*.tar.gz file 207 | - Go to the installation directory and run: `make` 208 | - Add the bin directory to your path, for example: 209 | `export PATH=/home/bedtools2.27/bin/:$PATH` 210 | -------------------------------------------------------------------------------- /Structural_Variant_Sets/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Structural Variant (SV) Datasets 2 | 3 | ## Work in progress, subject to change 4 | 5 | Advances in genomic technologies have revealed structural variations (SV) to be prevalent in all human DNA, and subsequent research has increasingly implicated SV in phenotypic diversity and disease. In the next few years millions of genomes will be sequenced, leading to the discovery of many millions of SVs that will need to be analyzed to understand their functional impacts. A critical step in variant analysis and interpretation is the comparison of SVs found in an individual or population to known variants in public databases such as dbVar. Such comparisons can lead to biological insights into the possible functions of novel variants. To address the need for a SV reference, dbVar is creating structural variation datasets of known SV, complete with rich genomic and biological annotations, for use in novel SV analysis, annotaion, and related workflows. 6 | 7 | dbVar is a NCBI database of human genomic structural variation with more than 5 million submitted SVs from 157 human studies. These include data from large diversity projects such as the 1000 Genomes Project and a global population CNV survey ([Sudmant et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/26249230)) and from clinical resources such as ClinVar and ClinGen. We utilized this large collection of variants to generate several non-redundant SV datasets, separated according to variant type. We have begun annotating these datasets using available information on genes, molecular consequence, clinical significance, dosage sensitivity, regulatory regions, and relevant genomic structural features such as repetitive regions, segmental duplications, and assembly-assembly alignment anomalous regions. These reference data and annotations are intended to facilitate the integration and comparison of dbVar SV data with other genome annotations (such as disease phenotype and population frequencies) and to provide insights into the impact of the SV on biological functions. 8 | 9 | The resulting alpha-release datasets are currently available in tab-delimited formats at https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants. Our goal is to update these files on a regular basis incorporating new variant submissions, genomic features, Gene, RefSeq, ClinVar and other annotation information as it becomes available. We encourage users to test these files and to provide feedback, either by emailing dbvar@ncbi.nlm.nih.gov or by submitting a request directly at https://github.com/ncbi/dbvar/issues. 10 | 11 | ============================ 12 | -------------------------------------------------------------------------------- /images/NR_sv_chr1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/NR_sv_chr1.PNG -------------------------------------------------------------------------------- /images/NR_sv_chr1_detail.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/NR_sv_chr1_detail.PNG -------------------------------------------------------------------------------- /images/NR_sv_chr1_zoom.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/NR_sv_chr1_zoom.PNG -------------------------------------------------------------------------------- /images/galaxy.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/galaxy.PNG -------------------------------------------------------------------------------- /images/galaxy_vcf.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/galaxy_vcf.PNG -------------------------------------------------------------------------------- /images/stub: -------------------------------------------------------------------------------- 1 | this is just a stub 2 | -------------------------------------------------------------------------------- /images/ucsc_browser.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/ucsc_browser.PNG -------------------------------------------------------------------------------- /images/ucsc_browser_ins.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/images/ucsc_browser_ins.PNG -------------------------------------------------------------------------------- /nr_stats_tables/ftp_manifest-table4.20191104.inc.md: -------------------------------------------------------------------------------- 1 | FTP Directory: nonredundant
2 | Last modified: Oct 22, 2019
3 | File types: bed, bedpe, tsv
4 |

5 | Deletions

6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 |
Type and FTP DirectoryGRCh37GRCh38
All2,565,7982,553,342
Common256,543255,992
Pathogenic11,29611,105
Somatic23,36023,319
34 |

35 | Duplications

36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |
Type and FTP DirectoryGRCh37GRCh38
All428,103417,693
Common68,99268,447
Pathogenic4,3374,206
Somatic15,10315,077
64 |

65 | Insertions

66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 |
Type and FTP DirectoryGRCh37GRCh38
All1,322,7071,327,375
Common138,871138,737
Pathogenic7171
Somatic00
94 | 95 | -------------------------------------------------------------------------------- /nr_stats_tables/test: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 |

Supported Features

Capabilitybrowserbrowserclient browser client browser
23 | 24 | access_binary 25 |
26 | 27 | provide access to raw binary data of the file 28 |
[1] [8]
38 | 39 | access_image_binary 40 |
41 | 42 | provide access to raw binary data of the image 43 | 44 |
[1] [8]
54 | 55 | display_media 56 |
57 | 58 | display binary data as thumbs for example 59 |
[8]
69 | 70 | do_cors 71 |
72 | 73 | make cross-domain requests 74 | 75 |
85 | 86 | drag_and_drop 87 |
88 | 89 | accept files dragged and dropped from the desktop 90 | 91 |
101 | 102 | filter_by_extension 103 |
104 | 105 | filter files in selection dialog by their extensions 106 | 107 |
[7] [7]
117 | 118 | resize_image 119 |
120 | 121 | resize image 122 | 123 |
[1]
133 | 134 | report_upload_progress 135 |
136 | 137 | periodically report how many bytes were uploaded 138 | 139 |
[2]
149 | 150 | return_response_headers 151 |
152 | 153 | provide access to the headers of http response 154 | 155 |
165 | 166 | return_response_type 167 |
168 | 169 | support http response of specific type 170 |
[9] [9]
180 | 181 | return_status_code 182 |
183 | 184 | return http status code of the response 185 |
[10] [10] [10]
195 | 196 | send_custom_headers 197 |
198 | 199 | send custom http header with the request 200 | 201 |
[12]
211 | 212 | select_file 213 |
214 | 215 | pick up a files from a dialog 216 | 217 |
[3]
227 | 228 | select_folder 229 |
230 | 231 | select a folder from a dialog 232 | 233 |
[4]
243 | 244 | select_multiple 245 |
246 | 247 | select multiple files at once from a file dialog 248 |
[5]
258 | 259 | send_binary_string 260 |
261 | 262 | send raw binary data (typically a binary string) 263 | 264 |
[1]
274 | 275 | send_browser_cookies 276 |
277 | 278 | send browser cookies with http request 279 |
289 | 290 | send_multipart 291 |
292 | 293 | send multipart/form-data 294 | 295 |
305 | 306 | slice_blob 307 |
308 | 309 | slice the file or blob 310 |
[1]
320 | 321 | stream_upload 322 |
323 | 324 | upload file without preloading it to memory 325 |
335 | 336 | summon_file_dialog 337 |
338 | 339 | programmatically trigger file dialog 340 |
[6]
350 | 351 | upload_filesize 352 |
353 | upload file of specific size 354 |
[1]
364 | use_http_method 365 |
366 | use specific http method
[11] [11] [11] [11]
376 | -------------------------------------------------------------------------------- /specs/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Design and Schema Specifications 2 | ============================ 3 | 4 | ### directory layout 5 | 6 | . 7 | +-- README.md 8 | +-- dbVar.xsd # dbVar database schema 9 | +-- dbVarSubmissionTemplate_v3.5.xlsx # dbVar Submission Template (Excel) 10 | -------------------------------------------------------------------------------- /specs/dbVar.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | The positive-decimal type specifies a positive decimal value. 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | Controlled vocabulary for Analysis Type 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | Controlled vocabulary to identify if this is From or To in a breakpoint pair 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | Controlled vocabulary for subject collection 65 | 66 | 67 | 68 | 69 | 70 | 71 | Individual included in 1000 Genomes 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | Individual included in CEPH panel 80 | 81 | 82 | 83 | 84 | 85 | 86 | Individual included in International HapMap Project 87 | 88 | 89 | 90 | 91 | 92 | 93 | Individual included in Human Genome Diversity Project 94 | 95 | 96 | 97 | 98 | 99 | 100 | Individual included in National Institute of Neurological Disorders and Stroke repository 101 | 102 | 103 | 104 | 105 | 106 | 107 | Individual included in Ontario Population Genomics Platform 108 | 109 | 110 | 111 | 112 | 113 | 114 | Individual included in Polymorphism Discovery Resource 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | Controlled vocabulary for contact type 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | Controlled vocabulary for Experiment Type 138 | 139 | 140 | 141 | 142 | 143 | 144 | Experiment was used to discover structural variants 145 | 146 | 147 | 148 | 149 | 150 | 151 | Experiment was used to validate structural variants 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | Controlled vocabulary for genomic placement methods, default is genomic placements received from submitter. 162 | 163 | 164 | 165 | 166 | 167 | 168 | Original placement was mapped to another assembly using NCBI assembly-to-assembly remapping software. 169 | 170 | 171 | 172 | 173 | 174 | 175 | Original placement was mapped to another assembly by ad-hoc method at NCBI. 176 | 177 | 178 | 179 | 180 | 181 | 182 | Original placement was on an unplaced contig which is an artifact of the assembly. 183 | 184 | 185 | 186 | 187 | 188 | 189 | Variant location was identified by cytoband, which were placed on genome by NCBI. 190 | 191 | 192 | 193 | 194 | 195 | 196 | Original genomic placement provided by submitter. 197 | 198 | 199 | 200 | 201 | 202 | 203 | Variant location was identified by HGVS nomenclature, in the hgvs_name field. 204 | 205 | 206 | 207 | 208 | 209 | 210 | Original placement did not successfully remap to specified assembly. 211 | 212 | 213 | 214 | 215 | 216 | 217 | Variant could not be placed on submitted assembly. 218 | 219 | 220 | 221 | 222 | 223 | 224 | Submitted placement is on an assembly that is not in INSDC. Placements cannot be validated or loaded. 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | Controlled vocabulary for database Link, for an archive, phenotype, or sample resource 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | Controlled vocabulary for Method Type 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | Controlled vocabulary for Variant_call Origin 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | Controlled vocabulary for ranking placements 350 | 351 | 352 | 353 | 354 | 355 | 356 | This is the best placement available in the assembly. 357 | (Regardless of whether it is submitted or remapped, single or multiple, first or second pass, genomic, cytogenetic, or HGVS) 358 | 359 | 360 | 361 | 362 | 363 | 364 | There are better placements on the same assembly. 365 | Indicates that this is a remapped placement that does not have the best coverage score. 366 | 367 | 368 | 369 | 370 | 371 | 372 | This remmapped placement was generated using an unknown remapper algorithm, therefore of low confidence. 373 | 374 | 375 | 376 | 377 | 378 | 379 | This placement is a submitted artifact, therefore of low confidence. 380 | 381 | 382 | 383 | 384 | 385 | 386 | There is no placement submitted or remapped on the assembly. 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | Controlled vocabulary indicating if a placement is single or multiple within the same assembly. 396 | 397 | 398 | 399 | 400 | 401 | 402 | This is the only placement in this assembly. 403 | 404 | 405 | 406 | 407 | 408 | 409 | This is one of multiple placements in the same assembly. 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | Controlled vocabulary for remap failures 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | Controlled vocabulary for placement Recipient 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | Controlled vocabulary for reference_type 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | Sampleset was Case-set 460 | 461 | 462 | 463 | 464 | 465 | 466 | Sampleset was control-set 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | Controlled vocabulary for sequence type 475 | 476 | 477 | 478 | 479 | 480 | 481 | last landmark affected by variant 482 | 483 | 484 | 485 | 486 | 487 | 488 | first unaffected landmark after variant 489 | 490 | 491 | 492 | 493 | 494 | 495 | last unaffected landmark before variant 496 | 497 | 498 | 499 | 500 | 501 | 502 | first landmark affected by variant 503 | 504 | 505 | 506 | 507 | 508 | 509 | typically a sequence trace that shows the variant breakpoint, and which was used to make the variant call 510 | 511 | 512 | 513 | 514 | 515 | 516 | typically the ID(s) of a probe(s) which were observed to have increased or decreased signal intensity, and so were used to call the variant 517 | 518 | 519 | 520 | 521 | 522 | 523 | the actual variant sequence; can either be literal sequence under 100bp, or a GENBANK accession containing the literal sequence if greater than 100bp 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | Controlled vocabulary for subject Sex 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | Controlled vocabulary for strand Type 545 | 546 | 547 | 548 | 549 | 550 | 551 | + strand 552 | 553 | 554 | 555 | 556 | 557 | 558 | - strand 559 | 560 | 561 | 562 | 563 | 564 | 565 | This value can be assigned by the converter if a value for strand is expected but was not submitted. 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | Controlled vocabulary for Study Type 575 | 576 | 577 | 578 | 579 | 580 | 581 | Comparison between healthy and diseased individuals. 582 | 583 | 584 | 585 | 586 | 587 | 588 | Disease-associated variation. 589 | 590 | 591 | 592 | 593 | 594 | 595 | A collection of variation 596 | 597 | 598 | 599 | 600 | 601 | 602 | Variation in healthy population. 603 | 604 | 605 | 606 | 607 | 608 | 609 | Somatic Variation 610 | 611 | 612 | 613 | 614 | 615 | 616 | Somatic variation detected from comparison of tumor to healthy tissue from the same individual. 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | Controlled vocabulary for Study Type 626 | 627 | 628 | 629 | 630 | 631 | 632 | Subject Age in Days 633 | 634 | 635 | 636 | 637 | 638 | 639 | Subject Age in Weeks 640 | 641 | 642 | 643 | 644 | 645 | 646 | Subject Age in Months 647 | 648 | 649 | 650 | 651 | 652 | 653 | Subject Age in Years 654 | 655 | 656 | 657 | 658 | 659 | 660 | Subject Gestational Age in Days 661 | 662 | 663 | 664 | 665 | 666 | 667 | Subject Gestational Age in Weeks 668 | 669 | 670 | 671 | 672 | 673 | 674 | Subject Gestational Age in Months 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | Controlled vocabulary for Validation result 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | Controlled vocabulary for Variant Call 696 | Type 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | Controlled vocabulary for Variant_call Zygosity 732 | 733 | 734 | 735 | 736 | 737 | 738 | Hemizygous variant_call 739 | 740 | 741 | 742 | 743 | 744 | 745 | Heterozygous variant_call 746 | 747 | 748 | 749 | 750 | 751 | 752 | Homozygous variant_call 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | Controlled vocabulary for Variant Region 762 | Type 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | Database name from controlled vocabulary 788 | 789 | 790 | 791 | 792 | 793 | 794 | Database ID 795 | 796 | 797 | 798 | 799 | 800 | 801 | Indicates the link is not valid. 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | Indicates the URL is not valid. 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | Placement of the variant to a sequenced genome 824 | 825 | 826 | 827 | 828 | 829 | chromosome name 830 | 831 | 832 | 833 | 834 | contig accession 835 | 836 | 837 | 838 | 839 | chromosome accession 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | The left confidence interval around start. Use instead of outer_start when there are confidence intervals. 851 | 852 | 853 | 854 | 855 | The right confidence interval around start. Use instead of inner_start when there are confidence intervals. 856 | 857 | 858 | 859 | 860 | The left confidence interval around stop. Use instead of inner_stop when there are confidence intervals. 861 | 862 | 863 | 864 | 865 | The right confidence interval around stop. Use instead of outer_stop when there are confidence intervals. 866 | 867 | 868 | 869 | 870 | 871 | 872 | The assembly unit to which the mapped_id belongs 873 | 874 | 875 | 876 | 877 | First Pass means the remapping is based on the 'First Pass' or reciprocal best hit alignments. 'Second Pass' means the remapping is based on the non-reciprocal best hit alignments 878 | 879 | 880 | 881 | 882 | 884 | 885 | 886 | 1=remapped placement is on a different AC_ or NC_ chromosome than the source chromosome 887 | 888 | 889 | 890 | 891 | 1=remapped placement was best-scoring placement within a cluster/overlap/subset on same chromosome and alignment (First Pass or Second Pass), and other remapped placements were dropped 892 | 893 | 894 | 895 | 896 | 897 | 898 | Placement of the variant to a cytogenetic band location 899 | 900 | 901 | 902 | 903 | NCBI Taxonomy ID (https://www.ncbi.nlm.nih.gov/taxonomy) 904 | 905 | 906 | 907 | 908 | eg 7p12-7p11.2 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | indicate status for alternate placement 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | Variant Calls have an HGVS expression associated with a placement 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | Variant Regions may have variant_call_id, mutation_order, and mutation_molecule associated with a placement 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | The contact's first name. Must have either a first_name or a last_name. 950 | 951 | 952 | 953 | 954 | 955 | 956 | The email is optional in the submission because it can be obtained from myncbi_id. 957 | 958 | 959 | 960 | 961 | 962 | MyNCBI login system ID 963 | (https://www.ncbi.nlm.nih.gov/myncbi) 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | optional phenotype description, although standard terms are preferred 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | Top Level Submission Package 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | version of the XSD 1018 | 1019 | 1020 | 1021 | 1022 | Versions used in submission files. 1023 | Format: Major.Component.Update 1024 | Major = structural change in the model. 1025 | Component = change such as addition of a new object 1026 | Update = minor change such as addition of CF type or update/addition/removal of a field. 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | May 18, 2010 1034 | Added SUBMISSION/version. 1035 | Added STRUCTVAR/VARIANT_SET/VARIANT/MERGED_VARIANT/id for merged variants. 1036 | Changed ANALYSIS/id, METHOD/id to positiveInteger in order to agree with exchange. 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | Jun 8, 2010 1044 | Added HPO to PhenotypeType. 1045 | Replaced STRUCTVAR/VARIANT_SET/VARIANT/ALLELE/phenotype with STRUCTVAR/VARIANT_SET/VARIANT/ALLELE/PHENOTYPE. 1046 | Changed STRUCTVAR/VARIANT_SET/VARIANT/ALLELE string to STRUCTVAR/VARIANT_SET/VARIANT/ALLELE/DESCRIPTION. 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | Jun 24, 2010 1054 | Added STUDY/NCBI_submission_id 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | July 26, 2010 1062 | Add Curated to StudyType. 1063 | Add LSDB, MassSpec, Optical mapping to MethodTypeCV. 1064 | Added VARIANT/PLACEMENT/GENOME/placement_method, GenomicPlacementMethodCV. 1065 | Moved STRUCTVAR/VARIANT_SET/VARIANT/ALLELE/DESCRIPTION to STRUCTVAR/VARIANT_SET/VARIANT/DESCRIPTION. 1066 | Removed STRUCTVAR/VARIANT_SET/VARIANT/notes, since DESCRIPTION will be used instead. 1067 | Added Library to SampleTypeCV, since it is in Exchange.xsd. 1068 | Added definitions to StudyTypeCV. 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | August 12, 2010 1076 | Changed start, outer_start to nonNegativeInteger to allow for base 0. 1077 | Changed start, stop to optional for studies that only have outer_start, outer_stop. 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | October 28, 2010 1085 | Added inner_start, inner_stop to placements. 1086 | Added Genomic to GenomicPlacementMethodCV. 1087 | Added SUBJECT/PEDIGREE/chromosome_complement. 1088 | Added Karyotype to MethodCV for validation. 1089 | Added AssemblyCV and use it in ANALYSIS/REFERENCE/assembly and PLACEMENT/GENOME/assembly. 1090 | Removed STRUCTVAR/base. 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | November 18, 2010 1098 | Per STV-604, removed these from AlleleTypeCV: 1099 | Amplification, CNV, Deletion, DelIns, Duplication, Gain+Loss, LOH, Ring, Translocation. 1100 | Per STV-619, added ClinicalSignificanceCV. 1101 | Per STV-723, added ALLELE/loss_of_heterozygosity. 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | December 3, 2010 1109 | Set VARIANT_SET to optional, since ISCA will submit instances to existing variants. This also allows generation of temporary xml files with only instances. 1110 | STV-731: Add dbSNP to MethodTypeCV. 1111 | STV-732: Add "Conventional CGH" to MethodTypeCV. 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | December 20, 2010 1119 | STV-622: added GRCh37.p1, GRCh37.p2. 1120 | 1121 | 1122 | 1123 | 1124 | 1125 | 1126 | January 19, 2011 1127 | VLOAD-106: added Sscrofa9. 1128 | STV-761: added SAMPLE/SAMPLESET/id. 1129 | STV-709: Allow only 1 ANALYSIS/REFERENCE. 1130 | 1131 | 1132 | 1133 | 1134 | 1135 | 1136 | February 17, 2011 1137 | STV-782: Removed ANALYSIS/REFERENCE/POOLED. 1138 | Allow ANALYSIS/REFERENCE/SAMPLE and SAMPLESET to be repeated. 1139 | STV-783: Added VARIANT/ALLELE/mode_of_inheritence. 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | March 14, 2011 1147 | VLOAD-100: added Btau_4.0. 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | April 19, 2011 1155 | VLOAD-13: Although SUBMISSION/submission_id is required in the Exchange.xsd, STUDY/NCBI_submission_id should be optional since it is not included in FTP files. 1156 | VLOAD-128, STV-797: Added STUDY/ALIASES/alias. 1157 | VLOAD-129: Made sampleset id's integer. 1158 | VLOAD-132: Replaced Everted with "Tandem duplication" in AlleleTypeCV. 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | July 7, 2011 1166 | STV-674: Changed values in GenomePlacementMethodCV. 1167 | STV-888: Added PlacementType/remap_score. 1168 | STV-731, STV-800: Removed 'Sequence alingnment', 'Read-depth analysis', 'Paired-end mapping', 'SNP genotyping analysis', 1169 | 'Composite Approach', dbSNP, LSDB from MethodTypeCV. 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | August 24, 2011 1178 | STV-950: Major updates for new submission template. 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | Jira issue number for the study, i.e. VLOAD-123. 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | Submitter study_id in the form of 'AuthorYear' 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | Submitter or other contact. 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | Details of study 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | NCBI Taxonomy ID (https://www.ncbi.nlm.nih.gov/taxonomy) 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | NCBI PubMed ID (https://www.ncbi.nlm.nih.gov/pubmed/) 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | This gets indexed. Use this for alternate names, as for 1000 genomes. 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | generic study links 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | BioProject accession (https://www.ncbi.nlm.nih.gov/bioproject) 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | Submitter location of data eg PI lab 1266 | 1267 | 1268 | 1269 | 1270 | 1271 | 1272 | Indicates the study URL is not valid. 1273 | 1274 | 1275 | 1276 | 1277 | 1278 | 1279 | dbGaP identifier 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | EGA identifier 1287 | 1288 | 1289 | 1290 | 1291 | 1292 | 1293 | DDBJ/EBI/NCBI assigned accession for the study 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 | 1305 | 1306 | Date of scheduled release of data 1307 | 1308 | 1309 | 1310 | 1311 | 1312 | Enter 1 if this study was loaded by NCBI/EBI from historical data 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | When a study is updated and linked to an external data source (e.g. COSMIC) this attribute indicates the version/date of the external data source 1321 | 1322 | 1323 | 1324 | 1325 | 1326 | 1327 | For updating studies, the freeze date of the newly submitted data 1328 | 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | Indicates the study is curated. 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 | 1344 | Details of methods used to generate raw data 1345 | 1346 | 1347 | 1348 | 1349 | 1350 | 1351 | 1352 | Details of method used to generate data 1353 | 1354 | 1355 | 1356 | 1357 | 1359 | 1360 | 1361 | Free text method description 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | Analysis of the data provided by the method 1374 | 1375 | 1376 | 1377 | 1378 | 1380 | 1381 | 1382 | Free text description of the analysis 1383 | 1384 | 1385 | 1386 | 1387 | 1388 | 1389 | 1390 | Analysis type, from AnalysisTypeCV 1391 | 1392 | 1393 | 1394 | 1396 | 1397 | 1398 | Reference Type, from ReferenceTypeCV 1399 | 1400 | 1401 | 1402 | 1403 | 1404 | 1405 | Reference value 1406 | 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | Analysis of the data provided by the method 1415 | 1416 | 1417 | 1418 | 1419 | 1421 | 1422 | 1423 | Free text description of the analysis 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 | Analysis type, from AnalysisTypeCV 1432 | 1433 | 1434 | 1435 | 1437 | 1438 | 1439 | Reference Type, from ReferenceTypeCV 1440 | 1441 | 1442 | 1443 | 1444 | 1445 | 1446 | Reference value 1447 | 1448 | 1449 | 1450 | 1451 | 1452 | 1453 | 1454 | 1455 | Analysis of the data provided by the method 1456 | 1457 | 1458 | 1459 | 1460 | 1462 | 1463 | 1464 | Free text description of the analysis 1465 | 1466 | 1467 | 1468 | 1469 | 1470 | 1471 | 1472 | Analysis type, from AnalysisTypeCV 1473 | 1474 | 1475 | 1476 | 1478 | 1479 | 1480 | Reference Type, from ReferenceTypeCV 1481 | 1482 | 1483 | 1484 | 1485 | 1486 | 1487 | Reference value 1488 | 1489 | 1490 | 1491 | 1492 | 1493 | 1494 | 1495 | 1496 | 1497 | 1499 | 1500 | Details of software/algorithm parameters used 1501 | 1502 | 1503 | 1504 | 1505 | 1506 | 1507 | Short description or keywords for software/algorithm(s) used eg ADM2, Birdsuite, BrkPtr, CNVFinder 1508 | 1509 | 1510 | 1511 | 1512 | 1513 | 1514 | 1515 | 1516 | 1518 | 1519 | Platform links. 1520 | 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | Name of sequencing platform eg Capillary, 454, Helicos, Solexa, SOLiD or name of array platform if not in GEO or Array Express 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 | 1535 | 1536 | 1537 | 1538 | 1539 | 1540 | 1541 | 1542 | 1543 | 1544 | Details of curation by DDBJ/EBI/NCBI, if this has occurred 1545 | 1546 | 1547 | 1548 | 1549 | 1550 | 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | IDs of experiments merged to make this one 1558 | 1559 | 1560 | 1561 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | id of the experiment supplied by submitter 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | SRA Experiment accession (https://www.ncbi.nlm.nih.gov/books/NBK47529/#_SRA_Quick_Sub_BK_Experiment_) 1577 | 1578 | 1579 | 1580 | 1581 | 1582 | 1583 | For multicenter studies, the location where the experiment took place 1584 | 1585 | 1586 | 1587 | 1588 | 1589 | Estimate of the resolution (kb) 1590 | 1591 | 1592 | 1593 | 1594 | 1595 | 1596 | 1597 | 1598 | 1599 | Details of each sampleset/population used in the study 1600 | 1601 | 1602 | 1603 | 1604 | 1605 | 1606 | 1607 | 1608 | 1609 | 1610 | NCBI Taxonomy ID (https://www.ncbi.nlm.nih.gov/taxonomy) 1611 | 1612 | 1613 | 1614 | 1615 | 1616 | 1618 | 1619 | 1620 | 1621 | 1622 | Sampleset ID supplied by submitter 1623 | 1624 | 1625 | 1626 | 1627 | 1628 | 1629 | Submitter defined sampleset name 1630 | 1631 | 1632 | 1633 | 1634 | 1635 | 1636 | Number of samples in sampleset 1637 | 1638 | 1639 | 1640 | 1641 | 1642 | 1643 | 1644 | 1645 | Population of the sampleset 1646 | 1647 | 1648 | 1649 | 1650 | 1651 | 1652 | 1653 | 1654 | 1655 | Details of each sample used in the study 1656 | 1657 | 1658 | 1659 | 1660 | 1661 | 1662 | 1663 | 1664 | 1665 | Sample sampleset_id 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | 1673 | 1674 | If sample information has been deposited to a samples database (e.g. BioSD), or is commercially available (e.g. from the Jackson Laboratory) provide the details 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | Sample ID supplied by submitter 1683 | 1684 | 1685 | 1686 | 1687 | 1688 | 1689 | Cell/tissue type from which the sample was obtained 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | Reference to subject ID supplied by submitter 1697 | 1698 | 1699 | 1700 | 1701 | 1702 | cancer, histology, toxicology,... 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1 if sample was derived from cancerous cells or tissue 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | The sample's karyotype in ISCN notation, e.g. 47,XXY or 46,XY,t(12;16)(q13;p11) 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | Sample accession from DDBJ/EBI/NCBI BioSample 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | Details of each subject/individual used in the study 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | Subject ID supplied by submitter 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | NCBI tax ID (https://www.ncbi.nlm.nih.gov/taxonomy) 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | Reference to maternal Subject ID 1757 | 1758 | 1759 | 1760 | 1761 | 1762 | 1763 | Reference to paternal Subject ID 1764 | 1765 | 1766 | 1767 | 1768 | 1769 | 1770 | 1771 | Indicates if the subject is included in either the International HapMap Project (HapMap) or Human Genome Diversity Project (HGDP) 1772 | 1773 | 1774 | 1775 | 1776 | 1777 | 1778 | The subject's karyotype in ISCN notation, e.g. 47,XXY or 46,XY,t(12;16)(q13;p11) 1779 | 1780 | 1781 | 1782 | 1783 | 1784 | 1785 | Free-form subject ethnicity as defined by submitter 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | The subject's age 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | The subject's age units 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | Details of each variant call identified in the study 1810 | 1811 | 1812 | 1813 | 1814 | 1815 | 1816 | 1817 | 1818 | 1819 | 1820 | 1821 | 1822 | Free-form description of variant 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | 1830 | 1831 | References SAMPLE/sample_id. 1832 | 1833 | 1834 | 1835 | 1836 | 1837 | 1838 | 1839 | 1841 | 1842 | 1843 | References SAMPLESET/sampleset_id. 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 | 1851 | 1852 | 1854 | 1855 | 1856 | raw sequence eg ATGTTAA/atgttaa of inserted or replacement sequence 1857 | 1858 | 1859 | 1860 | 1861 | Evidence or Support sequences 1862 | 1863 | 1864 | 1865 | 1866 | 1867 | 1868 | Submitter id of the VARIANT_CALL 1869 | 1870 | 1871 | 1872 | 1873 | 1874 | DDBJ/EBI/NCBI assigned accession of the variant call 1875 | 1876 | 1877 | 1878 | 1879 | 1880 | 1881 | Source of Clinical Significance: ClinVar, Submitter, ClinGen Dosage Sensitivity Map 1882 | 1883 | 1884 | 1885 | 1886 | 1887 | Studies with clinical_significance should be submitted directly to ClinVar. 1888 | There are still some legacy studies with clinical_significance, including ClinGen. 1889 | Changed from ClinicalSignificanceCV to string to allow types from ClinVar aggregated at the region level. 1890 | 1891 | 1892 | 1893 | 1894 | 1895 | If this call is an insertion of any kind, this field indicates the length of the inserted sequence, and is required (value may be approximate, e.g., "150-300"). Can also be used for a mapping-based deletion, to indicate the approximate size of the deleted sequence. 1896 | 1897 | 1898 | 1899 | 1900 | 1901 | 1902 | 1903 | 1904 | Needs to be a string to support values like "4+". 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | Indicates the expected copy number for this variant in the default state. 1911 | 1912 | 1913 | 1914 | 1915 | 1916 | Number of variant probes or clones 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 | Average log2 ratio for variant probes 1924 | 1925 | 1926 | 1927 | 1928 | 1929 | 1930 | 1/true indicates low-quality data, as in some variants in 1000 genomes project (used for filtering) 1931 | 1932 | 1933 | 1934 | 1935 | 1936 | 1937 | From 1000 Genomes VCF spec: AC : allele count in genotypes, for each ALT allele, in the same order as listed 1938 | 1939 | 1940 | 1941 | 1942 | 1943 | From 1000 Genomes VCF spec: AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary 1944 | 1945 | 1946 | 1947 | 1948 | 1949 | From 1000 Genomes VCF spec: AN : total number of alleles in called genotypes 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | Use for short tandem repeat. Combine repeat motif, repeat count, and optional indication if this is the reference allele. Example: "[TG]12.5 (ref)". 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | Details of each variant region identified in the study 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | Free-form description of variant 1973 | 1974 | 1975 | 1976 | 1977 | Variant Calls that were merged into this region. 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | Variant Regions that were merged into this region. 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1995 | 1996 | 1997 | 1998 | The submitter defined id of the VARIANT_REGION 1999 | 2000 | 2001 | 2002 | 2003 | DDBJ/EBI/NCBI assigned accession of the variant region 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | The assertion method used to assert the variant region 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 1/true indicates low-quality data, as in some variants in 1000 genomes project (used for filtering) 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | Use for short tandem repeat. Repeat motif, for example: "TG". 2024 | 2025 | 2026 | 2027 | 2028 | 2029 | Use for short tandem repeat. Sorted list of repeat counts for reference and alternates. For example: "13.5,14.5,15.5,16.5,17.5,18.5,19.5,20.5,21.5,22.5,23.5,26.5". 2030 | 2031 | 2032 | 2033 | 2034 | 2035 | 2036 | 2037 | 2038 | 2039 | 2040 | 2041 | 2042 | 2043 | 2044 | 2045 | Internal sample id 2046 | 2047 | 2048 | 2049 | 2050 | 2051 | 2052 | 2053 | 2055 | 2056 | 2057 | Internal sampleset id 2058 | 2059 | 2060 | 2061 | 2062 | 2063 | 2064 | 2065 | 2066 | 2067 | 2068 | 2069 | 2070 | 2072 | 2073 | 2074 | 2075 | 2076 | 1=ancestral allele, 0=non-ancestral allele 2077 | 2078 | 2079 | 2080 | 2081 | 2082 | 2083 | 2084 | 2085 | 2086 | 2087 | 2088 | Submitted genotype 0|1, 0/1 etc. Where '|' is phased and '/' is unphased, and the digits refer to the alternate allele code for this genotype 2089 | 2090 | 2091 | 2092 | 2093 | 2094 | 2095 | Submitted genotype eg heterozygous, hemizygous, haploid 2096 | 2097 | 2098 | 2099 | 2100 | 2101 | 2102 | 2103 | 2104 | 2105 | 2106 | 2107 | List of dbVar accessions to be deleted 2108 | 2109 | 2110 | 2111 | 2112 | 2113 | 2114 | dbVar sv accession 2115 | 2116 | 2117 | 2118 | 2119 | 2120 | 2121 | -------------------------------------------------------------------------------- /specs/dbVarSubmissionTemplate_v3.4.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/specs/dbVarSubmissionTemplate_v3.4.xlsx -------------------------------------------------------------------------------- /specs/dbVarSubmissionTemplate_v3.5.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ncbi/dbvar/505191f154a26edb59b71920310b831491897a6a/specs/dbVarSubmissionTemplate_v3.5.xlsx -------------------------------------------------------------------------------- /tutorials/README.md: -------------------------------------------------------------------------------- 1 | # dbVar Tutorials 2 | 3 | ## Tutorials for NR data: 4 | https://github.com/ncbi/dbvar/blob/master/Structural_Variant_Sets/Nonredundant_Structural_Variants/ToolGuide.md 5 | --------------------------------------------------------------------------------