└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Imputation beagle tutorial 2 | Throughout the protocol we assume Bash shell. 3 | 4 | This is a tutorial forked from: https://www.protocols.io/run/genotype-imputation-workflow-v3-0-xbgfijw 5 | 6 | Split tutorial step by step. 7 | 8 | 1. Installation anaconda 9 | 2. Installation software 10 | 3. Run imputation 11 | 12 | ## Create new env that we call imputation 13 | 14 | `conda create -n imputation python=3.6 anaconda` 15 | 16 | Activate your new env: 17 | 18 | `conda activate imputation` 19 | 20 | Installation of the required pack/software: 21 | ``` 22 | conda install -c bioconda eagle 23 | conda install -c bioconda beagle 24 | conda install -c r r-base 25 | conda install -c bioconda bcftools 26 | conda install -c conda-forge r-data.table 27 | conda install -c conda-forge r-sm 28 | ``` 29 | 30 | ## or Download and install the software packages 31 | 32 | Required software packages are listed below with the versions used in this protocol. However, using the latest versions is 33 | recommended. 34 | * BCFtools v1.7 (or later version) http://www.htslib.org/download/ 35 | * R v3.4.1 (or later version) https://www.r-project.org/ 36 | * R package data.table https://github.com/Rdatatable/data.table/wiki/Installation 37 | * R package sm https://cran.r-project.org/package=sm 38 | * Eagle v2.3.5 https://data.broadinstitute.org/alkesgroup/Eagle/ 39 | * Beagle v4.1 beagle.27Jan18.7e1.jar https://faculty.washington.edu/browning/beagle/b4_1.html 40 | * Beagle bref bref.27Jan18.7e1.jar https://faculty.washington.edu/browning/beagle/bref.27Jan18.7e1.jar 41 | 42 | ## Reference genome and genetic map files 43 | 44 | Fasta files 45 | Homo Sapiens assembly hg38 version 0 is used and the required files are: 46 | Homo_sapiens_assembly38.fasta 47 | Homo_sapiens_assembly38.fasta.fai 48 | The files are available for downloading at Broad Insitute storage in Google cloud at: [hg38](https://console.cloud.google.com/storage/browser/broad-references/hg38/v0/?pli=1) 49 | 50 | 51 | 1.2.2. Genetic map files for phasing with Eagle 52 | Genetic map file (all chromosomes in a single file) with recombination frequencies for GRCh38/hg38 are available for downloading at Eagle download page at: 53 | https://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/ 54 | genetic_map_hg38_withX.txt.gz 55 | 56 | We have processed the file according to the command below in order to split it per chromosome with correct headers. 57 | The resulting files are saved as: 58 | eagle_chr#_b38.txt (where # is the chromosome number) 59 | 60 | ``` 61 | for CHR in {1..23}; do 62 | zcat genetic_map_hg38_withX.txt.gz | \ 63 | grep ^${CHR} | \ 64 | sed '1ichr position COMBINED_rate(cM/Mb) Genetic_Map(cM)' \ 65 | > eagle_chr${CHR}_b38.map 66 | done 67 | ``` 68 | Note: Currently the chromosome notation in the Eagle genetic map files is only the chromosome number without 'chr' and chrX is '23'. Starting from Eagle v2.4, also chromosome notation with 'chr' tag is supported. 69 | 70 | 71 | 1.2.3 Genetic map files for imputation with Beagle 72 | Genetic map files fro Beagle are available for downloading at Beagle download page at 73 | http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/ 74 | plink.GRCh38.map.zip 75 | 76 | Unzip the files and change the chromosome notation from PLINK format (only number or X) to GRCh38/hg38 standard notation with 'chr' tag as follows: 77 | 78 | ``` 79 | # Unzip the files 80 | unzip plink.GRCh38.map.zip 81 | 82 | # Rename chromosome 23 83 | mv plink.chrX.GRCh38.map plink.chr23.GRCh38.map 84 | 85 | # Add 'chr' tag to the beginning of the line and 86 | # store the output with suitable filename 87 | for CHR in {1..23}; do 88 | cat plink.chr${CHR}.GRCh38.map | \ 89 | sed 's/^/chr/' \ 90 | > beagle_chr${CHR}_b38.map 91 | done 92 | ``` 93 | 94 | Note: For GRCh38/hg38, the chromosome notation in the Beagle genetic map files is 'chr#' and chromosome 23 is 'chrX'. 95 | 1.3. Imputation reference panel files 96 | 97 | 1.3.1 Obtain the reference panel files 98 | For increased imputation accuracy, we recommend using a population-specific imputation reference panel, if available. 99 | 100 | If population-specific reference data is not available, for instance 1000 Genomes Project (1000 GP) (www.nature.com/articles/nature15393) data can be used instead. 101 | 102 | GRCh38/hg38 files are available at EBI 1000 genomes ftp site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ 103 | 104 | ``` 105 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr{{1..22},X}_GRCh38.genotypes.20170504.vcf.gz{,.tbi} 106 | ``` 107 | NOTE: 108 | The reference panel files should contain: 109 | * phased genotypes, 110 | * chromosome names with 'chr' and chromosome X as 'chrX', 111 | * all variants as biallelic records, 112 | * only SNPs and INDELs, 113 | * only unique variants, 114 | * non-missing data, 115 | * chrX as diploid genotypes, and 116 | * only unique IDs 117 | 118 | 1.3.2 Minimum quality control 119 | 120 | Here, we have piped most of the processing steps together in order to save significant amount of time by avoiding writing out multiple intermediate files. If your imputation reference panel does not require all the steps, modify the command accordingly. 121 | 122 | For 1000GP data: 123 | Generate a text file containing space-spearated old and new chromosome names. This is required to rename the numerical chromosome names with 'chr' tag. Apply the new chromosome names with 'bcftools annotate'. 124 | Remove the rare variants, here singletons and doubletons by setting AC threshold with 'bcftools view'. 125 | Split multiallelic sites to biallelic records with 'bcftools norm'. 126 | Keep only SNPs and INDELs with 'bcftools view'. Here, the 1000GP data included a tag VT in the INFO field and data contain also structural variants which should be excluded. 127 | Align the variants to reference genome with 'bcftools norm' in order to have the REF and ALT alleles in the shortest possible representation and to confirm that the REF allele matches the reference genome, additionally remove duplicate variants (-d none). 128 | After alignment, remove multiallelic records with 'bcftools view', since these are formed during the alignment if the REF does not match with the reference genome. 129 | Finally, remove sites containing missing data with 'bcftools view'. 130 | 131 | 132 | ``` 133 | # Generate a chromosome renaming file 134 | for CHR in {1..23} X ; do 135 | echo ${CHR} chr${CHR} 136 | done >> chr_names.txt 137 | 138 | # Multiple processing commands piped together 139 | for CHR in {1..22} X; do 140 | bcftools annotate --rename-chrs chr_names.txt \ 141 | ALL.chr${CHR}_GRCh38.genotypes.20170504.vcf.gz -Ou | \ 142 | bcftools view -e 'INFO/AC<3 | INFO/AN-INFO/AC<3' -Ou | \ 143 | bcftools norm -m -any -Ou | \ 144 | bcftools view -i 'INFO/VT="SNP" | INFO/VT="INDEL"' -Ou | \ 145 | bcftools norm -f Homo_sapiens_assembly38.fasta -d none -Ou | \ 146 | bcftools view -m 2 -M 2 -Ou | \ 147 | bcftools view -g ^miss -Oz -o 1000GP_chr${CHR}.vcf.gz" 148 | done 149 | ``` 150 | 151 | If multiallelic sites are present in your data, in order to preserve them throughout the protocol, set ID field with unique IDs e.g. in format CHR_POS_REF_ALT. (RSIDs might contain duplicates, when the multiallelic sites are decomposed.) 152 | 153 | ``` 154 | for CHR in {1..23}; do 155 | bcftools annotate \ 156 | --set-id '%CHROM\_%POS\_%REF\_%ALT' \ 157 | panel_chr${CHR}.vcf.gz \ 158 | -Oz -o panel_SNPID_chr${CHR}.vcf.gz 159 | done 160 | ``` 161 | 162 | 1.3.3 Convert haploid genotypes to homozygous diploids 163 | Often chrX is represented as haploid genotypes for males, however, Beagle can only handle diploid genotypes. 164 | The command here produces unphased diploid genotypes. But since the haploid genotypes are in diploid format as REF/REF or ALT/ALT, we can simply set the phase for those alleles with a simple sed replacement. 165 | 166 | The chrX ploidy can be corrected as follow 167 | 168 | ``` 169 | # Fix the chromosome X ploidy to phased diploid 170 | # Requires a ploidy.txt file containing 171 | # space-separated CHROM,FROM,TO,SEX,PLOIDY 172 | echo "chrX 1 156040895 M 2" > ploidy.txt 173 | bcftools +fixploidy \ 174 | 1000GP_chrX.vcf.gz -Ov -- -p ploidy.txt | \ 175 | sed 's#0/0#0\|0#g;s#1/1#1\|1#g' | \ 176 | bcftools view -Oz -o 1000GP_chr23.vcf.gz" 177 | ``` 178 | 179 | 1.3.4 Duplicate ID removal 180 | Remove duplicate IDs. If you wish to preserve all multiallelic sites, replace the ID column with a unique ID e.g. CHR_POS_REF_ALT (as indicated in Step 1.3.2). 181 | 182 | Here, 1000GP did not contain multiallelic sites after AC filtering, and thus, RSIDs were preserved in the ID column. And since RSIDs are not always unique, duplicates should be removed. 183 | 184 | ``` 185 | for CHR in {1..23}; do 186 | bcftools query -f '%ID\n' 1000GP_chr${CHR}.vcf.gz | \ 187 | sort | uniq -d > 1000GP_chr${CHR}.dup_id 188 | 189 | if [[ -s 1000GP_chr${CHR}.dup_id ]]; then 190 | bcftools view -e ID=@1000GP_chr${CHR}.dup_id \ 191 | 1000GP_chr${CHR}.vcf.gz \ 192 | -Oz -o 1000GP_filtered_chr${CHR}.vcf.gz 193 | else 194 | mv 1000GP_chr${CHR}.vcf.gz \ 195 | 1000GP_filtered_chr${CHR}.vcf.gz 196 | fi 197 | done 198 | ``` 199 | 1.3.5 Reference panel allele frequencies 200 | Generate a tab-delimited file of the reference panel allele frequencies, one variant per line, with columns CHR, SNP (in format CHR_POS_REF_ALT), REF, ALT, AF (including the header line). 201 | 202 | First, update (or add) AF values in the INFO field, calculate it with BCFtools plugin +fill-tags: 203 | ``` 204 | # Check if the VCF does NOT contain AF in the INFO field, 205 | # and calculate it with bcftools +fill-tags plugin 206 | 207 | for CHR in {1..23}; do 208 | bcftools +fill-tags \ 209 | 1000GP_filtered_chr${CHR}.vcf.gz \ 210 | -Oz -o 1000GP_AF_chr${CHR}.vcf.gz -- -t AF 211 | done 212 | ``` 213 | Extract the wanted fields from each VCF file and combine as a single output file with the header: 214 | 215 | ``` 216 | # Generate a tab-delimited header 217 | echo -e 'CHR\tSNP\tREF\tALT\tAF' \ 218 | > 1000GP_imputation_all.frq 219 | 220 | # Query the required fields from the VCF file 221 | # and append to the allele frequency file 222 | for CHR in {1..23}; do 223 | bcftools query \ 224 | -f '%CHROM\t%CHROM\_%POS\_%REF\_%ALT\t%REF\t%ALT\t%INFO/AF\n' \ 225 | 1000GP_AF_chr${CHR}.vcf.gz \ 226 | >> 1000GP_imputation_all.frq 227 | done 228 | ``` 229 | Note: Chromosome notation in the panel.frq file should follow the GRCh38/hg38 notations ('chr#' for autosomal chromosomes and 'chrX' for chromosome 23). 230 | 231 | 232 | 233 | 1.3.6 Create binary reference panel files 234 | The phased reference panel files per chromosome are required in bref format (bref = binary reference). For more information, see Beagle documentation at Beagle site: 235 | https://faculty.washington.edu/browning/beagle/bref.16Dec15.pdf. 236 | 237 | The required bref.*.jar is downloaded from Beagle site: 238 | ``` 239 | wget https://faculty.washington.edu/browning/beagle/bref.27Jan18.7e1.jar 240 | ``` 241 | Use the processed imputation reference panel VCFs as inputs for the example command below. 242 | The output files have the suffix '.bref' instead of '.vcf.gz'. 243 | ``` 244 | # Convert each file to bref format 245 | for CHR in {1..23}; do 246 | java -jar /path/to/bref.27Jan18.7e1.jar \ 247 | 1000GP_AF_chr${CHR}.vcf.gz 248 | done 249 | ``` 250 | 251 | 1.3.7 Generate a list of the reference panel sample IDs 252 | List of sample IDs present in the reference panel, one line per sample ID can be generated from any of the VCF files as in the example below (assuming that all chromosomes contain the same set of samples): 253 | ``` 254 | bcftools query -l 1000GP_AF_chr22.vcf.gz \ 255 | > 1000GP_sample_IDs.txt 256 | ``` 257 | 1.4. You are ready to start! 258 | As the last prepatory step, let's go over the required input data file(s) and also expected final output files! 259 | 260 | 1.4.1 Input file: 261 | 262 | Post-QC chip genotype data in VCFv4.2 format and chrX genotypes as diploid genotypes: 263 | DATASET.vcf.gz 264 | 265 | Note: Chromosome notation should follow the GRCh38/hg38 notations (e.g. 'chr#' for autosomal chromosomes, 'chrX', 'chrY' and 'chrM'). 266 | 267 | Note: If the input data was lifted over from an older genome build to build version 38, cautious inspection of the data is highly recommended before proceeding with the protocol. 268 | 269 | Note: If chrX is represented as haploid genotypes, follow step 1.3.3 first 'bcftools +fixploidy' command (not the other two piped commands) to convert to diploid genotypes. 270 | 271 | 272 | 1.4.2 Final output files: 273 | DATASET_imputed_info_chr#.vcf.gz (where # is chromosome number) 274 | DATASET_postimputation_summary_plots.pdf 275 | 276 | Note: Several intermediate files are created during the protocol. Those files can be used for troubleshooting and deleted once the successful imputation is confirmed. 277 | 278 | 279 | 280 | --------------------------------------------------------------------------------