└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Imputation beagle tutorial
  2 | Throughout the protocol we assume Bash shell.
  3 | 
  4 | This is a tutorial forked from: https://www.protocols.io/run/genotype-imputation-workflow-v3-0-xbgfijw
  5 | 
  6 | Split tutorial step by step.
  7 | 
  8 | 1. Installation anaconda
  9 | 2. Installation software
 10 | 3. Run imputation
 11 | 
 12 | ## Create new env that we call imputation
 13 | 
 14 | `conda create -n imputation python=3.6 anaconda`
 15 | 
 16 | Activate your new env:
 17 | 
 18 | `conda activate imputation`
 19 | 
 20 | Installation of the required pack/software:
 21 | ```
 22 | conda install -c bioconda eagle
 23 | conda install -c bioconda beagle
 24 | conda install -c r r-base
 25 | conda install -c bioconda bcftools
 26 | conda install -c conda-forge r-data.table
 27 | conda install -c conda-forge r-sm
 28 | ```
 29 | 
 30 | ## or Download and install the software packages
 31 | 
 32 | Required software packages are listed below with the versions used in this protocol. However, using the latest versions is
 33 | recommended.
 34 | * BCFtools v1.7 (or later version) http://www.htslib.org/download/
 35 | * R v3.4.1 (or later version) https://www.r-project.org/
 36 | * R package data.table https://github.com/Rdatatable/data.table/wiki/Installation
 37 | * R package sm https://cran.r-project.org/package=sm
 38 | * Eagle v2.3.5 https://data.broadinstitute.org/alkesgroup/Eagle/
 39 | * Beagle v4.1 beagle.27Jan18.7e1.jar https://faculty.washington.edu/browning/beagle/b4_1.html
 40 | * Beagle bref bref.27Jan18.7e1.jar https://faculty.washington.edu/browning/beagle/bref.27Jan18.7e1.jar
 41 | 
 42 | ## Reference genome and genetic map files
 43 |  
 44 | Fasta files
 45 | Homo Sapiens assembly hg38 version 0 is used and the required files are:
 46 | Homo_sapiens_assembly38.fasta
 47 | Homo_sapiens_assembly38.fasta.fai
 48 | The files are available for downloading at Broad Insitute storage in Google cloud at: [hg38](https://console.cloud.google.com/storage/browser/broad-references/hg38/v0/?pli=1)
 49 | 
 50 | 
 51 | 1.2.2. Genetic map files for phasing with Eagle
 52 | Genetic map file (all chromosomes in a single file) with recombination frequencies for GRCh38/hg38 are available for downloading at Eagle download page at:
 53 | https://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/
 54 | genetic_map_hg38_withX.txt.gz
 55 | 
 56 | We have processed the file according to the command below in order to split it per chromosome with correct headers.
 57 | The resulting files are saved as:
 58 | eagle_chr#_b38.txt (where # is the chromosome number)
 59 | 
 60 | ```
 61 | for CHR in {1..23}; do
 62 |     zcat genetic_map_hg38_withX.txt.gz | \
 63 |     grep ^${CHR} | \
 64 |     sed '1ichr position COMBINED_rate(cM/Mb) Genetic_Map(cM)' \
 65 |     > eagle_chr${CHR}_b38.map
 66 | done
 67 | ```
 68 | Note: Currently the chromosome notation in the Eagle genetic map files is only the chromosome number without 'chr' and chrX is '23'. Starting from Eagle v2.4, also chromosome notation with 'chr' tag is supported.
 69 | 
 70 | 
 71 | 1.2.3 Genetic map files for imputation with Beagle
 72 | Genetic map files fro Beagle are available for downloading at Beagle download page at
 73 | http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/
 74 | plink.GRCh38.map.zip
 75 | 
 76 | Unzip the files and change the chromosome notation from PLINK format (only number or X) to GRCh38/hg38 standard notation with 'chr' tag as follows:
 77 | 
 78 | ```
 79 | # Unzip the files
 80 | unzip plink.GRCh38.map.zip
 81 | 
 82 | # Rename chromosome 23
 83 | mv plink.chrX.GRCh38.map plink.chr23.GRCh38.map
 84 | 
 85 | # Add 'chr' tag to the beginning of the line and 
 86 | # store the output with suitable filename
 87 | for CHR in {1..23}; do 
 88 |     cat plink.chr${CHR}.GRCh38.map | \
 89 |     sed 's/^/chr/' \
 90 |     > beagle_chr${CHR}_b38.map
 91 | done
 92 | ```
 93 | 
 94 | Note: For GRCh38/hg38, the chromosome notation in the Beagle genetic map files is 'chr#' and chromosome 23 is 'chrX'.
 95 | 1.3. Imputation reference panel files
 96 |  
 97 | 1.3.1 Obtain the reference panel files
 98 | For increased imputation accuracy, we recommend using a population-specific imputation reference panel, if available.
 99 |  
100 | If population-specific reference data is not available, for instance 1000 Genomes Project (1000 GP) (www.nature.com/articles/nature15393) data can be used instead.
101 |  
102 | GRCh38/hg38 files are available at EBI 1000 genomes ftp site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/
103 | 
104 | ```
105 | wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr{{1..22},X}_GRCh38.genotypes.20170504.vcf.gz{,.tbi}
106 | ```
107 | NOTE: 
108 | The reference panel files should contain: 
109 | * phased genotypes,
110 | * chromosome names with 'chr' and chromosome X as 'chrX', 
111 | * all variants as biallelic records,
112 | * only SNPs and INDELs,
113 | * only unique variants, 
114 | * non-missing data, 
115 | * chrX as diploid genotypes, and
116 | * only unique IDs
117 | 
118 | 1.3.2 Minimum quality control
119 | 
120 | Here, we have piped most of the processing steps together in order to save significant amount of time by avoiding writing out multiple intermediate files. If your imputation reference panel does not require all the steps, modify the command accordingly.
121 | 
122 | For 1000GP data:
123 | Generate a text file containing space-spearated old and new chromosome names. This is required to rename the numerical chromosome names with 'chr' tag. Apply the new chromosome names with 'bcftools annotate'.
124 | Remove the rare variants, here singletons and doubletons by setting AC threshold with 'bcftools view'.
125 | Split multiallelic sites to biallelic records with 'bcftools norm'.
126 | Keep only SNPs and INDELs with 'bcftools view'. Here, the 1000GP data included a tag VT in the INFO field and data contain also structural variants which should be excluded.
127 | Align the variants to reference genome with 'bcftools norm' in order to have the REF and ALT alleles in the shortest possible representation and to confirm that the REF allele matches the reference genome, additionally remove duplicate variants (-d none).
128 | After alignment, remove multiallelic records with 'bcftools view', since these are formed during the alignment if the REF does not match with the reference genome.
129 | Finally, remove sites containing missing data with 'bcftools view'.
130 | 
131 | 
132 | ```
133 | # Generate a chromosome renaming file
134 | for CHR in {1..23} X ; do 
135 |     echo ${CHR} chr${CHR}
136 | done >> chr_names.txt
137 | 
138 | # Multiple processing commands piped together
139 | for CHR in {1..22} X; do
140 |     bcftools annotate --rename-chrs chr_names.txt \
141 |         ALL.chr${CHR}_GRCh38.genotypes.20170504.vcf.gz -Ou | \
142 |     bcftools view -e 'INFO/AC<3 | INFO/AN-INFO/AC<3' -Ou | \
143 |     bcftools norm -m -any -Ou | \
144 |     bcftools view -i 'INFO/VT="SNP" | INFO/VT="INDEL"' -Ou | \
145 |     bcftools norm -f Homo_sapiens_assembly38.fasta -d none -Ou | \
146 |     bcftools view -m 2 -M 2 -Ou | \
147 |     bcftools view -g ^miss -Oz -o 1000GP_chr${CHR}.vcf.gz"
148 | done
149 | ```
150 |  
151 | If multiallelic sites are present in your data, in order to preserve them throughout the protocol, set ID field with unique IDs e.g. in format CHR_POS_REF_ALT. (RSIDs might contain duplicates, when the multiallelic sites are decomposed.)
152 | 
153 | ```
154 | for CHR in {1..23}; do
155 |     bcftools annotate \
156 |     --set-id '%CHROM\_%POS\_%REF\_%ALT' \
157 |     panel_chr${CHR}.vcf.gz  \
158 |     -Oz -o panel_SNPID_chr${CHR}.vcf.gz
159 | done
160 | ```
161 | 
162 | 1.3.3 Convert haploid genotypes to homozygous diploids
163 | Often chrX is represented as haploid genotypes for males, however, Beagle can only handle diploid genotypes. 
164 | The command here produces unphased diploid genotypes. But since the haploid genotypes are in diploid format as REF/REF or ALT/ALT, we can simply set the phase for those alleles with a simple sed replacement. 
165 | 
166 | The chrX ploidy can be corrected as follow
167 | 
168 | ```
169 | # Fix the chromosome X ploidy to phased diploid
170 | # Requires a ploidy.txt file containing 
171 | # space-separated CHROM,FROM,TO,SEX,PLOIDY 
172 | echo "chrX 1 156040895 M 2" > ploidy.txt
173 | bcftools +fixploidy \
174 |     1000GP_chrX.vcf.gz -Ov -- -p ploidy.txt | \
175 |     sed 's#0/0#0\|0#g;s#1/1#1\|1#g' | \
176 | bcftools view -Oz -o 1000GP_chr23.vcf.gz"
177 | ```
178 | 
179 | 1.3.4 Duplicate ID removal
180 | Remove duplicate IDs. If you wish to preserve all multiallelic sites, replace the ID column with a unique ID e.g. CHR_POS_REF_ALT (as indicated in Step 1.3.2). 
181 | 
182 | Here, 1000GP did not contain multiallelic sites after AC filtering, and thus, RSIDs were preserved in the ID column. And since RSIDs are not always unique, duplicates should be removed.
183 | 
184 | ```
185 | for CHR in {1..23}; do
186 |     bcftools query -f '%ID\n' 1000GP_chr${CHR}.vcf.gz | \
187 |     sort | uniq -d > 1000GP_chr${CHR}.dup_id
188 | 
189 |     if [[ -s 1000GP_chr${CHR}.dup_id ]]; then
190 |     	bcftools view -e ID=@1000GP_chr${CHR}.dup_id \
191 |     	1000GP_chr${CHR}.vcf.gz \
192 |         -Oz -o 1000GP_filtered_chr${CHR}.vcf.gz
193 |     else 
194 |     	mv 1000GP_chr${CHR}.vcf.gz \
195 |         1000GP_filtered_chr${CHR}.vcf.gz
196 |     fi
197 | done
198 | ```
199 | 1.3.5 Reference panel allele frequencies
200 | Generate a tab-delimited file of the reference panel allele frequencies, one variant per line, with columns CHR, SNP (in format CHR_POS_REF_ALT), REF, ALT, AF (including the header line).
201 | 
202 | First, update (or add) AF values in the INFO field, calculate it with BCFtools plugin +fill-tags:
203 | ```
204 | # Check if the VCF does NOT contain AF in the INFO field,
205 | # and calculate it with bcftools +fill-tags plugin
206 | 
207 | for CHR in {1..23}; do
208 |     bcftools +fill-tags \
209 |         1000GP_filtered_chr${CHR}.vcf.gz \
210 |         -Oz -o 1000GP_AF_chr${CHR}.vcf.gz -- -t AF
211 | done
212 | ```
213 | Extract the wanted fields from each VCF file and combine as a single output file with the header:
214 | 
215 | ```
216 | # Generate a tab-delimited header
217 | echo -e 'CHR\tSNP\tREF\tALT\tAF' \
218 |     > 1000GP_imputation_all.frq
219 | 
220 | # Query the required fields from the VCF file 
221 | # and append to the allele frequency file 
222 | for CHR in {1..23}; do
223 |     bcftools query \
224 |     -f '%CHROM\t%CHROM\_%POS\_%REF\_%ALT\t%REF\t%ALT\t%INFO/AF\n' \
225 |     1000GP_AF_chr${CHR}.vcf.gz \
226 |     >> 1000GP_imputation_all.frq
227 | done
228 | ```
229 | Note: Chromosome notation in the panel.frq file should follow the GRCh38/hg38 notations ('chr#' for autosomal chromosomes and 'chrX' for chromosome 23).
230 |  
231 | 
232 | 
233 | 1.3.6 Create binary reference panel files
234 | The phased reference panel files per chromosome are required in bref format (bref = binary reference). For more information, see Beagle documentation at Beagle site:
235 | https://faculty.washington.edu/browning/beagle/bref.16Dec15.pdf.
236 |  
237 | The required bref.*.jar is downloaded from Beagle site:
238 | ```
239 | wget https://faculty.washington.edu/browning/beagle/bref.27Jan18.7e1.jar
240 | ```
241 | Use the processed imputation reference panel VCFs as inputs for the example command below. 
242 | The output files have the suffix '.bref' instead of '.vcf.gz'.
243 | ```
244 | # Convert each file to bref format
245 | for CHR in {1..23}; do
246 |     java -jar /path/to/bref.27Jan18.7e1.jar \
247 |         1000GP_AF_chr${CHR}.vcf.gz
248 | done
249 | ```
250 | 
251 | 1.3.7 Generate a list of the reference panel sample IDs
252 | List of sample IDs present in the reference panel, one line per sample ID can be generated from any of the VCF files as in the example below (assuming that all chromosomes contain the same set of samples):
253 | ```
254 | bcftools query -l 1000GP_AF_chr22.vcf.gz \
255 |     > 1000GP_sample_IDs.txt
256 | ```
257 | 1.4. You are ready to start! 
258 | As the last prepatory step, let's go over the required input data file(s) and also expected final output files!
259 |  
260 | 1.4.1 Input file:
261 |  
262 | Post-QC chip genotype data in VCFv4.2 format and chrX genotypes as diploid genotypes:
263 | DATASET.vcf.gz
264 |  
265 | Note: Chromosome notation should follow the GRCh38/hg38 notations (e.g. 'chr#' for autosomal chromosomes, 'chrX', 'chrY' and 'chrM').
266 | 
267 | Note: If the input data was lifted over from an older genome build to build version 38, cautious inspection of the data is highly recommended before proceeding with the protocol.
268 | 
269 | Note: If chrX is represented as haploid genotypes, follow step 1.3.3 first 'bcftools +fixploidy' command (not the other two piped commands) to convert to diploid genotypes.
270 |  
271 |  
272 | 1.4.2 Final output files:
273 | DATASET_imputed_info_chr#.vcf.gz (where # is chromosome number)
274 | DATASET_postimputation_summary_plots.pdf
275 | 
276 | Note: Several intermediate files are created during the protocol. Those files can be used for troubleshooting and deleted once the successful imputation is confirmed.
277 | 
278 | 
279 | 
280 | 


--------------------------------------------------------------------------------