├── ADMIXTURE_Tut.md ├── BASH Commands to Navigate the server.md ├── CITATION.cff ├── GENESIS_PCRelate_Tut.md ├── GRM_Computation_Methods.md ├── Jennifer's R Tutorials.pdf ├── Markdown_Tut.md ├── PLINK_QC.md ├── README.md ├── Subsetting_tutorial.pdf ├── Useful_databases.md ├── ggplot_manhattan.r ├── prefev1_loco_180322.png ├── prefev1_loco_qqman_rsid_180322.png └── prefev1_loco_rsid_180322.png /ADMIXTURE_Tut.md: -------------------------------------------------------------------------------- 1 | # ADMIXTURE Tutorial 2 | #### Pagé Goddard 3 | #### Sep 28, 2017 4 | ___ 5 | ## Content 6 | * [ADMIXTURE Overview](#summary) 7 | * [Set up: Choose your file locations](#files) 8 | * [Pipeline](#format) 9 | - [Formatting genotype data](#format) 10 | - [Build popfile](#popfile) 11 | - [Run Admixture](#admix) 12 | * [Parsing Results](#results) 13 | - [ADMIXTURE Output](#output) 14 | - [Reformat results](#reformat) 15 | - [Plotting Results](#plot) 16 | * [Bonus: Thinning Files](#thin) 17 | ___ 18 | 19 | Tutorial for calculating global ancestry proportions from **known ancestral populations** using ADMIXTURE. 20 | 21 | Sample data: SAGE 2 22 | 23 | [ADMIXTURE manual](https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf) 24 | 25 | 26 | ### ADMIXTURE Summary 27 | Input | Output 28 | -----------|-------- 29 | test.bed \(*plink*\) | test.k.Q \(*ancestry fractions*\) 30 | test.bim \(*plink*\) | test.k.P \(*allele freq of inferred pop*\) 31 | test.fam \(*plink*\) 32 | test.pop \(*generate in R*\) 33 | 34 | * test includes cohort of interest AND reference populations 35 | * test binary files are 1/2 encoded 36 | * k = number of reference populations 37 | 38 | 39 | ### File Locations 40 | *script variables and directories* 41 | ```bash 42 | date="171231" # update current yymmdd 43 | wkdir="$HOME/wkdir" # choose your working directory 44 | ancestraldir="path/to/your/ancestry.reference.panels.directory" 45 | datadir="path/to/your/genotype.data.dir" 46 | 47 | # data files 48 | sagegenofile="$datadir/genotype.data" # I'v set it up this way so that you can make a second object (eg: galagenofile) for another dataset 49 | ancestralgenofile="$ancestraldir/ref.panels" # reference panel genotypes 50 | 51 | # all your data should be in PLINK binary format (.bim .bed. fam) 52 | ``` 53 | 54 | ## PLINK: format data 55 | ADMIXTURE requires 1 input that contains your population of unknown ancestry and the references of known ancestry. 56 | If you do not have a cohort-reference panel file, you must merge them together in PLINK. 57 | **NB**: plink will merge the files in alphabetical order by FID, so you may have your reference panels merged into the middle of the file rather than appended. 58 | ```bash 59 | # PLINK variables 60 | merge="genotype_ancestry_attempt1_${date}" 61 | ancestry_flip="ref_panels_flippedalleles_${date}" 62 | myadmix="genotype_ancestry_merge_${date}" 63 | 64 | # merge cohort + reference pops 65 | plink --bfile $sagegenofile --bmerge $ancestralgenofile.bed $ancestralgenofile.bim $ancestralgenofile.fam --make-bed --out $merge 66 | 67 | # if error due to missed snps, flip the problematic snps and merge again 68 | plink --bfile $ancestralgenofile --flip $merge_missnp --make-bed --out $ancestry_flip 69 | plink --bfile $sagegenofile --bmerge $ancestry_flip --make-bed --out $merge 70 | 71 | # if error persists, remove the snps with --exclude missnp 72 | 73 | # recode Cohort + Ref binary files in 1/2 format 74 | plink --bfile $merge --recode 12 --out $myadmix 75 | ``` 76 | 77 | *optional: thin files for admixture, see bottom* 78 | 79 | 80 | ## R: build popfile 81 | .pop is required for supervised admixture estimation; Each line of the .pop file corresponds to individual listed on the same line number in the .fam file. If the individual is a population reference, the .pop file line should be a string (beginning with an alphanumeric character) designating the population. If the individual is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to indicate that the ancestry should be estimated. The final format should be a **single column** with **no header** that lines up with the fam file as shown here: 82 | 83 | fam_ID_source | popfile 84 | ------|------ 85 | sage | - 86 | sage | - 87 | sage | - 88 | euro_ref | ceu 89 | euro_ref | ceu 90 | afr_ref | yri 91 | afr_ref | yri 92 | sage | - 93 | sage | - 94 | 95 | ```R 96 | # R variables 97 | setwd("PATH_TO/wkdir") 98 | date <- "171231" # update current yymmdd 99 | mydat <- "genotype_ancestry_merge_" # same prefix as myadmix 100 | merge <- read.table(file=paste(mydat, date, ".fam", sep=""), header = F, sep = ' ') # .fam file has the samples for your population and reference populations 101 | ceu <- read.table("samples_ceu.txt", header = F, sep = ' ') # european ancestry 102 | yri <- read.table("samples_yri.txt", header = F, sep = ' ') # african ancestry 103 | nam <- read.table("samples_nam.txt", header = F, sep = ' ') # native american ancestry 104 | 105 | # make popfile 106 | merge$pop = ifelse(merge$X1 %in% ceu$V1, 'CEU', ifelse(merge$X1 %in% yri$V1, 'YRI', '-')) 107 | # if IID is from CEU ref, write CEU; 108 | # if IID is from YRI ref, write YRI; 109 | # if IID is from NAM ref, write NAM; 110 | # if IID is in neither ref, it is a SAGE individual with unknwon ancestry; write - as placeholder 111 | popfile <- as.data.frame(merge$pop) 112 | 113 | # write out 114 | write.table(popfile, file=paste(mydat, date, ".pop", sep=""), row.names = F, quote = F) 115 | # check that same prefix as $myadmix input 116 | ``` 117 | 118 | ## ADMIXTURE: calculate global ancestry 119 | ```bash 120 | # ADMIXTURE variables 121 | 122 | ## required 123 | myadmix="genotype_ancestry_merge_${date}" 124 | popfile=${myadmix}.pop 125 | npop="2" # must reflect number of populations in reference panel 126 | 127 | ## optional 128 | nthreads="64" 129 | accelnum="qn3" 130 | seed="2017" 131 | #bootnum="200" # optional 132 | 133 | 134 | # admixture script 135 | admixture $myadmix.ped $npop --supervised --seed=$seed -j$nthreads 136 | ``` 137 | 138 | 139 | ## Parsing ADMIXTURE results 140 | 141 | 142 | ### Admixture output: 143 | myadmix.2.P - allele frequencies of the inferred ancestral populations (SNP x ancestry) 144 | *order: CEU YRI* 145 | ```bash 146 | wc -l $myadmix.2.P 147 | # 801660 lines 148 | head $myadmix.2.P 149 | # 0.006379 0.000010 150 | # 0.827382 0.404886 151 | # 0.048119 0.021815 152 | # 0.874641 0.306484 153 | ``` 154 | myadmix.2.Q - ancestry fractions for each individual (sample x ancestry) 155 | *order: CEU YRI* 156 | ```bash 157 | wc -l $myadmix.2.Q 158 | # 2165 lines 159 | head $myadmix.2.Q 160 | # 0.142886 0.857114 161 | # 0.123782 0.876218 162 | # 0.208400 0.791600 163 | # 0.109842 0.890158 164 | ``` 165 | 166 | ### Reformat results 167 | The following output can be used with programs such as [REAP](http://faculty.washington.edu/tathornt/software/REAP/REAP_Documentation.pdf) to calculate genetic relatedness matrices 168 | 169 | ```bash 170 | # REAP variables 171 | reapinput="genotype_transposed_${date}" 172 | myID="${reapinput}_IID_only.txt" 173 | admixprop="admixporp_${date}" 174 | admixprop_sorted="admixprop_sorted_${date}" 175 | 176 | # transpose genotype files 177 | plink --bfile $sagegenofile --recode 12 transpose --output-missing-genotype 0 --out $reapinput 178 | 179 | # reformat 180 | cut -d' ' -f1 $reapinput.tfam > $myID # SAGE ID list 181 | cut -d' ' -f1-2 merged_trial.fam | paste -d' ' - $myadmix.2.Q > admix_out_IDs_${date} # paste global ancestries to IDs; FID IID CEU YRI 182 | # -d' ' reads space as delimiter 183 | # -f1-2 selects fields 1 through 2 to extract with cut 184 | # - directs paste to use the standard input form the pipe instead of a file 185 | grep -Fwf $myID admix_out_IDs_${date} > $admixprop # extract sage only 186 | # -f read patterns from file1 187 | # -F read patterns as plain strings 188 | # -w match patterns as whole word 189 | awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' $admixprop $reapinput.tfam > $admixprop_sorted 190 | # sorts $admixprop to match SAGE fam file 191 | # FNR==NR holds true for first named file 192 | # current line $0 is stored in associative array named x2 indexed by the first field [$1] 193 | # $1 in x2 will only start after first name file has been read completely 194 | # looks at the first field of line in second file, and prints the corresponding line from first 195 | ``` 196 | 197 | **output** 198 | *order: FID IID EUR AFR* 199 | ```bash 200 | head $admixprop_sorted 201 | # CH30380 CH30380 0.153140 0.846860 202 | # VA30171 VA30171 0.120529 0.879471 203 | # VA30167 VA30167 0.180591 0.819409 204 | # CH30357 CH30357 0.134801 0.865199 205 | ``` 206 | 207 | ### Plotting results 208 | 209 | ```R 210 | #### files must be formatted with: race, eur, afr, natam 211 | # no subjectID, race column must come first. check example inputs 212 | 213 | ### Call libraries 214 | require(truncnorm) 215 | require(ggplot2) 216 | require(gridExtra) 217 | require(reshape2) 218 | 219 | # set up space 220 | setwd("PATH_TO/wkdir") 221 | date <- "171231" # update current yymmdd 222 | sage <- read.delim(print("$myadmix.2.Q")) 223 | colnames(sage) <- c("EUR","AFR") 224 | cbind(race = "AA", sage) 225 | head(sage) 226 | # race EUR AFR 227 | # AA 0.142886 0.857114 228 | # AA 0.123782 0.876218 229 | # AA 0.208400 0.791600 230 | # AA 0.109842 0.890158 231 | 232 | # order data by increasing european ancestry 233 | eur.sort = sage[order(sage$EUR),] 234 | eur.sort$rank = 1:length(eur.sort$EUR) 235 | 236 | # format data into geom_bar input 237 | eur.sort.long = melt(eur.sort, id = c('rank','race'), variable.name = 'Ancestry') 238 | head(eur.sort.long) 239 | # race rank Ancestry value 240 | # AA 1 EUR 0.088553 241 | # AA 2 EUR 0.100432 242 | # AA 3 EUR 0.134868 243 | # AA 4 EUR 0.181657 244 | 245 | # create plots 246 | afr.plot <- ggplot(eur.sort.long, aes(x=rank, y=value, fill=Ancestry)) 247 | afr.plot <- afr.plot + geom_bar(width=1, stat='identity') # play with width to see what it does 248 | afr.plot <- afr.plot + theme(legend.position="bottom") # move legend 249 | afr.plot <- afr.plot + xlab('Individual') + ylab('Genetic ancestry')# label the axes 250 | afr.plot <- afr.plot + ggtitle('SAGE 2') # create a title 251 | afr.plot <- afr.plot + scale_fill_manual(labels=c("Eur","Afr"),values=c("#cc9999","#660099")) 252 | afr.plot <- afr.plot + theme(axis.text.x=element_blank(), 253 | axis.ticks.x=element_blank()) 254 | 255 | afr.plot 256 | ``` 257 | 258 | ![resulting plot](file:///C:/Users/page/Desktop/sage_ancestry_plot.png) 259 | 260 | 261 | \# **QED** \# 262 | 263 | *** 264 | 265 | ## Thinning files 266 | [Documentation](https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf) SEC 2.3 267 | 268 | * speeds up ADMIXTURE 269 | 270 | *SAGE 2* 271 | ```bash 272 | # PLINK variables 273 | datadir="path/to/your/genotype.data.dir" 274 | genofile=$myadmix 275 | slidingwindowsize=50 276 | slidewidth=10 277 | r2val="0.1" 278 | thingenofile="${myadmix}_thinned_indeppairwise_${slidingwindowsize}_${slidewidth}_${r2val}" 279 | 280 | # run PLINK 281 | plink --bfile $genofile --indep-pairwise $slidingwindowsize $slidewidth $r2val --make-bed --out $thingenofile 282 | ``` 283 | You can then continue through the popfile and admixture steps with the thinned files. 284 | **NB**: Remember to test the order of thingenofile before making popfile 285 | -------------------------------------------------------------------------------- /BASH Commands to Navigate the server.md: -------------------------------------------------------------------------------- 1 | # Navigating the Server: Bash Commands 2 | ### Notes by: Pagé Goddard 3 | ### Tutorial session by: Jennifer Liberto 4 | ___ 5 | ### Contents 6 | * [`ls` - Looking in Directories](#ls) 7 | * [Permissions](#permissions) 8 | * [`cd` - Changing Directories](#cd) 9 | * [`less` - Read files](#less) 10 | * [`screen` - Screen a command](#screen) 11 | * [`scp` - Copying files to desktop](#scp) 12 | 13 | 14 | ### look at your directory contents 15 | * `ls` lists just the file and subfolder names in your directory 16 | * `ls -1` does the same, but puts it all in a single column 17 | * `ll` lists contents with permissions, size, and date information 18 | * `ls -lrta` list in reverse time, anything (lists latest last in list) 19 | 20 | 21 | ### permissions 22 | `drwxr-xr`-- *example 1* 23 | `-rwxr-xr`-- *example 2* 24 | 25 | * initial `d` = directory; `-` = file 26 | * `w` = can write 27 | * `x` = can execute 28 | * `r` = can read 29 | 30 | 31 | ### changing directories 32 | 33 | ```bash 34 | pwd 35 | /media/burchardraid01/datafreeze # example directory 36 | 37 | cd .. && pwd # change then prnt new directory 38 | /media/burchardraid01 # parent directory 39 | 40 | cd ~ && pwd 41 | /media/.../your.directory # you are now in your home directory 42 | 43 | cd path/to/desired/directory && pwd 44 | path/to/desired/directory # you are now in the directory you asked for 45 | ``` 46 | 47 | `pwd` = print working directory; tells you where you are 48 | 49 | `cd` = change directory 50 | 51 | `.` = current directory 52 | 53 | `..` = previous directory 54 | 55 | `~` = home 56 | 57 | ###### note: when typing a path, use `/` not backslash; if you get an error, check that you are not trying to `cd` into a file instead of the directory; if error persists, check your slashes, your capitalization and your path 58 | 59 | 60 | ### visualize files 61 | `less filename` = opens 1 page view of data without printing the whole thing to screen; move by line (arrow keys) or page (space) 62 | 63 | `head filename` = prints first 10 lines of file 64 | 65 | * `head -3 filename` = prints first 3 lines 66 | * `tail filename` = prints last 10 lines of file 67 | * `tail -3 filename` = print last 3 lines 68 | * `head 10 filenme | tail -3 -` = print 3 lines after line 10 of file 69 | * `cut -d' ' -f 1-5 filename` = print columns 1-5 of file; `d` is your file delimiter (the symbol that separates columns, usually space `' '`, comma `','`, or tab `'\t'`) 70 | * `cut -d' ' -f 1-5 filename | head -5 -` = print only rows 1-5 of columns 1-5 of file 71 | 72 | `vim filename` = let's you view and edit file 73 | 74 | * to edit, press `i` (for insert) and write as you please 75 | * navigate lines with arrow keys 76 | * to save click `Esc` and type :w 77 | * to quit, click `Esc` and type :q 78 | * to save and quit type :wq 79 | 80 | 81 | ### run a process in the background 82 | *this is especially useful if you are running an automated script that will take a while and doesn't require interaction after it starts* 83 | 84 | start a new screen 85 | ```bash 86 | screen 87 | # this changes your console window to a detachable "screen" 88 | # title will now read "screen 0: username@hostname:~/PATH/TO/current_directory" 89 | # normal title bar reads: "username@hostname:~/PATH/TO/current_directory" 90 | ``` 91 | start a new screen with a particular name 92 | ```bash 93 | screen -S name 94 | ``` 95 | detach your screen 96 | ```bash 97 | # while in the new screen type 98 | ctrl+A ctrl+D 99 | ``` 100 | return to your screen 101 | ```bash 102 | screen -r # if you only have one screen 103 | # if you have multiple screens, the list will print here 104 | 105 | screen -r [number/name] # if returning to one of multiple screens 106 | ``` 107 | kill an old screen 108 | ```bash 109 | screen -r [name/number] # return to screen 110 | 111 | # type 112 | crtl+A K 113 | ``` 114 | 115 | 116 | ### copying files to your computer 117 | 118 | ##### on a mac/linux 119 | 120 | ```bash 121 | exit # leave server 122 | scp username@hostname.edu:path/to/file/filename.csv ~/Desktop 123 | ``` 124 | 125 | * translation: copy/paste \[this file on server\] \[to my personal desktop\] 126 | * `scp` = secure copy paste (copy paste over secure ssh connection) 127 | * to push to server, just switch ~/Desktop and username@hostname.edu:/path 128 | 129 | ##### or use filezilla GUI 130 | 131 | Angel's step-by-step instructions for setting up `FileZilla` can be found on the Wiki [here](https://wiki.library.ucsf.edu/display/UAC/How+to+transfer+files+between+cesar+and+your+desktop+with+your+private+key) 132 | 133 | ## Congrats! you can now read your data into R studio from your personal computer 134 | -------------------------------------------------------------------------------- /CITATION.cff: -------------------------------------------------------------------------------- 1 | cff-version: 1.2.0 2 | message: "If you found any of this code uniquely helpful, you may cite it as below." 3 | authors: 4 | - family-names: "Goddard" 5 | given-names: "P" 6 | orcid: "https://orcid.org/0000-0001-8187-5316" 7 | - family-names: "Elhawary" 8 | given-names: "J" 9 | orcid: "https://orcid.org/0000-0003-3326-1680" 10 | title: "Burchardlab Tutorials {optional tutorial title}" 11 | version: 1.0.0 12 | doi: 10.5281/zenodo.1234 13 | date-released: 2021-08-16 14 | url: "https://github.com/pcgoddard/Burchardlab_Tutorials/wiki" 15 | -------------------------------------------------------------------------------- /GENESIS_PCRelate_Tut.md: -------------------------------------------------------------------------------- 1 | # GENESIS PC-Relate Tutorial 2 | #### Pagé Goddard 3 | 4 | "`GENESIS` uses `PC-AiR` for **population structure** inference that is robust to known or cryptic relatedness, and it uses `PC-Relate` for accurate **relatedness estimation** in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium." 5 | 6 | ### Resources 7 | 8 | [GENESIS Vignette](https://rdrr.io/bioc/GENESIS/f/vignettes/pcair.Rmd), 9 | [Bioconductor Vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/GENESIS/inst/doc/pcair.html#plink-files) 10 | 11 | [KING Documentation](http://people.virginia.edu/~wc9c/KING/manual.html) 12 | 13 | [SNPRelate](https://www.rdocumentation.org/packages/SNPRelate/versions/1.6.4) 14 | 15 | ### Libraries 16 | ```R 17 | # be sure to install biocLite 18 | source("https://bioconductor.org/biocLite.R") 19 | biocLite(c('GENESIS',"GWASTools", "SNPRelate")) 20 | library(GENESIS) 21 | library(GWASTools) 22 | library(SNPRelate) 23 | library(gdsfmt) 24 | ``` 25 | 26 | ### Input Files 27 | GENESIS uses a model-free approach and thus requires only the genotype file and no externally calculated ancestry proportions. The functions in the `GENESIS` package read genotype data from a GenotypeData class object created by the `GWASTools` package. Through the use of `GWASTools`, a `GenotypeData` class object can easily be created from: 28 | 29 | * Plink Files 30 | * GDS File 31 | * R Matrix of Genotype Data 32 | 33 | #### PLINK files 34 | The `SNPRelate` package provides the `snpgdsBED2GDS` function to convert binary PLINK files into a GDS file. 35 | 36 | file | description 37 | -----------|----------- 38 | `bed.fn` | path to `PLINK.bed` file 39 | `bim.fn` | path to `PLINK.bim` file 40 | `fam.fn` | path to `PLINK.fam` file 41 | `out.gdsfn`|path for output GDS file 42 | 43 | ```R 44 | snpgdsBED2GDS(bed.fn = "genotype.bed", bim.fn = "genotype.bim", fam.fn = "genotype.fam", out.gdsfn = "mygenotype.gds") 45 | ``` 46 | 47 | Then continue with GDS file instructions: 48 | 49 | #### GDS files 50 | 51 | ```R 52 | mygeno <- GdsGenotypeReader(filename = "PATH_TO/mygenotype.gds") 53 | myenoData <- GenotypeData(geno) 54 | ``` 55 | 56 | #### R Matrix 57 | file | description 58 | -----------|----------- 59 | `genotype` | matrix of genotype values coded as 0 / 1 / 2; rows index SNPs; columns index samples 60 | `snpID` | integer vector of unique SNP IDs 61 | `chromosome` | integer vector specifying chr of each snp 62 | `position` | integer vector psecifying position of each SNP 63 | `scanID` | vector of unique individual IDs 64 | 65 | ```R 66 | mygeno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID, chromosome = chromosome, position = position, scanID = scanID) 67 | 68 | mygenoData <- GenotypeData(mygeno) 69 | ``` 70 | 71 | ### Pairwise Measures of Ancestry Divergence 72 | Identifying a **mutually unrelated** and **ancestry representative** subset of individuals. `KING-robust` kinship coefficient estimator provides negative estimates for unrelated pairs with divergent ancestry to prioritize ancestrally-diverse individuals for the representative subset. Can be calculated with `KING-robust` from `GENESIS` or `snpgdsIBDKING` from `SNPRelate`. 73 | 74 | ##### with KING-robust 75 | ###### KING (external [program](http://people.virginia.edu/~wc9c/KING/manual.html)) 76 | * **input**: PLINK binary ped bile `genotype.bed` 77 | * **command**: king -b `genotype.bed` --kinship 78 | * **output**: `king.kin`, `king.kin0` 79 | 80 | KING Output: `.kin` and `.kin0` text files. `king2mat` extracts kinship coefficients for `GENESIS` functions. 81 | 82 | ```bash 83 | king -b genotype.bed --kinship 84 | ``` 85 | 86 | ```r 87 | # sample output 88 | 89 | head king.kin # within family 90 | ## FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error 91 | ## 28 1 2 2359853 0.000 0.2500 0.162 0.0008 0.2459 0 92 | ## 28 1 3 2351257 0.000 0.2500 0.161 0.0008 0.2466 0 93 | ## 28 2 3 2368538 1.000 0.0000 0.120 0.0634 -0.0108 0 94 | ## 117 1 2 2354279 0.000 0.2500 0.163 0.0006 0.2477 0 95 | ## 117 1 3 2358957 0.000 0.2500 0.164 0.0006 0.2490 0 96 | 97 | head king.kin0 # between family 98 | ## FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship 99 | ## 28 3 117 1 2360618 0.143 0.0267 0.1356 100 | ## 28 3 117 2 2352628 0.161 0.0009 0.2441 101 | ## 28 3 117 3 2354540 0.120 0.0624 -0.0119 102 | ## 28 3 1344 1 2361807 0.093 0.1095 -0.2295 103 | ## 28 3 1344 12 2367180 0.094 0.1091 -0.2225 104 | ``` 105 | 106 | ###### convert to matrix with king2mat (GENESIS package) 107 | * **input**: `.kin` and `.kin0` and `iid` 108 | * **command**: `king2mat` 109 | * **output**: matrix 110 | 111 | `iid` = text file containing iids in order from GenotypeData 112 | 113 | ```R 114 | # read individual IDs from GenotypeData object 115 | iids <- getScanID(genoData) 116 | head(iids) 117 | 118 | # create matrix of KING estimates 119 | KINGmat <- king2mat(file.kin0 = system.file("wrkdir", "data.kin0", package="GENESIS"), 120 | file.kin = system.file("wrkdir", "data.kin", package="GENESIS"), 121 | iids = iids) 122 | ``` 123 | 124 | ```R 125 | # sample output 126 | 127 | KINGmat[1:5,1:5] 128 | 129 | ## NA19919 NA19916 NA19835 NA20282 NA19703 130 | ## NA19919 0.5000 -0.0009 -0.0059 -0.0080 0.0014 131 | ## NA19916 -0.0009 0.5000 -0.0063 -0.0150 -0.0039 132 | ## NA19835 -0.0059 -0.0063 0.5000 -0.0094 -0.0104 133 | ## NA20282 -0.0080 -0.0150 -0.0094 0.5000 -0.0134 134 | ## NA19703 0.0014 -0.0039 -0.0104 -0.0134 0.5000 135 | ``` 136 | 137 | ##### with SNPRelate 138 | ###### prefered since it can be done without leaving R 139 | The vignette states: "Alternative to running the KING software, the `snpgdsIBDKING` function from the `SNPRelate` package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the `GENESIS` functions" **this is a lie** You must extract the matrix and prepend the IIDs as row and column names. 140 | 141 | Input: `mygenotype.gds` 142 | Command: `snpgdsIBDKING` 143 | Output: matrix of pairwise estimates 144 | 145 | Options: 146 | **type** ("KING-robust" - for admixed pop,"KING-homo" - for homogeneous pop), 147 | **sample.id** (choose samples subset; default all), 148 | **snp.id** (choose SNPs subset; default all), 149 | **autosome.only** (default TRUE), 150 | **maf** (to use the SNPs with ">=maf" only; default no threshold) 151 | **family.id** (default NULL; all individuals treated as singletons. If provided, within- and between- family relationships are estimated differently), 152 | **verbose** 153 | 154 | ```r 155 | # snpgdsIBDKING requires a gds object; you cannot just point command to a .gds file 156 | # read in the GDS file you just generated and verify its class 157 | gdsfile <- snpgdsOpen(paste(genotype,".gds",sep="")) 158 | class(gdsfile) #check for "gds.class" 159 | ``` 160 | ```r 161 | # calculate KING IBD Kinship coefficients 162 | ibd_king <- snpgdsIBDKING(gdsfile, type="KING-robust", verbose=TRUE) 163 | ``` 164 | ```r 165 | class(ibd_king) 166 | # snpgdsIBDClass (5 elements) 167 | names(ibd_king) 168 | # [1] "sample.id" "snp.id" "afreq" "IBS0" "kinship" 169 | ``` 170 | ```r 171 | # extract kinship matrix 172 | KINGmat = as.matrix(ibd_king$kinship) 173 | 174 | # check output 175 | KINGmat[1:5,][,1:5] 176 | 177 | ## 0.5000000000 -0.004695793 0.001941412 -0.0009816093 -0.01903618 178 | ## -0.0046957930 0.500000000 -0.009200767 -0.0070120462 -0.03430987 179 | ## 0.0019414117 -0.009200767 0.500000000 -0.0085147706 -0.01809720 180 | ## -0.0009816093 -0.007012046 -0.008514771 0.5000000000 -0.02480543 181 | ## -0.0190361844 -0.034309866 -0.018097203 -0.0248054274 0.50000000 182 | 183 | #add row and column labels 184 | rownames(KINGmat) <- c(ibd_king$sample.id) 185 | colnames(KINGmat) <- c(ibd_king$sample.id) 186 | ``` 187 | 188 | ```R 189 | # SAGE2 output using SNPRelate 190 | 191 | KINGmat[1:5,][,1:5] 192 | 193 | ## CH30380 VA30171 VA30167 CH30357 VA70028 194 | ## CH30380 0.5000000000 -0.004695793 0.001941412 -0.0009816093 -0.01903618 195 | ## VA30171 -0.0046957930 0.500000000 -0.009200767 -0.0070120462 -0.03430987 196 | ## VA30167 0.0019414117 -0.009200767 0.500000000 -0.0085147706 -0.01809720 197 | ## CH30357 -0.0009816093 -0.007012046 -0.008514771 0.5000000000 -0.02480543 198 | ## VA70028 -0.0190361844 -0.034309866 -0.018097203 -0.0248054274 0.50000000 199 | ``` 200 | 201 | ### Running PC-AIR 202 | Uses pairwise measure sof kinship and ancestry divergence to determine the unrelated and representative subset for analysis. 203 | 204 | The KING-robust estimates are always used as measures of ancestry divergence for unrelated pairs of individuals; can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry) 205 | 206 | ###### Input: 207 | input | description 208 | ------|------ 209 | `genoData` | `GenotypeData` class object 210 | `kinMat` | matrix of pairwise kinship coefficient estimates (KING-robust estimates or other source) 211 | `divMat` | matrix of pairwise measures of ancestry divergence (KING-robust estimates) 212 | 213 | ```R 214 | # run PC-AiR 215 | mypcair <- pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat) 216 | ``` 217 | You should see the following verbage: 218 | 219 | ``` 220 | Partitioning Samples into Related and Unrelated Sets... 221 | Unrelated Set: 1665 Samples 222 | Related Set: 320 Samples 223 | Running Analysis with 748665 SNPs - in 75 Block(s) 224 | Computing Genetic Correlation Matrix for the Unrelated Set: Block 1 of 75 ... 225 | ... 226 | Computing Genetic Correlation Matrix for the Unrelated Set: Block 75 of 75 ... 227 | Performing PCA on the Unrelated Set... 228 | Predicting PC Values for the Related Set: Block 1 of 75 ... 229 | ... 230 | Predicting PC Values for the Related Set: Block 75 of 75 ... 231 | Concatenating Results... 232 | ``` 233 | 234 | ###### Output: 235 | 236 | ```R 237 | summary(mypcair) 238 | ``` 239 | 240 | ``` 241 | Call: 242 | pcair(genoData = myGenoData, kinMat = KINGmat, divMat = KINGmat) 243 | 244 | PCA Method: PC-AiR 245 | 246 | Sample Size: 1985 247 | Unrelated Set: 1665 Samples 248 | Related Set: 320 Samples 249 | 250 | Kinship Threshold: 0.02209709 251 | Divergence Threshold: -0.02209709 252 | 253 | Principal Components Returned: 20 254 | Eigenvalues: 11.286 2.216 1.95 1.924 1.914 1.866 1.849 1.837 1.815 1.797 ... 255 | 256 | MAF Filter: 0.01 257 | SNPs Used: 644529 258 | ``` 259 | 260 | ###### Options 261 | 262 | using a reference population alongside sample; perhaps to determine which PCs capture which ancestral groups 263 | 264 | ```r 265 | pcair(genoData = mygenoData, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate 266 | 267 | pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate; partition IDs first, then sample 268 | ``` 269 | 270 | ###### Plotting PC-Air PCs 271 | 272 | `plot` method provided by `GENESIS` package. Each point represents one individual. Visualization of population structure to **identify clusters of individuals with similar ancestry**. Can by altered by standard `plot` function manipulation. [Basis](https://www.rdocumentation.org/packages/graphics/versions/3.4.0/topics/plot) and [More Details](http://www.statmethods.net/advgraphs/parameters.html) 273 | 274 | default: black dots = unrelated subset**;** blue pluses = related subsets 275 | 276 | ```r 277 | # plot top 2 PCs 278 | plot(mypcair) 279 | 280 | # plot PCs 3 and 4 281 | plot(mypcair, vx = 3, vy = 4) 282 | ``` 283 | 284 | ### Running PC-Relate 285 | Provides **genetic relatedness estimates**. Uses top PCs from PC-Air (ancestry capturing components) to adjust for population structure & individual ancestry in sample. 286 | 287 | ###### Input: 288 | 289 | input | file | description 290 | ------|------|------ 291 | `training.set` | `mypcair`$`unrels` | vector of IIDs specifying unrelated subset to be used for ancestry adjustment per SNP 292 | `genoData` | `mygenoData` | GenotypeData class object 293 | `pcMat` | `mypcair`$`vectors[,1:n]` | matrix; columns are PCs 1 - n 294 | 295 | ```r 296 | # run PC-Relate 297 | mypcrelate <- pcrelate(genoData = HapMap_genoData, pcMat = mypcair$vectors[,1:2], training.set = mypcair$unrels) 298 | ``` 299 | 300 | You should see the following verbiage: 301 | ``` 302 | Running Analysis with 748665 SNPs - in 75 Block(s) 303 | Running Analysis with 1985 Samples - in 1 Block(s) 304 | Using 2 PC(s) in pcMat to Calculate Adjusted Estimates 305 | Using 1665 Samples in training.set to Estimate PC effects on Allele Frequencies 306 | Computing PC-Relate Estimates... 307 | ...SNP Block 1 of 75 Completed - 7.165 mins 308 | ... 309 | ...SNP Block 75 of 75 Completed - 5.876 mins 310 | Performing Small Sample Correction... 311 | ``` 312 | 313 | ###### Output: 314 | * `write.to.gds = FALSE` = pcrelate obj (default) 315 | * `write.to.gds = TRUE` = gds file `tmp_pcrelate.gds` 316 | 317 | to read `tmp_pcrelate.gds`: 318 | ```r 319 | packages(gdsfmt) 320 | mypcrelate <- openfn.gds("tmp_pcrelate.gds") 321 | ``` 322 | 323 | ###### to parse PCRelate outputs: 324 | 325 | command | description 326 | --------|-------- 327 | `pcrelateReadKinship` | make a table of pairwise relatedness estimates 328 | `pcrelateReadInbreed` | make a table of individual inbreeding coeficients 329 | `pcrelateMakeGRM` | make a genetic relatedness matrix 330 | 331 | option | description 332 | -----|----- 333 | `pcrelObj` | output from pcrelate; either a class pcrelate object or a GDS file 334 | `scan.include` | vector of individual IDs specifying which individuals to include in the table or matrix; default NULL 335 | `kin.thresh` | minimum kinship coefficient value to include in the table 336 | `f.thresh` | minimum inbreeding coefficient value to include in the table 337 | `scaleKin` | factor to multiply the kinship coefficients by in the GRM; default 2 338 | 339 | ```r 340 | # make pairwise relatedness estimates table 341 | relatepairs.tbl <- pcrelateReadKinship(pcrelObj = mypcrelate, kin.thresh = 2^(-9/2)) 342 | 343 | # make inbreeding coefficient table 344 | inbreedcoef.tbl <- pcrelateReadInbreed(pcrelObj = mypcrelate, f.thresh = 2^(-11/2)) 345 | 346 | # make GRM 347 | mygrm <- pcrelateMakeGRM(pcrelObj = mypcrelate, scan.include = iids[1:5], scaleKin = 2) 348 | ``` 349 | 350 | check results 351 | ```r 352 | # grm output 353 | 354 | mygrm[1:5,][,1:5] 355 | 356 | ## CH30380 VA30171 VA30167 CH30357 VA70028 357 | ## CH30380 0.9921236602 0.001587362 0.004589604 0.002636372 0.0008911428 358 | ## VA30171 0.0015873618 1.006070553 0.002143034 0.001831966 -0.0025865272 359 | ## VA30167 0.0045896042 0.002143034 1.002365658 -0.005183818 0.0049334393 360 | ## CH30357 0.0026363721 0.001831966 -0.005183818 1.003906724 0.0062427112 361 | ## VA70028 0.0008911428 -0.002586527 0.004933439 0.006242711 1.0005800629 362 | 363 | quantile(mygrm) 364 | 365 | ## 0% 25% 50% 75% 100% 366 | ## -3.260058e-02 -3.014012e-03 -5.799979e-05 2.950375e-03 1.169809e+00 367 | ``` 368 | save files 369 | ```r 370 | # for .csv output: add sep=',' and change ".txt" 371 | 372 | write.table(mygrm, paste("GENESIS_Kincoef_matrix_",genotype,date,".txt",sep=""), row.names = F, quote = F) 373 | write.table(relatepairs.tbl.nothresh, paste("GENESIS_relatedpairs_",genotype,date,".txt",sep=""), row.names = F, quote = F) 374 | write.table(inbreedcoef.tbl.nothresh, paste("GENESIS_inbreedcoef_",genotype,date,".txt",sep=""), row.names = F, quote = F) 375 | ``` 376 | 377 | note: to look at just the first 5 rows and first 5 columns of matrix 378 | ```bash 379 | # in shell 380 | cut -d' ' -f 1-5 test.file | head -5 - 381 | ``` 382 | ```r 383 | # in r 384 | test.file[1:5,][,1:5] 385 | ``` 386 | 387 | ## \# QED \# 388 | -------------------------------------------------------------------------------- /GRM_Computation_Methods.md: -------------------------------------------------------------------------------- 1 | # Methods for Computing Genetic Relatedness Matrices (GRMs) 2 | ## Pagé Goddard 3 | 4 | ### Resources 5 | 6 | ###### TOPmed Pipline 7 | * [github](https://github.com/UW-GAC/analysis_pipeline) 8 | * [slides](https://uw-gac.github.io/topmed_workshop_2017/computing-a-grm.html) 9 | 10 | ###### GCTA 11 | * [GCTA Documentation](http://cnsgenomics.com/software/gcta/#GREML) 12 | * [GCTA Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/pdf/main.pdf) 13 | 14 | ###### REAP 15 | * [REAP Documentation](http://faculty.washington.edu/tathornt/software/REAP/REAP_Documentation.pdf), 16 | * [REAP Publication](http://www.cell.com/ajhg/fulltext/S0002-9297(12)00309-6) 17 | 18 | ###### GENESIS 19 | * [GENESIS Vignette](https://rdrr.io/bioc/GENESIS/f/vignettes/pcair.Rmd), 20 | * [GENESIS Publication](https://www.ncbi.nlm.nih.gov/pubmed/26748516), 21 | (also see the TOPmed slides above) 22 | 23 | ### TOPmed Pipeline 24 | The TOPmed analysis pipeline is a great resources for association study design in general, but it is linked here because it includes a recommended approach for GRM computation. For GRM computation with an eye for confounding ancestry, **TOPmed recommends the GENESIS approach**: 25 | 26 | 1. KING 27 | 2. PC-AIR 28 | 3. PC-Relate 29 | 30 | See below for more details 31 | 32 | **NB:** "Section 3 Computing a GRM" calculates a basic Genetic Relationship matrix using `SNPRelate` package in R but does not take into account ancestry or population structure. For the more robust approach, see "Section 4 PC-Relate." 33 | 34 | ### GCTA 35 | * commandline program 36 | **Importance of GRMs:** Allows for identification of closely related individuals. Objective of downstream `GCTA` analysis is to provide a heritability estimate the genetic variation captured by all SNPs (vs. GWAS which estimates variation captured by single SNPs). Including close relatives could bias the results with variance driven by pedigree phenotypic correlations. 37 | 38 | ```bash 39 | # estimate genetic relatedness from SNPs 40 | gcta64 --bfile input.binary --make-grm --out output.files 41 | ``` 42 | 43 | From the publication: *As a by-product, we provide a function in GCTA to **calculate the eigenvectors of the GRM**, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT11 because the GRM (Ajk) defined in GCTA is approximately half of the covariance matrix (Jjk) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT and STRUCTURE.* 44 | 45 | ###### PROs 46 | * super easy to run 47 | - takes plink files 48 | - no extra input required 49 | * fast 50 | 51 | ###### CONs 52 | * does not take into account ancestry or any external population structure proxy 53 | - not a reliable estimator for admixed populations 54 | * GCTA has been [criticized](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987787/pdf/pnas.201608425.pdf) for unreliable estimates (GrantedI haven't read into this too much) 55 | * I (Pagé) don't really understand the math... 56 | 57 | ### REAP 58 | * commandline program 59 | * concept: use pre-computed ancestry measures to adjust for population structure 60 | * approach: model-based 61 | - calculate **global ancestry per individual** and **allele frequency per ancestral group** using something like `ADMIXTURE` 62 | - calculate relatedness coefficients after adjusting for ancestry 63 | 64 | ###### PROs 65 | * accounts for population ancestry 66 | * designed for admixed populations 67 | * easy one-liner once you have your admixture output 68 | 69 | ###### CONs 70 | * requires additional inputs calculated externally 71 | - not tough to calculate; see my [ADMIXTRE Tutorial](https://github.com/pcgoddard/Burchardlab_Tutorials/blob/master/ADMIXTURE_Tut.md) 72 | - potential for inaccuracies in admixed pop of unknown/poorly defined ancestries 73 | * model-based methods can be confounded by familial relatedness due to inability to distinguish b/w ancestral groups and clusters of close relatives ([source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/pdf/main.pdf)) 74 | 75 | ### GENESIS 76 | * R 77 | * concept: trianing on unrelated subpopulation and using PCs to correct for population structure 78 | * approach: model-free 79 | - KING-robust: estimate ancestral divergence and apparent relatedness separately 80 | - PC-Air: use PCs to capture the ethnic components of the population structure and identify the unrelated and related clusters 81 | + ancestral divergence scores used to ensure the unrelated subpop is representative of the full population's ancestry dsitribution 82 | - PC-Relate: calculate kinship coefficient for unrelated group first, then extrapolate to the related group, using ancestry-representative PCs to correct for pop structure 83 | + unrelated first: prevent confounding by related individuals 84 | + using first n PCs: (at your discretion) to capture pop structure / ethnic diversity 85 | 86 | ###### PROs 87 | * ~~doesn't need external inputs~~ 88 | * good track record 89 | - TOPmed 90 | - favored in comparison studies (after more computationally heavy IBD-inferrence approaches) 91 | * designed to work well for both admixed and homogenous populations 92 | 93 | ###### CONs 94 | * PC-relate took about 8 hours to run (R-studio) 95 | * claims to not require external input, but it kind of does - it can all be done in R so you can totally set up a pipeline for it though 96 | - uses KING-robust estimates (commandline) 97 | - or SNPRelate funciton (in R) 98 | - requires some reformatting from either output to prep for PC-Air 99 | 100 | --- 101 | 102 | ## Overall winner: GENESIS/TOPmed Pipeline 103 | see [tutorial](https://github.com/pcgoddard/Burchardlab_Tutorials/blob/master/GENESIS_PCRelate_Tut.md) 104 | -------------------------------------------------------------------------------- /Jennifer's R Tutorials.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/Jennifer's R Tutorials.pdf -------------------------------------------------------------------------------- /Markdown_Tut.md: -------------------------------------------------------------------------------- 1 | # Markdown Tutorial 2 | #### Pagé Goddard 3 | #### Sep 1, 2017 4 | --- 5 | ## Contents 6 | * [Sublime Add-ons](#add-ons) 7 | * [Headers](#headers) 8 | * [Emphasis](#emphasis) 9 | * [Lists](#lists) 10 | - [unordered lists](#unordered) 11 | - [ordered lists](#ordered) 12 | - [task lists / check boxes](#tasks) 13 | * [Links](#links) 14 | - [URLs](#url) 15 | - [images](#images) 16 | - [Table of Contents / Anchor Tags](#anchor) 17 | * [Block Quotes](#quotes) 18 | * [Code Inserts](#code) 19 | - [Inline highlighting](#inline) 20 | - [Block Code](#block) 21 | * [Tables](#tables) 22 | * [Line Breaks](#linebreaks) 23 | * [Using Special Characters](#escape) 24 | * [Emojis :sparkles:](#emojis) 25 | 26 | --- 27 | 28 | ## Sublime add-ons 29 | * MarkdownEditor *# for easier visualization pre-publication* 30 | * OmniMarkupPreviewer *# for previewing fully formatted document* 31 | - ctrl + alt + o 32 | - OR MarkdownBuddy 33 | 34 | 35 | ## Headers 36 | Use \# to indicate header level 37 | # \# Heading1 38 | ## \#\# Sub-heading 39 | ###### \#\#\#\#\#\# level 6 subheading 40 | 41 | 42 | ## Emphasis 43 | *italic:* \*word\* or \_word\_ 44 | 45 | **bold**: \*\*word\*\* or \_\_bold\_\_ 46 | 47 | ***combined:*** \*\*\*word\*\*\* or \_\_\_word\_\_\_ 48 | 49 | strikethrough: word or ~~ word ~~ 50 | 51 | 52 | ## Lists 53 | 54 | ### unordered 55 | * unordered 56 | * lists 57 | - use \* 58 | - and \- 59 | 60 | 61 | ### ordered 62 | 1. ordered 63 | 2. lists 64 | 3. use 65 | 1. numbers 66 | 2. but not letters 67 | * number of tabs determines bullet level 68 | 69 | 70 | ### task list 71 | - [x] - [x] this is a **complete** item 72 | - [ ] - [ ] this is an *incomplete* item 73 | - [ ] - [ ] list autopopulates format 74 | 75 | 76 | ## Links 77 | 78 | ### URLs 79 | example: [wikipedia](https://en.wikipedia.org/wiki/Main_Page) 80 | 81 | `\[Alt Text](url)` 82 | 83 | 84 | ### Images 85 | All linked images must be hosted online. You can link to an image on your local machine but it will not be viewable in the published markdown on other devices. When the image is not publishable, the Alt Text input will be shown. 86 | 87 | ![cute puppy](http://www.zarias.com/wp-content/uploads/2015/12/61-cute-puppies.jpg) 88 | 89 | `\!\[Alt Text](url)` 90 | 91 | #### you can resize images using standard HTML 92 | 93 | 94 | 95 | `\Alt Text` 96 | 97 | *note: the alt="" input is optional 98 | 99 | 100 | ### Table of Contents Links 101 | This requires use of **anchor tags** where you want the table of contents to link to. 102 | 103 | `\` 104 | 105 | **\# This is my header!** 106 | 107 | * *note: remove the `\` before the first `>` to activate the anchor tag* 108 | * *personal preference: I like to put the anchor tag **above** the tagged section so that when the link jumps to that section you still see the header* 109 | 110 | You can then link to that line from anywhere in the document using: 111 | 112 | `\[My header\]\(\#anchor_tag\)` 113 | 114 | 115 | ## Block quotes 116 | the following lines will be a quote 117 | 118 | \> it was the best of times 119 | 120 | \> it was the worst of times 121 | 122 | > it was the best of times 123 | > 124 | > it was the worst of times 125 | 126 | 127 | ## Code Blocks 128 | #### Inline Code 129 | 130 | This is your \`code\` to highlight 131 | 132 | This is your `code` to highlight 133 | 134 | 135 | #### Fenced Code Blocks 136 | 137 | \```javascript 138 | 139 | function test() { 140 | 141 | console.log("hello world"); 142 | 143 | \``` 144 | 145 | ```javascript 146 | function test() { 147 | console.log("hello world"); 148 | } 149 | ``` 150 | 151 | to put any text in a code box, just indent it once 152 | 153 | any text 154 | 155 | 156 | ## Tables 157 | * tables use | and - to indicate field divisions 158 | 159 | column 1 | column 2 160 | 161 | \---------|--------- *# this line is necessary to indicate table format* 162 | 163 | cell 1.1 | cell 2.1 164 | 165 | cell 1.2 | cell 2.2 166 | 167 | column 1 | column 2 168 | ---------|--------- 169 | cell 1.1 | cell 2.1 170 | cell 1.2 | cell 2.2 171 | 172 | 173 | ## Line breaks 174 | * 3 or more \* \- or \_ 175 | --- 176 | 177 | 178 | ## Escaping characters 179 | removes special syntax meaning: 180 | *italic* v. \*asterisks\* 181 | 182 | `\character` can be used on the following: 183 | 184 | \\ \` \* \# \_ \- \+ \. 185 | \{ \} \[ \] \( \) \! 186 | 187 | 188 | ## Emjois (github) 189 | 190 | (just remove the spaces) 191 | 192 | :+1: = \: \+ 1 \: 193 | :sparkles: = \: sparkles \: 194 | :octocat: = \: octocat \: 195 | -------------------------------------------------------------------------------- /PLINK_QC.md: -------------------------------------------------------------------------------- 1 | # Plink QC Pipeline 2 | ### Pagé Goddard 3 | ##### Oct 6 2017 4 | --- 5 | ## Content 6 | * [Resources](#resources) 7 | * [Basics of Quality Control](#intro) 8 | * [Example PLINK Command](#example) 9 | * [(Some) PLINK QC Commands](#commands) 10 | * [How to read the Log file](#logs) 11 | * [Combine genotype chr files](#concatenate) 12 | * [Sample Pipeline](#pipeline) 13 | - [Variables](#vars) 14 | - [Update sex](#sex) 15 | - [Remove unwanted samples](#rmv) 16 | - [Filter SNPs with low genotyping efficiency](#geno) 17 | - [Filter samples with low genotyping efficiency](#mind) 18 | - [Filter out too closely related individuals](#cryptic) 19 | - [Filter rare SNPs](#maf) 20 | - [Filter SNPs out of HWE](#hwe) 21 | - [Option: Subset data](#subset) 22 | - [Get QC counts for each QC step](#stats) 23 | 24 | 25 | --- 26 | 27 | ## Resources 28 | [PLINK summary statistic commands](http://zzz.bwh.harvard.edu/plink/summary.shtml) 29 | 30 | [COG-Genomics PLINK command list](https://www.cog-genomics.org/plink/1.9/data) 31 | 32 | [Tufts QC Vignette](http://sites.tufts.edu/cbi/files/2013/01/GWAS_Exercise5_QC.pdf) 33 | 34 | [PMC3066182](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3066182/) Turner, Stephen et al. “Quality Control Procedures for Genome Wide Association Studies.” Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] CHAPTER (2011): Unit1.19. PMC. Web. 6 Oct. 2017. 35 | 36 | 37 | ## Basics of Quality Control (QC) 38 | *why we care:* With GWAS, hundreds of thousands of genotypes are generated so even a small percentage of genotyping error can lead to spurious GWAS results. This tutorial focuses on **downstream QC** (i.e. data cleaning after you have the genotype calls). 39 | 40 | ##### Two parts of downstream QC 41 | 42 | 1. subject / sample-based 43 | 2. variant / snp-based 44 | 45 | **Sample QC Concerns** 46 | 47 | term | desc | typical threshold 48 | -----|------|------ 49 | sample-specific missingness rate | proportion of missing genotypes per individual | 50 | gender discordance | check that self-reported gender matches genotyped gender | 51 | cryptic relatedness | undisclosed familial relationships; duplicate enrollment | 52 | replicate discordance | agreement with independent genotyping | 53 | population outliers | subjects with significantly different genetic background | 54 | 55 | **SNP QC Concerns** 56 | 57 | term | desc | typical threshold 58 | -----|------|------ 59 | SNP-specific missingness rate | proportion of failed genotype assays per variant | 60 | minor allele frequency | low freq alleles more likely to represent genotyping error | 0.05 61 | replicate discordance | agreement with independent genotyping | 62 | Hardy-Weinberg equilibrium | | 63 | Mendelian errors | in family data evidence of non-Mendelian transmission | N/A 64 | 65 | 66 | ## Example Plink command 67 | ```bash 68 | 69 | plink --noweb --bfile inputfile --remove removeme.txt --make-bed --out outputfile 70 | 71 | # basic command if all genotype data is in one file 72 | ``` 73 | * `plink` - calls plink in shell environment 74 | * `--no-web` - usually not necessary; tells PLINK not to connect to the internet (which makes it slow) 75 | * `--bfile` - tells plink to read fyour binary files `inputfile.bim`, `inputfile.fam`, `inputfile.bed` 76 | * `--remove` - tells plink to remove all IDs listed in the named file 77 | * `--make-bed` - write the outfile in our favorite `.bed`, `.bim`, `.fam` form 78 | * `--out` - write my outfile with the following prefix 79 | 80 | ```bash 81 | for ((i=1;i<=22;i++)) 82 | do 83 | plink --noweb --bfile inputfile_chr$i --remove removeme.txt --make-bed --out outputfile_chr$i 84 | done 85 | 86 | # if geno data is split by chromosome 87 | ``` 88 | * `for ((i=1;i<=22;i++))` - "for every value of `i` between 1 and 22, run the following script;" in the plink script, we then tell bash that the `i` refers to chromosome number in the file name using `filename_chr$i`; this is useful if your genotype data is divided by chromosome 89 | * `--bfile` - read from binary files with the prefix `inputfile_chr$i` where `$i` is the chromosome number from your `for` loop; your bfiles are your `.bim`, `.bed`, and `.fam` files where bim and fam are files contianing IDs for the SNPs and Individuals and .bed is the genotype file 90 | * `--remove` - there are a ton of different commands to use here, depending on what you want to do (see resources). A typical pipeline example can be found below. 91 | 92 | 93 | ## Some Plink command options for QC 94 | This options replace the `--remove filename` option 95 | 96 | option | description | notes 97 | -------|-------------|- 98 | `--update-sex filename` | updates sex label for individuals in filename | expects file with `FID` `IID` `sex (1,2)`, no header 99 | `--remove filename` | removes samples in list | expects file with `FID` `IID`, no header 100 | `--geno 0.05` | filters SNPs with genotyping frequency below 95% | to get a file listing the snps that remain after filtering, include `--write-snp.list` 101 | `--mind 0.05` | exclude individuals with genotype rates below 95% | to get the list of the missingness rates include `--missing` 102 | `--hwe 0.0001` | filters out SNPs with HWE exact test p-value below threshold | recommend setting a low threshold as major deviation (p-val of e-50, eg) is likely genotyping error while true SNP-trait association shows slight deviation 103 | `--maf` | filters out SNPs with minor allele freq below threshold | default 0.01 104 | `--genome --min 0.025` | identity by descent calculation to determine relatedness coefficient and remove individuals with values <0.025?? | These calculations are not LD-sensitive. It is usually a good idea to perform some form of LD-based pruning before invoking them. 105 | 106 | 107 | 108 | ## Log files 109 | * every plink command has an automatic log file output: `outfileprefix.log` 110 | * **number of SNPs/Samples loaded** and **number that passed QC** 111 | * errors, warnings if applicable 112 | 113 | *sample log file:* 114 | ```bash 115 | PLINK v1.90b3.29 64-bit (24 Dec 2015) 116 | Options in effect: 117 | --bfile sage_imputedgeno_gwas_171004_no21plus_chr1 118 | --maf 0.05 119 | --make-bed 120 | --noweb 121 | --out sage_imputedgeno_gwas_171004_maf05_finalQC_chr1 122 | 123 | Hostname: burchardlab.ucsf.edu 124 | Working directory: /media/BurchardRaid01/LabShare/Home/pgoddard/telo_wd_171004 125 | Start time: Wed Oct 4 15:01:38 2017 126 | 127 | Note: --noweb has no effect since no web check is implemented yet. 128 | Random number seed: 1507154498 129 | 257655 MB RAM detected; reserving 128827 MB for main workspace. 130 | 1967966 variants loaded from .bim file. # SNP input 131 | 1715 people (822 males, 890 females, 3 ambiguous) loaded from .fam. # Sample input 132 | Ambiguous sex IDs written to 133 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.nosex . 134 | Using 1 thread (no multithreaded calculations invoked. 135 | Before main variant filters, 1715 founders and 0 nonfounders present. 136 | Calculating allele frequencies... done. 137 | 1387541 variants removed due to minor allele threshold(s) # action taken 138 | (--maf/--max-maf/--mac/--max-mac). 139 | 580425 variants and 1715 people pass filters and QC. # SNP/Sample output 140 | Note: No phenotypes present. 141 | --make-bed to sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.bed + 142 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.bim + 143 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.fam ... done. 144 | 145 | End time: Wed Oct 4 15:01:40 2017 146 | ``` 147 | 148 | 149 | # Example Run 150 | 151 | ##### If you want to merge all chromosomes into one file 152 | ```bash 153 | ### Concatenate genotype data 154 | 155 | plink --bfile $datadir/$sagedata --chr 1-22 --make-bed --out SAGEbase_081517 156 | 157 | # the next step would then look like: 158 | 159 | ###1. Updating Sex in .fam file 160 | plink ---noweb -bfile SAGEbase_081517 --update-sex $sexupdated --make-bed --out ${out}_sex_chr 161 | # no for loop 162 | # no $i as we no longer need to call each chr individually 163 | # this step makes your QC run slower but your directory will look cleaner 164 | ``` 165 | 166 | ## Pipeline 167 | This pipeline will walk you through the steps used to QC the imputed SAGE2 genotype data for the Telomere project. We did not merge chromosomes. 168 | 169 | * **note:** depending on your analysis, you can choose to keep or remove any subset of individuals; removing saliva samples is always recommended as it is prefered to work with genotypes sequenced from whole blood. To do this, you need an external tab-delimited, 2-field file with the respective FID IID info for each individual you wish to remove. 170 | 171 | 172 | **set variables** 173 | 174 | * in bash, we can define variables in the environment as `var='value'` to call on later using `$var` 175 | * **useful shorthand**, especially when working with cumbersome file paths or bash scripting (which will be a different tutorial); think of it as **nicknaming your data** 176 | * I prefer this approach because I **know where everything is** coming from and going to and what it is called **without having to parse** my script; it is also **easier to adapt** the script to new data because you only need to update the variable rather than every line of the script. 177 | 178 | ```bash 179 | ## Set up environment 180 | 181 | # values 182 | date=171004 183 | 184 | # directories 185 | wrkdir="$HOME/wrkdir" #choose your working directory 186 | datadir="path/to/genodata" #identify where your genotype data are 187 | 188 | # reference files 189 | sagedata="genodata" #to run it will be ${sagedata}${i} # i will be the chr number 1-22 190 | sexupdated="path/to/file" # list of FID IID and updated sex # optional 191 | removeme_saliva="path/to/file" # list of FID IID for individuals with geno data from saliva rather than blood # optional 192 | removeme_old="path/to/file" # list of idnividuals outside of age range # optional 193 | 194 | # output files 195 | out="mygwasdat_${date}" 196 | 197 | # move into working directory 198 | cd $wrkdir 199 | ``` 200 | 201 | *note: my genotype data is split across chromosomes, so to we are looping PLINK over all 22 autosomes; to get total number of variants at each QC step we will sum across all the files at the end. Alternatively, you can run a plink command to concatenate the genotype data into one file, but the QC process is much faster when working with smaller files* 202 | 203 | 204 | ##### 1. Make sure sex is up-to-date 205 | ```bash 206 | ###1. Updating Sex in .fam file 207 | 208 | for ((i=1;i<=22;i++)) 209 | do plink ---noweb -bfile $datadir/$sagedata$i --update-sex $sexupdated --make-bed --out ${out}_sex_chr$i 210 | done 211 | 212 | # prints log to screen; look for thes lines: 213 | # --update-sex: 1987 people updated, 130 IDs not present. 214 | # 332237 variants and 1990 people pass filters and QC. 215 | 216 | # the log is also written to the ${out}_sex_chr$i.log files so don't worry about losing the information on the screen 217 | ``` 218 | 219 | ##### 2. Remove saliva samples 220 | ```bash 221 | for ((i=1;i<=22;i++)) 222 | do plink --noweb --bfile ${out}_sex_chr$i --remove $removeme_saliva --make-bed --out ${out}_nosaliva_chr$i 223 | done 224 | 225 | # 332237 variants and 1954 people pass filters and QC. 226 | ``` 227 | 228 | ##### 3: Remove individuals over 21 229 | ```bash 230 | for ((i=1;i<=22;i++)) 231 | do plink --noweb --bfile ${out}_nosaliva_chr$i --remove $removeme_old --make-bed --out ${out}_no21plus_chr$i 232 | done 233 | 234 | # 332237 variants and 1715 people pass filters and QC. 235 | ``` 236 | 237 | 238 | ##### 4: Filtered out SNPs with genotyping efficiency below 95% 239 | ```bash 240 | for ((i=1;i<=22;i++)) 241 | do plink --noweb --bfile ${out}_no21plus_chr$i --geno 0.05 --make-bed --out ${out}_geno05_chr$i 242 | done 243 | 244 | # 0 variants removed due to missing genotype data (--geno) 245 | # 332237 variants and 1715 people pass filters and QC. 246 | ``` 247 | 248 | 249 | ##### 5: Filtered out individuals with genotyping efficiency below 95% 250 | ```bash 251 | for ((i=1;i<=22;i++)) 252 | do plink --noweb --bfile ${out}_geno05_chr$i --mind 0.05 --make-bed --out ${out}_mind05_chr$i 253 | done 254 | 255 | # 0 people removed due to missing genotype data (--mind) 256 | # 332237 variants and 1715 people pass filters and QC. 257 | ``` 258 | 259 | 260 | ##### 6: Screen for Cryptic relatedness 261 | ```bash 262 | for ((i=1;i<=22;i++)) 263 | do plink --noweb --bfile ${out}_mind05_chr$i --genome --min 0.025 --make-bed --out ${out}_decrypted_chr$i 264 | done 265 | 266 | # 1954 people (906 males, 1044 females, 4 ambiguous) loaded from .fam. 267 | # 1967966 variants and 1954 people pass filters and QC. 268 | ``` 269 | 270 | 271 | ##### 7: Remove SNPs with MAF < 0.05 272 | ```bash 273 | for ((i=1;i<=22;i++)) 274 | do plink --noweb --bfile ${out}_decrypted_chr$i --maf 0.05 --make-bed --out ${out}_maf05_chr$i 275 | done 276 | 277 | # 230023 variants removed due to minor allele threshold(s) 278 | # 580425 variants and 1715 people pass filters and QC. 279 | ``` 280 | 281 | 282 | ##### 8: Filtered SNPs that fail a HWE cutoff p<0.0001 283 | ```bash 284 | for ((i=1;i<=22;i++)) 285 | do plink --noweb --bfile ${out}_maf05_chr$i --hwe 0.0001 --make-bed --out ${out}_hwe0001_chr$i 286 | done 287 | 288 | # 195 variants removed due to Hardy-Weinberg exact test. 289 | ``` 290 | 291 | 292 | ##### clean up space 293 | ```bash 294 | # make a separate directory for each QC stage and move the appropriate files into each 295 | # be sure to leave the last set of QC files in your working directory for easy access 296 | 297 | mkdir QC1_update_sex && mv ${out}_sex_chr* QC1_update_sex 298 | mkdir QC2_remove_saliva && mv ${out}_nosaliva_chr* QC2_remove_saliva 299 | mkdir QC3_remove_over21 && mv ${out}_no21plus_chr* QC3_remove_over21 300 | mkdir QC4_genotype_efficiency && mv ${out}_geno05_chr* QC3_genotype_efficiency 301 | mkdir QC5_individ_efficiency && mv ${out}_mind05_chr* QC4_individ_efficiency 302 | mkdir QC6_cryptic_relatedness && mv ${out}_decrypted_chr* QC5_cryptic_relatedness 303 | mkdir QC7_maf_05 && mv ${out}_maf05_chr* QC7_MAF_05 304 | 305 | ls -1 $wrkdir 306 | # QC1_update_sex 307 | # QC2_remove_saliva 308 | # QC3_remove_over21 309 | # QC4_genotype_efficiency 310 | # QC5_individ_efficiency 311 | # QC6_cryptic_relatedness 312 | # QC7_maf_05 313 | ``` 314 | 315 | 316 | ##### 9: Subset into two populations: male controls, female controls 317 | 318 | ```r 319 | #R work: Select subjects to keep for analysis (locally in R) 320 | 321 | setwd("/media/BurchardRaid01/LabShare/Home/pgoddard/telo_wd_171004") 322 | 323 | # partition by sex and case/control status 324 | pheno <- read.csv("/media/BurchardRaid01/LabShare/Home/azeiger/Telomere/SAGE/sage2_clean2016_02_23_de_ident.csv", header=T) 325 | telodat <- read.csv("raw_data/tel_res_MQ_SAGE2_09252017_FINAL.csv", header=T) 326 | telodat2 <- merge(pheno[,c("SubjectID", "Male")], telodat, by.x="SubjectID", by.y="SampleID") 327 | 328 | fm_cntl <- telodat2[telodat2$Male=="Female" & telodat2$Status=="control",] 329 | m_cntl <- telodat2[telodat2$Male=="Male" & telodat2$Status=="control",] 330 | 331 | length(telodat$SampleID) # 1540 332 | length(fm_cntl$SubjectID) # 336 333 | length(m_cntl$SubjectID) # 260 334 | 335 | # keep individuals with no missing covariates 336 | # covariate file generated with covar script: /Dropbox/Telomeres/script_telo_covars_171005.R 337 | covars <- read.table("tel_phenocovars_allSage2_100417.txt", header=T) 338 | nomisscovar <- na.omit(covars) 339 | 340 | fm_cntl_clean <- fm_cntl[fm_cntl$SubjectID %in% nomisscovar$FID,] 341 | m_cntl_clean <- m_cntl[m_cntl$SubjectID %in% nomisscovar$FID,] 342 | 343 | length(fm_cntl_clean$SubjectID) # 180 344 | length(m_cntl_clean$SubjectID) # 139 345 | 346 | 347 | # create subset ID lists for PLINK 348 | keep_female <- data.frame(fm_cntl_clean$SubjectID, fm_cntl_clean$SubjectID) 349 | keep_male <- data.frame(m_cntl_clean$SubjectID, m_cntl_clean$SubjectID) 350 | 351 | head(keep_female) 352 | # BP70001 BP70001 353 | # BP70002 BP70002 354 | # BP70004 BP70004 355 | # BP70005 BP70005 356 | # BP70006 BP70006 357 | 358 | # save subset ID list as txt file to wd 359 | write.table(keep_female, "keep_female_171006.txt", row.names=F, col.names=F, quote=FALSE, sep=" ") 360 | write.table(keep_male, "keep_male_171006.txt", row.names=F, col.names=F, quote=FALSE, sep=" ") 361 | ``` 362 | 363 | ```bash 364 | # PLINK work 365 | 366 | ## Plink command to extract female controls (n=180): 367 | for ((i=1;i<=22;i++)) 368 | do plink --bfile ${out}_maf05_finalQC_chr$i --keep keep_female_${date}.txt --make-bed --out ${out}_female_controls_chr$i 369 | done 370 | 371 | # check 372 | wc -l ${out}_female_controls_chr22.fam 373 | # 165 374 | # 15 female controls with all covariates removed by QC 375 | 376 | ## Plink command to extract male controls (n=139): 377 | for ((i=1;i<=22;i++)) 378 | do plink --noweb --bfile ${out}_maf05_finalQC_chr$i --keep keep_male_${date}.txt --make-bed --out ${out}_male_controls_chr$i 379 | done 380 | 381 | # check 382 | wc -l ${out}_female_controls_chr22.fam 383 | # 130 384 | # 9 male controls with all covariates removed by QC 385 | ``` 386 | 387 | ##### clean up space 388 | ```bash 389 | # finish clean up 390 | mkdir QC8_MAF_05 && mv ${out}_maf05_finalQC_chr* QC8_maf_05 391 | mkdir QC8_split_fm_cntl && mv ${out}_female_controls_chr* QC8_split_fm_cntl 392 | mkdir QC8_split_m_cntl && mv ${out}_male_controls_chr* QC8_split_m_cntl 393 | ``` 394 | 395 | 396 | ##### Final QC Stats 397 | 398 | ```bash 399 | # Log summary 400 | # --update-sex: 1987 people updated, 130 IDs not present. 401 | # 332237 variants and 1990 people pass filters and QC. 402 | # Removed 36 people with saliva data 403 | # 0 variants removed due to missing genotype data (--geno) 404 | # 0 people removed due to missing genotype data (--mind) 405 | # 195 variants removed due to Hardy-Weinberg exact test. 406 | # 0 people removed due to cryptic relatedness 407 | # Removed 239 people over 21 408 | # 230023 variants removed due to minor allele threshold(s) 409 | # extracted controls and subset by sex 410 | 411 | # After plink QC: 412 | # 102019 variants and 1715 people pass filters and QC 413 | 414 | # After downscaling to samples with telomere data: 415 | # total: 545 416 | # fm: 307 417 | # m: 238 418 | ``` 419 | 420 | ###### Get summary counts per step 421 | ```bash 422 | # sum line count for each chromosome .bim file 423 | # if data is not split by chr, you can do wc -l path/to/finalQCfile.bim 424 | cd $wrkdir/QC_171004/ 425 | wc -l QC1_update_sex/${out}_sex_chr*.bim 426 | wc -l QC2_remove_saliva/${out}_nosaliva_chr*.bim 427 | wc -l QC3_remove_over21/${out}_no21plus_chr*.bim 428 | wc -l QC4_genotype_efficiency/${out}_geno05_chr*.bim 429 | wc -l QC5_individ_efficiency/${out}_mind05_chr*.bim 430 | wc -l QC6_cryptic_relatedness/${out}_decrypted_chr*.bim 431 | wc -l QC7_maf_05/${out}_maf05_chr*.bim 432 | wc -l QC8_hwe/${out}_hwe0001_chr*.bim 433 | ``` 434 | 435 | ####### SNPs 436 | 437 | command | removed | remaining 438 | ----------|--------------|---------- 439 | **initial** | **-** | **25236191** 440 | geno | 0 | 25236191 441 | MAF | 17711861 | 7524330 442 | HWE | 5154 | 7519176 443 | **total** | **17717015** | **7519176** 444 | 445 | 446 | ####### Individuals 447 | 448 | command | removed | remaining 449 | ----------|---------|---------- 450 | **initial** | **-** | **1990** 451 | saliva | 36 | 1954 452 | over 21 | 239 | 1715 453 | mind | 0 | 1715 454 | cryptic | 0 | 1715 455 | cntl only | 1170 | 545 456 | all covar | 250 | 295 457 | 458 | * total: 295 459 | * female: 165 460 | * male: 130 461 | 462 | \**`cntl only` = only only individuals without asthma; `all covar` = individuals with data for all relevant covariates* 463 | 464 | ## \#QED# 465 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Burchardlab Tutorials 2 | ## Pagé Goddard 3 | 4 | I have set up this repository as a catch-all for the notes I make as I learn new programs and design pipelines. My Tutorials will focus on the codes I have used successfully with notes about what needed troubleshooting. I have attempted to make these walkthorughs clear and accurate with helpful details not included in the available vignettes. 5 | 6 | - **ADMIXTURE** - used to estimate global genetic ancestry; walkthrough of a supervised admixed run on African Americans 7 | - **BASH Commands** - navigating the server; basic bash commands for navigation, finding and viewing data; also detaching screens for backgrounf processes 8 | - **GENESES_PCRelate** - used to estimate relatedness measurements in sample; genetic relatedness matrix generation 9 | - **GRM Methods Comparison** - evaluates Pro's and Con's of 3 common GRM computation programs (GENESIS, REAP, GCTA) and conlcudes that GENESIS is the most robust for admixed populations 10 | - **Markdown** - quick summary of markdown syntax used to create the following files 11 | - **PLINK_QC** - intro to running quality control for genotype array data in PLINK to prepare for GWAS; example shown is the process used for the telomere project 12 | - **Useful_databases** - a dynamic and curated list for different bioinfomatic databases by data type / research question 13 | - **ggplot_manhattan** - function for creating a beautiful and customizable manhattan plot using ggplot2 14 | 15 | See the Wiki tab for more details and tutorials. 16 | -------------------------------------------------------------------------------- /Subsetting_tutorial.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/Subsetting_tutorial.pdf -------------------------------------------------------------------------------- /Useful_databases.md: -------------------------------------------------------------------------------- 1 | # Useful Bioinformatic Databases 2 | ## Pagé Goddard 3 | ### Oct 7, 2017 4 | 5 | This is a dynamic document where I will be adding and annotating databases I come across with their data types, possible uses, and current limitations. If you have any databases that you use regularly or have found useful in the past, please send them my way and I'll add them here! 6 | 7 | ### Overview of all databases included: 8 | 9 | ### Overview of all databases included: 10 | Database | data type | uses | limitations 11 | ---------|-----------|------|------------ 12 | UCSC Table browser | curated genomic annotations from multiple sources | find functional and sequence information by genomic location | 13 | HPO Human Phenotype Ontology | Phenotypic information with associated genes (mined from OMIM, Orphanet, DECIPHER) | finding related phenotypes and genes given phenotypes 14 | OMIM Online Mendelian Inheritance in Man | phenotype annotations and gene annotations | mendelian phenotypes with known/suspected genetic cause and disease genes with related phenotypes; includes summaries, references and hypotheses for gene function if unconfirmed | only contains monogenic diseases and related genes (but is super powerful for those) 15 | [MGI](http://www.informatics.jax.org/) | Phenotype and disease ontologies 16 | 17 | ### Annotating SNPs 18 | So you ran your GWAS and got some SNPs. Now we want to gauge their biological and clinical potential. 19 | 20 | * snpinfo 21 | * dbsnp 22 | * Ensemble VEP 23 | * UCSC Genome Browser rsid annotator 24 | 25 | ### Annotating Genes 26 | Let's say you run a GWAS and identify a set of genes from your SNPs that you now want to investigate. 27 | 28 | *Gene Function* 29 | 30 | * Uniprot 31 | * GeneCards 32 | * NCBI Gene search 33 | 34 | *Gene-phenotype links* 35 | 36 | * MGI (http://www.informatics.jax.org/) 37 | * RNAi database (http://www.genomernai.org/) 38 | * OMIM 39 | -------------------------------------------------------------------------------- /ggplot_manhattan.r: -------------------------------------------------------------------------------- 1 | # This function builds on code shared by the R Graph Gallery and Getting Genetics Done (see wiki for sources) to produce a customizable manhattan plot using ggplot2. 2 | 3 | # Libraries ==== 4 | library(readr) 5 | library(ggrepel) 6 | library(ggplot2) 7 | library(dplyr) 8 | library(RColorBrewer) 9 | 10 | # Variables ==== 11 | mypalette <- c("#E2709A", "#CB4577", "#BD215B", "#970F42", "#75002B") # chr color palette 12 | mysnps <- c("rs11801961","rs116558464","rs61703161") # snps to highlight 13 | sig = 5e-8 # significant threshold line 14 | sugg = 1e-6 # suggestive threshold line 15 | 16 | # Core Function ==== 17 | gg.manhattan <- function(df, threshold, hlight, col, ylims, title){ 18 | # format df 19 | df.tmp <- df %>% 20 | 21 | # Compute chromosome size 22 | group_by(CHR) %>% 23 | summarise(chr_len=max(BP)) %>% 24 | 25 | # Calculate cumulative position of each chromosome 26 | mutate(tot=cumsum(chr_len)-chr_len) %>% 27 | select(-chr_len) %>% 28 | 29 | # Add this info to the initial dataset 30 | left_join(df, ., by=c("CHR"="CHR")) %>% 31 | 32 | # Add a cumulative position of each SNP 33 | arrange(CHR, BP) %>% 34 | mutate( BPcum=BP+tot) %>% 35 | 36 | # Add highlight and annotation information 37 | mutate( is_highlight=ifelse(SNP %in% hlight, "yes", "no")) %>% 38 | mutate( is_annotate=ifelse(P < threshold, "yes", "no")) 39 | 40 | # get chromosome center positions for x-axis 41 | axisdf <- df.tmp %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 ) 42 | 43 | ggplot(df.tmp, aes(x=BPcum, y=-log10(P))) + 44 | # Show all points 45 | geom_point(aes(color=as.factor(CHR)), alpha=0.8, size=2) + 46 | scale_color_manual(values = rep(col, 22 )) + 47 | 48 | # custom X axis: 49 | scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) + 50 | scale_y_continuous(expand = c(0, 0), limits = ylims) + # expand=c(0,0)removes space between plot area and x axis 51 | 52 | # add plot and axis titles 53 | ggtitle(paste0(title)) + 54 | labs(x = "Chromosome") + 55 | 56 | # add genome-wide sig and sugg lines 57 | geom_hline(yintercept = -log10(sig)) + 58 | geom_hline(yintercept = -log10(sugg), linetype="dashed") + 59 | 60 | # Add highlighted points 61 | #geom_point(data=subset(df.tmp, is_highlight=="yes"), color="orange", size=2) + 62 | 63 | # Add label using ggrepel to avoid overlapping 64 | geom_label_repel(data=df.tmp[df.tmp$is_annotate=="yes",], aes(label=as.factor(SNP), alpha=0.7), size=5, force=1.3) + 65 | 66 | # Custom the theme: 67 | theme_bw(base_size = 22) + 68 | theme( 69 | plot.title = element_text(hjust = 0.5), 70 | legend.position="none", 71 | panel.border = element_blank(), 72 | panel.grid.major.x = element_blank(), 73 | panel.grid.minor.x = element_blank() 74 | ) 75 | } 76 | 77 | # Run Function ==== 78 | gg.manhattan(df, threshold=1e-6, hlight=mysnps, col=mypalette, ylims=c(0,10), title="My Manhattan Plot") 79 | -------------------------------------------------------------------------------- /prefev1_loco_180322.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_180322.png -------------------------------------------------------------------------------- /prefev1_loco_qqman_rsid_180322.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_qqman_rsid_180322.png -------------------------------------------------------------------------------- /prefev1_loco_rsid_180322.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_rsid_180322.png --------------------------------------------------------------------------------