├── ADMIXTURE_Tut.md
├── BASH Commands to Navigate the server.md
├── CITATION.cff
├── GENESIS_PCRelate_Tut.md
├── GRM_Computation_Methods.md
├── Jennifer's R Tutorials.pdf
├── Markdown_Tut.md
├── PLINK_QC.md
├── README.md
├── Subsetting_tutorial.pdf
├── Useful_databases.md
├── ggplot_manhattan.r
├── prefev1_loco_180322.png
├── prefev1_loco_qqman_rsid_180322.png
└── prefev1_loco_rsid_180322.png
/ADMIXTURE_Tut.md:
--------------------------------------------------------------------------------
1 | # ADMIXTURE Tutorial
2 | #### Pagé Goddard
3 | #### Sep 28, 2017
4 | ___
5 | ## Content
6 | * [ADMIXTURE Overview](#summary)
7 | * [Set up: Choose your file locations](#files)
8 | * [Pipeline](#format)
9 | - [Formatting genotype data](#format)
10 | - [Build popfile](#popfile)
11 | - [Run Admixture](#admix)
12 | * [Parsing Results](#results)
13 | - [ADMIXTURE Output](#output)
14 | - [Reformat results](#reformat)
15 | - [Plotting Results](#plot)
16 | * [Bonus: Thinning Files](#thin)
17 | ___
18 |
19 | Tutorial for calculating global ancestry proportions from **known ancestral populations** using ADMIXTURE.
20 |
21 | Sample data: SAGE 2
22 |
23 | [ADMIXTURE manual](https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf)
24 |
25 |
26 | ### ADMIXTURE Summary
27 | Input | Output
28 | -----------|--------
29 | test.bed \(*plink*\) | test.k.Q \(*ancestry fractions*\)
30 | test.bim \(*plink*\) | test.k.P \(*allele freq of inferred pop*\)
31 | test.fam \(*plink*\)
32 | test.pop \(*generate in R*\)
33 |
34 | * test includes cohort of interest AND reference populations
35 | * test binary files are 1/2 encoded
36 | * k = number of reference populations
37 |
38 |
39 | ### File Locations
40 | *script variables and directories*
41 | ```bash
42 | date="171231" # update current yymmdd
43 | wkdir="$HOME/wkdir" # choose your working directory
44 | ancestraldir="path/to/your/ancestry.reference.panels.directory"
45 | datadir="path/to/your/genotype.data.dir"
46 |
47 | # data files
48 | sagegenofile="$datadir/genotype.data" # I'v set it up this way so that you can make a second object (eg: galagenofile) for another dataset
49 | ancestralgenofile="$ancestraldir/ref.panels" # reference panel genotypes
50 |
51 | # all your data should be in PLINK binary format (.bim .bed. fam)
52 | ```
53 |
54 | ## PLINK: format data
55 | ADMIXTURE requires 1 input that contains your population of unknown ancestry and the references of known ancestry.
56 | If you do not have a cohort-reference panel file, you must merge them together in PLINK.
57 | **NB**: plink will merge the files in alphabetical order by FID, so you may have your reference panels merged into the middle of the file rather than appended.
58 | ```bash
59 | # PLINK variables
60 | merge="genotype_ancestry_attempt1_${date}"
61 | ancestry_flip="ref_panels_flippedalleles_${date}"
62 | myadmix="genotype_ancestry_merge_${date}"
63 |
64 | # merge cohort + reference pops
65 | plink --bfile $sagegenofile --bmerge $ancestralgenofile.bed $ancestralgenofile.bim $ancestralgenofile.fam --make-bed --out $merge
66 |
67 | # if error due to missed snps, flip the problematic snps and merge again
68 | plink --bfile $ancestralgenofile --flip $merge_missnp --make-bed --out $ancestry_flip
69 | plink --bfile $sagegenofile --bmerge $ancestry_flip --make-bed --out $merge
70 |
71 | # if error persists, remove the snps with --exclude missnp
72 |
73 | # recode Cohort + Ref binary files in 1/2 format
74 | plink --bfile $merge --recode 12 --out $myadmix
75 | ```
76 |
77 | *optional: thin files for admixture, see bottom*
78 |
79 |
80 | ## R: build popfile
81 | .pop is required for supervised admixture estimation; Each line of the .pop file corresponds to individual listed on the same line number in the .fam file. If the individual is a population reference, the .pop file line should be a string (beginning with an alphanumeric character) designating the population. If the individual is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to indicate that the ancestry should be estimated. The final format should be a **single column** with **no header** that lines up with the fam file as shown here:
82 |
83 | fam_ID_source | popfile
84 | ------|------
85 | sage | -
86 | sage | -
87 | sage | -
88 | euro_ref | ceu
89 | euro_ref | ceu
90 | afr_ref | yri
91 | afr_ref | yri
92 | sage | -
93 | sage | -
94 |
95 | ```R
96 | # R variables
97 | setwd("PATH_TO/wkdir")
98 | date <- "171231" # update current yymmdd
99 | mydat <- "genotype_ancestry_merge_" # same prefix as myadmix
100 | merge <- read.table(file=paste(mydat, date, ".fam", sep=""), header = F, sep = ' ') # .fam file has the samples for your population and reference populations
101 | ceu <- read.table("samples_ceu.txt", header = F, sep = ' ') # european ancestry
102 | yri <- read.table("samples_yri.txt", header = F, sep = ' ') # african ancestry
103 | nam <- read.table("samples_nam.txt", header = F, sep = ' ') # native american ancestry
104 |
105 | # make popfile
106 | merge$pop = ifelse(merge$X1 %in% ceu$V1, 'CEU', ifelse(merge$X1 %in% yri$V1, 'YRI', '-'))
107 | # if IID is from CEU ref, write CEU;
108 | # if IID is from YRI ref, write YRI;
109 | # if IID is from NAM ref, write NAM;
110 | # if IID is in neither ref, it is a SAGE individual with unknwon ancestry; write - as placeholder
111 | popfile <- as.data.frame(merge$pop)
112 |
113 | # write out
114 | write.table(popfile, file=paste(mydat, date, ".pop", sep=""), row.names = F, quote = F)
115 | # check that same prefix as $myadmix input
116 | ```
117 |
118 | ## ADMIXTURE: calculate global ancestry
119 | ```bash
120 | # ADMIXTURE variables
121 |
122 | ## required
123 | myadmix="genotype_ancestry_merge_${date}"
124 | popfile=${myadmix}.pop
125 | npop="2" # must reflect number of populations in reference panel
126 |
127 | ## optional
128 | nthreads="64"
129 | accelnum="qn3"
130 | seed="2017"
131 | #bootnum="200" # optional
132 |
133 |
134 | # admixture script
135 | admixture $myadmix.ped $npop --supervised --seed=$seed -j$nthreads
136 | ```
137 |
138 |
139 | ## Parsing ADMIXTURE results
140 |
141 |
142 | ### Admixture output:
143 | myadmix.2.P - allele frequencies of the inferred ancestral populations (SNP x ancestry)
144 | *order: CEU YRI*
145 | ```bash
146 | wc -l $myadmix.2.P
147 | # 801660 lines
148 | head $myadmix.2.P
149 | # 0.006379 0.000010
150 | # 0.827382 0.404886
151 | # 0.048119 0.021815
152 | # 0.874641 0.306484
153 | ```
154 | myadmix.2.Q - ancestry fractions for each individual (sample x ancestry)
155 | *order: CEU YRI*
156 | ```bash
157 | wc -l $myadmix.2.Q
158 | # 2165 lines
159 | head $myadmix.2.Q
160 | # 0.142886 0.857114
161 | # 0.123782 0.876218
162 | # 0.208400 0.791600
163 | # 0.109842 0.890158
164 | ```
165 |
166 | ### Reformat results
167 | The following output can be used with programs such as [REAP](http://faculty.washington.edu/tathornt/software/REAP/REAP_Documentation.pdf) to calculate genetic relatedness matrices
168 |
169 | ```bash
170 | # REAP variables
171 | reapinput="genotype_transposed_${date}"
172 | myID="${reapinput}_IID_only.txt"
173 | admixprop="admixporp_${date}"
174 | admixprop_sorted="admixprop_sorted_${date}"
175 |
176 | # transpose genotype files
177 | plink --bfile $sagegenofile --recode 12 transpose --output-missing-genotype 0 --out $reapinput
178 |
179 | # reformat
180 | cut -d' ' -f1 $reapinput.tfam > $myID # SAGE ID list
181 | cut -d' ' -f1-2 merged_trial.fam | paste -d' ' - $myadmix.2.Q > admix_out_IDs_${date} # paste global ancestries to IDs; FID IID CEU YRI
182 | # -d' ' reads space as delimiter
183 | # -f1-2 selects fields 1 through 2 to extract with cut
184 | # - directs paste to use the standard input form the pipe instead of a file
185 | grep -Fwf $myID admix_out_IDs_${date} > $admixprop # extract sage only
186 | # -f read patterns from file1
187 | # -F read patterns as plain strings
188 | # -w match patterns as whole word
189 | awk 'FNR==NR {x2[$1] = $0; next} $1 in x2 {print x2[$1]}' $admixprop $reapinput.tfam > $admixprop_sorted
190 | # sorts $admixprop to match SAGE fam file
191 | # FNR==NR holds true for first named file
192 | # current line $0 is stored in associative array named x2 indexed by the first field [$1]
193 | # $1 in x2 will only start after first name file has been read completely
194 | # looks at the first field of line in second file, and prints the corresponding line from first
195 | ```
196 |
197 | **output**
198 | *order: FID IID EUR AFR*
199 | ```bash
200 | head $admixprop_sorted
201 | # CH30380 CH30380 0.153140 0.846860
202 | # VA30171 VA30171 0.120529 0.879471
203 | # VA30167 VA30167 0.180591 0.819409
204 | # CH30357 CH30357 0.134801 0.865199
205 | ```
206 |
207 | ### Plotting results
208 |
209 | ```R
210 | #### files must be formatted with: race, eur, afr, natam
211 | # no subjectID, race column must come first. check example inputs
212 |
213 | ### Call libraries
214 | require(truncnorm)
215 | require(ggplot2)
216 | require(gridExtra)
217 | require(reshape2)
218 |
219 | # set up space
220 | setwd("PATH_TO/wkdir")
221 | date <- "171231" # update current yymmdd
222 | sage <- read.delim(print("$myadmix.2.Q"))
223 | colnames(sage) <- c("EUR","AFR")
224 | cbind(race = "AA", sage)
225 | head(sage)
226 | # race EUR AFR
227 | # AA 0.142886 0.857114
228 | # AA 0.123782 0.876218
229 | # AA 0.208400 0.791600
230 | # AA 0.109842 0.890158
231 |
232 | # order data by increasing european ancestry
233 | eur.sort = sage[order(sage$EUR),]
234 | eur.sort$rank = 1:length(eur.sort$EUR)
235 |
236 | # format data into geom_bar input
237 | eur.sort.long = melt(eur.sort, id = c('rank','race'), variable.name = 'Ancestry')
238 | head(eur.sort.long)
239 | # race rank Ancestry value
240 | # AA 1 EUR 0.088553
241 | # AA 2 EUR 0.100432
242 | # AA 3 EUR 0.134868
243 | # AA 4 EUR 0.181657
244 |
245 | # create plots
246 | afr.plot <- ggplot(eur.sort.long, aes(x=rank, y=value, fill=Ancestry))
247 | afr.plot <- afr.plot + geom_bar(width=1, stat='identity') # play with width to see what it does
248 | afr.plot <- afr.plot + theme(legend.position="bottom") # move legend
249 | afr.plot <- afr.plot + xlab('Individual') + ylab('Genetic ancestry')# label the axes
250 | afr.plot <- afr.plot + ggtitle('SAGE 2') # create a title
251 | afr.plot <- afr.plot + scale_fill_manual(labels=c("Eur","Afr"),values=c("#cc9999","#660099"))
252 | afr.plot <- afr.plot + theme(axis.text.x=element_blank(),
253 | axis.ticks.x=element_blank())
254 |
255 | afr.plot
256 | ```
257 |
258 | 
259 |
260 |
261 | \# **QED** \#
262 |
263 | ***
264 |
265 | ## Thinning files
266 | [Documentation](https://www.genetics.ucla.edu/software/admixture/admixture-manual.pdf) SEC 2.3
267 |
268 | * speeds up ADMIXTURE
269 |
270 | *SAGE 2*
271 | ```bash
272 | # PLINK variables
273 | datadir="path/to/your/genotype.data.dir"
274 | genofile=$myadmix
275 | slidingwindowsize=50
276 | slidewidth=10
277 | r2val="0.1"
278 | thingenofile="${myadmix}_thinned_indeppairwise_${slidingwindowsize}_${slidewidth}_${r2val}"
279 |
280 | # run PLINK
281 | plink --bfile $genofile --indep-pairwise $slidingwindowsize $slidewidth $r2val --make-bed --out $thingenofile
282 | ```
283 | You can then continue through the popfile and admixture steps with the thinned files.
284 | **NB**: Remember to test the order of thingenofile before making popfile
285 |
--------------------------------------------------------------------------------
/BASH Commands to Navigate the server.md:
--------------------------------------------------------------------------------
1 | # Navigating the Server: Bash Commands
2 | ### Notes by: Pagé Goddard
3 | ### Tutorial session by: Jennifer Liberto
4 | ___
5 | ### Contents
6 | * [`ls` - Looking in Directories](#ls)
7 | * [Permissions](#permissions)
8 | * [`cd` - Changing Directories](#cd)
9 | * [`less` - Read files](#less)
10 | * [`screen` - Screen a command](#screen)
11 | * [`scp` - Copying files to desktop](#scp)
12 |
13 |
14 | ### look at your directory contents
15 | * `ls` lists just the file and subfolder names in your directory
16 | * `ls -1` does the same, but puts it all in a single column
17 | * `ll` lists contents with permissions, size, and date information
18 | * `ls -lrta` list in reverse time, anything (lists latest last in list)
19 |
20 |
21 | ### permissions
22 | `drwxr-xr`-- *example 1*
23 | `-rwxr-xr`-- *example 2*
24 |
25 | * initial `d` = directory; `-` = file
26 | * `w` = can write
27 | * `x` = can execute
28 | * `r` = can read
29 |
30 |
31 | ### changing directories
32 |
33 | ```bash
34 | pwd
35 | /media/burchardraid01/datafreeze # example directory
36 |
37 | cd .. && pwd # change then prnt new directory
38 | /media/burchardraid01 # parent directory
39 |
40 | cd ~ && pwd
41 | /media/.../your.directory # you are now in your home directory
42 |
43 | cd path/to/desired/directory && pwd
44 | path/to/desired/directory # you are now in the directory you asked for
45 | ```
46 |
47 | `pwd` = print working directory; tells you where you are
48 |
49 | `cd` = change directory
50 |
51 | `.` = current directory
52 |
53 | `..` = previous directory
54 |
55 | `~` = home
56 |
57 | ###### note: when typing a path, use `/` not backslash; if you get an error, check that you are not trying to `cd` into a file instead of the directory; if error persists, check your slashes, your capitalization and your path
58 |
59 |
60 | ### visualize files
61 | `less filename` = opens 1 page view of data without printing the whole thing to screen; move by line (arrow keys) or page (space)
62 |
63 | `head filename` = prints first 10 lines of file
64 |
65 | * `head -3 filename` = prints first 3 lines
66 | * `tail filename` = prints last 10 lines of file
67 | * `tail -3 filename` = print last 3 lines
68 | * `head 10 filenme | tail -3 -` = print 3 lines after line 10 of file
69 | * `cut -d' ' -f 1-5 filename` = print columns 1-5 of file; `d` is your file delimiter (the symbol that separates columns, usually space `' '`, comma `','`, or tab `'\t'`)
70 | * `cut -d' ' -f 1-5 filename | head -5 -` = print only rows 1-5 of columns 1-5 of file
71 |
72 | `vim filename` = let's you view and edit file
73 |
74 | * to edit, press `i` (for insert) and write as you please
75 | * navigate lines with arrow keys
76 | * to save click `Esc` and type :w
77 | * to quit, click `Esc` and type :q
78 | * to save and quit type :wq
79 |
80 |
81 | ### run a process in the background
82 | *this is especially useful if you are running an automated script that will take a while and doesn't require interaction after it starts*
83 |
84 | start a new screen
85 | ```bash
86 | screen
87 | # this changes your console window to a detachable "screen"
88 | # title will now read "screen 0: username@hostname:~/PATH/TO/current_directory"
89 | # normal title bar reads: "username@hostname:~/PATH/TO/current_directory"
90 | ```
91 | start a new screen with a particular name
92 | ```bash
93 | screen -S name
94 | ```
95 | detach your screen
96 | ```bash
97 | # while in the new screen type
98 | ctrl+A ctrl+D
99 | ```
100 | return to your screen
101 | ```bash
102 | screen -r # if you only have one screen
103 | # if you have multiple screens, the list will print here
104 |
105 | screen -r [number/name] # if returning to one of multiple screens
106 | ```
107 | kill an old screen
108 | ```bash
109 | screen -r [name/number] # return to screen
110 |
111 | # type
112 | crtl+A K
113 | ```
114 |
115 |
116 | ### copying files to your computer
117 |
118 | ##### on a mac/linux
119 |
120 | ```bash
121 | exit # leave server
122 | scp username@hostname.edu:path/to/file/filename.csv ~/Desktop
123 | ```
124 |
125 | * translation: copy/paste \[this file on server\] \[to my personal desktop\]
126 | * `scp` = secure copy paste (copy paste over secure ssh connection)
127 | * to push to server, just switch ~/Desktop and username@hostname.edu:/path
128 |
129 | ##### or use filezilla GUI
130 |
131 | Angel's step-by-step instructions for setting up `FileZilla` can be found on the Wiki [here](https://wiki.library.ucsf.edu/display/UAC/How+to+transfer+files+between+cesar+and+your+desktop+with+your+private+key)
132 |
133 | ## Congrats! you can now read your data into R studio from your personal computer
134 |
--------------------------------------------------------------------------------
/CITATION.cff:
--------------------------------------------------------------------------------
1 | cff-version: 1.2.0
2 | message: "If you found any of this code uniquely helpful, you may cite it as below."
3 | authors:
4 | - family-names: "Goddard"
5 | given-names: "P"
6 | orcid: "https://orcid.org/0000-0001-8187-5316"
7 | - family-names: "Elhawary"
8 | given-names: "J"
9 | orcid: "https://orcid.org/0000-0003-3326-1680"
10 | title: "Burchardlab Tutorials {optional tutorial title}"
11 | version: 1.0.0
12 | doi: 10.5281/zenodo.1234
13 | date-released: 2021-08-16
14 | url: "https://github.com/pcgoddard/Burchardlab_Tutorials/wiki"
15 |
--------------------------------------------------------------------------------
/GENESIS_PCRelate_Tut.md:
--------------------------------------------------------------------------------
1 | # GENESIS PC-Relate Tutorial
2 | #### Pagé Goddard
3 |
4 | "`GENESIS` uses `PC-AiR` for **population structure** inference that is robust to known or cryptic relatedness, and it uses `PC-Relate` for accurate **relatedness estimation** in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium."
5 |
6 | ### Resources
7 |
8 | [GENESIS Vignette](https://rdrr.io/bioc/GENESIS/f/vignettes/pcair.Rmd),
9 | [Bioconductor Vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/GENESIS/inst/doc/pcair.html#plink-files)
10 |
11 | [KING Documentation](http://people.virginia.edu/~wc9c/KING/manual.html)
12 |
13 | [SNPRelate](https://www.rdocumentation.org/packages/SNPRelate/versions/1.6.4)
14 |
15 | ### Libraries
16 | ```R
17 | # be sure to install biocLite
18 | source("https://bioconductor.org/biocLite.R")
19 | biocLite(c('GENESIS',"GWASTools", "SNPRelate"))
20 | library(GENESIS)
21 | library(GWASTools)
22 | library(SNPRelate)
23 | library(gdsfmt)
24 | ```
25 |
26 | ### Input Files
27 | GENESIS uses a model-free approach and thus requires only the genotype file and no externally calculated ancestry proportions. The functions in the `GENESIS` package read genotype data from a GenotypeData class object created by the `GWASTools` package. Through the use of `GWASTools`, a `GenotypeData` class object can easily be created from:
28 |
29 | * Plink Files
30 | * GDS File
31 | * R Matrix of Genotype Data
32 |
33 | #### PLINK files
34 | The `SNPRelate` package provides the `snpgdsBED2GDS` function to convert binary PLINK files into a GDS file.
35 |
36 | file | description
37 | -----------|-----------
38 | `bed.fn` | path to `PLINK.bed` file
39 | `bim.fn` | path to `PLINK.bim` file
40 | `fam.fn` | path to `PLINK.fam` file
41 | `out.gdsfn`|path for output GDS file
42 |
43 | ```R
44 | snpgdsBED2GDS(bed.fn = "genotype.bed", bim.fn = "genotype.bim", fam.fn = "genotype.fam", out.gdsfn = "mygenotype.gds")
45 | ```
46 |
47 | Then continue with GDS file instructions:
48 |
49 | #### GDS files
50 |
51 | ```R
52 | mygeno <- GdsGenotypeReader(filename = "PATH_TO/mygenotype.gds")
53 | myenoData <- GenotypeData(geno)
54 | ```
55 |
56 | #### R Matrix
57 | file | description
58 | -----------|-----------
59 | `genotype` | matrix of genotype values coded as 0 / 1 / 2; rows index SNPs; columns index samples
60 | `snpID` | integer vector of unique SNP IDs
61 | `chromosome` | integer vector specifying chr of each snp
62 | `position` | integer vector psecifying position of each SNP
63 | `scanID` | vector of unique individual IDs
64 |
65 | ```R
66 | mygeno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID, chromosome = chromosome, position = position, scanID = scanID)
67 |
68 | mygenoData <- GenotypeData(mygeno)
69 | ```
70 |
71 | ### Pairwise Measures of Ancestry Divergence
72 | Identifying a **mutually unrelated** and **ancestry representative** subset of individuals. `KING-robust` kinship coefficient estimator provides negative estimates for unrelated pairs with divergent ancestry to prioritize ancestrally-diverse individuals for the representative subset. Can be calculated with `KING-robust` from `GENESIS` or `snpgdsIBDKING` from `SNPRelate`.
73 |
74 | ##### with KING-robust
75 | ###### KING (external [program](http://people.virginia.edu/~wc9c/KING/manual.html))
76 | * **input**: PLINK binary ped bile `genotype.bed`
77 | * **command**: king -b `genotype.bed` --kinship
78 | * **output**: `king.kin`, `king.kin0`
79 |
80 | KING Output: `.kin` and `.kin0` text files. `king2mat` extracts kinship coefficients for `GENESIS` functions.
81 |
82 | ```bash
83 | king -b genotype.bed --kinship
84 | ```
85 |
86 | ```r
87 | # sample output
88 |
89 | head king.kin # within family
90 | ## FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error
91 | ## 28 1 2 2359853 0.000 0.2500 0.162 0.0008 0.2459 0
92 | ## 28 1 3 2351257 0.000 0.2500 0.161 0.0008 0.2466 0
93 | ## 28 2 3 2368538 1.000 0.0000 0.120 0.0634 -0.0108 0
94 | ## 117 1 2 2354279 0.000 0.2500 0.163 0.0006 0.2477 0
95 | ## 117 1 3 2358957 0.000 0.2500 0.164 0.0006 0.2490 0
96 |
97 | head king.kin0 # between family
98 | ## FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship
99 | ## 28 3 117 1 2360618 0.143 0.0267 0.1356
100 | ## 28 3 117 2 2352628 0.161 0.0009 0.2441
101 | ## 28 3 117 3 2354540 0.120 0.0624 -0.0119
102 | ## 28 3 1344 1 2361807 0.093 0.1095 -0.2295
103 | ## 28 3 1344 12 2367180 0.094 0.1091 -0.2225
104 | ```
105 |
106 | ###### convert to matrix with king2mat (GENESIS package)
107 | * **input**: `.kin` and `.kin0` and `iid`
108 | * **command**: `king2mat`
109 | * **output**: matrix
110 |
111 | `iid` = text file containing iids in order from GenotypeData
112 |
113 | ```R
114 | # read individual IDs from GenotypeData object
115 | iids <- getScanID(genoData)
116 | head(iids)
117 |
118 | # create matrix of KING estimates
119 | KINGmat <- king2mat(file.kin0 = system.file("wrkdir", "data.kin0", package="GENESIS"),
120 | file.kin = system.file("wrkdir", "data.kin", package="GENESIS"),
121 | iids = iids)
122 | ```
123 |
124 | ```R
125 | # sample output
126 |
127 | KINGmat[1:5,1:5]
128 |
129 | ## NA19919 NA19916 NA19835 NA20282 NA19703
130 | ## NA19919 0.5000 -0.0009 -0.0059 -0.0080 0.0014
131 | ## NA19916 -0.0009 0.5000 -0.0063 -0.0150 -0.0039
132 | ## NA19835 -0.0059 -0.0063 0.5000 -0.0094 -0.0104
133 | ## NA20282 -0.0080 -0.0150 -0.0094 0.5000 -0.0134
134 | ## NA19703 0.0014 -0.0039 -0.0104 -0.0134 0.5000
135 | ```
136 |
137 | ##### with SNPRelate
138 | ###### prefered since it can be done without leaving R
139 | The vignette states: "Alternative to running the KING software, the `snpgdsIBDKING` function from the `SNPRelate` package can be used to calculate the KING-robust estimates directly from a GDS file. The ouput of this function contains a matrix of pairwise estimates, which can be used by the `GENESIS` functions" **this is a lie** You must extract the matrix and prepend the IIDs as row and column names.
140 |
141 | Input: `mygenotype.gds`
142 | Command: `snpgdsIBDKING`
143 | Output: matrix of pairwise estimates
144 |
145 | Options:
146 | **type** ("KING-robust" - for admixed pop,"KING-homo" - for homogeneous pop),
147 | **sample.id** (choose samples subset; default all),
148 | **snp.id** (choose SNPs subset; default all),
149 | **autosome.only** (default TRUE),
150 | **maf** (to use the SNPs with ">=maf" only; default no threshold)
151 | **family.id** (default NULL; all individuals treated as singletons. If provided, within- and between- family relationships are estimated differently),
152 | **verbose**
153 |
154 | ```r
155 | # snpgdsIBDKING requires a gds object; you cannot just point command to a .gds file
156 | # read in the GDS file you just generated and verify its class
157 | gdsfile <- snpgdsOpen(paste(genotype,".gds",sep=""))
158 | class(gdsfile) #check for "gds.class"
159 | ```
160 | ```r
161 | # calculate KING IBD Kinship coefficients
162 | ibd_king <- snpgdsIBDKING(gdsfile, type="KING-robust", verbose=TRUE)
163 | ```
164 | ```r
165 | class(ibd_king)
166 | # snpgdsIBDClass (5 elements)
167 | names(ibd_king)
168 | # [1] "sample.id" "snp.id" "afreq" "IBS0" "kinship"
169 | ```
170 | ```r
171 | # extract kinship matrix
172 | KINGmat = as.matrix(ibd_king$kinship)
173 |
174 | # check output
175 | KINGmat[1:5,][,1:5]
176 |
177 | ## 0.5000000000 -0.004695793 0.001941412 -0.0009816093 -0.01903618
178 | ## -0.0046957930 0.500000000 -0.009200767 -0.0070120462 -0.03430987
179 | ## 0.0019414117 -0.009200767 0.500000000 -0.0085147706 -0.01809720
180 | ## -0.0009816093 -0.007012046 -0.008514771 0.5000000000 -0.02480543
181 | ## -0.0190361844 -0.034309866 -0.018097203 -0.0248054274 0.50000000
182 |
183 | #add row and column labels
184 | rownames(KINGmat) <- c(ibd_king$sample.id)
185 | colnames(KINGmat) <- c(ibd_king$sample.id)
186 | ```
187 |
188 | ```R
189 | # SAGE2 output using SNPRelate
190 |
191 | KINGmat[1:5,][,1:5]
192 |
193 | ## CH30380 VA30171 VA30167 CH30357 VA70028
194 | ## CH30380 0.5000000000 -0.004695793 0.001941412 -0.0009816093 -0.01903618
195 | ## VA30171 -0.0046957930 0.500000000 -0.009200767 -0.0070120462 -0.03430987
196 | ## VA30167 0.0019414117 -0.009200767 0.500000000 -0.0085147706 -0.01809720
197 | ## CH30357 -0.0009816093 -0.007012046 -0.008514771 0.5000000000 -0.02480543
198 | ## VA70028 -0.0190361844 -0.034309866 -0.018097203 -0.0248054274 0.50000000
199 | ```
200 |
201 | ### Running PC-AIR
202 | Uses pairwise measure sof kinship and ancestry divergence to determine the unrelated and representative subset for analysis.
203 |
204 | The KING-robust estimates are always used as measures of ancestry divergence for unrelated pairs of individuals; can also be used as measures of kinship for relatives (NOTE: they may be biased measures of kinship for admixed relatives with different ancestry)
205 |
206 | ###### Input:
207 | input | description
208 | ------|------
209 | `genoData` | `GenotypeData` class object
210 | `kinMat` | matrix of pairwise kinship coefficient estimates (KING-robust estimates or other source)
211 | `divMat` | matrix of pairwise measures of ancestry divergence (KING-robust estimates)
212 |
213 | ```R
214 | # run PC-AiR
215 | mypcair <- pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat)
216 | ```
217 | You should see the following verbage:
218 |
219 | ```
220 | Partitioning Samples into Related and Unrelated Sets...
221 | Unrelated Set: 1665 Samples
222 | Related Set: 320 Samples
223 | Running Analysis with 748665 SNPs - in 75 Block(s)
224 | Computing Genetic Correlation Matrix for the Unrelated Set: Block 1 of 75 ...
225 | ...
226 | Computing Genetic Correlation Matrix for the Unrelated Set: Block 75 of 75 ...
227 | Performing PCA on the Unrelated Set...
228 | Predicting PC Values for the Related Set: Block 1 of 75 ...
229 | ...
230 | Predicting PC Values for the Related Set: Block 75 of 75 ...
231 | Concatenating Results...
232 | ```
233 |
234 | ###### Output:
235 |
236 | ```R
237 | summary(mypcair)
238 | ```
239 |
240 | ```
241 | Call:
242 | pcair(genoData = myGenoData, kinMat = KINGmat, divMat = KINGmat)
243 |
244 | PCA Method: PC-AiR
245 |
246 | Sample Size: 1985
247 | Unrelated Set: 1665 Samples
248 | Related Set: 320 Samples
249 |
250 | Kinship Threshold: 0.02209709
251 | Divergence Threshold: -0.02209709
252 |
253 | Principal Components Returned: 20
254 | Eigenvalues: 11.286 2.216 1.95 1.924 1.914 1.866 1.849 1.837 1.815 1.797 ...
255 |
256 | MAF Filter: 0.01
257 | SNPs Used: 644529
258 | ```
259 |
260 | ###### Options
261 |
262 | using a reference population alongside sample; perhaps to determine which PCs capture which ancestral groups
263 |
264 | ```r
265 | pcair(genoData = mygenoData, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate
266 |
267 | pcair(genoData = mygenoData, kinMat = KINGmat, divMat = KINGmat, unrel.set = IDs) # IDs of individuals from ref panel included in mygenoDate; partition IDs first, then sample
268 | ```
269 |
270 | ###### Plotting PC-Air PCs
271 |
272 | `plot` method provided by `GENESIS` package. Each point represents one individual. Visualization of population structure to **identify clusters of individuals with similar ancestry**. Can by altered by standard `plot` function manipulation. [Basis](https://www.rdocumentation.org/packages/graphics/versions/3.4.0/topics/plot) and [More Details](http://www.statmethods.net/advgraphs/parameters.html)
273 |
274 | default: black dots = unrelated subset**;** blue pluses = related subsets
275 |
276 | ```r
277 | # plot top 2 PCs
278 | plot(mypcair)
279 |
280 | # plot PCs 3 and 4
281 | plot(mypcair, vx = 3, vy = 4)
282 | ```
283 |
284 | ### Running PC-Relate
285 | Provides **genetic relatedness estimates**. Uses top PCs from PC-Air (ancestry capturing components) to adjust for population structure & individual ancestry in sample.
286 |
287 | ###### Input:
288 |
289 | input | file | description
290 | ------|------|------
291 | `training.set` | `mypcair`$`unrels` | vector of IIDs specifying unrelated subset to be used for ancestry adjustment per SNP
292 | `genoData` | `mygenoData` | GenotypeData class object
293 | `pcMat` | `mypcair`$`vectors[,1:n]` | matrix; columns are PCs 1 - n
294 |
295 | ```r
296 | # run PC-Relate
297 | mypcrelate <- pcrelate(genoData = HapMap_genoData, pcMat = mypcair$vectors[,1:2], training.set = mypcair$unrels)
298 | ```
299 |
300 | You should see the following verbiage:
301 | ```
302 | Running Analysis with 748665 SNPs - in 75 Block(s)
303 | Running Analysis with 1985 Samples - in 1 Block(s)
304 | Using 2 PC(s) in pcMat to Calculate Adjusted Estimates
305 | Using 1665 Samples in training.set to Estimate PC effects on Allele Frequencies
306 | Computing PC-Relate Estimates...
307 | ...SNP Block 1 of 75 Completed - 7.165 mins
308 | ...
309 | ...SNP Block 75 of 75 Completed - 5.876 mins
310 | Performing Small Sample Correction...
311 | ```
312 |
313 | ###### Output:
314 | * `write.to.gds = FALSE` = pcrelate obj (default)
315 | * `write.to.gds = TRUE` = gds file `tmp_pcrelate.gds`
316 |
317 | to read `tmp_pcrelate.gds`:
318 | ```r
319 | packages(gdsfmt)
320 | mypcrelate <- openfn.gds("tmp_pcrelate.gds")
321 | ```
322 |
323 | ###### to parse PCRelate outputs:
324 |
325 | command | description
326 | --------|--------
327 | `pcrelateReadKinship` | make a table of pairwise relatedness estimates
328 | `pcrelateReadInbreed` | make a table of individual inbreeding coeficients
329 | `pcrelateMakeGRM` | make a genetic relatedness matrix
330 |
331 | option | description
332 | -----|-----
333 | `pcrelObj` | output from pcrelate; either a class pcrelate object or a GDS file
334 | `scan.include` | vector of individual IDs specifying which individuals to include in the table or matrix; default NULL
335 | `kin.thresh` | minimum kinship coefficient value to include in the table
336 | `f.thresh` | minimum inbreeding coefficient value to include in the table
337 | `scaleKin` | factor to multiply the kinship coefficients by in the GRM; default 2
338 |
339 | ```r
340 | # make pairwise relatedness estimates table
341 | relatepairs.tbl <- pcrelateReadKinship(pcrelObj = mypcrelate, kin.thresh = 2^(-9/2))
342 |
343 | # make inbreeding coefficient table
344 | inbreedcoef.tbl <- pcrelateReadInbreed(pcrelObj = mypcrelate, f.thresh = 2^(-11/2))
345 |
346 | # make GRM
347 | mygrm <- pcrelateMakeGRM(pcrelObj = mypcrelate, scan.include = iids[1:5], scaleKin = 2)
348 | ```
349 |
350 | check results
351 | ```r
352 | # grm output
353 |
354 | mygrm[1:5,][,1:5]
355 |
356 | ## CH30380 VA30171 VA30167 CH30357 VA70028
357 | ## CH30380 0.9921236602 0.001587362 0.004589604 0.002636372 0.0008911428
358 | ## VA30171 0.0015873618 1.006070553 0.002143034 0.001831966 -0.0025865272
359 | ## VA30167 0.0045896042 0.002143034 1.002365658 -0.005183818 0.0049334393
360 | ## CH30357 0.0026363721 0.001831966 -0.005183818 1.003906724 0.0062427112
361 | ## VA70028 0.0008911428 -0.002586527 0.004933439 0.006242711 1.0005800629
362 |
363 | quantile(mygrm)
364 |
365 | ## 0% 25% 50% 75% 100%
366 | ## -3.260058e-02 -3.014012e-03 -5.799979e-05 2.950375e-03 1.169809e+00
367 | ```
368 | save files
369 | ```r
370 | # for .csv output: add sep=',' and change ".txt"
371 |
372 | write.table(mygrm, paste("GENESIS_Kincoef_matrix_",genotype,date,".txt",sep=""), row.names = F, quote = F)
373 | write.table(relatepairs.tbl.nothresh, paste("GENESIS_relatedpairs_",genotype,date,".txt",sep=""), row.names = F, quote = F)
374 | write.table(inbreedcoef.tbl.nothresh, paste("GENESIS_inbreedcoef_",genotype,date,".txt",sep=""), row.names = F, quote = F)
375 | ```
376 |
377 | note: to look at just the first 5 rows and first 5 columns of matrix
378 | ```bash
379 | # in shell
380 | cut -d' ' -f 1-5 test.file | head -5 -
381 | ```
382 | ```r
383 | # in r
384 | test.file[1:5,][,1:5]
385 | ```
386 |
387 | ## \# QED \#
388 |
--------------------------------------------------------------------------------
/GRM_Computation_Methods.md:
--------------------------------------------------------------------------------
1 | # Methods for Computing Genetic Relatedness Matrices (GRMs)
2 | ## Pagé Goddard
3 |
4 | ### Resources
5 |
6 | ###### TOPmed Pipline
7 | * [github](https://github.com/UW-GAC/analysis_pipeline)
8 | * [slides](https://uw-gac.github.io/topmed_workshop_2017/computing-a-grm.html)
9 |
10 | ###### GCTA
11 | * [GCTA Documentation](http://cnsgenomics.com/software/gcta/#GREML)
12 | * [GCTA Publication](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/pdf/main.pdf)
13 |
14 | ###### REAP
15 | * [REAP Documentation](http://faculty.washington.edu/tathornt/software/REAP/REAP_Documentation.pdf),
16 | * [REAP Publication](http://www.cell.com/ajhg/fulltext/S0002-9297(12)00309-6)
17 |
18 | ###### GENESIS
19 | * [GENESIS Vignette](https://rdrr.io/bioc/GENESIS/f/vignettes/pcair.Rmd),
20 | * [GENESIS Publication](https://www.ncbi.nlm.nih.gov/pubmed/26748516),
21 | (also see the TOPmed slides above)
22 |
23 | ### TOPmed Pipeline
24 | The TOPmed analysis pipeline is a great resources for association study design in general, but it is linked here because it includes a recommended approach for GRM computation. For GRM computation with an eye for confounding ancestry, **TOPmed recommends the GENESIS approach**:
25 |
26 | 1. KING
27 | 2. PC-AIR
28 | 3. PC-Relate
29 |
30 | See below for more details
31 |
32 | **NB:** "Section 3 Computing a GRM" calculates a basic Genetic Relationship matrix using `SNPRelate` package in R but does not take into account ancestry or population structure. For the more robust approach, see "Section 4 PC-Relate."
33 |
34 | ### GCTA
35 | * commandline program
36 | **Importance of GRMs:** Allows for identification of closely related individuals. Objective of downstream `GCTA` analysis is to provide a heritability estimate the genetic variation captured by all SNPs (vs. GWAS which estimates variation captured by single SNPs). Including close relatives could bias the results with variance driven by pedigree phenotypic correlations.
37 |
38 | ```bash
39 | # estimate genetic relatedness from SNPs
40 | gcta64 --bfile input.binary --make-grm --out output.files
41 | ```
42 |
43 | From the publication: *As a by-product, we provide a function in GCTA to **calculate the eigenvectors of the GRM**, which is asymptotically equivalent to those from the PCA implemented in EIGENSTRAT11 because the GRM (Ajk) defined in GCTA is approximately half of the covariance matrix (Jjk) used in EIGENSTRAT. The only purpose of developing this function is to calculate eigenvectors and then include them in the model as covariates to capture variance due to population structure. More sophisticated analyses of the population structure can be found in programs such as EIGENSTRAT and STRUCTURE.*
44 |
45 | ###### PROs
46 | * super easy to run
47 | - takes plink files
48 | - no extra input required
49 | * fast
50 |
51 | ###### CONs
52 | * does not take into account ancestry or any external population structure proxy
53 | - not a reliable estimator for admixed populations
54 | * GCTA has been [criticized](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987787/pdf/pnas.201608425.pdf) for unreliable estimates (GrantedI haven't read into this too much)
55 | * I (Pagé) don't really understand the math...
56 |
57 | ### REAP
58 | * commandline program
59 | * concept: use pre-computed ancestry measures to adjust for population structure
60 | * approach: model-based
61 | - calculate **global ancestry per individual** and **allele frequency per ancestral group** using something like `ADMIXTURE`
62 | - calculate relatedness coefficients after adjusting for ancestry
63 |
64 | ###### PROs
65 | * accounts for population ancestry
66 | * designed for admixed populations
67 | * easy one-liner once you have your admixture output
68 |
69 | ###### CONs
70 | * requires additional inputs calculated externally
71 | - not tough to calculate; see my [ADMIXTRE Tutorial](https://github.com/pcgoddard/Burchardlab_Tutorials/blob/master/ADMIXTURE_Tut.md)
72 | - potential for inaccuracies in admixed pop of unknown/poorly defined ancestries
73 | * model-based methods can be confounded by familial relatedness due to inability to distinguish b/w ancestral groups and clusters of close relatives ([source](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014363/pdf/main.pdf))
74 |
75 | ### GENESIS
76 | * R
77 | * concept: trianing on unrelated subpopulation and using PCs to correct for population structure
78 | * approach: model-free
79 | - KING-robust: estimate ancestral divergence and apparent relatedness separately
80 | - PC-Air: use PCs to capture the ethnic components of the population structure and identify the unrelated and related clusters
81 | + ancestral divergence scores used to ensure the unrelated subpop is representative of the full population's ancestry dsitribution
82 | - PC-Relate: calculate kinship coefficient for unrelated group first, then extrapolate to the related group, using ancestry-representative PCs to correct for pop structure
83 | + unrelated first: prevent confounding by related individuals
84 | + using first n PCs: (at your discretion) to capture pop structure / ethnic diversity
85 |
86 | ###### PROs
87 | * ~~doesn't need external inputs~~
88 | * good track record
89 | - TOPmed
90 | - favored in comparison studies (after more computationally heavy IBD-inferrence approaches)
91 | * designed to work well for both admixed and homogenous populations
92 |
93 | ###### CONs
94 | * PC-relate took about 8 hours to run (R-studio)
95 | * claims to not require external input, but it kind of does - it can all be done in R so you can totally set up a pipeline for it though
96 | - uses KING-robust estimates (commandline)
97 | - or SNPRelate funciton (in R)
98 | - requires some reformatting from either output to prep for PC-Air
99 |
100 | ---
101 |
102 | ## Overall winner: GENESIS/TOPmed Pipeline
103 | see [tutorial](https://github.com/pcgoddard/Burchardlab_Tutorials/blob/master/GENESIS_PCRelate_Tut.md)
104 |
--------------------------------------------------------------------------------
/Jennifer's R Tutorials.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/Jennifer's R Tutorials.pdf
--------------------------------------------------------------------------------
/Markdown_Tut.md:
--------------------------------------------------------------------------------
1 | # Markdown Tutorial
2 | #### Pagé Goddard
3 | #### Sep 1, 2017
4 | ---
5 | ## Contents
6 | * [Sublime Add-ons](#add-ons)
7 | * [Headers](#headers)
8 | * [Emphasis](#emphasis)
9 | * [Lists](#lists)
10 | - [unordered lists](#unordered)
11 | - [ordered lists](#ordered)
12 | - [task lists / check boxes](#tasks)
13 | * [Links](#links)
14 | - [URLs](#url)
15 | - [images](#images)
16 | - [Table of Contents / Anchor Tags](#anchor)
17 | * [Block Quotes](#quotes)
18 | * [Code Inserts](#code)
19 | - [Inline highlighting](#inline)
20 | - [Block Code](#block)
21 | * [Tables](#tables)
22 | * [Line Breaks](#linebreaks)
23 | * [Using Special Characters](#escape)
24 | * [Emojis :sparkles:](#emojis)
25 |
26 | ---
27 |
28 | ## Sublime add-ons
29 | * MarkdownEditor *# for easier visualization pre-publication*
30 | * OmniMarkupPreviewer *# for previewing fully formatted document*
31 | - ctrl + alt + o
32 | - OR MarkdownBuddy
33 |
34 |
35 | ## Headers
36 | Use \# to indicate header level
37 | # \# Heading1
38 | ## \#\# Sub-heading
39 | ###### \#\#\#\#\#\# level 6 subheading
40 |
41 |
42 | ## Emphasis
43 | *italic:* \*word\* or \_word\_
44 |
45 | **bold**: \*\*word\*\* or \_\_bold\_\_
46 |
47 | ***combined:*** \*\*\*word\*\*\* or \_\_\_word\_\_\_
48 |
49 | strikethrough: word or ~~ word ~~
50 |
51 |
52 | ## Lists
53 |
54 | ### unordered
55 | * unordered
56 | * lists
57 | - use \*
58 | - and \-
59 |
60 |
61 | ### ordered
62 | 1. ordered
63 | 2. lists
64 | 3. use
65 | 1. numbers
66 | 2. but not letters
67 | * number of tabs determines bullet level
68 |
69 |
70 | ### task list
71 | - [x] - [x] this is a **complete** item
72 | - [ ] - [ ] this is an *incomplete* item
73 | - [ ] - [ ] list autopopulates format
74 |
75 |
76 | ## Links
77 |
78 | ### URLs
79 | example: [wikipedia](https://en.wikipedia.org/wiki/Main_Page)
80 |
81 | `\[Alt Text](url)`
82 |
83 |
84 | ### Images
85 | All linked images must be hosted online. You can link to an image on your local machine but it will not be viewable in the published markdown on other devices. When the image is not publishable, the Alt Text input will be shown.
86 |
87 | 
88 |
89 | `\!\[Alt Text](url)`
90 |
91 | #### you can resize images using standard HTML
92 |
93 |
94 |
95 | `\
`
96 |
97 | *note: the alt="" input is optional
98 |
99 |
100 | ### Table of Contents Links
101 | This requires use of **anchor tags** where you want the table of contents to link to.
102 |
103 | `\`
104 |
105 | **\# This is my header!**
106 |
107 | * *note: remove the `\` before the first `>` to activate the anchor tag*
108 | * *personal preference: I like to put the anchor tag **above** the tagged section so that when the link jumps to that section you still see the header*
109 |
110 | You can then link to that line from anywhere in the document using:
111 |
112 | `\[My header\]\(\#anchor_tag\)`
113 |
114 |
115 | ## Block quotes
116 | the following lines will be a quote
117 |
118 | \> it was the best of times
119 |
120 | \> it was the worst of times
121 |
122 | > it was the best of times
123 | >
124 | > it was the worst of times
125 |
126 |
127 | ## Code Blocks
128 | #### Inline Code
129 |
130 | This is your \`code\` to highlight
131 |
132 | This is your `code` to highlight
133 |
134 |
135 | #### Fenced Code Blocks
136 |
137 | \```javascript
138 |
139 | function test() {
140 |
141 | console.log("hello world");
142 |
143 | \```
144 |
145 | ```javascript
146 | function test() {
147 | console.log("hello world");
148 | }
149 | ```
150 |
151 | to put any text in a code box, just indent it once
152 |
153 | any text
154 |
155 |
156 | ## Tables
157 | * tables use | and - to indicate field divisions
158 |
159 | column 1 | column 2
160 |
161 | \---------|--------- *# this line is necessary to indicate table format*
162 |
163 | cell 1.1 | cell 2.1
164 |
165 | cell 1.2 | cell 2.2
166 |
167 | column 1 | column 2
168 | ---------|---------
169 | cell 1.1 | cell 2.1
170 | cell 1.2 | cell 2.2
171 |
172 |
173 | ## Line breaks
174 | * 3 or more \* \- or \_
175 | ---
176 |
177 |
178 | ## Escaping characters
179 | removes special syntax meaning:
180 | *italic* v. \*asterisks\*
181 |
182 | `\character` can be used on the following:
183 |
184 | \\ \` \* \# \_ \- \+ \.
185 | \{ \} \[ \] \( \) \!
186 |
187 |
188 | ## Emjois (github)
189 |
190 | (just remove the spaces)
191 |
192 | :+1: = \: \+ 1 \:
193 | :sparkles: = \: sparkles \:
194 | :octocat: = \: octocat \:
195 |
--------------------------------------------------------------------------------
/PLINK_QC.md:
--------------------------------------------------------------------------------
1 | # Plink QC Pipeline
2 | ### Pagé Goddard
3 | ##### Oct 6 2017
4 | ---
5 | ## Content
6 | * [Resources](#resources)
7 | * [Basics of Quality Control](#intro)
8 | * [Example PLINK Command](#example)
9 | * [(Some) PLINK QC Commands](#commands)
10 | * [How to read the Log file](#logs)
11 | * [Combine genotype chr files](#concatenate)
12 | * [Sample Pipeline](#pipeline)
13 | - [Variables](#vars)
14 | - [Update sex](#sex)
15 | - [Remove unwanted samples](#rmv)
16 | - [Filter SNPs with low genotyping efficiency](#geno)
17 | - [Filter samples with low genotyping efficiency](#mind)
18 | - [Filter out too closely related individuals](#cryptic)
19 | - [Filter rare SNPs](#maf)
20 | - [Filter SNPs out of HWE](#hwe)
21 | - [Option: Subset data](#subset)
22 | - [Get QC counts for each QC step](#stats)
23 |
24 |
25 | ---
26 |
27 | ## Resources
28 | [PLINK summary statistic commands](http://zzz.bwh.harvard.edu/plink/summary.shtml)
29 |
30 | [COG-Genomics PLINK command list](https://www.cog-genomics.org/plink/1.9/data)
31 |
32 | [Tufts QC Vignette](http://sites.tufts.edu/cbi/files/2013/01/GWAS_Exercise5_QC.pdf)
33 |
34 | [PMC3066182](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3066182/) Turner, Stephen et al. “Quality Control Procedures for Genome Wide Association Studies.” Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] CHAPTER (2011): Unit1.19. PMC. Web. 6 Oct. 2017.
35 |
36 |
37 | ## Basics of Quality Control (QC)
38 | *why we care:* With GWAS, hundreds of thousands of genotypes are generated so even a small percentage of genotyping error can lead to spurious GWAS results. This tutorial focuses on **downstream QC** (i.e. data cleaning after you have the genotype calls).
39 |
40 | ##### Two parts of downstream QC
41 |
42 | 1. subject / sample-based
43 | 2. variant / snp-based
44 |
45 | **Sample QC Concerns**
46 |
47 | term | desc | typical threshold
48 | -----|------|------
49 | sample-specific missingness rate | proportion of missing genotypes per individual |
50 | gender discordance | check that self-reported gender matches genotyped gender |
51 | cryptic relatedness | undisclosed familial relationships; duplicate enrollment |
52 | replicate discordance | agreement with independent genotyping |
53 | population outliers | subjects with significantly different genetic background |
54 |
55 | **SNP QC Concerns**
56 |
57 | term | desc | typical threshold
58 | -----|------|------
59 | SNP-specific missingness rate | proportion of failed genotype assays per variant |
60 | minor allele frequency | low freq alleles more likely to represent genotyping error | 0.05
61 | replicate discordance | agreement with independent genotyping |
62 | Hardy-Weinberg equilibrium | |
63 | Mendelian errors | in family data evidence of non-Mendelian transmission | N/A
64 |
65 |
66 | ## Example Plink command
67 | ```bash
68 |
69 | plink --noweb --bfile inputfile --remove removeme.txt --make-bed --out outputfile
70 |
71 | # basic command if all genotype data is in one file
72 | ```
73 | * `plink` - calls plink in shell environment
74 | * `--no-web` - usually not necessary; tells PLINK not to connect to the internet (which makes it slow)
75 | * `--bfile` - tells plink to read fyour binary files `inputfile.bim`, `inputfile.fam`, `inputfile.bed`
76 | * `--remove` - tells plink to remove all IDs listed in the named file
77 | * `--make-bed` - write the outfile in our favorite `.bed`, `.bim`, `.fam` form
78 | * `--out` - write my outfile with the following prefix
79 |
80 | ```bash
81 | for ((i=1;i<=22;i++))
82 | do
83 | plink --noweb --bfile inputfile_chr$i --remove removeme.txt --make-bed --out outputfile_chr$i
84 | done
85 |
86 | # if geno data is split by chromosome
87 | ```
88 | * `for ((i=1;i<=22;i++))` - "for every value of `i` between 1 and 22, run the following script;" in the plink script, we then tell bash that the `i` refers to chromosome number in the file name using `filename_chr$i`; this is useful if your genotype data is divided by chromosome
89 | * `--bfile` - read from binary files with the prefix `inputfile_chr$i` where `$i` is the chromosome number from your `for` loop; your bfiles are your `.bim`, `.bed`, and `.fam` files where bim and fam are files contianing IDs for the SNPs and Individuals and .bed is the genotype file
90 | * `--remove` - there are a ton of different commands to use here, depending on what you want to do (see resources). A typical pipeline example can be found below.
91 |
92 |
93 | ## Some Plink command options for QC
94 | This options replace the `--remove filename` option
95 |
96 | option | description | notes
97 | -------|-------------|-
98 | `--update-sex filename` | updates sex label for individuals in filename | expects file with `FID` `IID` `sex (1,2)`, no header
99 | `--remove filename` | removes samples in list | expects file with `FID` `IID`, no header
100 | `--geno 0.05` | filters SNPs with genotyping frequency below 95% | to get a file listing the snps that remain after filtering, include `--write-snp.list`
101 | `--mind 0.05` | exclude individuals with genotype rates below 95% | to get the list of the missingness rates include `--missing`
102 | `--hwe 0.0001` | filters out SNPs with HWE exact test p-value below threshold | recommend setting a low threshold as major deviation (p-val of e-50, eg) is likely genotyping error while true SNP-trait association shows slight deviation
103 | `--maf` | filters out SNPs with minor allele freq below threshold | default 0.01
104 | `--genome --min 0.025` | identity by descent calculation to determine relatedness coefficient and remove individuals with values <0.025?? | These calculations are not LD-sensitive. It is usually a good idea to perform some form of LD-based pruning before invoking them.
105 |
106 |
107 |
108 | ## Log files
109 | * every plink command has an automatic log file output: `outfileprefix.log`
110 | * **number of SNPs/Samples loaded** and **number that passed QC**
111 | * errors, warnings if applicable
112 |
113 | *sample log file:*
114 | ```bash
115 | PLINK v1.90b3.29 64-bit (24 Dec 2015)
116 | Options in effect:
117 | --bfile sage_imputedgeno_gwas_171004_no21plus_chr1
118 | --maf 0.05
119 | --make-bed
120 | --noweb
121 | --out sage_imputedgeno_gwas_171004_maf05_finalQC_chr1
122 |
123 | Hostname: burchardlab.ucsf.edu
124 | Working directory: /media/BurchardRaid01/LabShare/Home/pgoddard/telo_wd_171004
125 | Start time: Wed Oct 4 15:01:38 2017
126 |
127 | Note: --noweb has no effect since no web check is implemented yet.
128 | Random number seed: 1507154498
129 | 257655 MB RAM detected; reserving 128827 MB for main workspace.
130 | 1967966 variants loaded from .bim file. # SNP input
131 | 1715 people (822 males, 890 females, 3 ambiguous) loaded from .fam. # Sample input
132 | Ambiguous sex IDs written to
133 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.nosex .
134 | Using 1 thread (no multithreaded calculations invoked.
135 | Before main variant filters, 1715 founders and 0 nonfounders present.
136 | Calculating allele frequencies... done.
137 | 1387541 variants removed due to minor allele threshold(s) # action taken
138 | (--maf/--max-maf/--mac/--max-mac).
139 | 580425 variants and 1715 people pass filters and QC. # SNP/Sample output
140 | Note: No phenotypes present.
141 | --make-bed to sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.bed +
142 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.bim +
143 | sage_imputedgeno_gwas_171004_maf05_finalQC_chr1.fam ... done.
144 |
145 | End time: Wed Oct 4 15:01:40 2017
146 | ```
147 |
148 |
149 | # Example Run
150 |
151 | ##### If you want to merge all chromosomes into one file
152 | ```bash
153 | ### Concatenate genotype data
154 |
155 | plink --bfile $datadir/$sagedata --chr 1-22 --make-bed --out SAGEbase_081517
156 |
157 | # the next step would then look like:
158 |
159 | ###1. Updating Sex in .fam file
160 | plink ---noweb -bfile SAGEbase_081517 --update-sex $sexupdated --make-bed --out ${out}_sex_chr
161 | # no for loop
162 | # no $i as we no longer need to call each chr individually
163 | # this step makes your QC run slower but your directory will look cleaner
164 | ```
165 |
166 | ## Pipeline
167 | This pipeline will walk you through the steps used to QC the imputed SAGE2 genotype data for the Telomere project. We did not merge chromosomes.
168 |
169 | * **note:** depending on your analysis, you can choose to keep or remove any subset of individuals; removing saliva samples is always recommended as it is prefered to work with genotypes sequenced from whole blood. To do this, you need an external tab-delimited, 2-field file with the respective FID IID info for each individual you wish to remove.
170 |
171 |
172 | **set variables**
173 |
174 | * in bash, we can define variables in the environment as `var='value'` to call on later using `$var`
175 | * **useful shorthand**, especially when working with cumbersome file paths or bash scripting (which will be a different tutorial); think of it as **nicknaming your data**
176 | * I prefer this approach because I **know where everything is** coming from and going to and what it is called **without having to parse** my script; it is also **easier to adapt** the script to new data because you only need to update the variable rather than every line of the script.
177 |
178 | ```bash
179 | ## Set up environment
180 |
181 | # values
182 | date=171004
183 |
184 | # directories
185 | wrkdir="$HOME/wrkdir" #choose your working directory
186 | datadir="path/to/genodata" #identify where your genotype data are
187 |
188 | # reference files
189 | sagedata="genodata" #to run it will be ${sagedata}${i} # i will be the chr number 1-22
190 | sexupdated="path/to/file" # list of FID IID and updated sex # optional
191 | removeme_saliva="path/to/file" # list of FID IID for individuals with geno data from saliva rather than blood # optional
192 | removeme_old="path/to/file" # list of idnividuals outside of age range # optional
193 |
194 | # output files
195 | out="mygwasdat_${date}"
196 |
197 | # move into working directory
198 | cd $wrkdir
199 | ```
200 |
201 | *note: my genotype data is split across chromosomes, so to we are looping PLINK over all 22 autosomes; to get total number of variants at each QC step we will sum across all the files at the end. Alternatively, you can run a plink command to concatenate the genotype data into one file, but the QC process is much faster when working with smaller files*
202 |
203 |
204 | ##### 1. Make sure sex is up-to-date
205 | ```bash
206 | ###1. Updating Sex in .fam file
207 |
208 | for ((i=1;i<=22;i++))
209 | do plink ---noweb -bfile $datadir/$sagedata$i --update-sex $sexupdated --make-bed --out ${out}_sex_chr$i
210 | done
211 |
212 | # prints log to screen; look for thes lines:
213 | # --update-sex: 1987 people updated, 130 IDs not present.
214 | # 332237 variants and 1990 people pass filters and QC.
215 |
216 | # the log is also written to the ${out}_sex_chr$i.log files so don't worry about losing the information on the screen
217 | ```
218 |
219 | ##### 2. Remove saliva samples
220 | ```bash
221 | for ((i=1;i<=22;i++))
222 | do plink --noweb --bfile ${out}_sex_chr$i --remove $removeme_saliva --make-bed --out ${out}_nosaliva_chr$i
223 | done
224 |
225 | # 332237 variants and 1954 people pass filters and QC.
226 | ```
227 |
228 | ##### 3: Remove individuals over 21
229 | ```bash
230 | for ((i=1;i<=22;i++))
231 | do plink --noweb --bfile ${out}_nosaliva_chr$i --remove $removeme_old --make-bed --out ${out}_no21plus_chr$i
232 | done
233 |
234 | # 332237 variants and 1715 people pass filters and QC.
235 | ```
236 |
237 |
238 | ##### 4: Filtered out SNPs with genotyping efficiency below 95%
239 | ```bash
240 | for ((i=1;i<=22;i++))
241 | do plink --noweb --bfile ${out}_no21plus_chr$i --geno 0.05 --make-bed --out ${out}_geno05_chr$i
242 | done
243 |
244 | # 0 variants removed due to missing genotype data (--geno)
245 | # 332237 variants and 1715 people pass filters and QC.
246 | ```
247 |
248 |
249 | ##### 5: Filtered out individuals with genotyping efficiency below 95%
250 | ```bash
251 | for ((i=1;i<=22;i++))
252 | do plink --noweb --bfile ${out}_geno05_chr$i --mind 0.05 --make-bed --out ${out}_mind05_chr$i
253 | done
254 |
255 | # 0 people removed due to missing genotype data (--mind)
256 | # 332237 variants and 1715 people pass filters and QC.
257 | ```
258 |
259 |
260 | ##### 6: Screen for Cryptic relatedness
261 | ```bash
262 | for ((i=1;i<=22;i++))
263 | do plink --noweb --bfile ${out}_mind05_chr$i --genome --min 0.025 --make-bed --out ${out}_decrypted_chr$i
264 | done
265 |
266 | # 1954 people (906 males, 1044 females, 4 ambiguous) loaded from .fam.
267 | # 1967966 variants and 1954 people pass filters and QC.
268 | ```
269 |
270 |
271 | ##### 7: Remove SNPs with MAF < 0.05
272 | ```bash
273 | for ((i=1;i<=22;i++))
274 | do plink --noweb --bfile ${out}_decrypted_chr$i --maf 0.05 --make-bed --out ${out}_maf05_chr$i
275 | done
276 |
277 | # 230023 variants removed due to minor allele threshold(s)
278 | # 580425 variants and 1715 people pass filters and QC.
279 | ```
280 |
281 |
282 | ##### 8: Filtered SNPs that fail a HWE cutoff p<0.0001
283 | ```bash
284 | for ((i=1;i<=22;i++))
285 | do plink --noweb --bfile ${out}_maf05_chr$i --hwe 0.0001 --make-bed --out ${out}_hwe0001_chr$i
286 | done
287 |
288 | # 195 variants removed due to Hardy-Weinberg exact test.
289 | ```
290 |
291 |
292 | ##### clean up space
293 | ```bash
294 | # make a separate directory for each QC stage and move the appropriate files into each
295 | # be sure to leave the last set of QC files in your working directory for easy access
296 |
297 | mkdir QC1_update_sex && mv ${out}_sex_chr* QC1_update_sex
298 | mkdir QC2_remove_saliva && mv ${out}_nosaliva_chr* QC2_remove_saliva
299 | mkdir QC3_remove_over21 && mv ${out}_no21plus_chr* QC3_remove_over21
300 | mkdir QC4_genotype_efficiency && mv ${out}_geno05_chr* QC3_genotype_efficiency
301 | mkdir QC5_individ_efficiency && mv ${out}_mind05_chr* QC4_individ_efficiency
302 | mkdir QC6_cryptic_relatedness && mv ${out}_decrypted_chr* QC5_cryptic_relatedness
303 | mkdir QC7_maf_05 && mv ${out}_maf05_chr* QC7_MAF_05
304 |
305 | ls -1 $wrkdir
306 | # QC1_update_sex
307 | # QC2_remove_saliva
308 | # QC3_remove_over21
309 | # QC4_genotype_efficiency
310 | # QC5_individ_efficiency
311 | # QC6_cryptic_relatedness
312 | # QC7_maf_05
313 | ```
314 |
315 |
316 | ##### 9: Subset into two populations: male controls, female controls
317 |
318 | ```r
319 | #R work: Select subjects to keep for analysis (locally in R)
320 |
321 | setwd("/media/BurchardRaid01/LabShare/Home/pgoddard/telo_wd_171004")
322 |
323 | # partition by sex and case/control status
324 | pheno <- read.csv("/media/BurchardRaid01/LabShare/Home/azeiger/Telomere/SAGE/sage2_clean2016_02_23_de_ident.csv", header=T)
325 | telodat <- read.csv("raw_data/tel_res_MQ_SAGE2_09252017_FINAL.csv", header=T)
326 | telodat2 <- merge(pheno[,c("SubjectID", "Male")], telodat, by.x="SubjectID", by.y="SampleID")
327 |
328 | fm_cntl <- telodat2[telodat2$Male=="Female" & telodat2$Status=="control",]
329 | m_cntl <- telodat2[telodat2$Male=="Male" & telodat2$Status=="control",]
330 |
331 | length(telodat$SampleID) # 1540
332 | length(fm_cntl$SubjectID) # 336
333 | length(m_cntl$SubjectID) # 260
334 |
335 | # keep individuals with no missing covariates
336 | # covariate file generated with covar script: /Dropbox/Telomeres/script_telo_covars_171005.R
337 | covars <- read.table("tel_phenocovars_allSage2_100417.txt", header=T)
338 | nomisscovar <- na.omit(covars)
339 |
340 | fm_cntl_clean <- fm_cntl[fm_cntl$SubjectID %in% nomisscovar$FID,]
341 | m_cntl_clean <- m_cntl[m_cntl$SubjectID %in% nomisscovar$FID,]
342 |
343 | length(fm_cntl_clean$SubjectID) # 180
344 | length(m_cntl_clean$SubjectID) # 139
345 |
346 |
347 | # create subset ID lists for PLINK
348 | keep_female <- data.frame(fm_cntl_clean$SubjectID, fm_cntl_clean$SubjectID)
349 | keep_male <- data.frame(m_cntl_clean$SubjectID, m_cntl_clean$SubjectID)
350 |
351 | head(keep_female)
352 | # BP70001 BP70001
353 | # BP70002 BP70002
354 | # BP70004 BP70004
355 | # BP70005 BP70005
356 | # BP70006 BP70006
357 |
358 | # save subset ID list as txt file to wd
359 | write.table(keep_female, "keep_female_171006.txt", row.names=F, col.names=F, quote=FALSE, sep=" ")
360 | write.table(keep_male, "keep_male_171006.txt", row.names=F, col.names=F, quote=FALSE, sep=" ")
361 | ```
362 |
363 | ```bash
364 | # PLINK work
365 |
366 | ## Plink command to extract female controls (n=180):
367 | for ((i=1;i<=22;i++))
368 | do plink --bfile ${out}_maf05_finalQC_chr$i --keep keep_female_${date}.txt --make-bed --out ${out}_female_controls_chr$i
369 | done
370 |
371 | # check
372 | wc -l ${out}_female_controls_chr22.fam
373 | # 165
374 | # 15 female controls with all covariates removed by QC
375 |
376 | ## Plink command to extract male controls (n=139):
377 | for ((i=1;i<=22;i++))
378 | do plink --noweb --bfile ${out}_maf05_finalQC_chr$i --keep keep_male_${date}.txt --make-bed --out ${out}_male_controls_chr$i
379 | done
380 |
381 | # check
382 | wc -l ${out}_female_controls_chr22.fam
383 | # 130
384 | # 9 male controls with all covariates removed by QC
385 | ```
386 |
387 | ##### clean up space
388 | ```bash
389 | # finish clean up
390 | mkdir QC8_MAF_05 && mv ${out}_maf05_finalQC_chr* QC8_maf_05
391 | mkdir QC8_split_fm_cntl && mv ${out}_female_controls_chr* QC8_split_fm_cntl
392 | mkdir QC8_split_m_cntl && mv ${out}_male_controls_chr* QC8_split_m_cntl
393 | ```
394 |
395 |
396 | ##### Final QC Stats
397 |
398 | ```bash
399 | # Log summary
400 | # --update-sex: 1987 people updated, 130 IDs not present.
401 | # 332237 variants and 1990 people pass filters and QC.
402 | # Removed 36 people with saliva data
403 | # 0 variants removed due to missing genotype data (--geno)
404 | # 0 people removed due to missing genotype data (--mind)
405 | # 195 variants removed due to Hardy-Weinberg exact test.
406 | # 0 people removed due to cryptic relatedness
407 | # Removed 239 people over 21
408 | # 230023 variants removed due to minor allele threshold(s)
409 | # extracted controls and subset by sex
410 |
411 | # After plink QC:
412 | # 102019 variants and 1715 people pass filters and QC
413 |
414 | # After downscaling to samples with telomere data:
415 | # total: 545
416 | # fm: 307
417 | # m: 238
418 | ```
419 |
420 | ###### Get summary counts per step
421 | ```bash
422 | # sum line count for each chromosome .bim file
423 | # if data is not split by chr, you can do wc -l path/to/finalQCfile.bim
424 | cd $wrkdir/QC_171004/
425 | wc -l QC1_update_sex/${out}_sex_chr*.bim
426 | wc -l QC2_remove_saliva/${out}_nosaliva_chr*.bim
427 | wc -l QC3_remove_over21/${out}_no21plus_chr*.bim
428 | wc -l QC4_genotype_efficiency/${out}_geno05_chr*.bim
429 | wc -l QC5_individ_efficiency/${out}_mind05_chr*.bim
430 | wc -l QC6_cryptic_relatedness/${out}_decrypted_chr*.bim
431 | wc -l QC7_maf_05/${out}_maf05_chr*.bim
432 | wc -l QC8_hwe/${out}_hwe0001_chr*.bim
433 | ```
434 |
435 | ####### SNPs
436 |
437 | command | removed | remaining
438 | ----------|--------------|----------
439 | **initial** | **-** | **25236191**
440 | geno | 0 | 25236191
441 | MAF | 17711861 | 7524330
442 | HWE | 5154 | 7519176
443 | **total** | **17717015** | **7519176**
444 |
445 |
446 | ####### Individuals
447 |
448 | command | removed | remaining
449 | ----------|---------|----------
450 | **initial** | **-** | **1990**
451 | saliva | 36 | 1954
452 | over 21 | 239 | 1715
453 | mind | 0 | 1715
454 | cryptic | 0 | 1715
455 | cntl only | 1170 | 545
456 | all covar | 250 | 295
457 |
458 | * total: 295
459 | * female: 165
460 | * male: 130
461 |
462 | \**`cntl only` = only only individuals without asthma; `all covar` = individuals with data for all relevant covariates*
463 |
464 | ## \#QED#
465 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Burchardlab Tutorials
2 | ## Pagé Goddard
3 |
4 | I have set up this repository as a catch-all for the notes I make as I learn new programs and design pipelines. My Tutorials will focus on the codes I have used successfully with notes about what needed troubleshooting. I have attempted to make these walkthorughs clear and accurate with helpful details not included in the available vignettes.
5 |
6 | - **ADMIXTURE** - used to estimate global genetic ancestry; walkthrough of a supervised admixed run on African Americans
7 | - **BASH Commands** - navigating the server; basic bash commands for navigation, finding and viewing data; also detaching screens for backgrounf processes
8 | - **GENESES_PCRelate** - used to estimate relatedness measurements in sample; genetic relatedness matrix generation
9 | - **GRM Methods Comparison** - evaluates Pro's and Con's of 3 common GRM computation programs (GENESIS, REAP, GCTA) and conlcudes that GENESIS is the most robust for admixed populations
10 | - **Markdown** - quick summary of markdown syntax used to create the following files
11 | - **PLINK_QC** - intro to running quality control for genotype array data in PLINK to prepare for GWAS; example shown is the process used for the telomere project
12 | - **Useful_databases** - a dynamic and curated list for different bioinfomatic databases by data type / research question
13 | - **ggplot_manhattan** - function for creating a beautiful and customizable manhattan plot using ggplot2
14 |
15 | See the Wiki tab for more details and tutorials.
16 |
--------------------------------------------------------------------------------
/Subsetting_tutorial.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/Subsetting_tutorial.pdf
--------------------------------------------------------------------------------
/Useful_databases.md:
--------------------------------------------------------------------------------
1 | # Useful Bioinformatic Databases
2 | ## Pagé Goddard
3 | ### Oct 7, 2017
4 |
5 | This is a dynamic document where I will be adding and annotating databases I come across with their data types, possible uses, and current limitations. If you have any databases that you use regularly or have found useful in the past, please send them my way and I'll add them here!
6 |
7 | ### Overview of all databases included:
8 |
9 | ### Overview of all databases included:
10 | Database | data type | uses | limitations
11 | ---------|-----------|------|------------
12 | UCSC Table browser | curated genomic annotations from multiple sources | find functional and sequence information by genomic location |
13 | HPO Human Phenotype Ontology | Phenotypic information with associated genes (mined from OMIM, Orphanet, DECIPHER) | finding related phenotypes and genes given phenotypes
14 | OMIM Online Mendelian Inheritance in Man | phenotype annotations and gene annotations | mendelian phenotypes with known/suspected genetic cause and disease genes with related phenotypes; includes summaries, references and hypotheses for gene function if unconfirmed | only contains monogenic diseases and related genes (but is super powerful for those)
15 | [MGI](http://www.informatics.jax.org/) | Phenotype and disease ontologies
16 |
17 | ### Annotating SNPs
18 | So you ran your GWAS and got some SNPs. Now we want to gauge their biological and clinical potential.
19 |
20 | * snpinfo
21 | * dbsnp
22 | * Ensemble VEP
23 | * UCSC Genome Browser rsid annotator
24 |
25 | ### Annotating Genes
26 | Let's say you run a GWAS and identify a set of genes from your SNPs that you now want to investigate.
27 |
28 | *Gene Function*
29 |
30 | * Uniprot
31 | * GeneCards
32 | * NCBI Gene search
33 |
34 | *Gene-phenotype links*
35 |
36 | * MGI (http://www.informatics.jax.org/)
37 | * RNAi database (http://www.genomernai.org/)
38 | * OMIM
39 |
--------------------------------------------------------------------------------
/ggplot_manhattan.r:
--------------------------------------------------------------------------------
1 | # This function builds on code shared by the R Graph Gallery and Getting Genetics Done (see wiki for sources) to produce a customizable manhattan plot using ggplot2.
2 |
3 | # Libraries ====
4 | library(readr)
5 | library(ggrepel)
6 | library(ggplot2)
7 | library(dplyr)
8 | library(RColorBrewer)
9 |
10 | # Variables ====
11 | mypalette <- c("#E2709A", "#CB4577", "#BD215B", "#970F42", "#75002B") # chr color palette
12 | mysnps <- c("rs11801961","rs116558464","rs61703161") # snps to highlight
13 | sig = 5e-8 # significant threshold line
14 | sugg = 1e-6 # suggestive threshold line
15 |
16 | # Core Function ====
17 | gg.manhattan <- function(df, threshold, hlight, col, ylims, title){
18 | # format df
19 | df.tmp <- df %>%
20 |
21 | # Compute chromosome size
22 | group_by(CHR) %>%
23 | summarise(chr_len=max(BP)) %>%
24 |
25 | # Calculate cumulative position of each chromosome
26 | mutate(tot=cumsum(chr_len)-chr_len) %>%
27 | select(-chr_len) %>%
28 |
29 | # Add this info to the initial dataset
30 | left_join(df, ., by=c("CHR"="CHR")) %>%
31 |
32 | # Add a cumulative position of each SNP
33 | arrange(CHR, BP) %>%
34 | mutate( BPcum=BP+tot) %>%
35 |
36 | # Add highlight and annotation information
37 | mutate( is_highlight=ifelse(SNP %in% hlight, "yes", "no")) %>%
38 | mutate( is_annotate=ifelse(P < threshold, "yes", "no"))
39 |
40 | # get chromosome center positions for x-axis
41 | axisdf <- df.tmp %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
42 |
43 | ggplot(df.tmp, aes(x=BPcum, y=-log10(P))) +
44 | # Show all points
45 | geom_point(aes(color=as.factor(CHR)), alpha=0.8, size=2) +
46 | scale_color_manual(values = rep(col, 22 )) +
47 |
48 | # custom X axis:
49 | scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
50 | scale_y_continuous(expand = c(0, 0), limits = ylims) + # expand=c(0,0)removes space between plot area and x axis
51 |
52 | # add plot and axis titles
53 | ggtitle(paste0(title)) +
54 | labs(x = "Chromosome") +
55 |
56 | # add genome-wide sig and sugg lines
57 | geom_hline(yintercept = -log10(sig)) +
58 | geom_hline(yintercept = -log10(sugg), linetype="dashed") +
59 |
60 | # Add highlighted points
61 | #geom_point(data=subset(df.tmp, is_highlight=="yes"), color="orange", size=2) +
62 |
63 | # Add label using ggrepel to avoid overlapping
64 | geom_label_repel(data=df.tmp[df.tmp$is_annotate=="yes",], aes(label=as.factor(SNP), alpha=0.7), size=5, force=1.3) +
65 |
66 | # Custom the theme:
67 | theme_bw(base_size = 22) +
68 | theme(
69 | plot.title = element_text(hjust = 0.5),
70 | legend.position="none",
71 | panel.border = element_blank(),
72 | panel.grid.major.x = element_blank(),
73 | panel.grid.minor.x = element_blank()
74 | )
75 | }
76 |
77 | # Run Function ====
78 | gg.manhattan(df, threshold=1e-6, hlight=mysnps, col=mypalette, ylims=c(0,10), title="My Manhattan Plot")
79 |
--------------------------------------------------------------------------------
/prefev1_loco_180322.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_180322.png
--------------------------------------------------------------------------------
/prefev1_loco_qqman_rsid_180322.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_qqman_rsid_180322.png
--------------------------------------------------------------------------------
/prefev1_loco_rsid_180322.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pcgoddard/Burchardlab_Tutorials/eb6a26fe6c5fc7bc1d9f4d9f1e144f0107df5d4e/prefev1_loco_rsid_180322.png
--------------------------------------------------------------------------------