YEAR: 2021 135 | COPYRIGHT HOLDER: Salvador Herrando-Pérez 136 |137 | 138 |
122 |
123 | Voila! Now to go back to the old working directory:
124 |
125 |
126 | ```r
127 | setwd(oldwd)
128 | ```
129 |
130 |
131 |
--------------------------------------------------------------------------------
/vignettes/Converting_VCF_and_PLINK_formats.Rmd.orig:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Using VCF and PLINK formatted files with smartsnp"
3 | output:
4 | rmarkdown::html_vignette:
5 | toc: true
6 | toc_depth: 2
7 | description: >
8 | This Vignette shows how you can use VCF and PLINK formatted variant files with smartsnp.
9 | vignette: >
10 | %\VignetteIndexEntry{Using VCF and PLINK formatted files with smartsnp}
11 | %\VignetteEngine{knitr::rmarkdown}
12 | %\VignetteEncoding{UTF-8}
13 | ---
14 |
15 | ```{r setup, echo = FALSE, message = FALSE}
16 | knitr::opts_knit$set(collapse = T, comment = "#>")
17 | knitr::opts_knit$set(root.dir = normalizePath("~/Dropbox/Salva_PCA/converting_VCF_to_PLINK_and_GENOTYPE"))
18 | options(tibble.print_min = 4L, tibble.print_max = 4L)
19 | library(smartsnp)
20 | library(sim1000G)
21 | set.seed(1014)
22 |
23 | ```
24 |
25 | Package *smartsnp* is not a data-conversion tool. Inspired by the command-line tool SMARTPCA, *smartsnp* handles SNP datasets in text format and in SMARTPCA formats (uncompressed = EIGENSTRAT or compressed binary = PACKENDANCESTRYMAP) and a general genotype matrix format. However, both VCF (.vcf) and PLINK (.bed) formats are frequently used for storing genetic variation data. In this vignette we provide a quick and robust solution for how to transform these two formats into a general genotype matrix (i.e. where homozygous genotypes are coded as 0 or 2, and heterozygotes as 1) that can be used with the *smartsnp* package.
26 |
27 | The general strategy is to use the *plink2* software for transforming VCF or PLINK/bed files into a general (transposed) genotype matrix. It is "transposed" because PLINK and VCF files typically have samples in rows, whereas the general input file for *smartsnp* has samples in columns.
28 |
29 | We make heavily use of the *plink2* software, which is a comprehensive update to Shaun Purcell's PLINK original command-line program. Binary downloads and an installation guide is available here:
30 | https://www.cog-genomics.org/plink2
31 |
32 | In the *plink2* manual, the file format that we transform our data into is called ".traw (variant-major additive component file)". See here for more information:
33 | https://www.cog-genomics.org/plink/1.9/formats#traw
34 |
35 | Note that the *.traw* format can be directly used with the lastest development version of *smartsnp*, no further data transformation is necessary.
36 |
37 | ## Download a small example VCF
38 |
39 | The R package *sim1000G* contains a small VCF, an unfiltered region from the 1000 genomes Phase III sequencing data VCF, chromosome 4, CEU samples. We will first load the package and then use this file as an example dataset.
40 |
41 | ```{r, message=FALSE, error=FALSE, eval=FALSE}
42 | library(sim1000G)
43 |
44 | # First set the current working directory (cwd) to where you want to download the data to (note that the *Downloads* folder might not exist on your computer, e.g. if you have a Windows system):
45 | oldwd <- getwd()
46 | setwd("~/Downloads/")
47 | ```
48 |
49 | ```{r, message=FALSE, error=FALSE}
50 | examples_dir <- system.file("examples", package = "sim1000G")
51 | vcf_file <- file.path(examples_dir, "region.vcf.gz")
52 |
53 | file.copy(from = vcf_file, to = "./") # Copy the file to the cwd
54 | ```
55 |
56 | ## VCF to PLINK (.bed)
57 |
58 | As a first step, we show how to transform a VCF file into a PLINK/bed format. Note that the VCF is gzipped, but *plink2* can directly use gzipped files.
59 |
60 | We will use the *system* function for calling *plink2*. This is effectively the same as running the quoted command on the command line.
61 |
62 | ```{r}
63 | system("plink --vcf region.vcf.gz --make-bed --out region")
64 | ```
65 |
66 | The --out parameter defines the name of the output file (without any suffix), you can set it to any arbitrary string. After running this command, you will see three files that make up the PLINK/bed format (region.bim, region.bed, region.fam). See the definition of the PLINK/bed (binary biallelic genotype table) file format for more information:
67 | https://www.cog-genomics.org/plink/1.9/formats#bed
68 |
69 | The *plink2* software offers a wide range of options for filtering and transforming the data that could be useful for your analysis. See the manual:
70 | https://www.cog-genomics.org/plink/1.9/
71 |
72 | If you don't want to make use of any further *plink2* functionality, then you can also directly transform the VCF file into the .traw format. See section "Directly transforming VCF to raw genotype (.traw)" below.
73 |
74 | ## PLINK to raw genotype (.traw)
75 |
76 | Now we will use the *plink2* software to transform the .bed file into raw genotypes. Again, note that we will need a "transposed" version since *smartsnp* assumes that samples are in columns, not rows.
77 |
78 | ```{r}
79 | system("plink --bfile region --recode A-transpose --out region_genotypeMatrix")
80 | ```
81 |
82 | Again, the --out parameter defines the name of the output file, without the suffix. After running this command, you will see a "region_genotypeMatrix.traw" file. This file can be directly used with smartsnp.
83 |
84 | ## VCF to raw genotype (.traw)
85 |
86 | We could have skipped the intermediate step of transforming the VCF into a PLINK format. The *plink2* software allows to directly transform the VCF into the .traw format.
87 |
88 | ```{r}
89 | system("plink --vcf region.vcf.gz --recode A-transpose --out region_genotypeMatrix")
90 | ```
91 |
92 | ## Running smartpca
93 |
94 | The VCF file just contained data from a single group (CEU). However, just to demonstrate that this file can be used with smartsnp we'll run a simple pca analysis. Importantly, you will have to set the *missing_value* parameter to "NA".
95 |
96 | ```{r, eval=FALSE}
97 | # To use .traw files, we will need to load the latest development version of smartsnp.
98 | install.packages("devtools")
99 | devtools::install_github("ChristianHuber/smartsnp")
100 | ```
101 |
102 | ```{r data_transformation_example_plot, message=FALSE, fig.height = 7, fig.width = 7, fig.align = "center"}
103 | # Load the PLINK (.fam) file to get the number of samples
104 | numSamples = nrow(read.table("region.fam"))
105 |
106 | # There is just a single group in this data
107 | group_id <- rep(c("CEU"), length.out = numSamples)
108 |
109 | # Running smart_pca
110 | sm.pca <- smart_pca(snp_data = "region_genotype.traw",
111 | sample_group = group_id,
112 | missing_value = NA)
113 |
114 | # Here is a plot of the first two components:
115 | plot(sm.pca$pca.sample_coordinates[, c(3,4)])
116 | ```
117 |
118 | Voila! Now to go back to the old working directory:
119 |
120 | ```{r, eval=FALSE}
121 | setwd(oldwd)
122 | ```
123 |
124 |
125 |
--------------------------------------------------------------------------------
/inst/extdata/mallard_snps_Kraus2013:
--------------------------------------------------------------------------------
1 | ss263068950
2 | ss263068952
3 | ss263068953
4 | ss263068954
5 | ss263068955
6 | ss263068956
7 | ss263068957
8 | ss263068958
9 | ss263068959
10 | ss263068960
11 | ss263068961
12 | ss263068962
13 | ss263068963
14 | ss263068964
15 | ss263068965
16 | ss263068967
17 | ss263068968
18 | ss263068969
19 | ss263068970
20 | ss263068971
21 | ss263068972
22 | ss263068973
23 | ss263068974
24 | ss263068975
25 | ss263068976
26 | ss263068977
27 | ss263068978
28 | ss263068979
29 | ss263068980
30 | ss263068981
31 | ss263068982
32 | ss263068983
33 | ss263068984
34 | ss263068985
35 | ss263068986
36 | ss263068987
37 | ss263068989
38 | ss263068991
39 | ss263068992
40 | ss263068993
41 | ss263068994
42 | ss263068995
43 | ss263068996
44 | ss263068997
45 | ss263068998
46 | ss263068999
47 | ss263069000
48 | ss263069002
49 | ss263069004
50 | ss263069005
51 | ss263069006
52 | ss263069007
53 | ss263069008
54 | ss263069009
55 | ss263069010
56 | ss263069012
57 | ss263069013
58 | ss263069014
59 | ss263069015
60 | ss263069017
61 | ss263069018
62 | ss263069019
63 | ss263069020
64 | ss263069021
65 | ss263069022
66 | ss263069023
67 | ss263069024
68 | ss263069025
69 | ss263069026
70 | ss263069027
71 | ss263069028
72 | ss263069029
73 | ss263069030
74 | ss263069031
75 | ss263069032
76 | ss263069033
77 | ss263069034
78 | ss263069035
79 | ss263069036
80 | ss263069037
81 | ss263069038
82 | ss263069039
83 | ss263069040
84 | ss263069041
85 | ss263069042
86 | ss263069043
87 | ss263069044
88 | ss263069045
89 | ss263069046
90 | ss263069048
91 | ss263069049
92 | ss263069050
93 | ss263069051
94 | ss263069052
95 | ss263069053
96 | ss263069054
97 | ss263069055
98 | ss263069056
99 | ss263069057
100 | ss263069058
101 | ss263069059
102 | ss263069060
103 | ss263069061
104 | ss263069062
105 | ss263069063
106 | ss263069064
107 | ss263069065
108 | ss263069066
109 | ss263069067
110 | ss263069068
111 | ss263069069
112 | ss263069070
113 | ss263069071
114 | ss263069072
115 | ss263069073
116 | ss263069074
117 | ss263069075
118 | ss263069076
119 | ss263069077
120 | ss263069078
121 | ss263069079
122 | ss263069080
123 | ss263069081
124 | ss263069082
125 | ss263069083
126 | ss263069084
127 | ss263069085
128 | ss263069086
129 | ss263069087
130 | ss263069088
131 | ss263069089
132 | ss263069090
133 | ss263069091
134 | ss263069092
135 | ss263069093
136 | ss263069094
137 | ss263069095
138 | ss263069096
139 | ss263069097
140 | ss263069098
141 | ss263069099
142 | ss263069100
143 | ss263069101
144 | ss263069102
145 | ss263069103
146 | ss263069104
147 | ss263069105
148 | ss263069106
149 | ss263069108
150 | ss263069109
151 | ss263069110
152 | ss263069111
153 | ss263069112
154 | ss263069113
155 | ss263069114
156 | ss263069115
157 | ss263069116
158 | ss263069117
159 | ss263069118
160 | ss263069119
161 | ss263069120
162 | ss263069121
163 | ss263069122
164 | ss263069123
165 | ss263069124
166 | ss263069125
167 | ss263069126
168 | ss263069127
169 | ss263069128
170 | ss263069129
171 | ss263069130
172 | ss263069131
173 | ss263069132
174 | ss263069133
175 | ss263069134
176 | ss263069136
177 | ss263069137
178 | ss263069138
179 | ss263069139
180 | ss263069140
181 | ss263069141
182 | ss263069142
183 | ss263069143
184 | ss263069144
185 | ss263069145
186 | ss263069146
187 | ss263069147
188 | ss263069148
189 | ss263069149
190 | ss263069150
191 | ss263069151
192 | ss263069152
193 | ss263069153
194 | ss263069154
195 | ss263069155
196 | ss263069156
197 | ss263069157
198 | ss263069158
199 | ss263069159
200 | ss263069160
201 | ss263069162
202 | ss263069163
203 | ss263069164
204 | ss263069165
205 | ss263069166
206 | ss263069167
207 | ss263069168
208 | ss263069169
209 | ss263069170
210 | ss263069171
211 | ss263069172
212 | ss263069173
213 | ss263069174
214 | ss263069175
215 | ss263069176
216 | ss263069177
217 | ss263069178
218 | ss263069179
219 | ss263069180
220 | ss263069181
221 | ss263069182
222 | ss263069183
223 | ss263069184
224 | ss263069185
225 | ss263069186
226 | ss263069187
227 | ss263069188
228 | ss263069189
229 | ss263069190
230 | ss263069191
231 | ss263069192
232 | ss263069193
233 | ss263069195
234 | ss263069196
235 | ss263069198
236 | ss263069199
237 | ss263069200
238 | ss263069201
239 | ss263069202
240 | ss263069203
241 | ss263069204
242 | ss263069205
243 | ss263069206
244 | ss263069207
245 | ss263069208
246 | ss263069209
247 | ss263069210
248 | ss263069211
249 | ss263069212
250 | ss263069213
251 | ss263069214
252 | ss263069215
253 | ss263069217
254 | ss263069218
255 | ss263069219
256 | ss263069220
257 | ss263069221
258 | ss263069222
259 | ss263069223
260 | ss263069224
261 | ss263069225
262 | ss263069226
263 | ss263069227
264 | ss263069228
265 | ss263069229
266 | ss263069230
267 | ss263069231
268 | ss263069232
269 | ss263069233
270 | ss263069234
271 | ss263069235
272 | ss263069236
273 | ss263069237
274 | ss263069238
275 | ss263069239
276 | ss263069240
277 | ss263069241
278 | ss263069243
279 | ss263069244
280 | ss263069245
281 | ss263069246
282 | ss263069247
283 | ss263069248
284 | ss263069249
285 | ss263069250
286 | ss263069251
287 | ss263069252
288 | ss263069253
289 | ss263069254
290 | ss263069255
291 | ss263069256
292 | ss263069257
293 | ss263069258
294 | ss263069259
295 | ss263069261
296 | ss263069262
297 | ss263069263
298 | ss263069264
299 | ss263069265
300 | ss263069266
301 | ss263069267
302 | ss263069268
303 | ss263069269
304 | ss263069270
305 | ss263069271
306 | ss263069272
307 | ss263069273
308 | ss263069274
309 | ss263069275
310 | ss263069276
311 | ss263069277
312 | ss263069278
313 | ss263069279
314 | ss263069280
315 | ss263069281
316 | ss263069282
317 | ss263069283
318 | ss263069284
319 | ss263069286
320 | ss263069287
321 | ss263069288
322 | ss263069289
323 | ss263069290
324 | ss263069291
325 | ss263069292
326 | ss263069293
327 | ss263069294
328 | ss263069295
329 | ss263069296
330 | ss263069297
331 | ss263069298
332 | ss263069299
333 | ss263069300
334 | ss263069302
335 | ss263069303
336 | ss263069304
337 | ss263069305
338 | ss263069306
339 | ss263069307
340 | ss263069308
341 | ss263069309
342 | ss263069310
343 | ss263069311
344 | ss263069312
345 | ss263069313
346 | ss263069314
347 | ss263069315
348 | ss263069316
349 | ss263069317
350 | ss263069318
351 | ss263069319
352 | ss263069320
353 | ss263069321
354 | ss263069323
355 | ss263069324
356 | ss263069325
357 | ss263069326
358 | ss263069327
359 | ss263069328
360 | ss263069329
361 | ss263069330
362 | ss263069331
363 | ss263069332
364 | ss263069333
365 |
--------------------------------------------------------------------------------
/docs/news/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 | NEWS.md
133 | This Vignette shows how you can use VCF and PLINK formatted variant files with smartsnp.
141 |This Vignette provides an example of how to project ancient DNA onto modern data using the smartsnp package.
143 |This Vignette provides an example analysis of genetic data using the smartsnp package.
145 |Copyright (c) 2021 Salvador Herrando-Pérez
137 |Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
138 |The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
139 |THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
140 |
147 |
--------------------------------------------------------------------------------
/docs/reference/read_packedancestrymap.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 | R/read_packedancestrymap.R
134 | read_packedancestrymap.RdThis function loads genotype data in PACKEDANCESTRYMAP format (binary or compressed).
read_packedancestrymap(pref)142 | 143 |
| pref | 148 |The prefix of the file name that contains the genotype data (i.e., without the |
149 |
|---|
Returns a list containing a single element:
geno Genotype data as R matrix.
229 |
230 | Voila! Note that we have plotted the negative of PC1 (i.e. -PC1) here. The only reason for this is to make the plot have the same orientation as the original plot in Lazaridis et al. (2016), Fig. 1B. Importantly, changing the sign of any axis of an PCA does not change its interpretation, and different software can give different signs. See the excellent explanation here:
231 |
232 | https://stats.stackexchange.com/questions/88880/does-the-sign-of-scores-or-of-loadings-in-pca-or-fa-have-a-meaning-may-i-revers
233 |
234 | Finally, let's move back to the old working directory:
235 |
236 |
237 | ```r
238 | setwd(oldwd)
239 | ```
240 |
241 |
242 |
243 |
244 |
245 |
246 |
247 |
--------------------------------------------------------------------------------
/vignettes/aDNA_smartpca_analysis.Rmd.orig:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Projecting ancient samples"
3 | output:
4 | rmarkdown::html_vignette:
5 | toc: true
6 | toc_depth: 2
7 | description: >
8 | This Vignette provides an example of how to project ancient DNA onto modern data using
9 | the smartsnp package.
10 | vignette: >
11 | %\VignetteIndexEntry{Projecting ancient samples}
12 | %\VignetteEngine{knitr::rmarkdown}
13 | %\VignetteEncoding{UTF-8}
14 | ---
15 |
16 | ```{r setup, echo = FALSE, message = FALSE}
17 | knitr::opts_knit$set(collapse = T, comment = "#>")
18 | knitr::opts_knit$set(root.dir = normalizePath("~/Dropbox/Salva_PCA/TESTING/Lazaridis_2016"))
19 | options(tibble.print_min = 4L, tibble.print_max = 4L)
20 | library(smartsnp)
21 | set.seed(1014)
22 |
23 | ```
24 |
25 | This Vignette provides an example of projecting ancient DNA onto modern data in a PCA analysis using the *smartsnp* package. We will use data from one of the first large-scale ancient DNA studies, Lazaridis et al. 2016:
26 |
27 | Lazaridis et al. "Genomic insights into the origin of farming in the ancient Near East", Nature volume 536, pages 419–424 (2016).
28 |
29 | The data is available online but needs to be pre-processed. Particularly, the aDNA data needs to be merged with modern data. All steps can be completed in R, but some need certain command-line software installed - they won't work on a Windows machine.
30 |
31 | If you are just interested in how to run *smart_pca* with ancient samples (all you need is an index vector *aDNA_inds* with the column numbers of the ancient samples), feel free to go straight to the section "Running smartpca" below.
32 |
33 | ## Install package *smartsnp*
34 |
35 | Select one of two options.
36 |
37 | Install development version from GitHub:
38 |
39 | ```{r, eval = FALSE}
40 | install.packages("devtools")
41 | devtools::install_github("ChristianHuber/smartsnp")
42 | ```
43 |
44 | Install release version from CRAN:
45 |
46 | ```{r, eval = FALSE}
47 | install.packages("smartsnp")
48 | ```
49 |
50 | Load the package:
51 |
52 | ```{r}
53 | library(smartsnp)
54 | ```
55 |
56 | ## Downloading the data
57 |
58 | First, set the working directory to a location where you want to download and process the files. In my case, I'm choosing the Downloads directory in my home folder.
59 |
60 | ```{r, eval=FALSE}
61 | oldwd <- getwd()
62 | setwd("~/Downloads/")
63 | ```
64 |
65 | We will download the data provided here https://reich.hms.harvard.edu/datasets using a command-line software called *wget*. Alternatively, you can download the file using a browser and the link.
66 | Note that this is quite a large file, >200 Mb!
67 |
68 | ```{r, eval=FALSE}
69 | system("wget https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/NearEastPublic.tar.gz")
70 | ```
71 |
72 | The downloaded data has to be unzipped. I will unzip it into a new folder called "data".
73 |
74 | ```{r, eval=FALSE}
75 | system("mkdir data") # Make a new folder called "data"
76 | system("tar -xvf NearEastPublic.tar.gz -C ./data") # Unzip data into this folder
77 | system("rm NearEastPublic.tar.gz") # Remove the zip file
78 | ```
79 |
80 |
81 | ## Select subsets of individuals with convertf
82 |
83 | The data is in a PACKEDANCESTRYMAP format. The ancient and modern data is in two different files, and we are only interested in the Western Eurasian subset of the modern samples.
84 |
85 | In the next step, we will filter out the Western Eurasian samples from the full set of modern samples. Then, we will merge the modern with the ancient data.
86 |
87 | The *convertf* and *mergeit* command-line software of the Eigensoft package conveniently allows to run these two operations, see here for installation:
88 | https://github.com/DReichLab/EIG
89 |
90 | To run Eigensoft software within R using the *system* function, you might need to first explicitly tell R where the software can be found:
91 |
92 | ```{r}
93 | # Path to Eigensoft binaries (might be different on your computer):
94 | pathToEIGENSOFT = "~/repos/EIG/bin/"
95 | Sys.setenv(PATH = paste(Sys.getenv()["PATH"], paste0(":", pathToEIGENSOFT), sep=""))
96 | ```
97 |
98 | We need to generate a parameter file for *convertf* that contains all the file names. We also need to generate a list of West Eurasian populations in a text file.
99 |
100 | ```{r, eval=TRUE}
101 | # Generating a text file with West Eurasian group names
102 | westEurasian_pops <- c(
103 | "Abkhasian", "Adygei", "Albanian", "Armenian", "Assyrian", "Balkar", "Basque", "BedouinA", "BedouinB", "Belarusian", "Bulgarian", "Canary_Islander",
104 | "Chechen", "Croatian", "Cypriot", "Czech", "Druze", "English", "Estonian", "Finnish", "French", "Georgian", "German", "Greek", "Hungarian", "Icelandic",
105 | "Iranian", "Irish", "Irish_Ulster", "Italian_North", "Italian_South", "Jew_Ashkenazi", "Jew_Georgian", "Jew_Iranian", "Jew_Iraqi", "Jew_Libyan", "Jew_Moroccan",
106 | "Jew_Tunisian", "Jew_Turkish", "Jew_Yemenite", "Jordanian", "Kumyk", "Lebanese_Christian", "Lebanese", "Lebanese_Muslim", "Lezgin", "Lithuanian", "Maltese",
107 | "Mordovian", "North_Ossetian", "Norwegian", "Orcadian", "Palestinian", "Polish", "Romanian", "Russian", "Sardinian", "Saudi", "Scottish", "Shetlandic", "Sicilian",
108 | "Sorb", "Spanish_North", "Spanish", "Syrian", "Turkish", "Ukrainian"
109 | )
110 | ```
111 |
112 | ```{r, eval=FALSE}
113 |
114 | # Generating the parameter file for convertf:
115 | par.ANCESTRYMAP.FILTER <- c(
116 | "genotypename: ./data/HumanOriginsPublic2068.geno",
117 | "snpname: ./data/HumanOriginsPublic2068.snp",
118 | "indivname: ./data/HumanOriginsPublic2068.ind",
119 | "poplistname: ./WestEurasia.poplist.txt",
120 | "genotypeoutname: ./data/HumanOriginsPublic2068.WestEurasia.geno",
121 | "snpoutname: ./data/HumanOriginsPublic2068.WestEurasia.snp",
122 | "indivoutname: ./data/HumanOriginsPublic2068.WestEurasia.ind"
123 | )
124 |
125 | writeLines(par.ANCESTRYMAP.FILTER, con = "par.ANCESTRYMAP.FILTER")
126 |
127 | # Now run convertf using the system command in R. This is equivalent to running the quoted command in a terminal:
128 |
129 | system("convertf -p par.ANCESTRYMAP.FILTER")
130 | ```
131 |
132 | ## Merging ancient with modern data using mergeit
133 |
134 | Now we combine the ancient samples with the modern data using *mergeit*.
135 | Again, we first need a parameter file and then we can run *mergeit* with the *system* function in R (or alternatively in the terminal).
136 |
137 | ```{r, eval = FALSE}
138 | params <- c(
139 | "geno1: ./data/HumanOriginsPublic2068.WestEurasia.geno",
140 | "snp1: ./data/HumanOriginsPublic2068.WestEurasia.snp",
141 | "ind1: ./data/HumanOriginsPublic2068.WestEurasia.ind",
142 | "geno2: ./data/AncientLazaridis2016.geno",
143 | "snp2: ./data/AncientLazaridis2016.snp",
144 | "ind2: ./data/AncientLazaridis2016.ind",
145 | "genooutfilename: ./data/AncientLazaridis2016_ModernWestEurasia.geno",
146 | "snpoutfilename: ./data/AncientLazaridis2016_ModernWestEurasia.snp",
147 | "indoutfilename: ./data/AncientLazaridis2016_ModernWestEurasia.ind"
148 | )
149 |
150 | writeLines(params, con = "mergeit.params.txt")
151 |
152 | system("mergeit -p mergeit.params.txt")
153 | ```
154 |
155 | ## Running smartpca
156 |
157 | We still need two additional vectors before we can run *smartsnp*: one that defines the ancient samples, and one that defines which samples we want to remove before running the PCA.
158 |
159 | ```{r}
160 | # Group names of the ancient groups
161 | aDNA_inds <- c("Anatolia_ChL", "Anatolia_N", "Armenia_ChL", "Armenia_EBA", "Armenia_MLBA", "CHG", "EHG", "Europe_EN", "Europe_LNBA", "Europe_MNChL", "Iberia_BA", "Iran_ChL", "Iran_HotuIIIb", "Iran_LN", "Iran_N", "Levant_BA", "Levant_N", "Natufian", "SHG", "Steppe_EMBA", "Steppe_Eneolithic", "Steppe_IA", "Steppe_MLBA", "Switzerland_HG", "WHG")
162 |
163 | # Contains group names of all groups in merged data
164 | GR <- read.table("./data/AncientLazaridis2016_ModernWestEurasia.ind", header=F)
165 |
166 | # Vector defining ancient and modern groups
167 | SA <- ifelse(GR$V3 %in% westEurasian_pops, "modern", "ancient")
168 |
169 | # Samples to remove:
170 | sample.rem <- c("Mota", "Denisovan", "Chimp", "Mbuti.DG", "Altai",
171 | "Vi_merge", "Clovis", "Kennewick", "Chuvash", "Ust_Ishim",
172 | "AG2", "MA1", "MezE", "hg19ref", "Kostenki14")
173 |
174 | # Simple index vectors that determines which samples to remove and which to use for PCA ordination or projection:
175 | SR <- which(GR$V3 %in% sample.rem) # Column numbers of samples to remove
176 | SP <- which(SA == "ancient") # Column numbers of samples to project (i.e. aDNA)
177 | ```
178 |
179 | Now we are finally ready to run smart_pca:
180 |
181 | ```{r run__smart_pca, message = FALSE}
182 | # Running smart_pca:
183 | sm.pca <- smart_pca(snp_data = "./data/AncientLazaridis2016_ModernWestEurasia.geno",
184 | sample_group = GR$V3, missing_value = 9, missing_impute = "mean",
185 | scaling = "drift", program_svd = "RSpectra", pc_axes = 2,
186 | sample_remove = SR, sample_project = SP, pc_project = 1:2)
187 |
188 | # To see more information on the different parameter options:
189 | ?smart_pca
190 | ```
191 |
192 | ## Plotting the results using ggplot2
193 |
194 | We can have a look at the result using *ggplot2* (and *ggrepel* for labeling the groups). The *data.table* package is used to simplify some data operations.
195 | For more info on these packages, see:
196 |
197 | https://ggplot2.tidyverse.org/
198 |
199 | https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
200 |
201 | https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
202 |
203 | ```{r lazaridis_plot, fig.height = 7, fig.width = 7, fig.align = "center", message=FALSE, error=FALSE}
204 | # This needs the R libraries ggplot2, ggrepel, and data.table.
205 | library(ggplot2)
206 | library(ggrepel)
207 | library(data.table)
208 |
209 | # Plotting with ggplot2 and labeling the groups with ggrepel:
210 |
211 | smart_mva.aDNA_WestEurasia.evec <- data.table(sm.pca$pca.sample_coordinates)
212 | smart_mva.aDNA_WestEurasia.evec[, c("PC1mean", "PC2mean") := .(mean(PC1), mean(PC2)), Group]
213 | smart_mva.aDNA_WestEurasia.evec[, Name := GR$V1]
214 |
215 | ggplot() +
216 | geom_point(data = smart_mva.aDNA_WestEurasia.evec[Class == "PCA"], aes(-PC1, PC2), col="grey", alpha=0.5) +
217 | geom_point(data = smart_mva.aDNA_WestEurasia.evec[Class == "Projected" ], aes(-PC1, PC2, fill=Group, shape=Group), size=3) +
218 | scale_shape_manual(values=rep(21:25, 100)) +
219 | geom_label_repel(data = smart_mva.aDNA_WestEurasia.evec[Class == "Projected",.SD[1], Group], aes(-PC1mean, PC2mean, label=Group, col=Group), alpha=0.7, segment.color="NA") +
220 | theme_bw() + theme(legend.position = "none")
221 |
222 |
223 | ```
224 |
225 | Voila! Note that we have plotted the negative of PC1 (i.e. -PC1) here. The only reason for this is to make the plot have the same orientation as the original plot in Lazaridis et al. (2016), Fig. 1B. Importantly, changing the sign of any axis of an PCA does not change its interpretation, and different software can give different signs. See the excellent explanation here:
226 |
227 | https://stats.stackexchange.com/questions/88880/does-the-sign-of-scores-or-of-loadings-in-pca-or-fa-have-a-meaning-may-i-revers
228 |
229 | Finally, let's move back to the old working directory:
230 |
231 | ```{r, eval=FALSE}
232 | setwd(oldwd)
233 | ```
234 |
235 |
236 |
237 |
238 |
239 |
240 |
241 |
--------------------------------------------------------------------------------
/docs/docsearch.css:
--------------------------------------------------------------------------------
1 | /* Docsearch -------------------------------------------------------------- */
2 | /*
3 | Source: https://github.com/algolia/docsearch/
4 | License: MIT
5 | */
6 |
7 | .algolia-autocomplete {
8 | display: block;
9 | -webkit-box-flex: 1;
10 | -ms-flex: 1;
11 | flex: 1
12 | }
13 |
14 | .algolia-autocomplete .ds-dropdown-menu {
15 | width: 100%;
16 | min-width: none;
17 | max-width: none;
18 | padding: .75rem 0;
19 | background-color: #fff;
20 | background-clip: padding-box;
21 | border: 1px solid rgba(0, 0, 0, .1);
22 | box-shadow: 0 .5rem 1rem rgba(0, 0, 0, .175);
23 | }
24 |
25 | @media (min-width:768px) {
26 | .algolia-autocomplete .ds-dropdown-menu {
27 | width: 175%
28 | }
29 | }
30 |
31 | .algolia-autocomplete .ds-dropdown-menu::before {
32 | display: none
33 | }
34 |
35 | .algolia-autocomplete .ds-dropdown-menu [class^=ds-dataset-] {
36 | padding: 0;
37 | background-color: rgb(255,255,255);
38 | border: 0;
39 | max-height: 80vh;
40 | }
41 |
42 | .algolia-autocomplete .ds-dropdown-menu .ds-suggestions {
43 | margin-top: 0
44 | }
45 |
46 | .algolia-autocomplete .algolia-docsearch-suggestion {
47 | padding: 0;
48 | overflow: visible
49 | }
50 |
51 | .algolia-autocomplete .algolia-docsearch-suggestion--category-header {
52 | padding: .125rem 1rem;
53 | margin-top: 0;
54 | font-size: 1.3em;
55 | font-weight: 500;
56 | color: #00008B;
57 | border-bottom: 0
58 | }
59 |
60 | .algolia-autocomplete .algolia-docsearch-suggestion--wrapper {
61 | float: none;
62 | padding-top: 0
63 | }
64 |
65 | .algolia-autocomplete .algolia-docsearch-suggestion--subcategory-column {
66 | float: none;
67 | width: auto;
68 | padding: 0;
69 | text-align: left
70 | }
71 |
72 | .algolia-autocomplete .algolia-docsearch-suggestion--content {
73 | float: none;
74 | width: auto;
75 | padding: 0
76 | }
77 |
78 | .algolia-autocomplete .algolia-docsearch-suggestion--content::before {
79 | display: none
80 | }
81 |
82 | .algolia-autocomplete .ds-suggestion:not(:first-child) .algolia-docsearch-suggestion--category-header {
83 | padding-top: .75rem;
84 | margin-top: .75rem;
85 | border-top: 1px solid rgba(0, 0, 0, .1)
86 | }
87 |
88 | .algolia-autocomplete .ds-suggestion .algolia-docsearch-suggestion--subcategory-column {
89 | display: block;
90 | padding: .1rem 1rem;
91 | margin-bottom: 0.1;
92 | font-size: 1.0em;
93 | font-weight: 400
94 | /* display: none */
95 | }
96 |
97 | .algolia-autocomplete .algolia-docsearch-suggestion--title {
98 | display: block;
99 | padding: .25rem 1rem;
100 | margin-bottom: 0;
101 | font-size: 0.9em;
102 | font-weight: 400
103 | }
104 |
105 | .algolia-autocomplete .algolia-docsearch-suggestion--text {
106 | padding: 0 1rem .5rem;
107 | margin-top: -.25rem;
108 | font-size: 0.8em;
109 | font-weight: 400;
110 | line-height: 1.25
111 | }
112 |
113 | .algolia-autocomplete .algolia-docsearch-footer {
114 | width: 110px;
115 | height: 20px;
116 | z-index: 3;
117 | margin-top: 10.66667px;
118 | float: right;
119 | font-size: 0;
120 | line-height: 0;
121 | }
122 |
123 | .algolia-autocomplete .algolia-docsearch-footer--logo {
124 | background-image: url("data:image/svg+xml;utf8,");
125 | background-repeat: no-repeat;
126 | background-position: 50%;
127 | background-size: 100%;
128 | overflow: hidden;
129 | text-indent: -9000px;
130 | width: 100%;
131 | height: 100%;
132 | display: block;
133 | transform: translate(-8px);
134 | }
135 |
136 | .algolia-autocomplete .algolia-docsearch-suggestion--highlight {
137 | color: #FF8C00;
138 | background: rgba(232, 189, 54, 0.1)
139 | }
140 |
141 |
142 | .algolia-autocomplete .algolia-docsearch-suggestion--text .algolia-docsearch-suggestion--highlight {
143 | box-shadow: inset 0 -2px 0 0 rgba(105, 105, 105, .5)
144 | }
145 |
146 | .algolia-autocomplete .ds-suggestion.ds-cursor .algolia-docsearch-suggestion--content {
147 | background-color: rgba(192, 192, 192, .15)
148 | }
149 |
--------------------------------------------------------------------------------
/vignettes/mallard_smartpca_analysis.Rmd.orig:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Example PCA, PERMANOVA and PERMDISP analysis"
3 | output:
4 | rmarkdown::html_vignette:
5 | toc: true
6 | toc_depth: 2
7 | description: >
8 | This Vignette provides an example analysis of genetic data using
9 | the smartsnp package.
10 | vignette: >
11 | %\VignetteIndexEntry{Example PCA, PERMANOVA and PERMDISP analysis}
12 | %\VignetteEngine{knitr::rmarkdown}
13 | %\VignetteEncoding{UTF-8}
14 | ---
15 |
16 | ```{r, echo = FALSE, message = FALSE}
17 | knitr::opts_chunk$set(collapse = T, comment = "#>")
18 | options(tibble.print_min = 4L, tibble.print_max = 4L)
19 | library(smartsnp)
20 | set.seed(1014)
21 | ```
22 |
23 | This Vignette provides an example analysis of genetic data using the smartsnp package.
24 |
25 | ## Description of the data
26 |
27 | Multivariate analysis of mallard genotypes using the dataset published by Kraus et al. 2013.
28 |
29 | Paper = https://onlinelibrary.wiley.com/doi/10.1111/mec.12098
30 |
31 | Dataset = https://datadryad.org/stash/dataset/doi:10.5061/dryad.1bq39
32 |
33 | Population SEAP removed from dataset as its geographic background is unclear (Robert Kraus, pers. comm., 02/06/2021).
34 |
35 | Populations GBAB, GBFE and GBNM (British Isles) removed from dataset as these individuals might have mixed with captive/feral mallards (Robert Kraus, pers. comm., 03/06/2021).
36 |
37 | Three datasets are available. They are part of the *smartsnp* package, you don't need to download or process the data from dryad:
38 |
39 | * Genotype data (mallard_genotype_Kraus2012.txt) = 364 SNPs (rows) x 695 individuals (columns), individuals comprise 55 populations and 10 flyways. Genotypes are 0, 1, 2 or (for missing values) 9
40 | * Group names (mallard_samples_Kraus2013.txt) = 695 rows x 3 columns, column 1 = flyway names, column 2 = population name, column 3 = indvidual names comprise 55 populations and 10 flyways
41 | * SNP names (mallard_snps_Kraus2013.txt) = 695 rows x 1 column
42 |
43 | The study supports panmixia in cosmopolitan bird species (see Kraus et al. 2013):
44 |
45 | "...Only Greenland is genetically differentiated from the remaining mallard
46 | population, and to a lesser extent, slight differentiation is observed between
47 | flyways in Europe and North America".
48 |
49 | "...There is a lack of clear population structure, suggesting that the world's
50 | mallards, perhaps with minor exceptions, form a single large, mainly
51 | interbreeding population".
52 |
53 |
54 | ## Install package *smartsnp* (use one option)
55 |
56 | From GitHub:
57 |
58 | ```{r, eval = FALSE}
59 | install.packages("devtools")
60 | devtools::install_github("ChristianHuber/smartsnp")
61 | ```
62 |
63 | From CRAN:
64 |
65 | ```{r, eval = FALSE}
66 | install.packages("smartsnp")
67 | ```
68 |
69 | ## Load package
70 |
71 | ```{r}
72 | library(smartsnp)
73 | ```
74 |
75 | ## Create group factor
76 |
77 | Load group file (flyway = categorical predictor in PERMANOVA AND PERMDISP tests):
78 |
79 | ```{r}
80 | pathToFile <- system.file("extdata", "mallard_samples_Kraus2013", package = "smartsnp")
81 | my_groups <- c(data.table::fread(pathToFile, header = FALSE))[[1]]
82 | length(my_groups) #number of individuals
83 | length(table(my_groups)) #number of flyways
84 | table(my_groups) #number of individuals per flyway
85 | ```
86 |
87 | Number of populations (not needed for analysis hereafter):
88 |
89 | ```{r}
90 | my_pops <- c(data.table::fread(pathToFile, header = FALSE))[[2]]
91 | length(table(my_pops)) #number of populations
92 | table(my_pops) #number of individuals per population
93 | ```
94 |
95 | Code per individual (not needed for analysis hereafter):
96 |
97 | ```{r}
98 | my_indv <- c(data.table::fread(pathToFile, header = FALSE))[[3]]
99 | ```
100 |
101 | SNP names (not needed for analysis hereafter):
102 |
103 | ```{r}
104 | pathToFile <- system.file("extdata", "mallard_snps_Kraus2013", package = "smartsnp")
105 | my_snps <- c(data.table::fread(pathToFile, header = FALSE))[[1]]
106 | length(my_snps) # number of snps
107 | ```
108 |
109 | ## Run *smart_pca*
110 |
111 | Run PCA with truncated SVD (PCA 1 x PCA 2 axes) and assign results to object pcaR (missing values imputed with means, SNPs scaled to control genetic drift):
112 |
113 | ```{r, message=FALSE}
114 | pathToFile <- system.file("extdata", "mallard_genotype_Kraus2012", package = "smartsnp")
115 | pcaR <- smart_pca(snp_data = pathToFile, sample_group = my_groups, missing_impute = "mean")
116 | ```
117 |
118 | pcaR is a list with 3 elements:
119 |
120 | ```{r}
121 | class(pcaR)
122 | names(pcaR)
123 | str(pcaR)
124 | ```
125 |
126 | Assign statistical results to objects pcaR_eigen, pcaR_load and pcaR_coord:
127 |
128 | ```{r}
129 | pcaR_eigen <- pcaR$pca.eigenvalues # extract eigenvalues (PCA1 and PC2 axes explain 3.5% variation in SNP variation across individuals)
130 | pcaR_load <- pcaR$pca.snp_loadings # extract principal coefficients (high SNP loadings indicate loci with stronger variation across individuals)
131 | pcaR_coord <- pcaR$pca.sample_coordinates # extract principal components (position of individuals in PCA space used to generate the ordination)
132 | ```
133 |
134 | Plot PCA:
135 |
136 | ```{r pca_plot_mallard, fig.height = 7, fig.width = 7, fig.align = "center"}
137 | cols <- rainbow(length(table(my_groups)))
138 | plot(pcaR$pca.sample_coordinates[,c("PC1","PC2")], cex = 1.5,
139 | bg = cols[as.factor(my_groups)], pch = 21, col = "black", main = "mallard genotype smartpca")
140 | legend("topleft", legend = levels(as.factor(my_groups)), cex = 1, pch = 21,
141 | pt.cex = 1.0, col = "black", pt.bg = cols, text.col = cols)
142 | ```
143 |
144 | Greenland individuals cluster in one of the corners of the ordination, supporting a distinct SNP composition relative to the remaining flyways.
145 |
146 |
147 | ## Run *smart_permanova*
148 |
149 | Run PERMANOVA test (group location in PCA1 x PCA2 space) and assign results to object permanovaR (missing values imputed with means, SNPs scaled to control genetic drift).
150 | Notice that pairwise tests increase computing time considerably as there are 45 pairwise comparisons to make for 10 flyways, each calculating a p value based on 10,000 permutations of the data.
151 |
152 | ```{r, message=FALSE}
153 | pathToFile <- system.file("extdata", "mallard_genotype_Kraus2012", package = "smartsnp")
154 | permanovaR <- smart_permanova(snp_data = pathToFile, sample_group = my_groups,
155 | target_space = "pca", missing_impute = "mean", pairwise = "TRUE")
156 | ```
157 |
158 | permanovaR is a list with 5 elements:
159 |
160 | ```{r}
161 | class(permanovaR)
162 | names(permanovaR)
163 | str(permanovaR)
164 | ```
165 |
166 | Assign sample summary to object permP:
167 |
168 | ```{r}
169 | permP <- permanovaR$permanova.samples
170 | ```
171 |
172 | Show PERMANOVA tables (global and pairwise):
173 |
174 | ```{r}
175 | permanovaR$permanova.global_test
176 | ```
177 |
178 | For the mallard dataset, the p value is 1e-04.
179 | As with other frequentist tests, p values should be interpreted as the probability of the observed differences if the null hypothesis of no differences between groups is true.
180 | The lower the p value, the weaker the support for the null hypothesis.
181 |
182 | ```{r}
183 | head(permanovaR$permanova.pairwise_test)
184 | ```
185 |
186 | The lowest p values (resulting from pairwise comparisons) consistently occur between the Greenland and the remaining flyways, supporting a unique SNP composition mostly in Greenland mallards.
187 |
188 | ## Run *smart_permdisp*
189 |
190 | Run PERMDISP test (group dispersion in PCA1 x PCA2 space) and assign results to object permdispR (missing values imputed with means, SNPs scaled to control genetic drift). Heteroscededasticity tests in combination with ANOVA tests tell whether the ANOVA F statistic is driven by mean and/or varinance differences among groups in a univariate context. Location and dispersion (multivariate context) are analogous with mean and variance in a univariate context. As the number per individuals per flyway differ a great deal among flyways, PERMDISP is run to control for sample-size bias (samplesize_bias = TRUE).
191 |
192 | ```{r, message=FALSE}
193 | pathToFile <- system.file("extdata", "mallard_genotype_Kraus2012", package = "smartsnp")
194 | permdispR <- smart_permdisp(snp_data = pathToFile, sample_group = my_groups,
195 | target_space = "pca", missing_impute = "mean", pairwise = "TRUE", samplesize_bias = TRUE)
196 | ```
197 |
198 | permdispR is a list with 7 elements:
199 |
200 | ```{r}
201 | class(permdispR)
202 | names(permdispR)
203 | str(permdispR)
204 | ```
205 |
206 | Assign sample summary to object permD, where column Sample_dispersion column show dispersion of individuals relative to their flyway:
207 |
208 | ```{r}
209 | permD <- permdispR$permdisp.samples
210 | ```
211 |
212 | Show PERMDISP tables (global and pairwise):
213 |
214 | ```{r}
215 | permdispR$permdisp.global_test
216 | ```
217 |
218 | For the mallard dataset, the p value is 0.0073:
219 |
220 | ```{r}
221 | str(permdispR$permdisp.pairwise_test)
222 | ```
223 |
224 | Most PERMDISP pairwise tests show relatively high p values (i.e., high probability of the observed differences in dispersion if the null hypothesis of no dispersion differences among groups is true), indicating that PERMANOVA tests mainly captured differences in location. The lowest p values for the PERMDISP pairwise tests among Eurasian flyways occur for the Europe North Western (ENW) flyway versus the other flyways as seen in the ordination plot (i.e., ENW individuals are widely spread over both the PCA1 and PCA2 axes).
225 |
226 |
227 | ## Run *smart_mva*
228 |
229 | Run PCA, and PERMANOVA and PERMDISP tests (group location and dispersion in PCA1 x PCA2 space), and assign results to object mvaR. No pairwise comparisons are applied (default: pairwise = "FALSE"), so computation will be relatively fast. This is a wrapper function running in one single job the three other functions of the package (smart_pca, smart_permanova, smart_permdisp).
230 |
231 | ```{r, message=FALSE}
232 | pathToFile <- system.file("extdata", "mallard_genotype_Kraus2012", package = "smartsnp")
233 | mvaR <- smart_mva(snp_data = pathToFile, sample_group = my_groups,
234 | target_space = "pca", missing_impute = "mean", samplesize_bias = TRUE)
235 | ```
236 |
237 | mvaR is a list with three elements (data, pca, test):
238 |
239 | ```{r}
240 | class(mvaR)
241 | names(mvaR)
242 | str(mvaR)
243 | ```
244 |
245 | Element 1 = scaled dataset (none, covariance, correlation, drift) in a matrix and array (rows = SNPs, columns = samples):
246 |
247 | ```{r}
248 | class(mvaR$data)
249 | dim(mvaR$data)
250 | str(mvaR$data)
251 | ```
252 |
253 | Element 2 = PCA results in a list:
254 |
255 | ```{r}
256 | class(mvaR$pca)
257 | names(mvaR$pca)
258 | str(mvaR$pca)
259 | ```
260 |
261 | Show PCA results:
262 |
263 | ```{r}
264 | head(mvaR$pca$pca.eigenvalues) #extract eigenvalues
265 | head(mvaR$pca$pca.sample_coordinates) #extract coordinates of individuals in PCA1 x PCA2 space
266 | head(mvaR$pca$pca.snp_loadings) #extract SNP loadings
267 | ```
268 |
269 | Element 3 = PERMANOVA and PERMDISP results in a list:
270 |
271 | ```{r}
272 | class(mvaR$test)
273 | names(mvaR$test)
274 | str(mvaR$test)
275 | ```
276 |
277 | Multiple-testing correction applied:
278 |
279 | ```{r}
280 | mvaR$test$test.pairwise_correction
281 | ```
282 |
283 | Number of permutations to estimate p value:
284 |
285 | ```{r}
286 | mvaR$test$test.permutation_number
287 | ```
288 |
289 | Seed for random generator:
290 |
291 | ```{r}
292 | mvaR$test$test.permutation_seed
293 | ```
294 |
295 | Summary of samples:
296 |
297 | ```{r}
298 | head(mvaR$test$test_samples)
299 | ```
300 |
301 | Show PERMANOVA table:
302 |
303 | ```{r}
304 | mvaR$test$permanova.global_test #global test
305 | mvaR$test$permanova.pairwise_test #pairwise tests
306 | ```
307 |
308 | Show PERMDISP table:
309 |
310 | ```{r}
311 | mvaR$test$permdisp.global_test #global test
312 | mvaR$test$test$permdisp.pairwise_test #pairwise tests
313 | ```
314 |
315 | Sample-size correction applied:
316 |
317 | ```{r}
318 | mvaR$test$permdisp.bias
319 | ```
320 |
321 | Location of flyways in ordination:
322 |
323 | ```{r}
324 | mvaR$test$permdisp.group_location
325 | ```
326 |
327 |
328 |
--------------------------------------------------------------------------------
/man/smart_mva.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/smart_mva.R
3 | \name{smart_mva}
4 | \alias{smart_mva}
5 | \title{Smart Multivariate Analyses (wrapper of PCA, PERMANOVA and PERMDISP)}
6 | \arguments{
7 | \item{snp_data}{snp_data}{File name read from working directory.
8 | SNP = rows, samples = columns without row names or column headings.
9 | SNP values must be count data (no decimals allowed).
10 | File extension detected automatically whether text or \code{EIGENSTRAT}.
11 | See details.}
12 |
13 | \item{packed_data}{Logical value for \code{EIGENSTRAT}, irrelevant for text data.
14 | Default \code{packed_data = FALSE} assumes uncompressed \code{EIGENSTRAT}.
15 | \code{packed_data = TRUE} for compressed or binary \code{EIGENSTRAT} (\code{PACKENDANCESTRYMAP}).}
16 |
17 | \item{sample_group}{Character or numeric vector assigning samples to groups.
18 | Coerced to factor.}
19 |
20 | \item{sample_remove}{Logical \code{FALSE} or numeric vector indicating column numbers (samples) to be removed from computations.
21 | Default \code{sample_remove = FALSE} keeps all samples.}
22 |
23 | \item{snp_remove}{Logical \code{FALSE} or numeric vector indicating row numbers (SNPs) to be removed from computations.
24 | Default \code{snp_remove = FALSE} keeps all SNPs.
25 | See details.}
26 |
27 | \item{pca}{Logical indicating if PCA is computed.
28 | Default \code{TRUE}.}
29 |
30 | \item{permanova}{Logical indicating if PERMANOVA is computed.
31 | Default \code{TRUE}}
32 |
33 | \item{permdisp}{Logical indicating if PERMDISP is computed.
34 | Default \code{TRUE}.}
35 |
36 | \item{missing_value}{Number \code{9} or string \code{NA} indicating missing value.
37 | Default \code{missing_value = 9} as in \code{EIGENSTRAT}.
38 | If no missing values present, no effect on computation.}
39 |
40 | \item{missing_impute}{String handling missing values.
41 | Default \code{missing_impute = "mean"} replaces missing values of each SNP by mean of non-missing values across samples.
42 | \code{missing_impute = "remove"} removes SNPs with at least one missing value.
43 | If no missing values present, no effect on computation.}
44 |
45 | \item{scaling}{String. Default \code{scaling = "drift"} scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA).
46 | \code{scaling = "center"} for \code{centering} (covariance-based PCA).
47 | \code{scaling = "sd"} for \code{centered} SNPs divided by standard deviation (correlation-based PCA).
48 | \code{scaling = "none"} for no scaling.
49 | See details.}
50 |
51 | \item{program_svd}{String indicating R package computing single value decomposition (SVD).
52 | Default \code{program_svd = "Rspectra"} for \code{\link[RSpectra]{svds}}.
53 | \code{program_svd = "bootSVD"} for \code{\link[bootSVD]{fastSVD}}.
54 | See details.}
55 |
56 | \item{sample_project}{Numeric vector indicating column numbers (ancient samples) projected onto (modern) PCA space.
57 | Default \code{sample_project = FALSE} implements no projection.
58 | See details.}
59 |
60 | \item{pc_project}{Numeric vector indicating the ranks of the PCA axes ancient samples are projected onto. Default \code{pc_ancient = c(1, 2)} for PCA axes 1 and 2.
61 | If \code{program_svd = "RSpectra"}, \code{length(pc_ancient)} must be smaller than or equal to \code{pc_axes}.
62 | No effect on computation, if no ancient samples present.}
63 |
64 | \item{sample_distance}{Type of inter-sample proximity computed (distance, similarity, dissimilarity).
65 | Default is \code{Euclidean distance}.
66 | See details.}
67 |
68 | \item{program_distance}{A string value indicating R package to estimate proximities between pairs of samples.
69 | Default \code{program_distance = "Rfast"} uses function \code{\link[Rfast]{Dist}}; \code{program_distance = "vegan"} uses \code{\link[vegan]{vegdist}}.
70 | See details.}
71 |
72 | \item{target_space}{String.
73 | Default \code{target_space = "multidimensional"} applies PERMANOVA and/or PERMDISP to sample-by-sample triangular matrix computed from variable-by-sample data, \code{pc_axes} has no effect on computation. \code{target_space = "pca"} applies PERMANOVA and/or PERMDISP to sample-by-sample data in PCA space, \code{pc_axes} determines number of PCA axes for testing.}
74 |
75 | \item{pc_axes}{Number of PCA axes computed always starting with PCA axis 1.
76 | Default \code{pc_axes = 2} computes PCA axes 1 and 2 if \code{target_space = "pca"}.
77 | No effect on computation if \code{target_space = "multidimensional"}.}
78 |
79 | \item{pairwise}{Logical.
80 | Default \code{pairwise = FALSE} computes global test. \code{pairwise = TRUE} computes global and pairwise tests.}
81 |
82 | \item{pairwise_method}{String specifying type of correction for multiple testing.
83 | Default \code{"holm"}.}
84 |
85 | \item{permutation_n}{Number of permutations resulting in PERMANOVA/PERMDISP test \emph{p value}.
86 | Default \code{9999}.}
87 |
88 | \item{permutation_seed}{Number fixing random generator of permutations.
89 | Default \code{1}.}
90 |
91 | \item{dispersion_type}{String indicating quantification of group dispersion whether relative to spatial \code{"median"} or \code{"centroid"} in PERMDISP.
92 | Default \code{"median"}.}
93 |
94 | \item{samplesize_bias}{Logical. \code{samplesize_bias = TRUE} for dispersion weighted by number of samples per group in PERMDISP.
95 | Default \code{pairwise = FALSE} for no weighting.}
96 | }
97 | \value{
98 | Returns a list containing the following elements:
99 | \itemize{
100 | \item{pca.snp_loadings}{Dataframe of principal coefficients of SNPs.
101 | One set of coefficients per PCA axis computed.}
102 | \item{pca.eigenvalues}{Dataframe of eigenvalues, variance and cumulative variance explained.
103 | One eigenvalue per PCA axis computed.}
104 | \item{pca_sample_coordinates}{Dataframe showing PCA sample summary. Column \emph{Group} assigns samples to groups. Column \emph{Class} specifies if samples "Removed" from PCA or "Projected" onto PCA space.
105 | Sequence of additional columns shows principal components (coordinates) of samples in PCA space (1 column per PCA computed named PC1, PC2, ...).}
106 | \item{test_samples}{Dataframe showing test sample summary.
107 | Column \emph{Group} assigns samples to tested groups.
108 | Column \emph{Class} specifies if samples were used in, or removed from, testing (PERMANOVA and/or PERMDISP).
109 | Column \emph{Sample_dispersion} shows dispersion of individual samples relative to spatial \code{"median"} or \code{"centroid"} used in PERMDISP.}
110 | \item{permanova.global_test}{List showing PERMANOVA table with degrees of freedom, sum of squares, mean sum of squares, \emph{F} statistic, variance explained (\emph{R2}) and \emph{p} value.}
111 | \item{permanova.pairwise_test}{List showing PERMANOVA table with \emph{F} statistic, variance explained (\emph{R2}), \emph{p} value and corrected \emph{p} value per pair of groups.}
112 | \item{permdisp.global_test}{List showing PERMDISP table with degrees of freedoms, sum of squares, mean sum of squares, \emph{F} statistic and \emph{p} value.}
113 | \item{permdisp.pairwise_test}{List showing PERMDISP table with \emph{F} statistic, \emph{p} value and corrected \emph{p} value per pair of groups.
114 | Obtained only if \code{pairwise = TRUE}.}
115 | \item{permdisp.bias}{String indicating if PERMDISP dispersion corrected for number of samples per group.}
116 | \item{permdisp.group_location}{Dataframe showing coordinates of spatial \code{"median"} or \code{"centroid"} per group in PERMDISP.}
117 | \item{test.pairwise_correction}{String indicating type of correction for multiple testing in PERMANOVA and/or PERMDISP.}
118 | \item{test.permutation_number}{Number of permutations applied to obtain the distribution of \emph{F} statistic of PERMANOVA and/or PERMDISP.}
119 | \item{test.permutation_seed}{Number fixing random generator of permutations of PERMANOVA and/or PERMDISP for reproducibility of results.}
120 | }
121 | }
122 | \description{
123 | Computes Principal Component Analysis (PCA) for variable x sample genotype data, such as Single Nucleotide Polymorphisms (SNP), in combination with Permutational Multivariate Analysis of Variance (PERMANOVA) and Permutational Multivariate Analysis of Dispersion (PERMDISP).
124 | A wrapper of functions \code{smart_pca}, \code{smart_permanova} and \code{smart_permdisp}.
125 | Genetic markers such as SNPs can be scaled by \code{centering}, z-scores and genetic drift-based dispersion.
126 | The latter follows the SMARTPCA implementation of Patterson, Price and Reich (2006).
127 | Optimized to run fast computation for big datasets.
128 | }
129 | \details{
130 | See details in other functions for conceptualization of PCA (\code{smart_pca}) (Hotelling 1993), SMARTPCA (Patterson, Price and Reich 2006), PERMANOVA (\code{smart_permanova}) (Anderson 2001) and PERMDISP (\code{smart_permdisp} (Anderson 2006), types of scaling, ancient projection, and correction for multiple testing.\cr
131 |
132 | Users can compute any combination of the three analyses by assigning \code{TRUE} or \code{FALSE} to \code{pca} and/or \code{permanova} and/or \code{permdisp}.\cr
133 |
134 | PERMANOVA and PERMDISP exclude samples (columns) specified in either \code{sample_remove} or \code{sample_project}.
135 | Projected samples are not used for testing as their PCA coordinates are derived from, and therefore depend on, the coordinates of non-projected samples.\cr
136 |
137 | Data read from working directory with SNPs as rows and samples as columns. Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using \code{\link[data.table]{fread}}; or (2) duet of \code{EIGENSTRAT} files (see \url{https://reich.hms.harvard.edu/software}) using \code{\link[vroom]{vroom_fwf}}, including a genotype file of SNPs by samples (\code{*.geno}), and a sample file (\code{*.ind}) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers.
138 | For \code{EIGENSTRAT}, vector \code{sample_group} assigns samples to groups retrievable from column 3 of file \code{*.ind}.
139 | SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if \code{scaling = "sd"} or \code{"drift"}.\cr
140 |
141 | Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (\code{sample_remove}) and/or row numbers for SNPs (\code{snp_remove}) to be removed from computations.
142 | Function stops if the final number of SNPs is 1 or 2.
143 | \code{EIGENSOFT} was conceived for the analysis of human genes and its SMARTPCA suite so accepts 22 (autosomal) chromosomes by default.
144 | If >22 chromosomes are provided and the internal parameter \code{numchrom} is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
145 | In contrast, \code{smart_mva} accepts any number of autosomes with or without the sex chromosomes from an \code{EIGENSTRAT} file.\cr
146 | }
147 | \examples{
148 | # Path to example genotype matrix "dataSNP"
149 | pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
150 |
151 | # Assign 50 samples to each of two groups and colors
152 | my_groups <- as.factor(c(rep("A", 50), rep("B", 50))); cols = c("red", "blue")
153 |
154 | # Run PCA, PERMANOVA and PERMDISP
155 | mvaR <- smart_mva(snp_data = pathToGenoFile, sample_group = my_groups)
156 | mvaR$pca$pca.eigenvalues # extract PCA eigenvalues
157 | head(mvaR$pca$pca.snp_loadings) # extract principal coefficients (SNP loadings)
158 | head(mvaR$pca$pca.sample_coordinates) # extract PCA principal components (sample position in PCA space)
159 |
160 | # plot PCA
161 | plot(mvaR$pca$pca.sample_coordinates[,c("PC1","PC2")], cex = 2,
162 | pch = 19, col = cols[my_groups], main = "genotype smartpca")
163 | legend("topleft", legend = levels(my_groups), cex = 1,
164 | pch = 19, col = cols, text.col = cols)
165 |
166 | # Extract PERMANOVA table
167 | mvaR$test$permanova.global_test
168 |
169 | # Extract PERMDISP table
170 | mvaR$test$permdisp.global_test # extract PERMDISP table
171 |
172 | # Extract sample summary and dispersion of individual samples used in PERMDISP
173 | mvaR$test$test_samples
174 |
175 | }
176 | \seealso{
177 | \code{\link{smart_pca}},
178 | \code{\link{smart_permanova}},
179 | \code{\link{smart_permdisp}}
180 | }
181 |
--------------------------------------------------------------------------------
/man/smart_pca.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/smart_pca.R
3 | \name{smart_pca}
4 | \alias{smart_pca}
5 | \title{Smart Principal Component Analysis}
6 | \arguments{
7 | \item{snp_data}{File name read from working directory.
8 | SNP = rows, samples = columns without row names or column headings.
9 | SNP values must be count data (no decimals allowed). File extension detected automatically whether text or \code{EIGENSTRAT}.
10 | See details.}
11 |
12 | \item{packed_data}{Logical value for \code{EIGENSTRAT}, irrelevant for text data.
13 | Default \code{packed_data = FALSE} assumes uncompressed \code{EIGENSTRAT}.
14 | \code{packed_data = TRUE} for compressed or binary \code{EIGENSTRAT} (\code{PACKENDANCESTRYMAP}).}
15 |
16 | \item{sample_group}{Character or numeric vector assigning samples to groups.
17 | Coerced to factor.}
18 |
19 | \item{sample_remove}{Logical \code{FALSE} or numeric vector indicating column numbers (samples) to be removed from computations.
20 | Default \code{sample_remove = FALSE} keeps all samples.}
21 |
22 | \item{snp_remove}{Logical \code{FALSE} or numeric vector indicating row numbers (SNPs) to be removed from computations.
23 | Default \code{snp_remove = FALSE} keeps all SNPs.
24 | See details.}
25 |
26 | \item{missing_value}{Number \code{9} or string \code{NA} indicating missing value.
27 | Default \code{missing_value = 9} as in \code{EIGENSTRAT}.
28 | If no missing values present, no effect on computation.}
29 |
30 | \item{missing_impute}{String handling missing values.
31 | Default \code{missing_impute = "mean"} replaces missing values of each SNP by mean of non-missing values across samples.
32 | \code{missing_impute = "remove"} removes SNPs with at least one missing value.
33 | If no missing values present, no effect on computation.}
34 |
35 | \item{scaling}{String. Default \code{scaling = "drift"} scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA).
36 | \code{scaling = "center"} for \code{centering} (covariance-based PCA).
37 | \code{scaling = "sd"} for \code{centered} SNPs divided by standard deviation (correlation-based PCA).
38 | \code{scaling = "none"} for no scaling.
39 | See details.}
40 |
41 | \item{program_svd}{String indicating R package computing single value decomposition (SVD).
42 | Default \code{program_svd = "Rspectra"} for \code{\link[RSpectra]{svds}}.
43 | \code{program_svd = "bootSVD"} for \code{\link[bootSVD]{fastSVD}}.
44 | See details.}
45 |
46 | \item{pc_axes}{A numeric value.
47 | If \code{program_svd = "Rspectra"} this argument indicates number of PCA axes computed starting with PCA axis 1.
48 | Default \code{pc_axes = 2} computes PCA axes 1 and 2.
49 | No effect on computation if \code{program_svd = "bootSVD"} since all PCA axes are computed.}
50 |
51 | \item{sample_project}{Numeric vector indicating column numbers (ancient samples) projected onto (modern) PCA space.
52 | Default \code{sample_project = FALSE} indicates no samples will be used for projection.
53 | See details.}
54 |
55 | \item{pc_project}{Numeric vector indicating the ranks of the PCA axes ancient samples are projected onto.
56 | Default \code{pc_ancient = c(1, 2)} for PCA axes 1 and 2. If \code{program_svd = "RSpectra"}, \code{length(pc_ancient)} must be smaller than or equal to \code{pc_axes}.
57 | No effect on computation, if no ancient samples present.}
58 | }
59 | \value{
60 | Returns a list containing the following elements:
61 | \itemize{
62 | \item {\code{pca.snp_loadings}} {Dataframe of principal coefficients of SNPs. One set of coefficients per PCA axis computed.}\cr
63 | \item {\code{pca.eigenvalues}} {Dataframe of eigenvalues, variance and cumulative variance explained. One eigenvalue per PCA axis computed.}\cr
64 | \item {\code{pca_sample_coordinates}} {Dataframe showing PCA sample summary.
65 | Column \emph{Group} assigns samples to groups.
66 | Column \emph{Class} specifies if samples "Removed" from PCA or "Projected" onto PCA space.
67 | Sequence of additional columns shows principal components (coordinates) of samples in PCA space (1 column per PCA computed named PC1, PC2, ...).}
68 | }
69 | }
70 | \description{
71 | Compute Principal Component Analysis (PCA) for variable x sample genotype data including covariance (\code{centered}), correlation (z-score) and SMARTPCA scaling,
72 | and implements projection of ancient samples onto modern PCA space. SMARTPCA scaling controls for genetic drift when variables are bi-allelic genetic markers
73 | such as single nucleotide polymorphisms (SNP) following Patterson, Price and Reich (2006).
74 | Optimized to run fast single value decomposition for big datasets.
75 | }
76 | \details{
77 | PCA is a rigid rotation of a Cartesian coordinate system (samples = points, axes = variables or SNPs) that maximizes the dispersion of points along a new system of axes (Pearson 1901; Hotelling 1933; Jolliffe 2002).
78 | In rotated space (ordination), axes are \code{principal axes} (PCA axes), \code{eigenvalues} measure variance explained, and \code{principal coefficients} measure importance of SNPs (eigenvectors), \code{principal components} are coordinates of samples (i.e., linear combinations of scaled variables weighted by eigenvectors).
79 | Principal coefficients are direction cosines between original and PCA axes (Legendre & Legendre 2012). PCA can be computed by \code{eigenanalysis} or, as implemented here, single value decomposition (SVD). \cr
80 |
81 | SNPs can be scaled in four different ways prior to SVD: (1) no scaling; (2) covariance: SNPs \code{centered} such that \emph{M(i,j)} = \emph{C(i,j)} minus \emph{mean(j)}) where \emph{C(i,j)} is the number of variant alleles for SNP \emph{j} and sample \emph{i}, and \emph{M(i,j)} is the \code{centered} value of each data point; (3) correlation (z-scores): SNPs \code{centered} then divided by standard deviation \emph{sd(j)}, (4) SMARTPCA: SNPs \code{centered} then divided by \emph{sqrt(p(j)(1-p(j)))}, where \emph{p(j)} equals \emph{mean(j)} divided by \emph{2}, quantifies the underlying allele frequency (autosomal chromosomes) and conceptualizes that SNP frequency changes at rate proportional to \emph{sqrt(p(j)(1-p(j)))} per generation due to genetic drift (Patterson, Price and Reich 2006).
82 | SMARTPCA standardization results in all SNPs that comply with Hardy-Weinberg equilibrium having identical variance.
83 | SMARTPCA (Patterson, Price and Reich 2006) and \code{EIGENSTRAT} (Price, Patterson, Plenge, Weinblatt, Shadick and Reich 2006) are the computing suites of software \code{EIGENSOFT} (\url{https://reich.hms.harvard.edu/software}).\cr
84 |
85 | \code{\link[RSpectra]{svds}} runs single value decomposition much faster than \code{\link[bootSVD]{fastSVD}}. With \code{\link[RSpectra]{svds}}, \code{pc_axes} indicates number of eigenvalues and eigenvectors computed starting from PCA axis 1. \code{\link[bootSVD]{fastSVD}} computes all eigenvalues and eigenvectors. Eigenvalues calculated from singular values divided by number of samples minus 1. If number of samples equals number of SNPS, \code{\link[bootSVD]{fastSVD}} prints message alert that no computing efficiency is achieved for square matrices.\cr
86 |
87 | Ancient samples (with many missing values) can be projected onto modern PCA space derived from modern samples.
88 | Following Nelson Taylor and MacGregor (1996), the projected coordinates of a given ancient sample equal the slope coefficient of linear fit through the origin of (scaled) non-missing SNP values of that sample (response) versus principal coefficients of same SNPs in modern samples.
89 | Number of projected coordinates per ancient sample given by \code{length(pc_ancient)}.
90 | With \code{\link[RSpectra]{svds}}, \code{pc_axes} must be larger or equal to \code{length(pc_ancient)}.\cr
91 |
92 | Data read from working directory with SNPs as rows and samples as columns.
93 | Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using \code{\link[data.table]{fread}}; or (2) duet of \code{EIGENSTRAT} files (see \url{https://reich.hms.harvard.edu/software}) using \code{\link[vroom]{vroom_fwf}}, including a genotype file of SNPs by samples (\code{*.geno}), and a sample file (\code{*.ind}) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers.
94 | For \code{EIGENSTRAT}, vector \code{sample_group} assigns samples to groups retrievable from column of file \code{*.ind}. SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if \code{scaling = "sd"} or \code{"drift"}.\cr
95 |
96 | Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (\code{sample_remove}) and/or row numbers for SNPs (\code{snp_remove}) to be removed from computations.
97 | Function stops if the final number of SNPs is 1 or 2.
98 | \code{EIGENSOFT} was conceived for the analysis of human genes and its SMARTPCA suite so accepts 22 (autosomal) chromosomes by default.
99 | If >22 chromosomes are provided and the internal parameter \code{numchrom} is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
100 | In contrast, \code{smart_pca} accepts any number of autosomes with or without the sex chromosomes from an \code{EIGENSTRAT} file.\cr
101 | }
102 | \examples{
103 | # Path to example genotype matrix "dataSNP"
104 | pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
105 |
106 | # Example 1: modern samples
107 | #assign 50 samples to each of two groups and colors
108 | my_groups <- c(rep("A", 50), rep("B", 50)); cols = c("red", "blue")
109 | #run PCA with truncated SVD (PCA 1 x PCA 2)
110 | pcaR1 <- smart_pca(snp_data = pathToGenoFile, sample_group = my_groups)
111 | pcaR1$pca.eigenvalues # extract eigenvalues
112 | head(pcaR1$pca.snp_loadings) # extract principal coefficients (SNP loadings)
113 | head(pcaR1$pca.sample_coordinates) # extract principal components (sample position in PCA space)
114 | #plot PCA
115 | plot(pcaR1$pca.sample_coordinates[,c("PC1","PC2")], cex = 2,
116 | pch = 19, col = cols[as.factor(my_groups)], main = "genotype smartpca")
117 | legend("topleft", legend = levels(as.factor(my_groups)), cex =1,
118 | pch = 19, col = cols, text.col = cols)
119 |
120 | # Example 2: modern and ancient samples (ancient samples projected onto modern PCA space)
121 | #assign samples 1st to 10th per group to ancient
122 | my_ancient <- c(1:10, 51:60)
123 | #run PCA with truncated SVD (PCA 1 x PCA 2)
124 | pcaR2 <- smart_pca(snp_data = pathToGenoFile, sample_group = my_groups, sample_project = my_ancient)
125 | pcaR2$pca.eigenvalues # extract eigenvalues
126 | head(pcaR2$pca.snp_loadings) # extract principal coefficients (SNP loading)
127 | head(pcaR2$pca.sample_coordinates) # extract principal components (sample position in PCA space)
128 | #assign samples to groups (A, ancient, B) and colors
129 | my_groups[my_ancient] <- "ancient"; cols = c("red", "black", "blue")
130 | #plot PCA
131 | plot(pcaR2$pca.sample_coordinates[,c("PC1","PC2")],
132 | cex = 2, col = cols[as.factor(my_groups)], pch = 19, main = "genotype smartpca")
133 | legend("topleft", legend = levels(as.factor(my_groups)), cex = 1,
134 | pch = 19, col = cols, text.col = cols)
135 |
136 | }
137 | \references{
138 | Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417-441.\cr
139 |
140 | Jolliffe, I.T. (2002) Principal Component Analysis (Springer, New York, USA).\cr
141 |
142 | Legendre, P. & L. F. J. Legendre (2012). Numerical ecology. Developments in environmental modelling (Elsevier, Oxford, UK).\cr
143 |
144 | Nelson, P.R.C., P.A. Taylor, and J.F. MacGregor (1996) Missing data methods in PCA and PLS: score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35, 45-65.\cr
145 |
146 | Patterson, N.J., A. L. Price and D. Reich (2006) Population structure and eigenanalysis. PLoS Genetics, 2, e190.\cr
147 |
148 | Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559-572.\cr
149 |
150 | Price, A.L., N.J. Patterson, R.M. Plenge, M.E. Weinblatt, N.A. Shadick and David Reich (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904-909.
151 | }
152 | \seealso{
153 | \code{\link[bootSVD]{fastSVD}} (package \bold{bootSVD}),
154 | \code{\link[foreach]{foreach}} (package \bold{foreach}),
155 | \code{\link[data.table]{fread}} (package \bold{data.table}),
156 | \code{\link[Rfast]{rowVars}} (package \bold{Rfast}),
157 | \code{\link[RSpectra]{svds}} (package \bold{RSpectra}),
158 | \code{\link[vroom]{vroom_fwf}} (package \bold{vroom})
159 | }
160 |
--------------------------------------------------------------------------------
/man/smart_permanova.Rd:
--------------------------------------------------------------------------------
1 | % Generated by roxygen2: do not edit by hand
2 | % Please edit documentation in R/smart_permanova.R
3 | \name{smart_permanova}
4 | \alias{smart_permanova}
5 | \title{Smart Permutational Multivariate Analysis of Variance}
6 | \arguments{
7 | \item{snp_data}{File name read from working directory.
8 | SNP = rows, samples = columns without row names or column headings.
9 | SNP values must be count data (no decimals allowed).
10 | File extension detected automatically whether text or \code{EIGENSTRAT}.
11 | See details.}
12 |
13 | \item{packed_data}{Logical value for \code{EIGENSTRAT}, irrelevant for text data.
14 | Default \code{packed_data = FALSE} assumes uncompressed \code{EIGENSTRAT}.
15 | \code{packed_data = TRUE} for compressed or binary \code{EIGENSTRAT} (\code{PACKENDANCESTRYMAP}).}
16 |
17 | \item{sample_group}{Character or numeric vector assigning samples to groups. Coerced to factor.}
18 |
19 | \item{sample_remove}{Logical \code{FALSE} or numeric vector indicating column numbers (samples) to be removed from computations.
20 | Default \code{sample_remove = FALSE} keeps all samples.}
21 |
22 | \item{snp_remove}{Logical \code{FALSE} or numeric vector indicating row numbers (SNPs) to be removed from computations.
23 | Default \code{snp_remove = FALSE} keeps all SNPs. See details.}
24 |
25 | \item{missing_value}{Number \code{9} or string \code{NA} indicating missing value.
26 | Default \code{missing_value = 9} as in \code{EIGENSTRAT}.
27 | If no missing values present, no effect on computation.}
28 |
29 | \item{missing_impute}{String handling missing values.
30 | Default \code{missing_impute = "mean"} replaces missing values of each SNP by mean of non-missing values across samples.
31 | \code{missing_impute = "remove"} removes SNPs with at least one missing value.
32 | If no missing values present, no effect on computation.}
33 |
34 | \item{scaling}{String. Default \code{scaling = "drift"} scales SNPs to control for expected allele frequency dispersion caused by genetic drift (SMARTPCA).
35 | \code{scaling = "center"} for \code{centering} (covariance-based PCA).
36 | \code{scaling = "sd"} for \code{centered} SNPs divided by standard deviation (correlation-based PCA).
37 | \code{scaling = "none"} for no scaling.
38 | See details.}
39 |
40 | \item{sample_distance}{Type of inter-sample proximity computed (distance, similarity, dissimilarity).
41 | Default is \code{Euclidean distance}. See details.}
42 |
43 | \item{program_distance}{A string value indicating R package to estimate proximities between pairs of samples.
44 | Default \code{program_distance = "Rfast"} uses function \code{\link[Rfast]{Dist}}; \code{program_distance = "vegan"} uses \code{\link[vegan]{vegdist}}.
45 | See details.}
46 |
47 | \item{target_space}{String.
48 | Default \code{target_space = "multidimensional"} applies PERMANOVA to sample-by-sample triangular matrix computed from variable-by-sample data, \code{pc_axes} has no effect on computation.
49 | \code{target_space = "pca"} applies PERMANOVA to sample-by-sample data in PCA space, \code{pc_axes} determines number of PCA axes for testing.}
50 |
51 | \item{pc_axes}{Number of PCA axes computed always starting with PCA axis 1. Default \code{pc_axes = 2} computes PCA axes 1 and 2 if \code{target_space = "pca"}.
52 | No effect on computation if \code{target_space = "multidimensional"}.}
53 |
54 | \item{pairwise}{Logical.
55 | Default \code{pairwise = FALSE} computes global test.
56 | \code{pairwise = TRUE} computes global and pairwise tests.}
57 |
58 | \item{pairwise_method}{String specifying type of correction for multiple testing.
59 | Default \code{"holm"}.
60 | See details.}
61 |
62 | \item{permutation_n}{Number of permutations resulting in PERMANOVA test \emph{p value}.
63 | Default \code{9999}.}
64 |
65 | \item{permutation_seed}{Number fixing random generator of permutations.
66 | Default \code{1}.}
67 | }
68 | \value{
69 | Returns a list containing the following elements:
70 | \itemize{
71 | \item{permanova.samples}{Dataframe showing sample summary.
72 | Column \emph{Group} assigns samples to tested groups.
73 | Column \emph{Class} specifies if samples were used in, or removed from, testing.}
74 | \item{permanova.global_test}{List showing table with degrees of freedom, sum of squares, mean sum of squares, \emph{F} statistic, variance explained (\emph{R2}) and \emph{p} value.}
75 | \item{permanova.pairwise_test}{List showing table \emph{F} statistic, variance explained (\emph{R2}), \emph{p} value and corrected \emph{p} value per pair of groups.
76 | Obtained only if \code{pairwise = TRUE}.}
77 | \item{permanova.pairwise_correction}{String indicating type of correction for multiple testing.}
78 | \item{permanova.permutation_number}{Number of permutations applied to obtain the distribution of \emph{p value}.}
79 | \item{permanova.permutation_seed}{Number fixing random generator of permutations for reproducibility of results.}
80 | }
81 | }
82 | \description{
83 | Computes Permutational Multivariate Analysis of Variance (PERMANOVA) for testing differences in group location using multivariate data. Variance partitioning computed on a sample-by-sample triangular matrix obtained from variable-by-sample data following Anderson (2001).
84 | Calculates a range of inter-sample distances, similarities and dissimilarities.
85 | Includes control for genetic drift for bi-allelic genetic markers such as single nucleotide polymorphisms (SNP) following Patterson, Price and Reich (2006) that can be combined with SMART Principal Component Analysis (PCA). Optimized to run fast matrix building and permutations for big datasets in ecological, evolutionary and genomic research.
86 | }
87 | \details{
88 | PERMANOVA is a form of linear modelling that partitions variation in a triangular matrix of inter-sample proximities obtained from variable-by-sample data.
89 | Uses permutations to estimate the probability of observed group differences in SNP composition given a null hypothesis of no differences between groups (Anderson 2001).
90 | Proximity between samples can be any type of distance, similarity or dissimilarity.
91 | Original acronym \code{NPMANOVA} (Non-Parametric MANOVA) replaced with PERMANOVA (Anderson 2004, 2017).\cr
92 |
93 | Univariate ANOVA captures differences in mean and variance referred to as location and dispersion in PERMANOVA's multivariate context (Anderson & Walsh 2013, Warton, Wright and Wang 2012).
94 | To attribute group differences to location (position of sample groups) and/or dispersion (spread of sample groups), PERMANOVA must be combined with PERMDISP as implemented through \code{smart_permdisp}.\cr
95 |
96 | Function \code{smart_permanova} uses \code{\link[vegan]{adonis}} to fit formula \code{snp_eucli ~ sample_group}, where \code{snp_eucli} is the sample-by-sample triangular matrix in Principal Coordinate Analysis (Gower 1966) space.
97 | Current version restricted to one-way designs (one categorical predictor) though PERMANOVA can handle >1 crossed and/or nested factors (Anderson 2001) and continuous predictors (McArdle & Anderson 2001).
98 | If >2 sample groups tested, \code{pairwise = TRUE} allows pairwise testing and correction for multiple testing by \code{holm (Holm)} [default], \code{hochberg (Hochberg)}, \code{hommel (Hommel)}, \code{bonferroni (Bonferroni)}, \code{BY (Benjamini-Yekuieli)}, \code{BH (Benjamini-Hochberg)} or \code{fdr (False Discovery Rate)}.\cr
99 |
100 | For big data, \code{\link[Rfast]{Dist}} builds sample-by-sample triangular matrix much faster than \code{\link[vegan]{vegdist}}.
101 | \code{\link[Rfast]{Dist}} computes proximities \code{euclidean}, \code{manhattan}, \code{canberra1}, \code{canberra2}, \code{minimum}, \code{maximum}, \code{minkowski}, \code{bhattacharyya}, \code{hellinger}, \code{kullback_leibler} and \code{jensen_shannon}. \code{\link[vegan]{vegdist}} computes \code{manhattan}, \code{euclidean}, \code{canberra}, \code{clark}, \code{bray}, \code{kulczynski}, \code{jaccard}, \code{gower}, \code{altGower}, \code{morisita}, \code{horn}, \code{mountford}, \code{raup}, \code{binomial}, \code{chao}, \code{cao} and \code{mahalanobis}.
102 | Euclidean distance required for SMARTPCA scaling.\cr
103 |
104 | \code{sample_remove} should include both samples removed from PCA and ancient samples projected onto PCA space (if any).\cr
105 |
106 | Data read from working directory with SNPs as rows and samples as columns.
107 | Two alternative formats: (1) text file of SNPs by samples (file extension and column separators recognized automatically) read using \code{\link[data.table]{fread}}; or (2) duet of \code{EIGENSTRAT} files (see \url{https://reich.hms.harvard.edu/software}) using \code{\link[vroom]{vroom_fwf}}, including a genotype file of SNPs by samples (\code{*.geno}), and a sample file (\code{*.ind}) containing three vectors assigning individual samples to unique user-predefined groups (populations), sexes (or other user-defined descriptor) and alphanumeric identifiers.
108 | For \code{EIGENSTRAT}, vector \code{sample_group} assigns samples to groups retrievable from column 3 of file \code{*.ind}.
109 | SNPs with zero variance removed prior to SVD to optimize computation time and avoid undefined values if \code{scaling = "sd"} or \code{"drift"}.\cr
110 |
111 | Users can select subsets of samples or SNPs by introducing a vector including column numbers for samples (\code{sample_remove}) and/or row numbers for SNPs (\code{snp_remove}) to be removed from computations.
112 | Function stops if the final number of SNPs is 1 or 2.
113 | \code{EIGENSOFT} was conceived for the analysis of human genes and its SMARTPCA suite so accepts 22 (autosomal) chromosomes by default.
114 | If >22 chromosomes are provided and the internal parameter \code{numchrom} is not set to the target number chromosomes of interest, SMARTPCA automatically subsets chromosomes 1 to 22.
115 | In contrast, \code{smart_permanova} accepts any number of autosomes with or without the sex chromosomes from an \code{EIGENSTRAT} file.\cr
116 | }
117 | \examples{
118 | # Path to example genotype matrix "dataSNP"
119 | pathToGenoFile = system.file("extdata", "dataSNP", package = "smartsnp")
120 |
121 | # Assign 50 samples to each of two groups
122 | my_groups <- as.factor(c(rep("A", 50), rep("B", 50)))
123 |
124 | # Run PERMANOVA
125 | permanovaR <- smart_permanova(snp_data = pathToGenoFile, sample_group = my_groups)
126 |
127 | # Extract summary table assigning samples to groups
128 | permanovaR$permanova.samples
129 |
130 | # Extract PERMANOVA table
131 | permanovaR$permanova.global_test
132 |
133 | # Plot means of squares per group
134 | #run pca with truncated SVD (PCA 1 x PCA 2)
135 | pcaR1 <- smart_pca(snp_data = pathToGenoFile, sample_group = my_groups)
136 | #compute Euclidean inter-sample distances in PCA space (triangular matrix)
137 | snp_eucli <- vegan::vegdist(pcaR1$pca.sample_coordinates[,c("PC1","PC2")], method = "euclidean")
138 | #run PERMANOVA
139 | permanova <- vegan::adonis(formula = snp_eucli ~ my_groups, permutations = 9999)
140 | #extract meanSqs (groups versus residuals)
141 | meanSqs <- as.matrix(t(permanova$aov.tab$MeanSqs[1:2]))
142 | colnames(meanSqs) <- c("Groups", "Residuals")
143 | #two horizontal plots
144 | oldpar <- par(mfrow = c(2,1), oma = c(0,5,0.1,0.1), lwd = 2)
145 | barplot(meanSqs, horiz = TRUE, main = "PERMANOVA mean of squares",
146 | cex.names = 2, cex.main = 2, col = c("grey40"))
147 | #run ANOSIM
148 | anosimD <- vegan::anosim(snp_eucli, my_groups, permutations = 999)
149 | #remove outputs for clean plotting
150 | #anosimD[2] <- ""; anosimD[5] <- ""
151 | par(mar = c(5, 0.1, 3.5, 0.1))
152 | plot(anosimD, xlab = "", ylab = "distance/similarity ranks",
153 | main = "Inter-sample proximity ranks", cex.main =2, cex.axis = 2,
154 | col = c("cyan", "red", "blue"))
155 | par(oldpar)
156 |
157 | }
158 | \references{
159 | Anderson, M. J. (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26, 32-46.\cr
160 | Anderson, M. J. (2004). PERMANOVA_2factor: a FORTRAN computer program for permutational multivariate analysis of variance (for any two-factor ANOVA design) using permutation tests (Department of Statistics, University of Auckland, New Zealand).\cr
161 | Anderson, M. J. & D. C. I. Walsh (2013) PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecological Monographs, 83, 557-574.\cr
162 | Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325-338.\cr
163 | McArdle, B. H. & M. J. Anderson (2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology, 82, 290-297.\cr
164 | Patterson, N., A. L. Price and D. Reich (2006) Population structure and eigenanalysis. PLoS Genetics, 2, e190.\cr
165 | Warton, D. I., S. T. Wright and Y. Wang (2012) Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89-101.
166 | }
167 | \seealso{
168 | \code{\link[vegan]{adonis}} (package \bold{vegan}),
169 | \code{\link[Rfast]{Dist}} (package \bold{Rfast}),
170 | \code{\link[data.table]{fread}} (package \bold{data.table}),
171 | \code{\link[vegan]{vegdist}} (package \bold{vegan}),
172 | \code{\link[vroom]{vroom_fwf}} (package \bold{vroom})
173 | }
174 |
--------------------------------------------------------------------------------