├── LICENSE ├── README.md └── rscripts ├── calc_EHHS.R ├── calc_LD.R ├── calc_allele_sharing.R ├── calc_hwe_chisq.R ├── calc_hwe_fisher.R ├── calc_iES.R ├── calc_neiFis_multispop.R ├── calc_neiFis_onepop.R ├── calc_snp_stats.R ├── calc_wcFst_spop_pairs.R ├── calc_wcFstats.R ├── exampleI.ASdist.nj.png ├── exampleI.R ├── exampleI_data.RData ├── exampleI_functions.RData ├── geno_to_allelecnt.R ├── gwas_lm.R ├── plclust_in_colour.R ├── plot_marker_lox.R ├── plot_markers_by_set.R └── simgeno.R /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | 341 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Handy R functions for genetics research 2 | 3 | Originally hosted at http://evachan.org/rscripts.html, these R functions were initially written for my own research. Throughout the years, I've updated them (and fixed bugs) based on suggestions from users. If you find these useful in your own research, please cite this git repository. If you spot bugs or have suggestions for improvement, please let me know. Or, better, submit a pull request :) 4 | 5 | [Statistical Functions] (https://github.com/ekfchan/evachan.org-Rscripts#statistical-functions) 6 | [Plotting Functions] (https://github.com/ekfchan/evachan.org-Rscripts#plotting-functions) 7 | [Example Data] (https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-data-and-usage) 8 | 9 | ###The *geno* object 10 | 11 | The geno object (see [exampleI.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.R)) on which all analysis depend was originally written for diploid data. All R functions in the repository should be applicable to diploid data. In some cases, should also be applicable to multi-allelic data. In fact, when these functions were written, they were geared towards genotyping array data. Keep that in mind when using these scripts. 12 | 13 | 14 | ###Statistical Functions 15 | 16 | [geno_to_allelecnt.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/geno_to_allelecnt.R) 17 | A function to convert biallelic unphased SNP genotypes, such as {AA,CC,GG,TT,AC,AG,AT,CG,CT,GT}, to number of copies/counts {0,1,2} of the reference (or arbitrary) allele. 18 | [See [example II](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-ii) and simgeno.R for example and usage.] 19 | [simgeno.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/simgeno.R) 20 | Very simple function to generate a biallelic unphased SNP genotype matrix in the format {AA,CC,GG,TT,AC,AG,AT,CG,CT,GT}. Used predominantly to test [geno_to_allelecnt.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/geno_to_allelecnt.R). 21 | [See [example II](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-ii) for usage and purpose.] 22 | [calc_EHHS.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_EHHS.R) 23 | A function to calculate the normalised homozygosity between the i-th and j-th loci, EHHS(geno)i,j, for a given chromosome / linkage group ([Tang, Thornton, Stoneking 2007](http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050171)) 24 | [calc_iES.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_iES.R) 25 | A function to calculate the integrated EHHS statistic, iES, as described in Tang, Thornton and Stoneking (2007). You'd probably want to [calculate the EHHS](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_EHHS.R) first! 26 | [calc_LD.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_LD.R) 27 | Given a biallelic genotype matrix, calculates one or more measures of linkage disequilibrium between all locus-pairs. The available LD measures include: [D](http://www.jstor.org/sici?sici=0014-3820%28196012%2914%3A4%3C458%3ATEDOCP%3E2.0.CO%3B2-4), [D'](http://www.genetics.org/cgi/reprint/49/1/49), [r2](http://www.springerlink.com/content/g6449ph0v65t5w87/), [X2 (chi-square)](http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WXD-4F1SCHP-33&_user=4421&_rdoc=1&_fmt=&_orig=search&_sort=d&_docanchor=&view=c&_acct=C000059598&_version=1&_urlVersion=0&_userid=4421&md5=e0ec8112b03fb20f4212ae2b3e7d9fee), [X2' (chi-square-prime)](http://www.genetics.org/cgi/content/abstract/86/1/227). 28 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 29 | [calc_snp_stats.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_snp_stats.R) 30 | A function to calculate basic SNP stats, including: allele frequency (p), MAF (minor allele frequency), MGF (minor genotype frequency), and tests for deviation from HWE (X2 test and Fisher's Exact test). 31 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 32 | [gwas_lm.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/gwas_lm.R) 33 | Performs single-locus (SNP) genome-wide association tests for one or more traits simultaneously under one or more of five inheritance models (additive, co-dominance, dominance, recessive, over-dominance) using linear regression. 34 | [calc_hwe_fisher.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_hwe_fisher.R) 35 | A script to test for deviation from HWE using Fisher's Exact test. This test is also incorporated into [calc_snp_stats.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_snp_stats.R). 36 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 37 | [calc_hwe_chisq.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_hwe_chisq.R) 38 | A script to test for deviation from HWE using Pearson's Chi-Squared test. This test is also incorporated into calc_snp_stats.R. 39 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 40 | [calc_neiFis_multispop.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_neiFis_multispop.R) 41 | A script to calculate inbreeding coefficients, [Fis](http://www3.interscience.wiley.com/journal/119623803/abstract), for each sub-population using a given set of SNP markers. 42 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 43 | [calc_neiFis_onepop.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_neiFis_onepop.R) 44 | A script to calculate inbreeding coefficients, [Fis](http://www3.interscience.wiley.com/journal/119623803/abstract), for a given population using a given set of SNP markers. 45 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 46 | [calc_wcFstats.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_wcFstats.R) 47 | A script to estimate the variance components and fixation indices as described in [Weir & Cockerham 1984 Evolution 38(6) : 1358-1370](http://www.jstor.org/stable/2408641?&Search=yes&term=weir&term=cockerham&list=hide&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3Dweir%2Bcockerham%26jc%3Dj100004%26wc%3Don%26Search.x%3D0%26Search.y%3D0%26Search%3DSearch&item=2&ttl=275&returnArticleService=showArticle). 48 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage. ] 49 | [calc_wcFst_spop_pairs.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_wcFst_spop_pairs.R) 50 | A script to estimate Fst (theta) values for each pair of sub-populations using the method of [Weir & Cockerham 1984 Evolution 38(6): 1358-1370](http://www.jstor.org/stable/2408641?&Search=yes&term=weir&term=cockerham&list=hide&searchUri=%2Faction%2FdoBasicSearch%3FQuery%3Dweir%2Bcockerham%26jc%3Dj100004%26wc%3Don%26Search.x%3D0%26Search.y%3D0%26Search%3DSearch&item=2&ttl=275&returnArticleService=showArticle). 51 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 52 | [calc_allele_sharing.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/calc_allele_sharing.R) 53 | Calculates allele sharing distances between pairs of individuals (c.f. [Gao & Stramer 2007 BMC Genetics 8:34](http://www.biomedcentral.com/1471-2156/8/34)). 54 | [See [example I](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/README.md#example-i) for example data and usage.] 55 | 56 | 57 | ###Plotting Functions### 58 | 59 | [plclust_in_colour.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/plclust_in_colour.R) 60 | A modification of (wrapper to) plclust for plotting hclust (hierarchical cluster) objects with coloured leaf labels. 61 | [plot_marker_lox.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/plot_marker_lox.R) 62 | Generates a visual representation of the genetic positions of a set of markers. 63 | [plot_markers_by_set.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/plot_markers_by_set.R) 64 | A function to plot sets of markers on a map where the markers are coloured based on a defined variable. 65 | 66 | 67 | ###Example Data and Usage 68 | 69 | #####Example I 70 | [exampleI.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.R) 71 | Download and read exampleI.R first. This script contains several very simple lines of codes for creating a geno and a subpop object, and their usages in the following scripts: 72 | ```R 73 | calc_wcFstats(geno, subpop) 74 | calc_wcFst_spop_pairs(geno, subpop) 75 | calc_neiFis_onepop(geno) 76 | calc_snp_stats(geno) 77 | calc_neiFis_multispop(geno,subpop) 78 | calc_LD(geno) 79 | calc_allele_sharing(geno) 80 | calc_hwe_chisq(geno) 81 | calc_hwe_fisher(geno) 82 | ``` 83 | [exampleI_data.RData](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI_data.RData) 84 | A R workspace containing an instance of a _geno_ and _subpop_ objects used in [exampleI.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.R); i.e. the actual datasets corresponding to the outputs in [exampleI.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.R). 85 | [exampleI_functions.RData](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI_functions.RData) 86 | A R workspace containing all functions used in [exampleI.R](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.R). 87 | 88 | #####Example II 89 | ```R 90 | geno <- simgeno() 91 | alleleCount <- geno_to_allelecnt(geno) 92 | ``` 93 | 94 | ![exampleI.ASdist.nj](https://github.com/ekfchan/evachan.org-Rscripts/blob/master/rscripts/exampleI.ASdist.nj.png "NJ Tree of AS Distance Matrix") 95 | **Figure: NJ tree from Example I** 96 | 97 | -------------------------------------------------------------------------------- /rscripts/calc_EHHS.R: -------------------------------------------------------------------------------- 1 | calc_EHHS <- function(geno, thresh=0.1) { 2 | 3 | ## March 2010 4 | ## Eva KF Chan 5 | ## http://evachan.org 6 | ## 7 | ## Function to calculate the EHHS(geno)i,j values for a given chromosome as described in: 8 | ## Tang K, Thornton KR, Stoneking M (2007) A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome . PLoS Biol 5(7): e171 doi:10.1371/journal.pbio.0050171 9 | ## 10 | ## EHHS is the haplotype homozygosity between sites i and j, normalised by the homozygosity at site i: 11 | ## EHHS(geno)i,j = sum_k=1..n{ Ik,(hap ij) [1 if hap1 = hap2] } / suml=1..n{ I_i,(alle i) [1 if alle1 = alle2]} 12 | ## where: 13 | ## Ik,(hap ij) = identity of the two haplotypes between site i & j in one individual 14 | ## Il,(alle i) = identity of the alleles at site i 15 | ## 16 | ## EHHS_(geno)i,j = ( sum(k=1..n) {I_k,(hap ij) [1 if hap1 = hap2] ) 17 | ## ------------------------------------------------- 18 | ## ( sum(l=1..n) {I_l,(alle i) [1 if alle1 = alle2] ) 19 | ## 20 | ## = number of individuals where hap1 = hap2 21 | ## --------------------------------------------------- 22 | ## number of individuals where alle1 = alle2 at site i 23 | ## Parameters: 24 | ## geno: matrix of genotypes (0,1,2,NA) of size marker (row) by sample (column) 25 | ## <> 26 | ## thresh: the threshold [0,1] to which EHHS is calculated for all j moving away 27 | ## from site i until EHHS < thresh (0.1 by default) 28 | ## 29 | ## Output: 30 | ## Returns a matrix of size MxM (M=number of markers=nrow(geno)) of EHHS values 31 | ## calcualted for all i-th marker (row) to each j-th marker (colum) until EHH < thresh 32 | 33 | geno[geno==2] <- 0 ## 0=homozygous; 1=heterozygous 34 | M = nrow(geno) 35 | EHH <- matrix(NA, ncol=M, nrow=M, dimnames=list(rownames(geno),rownames(geno))) 36 | 37 | for(i in 1:M) { 38 | initial_list = which(geno[i,]==0) 39 | Ii = length(initial_list) 40 | EHH[i,i] = 1 41 | 42 | ## left-flank 43 | cur_list = initial_list 44 | j = i-1 45 | while( j >= 1 ) { 46 | tmp_list = which(geno[j,]==0) 47 | cur_list = intersect( cur_list, tmp_list ) 48 | Ij = length(cur_list) 49 | cur.EHH = Ij/Ii 50 | if (is.na(cur.EHH) | cur.EHH < thresh) { break } else { 51 | EHH[i,j] = cur.EHH 52 | } 53 | j = j-1 54 | } 55 | 56 | ## right-flank 57 | cur_list = initial_list 58 | j = i + 1 59 | while( j <= M ) { 60 | tmp_list = which(geno[j,]==0) 61 | cur_list = intersect( cur_list, tmp_list ) 62 | Ij = length(cur_list) 63 | cur.EHH = Ij/Ii 64 | if (is.na(cur.EHH) | cur.EHH < thresh) { break } else { 65 | EHH[i,j] = cur.EHH 66 | } 67 | j = j+1 68 | } 69 | } 70 | 71 | return(EHH) 72 | 73 | } 74 | -------------------------------------------------------------------------------- /rscripts/calc_LD.R: -------------------------------------------------------------------------------- 1 | calc_LD <- function( geno, inds=1:nrow(geno), get.D=T, get.Dprime=F, get.rsq=T, get.chisq=T, get.chisq_prime=F ) { 2 | ### Eva KF Chan 3 | ### 23 Feb 2009 4 | ### Last Modified: 29 Nov 2013 5 | ### 6 | ### Calculates D, D', r, chisq, chisq' 7 | ### Given locus A with allele frequencies pA & pa and locus B with allele frequencies pB and pb 8 | ### Let pAB be the allele frequencies of allele A/B. As the AB/ab is indistinguishable from Ab/aB, we assume equal probability for either assortment; i.e. For individuals with Aa at locus A and Bb at locus b, we assume p(AB/ab)=p(Ab/aB)=0.5. NOTE that this is assumption is part of the reason why this function is relatively fast, compare to, for example, the LD() function in R/genetics which estimates p(AB) using a maximum likelihood approach. 9 | ### D = pAB - pApB 10 | ### D' = { D/Dmax for D>=0 , where Dmax = min( pApb, papB ) 11 | ### = { D/Dmin for D<0 , where Dmin = max(-pApB,-papb ) 12 | ### r = D / sqrt( pApapBpb ) 13 | ### chi2 = (2ND^2) / (pApapBpb) 14 | ### chi2'= chisq / ( 2N(l-1) ) where l=min(k,m) 15 | ### N = # individuals 16 | ### k & m = # alelles in locus A & B 17 | ### 18 | ### Arguments: 19 | ### geno: m x n matrix of genotypes {0,1,2,NA} where m=number of markers, n=number of individuals 20 | ### inds: integer vector of marker indices (rows of geno) for subseting markers for LD calculation 21 | ### get.D: {T,F} Boolean value to indicate whether the D measure is to be calculated 22 | ### get.Dprime: {T,F} Boolean value to indicate whether the D' measure is to be calculated 23 | ### get.rsq: {T,F} Boolean value to indicate whether the r^2 measure is to be calculated 24 | ### get.chisq: {T,F} Boolean value to indicate whether the chi2 measure is to be calculated 25 | ### get.chisq_prime: {T,F} Boolean value to indicate whether the chi2' measure is to be calculated 26 | 27 | 28 | if( all(!get.D, !get.Dprime, !get.rsq, !get.chisq, !get.chisq_prime) ) { stop('Must request at least one LD statistic.\n') } 29 | D_prime <- rsq <- chisq <- chisq_prime <- df <- NULL 30 | D <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) 31 | if( get.Dprime ) { D_prime <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) } 32 | if( get.rsq ) { rsq <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) } 33 | if( get.chisq | get.chisq_prime ) { 34 | chisq <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) 35 | df <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) 36 | if( get.chisq_prime ) { chisq_prime <- matrix(NA, nrow=nrow(geno), ncol=length(inds)) } 37 | } 38 | 39 | if( all(as.logical(!is.na(geno))) ) { #no missing data 40 | tmp.geno <- geno ## genotypes at locus A 41 | N <- ncol(tmp.geno) #number of individuals (diploid is assumed) 42 | pA <- ((2*apply(tmp.geno==0,1,sum,na.rm=T))+apply(tmp.geno==1,1,sum,na.rm=T)) / (2*N) 43 | pa <- 1-pA 44 | for(i in 1:length(inds)) { 45 | tmp.Bgeno <- matrix(tmp.geno[inds[i],],nrow=nrow(tmp.geno),ncol=ncol(tmp.geno),byrow=T) ## genotypes at locus B 46 | pB <- ((2*apply(tmp.Bgeno==0,1,sum,na.rm=T))+apply(tmp.Bgeno==1,1,sum,na.rm=T)) / (2*N) 47 | pb <- 1-pB 48 | pAB <- ((apply(tmp.geno==0 & tmp.Bgeno==0, 1, sum,na.rm=T)*2) + (apply(tmp.geno==1 & tmp.Bgeno==0, 1, sum,na.rm=T)) + (apply(tmp.geno==0 & tmp.Bgeno==1, 1, sum,na.rm=T)) + (apply(tmp.geno==1 & tmp.Bgeno==1, 1, sum,na.rm=T)*0.5)) / (2*N) 49 | D[,i] <- pAB-(pA*pB) 50 | if( get.Dprime ) { 51 | Dmax <- pmin(pA*pb, pa*pB) 52 | Dmin <- pmax(-pA*pB, -pa*pb) 53 | pos <- (D[,i]>=0) 54 | D_prime[which(pos),i] <- D[which(pos),i] / Dmax[which(pos)] 55 | D_prime[which(!pos),i] <- D[which(!pos),i] / Dmin[which(!pos)] 56 | } 57 | if( get.rsq ) { 58 | rsq[,i] <- (D[,i]*D[,i]) / (pA*pa*pB*pb) 59 | } 60 | if( get.chisq | get.chisq_prime ) { 61 | chisq[,i] <- (2*N*D[,i]*D[,i]) / (pA*pa*pB*pb) 62 | if( get.chisq_prime ) { 63 | k=2-as.integer(pA==0|pa==0) 64 | m=2-as.integer(pB==0|pb==0) 65 | #df[,i] <- (k-1)*(m-1) 66 | chisq_prime[,i] <- chisq[,i] / (2*N*pmin(k,m)) 67 | } 68 | } 69 | } 70 | } else { #at least one missing data point in geno 71 | for(i in 1:length(inds)) { 72 | tmp.geno <- geno[,!is.na(geno[inds[i],])] ## genotypes at locus A; i.e. all loci, but excluding samples with missing data at lcous B (i) 73 | tmp.Bgeno <- matrix(tmp.geno[inds[i],],nrow=nrow(tmp.geno),ncol=ncol(tmp.geno),byrow=T) ## genotypes at locus B (i.e. i-th locus); pulling from tmp.geno, so samples with missing data at i-th locus (B) will also be excluded 74 | tmp.Bgeno[is.na(tmp.geno)] <- NA #anytime where locus A (i.e. all non i-th locus) is missing, set as missing 75 | N <- rowSums(!is.na(tmp.geno)) 76 | pA <- ((2*apply(tmp.geno==0,1,sum,na.rm=T))+apply(tmp.geno==1,1,sum,na.rm=T)) / (2*N) 77 | pB <- ((2*apply(tmp.Bgeno==0,1,sum,na.rm=T))+apply(tmp.Bgeno==1,1,sum,na.rm=T)) / (2*N) 78 | pa <- 1-pA 79 | pb <- 1-pB 80 | pAB <- ((apply(tmp.geno==0 & tmp.Bgeno==0, 1, sum,na.rm=T)*2) + (apply(tmp.geno==1 & tmp.Bgeno==0, 1, sum,na.rm=T)) + (apply(tmp.geno==0 & tmp.Bgeno==1, 1, sum,na.rm=T)) + (apply(tmp.geno==1 & tmp.Bgeno==1, 1, sum,na.rm=T)*0.5)) / (2*N) 81 | D[,i] <- pAB-(pA*pB) 82 | if( get.Dprime ) { 83 | Dmax <- pmin(pA*pb, pa*pB) 84 | Dmin <- pmax(-pA*pB, -pa*pb) 85 | pos <- (D[,i]>=0) 86 | D_prime[which(pos),i] <- D[which(pos),i] / Dmax[which(pos)] 87 | D_prime[which(!pos),i] <- D[which(!pos),i] / Dmin[which(!pos)] 88 | } 89 | if( get.rsq ) { 90 | rsq[,i] <- (D[,i]*D[,i]) / (pA*pa*pB*pb) 91 | } 92 | if( get.chisq | get.chisq_prime ) { 93 | chisq[,i] <- (2*N*D[,i]*D[,i]) / (pA*pa*pB*pb) 94 | k=2-as.integer(pA==0|pa==0) 95 | m=2-as.integer(pB==0|pb==0) 96 | df[,i] <- (k-1)*(m-1) 97 | if( get.chisq_prime ) { 98 | chisq_prime[,i] <- chisq[,i] / (2*N*pmin(k,m)) 99 | } 100 | } 101 | } 102 | } 103 | if( !get.D ) { D <- NULL } 104 | if( !get.chisq ) { chisq <- NULL } 105 | return(list(D=D, Dprime=D_prime, rsq=rsq, chisq=chisq, chisq_prime=chisq_prime, chisq_df=df)) 106 | } 107 | -------------------------------------------------------------------------------- /rscripts/calc_allele_sharing.R: -------------------------------------------------------------------------------- 1 | calc_allele_sharing <- function(geno) 2 | { 3 | ## Eva KF Chan 4 | ## http://evachan.org 5 | ## 6 | ## A script to calculate allele sharing distance between pairs of individuals 7 | ## (c.f. Gao & Stramer 2007 BMC Genetics 8:34) 8 | ## D_ij = (1/L) * sum(d_ij(l)) 9 | ## where d_ij(l) = { 0 if individuals i & j have 2 alleles in common at l-th locus 10 | ## { 1 if individuals i & j have only 1 allele in common at l-th locus 11 | ## { 2 if individuals i & j have no allele in common at l-th locus 12 | ## and L = number of SNP loci. 13 | ## 14 | ## Input: 15 | ## geno: SNP-by-sample matrix of genotypes {0,1,2}; any other values are ignored. 16 | ## 17 | ## Output: 18 | ## symmetrical matix of allele-sharing distance between each pair of individuals (columns of geno) 19 | ## 20 | ## NOTE:: if one wants to use this distance matrix to obtain Ward's Minimum Variance 21 | ## Hierarchical Clustering as in Gao & Stramer 2007, simply use the following 22 | ## command: 23 | ## plot(allele.sharing.hclust <- hclust(as.dist(allele.sharing), method="ward")) 24 | 25 | n <- ncol(geno) ## number of individuals 26 | d <- matrix(NA, ncol=n, nrow=n, dimnames=list(colnames(geno),colnames(geno))) ## distance 27 | 28 | for(i in 1:n) 29 | { 30 | cat(i,"\n") 31 | z <- abs(geno - geno[,i]) 32 | d[,i] <- apply(z, 2, mean, na.rm=T) 33 | } 34 | 35 | d 36 | 37 | } 38 | -------------------------------------------------------------------------------- /rscripts/calc_hwe_chisq.R: -------------------------------------------------------------------------------- 1 | calc_hwe_chisq <- function(geno) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## http://evachan.org 5 | ## 6 | ## This is a function for testing the significance of deviation from HWE using Pearson's Chi-Squared test. 7 | ## chisq = [((Obs(AA)-Exp(AA))^2)/Exp(AA)] + [((Obs(Aa)-Exp(Aa))^2)/Exp(Aa)] + [((Obs(aa)-Exp(aa))^2)/Exp(aa)] 8 | ## df = 1 (# phenotypes - # alleles; i.e. 3 genotypes - 2 alleles) 9 | ## 10 | ## Input: 11 | ## geno: SNP-by-sample matrix of genotypes {0,1,2}; any other values are ignored. 12 | ## 13 | ## Output: two column matrix of Chi-square values and corresponding P-values for each SNP in geno. 14 | 15 | ## assign all non {0,1,2} to NA 16 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 17 | geno <- as.matrix(geno) 18 | 19 | n0 <- apply(geno==0,1,sum,na.rm=T) 20 | n1 <- apply(geno==1,1,sum,na.rm=T) 21 | n2 <- apply(geno==2,1,sum,na.rm=T) 22 | n <- n0+n1+n2 23 | obs <- cbind(n0, n1, n2) 24 | p <- ((2*n0)+n1)/(2*n) 25 | q <- (1-p) 26 | expected <- cbind(p*p, 2*p*q, q*q) 27 | expected <- expected*n 28 | chisq <- (obs-expected) 29 | chisq <- (chisq*chisq) /expected 30 | chisq <- apply(chisq,1,sum) 31 | chisq.p <- 1-pchisq(chisq,df=1) 32 | 33 | res <- cbind(chisq=chisq, chisq.p=chisq.p) 34 | rownames(res) <- rownames(geno) 35 | res 36 | 37 | } 38 | -------------------------------------------------------------------------------- /rscripts/calc_hwe_fisher.R: -------------------------------------------------------------------------------- 1 | calc_hwe_fisher <- function(geno) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## htp://evachan.org 5 | ## 6 | ## This is a function for testing the significance of deviation from HWE 7 | ## using Fisher's Exact test. 8 | ## Note that the observed number of Aa and aA genotypes are identical if 9 | ## their sum is even, else Aa is always one more than aA. 10 | ## 11 | ## Input: 12 | ## geno: SNP-by-sample matrix of genotypes {0,1,2}; any other values are ignored. 13 | ## 14 | ## Output: two column matrix of Odds Ratio and corresponding P-values for each SNP in geno. 15 | 16 | ## assign all non {0,1,2} to NA 17 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 18 | geno <- as.matrix(geno) 19 | 20 | n0 <- apply(geno==0, 1, sum, na.rm=T) 21 | n1 <- apply(geno==1, 1, sum, na.rm=T) 22 | n2 <- apply(geno==2, 1, sum, na.rm=T) 23 | 24 | z <- cbind(n0, ceiling(n1/2), floor(n1/2), n2) 25 | z <- lapply( split( z, 1:nrow(z) ), matrix, ncol=2 ) 26 | z <- lapply( z, fisher.test ) 27 | 28 | res <- cbind( odds.ratio = as.numeric(unlist(lapply(z, "[[", "estimate"))), 29 | p.values = as.numeric(unlist(lapply(z, "[[", "p.value"))) ) 30 | rownames(res) <- rownames(geno) 31 | res 32 | 33 | } 34 | -------------------------------------------------------------------------------- /rscripts/calc_iES.R: -------------------------------------------------------------------------------- 1 | calc_iES <- function(EHHS, lox) { 2 | 3 | ## Eva KF Chan 4 | ## Nov 2008 5 | ## http://evachan.org 6 | ## 7 | ## A function to calculate the iES statistics as per Tang K, Thornton KR, Stoneking M (2007) A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome . PLoS Biol 5(7): e171 doi:10.1371/journal.pbio.0050171 8 | ## iES integrates the area under the curve of EHHS against distance: 9 | ## iES_i = sum_for_j_from_a+1_to_b = { (EHHS_i,j-1 + EHHS_i,j) * (Pos_j - Pos_j-1) } / 2 10 | ## where: 11 | ## a & b are the two ending positions where EHHS < X 12 | ## Pos_j is the physical position of site j 13 | ## 14 | ## Parameters: 15 | ## EHHS: matrix of size MxM (M=number of markers=nrow(geno)) of EHHS values calcualted for all i-th marker (row) to each j-th marker (colum) until EHH < thresh 16 | ## <> 17 | ## lox: genomic location of the markers (row) in EHHS; THIS SHOULD BE IN SAME ORDER AS EHHS 18 | 19 | if( nrow(EHHS) != length(lox) ) { stop("Number of positions given does not agree with number of markers.\n") } 20 | 21 | M <- length(lox) 22 | iES <- rep(NA, M) 23 | 24 | x = lox[2:M] - lox[1:(M-1)] 25 | for(i in 1:M) { 26 | y = EHHS[i,1:(M-1)] + EHHS[i,2:M] 27 | if( !all(is.na(y)) ) { 28 | iES[i] = sum(y*x, na.rm=T) / 2 29 | }; rm(y) 30 | } 31 | 32 | iES 33 | 34 | } 35 | -------------------------------------------------------------------------------- /rscripts/calc_neiFis_multispop.R: -------------------------------------------------------------------------------- 1 | calc_neiFis_multispop <- function (geno, spop) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## eva@evachan.org 5 | ## 6 | ## A script to calculate inbreeding coefficients, Fis (Nei 1977 Ann Hum Genet 41:225-233), 7 | ## for each sub-population from a given set of SNP markers. 8 | ## 9 | ## Input: 10 | ## geno: SNP-by-sample matrix of genotypes {0,1,2}; any other values are ignored. 11 | ## spop: a factor indicating the sub-population to which the corresponding samples 12 | ## (columns) in geno belong. 13 | ## 14 | ## Output: list of 15 | ## 1) aveloc: numeric vector of Fis averaged over all loci for each sub-population, 16 | ## the total population (2nd last value), 17 | ## and average of total population (last value) 18 | ## 2) perloc: matrix of Fis per SNP (row) for each sub-population, 19 | ## the total population (2nd last column), 20 | ## and average of total population (last column) 21 | 22 | ## assign all non {0,1,2} to NA 23 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 24 | geno <- as.matrix(geno) 25 | 26 | m = nrow(geno) ## number of markers 27 | N = ncol(geno) ## number of samples 28 | 29 | if( length(spop) != N ) { stop( "Number of samples with genotypes does not match provided number of spop.\n" ) } 30 | spop <- as.factor(as.character(spop)) 31 | unique.spop <- levels(spop) 32 | nspop <- length(unique.spop) 33 | 34 | ## determine numbers of each genotypes for each spop at each locus 35 | nNA <- nAA <- nAa <- naa <- matrix(NA, ncol=nspop, nrow=m, dimnames=list(NULL,unique.spop)) 36 | for(i in 1:nspop) { 37 | inds <- which(spop == unique.spop[i]) 38 | nAA[,i] <- apply(geno[,inds]==0,1,sum,na.rm=T) 39 | nAa[,i] <- apply(geno[,inds]==1,1,sum,na.rm=T) 40 | naa[,i] <- apply(geno[,inds]==2,1,sum,na.rm=T) 41 | } 42 | n <- nAA + nAa + naa 43 | nAA <- cbind(nAA, total=apply(nAA[,unique.spop],1,sum)) 44 | nAa <- cbind(nAa, total=apply(nAa[,unique.spop],1,sum)) 45 | naa <- cbind(naa, total=apply(naa[,unique.spop],1,sum)) 46 | n <- cbind(n, total=apply(n[,unique.spop],1,sum)) 47 | 48 | Ho <- (nAa/n) ## observed het 49 | p <- ((2*nAA)+nAa)/(2*n) ## allele freq 50 | He <- (n/(n-1)) * ((2*p*(1-p)) - (Ho/(2*n))) ## Nei's expected het 51 | 52 | s <- apply(!is.na(n[,unique.spop]),1,sum) ## number of spop per marker 53 | n_tilda <- s/apply((1/n[,unique.spop]),1,sum) ## harmonic mean of sample sizes 54 | Ho <- cbind(Ho, average=(apply(Ho[,unique.spop],1,sum,na.rm=T)/s)) ## Ho averged over samples 55 | He <- cbind(He, average=( (n_tilda/(n_tilda-1)) * ((apply(2*p[,unique.spop]*(1-p[,unique.spop]),1,sum,na.rm=T)/s) - (Ho[,"average"]/(2*n_tilda))) )) ## Nei's averaged He 56 | 57 | ncFis <- 1 - (apply(Ho,2,mean,na.rm=T) / apply(He,2,mean,na.rm=T)) 58 | ncFis.perloc <- 1 - (Ho/He) 59 | 60 | list(aveloc = ncFis, perloc = ncFis.perloc) 61 | 62 | } 63 | -------------------------------------------------------------------------------- /rscripts/calc_neiFis_onepop.R: -------------------------------------------------------------------------------- 1 | calc_neiFis_onepop <- function (geno) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## eva@evachan.org 5 | ## 6 | ## A script to calculate inbreeding coefficients, Fis (Nei 1977 Ann Hum Genet 41:225-233), 7 | ## for total population from a given set of SNP markers. 8 | ## 9 | ## Input: 10 | ## geno: SNP-by-sample matrix of genotypes {0,1,2}; any other values are ignored. 11 | ## 12 | ## Output: list of 13 | ## 1) aveloc: single Fis value averaged over all loci for the given population 14 | ## 2) perloc: numeric vector of Fis per SNP (row) for the given population 15 | 16 | ## assign all non {0,1,2} to NA 17 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 18 | geno <- as.matrix(geno) 19 | 20 | m = nrow(geno) ## number of markers 21 | N = ncol(geno) ## number of samples 22 | 23 | ## determine numbers of each genotypes for the given pop at each locus 24 | nNA <- apply(is.na(geno),1,sum) 25 | nAA <- apply(geno==0,1,sum,na.rm=T) 26 | nAa <- apply(geno==1,1,sum,na.rm=T) 27 | naa <- apply(geno==2,1,sum,na.rm=T) 28 | n <- nAA + nAa + naa 29 | 30 | Ho <- (nAa/n) ## observed het 31 | p <- ((2*nAA)+nAa)/(2*n) ## allele freq 32 | He <- (n/(n-1)) * ((2*p*(1-p)) - (Ho/(2*n))) ## Nei's expected het 33 | 34 | ncFis <- 1 - (mean(Ho,na.rm=T) / mean(He,na.rm=T)) 35 | ncFis.perloc <- 1 - (Ho/He) 36 | 37 | list(aveloc = ncFis, perloc = ncFis.perloc) 38 | 39 | } 40 | -------------------------------------------------------------------------------- /rscripts/calc_snp_stats.R: -------------------------------------------------------------------------------- 1 | calc_snp_stats <- function(geno) 2 | { 3 | ## Eva KF Chan 4 | ## http://evachan.org 5 | ## 6 | ## Created: 21/08/07 7 | ## Last Modified: 21/10/12 8 | ## 9 | ## Function to calculate basic stats on SNPs, including: allele frequency, MAF, and exact estimate of HWE 10 | ## 11 | ## geno: snp-by-individual matrix of genotypes, {0,1,2}. 12 | ## NOTE:: any other values are ignored 13 | ## 14 | ## OUTPUT: data.frame of 15 | ## n, n0, n1, n2: number of samples with total non-missing genotype, and geno=0,1,or 2 16 | ## p: allele frequency 17 | ## maf & mgf: minor allele & genotype frequencies 18 | ## mono: {T,F} indicating if marker is monomorphic (MAF<0%) 19 | ## loh: {T,F} indicating if marker has loss of heterozygote 20 | ## hwe.chisq & hwe.chisq.p: chi-square test statistic for deviation from HWE and correp p-value 21 | ## hwe.fisher & hwe.fisher.p: Fisher's Exact test statistic for deviation from HWE and correp p-value 22 | ## 23 | 24 | m <- nrow(geno) ## number of snps 25 | n <- ncol(geno) ## number of individuals 26 | 27 | ## assign all non {0,1,2} to NA 28 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 29 | geno <- as.matrix(geno) 30 | 31 | ## calc_n 32 | n0 <- apply(geno==0,1,sum,na.rm=T) 33 | n1 <- apply(geno==1,1,sum,na.rm=T) 34 | n2 <- apply(geno==2,1,sum,na.rm=T) 35 | 36 | n <- n0 + n1 + n2 37 | 38 | ## calculate allele frequencies 39 | p <- ((2*n0)+n1)/(2*n) 40 | q <- 1 - p 41 | maf <- pmin(p, q) 42 | mgf <- apply(cbind(n0,n1,n2),1,min) / n 43 | 44 | ## HWE: Chi-Square test 45 | obs <- cbind(n0=n0,n1=n1,n2=n2) 46 | exp <- cbind(p*p, 2*p*q, q*q) 47 | exp <- exp*n 48 | chisq <- (obs-exp) 49 | chisq <- (chisq*chisq) /exp 50 | hwe.chisq <- apply(chisq,1,sum) 51 | hwe.chisq.p <- 1-pchisq(hwe.chisq,df=1) 52 | 53 | ## HWE: Fisher's Exact test 54 | z <- cbind(n0, ceiling(n1/2), floor(n1/2), n2) 55 | z <- lapply( split( z, 1:nrow(z) ), matrix, ncol=2 ) 56 | z <- lapply( z, fisher.test ) 57 | hwe.fisher <- as.numeric(unlist(lapply(z, "[[", "estimate"))) 58 | hwe.fisher.p <- as.numeric(unlist(lapply(z, "[[", "p.value"))) 59 | 60 | # MODIFIED 21 Oct 2012: prior to this version, we had "mono=(mgf<0)" instead of "mono<(maf<0)" 61 | res <- data.frame( n=n, n0=n0, n1=n1, n2=n2, p=p, maf=maf, mgf=mgf, 62 | mono=(maf<=0), loh=(n1<=0), 63 | hwe.chisq=hwe.chisq, hwe.chisq.p=hwe.chisq.p, 64 | hwe.fisher=hwe.fisher, hwe.fisher.p=hwe.fisher.p, 65 | stringsAsFactors=F ) 66 | row.names(res) <- row.names(geno) 67 | res 68 | } -------------------------------------------------------------------------------- /rscripts/calc_wcFst_spop_pairs.R: -------------------------------------------------------------------------------- 1 | calc_wcFst_spop_pairs <- function(geno, spop, plot.nj=F) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## eva@evachan.org 5 | ## 6 | ## A script to estimate Fst (theta) values for each pair of sub-populations 7 | ## using the method of Weir & Cockerham 1984 Evolution 38(6): 1358-1370. 8 | ## 9 | ## Arguments 10 | ## ========= 11 | ## geno: matrix of genotypes with rows corresp. to markers and columns to individuals; 12 | ## notation for genotyeps are {0,1,2} indicating the number of one of the two alleles 13 | ## subpop: vector indicting the sub-popln to which the individuals belong to 14 | ## plot.nj: logical indicating whether a Neighbouring-Joining Tree of the results should be plotted. 15 | ## (note that the R/ape library is required for this); defaults to FALSE 16 | ## 17 | ## Side effects 18 | ## ============ 19 | ## output: Symmetical matix (in which only upper triangle is filled) of theta (Fst) values for each 20 | ## pair of unique sub-populaitons. 21 | ## plot: neighbouring-joining tree of results. 22 | 23 | ## assign all non {0,1,2} to NA 24 | geno[(geno!=0) & (geno!=1) & (geno!=2)] <- NA 25 | geno <- as.matrix(geno) 26 | 27 | N = ncol(geno) ## sample size 28 | if( length(spop) != N ) { stop( "Number of samples with genotypes does not match provided number of spop.\n" ) } 29 | spop <- as.factor(as.character(spop)) 30 | unique.spop <- levels(spop) 31 | nspop = length(unique.spop) 32 | 33 | n0 <- n1 <- n <- matrix(NA, ncol=nspop, nrow=nrow(geno), dimnames=list(NULL,unique.spop)) 34 | for(i in 1:nspop) { 35 | inds <- which(spop == unique.spop[i]) 36 | n0[,i] <- apply(geno[,inds]==0,1,sum,na.rm=T) 37 | n1[,i] <- apply(geno[,inds]==1,1,sum,na.rm=T) 38 | n[,i] <- apply(!is.na(geno[,inds]),1,sum,na.rm=T) 39 | } 40 | p <- ((2*n0)+n1)/(2*n) ## allele freq 41 | Ho <- (n1/n) ## observed het 42 | 43 | pairwise.wcFst <- matrix(NA, ncol=nspop, nrow=nspop) 44 | r=2 ## now, only two spops are examined at a time 45 | for( i in 1:(nspop-1) ) { 46 | for( j in (i+1):nspop ) { 47 | 48 | n_bar <- apply(n[,unique.spop[c(i,j)]],1,sum,na.rm=T)/r 49 | nc <- ((r*n_bar) - (apply((n[,unique.spop[c(i,j)]]*n[,unique.spop[c(i,j)]])/(r*n_bar),1,sum,na.rm=T))) / (r-1) 50 | p_bar <- apply( (n[,unique.spop[c(i,j)]]*p[,unique.spop[c(i,j)]])/(r*n_bar), 1, sum, na.rm=T ) 51 | s_square <- apply( (n[,unique.spop[c(i,j)]]*((p[,unique.spop[c(i,j)]]-p_bar)^2)) / ((r-1)*n_bar), 1, sum, na.rm=T ) 52 | h_bar <- apply((n[,unique.spop[c(i,j)]]*Ho[,unique.spop[c(i,j)]])/(r*n_bar), 1, sum, na.rm=T) 53 | 54 | a_hat <- (n_bar/nc) * ( s_square - ((1/(n_bar-1))*((p_bar*(1-p_bar)) - (((r-1)/r)*s_square) - ((1/4)*h_bar))) ) 55 | b_hat <- (n_bar/(n_bar-1)) * ((p_bar*(1-p_bar)) - (((r-1)/r)*s_square) - ((((2*n_bar)-1)/(4*n_bar))*h_bar)) 56 | c_hat <- h_bar/2 57 | 58 | inds <- which(is.finite(a_hat) & is.finite(b_hat) & is.finite(c_hat)) 59 | pairwise.wcFst[i,j] <- sum(a_hat[inds],na.rm=T) / sum(apply(cbind(a_hat,b_hat,c_hat)[inds,],1,sum,na.rm=T),na.rm=T) 60 | 61 | rm(n_bar, nc, p_bar, s_square, h_bar, a_hat, b_hat, c_hat,inds) 62 | } 63 | } 64 | colnames(pairwise.wcFst) <- rownames(pairwise.wcFst) <- unique.spop 65 | 66 | ## plot Neighbouring-Joining Tree 67 | if(plot.nj) { 68 | library(ape) 69 | pairwise.wcFst.nj <- nj(as.dist(t(pairwise.wcFst))) 70 | plot(pairwise.wcFst.nj, main="Weir & Cockerham's Fst",sub="neighbor joining",type="unrooted") 71 | } 72 | 73 | pairwise.wcFst 74 | 75 | } 76 | -------------------------------------------------------------------------------- /rscripts/calc_wcFstats.R: -------------------------------------------------------------------------------- 1 | calc_wcFstats <- function(geno, subpop) { 2 | 3 | ## Copyright Eva Chan 2008 4 | ## eva@evachan.org 5 | ## 6 | ## A script to estimate the variance components and fixation indices as described in 7 | ## Weir & Cockerham 1984 Evolution 38(6): 1358-1370. 8 | ## 9 | ## Arguments 10 | ## ========= 11 | ## geno: matrix of genotypes with rows corresp. to markers and columns to individuals; 12 | ## notation for genotyeps are {0,1,2} indicating the number of one of the two alleles 13 | ## subpop: vector indicting the sub-popln to which the individuals belong to 14 | ## 15 | ## Output 16 | ## ====== 17 | ## list of two objects: perloc and global 18 | ## perloc: matrix of 6 columns and as many rows as markers in geno 19 | ## the 6 columns contain the estiamted variance components and fixation indices per locus 20 | ## a = component of variance between subpops 21 | ## b = component of variance between individuals within subpops 22 | ## c = component of variance between gametes within individuals 23 | ## global: numeric vector of three values corresponding to the esimated F (Fit), theta (Fst), & 24 | ## f (Fis) across all loci 25 | ## 26 | ## Note 27 | ## ==== 28 | ## R/HIERFSTAT also estimate F-statistics using variance component estimation. 29 | ## Results from that package is not too different to those from this function; I suspect 30 | ## there are two sources of differences: 31 | ## 1) all estimates of variance component from HIERFSTAT are doubled in magnitude to those 32 | ## from this function (i.e. scaled by factor of 2); 33 | ## 2) rounding off variations may also be present. 34 | ## The scaled difference in the estimates of variance components poses no problem when 35 | ## calcualting fixation indicies as the values scaling factor is cancelled out in the 36 | ## calculation of the ratios. 37 | 38 | spop <- unique(as.character(subpop)) ## unique spops 39 | r <- length(spop) 40 | 41 | n11 <- n12 <- n22 <- matrix(NA, ncol=r, nrow=nrow(geno)) 42 | for(i in 1:r) { 43 | inds <- which(subpop == spop[i]) 44 | n11[,i] <- apply(geno[,inds]==0,1,sum,na.rm=T) 45 | n12[,i] <- apply(geno[,inds]==1,1,sum,na.rm=T) 46 | n22[,i] <- apply(geno[,inds]==2,1,sum,na.rm=T) 47 | } 48 | ni <- n11 + n12 + n22 49 | pi_tilda <- ((2 * n11) + n12) / (2 * ni) 50 | hi_tilda <- n12 / ni 51 | n_bar <- apply(ni,1,sum,na.rm=T)/r 52 | # C_square <- ( apply(ni*ni,1,sum,na.rm=T) - (n_bar*n_bar*r) ) / ( (n_bar*n_bar) * (r-1) ) ## mod 2/3/2008 53 | # nc <- n_bar * (1 - (C_square/r)) 54 | nc <- ((r*n_bar) - apply(((ni*ni)/(r*n_bar)),1,sum,na.rm=T)) / (r - 1) 55 | p_bar <- apply( (ni*pi_tilda)/(r*n_bar), 1, sum, na.rm=T ) 56 | s_square <- apply( (ni*((pi_tilda-p_bar)^2)) / ((r-1)*n_bar), 1, sum, na.rm=T ) 57 | h_bar <- apply((ni*hi_tilda)/(r*n_bar), 1, sum, na.rm=T) 58 | 59 | # F_hat = 1 - ( (h_bar*(1-(C_square/r))) / ( (2*p_bar*(1-p_bar)*(1-((n_bar*C_square)/(r*(n_bar-1))))) + (2*(s_square/r)*(1+(((r-1)*(n_bar*C_square))/(r*(n_bar-1))))) + ((h_bar/2)*(C_square/(r*(n_bar-1)))) )) 60 | # theta_hat <- (s_square - ((1/(n_bar-1))*((p_bar*(1-p_bar)) - (((r-1)/r)*s_square) - (h_bar/4)))) / (((1-((n_bar*C_square)/(r*(n_bar-1))))*p_bar*(1-p_bar)) + ((1+(((r-1)*n_bar*C_square)/(r*(n_bar-1))))*(s_square/r)) + ((C_square/(r*(n_bar-1)))*(h_bar/4))) 61 | # f_hat <- 1 - (h_hat / ((((2*n_bar)/(n_bar-1))*p_bar*(1-p_bar)) - (((2*n_bar*(r-1))/(r*(n_bar-1)))*s_square) - ((1/(n_bar-1))*(h_bar/2)))) 62 | 63 | a_hat <- (n_bar/nc) * ( s_square - ((1/(n_bar-1))*((p_bar*(1-p_bar)) - (((r-1)/r)*s_square) - ((1/4)*h_bar))) ) 64 | b_hat <- (n_bar/(n_bar-1)) * ((p_bar*(1-p_bar)) - (((r-1)/r)*s_square) - ((((2*n_bar)-1)/(4*n_bar))*h_bar)) 65 | c_hat <- h_bar/2 66 | 67 | F_hat <- 1 - (c_hat / (a_hat + b_hat + c_hat)) 68 | theta_hat <- a_hat / (a_hat + b_hat + c_hat) 69 | f_hat <- 1 - (c_hat / (b_hat + c_hat)) 70 | 71 | F_hat_w <- 1 - (sum(c_hat,na.rm=T) / sum((a_hat + b_hat + c_hat),na.rm=T)) 72 | theta_hat_w <- sum(a_hat,na.rm=T) / sum((a_hat + b_hat + c_hat),na.rm=T) 73 | f_hat_w <- 1 - (sum(c_hat,na.rm=T) / sum((b_hat + c_hat),na.rm=T)) 74 | 75 | list( perloc=cbind(a_hat=a_hat, b_hat=b_hat, c_hat=c_hat, F_hat=F_hat, theta_hat=theta_hat, f_hat=f_hat), 76 | global=c(F_hat=F_hat_w, theta_hat=theta_hat_w, f_hat=f_hat_w) ) 77 | 78 | } 79 | 80 | -------------------------------------------------------------------------------- /rscripts/exampleI.ASdist.nj.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ekfchan/evachan.org-Rscripts/cbe314f3877c361a8b99fb1ece4dfa92a2a0d728/rscripts/exampleI.ASdist.nj.png -------------------------------------------------------------------------------- /rscripts/exampleI.R: -------------------------------------------------------------------------------- 1 | 2 | ### ----------------------------------- ### 3 | ### Create Data ### 4 | ### ----------------------------------- ### 5 | 6 | ## Create data of 20 individuals belonging to three subpopulations (A,B,and C) 7 | subpop <- rep(LETTERS[1:3],c(8,5,7) ) 8 | 9 | ## Simulate genotype matrix of 100 markers by 20 individuals, based on subpop. 10 | ## Simulation assumes the 100 markers are unlinked. 11 | geno <- matrix( NA, ncol=20, nrow=100, 12 | dimnames=list(paste('M',sapply(mapply(rep,3-nchar(1:100),MoreArgs=list(x=0)),paste,collapse=''),1:100,sep=''),paste('S',sapply(mapply(rep,2-nchar(1:20),MoreArgs=list(x=0)),paste,collapse=''),1:20,sep=''))) 13 | 14 | for( i in 1:100 ) { 15 | for( s in LETTERS[1:3] ) { 16 | sinds <- which(subpop==s) 17 | tmp.prob = runif(1) 18 | geno[i,sinds] <- 19 | sample(0:1,length(sinds),replace=T,p=c(tmp.prob,1-tmp.prob)) + 20 | sample(0:1,length(sinds),replace=T,p=c(tmp.prob,1-tmp.prob)) 21 | rm(sinds, tmp.prob) 22 | }; rm(s) 23 | }; rm(i) 24 | 25 | 26 | ### ----------------------------------- ### 27 | ### Example Usage ### 28 | ### ----------------------------------- ### 29 | 30 | ### calc_wcFstats.R 31 | ### =============== 32 | wcFstats <- calc_wcFstats(geno,subpop) 33 | sapply(wcFstats,head) #peek at the output 34 | #$perloc 35 | # a_hat b_hat c_hat F_hat theta_hat f_hat 36 | #[1,] 0.134044839 -0.025315126 0.175 0.38321581 0.472438497 -0.169122807 37 | #[2,] 0.066264674 0.001680672 0.200 0.25357912 0.247306679 0.008333333 38 | #[3,] 0.252778401 0.034926471 0.050 0.85194173 0.748518669 0.411255411 39 | #[4,] 0.134213227 -0.045168067 0.200 0.30806660 0.464333073 -0.291723202 40 | #[5,] 0.001800148 0.014548319 0.200 0.07556544 0.008320593 0.067809058 41 | #[6,] 0.089654564 -0.038130252 0.125 0.29188224 0.507887911 -0.438935913 42 | # 43 | #$global 44 | # F_hat theta_hat f_hat 45 | #0.343454963 0.341996174 0.002216991 46 | 47 | 48 | ### calc_wcFst_spop_pairs.R 49 | ### ======================= 50 | wcFstats.spop.pairs <- calc_wcFst_spop_pairs( geno, subpop, TRUE ) 51 | wcFstats.spop.pairs #output: FST between the three pairs of subpopulations: A vs B, A vs C, and B vs C. 52 | # A B C 53 | #A NA 0.3850011 0.3405093 54 | #B NA NA 0.2961459 55 | #C NA NA NA 56 | 57 | 58 | ### calc_neiFis_onepop.R 59 | ### ==================== 60 | neiFis.popA <- calc_neiFis_onepop(geno[,which(subpop=='A')]) 61 | sapply(neiFis.popA,head) #peek at the output: 62 | #$aveloc 63 | #[1] 0.05077399 64 | # 65 | #$perloc 66 | # M001 M002 M003 M004 M005 M006 67 | # NaN -0.27272727 0.00000000 -0.07692308 0.30000000 NaN 68 | geno[1:6,which(subpop=='A')] #Note: markers at which FIS cannot be calculated have NA (missing) values. In this example, markers M001 & M006 are monomorphic in subpopulation A. 69 | # S01 S02 S03 S04 S05 S06 S07 S08 70 | #M001 0 0 0 0 0 0 0 0 71 | #M002 1 0 1 0 0 1 1 0 72 | #M003 0 1 0 0 0 0 0 0 73 | #M004 1 2 1 2 2 2 2 2 74 | #M005 0 1 1 0 2 1 0 2 75 | #M006 2 2 2 2 2 2 2 2 76 | 77 | 78 | ### calc_snp_stats.R 79 | ### ================ 80 | snp.stats <- calc_snp_stats(geno) 81 | head(snp.stats) #peek at the output 82 | # n n0 n1 n2 p maf mgf mono loh hwe.chisq hwe.chisq.p hwe.fisher hwe.fisher.p 83 | #M001 20 9 7 4 0.625 0.375 0.20 FALSE FALSE 1.28355556 0.2572389852 2.828469 0.3563467492 84 | #M002 20 8 8 4 0.600 0.400 0.20 FALSE FALSE 0.55555556 0.4560565403 1.929814 0.6479161705 85 | #M003 20 8 2 10 0.450 0.450 0.10 FALSE FALSE 12.73543516 0.0003587923 50.725102 0.0009228388 86 | #M004 20 4 8 8 0.400 0.400 0.20 FALSE FALSE 0.55555556 0.4560565403 1.929814 0.6479161705 87 | #M005 20 10 8 2 0.700 0.300 0.10 FALSE FALSE 0.04535147 0.8313590555 1.235838 1.0000000000 88 | #M006 20 1 5 14 0.175 0.175 0.05 FALSE FALSE 0.36018815 0.5484017591 2.218801 0.5087719298 89 | 90 | # Some of the statistics are actually not appropriate in the presence of known population substrucutre. 91 | # In our case, since our *geno* corresponded to individuals from three different subpopulations, we may wish to re-estimate the marker statistics on a by-population basis. 92 | snp.stats.popA <- calc_snp_stats(geno[,which(subpop=='A')]) #statistics for subpopulation A 93 | head(snp.stats.popA) #Note that markers M001, M006 (& others) were monomorphic and so chi-square could not be calculated. 94 | # n n0 n1 n2 p maf mgf mono loh hwe.chisq hwe.chisq.p hwe.fisher hwe.fisher.p 95 | #M001 8 8 0 0 1.0000 0.0000 0.00 TRUE TRUE NaN NaN 0.000000 1 96 | #M002 8 4 4 0 0.7500 0.2500 0.00 FALSE FALSE 0.88888889 0.3457786 0.000000 1 97 | #M003 8 7 1 0 0.9375 0.0625 0.00 FALSE FALSE 0.03555556 0.8504363 0.000000 1 98 | #M004 8 0 2 6 0.1250 0.1250 0.00 FALSE FALSE 0.16326531 0.6861678 0.000000 1 99 | #M005 8 3 3 2 0.5625 0.4375 0.25 FALSE FALSE 0.45351474 0.5006706 2.601683 1 100 | #M006 8 0 0 8 0.0000 0.0000 0.00 TRUE TRUE NaN NaN 0.000000 1 101 | 102 | 103 | ### calc_neiFis_multispop.R 104 | ### ======================= 105 | neiFis.allpops <- calc_neiFis_multispop(geno,subpop) 106 | sapply(neiFis.allpops,head) #peek at output 107 | #$aveloc ## average FIS for each subpopulation, of combined pop, and averaged across the 3 subpops 108 | # A B C total average 109 | # 0.050773994 -0.096671949 0.017681729 0.265446224 -0.008713363 110 | # 111 | #$perloc ## FIS values for EACH marker 112 | # A B C total average 113 | #[1,] NaN -6.000000e-01 1.428571e-01 0.27717391 -0.22953762 114 | #[2,] -0.27272727 -1.428571e-01 3.684211e-01 0.19148936 0.00980153 115 | #[3,] 0.00000000 6.000000e-01 NaN 0.80710660 0.46406559 116 | #[4,] -0.07692308 -2.220446e-16 -5.000000e-01 0.19148936 -0.28099627 117 | #[5,] 0.30000000 -1.428571e-01 -2.000000e-01 0.07317073 0.03715450 118 | #[6,] NaN -6.000000e-01 2.220446e-16 0.15929204 -0.47610485 119 | 120 | 121 | ### calc_LD.R 122 | ## ========== 123 | LDrsq.allpops.SNPs1to50 <- calc_LD( geno, inds=1:50, get.D=F, get.Dprime=F, get.rsq=T, get.chisq=F, get.chisq_prime=F ) 124 | sapply(LDrsq.allpops.SNPs1to50,'[',1:6,1:6) #peek at the output: notice that only $rsq is not NULL as we had asked calc_LD to return only r-square values. 125 | #$D 126 | #NULL 127 | # 128 | #$Dprime 129 | #NULL 130 | # 131 | #$rsq ##symmetircal matrix of r-squares estimated for all 100x100 pairs of markers 132 | # [,1] [,2] [,3] [,4] [,5] [,6] 133 | #[1,] 1.000000000 0.006734007 0.486531987 0.2045455 0.01010101 0.04306220 134 | #[2,] 0.006734007 1.000000000 0.001736111 0.1666667 0.16666667 0.03508772 135 | #[3,] 0.486531987 0.001736111 1.000000000 0.0234375 0.00000000 0.03508772 136 | #[4,] 0.204545455 0.166666667 0.023437500 1.0000000 0.00000000 0.21052632 137 | #[5,] 0.010101010 0.166666667 0.000000000 0.0000000 1.00000000 0.05263158 138 | #[6,] 0.043062201 0.035087719 0.035087719 0.2105263 0.05263158 1.00000000 139 | # 140 | #$chisq 141 | #NULL 142 | # 143 | #$chisq_prime 144 | #NULL 145 | # 146 | #$chisq_df 147 | #NULL 148 | 149 | # LD estimated using samples with known population structure may not be appropriate. 150 | # In our case here, we may wish to estimate LD using only samples from subpopulation A. 151 | LDrsq.popA.SNPs1to50 <- calc_LD( geno[,which(subpop=='A')], inds=1:50, get.D=F, get.Dprime=F, get.rsq=T, get.chisq=F, get.chisq_prime=F ) 152 | LDrsq.popA.SNPs1to50$rsq[1:6,1:6] #peek at output: Note the large numbers of missing data due to (1) small sample size and (2) monomophic loci 153 | # [,1] [,2] [,3] [,4] [,5] [,6] 154 | #[1,] NaN NaN NaN NaN NaN NaN 155 | #[2,] NaN 1.00000000 0.14285714 NaN 0.06666667 NaN 156 | #[3,] NaN 0.14285714 1.00000000 NaN 0.08571429 NaN 157 | #[4,] NaN NaN NaN NaN NaN NaN 158 | #[5,] NaN 0.06666667 0.08571429 NaN 1.00000000 NaN 159 | #[6,] NaN NaN NaN NaN NaN NaN 160 | 161 | 162 | ### calc_allele_sharing.R 163 | ### ===================== 164 | ASdist.allpop <- calc_allele_sharing(geno) 165 | rownames(ASdist.allpop) <- colnames(ASdist.allpop) <- colnames(geno) #Note that calc_allele_sharing does not preserve marker labels. 166 | 167 | #To use allele-sharing distance matrix as input for estiamting Neighbour-Joining Tree: 168 | library(ape) #the library with nj() 169 | ASdist.allpop.nj <- nj(ASdist.allpop) 170 | plot( ASdist.allpop.nj, tip.color=match(subpop,LETTERS[1:3]), type='unroot' ) 171 | 172 | 173 | ### calc_hwe_chisq.R 174 | ### ================ 175 | HWEchisq.allpop <- calc_hwe_chisq(geno) 176 | head(HWEchisq.allpop) #peek at output 177 | # chisq chisq.p 178 | #M001 1.28355556 0.2572389852 179 | #M002 0.55555556 0.4560565403 180 | #M003 12.73543516 0.0003587923 181 | #M004 0.55555556 0.4560565403 182 | #M005 0.04535147 0.8313590555 183 | #M006 0.36018815 0.5484017591 184 | 185 | # Testing of deviation from HWE in data with known population substructure may not be appropriate. 186 | # Here we re-estimate with only data from subpopulation A. 187 | HWEchisq.popA <- calc_hwe_chisq(geno[,which(subpop=='A')]) 188 | head(HWEchisq.popA) #Note: chi-square cannot be calculated on monomorphic markers (e.g. M001 & M006) 189 | # chisq chisq.p 190 | #M001 NaN NaN 191 | #M002 0.88888889 0.3457786 192 | #M003 0.03555556 0.8504363 193 | #M004 0.16326531 0.6861678 194 | #M005 0.45351474 0.5006706 195 | #M006 NaN NaN 196 | 197 | 198 | ### calc_hwe_fisher.R 199 | ### ================= 200 | HWEfisher.allpop <- calc_hwe_fisher(geno) 201 | head(HWEfisher.allpop) #peek at output 202 | # odds.ratio p.values 203 | #[1,] 2.828469 0.3563467492 204 | #[2,] 1.929814 0.6479161705 205 | #[3,] 50.725102 0.0009228388 206 | #[4,] 1.929814 0.6479161705 207 | #[5,] 1.235838 1.0000000000 208 | #[6,] 2.218801 0.5087719298 209 | 210 | # Testing of deviation from HWE in data with known population substructure may not be appropriate. 211 | # Here we re-estimate with only data from subpopulation A. 212 | HWEfisher.popA <- calc_hwe_fisher(geno[,which(subpop=='A')]) 213 | head(HWEfisher.popA) 214 | # odds.ratio p.values 215 | #[1,] 0.000000 1 216 | #[2,] 0.000000 1 217 | #[3,] 0.000000 1 218 | #[4,] 0.000000 1 219 | #[5,] 2.601683 1 220 | #[6,] 0.000000 1 221 | 222 | -------------------------------------------------------------------------------- /rscripts/exampleI_data.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ekfchan/evachan.org-Rscripts/cbe314f3877c361a8b99fb1ece4dfa92a2a0d728/rscripts/exampleI_data.RData -------------------------------------------------------------------------------- /rscripts/exampleI_functions.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ekfchan/evachan.org-Rscripts/cbe314f3877c361a8b99fb1ece4dfa92a2a0d728/rscripts/exampleI_functions.RData -------------------------------------------------------------------------------- /rscripts/geno_to_allelecnt.R: -------------------------------------------------------------------------------- 1 | geno_to_allelecnt <- function(geno, ref=NULL, info=FALSE) { 2 | 3 | ## http://evachan.org 4 | ## Eva KF Chan 5 | ## Created: 7 July 2014 6 | ## 7 | ## Converts a matrix of genotypes into a matrix of allele counts. 8 | ## It essentially converts the bi-allele SNP data format of {AA,AG,GG,CC,...} 9 | ## to the number of copies of the ref (or alphabetically "smaller") allele {0,1,2} 10 | ## Inputs: 11 | ## geno: Matrix of genotypes with rows corresponding to markers and columns 12 | ## to samples. NA is allowed. 13 | ## ref: Character vector of same size the number of rows in geno, representing 14 | ## the reference "allele". If absent, then conversion will be based on 15 | ## the alphabetically smaller allele. 16 | ## Output: 17 | ## If info is FALSE (default), the function returns a single matrix of the same size as geno, containing the counts of the reference/common allele at each marker (rows). 18 | ## If info is TRUE, a list will be return containing the matrix of allele counts as well as a data.frame of marker information. This is useful for checking which alelles are counted. 19 | 20 | if(!is.matrix(geno) | !mode(geno)=="character") { stop("geno must be of 'matrix' class and 'character' mode.\n") } 21 | if( !all(nchar(as.character(geno[!is.na(geno)]))==2) ) { stop("geno should contain bi-allelic genotypes, e.g. {AA,CC,GG,TT,AC,AG,AT,CG,CT,GT}\n") } 22 | 23 | markers <- data.frame( N=rowSums(!is.na(geno)) ) 24 | 25 | alleles <- apply(cbind(substr(geno,1,1),substr(geno,2,2)),1,unique) 26 | if( is.matrix(alleles) ) { alleles <- lapply(apply(alleles,2,as.list),as.character) } #2017-03-15: corrected apply direction 27 | alleles <- lapply(alleles,sort) 28 | markers$numAlleles = sapply(alleles,length) 29 | if( any(markers$numAlleles>2) ) { stop("markers {",paste(which(markers$numAlleles>2),collapse=","),"} contains more than two alleles.\n") } 30 | 31 | markers$A1 = NA 32 | inds <- which(markers$numAlleles>0) 33 | markers$A1[inds] <- sapply(alleles[inds],'[[',1) 34 | markers$A2 = NA 35 | inds <- which(markers$numAlleles>1) 36 | markers$A2[inds] <- sapply(alleles[inds],'[[',2) 37 | 38 | if(is.null(ref)) { ref <- markers$A1; markers$input_ref=NA } else { markers$input_ref=ref } 39 | # If ref allele was not known, the alphabetically smaller allele is used 40 | if(length(inds<-which(is.na(ref)))>0) { ref[inds] = markers$A1[inds] } 41 | alt <- rep(NA,length(ref)) 42 | inds <- which(ref==markers$A1); alt[inds] <- markers$A2[inds] 43 | inds <- which(ref==markers$A2); alt[inds] <- markers$A1[inds] 44 | inds <- which(is.na(alt)); alt[inds] = markers$A1[inds] #if neither alleles is the reference, arbitrarily assign the alphabetically smaller allele as the alt 45 | markers$ref = ref 46 | markers$alt = alt 47 | 48 | #if( any(ref!=markers$A1 & ref!=markers$A2) ) { warning("ref allele not present in geno for some markers. Conversions for these markers cannot be performed and will be coerced to NA.\n") } 49 | 50 | markers$G2 = paste(ref,ref,sep="") #2 copies of ref 51 | markers$G1.1 = paste(ref,alt,sep="") #1 copy of ref, ref allele coded first 52 | markers$G1.2 = paste(alt,ref,sep="") #1 copy of ref, reversed coding 53 | markers$G0 = paste(alt,alt,sep="") #0 copy of ref 54 | markers$G2[is.na(ref)] <- NA 55 | markers$G1.1[is.na(alt)] <- NA 56 | markers$G1.2[is.na(alt)] <- NA 57 | markers$G0[is.na(alt)] <- NA 58 | 59 | geno.as.num <- matrix( 0, ncol=ncol(geno), nrow=nrow(geno), dimnames=dimnames(geno) ) 60 | geno.as.num[geno==markers$G2] <- 2 61 | geno.as.num[geno==markers$G1.1 | geno==markers$G1.2] <- 1 62 | geno.as.num[geno==markers$G0] <- 0 63 | geno.as.num[which(is.na(markers$ref)),] = NA 64 | geno.as.num[is.na(geno)] = NA 65 | 66 | if( info ) { 67 | return( list(allelecnt=geno.as.num, markers=markers[,c("N","numAlleles","input_ref","ref","alt")]) ) 68 | } else { 69 | return(geno.as.num) 70 | } 71 | } -------------------------------------------------------------------------------- /rscripts/gwas_lm.R: -------------------------------------------------------------------------------- 1 | gwas_lm <- function(pheno, geno, model = NULL) { 2 | ### Copyright 2006 - 2008 Eva Chan 3 | ### http://evachan.org 4 | ### Created: August 2006 5 | ### Last modified: July 2014 6 | ### 7 | ### For each of the traits in pheno, perform linear regression on one or more 8 | ### allelic models, using the genotype data provided in geno 9 | ### 10 | ### *** inputs *** 11 | ### pheno: matrix, or data.frame, of trait values: one trait per column with 12 | ### rownames(pheno) being sample IDs 13 | ### geno: matrix of genotypes: {0,1,2,NA}: one marker per row with rownames(geno) being 14 | ### marker IDs and one individual per column with colnames(geno) being sample IDs 15 | ### model: character vector listing the inheritence models to be tested; 16 | ### avaiable options are: logadditive, codominance; dominance, overdominance, recessive, 17 | ### or all (which is all of these five models): first three letter matches 18 | ### 19 | ### *** output *** 20 | ### res$ 21 | ### trait$ 22 | ### model 23 | ### Outputs a list of traits of lists of models of m-by-6 matrix, 24 | ### where rows of matrix corrrespond to each marke and 25 | ### columns of matrix to correspond to "f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq" 26 | 27 | stopifnot( nrow(pheno) == ncol(geno) ) 28 | 29 | ## Get phenotype names 30 | if(is.null(colnames(pheno))) { traits <- paste("trait",1:ncol(pheno),sep="") } else { traits <- colnames(pheno) } 31 | num.traits <- length(traits) 32 | 33 | ## Get marker names 34 | if(is.null(rownames(geno))) { markers <- paste("marker",1:nrow(geno),sep="") } else { markers <- rownames(geno) } 35 | num.markers <- length(markers) 36 | 37 | ## get sample names 38 | pheno.samples <- rownames(pheno) 39 | geno.samples <- colnames(geno) 40 | if(is.null(pheno.samples) | is.null(pheno.samples)) { 41 | if(nrow(pheno) == ncol(geno)) { 42 | samples <- paste("sample",1:nrow(pheno),sep="") 43 | } else { 44 | stop("Different sample numbers in pheno and geno!\n") 45 | } 46 | } else { 47 | if(!all(is.element(pheno.samples, geno.samples)) ) { 48 | stop("Different samples in pheno and geno!\n") 49 | } else { 50 | if(!all(pheno.samples == geno.samples) ) { #samples in different order between geno & pheno 51 | pheno <- pheno[match(geno.samples, pheno.samples),] 52 | } 53 | samples <- geno.samples 54 | } 55 | } 56 | rm(geno.samples, pheno.samples) 57 | 58 | ## Determine genetic models to test 59 | do.codom = do.logadd = do.dom = do.rec = do.overdom = F 60 | if(is.null(model) | is.na(model) | model=="") stop("No inheritance model selected\n") 61 | if(is.element("all", model)) { do.codom = T; do.logadd = T; do.dom = T; do.rec = T; do.overdom = T } 62 | submodel <- substr(model,1,3) 63 | if(is.element('add',submodel) | is.element('log',submodel)) { do.logadd=T } 64 | if(is.element('cod',submodel)) { do.codom=T } 65 | if(is.element('dom',submodel)) { do.dom=T } 66 | if(is.element('rec',submodel)) { do.rec=T } 67 | if(is.element('ove',submodel)) { do.overdom=T } 68 | 69 | res <- list() 70 | 71 | ## Perform GWAS on each trait 72 | for(i in 1:num.traits) 73 | { 74 | cat(traits[i],"\n") 75 | cur.phval <- as.vector(pheno[,i]) 76 | inds <- which(!is.na(cur.phval)) 77 | if (length(inds)<3) { next } #skip trait if fewer than 2 datapoints 78 | cur.phval <- cur.phval[inds] 79 | 80 | if(do.codom) codom.mat <- matrix(NA, ncol=6, nrow=num.markers, dimnames=list(markers,c("f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq"))) 81 | if(do.logadd) logadd.mat <- matrix(NA, ncol=6, nrow=num.markers, dimnames=list(markers,c("f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq"))) 82 | if(do.dom) dom.mat <- matrix(NA, ncol=6, nrow=num.markers, dimnames=list(markers,c("f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq"))) 83 | if(do.overdom) overdom.mat <- matrix(NA, ncol=6, nrow=num.markers, dimnames=list(markers,c("f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq"))) 84 | if(do.rec) rec.mat <- matrix(NA, ncol=6, nrow=num.markers, dimnames=list(markers,c("f.stat", "df1", "df2", "p.val", "r.sq", "adj.r.sq"))) 85 | 86 | for(j in 1:num.markers) 87 | { 88 | cur.geval <- geno[j,inds] 89 | if(nlevels(as.factor(cur.geval))<=1) {next} 90 | 91 | if(do.codom) 92 | { 93 | z <- summary( lm( cur.phval ~ as.factor(cur.geval) ) ) 94 | codom.mat[j,] <- c( z$fstatistic[1], z$fstatistic[2], z$fstatistic[3], 95 | 1-pf(z$fstatistic[1],z$fstatistic[2],z$fstatistic[3]), 96 | z$r.squared, z$adj.r.squared ) 97 | } 98 | 99 | if(do.logadd) 100 | { 101 | z <- summary( lm( cur.phval ~ as.numeric(cur.geval) ) ) 102 | logadd.mat[j,] <- c( z$fstatistic[1], z$fstatistic[2], z$fstatistic[3], 103 | 1-pf(z$fstatistic[1],z$fstatistic[2],z$fstatistic[3]), 104 | z$r.squared, z$adj.r.squared ) 105 | } 106 | 107 | if(do.dom) 108 | { 109 | g <- as.factor(cur.geval == 0) 110 | if(nlevels(g)>1) 111 | { 112 | z <- summary( lm( cur.phval ~ g ) ) 113 | dom.mat[j,] <- c( z$fstatistic[1], z$fstatistic[2], z$fstatistic[3], 114 | 1-pf(z$fstatistic[1],z$fstatistic[2],z$fstatistic[3]), 115 | z$r.squared, z$adj.r.squared ) 116 | } 117 | } 118 | 119 | if(do.rec) 120 | { 121 | g <- as.factor(cur.geval == 2) 122 | if(nlevels(g)>1) 123 | { 124 | z <- summary( lm( cur.phval ~ g ) ) 125 | rec.mat[j,] <- c( z$fstatistic[1], z$fstatistic[2], z$fstatistic[3], 126 | 1-pf(z$fstatistic[1],z$fstatistic[2],z$fstatistic[3]), 127 | z$r.squared, z$adj.r.squared ) 128 | } 129 | } 130 | 131 | if(do.overdom) 132 | { 133 | g <- as.factor(cur.geval == 1) 134 | if(nlevels(g)>1) 135 | { 136 | z <- summary( lm( cur.phval ~ g ) ) 137 | overdom.mat[j,] <- c( z$fstatistic[1], z$fstatistic[2], z$fstatistic[3], 138 | 1-pf(z$fstatistic[1],z$fstatistic[2],z$fstatistic[3]), 139 | z$r.squared, z$adj.r.squared ) 140 | } 141 | } 142 | } 143 | 144 | res[[traits[i]]] <- list() 145 | if(do.codom) res[[traits[i]]][["codom"]] <- codom.mat 146 | if(do.logadd) res[[traits[i]]][["logadd"]] <- logadd.mat 147 | if(do.dom) res[[traits[i]]][["dom"]] <- dom.mat 148 | if(do.rec) res[[traits[i]]][["rec"]] <- rec.mat 149 | if(do.overdom) res[[traits[i]]][["overdom"]] <- overdom.mat 150 | } 151 | 152 | res 153 | 154 | } -------------------------------------------------------------------------------- /rscripts/plclust_in_colour.R: -------------------------------------------------------------------------------- 1 | plclust_in_colour <- function( hclust, lab=hclust$labels, lab.col=rep(1,length(hclust$labels)), hang=0.1, ... ) 2 | { 3 | ## Eva KF Chan 2009 4 | ## Modifiction of plclust for plotting hclust objects *in colour*! 5 | ## Arguments: 6 | ## hclust: hclust object 7 | ## lab: a character vector of labels of the leaves of the tree 8 | ## lab.col: colour for the labels; NA=default device foreground colour 9 | ## hang: as in hclust & plclust 10 | ## Side effect: 11 | ## A display of hierarchical cluster with coloured leaf labels. 12 | 13 | y <- rep(hclust$height,2) 14 | x <- as.numeric(hclust$merge) 15 | 16 | y <- y[which(x<0)] 17 | x <- x[which(x<0)] 18 | 19 | x <- abs(x) 20 | 21 | y <- y[order(x)] 22 | x <- x[order(x)] 23 | 24 | plot( hclust, labels=F, hang=hang, ... ) 25 | text( x=x, y=y[hclust$order]-(max(hclust$height)*hang), labels=lab[hclust$order], col=lab.col[hclust$order], srt=90, adj=c(1,0.5), xpd=NA, ... ) 26 | 27 | } -------------------------------------------------------------------------------- /rscripts/plot_marker_lox.R: -------------------------------------------------------------------------------- 1 | plot_marker_lox <- function(chr, lox) { 2 | 3 | ## Copyright 2006-2008 Eva Chan 4 | ## eva@evachan.org 5 | ## 6 | ## This funciton generates a visual representation of a set of markers 7 | ## onto the genome. 8 | ## 9 | ## Inputs 10 | ## chr: vector of chromosomes 11 | ## lox: numeric vector of markers' positions on the corresponsing chrs 12 | ## NOTE:: chr and lox are assumed to be in same marker order!! 13 | ## 14 | 15 | ## remove markers with missing chr or pos 16 | inds <- which( is.na(chr) | is.na(lox) ) 17 | if(length(inds)>0) { 18 | warning(length(inds), " SNPs are missing map information; they are ignored.\n") 19 | chr <- chr[-inds] 20 | lox <- lox[-inds] 21 | } 22 | 23 | ## set non-integer chromosomes as integers 24 | unique.chrs <- unique(chr) 25 | suppressWarnings(unique.chrs.as.num <- as.integer(unique(unique.chrs))) 26 | non.int.chrs.ind <- which(is.na(unique.chrs.as.num)) 27 | chr.as.num <- chr 28 | if(length(non.int.chrs.ind)>0) { 29 | num.int.chrs <- length(unique.chrs) - length(non.int.chrs.ind) 30 | for(i in 1:length(non.int.chrs.ind)) { 31 | unique.chrs.as.num[non.int.chrs.ind[i]] <- num.int.chrs + i 32 | chr.as.num[which(chr==unique.chrs[non.int.chrs.ind[i]])] <- num.int.chrs + i 33 | } 34 | } 35 | chr.as.num <- as.integer(chr.as.num) 36 | unique.chrs <- unique.chrs[order(unique.chrs.as.num)] 37 | unique.chrs.as.num <- unique.chrs.as.num[order(unique.chrs.as.num)] 38 | 39 | ## set lox to Mb if in bases 40 | new.lox <- lox / 1000000 41 | if( max(new.lox) > 1 ) { 42 | lox <- new.lox 43 | yunit <- "(Mb)" 44 | rm(new.lox) 45 | } else { yunit = "(bases)" } 46 | 47 | ## calculate chromosome range 48 | chr.len <- rep(NA, length(unique.chrs)) 49 | for(i in 1:length(chr.len)) { 50 | chr.len[i] <- max(lox[which(chr==unique.chrs[i])],na.rm=T) 51 | } 52 | 53 | ## plot frame 54 | plot( unique.chrs.as.num, chr.len, ylim=c(0, max(chr.len)), pch="_", xlab="Chromosome", ylab=paste("position",yunit), las=1, axes=F, cex=1.2, col="dark grey" ) 55 | points( unique.chrs.as.num, rep(0, length(unique.chrs.as.num)), pch="_", cex=1.2, col="dark grey" ) 56 | axis(1, at=1:length(unique.chrs.as.num), labels=unique.chrs, las=1) 57 | axis(2, las=1) 58 | for(i in 1:length(chr.len)) { 59 | points( rep(unique.chrs.as.num[i],2), c(0,chr.len[i]), type="l" ) 60 | } 61 | 62 | ## plot markers 63 | points( chr.as.num, lox, pch="_" ) 64 | 65 | } 66 | -------------------------------------------------------------------------------- /rscripts/plot_markers_by_set.R: -------------------------------------------------------------------------------- 1 | plot_markers_by_set <- function(set, chrom, pos, horiz=F, usr.colours=NULL, pt.cex=1) { 2 | 3 | ## Copyright 2008 Eva Chan 4 | ## eva@evachan.org 5 | ## 6 | ## Generates a map of markers with different sets of markers marked with different colours. 7 | ## Parameters: 8 | ## set: factor indicating the set to which the corresponding markers belong. 9 | ## chrom: vector of chromosome to which the markers are located; sex chromosomes (X,Y) allowed. 10 | ## pos: numeric vector of chromosome position of markers. 11 | ## horiz: logical indicating if chromosomes should be represented horizontally. 12 | ## usr.colours: colour vector for each marker set; if NULL, default rainbow colours are used; 13 | ## if length is less than number of unique marker sets, colours will be recycled. 14 | ## pt.cex: expansion factor the markers and legend. 15 | ## NOTE:: Ordering of markers between the three parameters are assumed to be the same. 16 | 17 | ## check to ensure parameters are of the same length 18 | if( (length(chrom) != length(set)) || (length(pos) != length(set)) ) { 19 | stop("Parameters are of differnet length.\n") 20 | } 21 | 22 | ## exclude markers without known mapping info 23 | excl.inds <- unique(which(is.na(chrom) | is.na(pos))) 24 | if(length(excl.inds)>0) { 25 | warning(length(excl.inds)," markers have missing position and are excluded in plot.\n") 26 | set <- set[-excl.inds] 27 | chrom <- chrom[-excl.inds] 28 | pos <- pos[-excl.inds] 29 | } 30 | 31 | ## set "set" as factor variable 32 | set <- as.factor(as.character(set)) 33 | 34 | ## check to ensure positions are given in numeric variables 35 | if(!is.numeric(pos)) { 36 | if(sum(is.na(suppressWarnings(as.numeric(pos))))>0) { 37 | stop("Inappropriate chromosome positions.\n") 38 | } 39 | pos <- as.numeric(pos) 40 | } 41 | 42 | ## check for and recode sex chromosomes 43 | chrom.ori <- chrom 44 | chrom.as.num <- suppressWarnings(as.numeric(chrom)) 45 | if(sum(is.na(chrom.as.num)) > 0) { 46 | ## check for chrom X 47 | inds <- grep( "X", chrom, ignore.case=T ) 48 | if(length(inds)>0) { 49 | chrom[inds] <- sum(!is.na(unique(chrom.as.num))) + 1 50 | chrom.as.num <- suppressWarnings(as.numeric(chrom)) 51 | } 52 | } 53 | if(sum(is.na(chrom.as.num)) > 0) { 54 | ## check for chrom Y 55 | inds <- which( (chrom=="Y") | (chrom=="y") ) 56 | if(length(inds)>0) { 57 | chrom[inds] <- sum(!is.na(unique(chrom.as.num))) + 1 58 | chrom.as.num <- suppressWarnings(as.numeric(chrom)) 59 | } 60 | } 61 | if(sum(is.na(chrom.as.num)) > 0) { 62 | stop("Inappropriate chromosome:\n", paste(unique(chrom[which(is.na(chrom.as.num))]),collapse=", "), "\n") 63 | } 64 | 65 | ## set chromosome length 66 | pos.lab="" 67 | if( median(z <- (pos/1000000)) > 1 ) { 68 | pos <- z 69 | pos.lab <- "(Mb)" 70 | } else { 71 | if( median(z <- (pos/1000)) > 1 ) { 72 | pos <- z 73 | pos.lab <- "(kb)" 74 | } 75 | } 76 | 77 | unique.chroms <- sort(unique(chrom.as.num)) 78 | if(is.null(usr.colours)) { 79 | usr.colours <- rainbow(nlevels(set)) 80 | } 81 | if(length(usr.colours) < nlevels(set)) { 82 | warning("Colours provided is fewer than marker sets-- colours will be recycled\n") 83 | usr.colours <- rep(usr.colours,len=nlevels(set)) 84 | } 85 | if(horiz) { 86 | plot( c(0,max(pos[which(chrom.as.num==unique.chroms[1])])), c(unique.chroms[1],unique.chroms[1]), type="l", ylim=c(0,length(unique.chroms)), xlim=c(0,ceiling(1.1*max(pos))), ylab="Chromosomes", xlab=paste("Position",pos.lab), las=1, axes=F ) 87 | axis.lab <- unique(chrom.ori)[match(unique.chroms,unique(chrom.ori))] 88 | if(sum(is.na(axis.lab))==1) { 89 | axis.lab[(length(axis.lab))] <- "X" 90 | } 91 | if(sum(is.na(axis.lab))==2) { 92 | axis.lab[(length(axis.lab)-1):length(axis.lab)] <- c("X","Y") 93 | } 94 | axis(2, at=unique.chroms, labels=axis.lab, las=1 ) 95 | axis(1, las=1) 96 | for(i in 2:length(unique.chroms)) { 97 | points( c(0,max(pos[which(chrom.as.num==unique.chroms[i])])), c(unique.chroms[i],unique.chroms[i]), type="l" ) 98 | } 99 | for(i in 1:nlevels(set)) { 100 | text( pos[which(set==levels(set)[i])], chrom.as.num[which(set==levels(set)[i])], labels="|", col=usr.colours[i], cex=pt.cex ) 101 | } 102 | legend(max(pos),length(unique.chroms),legend=levels(set), col=usr.colours, pch=45, bty="o", horiz=F, pt.cex=pt.cex, ncol=1) 103 | } else { 104 | plot( c(unique.chroms[1],unique.chroms[1]), c(0,max(pos[which(chrom.as.num==unique.chroms[1])])), type="l", xlim=c(0,length(unique.chroms)), ylim=c(0,ceiling(1.1*max(pos))), xlab="Chromosomes", ylab=paste("Position",pos.lab), las=1, axes=F ) 105 | axis.lab <- unique(chrom.ori)[match(unique.chroms,unique(chrom.ori))] 106 | if(sum(is.na(axis.lab))==1) { 107 | axis.lab[(length(axis.lab))] <- "X" 108 | } 109 | if(sum(is.na(axis.lab))==2) { 110 | axis.lab[(length(axis.lab)-1):length(axis.lab)] <- c("X","Y") 111 | } 112 | axis(1, at=unique.chroms, labels=axis.lab, las=1 ) 113 | axis(2, las=1) 114 | for(i in 2:length(unique.chroms)) { 115 | points( c(unique.chroms[i],unique.chroms[i]), c(0,max(pos[which(chrom.as.num==unique.chroms[i])])), type="l" ) 116 | } 117 | for(i in 1:nlevels(set)) { 118 | text( chrom.as.num[which(set==levels(set)[i])], pos[which(set==levels(set)[i])], labels="--", col=usr.colours[i], cex=pt.cex ) 119 | } 120 | legend(2,max(pos),legend=levels(set), col=usr.colours, pch=45, bty="o", horiz=T, pt.cex=pt.cex) 121 | } 122 | 123 | } 124 | -------------------------------------------------------------------------------- /rscripts/simgeno.R: -------------------------------------------------------------------------------- 1 | simgeno <- function(M=100, N=30, propNA=0) { 2 | 3 | ## http://evachan.org 4 | ## Eva KF Chan 5 | ## Created: 7 July 2014 6 | ## 7 | ## Very simple function to simulate a matrix of biallelic unphased SNP genotypes in the format: {AA,CC,GG,TT,AC,AG,AT,CG,CT,GT}. This is written predominantly for the purpose of demonstrating the geno_toallelecnt.R function. 8 | ## Inputs: 9 | ## M: The number of SNP markers to simulate. 10 | ## N: The number of samples to simulate. 11 | ## propNA: The proportion of missing data to simulate. 12 | ## Output: 13 | ## A matrix of genotypes. 14 | 15 | alleles = c('A','C','G','T') 16 | a1 <- sample(alleles,M,replace=T) 17 | a2 <- sample(alleles,M,replace=T) 18 | g0=paste(a1,a1,sep='') 19 | g1=paste(a1,a2,sep='') 20 | g2=paste(a2,a2,sep='') 21 | 22 | geno <- matrix( NA, ncol=N, nrow=M, dimnames=list(paste("marker",1:M,sep=""),paste("sample",1:N,sep="")) ) 23 | for( i in 1:M ) { geno[i,] <- sample(c(g0[i],g1[i],g2[i]),N,replace=T) } 24 | if( propNA>0 ) { geno[sample(1:(M*N),ceiling(M*N*propNA))] <- NA } 25 | 26 | geno 27 | } 28 | --------------------------------------------------------------------------------