├── README ├── _static ├── image1.png ├── image2.png ├── image3.png ├── image4.png ├── image5.png ├── image6.png ├── image7.png ├── image8.png ├── image9.png └── image10.png ├── _build └── latex │ └── MultivariateAnalysis.pdf ├── LittleBookofRMultivariateAnalysis └── src │ └── multivariateanalysis.rst ├── index.rst ├── make.bat ├── conf.py └── src ├── installr.rst └── multivariateanalysis.rst /README: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /_static/image1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image1.png -------------------------------------------------------------------------------- /_static/image2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image2.png -------------------------------------------------------------------------------- /_static/image3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image3.png -------------------------------------------------------------------------------- /_static/image4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image4.png -------------------------------------------------------------------------------- /_static/image5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image5.png -------------------------------------------------------------------------------- /_static/image6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image6.png -------------------------------------------------------------------------------- /_static/image7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image7.png -------------------------------------------------------------------------------- /_static/image8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image8.png -------------------------------------------------------------------------------- /_static/image9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image9.png -------------------------------------------------------------------------------- /_static/image10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image10.png -------------------------------------------------------------------------------- /_build/latex/MultivariateAnalysis.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_build/latex/MultivariateAnalysis.pdf -------------------------------------------------------------------------------- /LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst -------------------------------------------------------------------------------- /index.rst: -------------------------------------------------------------------------------- 1 | Welcome to a Little Book of R for Multivariate Analysis! 2 | ======================================================== 3 | 4 | By `Avril Coghlan `_, 5 | Wellcome Trust Sanger Institute, Cambridge, U.K. 6 | Email: alc@sanger.ac.uk 7 | 8 | This is a simple introduction to multivariate analysis using the R statistics software. 9 | 10 | There is a pdf version of this booklet available at: 11 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf `_. 12 | 13 | If you like this booklet, you may also like to check out my booklet on using 14 | R for biomedical statistics, 15 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/ 16 | `_, 17 | and my booklet on using R for time series analysis, 18 | `http://a-little-book-of-r-for-time-series.readthedocs.org/ 19 | `_. 20 | 21 | Contents: 22 | 23 | .. toctree:: 24 | :maxdepth: 3 25 | 26 | src/installr.rst 27 | src/multivariateanalysis.rst 28 | 29 | Acknowledgements 30 | ---------------- 31 | 32 | Thank you to Noel O'Boyle for helping in using Sphinx, `http://sphinx.pocoo.org `_, to create 33 | this document, and github, `https://github.com/ `_, to store different versions of the document 34 | as I was writing it, and readthedocs, `http://readthedocs.org/ `_, to build and distribute 35 | this document. 36 | 37 | Contact 38 | ------- 39 | 40 | I will be very grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to 41 | my email address alc@sanger.ac.uk 42 | 43 | License 44 | ------- 45 | 46 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License 47 | `_. 48 | 49 | -------------------------------------------------------------------------------- /make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | REM Command file for Sphinx documentation 4 | 5 | set SPHINXBUILD=C:\Python26\Scripts\sphinx-build 6 | set BUILDDIR=_build 7 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . 8 | if NOT "%PAPER%" == "" ( 9 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% 10 | ) 11 | 12 | if "%1" == "" goto help 13 | 14 | if "%1" == "help" ( 15 | :help 16 | echo.Please use `make ^` where ^ is one of 17 | echo. html to make standalone HTML files 18 | echo. dirhtml to make HTML files named index.html in directories 19 | echo. pickle to make pickle files 20 | echo. json to make JSON files 21 | echo. htmlhelp to make HTML files and a HTML help project 22 | echo. qthelp to make HTML files and a qthelp project 23 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter 24 | echo. changes to make an overview over all changed/added/deprecated items 25 | echo. linkcheck to check all external links for integrity 26 | echo. doctest to run all doctests embedded in the documentation if enabled 27 | goto end 28 | ) 29 | 30 | if "%1" == "clean" ( 31 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i 32 | del /q /s %BUILDDIR%\* 33 | goto end 34 | ) 35 | 36 | if "%1" == "html" ( 37 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html 38 | echo. 39 | echo.Build finished. The HTML pages are in %BUILDDIR%/html. 40 | goto end 41 | ) 42 | 43 | if "%1" == "dirhtml" ( 44 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml 45 | echo. 46 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. 47 | goto end 48 | ) 49 | 50 | if "%1" == "pickle" ( 51 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle 52 | echo. 53 | echo.Build finished; now you can process the pickle files. 54 | goto end 55 | ) 56 | 57 | if "%1" == "json" ( 58 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json 59 | echo. 60 | echo.Build finished; now you can process the JSON files. 61 | goto end 62 | ) 63 | 64 | if "%1" == "htmlhelp" ( 65 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp 66 | echo. 67 | echo.Build finished; now you can run HTML Help Workshop with the ^ 68 | .hhp project file in %BUILDDIR%/htmlhelp. 69 | goto end 70 | ) 71 | 72 | if "%1" == "qthelp" ( 73 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp 74 | echo. 75 | echo.Build finished; now you can run "qcollectiongenerator" with the ^ 76 | .qhcp project file in %BUILDDIR%/qthelp, like this: 77 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\sampledoc.qhcp 78 | echo.To view the help file: 79 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\sampledoc.ghc 80 | goto end 81 | ) 82 | 83 | if "%1" == "latex" ( 84 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 85 | echo. 86 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. 87 | goto end 88 | ) 89 | 90 | if "%1" == "changes" ( 91 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes 92 | echo. 93 | echo.The overview file is in %BUILDDIR%/changes. 94 | goto end 95 | ) 96 | 97 | if "%1" == "linkcheck" ( 98 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck 99 | echo. 100 | echo.Link check complete; look for any errors in the above output ^ 101 | or in %BUILDDIR%/linkcheck/output.txt. 102 | goto end 103 | ) 104 | 105 | if "%1" == "doctest" ( 106 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest 107 | echo. 108 | echo.Testing of doctests in the sources finished, look at the ^ 109 | results in %BUILDDIR%/doctest/output.txt. 110 | goto end 111 | ) 112 | 113 | if "%1" == "pdf" ( 114 | %SPHINXBUILD% -b pdf %ALLSPHINXOPTS% %BUILDDIR%/pdf 115 | echo. 116 | echo.Build finished. The PDF files are in _build/pdf. 117 | goto end 118 | ) 119 | 120 | :end 121 | -------------------------------------------------------------------------------- /conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # sampledoc documentation build configuration file, created by 4 | # sphinx-quickstart on Sat Jan 09 18:21:28 2010. 5 | # 6 | # This file is execfile()d with the current directory set to its containing dir. 7 | # 8 | # Note that not all possible configuration values are present in this 9 | # autogenerated file. 10 | # 11 | # All configuration values have a default; values that are commented out 12 | # serve to show the default. 13 | 14 | import sys, os 15 | 16 | # If extensions (or modules to document with autodoc) are in another directory, 17 | # add these directories to sys.path here. If the directory is relative to the 18 | # documentation root, use os.path.abspath to make it absolute, like shown here. 19 | #sys.path.append(os.path.abspath('.')) 20 | 21 | # -- General configuration ----------------------------------------------------- 22 | 23 | # Add any Sphinx extension module names here, as strings. They can be extensions 24 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 25 | # (Noel) adding rst2pdf 26 | extensions = ['sphinx.ext.autodoc'] # ,'rst2pdf.pdfbuilder'] 27 | 28 | # Add any paths that contain templates here, relative to this directory. 29 | templates_path = ['_templates'] 30 | 31 | # The suffix of source filenames. 32 | source_suffix = '.rst' 33 | 34 | # The encoding of source files. 35 | #source_encoding = 'utf-8' 36 | 37 | # The master toctree document. 38 | master_doc = 'index' 39 | 40 | # General information about the project. 41 | project = u'Multivariate Analysis' 42 | copyright = u'2010, Avril Coghlan' 43 | 44 | # The version info for the project you're documenting, acts as replacement for 45 | # |version| and |release|, also used in various other places throughout the 46 | # built documents. 47 | # 48 | # The short X.Y version. 49 | version = '0.1' 50 | # The full version, including alpha/beta/rc tags. 51 | release = '0.1' 52 | 53 | # The language for content autogenerated by Sphinx. Refer to documentation 54 | # for a list of supported languages. 55 | #language = None 56 | 57 | # There are two options for replacing |today|: either, you set today to some 58 | # non-false value, then it is used: 59 | #today = '' 60 | # Else, today_fmt is used as the format for a strftime call. 61 | #today_fmt = '%B %d, %Y' 62 | 63 | # List of documents that shouldn't be included in the build. 64 | #unused_docs = [] 65 | 66 | # List of directories, relative to source directory, that shouldn't be searched 67 | # for source files. 68 | exclude_trees = ['_build'] 69 | 70 | # The reST default role (used for this markup: `text`) to use for all documents. 71 | #default_role = None 72 | 73 | # If true, '()' will be appended to :func: etc. cross-reference text. 74 | #add_function_parentheses = True 75 | 76 | # If true, the current module name will be prepended to all description 77 | # unit titles (such as .. function::). 78 | #add_module_names = True 79 | 80 | # If true, sectionauthor and moduleauthor directives will be shown in the 81 | # output. They are ignored by default. 82 | #show_authors = False 83 | 84 | # The name of the Pygments (syntax highlighting) style to use. 85 | pygments_style = 'sphinx' 86 | 87 | # A list of ignored prefixes for module index sorting. 88 | #modindex_common_prefix = [] 89 | 90 | 91 | # -- Options for HTML output --------------------------------------------------- 92 | 93 | # The theme to use for HTML and HTML Help pages. Major themes that come with 94 | # Sphinx are currently 'default' and 'sphinxdoc'. 95 | # html_theme = 'default' 96 | html_theme = 'sphinxdoc' 97 | 98 | # Theme options are theme-specific and customize the look and feel of a theme 99 | # further. For a list of options available for each theme, see the 100 | # documentation. 101 | #html_theme_options = {} 102 | 103 | # Add any paths that contain custom themes here, relative to this directory. 104 | #html_theme_path = [] 105 | 106 | # The name for this set of Sphinx documents. If None, it defaults to 107 | # " v documentation". 108 | #html_title = None 109 | 110 | # A shorter title for the navigation bar. Default is the same as html_title. 111 | #html_short_title = None 112 | 113 | # The name of an image file (relative to this directory) to place at the top 114 | # of the sidebar. 115 | #html_logo = None 116 | 117 | # The name of an image file (within the static path) to use as favicon of the 118 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 119 | # pixels large. 120 | #html_favicon = None 121 | 122 | # Add any paths that contain custom static files (such as style sheets) here, 123 | # relative to this directory. They are copied after the builtin static files, 124 | # so a file named "default.css" will overwrite the builtin "default.css". 125 | html_static_path = ['_static'] 126 | 127 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 128 | # using the given strftime format. 129 | #html_last_updated_fmt = '%b %d, %Y' 130 | 131 | # If true, SmartyPants will be used to convert quotes and dashes to 132 | # typographically correct entities. 133 | #html_use_smartypants = True 134 | 135 | # Custom sidebar templates, maps document names to template names. 136 | #html_sidebars = {} 137 | 138 | # Additional templates that should be rendered to pages, maps page names to 139 | # template names. 140 | #html_additional_pages = {} 141 | 142 | # If false, no module index is generated. 143 | #html_use_modindex = True 144 | 145 | # If false, no index is generated. 146 | #html_use_index = True 147 | 148 | # If true, the index is split into individual pages for each letter. 149 | #html_split_index = False 150 | 151 | # If true, links to the reST sources are added to the pages. 152 | #html_show_sourcelink = True 153 | 154 | # If true, an OpenSearch description file will be output, and all pages will 155 | # contain a tag referring to it. The value of this option must be the 156 | # base URL from which the finished HTML is served. 157 | #html_use_opensearch = '' 158 | 159 | # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). 160 | #html_file_suffix = '' 161 | 162 | # Output file base name for HTML help builder. 163 | htmlhelp_basename = 'sampledocdoc' 164 | 165 | 166 | # -- Options for LaTeX output -------------------------------------------------- 167 | 168 | # The paper size ('letter' or 'a4'). 169 | latex_paper_size = 'a4' 170 | 171 | # The font size ('10pt', '11pt' or '12pt'). 172 | #latex_font_size = '10pt' 173 | 174 | # Grouping the document tree into LaTeX files. List of tuples 175 | # (source start file, target name, title, author, documentclass [howto/manual]). 176 | latex_documents = [ 177 | ('index', 'MultivariateAnalysis.tex', u'A Little Book of R For Multivariate Analysis', 178 | u'Avril Coghlan', 'manual'), 179 | ] 180 | 181 | # The name of an image file (relative to this directory) to place at the top of 182 | # the title page. 183 | #latex_logo = None 184 | 185 | # For "manual" documents, if this is true, then toplevel headings are parts, 186 | # not chapters. 187 | #latex_use_parts = False 188 | 189 | # Additional stuff for the LaTeX preamble. 190 | #latex_preamble = '' 191 | 192 | # Documents to append as an appendix to all manuals. 193 | #latex_appendices = [] 194 | 195 | # If false, no module index is generated. 196 | #latex_use_modindex = True 197 | 198 | # (Noel) The following is all from rst2pdf 199 | # -- Options for PDF output -------------------------------------------------- 200 | 201 | # Grouping the document tree into PDF files. List of tuples 202 | # (source start file, target name, title, author, options). 203 | # 204 | # If there is more than one author, separate them with \\. 205 | # For example: r'Guido van Rossum\\Fred L. Drake, Jr., editor' 206 | # 207 | # The options element is a dictionary that lets you override 208 | # this config per-document. 209 | # For example, 210 | # ('index', u'MyProject', u'My Project', u'Author Name', 211 | # dict(pdf_compressed = True)) 212 | # would mean that specific document would be compressed 213 | # regardless of the global pdf_compressed setting. 214 | 215 | pdf_documents = [ 216 | ('index', u'MyProject', u'My Project', u'Author Name'), 217 | ] 218 | 219 | # A comma-separated list of custom stylesheets. Example: 220 | pdf_stylesheets = ['sphinx','kerning','a4'] 221 | 222 | # Create a compressed PDF 223 | # Use True/False or 1/0 224 | # Example: compressed=True 225 | #pdf_compressed = False 226 | 227 | # A colon-separated list of folders to search for fonts. Example: 228 | # pdf_font_path = ['/usr/share/fonts', '/usr/share/texmf-dist/fonts/'] 229 | 230 | # Language to be used for hyphenation support 231 | #pdf_language = "en_US" 232 | 233 | # Mode for literal blocks wider than the frame. Can be 234 | # overflow, shrink or truncate 235 | #pdf_fit_mode = "shrink" 236 | 237 | # Section level that forces a break page. 238 | # For example: 1 means top-level sections start in a new page 239 | # 0 means disabled 240 | #pdf_break_level = 0 241 | 242 | # When a section starts in a new page, force it to be 'even', 'odd', 243 | # or just use 'any' 244 | #pdf_breakside = 'any' 245 | 246 | # Insert footnotes where they are defined instead of 247 | # at the end. 248 | #pdf_inline_footnotes = True 249 | 250 | # verbosity level. 0 1 or 2 251 | #pdf_verbosity = 0 252 | 253 | # If false, no index is generated. 254 | #pdf_use_index = True 255 | 256 | # If false, no modindex is generated. 257 | #pdf_use_modindex = True 258 | 259 | # If false, no coverpage is generated. 260 | #pdf_use_coverpage = True 261 | 262 | # Documents to append as an appendix to all manuals. 263 | #pdf_appendices = [] 264 | 265 | # Enable experimental feature to split table cells. Use it 266 | # if you get "DelayedTable too big" errors 267 | #pdf_splittables = False 268 | 269 | # Set the default DPI for images 270 | #pdf_default_dpi = 72 271 | 272 | # Enable rst2pdf extension modules (default is empty list) 273 | # you need vectorpdf for better sphinx's graphviz support 274 | #pdf_extensions = ['vectorpdf'] 275 | 276 | 277 | -------------------------------------------------------------------------------- /src/installr.rst: -------------------------------------------------------------------------------- 1 | How to install R 2 | ================ 3 | 4 | Introduction to R 5 | ----------------- 6 | 7 | This little booklet has some information on how to use R for time series analysis. 8 | 9 | R (`www.r-project.org `_) is a commonly used 10 | free Statistics software. R allows you to carry out statistical 11 | analyses in an interactive mode, as well as allowing simple programming. 12 | 13 | Installing R 14 | ------------ 15 | 16 | To use R, you first need to install the R program on your computer. 17 | 18 | How to check if R is installed on a Windows PC 19 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 20 | 21 | Before you install R on your computer, the first thing to do is to check whether 22 | R is already installed on your computer (for example, by a previous user). 23 | 24 | These instructions will focus on installing R on a Windows PC. However, I will also 25 | briefly mention how to install R on a Macintosh or Linux computer (see below). 26 | 27 | If you are using a Windows PC, there are two ways you can check whether R is 28 | already isntalled on your computer: 29 | 30 | 1. Check if there is an "R" icon on the desktop of the computer that you are using. 31 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead. 32 | 2. Click on the "Start" menu at the bottom left of your Windows desktop, and then move your 33 | mouse over "All Programs" in the menu that pops up. See if "R" appears in the list 34 | of programs that pops up. If it does, it means that R is already installed on your 35 | computer, and you can start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R, 36 | eg. R 2.10.0) from the list. 37 | 38 | If either (1) or (2) above does succeed in starting R, it means that R is already installed 39 | on the computer that you are using. (If neither succeeds, R is not installed yet). 40 | If there is an old version of R installed on the Windows PC that you are using, 41 | it is worth installing the latest version of R, to make sure that you have all the 42 | latest R functions available to you to use. 43 | 44 | Finding out what is the latest version of R 45 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 46 | 47 | To find out what is the latest version of R, you can look at the CRAN (Comprehensive 48 | R Network) website, `http://cran.r-project.org/ `_. 49 | 50 | Beside "The latest release" (about half way down the page), it will say something like 51 | "R-X.X.X.tar.gz" (eg. "R-2.12.1.tar.gz"). This means that the latest release of R is X.X.X (for 52 | example, 2.12.1). 53 | 54 | New releases of R are made very regularly (approximately once a month), as R is actively being 55 | improved all the time. It is worthwhile installing new versions of R regularly, to make sure 56 | that you have a recent version of R (to ensure compatibility with all the latest versions of 57 | the R packages that you have downloaded). 58 | 59 | Installing R on a Windows PC 60 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 61 | 62 | To install R on your Windows computer, follow these steps: 63 | 64 | 1. Go to `http://ftp.heanet.ie/mirrors/cran.r-project.org `_. 65 | 2. Under "Download and Install R", click on the "Windows" link. 66 | 3. Under "Subdirectories", click on the "base" link. 67 | 4. On the next page, you should see a link saying something like "Download R 2.10.1 for Windows" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.11.1). 68 | Click on this link. 69 | 5. You may be asked if you want to save or run a file "R-2.10.1-win32.exe". Choose "Save" and 70 | save the file on the Desktop. Then double-click on the icon for the file to run it. 71 | 6. You will be asked what language to install it in - choose English. 72 | 7. The R Setup Wizard will appear in a window. Click "Next" at the bottom of the R Setup wizard 73 | window. 74 | 8. The next page says "Information" at the top. Click "Next" again. 75 | 9. The next page says "Information" at the top. Click "Next" again. 76 | 10. The next page says "Select Destination Location" at the top. 77 | By default, it will suggest to install R in "C:\\Program Files" on your computer. 78 | 11. Click "Next" at the bottom of the R Setup wizard window. 79 | 12. The next page says "Select components" at the top. Click "Next" again. 80 | 13. The next page says "Startup options" at the top. Click "Next" again. 81 | 14. The next page says "Select start menu folder" at the top. Click "Next" again. 82 | 15. The next page says "Select additional tasks" at the top. Click "Next" again. 83 | 16. R should now be installed. This will take about a minute. When R has finished, you will 84 | see "Completing the R for Windows Setup Wizard" appear. Click "Finish". 85 | 17. To start R, you can either follow step 18, or 19: 86 | 18. Check if there is an "R" icon on the desktop of the computer that you are using. 87 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 19 instead. 88 | 19. Click on the "Start" button at the bottom left of your computer screen, and then 89 | choose "All programs", and start R by selecting "R" (or R X.X.X, where 90 | X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 91 | 20. The R console (a rectangle) should pop up: 92 | 93 | |image3| 94 | 95 | How to install R on non-Windows computers (eg. Macintosh or Linux computers) 96 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 97 | 98 | The instructions above are for installing R on a Windows PC. If you want to install R 99 | on a computer that has a non-Windows operating system (for example, a Macintosh or computer running Linux, 100 | you should download the appropriate R installer for that operating system at 101 | `http://ftp.heanet.ie/mirrors/cran.r-project.org 102 | `_ and 103 | follow the R installation instructions for the appropriate operating system at 104 | `http://ftp.heanet.ie/mirrors/cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed_003f 105 | `_). 106 | 107 | Installing R packages 108 | --------------------- 109 | 110 | R comes with some standard packages that are installed when you install R. However, in this 111 | booklet I will also tell you how to use some additional R packages that are useful, for example, 112 | the "rmeta" package. These additional packages do not come with the standard installation of R, 113 | so you need to install them yourself. 114 | 115 | How to install an R package 116 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 117 | 118 | Once you have installed R on a Windows computer (following the steps above), you can install 119 | an additional package by following the steps below: 120 | 121 | 1. To start R, follow either step 2 or 3: 122 | 2. Check if there is an "R" icon on the desktop of the computer that you are using. 123 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead. 124 | 3. Click on the "Start" button at the bottom left of your computer screen, and then 125 | choose "All programs", and start R by selecting "R" (or R X.X.X, where 126 | X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 127 | 4. The R console (a rectangle) should pop up. 128 | 5. Once you have started R, you can now install an R package (eg. the "rmeta" package) by 129 | choosing "Install package(s)" from the "Packages" menu at the top of the R console. 130 | This will ask you what website you want to download the package from, you should choose 131 | "Ireland" (or another country, if you prefer). It will also bring up a list of available 132 | packages that you can install, and you should choose the package that you want to install 133 | from that list (eg. "rmeta"). 134 | 6. This will install the "rmeta" package. 135 | 7. The "rmeta" package is now installed. Whenever you want to use the "rmeta" package after this, 136 | after starting R, you first have to load the package by typing into the R console: 137 | 138 | .. highlight:: r 139 | 140 | :: 141 | 142 | > library("rmeta") 143 | 144 | Note that there are some additional R packages for bioinformatics that are part of a special 145 | set of R packages called Bioconductor (`www.bioconductor.org `_) 146 | such as the "yeastExpData" R package, the "Biostrings" R package, etc.). 147 | These Bioconductor packages need to be installed using a different, Bioconductor-specific procedure 148 | (see `How to install a Bioconductor R package`_ below). 149 | 150 | How to install a Bioconductor R package 151 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 152 | 153 | The procedure above can be used to install the majority of R packages. However, the 154 | Bioconductor set of bioinformatics R packages need to be installed by a special procedure. 155 | Bioconductor (`www.bioconductor.org `_) 156 | is a group of R packages that have been developed for bioinformatics. This includes 157 | R packages such as "yeastExpData", "Biostrings", etc. 158 | 159 | To install the Bioconductor packages, follow these steps: 160 | 161 | 1. To start R, follow either step 2 or 3: 162 | 2. Check if there is an "R" icon on the desktop of the computer that you are using. 163 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead. 164 | 3. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 165 | 4. The R console (a rectangle) should pop up. 166 | 5. Once you have started R, now type in the R console: 167 | 168 | .. highlight:: r 169 | 170 | :: 171 | 172 | > source("http://bioconductor.org/biocLite.R") 173 | > biocLite() 174 | 175 | 6. This will install a core set of Bioconductor packages ("affy", "affydata", "affyPLM", 176 | "annaffy", "annotate", "Biobase", "Biostrings", "DynDoc", "gcrma", "genefilter", 177 | "geneplotter", "hgu95av2.db", "limma", "marray", "matchprobes", "multtest", "ROC", 178 | "vsn", "xtable", "affyQCReport"). 179 | This takes a few minutes (eg. 10 minutes). 180 | 7. At a later date, you may wish to install some extra Bioconductor packages that do not belong 181 | to the core set of Bioconductor packages. For example, to install the Bioconductor package called 182 | "yeastExpData", start R and type in the R console: 183 | 184 | .. highlight:: r 185 | 186 | :: 187 | 188 | > source("http://bioconductor.org/biocLite.R") 189 | > biocLite("yeastExpData") 190 | 191 | 8. Whenever you want to use a package after installing it, you need to load it into R by typing: 192 | 193 | .. highlight:: r 194 | 195 | :: 196 | 197 | > library("yeastExpData") 198 | 199 | Running R 200 | ----------- 201 | 202 | To use R, you first need to start the R program on your computer. 203 | You should have already installed R on your computer (see above). 204 | 205 | To start R, you can either follow step 1 or 2: 206 | 1. Check if there is an "R" icon on the desktop of the computer that you are using. 207 | If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead. 208 | 2. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 209 | 210 | This should bring up a new window, which is the *R console*. 211 | 212 | A brief introduction to R 213 | ------------------------- 214 | 215 | You will type R commands into the R console in order to carry out 216 | analyses in R. In the R console you will see: 217 | 218 | .. highlight:: r 219 | 220 | :: 221 | 222 | > 223 | 224 | This is the R prompt. We type the commands needed for a particular 225 | task after this prompt. The command is carried out after you hit 226 | the Return key. 227 | 228 | Once you have started R, you can start typing in commands, and the 229 | results will be calculated immediately, for example: 230 | 231 | .. highlight:: r 232 | 233 | :: 234 | 235 | > 2*3 236 | [1] 6 237 | > 10-3 238 | [1] 7 239 | 240 | All variables (scalars, vectors, matrices, etc.) created by R are 241 | called *objects*. In R, we assign values to variables using an 242 | arrow. For example, we can assign the value 2\*3 to the variable 243 | *x* using the command: 244 | 245 | .. highlight:: r 246 | 247 | :: 248 | 249 | > x <- 2*3 250 | 251 | To view the contents of any R object, just type its name, and the 252 | contents of that R object will be displayed: 253 | 254 | .. highlight:: r 255 | 256 | :: 257 | 258 | > x 259 | [1] 6 260 | 261 | There are several possible different types of objects in R, 262 | including scalars, vectors, matrices, arrays, data frames, tables, 263 | and lists. The scalar variable *x* above is one example of an R 264 | object. While a scalar variable such as *x* has just one element, a 265 | vector consists of several elements. The elements in a vector are 266 | all of the same type (eg. numeric or characters), while lists may 267 | include elements such as characters as well as numeric quantities. 268 | 269 | To create a vector, we can use the c() (combine) function. For 270 | example, to create a vector called *myvector* that has elements 271 | with values 8, 6, 9, 10, and 5, we type: 272 | 273 | .. highlight:: r 274 | 275 | :: 276 | 277 | > myvector <- c(8, 6, 9, 10, 5) 278 | 279 | To see the contents of the variable *myvector*, we can just type 280 | its name: 281 | 282 | .. highlight:: r 283 | 284 | :: 285 | 286 | > myvector 287 | [1] 8 6 9 10 5 288 | 289 | The [1] is the index of the first element in the vector. We can 290 | extract any element of the vector by typing the vector name with 291 | the index of that element given in square brackets. For example, to 292 | get the value of the 4th element in the vector *myvector*, we 293 | type: 294 | 295 | .. highlight:: r 296 | 297 | :: 298 | 299 | > myvector[4] 300 | [1] 10 301 | 302 | In contrast to a vector, a list can contain elements of different 303 | types, for example, both numeric and character elements. A list can 304 | also include other variables such as a vector. The list() function 305 | is used to create a list. For example, we could create a list 306 | *mylist* by typing: 307 | 308 | .. highlight:: r 309 | 310 | :: 311 | 312 | > mylist <- list(name="Fred", wife="Mary", myvector) 313 | 314 | We can then print out the contents of the list *mylist* by typing 315 | its name: 316 | 317 | .. highlight:: r 318 | 319 | :: 320 | 321 | > mylist 322 | $name 323 | [1] "Fred" 324 | 325 | $wife 326 | [1] "Mary" 327 | 328 | [[3]] 329 | [1] 8 6 9 10 5 330 | 331 | The elements in a list are numbered, and can be referred to using 332 | indices. We can extract an element of a list by typing the list 333 | name with the index of the element given in double square brackets 334 | (in contrast to a vector, where we only use single square 335 | brackets). Thus, we can extract the second and third elements from 336 | *mylist* by typing: 337 | 338 | .. highlight:: r 339 | 340 | :: 341 | 342 | > mylist[[2]] 343 | [1] "Mary" 344 | > mylist[[3]] 345 | [1] 8 6 9 10 5 346 | 347 | Elements of lists may also be named, and in this case the elements 348 | may be referred to by giving the list name, followed by "$", 349 | followed by the element name. For example, *mylist$name* is the 350 | same as *mylist[[1]]* and *mylist$wife* is the same as 351 | *mylist[[2]]*: 352 | 353 | .. highlight:: r 354 | 355 | :: 356 | 357 | > mylist$wife 358 | [1] "Mary" 359 | 360 | We can find out the names of the named elements in a list by using 361 | the attributes() function, for example: 362 | 363 | .. highlight:: r 364 | 365 | :: 366 | 367 | > attributes(mylist) 368 | $names 369 | [1] "name" "wife" "" 370 | 371 | When you use the attributes() function to find the named elements 372 | of a list variable, the named elements are always listed under a 373 | heading "$names". Therefore, we see that the named elements of the 374 | list variable *mylist* are called "name" and "wife", and we can 375 | retrieve their values by typing *mylist$name* and *mylist$wife*, 376 | respectively. 377 | 378 | Another type of object that you will encounter in R is a *table* 379 | variable. For example, if we made a vector variable *mynames* 380 | containing the names of children in a class, we can use the table() 381 | function to produce a table variable that contains the number of 382 | children with each possible name: 383 | 384 | .. highlight:: r 385 | 386 | :: 387 | 388 | > mynames <- c("Mary", "John", "Ann", "Sinead", "Joe", "Mary", "Jim", "John", "Simon") 389 | > table(mynames) 390 | mynames 391 | Ann Jim Joe John Mary Simon Sinead 392 | 1 1 1 2 2 1 1 393 | 394 | We can store the table variable produced by the function table(), 395 | and call the stored table "mytable", by typing: 396 | 397 | .. highlight:: r 398 | 399 | :: 400 | 401 | > mytable <- table(mynames) 402 | 403 | To access elements in a table variable, you need to use double 404 | square brackets, just like accessing elements in a list. For 405 | example, to access the fourth element in the table *mytable* (the 406 | number of children called "John"), we type: 407 | 408 | .. highlight:: r 409 | 410 | :: 411 | 412 | > mytable[[4]] 413 | [1] 2 414 | 415 | Alternatively, you can use the name of the fourth element in 416 | the table ("John") to find the value of that table element: 417 | 418 | .. highlight:: r 419 | 420 | :: 421 | 422 | > mytable[["John"]] 423 | [1] 2 424 | 425 | Functions in R usually require *arguments*, which are input 426 | variables (ie. objects) that are passed to them, which they then 427 | carry out some operation on. For example, the log10() function is 428 | passed a number, and it then calculates the log to the base 10 of 429 | that number: 430 | 431 | .. highlight:: r 432 | 433 | :: 434 | 435 | > log10(100) 436 | 2 437 | 438 | In R, you can get help about a particular function by using the 439 | help() function. For example, if you want help about the log10() 440 | function, you can type: 441 | 442 | .. highlight:: r 443 | 444 | :: 445 | 446 | > help("log10") 447 | 448 | When you use the help() function, a box or webpage will pop up with 449 | information about the function that you asked for help with. 450 | 451 | If you are not sure of the name of a function, but think you know 452 | part of its name, you can search for the function name using the 453 | help.search() and RSiteSearch() functions. The help.search() function 454 | searches to see if you already have a function installed (from one of 455 | the R packages that you have installed) that may be related to some 456 | topic you're interested in. The RSiteSearch() function searches all 457 | R functions (including those in packages that you haven't yet installed) 458 | for functions related to the topic you are interested in. 459 | 460 | For example, if you want to know if there 461 | is a function to calculate the standard deviation of a set of 462 | numbers, you can search for the names of all installed functions containing 463 | the word "deviation" in their description by typing: 464 | 465 | .. highlight:: r 466 | 467 | :: 468 | 469 | > help.search("deviation") 470 | Help files with alias or concept or title matching 471 | 'deviation' using fuzzy matching: 472 | 473 | genefilter::rowSds 474 | Row variance and standard deviation of 475 | a numeric array 476 | nlme::pooledSD Extract Pooled Standard Deviation 477 | stats::mad Median Absolute Deviation 478 | stats::sd Standard Deviation 479 | vsn::meanSdPlot Plot row standard deviations versus row 480 | 481 | Among the functions that were found, is the function sd() in the 482 | "stats" package (an R package that comes with the standard R 483 | installation), which is used for calculating the standard deviation. 484 | 485 | In the example above, the help.search() function found a relevant 486 | function (sd() here). However, if you did not find what you were looking 487 | for with help.search(), you could then use the RSiteSearch() function to 488 | see if a search of all functions described on the R website may find 489 | something relevant to the topic that you're interested in: 490 | 491 | .. highlight:: r 492 | 493 | :: 494 | 495 | > RSiteSearch("deviation") 496 | 497 | The results of the RSiteSearch() function will be hits to descriptions 498 | of R functions, as well as to R mailing list discussions of those 499 | functions. 500 | 501 | We can perform computations with R using objects such as scalars 502 | and vectors. For example, to calculate the average of the values in 503 | the vector *myvector* (ie. the average of 8, 6, 9, 10 and 5), we 504 | can use the mean() function: 505 | 506 | .. highlight:: r 507 | 508 | :: 509 | 510 | > mean(myvector) 511 | [1] 7.6 512 | 513 | We have been using built-in R functions such as mean(), 514 | length(), print(), plot(), etc. We can also create our own 515 | functions in R to do calculations that you want to carry out very 516 | often on different input data sets. For example, we can create a 517 | function to calculate the value of 20 plus square of some input 518 | number: 519 | 520 | .. highlight:: r 521 | 522 | :: 523 | 524 | > myfunction <- function(x) { return(20 + (x*x)) } 525 | 526 | This function will calculate the square of a number (*x*), and then 527 | add 20 to that value. The return() statement returns the calculated 528 | value. Once you have typed in this function, the function is then 529 | available for use. For example, we can use the function for 530 | different input numbers (eg. 10, 25): 531 | 532 | .. highlight:: r 533 | 534 | :: 535 | 536 | > myfunction(10) 537 | [1] 120 538 | > myfunction(25) 539 | [1] 645 540 | 541 | To quit R, type: 542 | 543 | .. highlight:: r 544 | 545 | :: 546 | 547 | > q() 548 | 549 | 550 | Links and Further Reading 551 | ------------------------- 552 | 553 | Some links are included here for further reading. 554 | 555 | For a more in-depth introduction to R, a good online tutorial is 556 | available on the "Kickstarting R" website, 557 | `cran.r-project.org/doc/contrib/Lemon-kickstart `_. 558 | 559 | There is another nice (slightly more in-depth) tutorial to R 560 | available on the "Introduction to R" website, 561 | `cran.r-project.org/doc/manuals/R-intro.html `_. 562 | 563 | Acknowledgements 564 | ---------------- 565 | 566 | For very helpful comments and suggestions for improvements on the installation instructions, thank you very much to Friedrich Leisch and Phil Spector. 567 | 568 | Contact 569 | ------- 570 | 571 | I will be very grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to 572 | my email address alc@sanger.ac.uk 573 | 574 | License 575 | ------- 576 | 577 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License 578 | `_. 579 | 580 | .. |image3| image:: ../_static/image3.png 581 | -------------------------------------------------------------------------------- /src/multivariateanalysis.rst: -------------------------------------------------------------------------------- 1 | Using R for Multivariate Analysis 2 | ================================= 3 | 4 | Multivariate Analysis 5 | --------------------- 6 | 7 | This booklet tells you how to use the R statistical software to carry out some simple multivariate analyses, 8 | with a focus on principal components analysis (PCA) and linear discriminant analysis (LDA). 9 | 10 | This booklet assumes that the reader has some basic knowledge of multivariate analyses, and 11 | the principal focus of the booklet is not to explain multivariate analyses, but rather 12 | to explain how to carry out these analyses using R. 13 | 14 | If you are new to multivariate analysis, and want to learn more about any of the concepts 15 | presented here, I would highly recommend the Open University book 16 | "Multivariate Analysis" (product code M249/03), available from 17 | from `the Open University Shop `_. 18 | 19 | In the examples in this booklet, I will be using data sets from the UCI Machine 20 | Learning Repository, `http://archive.ics.uci.edu/ml `_. 21 | 22 | There is a pdf version of this booklet available at 23 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf `_. 24 | 25 | If you like this booklet, you may also like to check out my booklet on using 26 | R for biomedical statistics, 27 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/ 28 | `_, 29 | and my booklet on using R for time series analysis, 30 | `http://a-little-book-of-r-for-time-series.readthedocs.org/ 31 | `_. 32 | 33 | Reading Multivariate Analysis Data into R 34 | ----------------------------------------- 35 | 36 | The first thing that you will want to do to analyse your multivariate data will be to read 37 | it into R, and to plot the data. You can read data into R using the read.table() function. 38 | 39 | For example, the file `http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data 40 | `_ 41 | contains data on concentrations of 13 different chemicals in wines grown in the same region in Italy that are 42 | derived from three different cultivars. 43 | 44 | The data set looks like this: 45 | 46 | .. highlight:: r 47 | 48 | :: 49 | 50 | 1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065 51 | 1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050 52 | 1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185 53 | 1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480 54 | 1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735 55 | ... 56 | 57 | There is one row per wine sample. 58 | The first column contains the cultivar of a wine sample (labelled 1, 2 or 3), and the following thirteen columns 59 | contain the concentrations of the 13 different chemicals in that sample. 60 | The columns are separated by commas. 61 | 62 | When we read the file into R using the read.table() function, we need to use the "sep=" 63 | argument in read.table() to tell it that the columns are separated by commas. 64 | That is, we can read in the file using the read.table() function as follows: 65 | 66 | .. highlight:: r 67 | 68 | :: 69 | 70 | > wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", 71 | sep=",") 72 | > wine 73 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 74 | 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.640000 1.040 3.92 1065 75 | 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.380000 1.050 3.40 1050 76 | 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.680000 1.030 3.17 1185 77 | 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.800000 0.860 3.45 1480 78 | 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.320000 1.040 2.93 735 79 | ... 80 | 176 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.200000 0.590 1.56 835 81 | 177 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.300000 0.600 1.62 840 82 | 178 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.200000 0.610 1.60 560 83 | 84 | In this case the data on 178 samples of wine has been read into the variable 'wine'. 85 | 86 | Plotting Multivariate Data 87 | -------------------------- 88 | 89 | Once you have read a multivariate data set into R, the next step is usually to make a plot of the data. 90 | 91 | A Matrix Scatterplot 92 | ^^^^^^^^^^^^^^^^^^^^ 93 | 94 | One common way of plotting multivariate data is to make a "matrix scatterplot", showing each pair of 95 | variables plotted against each other. We can use the "scatterplotMatrix()" function from the "car" 96 | R package to do this. To use this function, we first need to install the "car" R package 97 | (for instructions on how to install an R package, see `How to install an R package 98 | <./installr.html#how-to-install-an-r-package>`_). 99 | 100 | Once you have installed the "car" R package, you can load the "car" R package by typing: 101 | 102 | .. highlight:: r 103 | 104 | :: 105 | 106 | > library("car") 107 | 108 | You can then use the "scatterplotMatrix()" function to plot the multivariate data. 109 | 110 | To use the scatterplotMatrix() function, you need to give it as its input the variables 111 | that you want included in the plot. Say for example, that we just want to include the 112 | variables corresponding to the concentrations of the first five chemicals. These are stored in 113 | columns 2-6 of the variable "wine". We can extract just these columns from the variable 114 | "wine" by typing: 115 | 116 | :: 117 | 118 | > wine[2:6] 119 | V2 V3 V4 V5 V6 120 | 1 14.23 1.71 2.43 15.6 127 121 | 2 13.20 1.78 2.14 11.2 100 122 | 3 13.16 2.36 2.67 18.6 101 123 | 4 14.37 1.95 2.50 16.8 113 124 | 5 13.24 2.59 2.87 21.0 118 125 | ... 126 | 127 | To make a matrix scatterplot of just these 13 variables using the scatterplotMatrix() function we type: 128 | 129 | :: 130 | 131 | > scatterplotMatrix(wine[2:6]) 132 | 133 | 134 | |image1| 135 | 136 | 137 | In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in this 138 | case the concentrations of the first five chemicals (variables V2, V3, V4, V5, V6). 139 | 140 | Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example, the second cell in the 141 | first row is a scatterplot of V2 (y-axis) against V3 (x-axis). 142 | 143 | A Scatterplot with the Data Points Labelled by their Group 144 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 145 | 146 | If you see an interesting scatterplot for two variables in the matrix scatterplot, you may want to 147 | plot that scatterplot in more detail, with the data points labelled by their group (their cultivar in this case). 148 | 149 | For example, in the matrix scatterplot above, the cell in the third column of the fourth row down is a scatterplot 150 | of V5 (x-axis) against V4 (y-axis). If you look at this scatterplot, it appears that there may be a 151 | positive relationship between V5 and V4. 152 | 153 | We may therefore decide to examine the relationship between V5 and V4 more closely, by plotting a scatterplot 154 | of these two variable, with the data points labelled by their group (their cultivar). To plot a scatterplot 155 | of two variables, we can use the "plot" R function. The V4 and V5 variables are stored in the columns 156 | V4 and V5 of the variable "wine", so can be accessed by typing wine$V4 or wine$V5. Therefore, to plot 157 | the scatterplot, we type: 158 | 159 | :: 160 | 161 | > plot(wine$V4, wine$V5) 162 | 163 | |image2| 164 | 165 | If we want to label the data points by their group (the cultivar of wine here), we can use the "text" function 166 | in R to plot some text beside every data point. In this case, the cultivar of wine is stored in the column 167 | V1 of the variable "wine", so we type: 168 | 169 | :: 170 | 171 | > text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red") 172 | 173 | If you look at the help page for the "text" function, you will see that "pos=4" will plot the text just to the 174 | right of the symbol for a data point. The "cex=0.5" option will plot the text at half the default size, and 175 | the "col=red" option will plot the text in red. This gives us the following plot: 176 | 177 | |image4| 178 | 179 | We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have 180 | lower values of V4 compared to the wines of cultivar 1. 181 | 182 | A Profile Plot 183 | ^^^^^^^^^^^^^^ 184 | 185 | Another type of plot that is useful is a "profile plot", which shows the variation in each of the 186 | variables, by plotting the value of each of the variables for each of the samples. 187 | 188 | The function "makeProfilePlot()" below can be used to make a profile plot. This function requires 189 | the "RColorBrewer" library. To use this function, we first need to install the "RColorBrewer" R package 190 | (for instructions on how to install an R package, see `How to install an R package 191 | <./installr.html#how-to-install-an-r-package>`_). 192 | 193 | :: 194 | 195 | > makeProfilePlot <- function(mylist,names) 196 | { 197 | require(RColorBrewer) 198 | # find out how many variables we want to include 199 | numvariables <- length(mylist) 200 | # choose 'numvariables' random colours 201 | colours <- brewer.pal(numvariables,"Set1") 202 | # find out the minimum and maximum values of the variables: 203 | mymin <- 1e+20 204 | mymax <- 1e-20 205 | for (i in 1:numvariables) 206 | { 207 | vectori <- mylist[[i]] 208 | mini <- min(vectori) 209 | maxi <- max(vectori) 210 | if (mini < mymin) { mymin <- mini } 211 | if (maxi > mymax) { mymax <- maxi } 212 | } 213 | # plot the variables 214 | for (i in 1:numvariables) 215 | { 216 | vectori <- mylist[[i]] 217 | namei <- names[i] 218 | colouri <- colours[i] 219 | if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) } 220 | else { points(vectori, col=colouri,type="l") } 221 | lastxval <- length(vectori) 222 | lastyval <- vectori[length(vectori)] 223 | text((lastxval-10),(lastyval),namei,col="black",cex=0.6) 224 | } 225 | } 226 | 227 | To use this function, you first need to copy and paste it into R. The arguments to the 228 | function are a vector containing the names of the varibles that you want to plot, and 229 | a list variable containing the variables themselves. 230 | 231 | For example, to make a profile plot of the concentrations of the first five chemicals in the wine samples 232 | (stored in columns V2, V3, V4, V5, V6 of variable "wine"), we type: 233 | 234 | :: 235 | 236 | > library(RColorBrewer) 237 | > names <- c("V2","V3","V4","V5","V6") 238 | > mylist <- list(wine$V2,wine$V3,wine$V4,wine$V5,wine$V6) 239 | > makeProfilePlot(mylist,names) 240 | 241 | |image5| 242 | 243 | It is clear from the profile plot that the mean and standard deviation for V6 is 244 | quite a lot higher than that for the other variables. 245 | 246 | .. xxx why did they do quite a different profile plot in the assignment answer? I sent a Q to the forum 247 | 248 | Calculating Summary Statistics for Multivariate Data 249 | ---------------------------------------------------- 250 | 251 | Another thing that you are likely to want to do is to calculate summary statistics such as the 252 | mean and standard deviation for each of the variables in your multivariate data set. 253 | 254 | .. sidebar:: sapply 255 | 256 | The "sapply()" function can be used to apply some other function to each column 257 | in a data frame, eg. sapply(mydataframe,sd) will calculate the standard deviation of 258 | each column in a dataframe "mydataframe". 259 | 260 | This is easy to do, using the "mean()" and "sd()" functions in R. For example, say we want 261 | to calculate the mean and standard deviations of each of the 13 chemical concentrations in the 262 | wine samples. These are stored in columns 2-14 of the variable "wine". So we type: 263 | 264 | :: 265 | 266 | > sapply(wine[2:14],mean) 267 | V2 V3 V4 V5 V6 V7 268 | 13.0006180 2.3363483 2.3665169 19.4949438 99.7415730 2.2951124 269 | V8 V9 V10 V11 V12 V13 270 | 2.0292697 0.3618539 1.5908989 5.0580899 0.9574494 2.6116854 271 | V14 272 | 746.8932584 273 | 274 | This tells us that the mean of variable V2 is 13.0006180, the mean of V3 is 2.3363483, and so on. 275 | 276 | Similarly, to get the standard deviations of the 13 chemical concentrations, we type: 277 | 278 | :: 279 | 280 | > sapply(wine[2:14],sd) 281 | V2 V3 V4 V5 V6 V7 282 | 0.8118265 1.1171461 0.2743440 3.3395638 14.2824835 0.6258510 283 | V8 V9 V10 V11 V12 V13 284 | 0.9988587 0.1244533 0.5723589 2.3182859 0.2285716 0.7099904 285 | V14 286 | 314.9074743 287 | 288 | We can see here that it would make sense to standardise in order to compare the variables because the variables 289 | have very different standard deviations - the standard deviation of V14 is 314.9074743, while the standard deviation 290 | of V9 is just 0.1244533. Thus, in order to compare the variables, we need to standardise each variable so that 291 | it has a sample variance of 1 and sample mean of 0. We will explain below how to standardise the variables. 292 | 293 | Means and Variances Per Group 294 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 295 | 296 | It is often interesting to calculate the means and standard deviations for just the samples 297 | from a particular group, for example, for the wine samples from each cultivar. The cultivar 298 | is stored in the column "V1" of the variable "wine". 299 | 300 | To extract out the data for just cultivar 2, we can type: 301 | 302 | :: 303 | 304 | > cultivar2wine <- wine[wine$V1=="2",] 305 | 306 | We can then calculate the mean and standard deviations of the 13 chemicals' concentrations, for 307 | just the cultivar 2 samples: 308 | 309 | :: 310 | 311 | > sapply(cultivar2wine[2:14],mean) 312 | V2 V3 V4 V5 V6 V7 V8 313 | 12.278732 1.932676 2.244789 20.238028 94.549296 2.258873 2.080845 314 | V9 V10 V11 V12 V13 V14 315 | 0.363662 1.630282 3.086620 1.056282 2.785352 519.507042 316 | > sapply(cultivar2wine[2:14]) 317 | V2 V3 V4 V5 V6 V7 V8 318 | 0.5379642 1.0155687 0.3154673 3.3497704 16.7534975 0.5453611 0.7057008 319 | V9 V10 V11 V12 V13 V14 320 | 0.1239613 0.6020678 0.9249293 0.2029368 0.4965735 157.2112204 321 | 322 | You can calculate the mean and standard deviation of the 13 chemicals' concentrations for just cultivar 1 samples, 323 | or for just cultivar 3 samples, in a similar way. 324 | 325 | However, for convenience, you might want to use the function "printMeanAndSdByGroup()" below, which 326 | prints out the mean and standard deviation of the variables for each group in your data set: 327 | 328 | :: 329 | 330 | > printMeanAndSdByGroup <- function(variables,groupvariable) 331 | { 332 | # find the names of the variables 333 | variablenames <- c(names(groupvariable),names(as.data.frame(variables))) 334 | # within each group, find the mean of each variable 335 | groupvariable <- groupvariable[,1] # ensures groupvariable is not a list 336 | means <- aggregate(as.matrix(variables) ~ groupvariable, FUN = mean) 337 | names(means) <- variablenames 338 | print(paste("Means:")) 339 | print(means) 340 | # within each group, find the standard deviation of each variable: 341 | sds <- aggregate(as.matrix(variables) ~ groupvariable, FUN = sd) 342 | names(sds) <- variablenames 343 | print(paste("Standard deviations:")) 344 | print(sds) 345 | # within each group, find the number of samples: 346 | samplesizes <- aggregate(as.matrix(variables) ~ groupvariable, FUN = length) 347 | names(samplesizes) <- variablenames 348 | print(paste("Sample sizes:")) 349 | print(samplesizes) 350 | } 351 | 352 | To use the function "printMeanAndSdByGroup()", you first need to copy and paste it into R. The 353 | arguments of the function are the variables that you want to calculate means and standard deviations for, 354 | and the variable containing the group of each sample. For example, to calculate the mean and standard deviation 355 | for each of the 13 chemical concentrations, for each of the three different wine cultivars, we type: 356 | 357 | :: 358 | 359 | > printMeanAndSdByGroup(wine[2:14],wine[1]) 360 | [1] "Means:" 361 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 362 | 1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119 363 | 2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070 364 | 3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958 365 | [1] "Standard deviations:" 366 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 367 | 1 1 0.4621254 0.6885489 0.2271660 2.546322 10.49895 0.3389614 0.3974936 0.07004924 0.4121092 1.2385728 0.1164826 0.3570766 221.5208 368 | 2 2 0.5379642 1.0155687 0.3154673 3.349770 16.75350 0.5453611 0.7057008 0.12396128 0.6020678 0.9249293 0.2029368 0.4965735 157.2112 369 | 3 3 0.5302413 1.0879057 0.1846902 2.258161 10.89047 0.3569709 0.2935041 0.12413959 0.4088359 2.3109421 0.1144411 0.2721114 115.0970 370 | [1] "Sample sizes:" 371 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 372 | 1 1 59 59 59 59 59 59 59 59 59 59 59 59 59 373 | 2 2 71 71 71 71 71 71 71 71 71 71 71 71 71 374 | 3 3 48 48 48 48 48 48 48 48 48 48 48 48 48 375 | 376 | The function "printMeanAndSdByGroup()" also prints out the number of samples in each group. In this case, 377 | we see that there are 59 samples of cultivar 1, 71 of cultivar 2, and 48 of cultivar 3. 378 | 379 | Between-groups Variance and Within-groups Variance for a Variable 380 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 381 | 382 | If we want to calculate the within-groups variance for a particular variable (for example, for a particular 383 | chemical's concentration), we can use the function "calcWithinGroupsVariance()" below: 384 | 385 | :: 386 | 387 | > calcWithinGroupsVariance <- function(variable,groupvariable) 388 | { 389 | # find out how many values the group variable can take 390 | groupvariable2 <- as.factor(groupvariable[[1]]) 391 | levels <- levels(groupvariable2) 392 | numlevels <- length(levels) 393 | # get the mean and standard deviation for each group: 394 | numtotal <- 0 395 | denomtotal <- 0 396 | for (i in 1:numlevels) 397 | { 398 | leveli <- levels[i] 399 | levelidata <- variable[groupvariable==leveli,] 400 | levelilength <- length(levelidata) 401 | # get the standard deviation for group i: 402 | sdi <- sd(levelidata) 403 | numi <- (levelilength - 1)*(sdi * sdi) 404 | denomi <- levelilength 405 | numtotal <- numtotal + numi 406 | denomtotal <- denomtotal + denomi 407 | } 408 | # calculate the within-groups variance 409 | Vw <- numtotal / (denomtotal - numlevels) 410 | return(Vw) 411 | } 412 | 413 | .. Checked that this formula is correct. 414 | 415 | You will need to copy and paste this function into R before you can use it. 416 | For example, to calculate the within-groups variance of the variable V2 (the concentration of the first chemical), 417 | we type: 418 | 419 | :: 420 | 421 | > calcWithinGroupsVariance(wine[2],wine[1]) 422 | [1] 0.2620525 423 | 424 | Thus, the within-groups variance for V2 is 0.2620525. 425 | 426 | We can calculate the between-groups variance for a particular variable (eg. V2) using the function 427 | "calcBetweenGroupsVariance()" below: 428 | 429 | :: 430 | 431 | > calcBetweenGroupsVariance <- function(variable,groupvariable) 432 | { 433 | # find out how many values the group variable can take 434 | groupvariable2 <- as.factor(groupvariable[[1]]) 435 | levels <- levels(groupvariable2) 436 | numlevels <- length(levels) 437 | # calculate the overall grand mean: 438 | grandmean <- mean(variable) 439 | # get the mean and standard deviation for each group: 440 | numtotal <- 0 441 | denomtotal <- 0 442 | for (i in 1:numlevels) 443 | { 444 | leveli <- levels[i] 445 | levelidata <- variable[groupvariable==leveli,] 446 | levelilength <- length(levelidata) 447 | # get the mean and standard deviation for group i: 448 | meani <- mean(levelidata) 449 | sdi <- sd(levelidata) 450 | numi <- levelilength * ((meani - grandmean)^2) 451 | denomi <- levelilength 452 | numtotal <- numtotal + numi 453 | denomtotal <- denomtotal + denomi 454 | } 455 | # calculate the between-groups variance 456 | Vb <- numtotal / (numlevels - 1) 457 | Vb <- Vb[[1]] 458 | return(Vb) 459 | } 460 | 461 | .. In the OU book, I think that they have the wrong formula - had N-G as denominator, I sent an email to the forum xxx 462 | 463 | .. Note the between-groups-variance*(G-1) + within-groups-variance*(N-G) should be equal to TotalSS 464 | .. calcTotalSS <- function(variable) 465 | .. { 466 | .. variable <- variable[[1]] 467 | .. variablelen <- length(variable) 468 | .. print(paste("variablelen=",variablelen)) 469 | .. grandmean <- mean(variable) 470 | .. print(paste("grandmean=",grandmean)) 471 | .. totalss <- 0 472 | .. for (i in 1:variablelen) 473 | .. { 474 | .. totalss <- totalss + ((variable[i] - grandmean)*(variable[i] - grandmean)) 475 | .. } 476 | .. return(totalss) 477 | .. } 478 | 479 | Once you have copied and pasted this function into R, you can use it to calculate the between-groups 480 | variance for a variable such as V2: 481 | 482 | :: 483 | 484 | > calcBetweenGroupsVariance (wine[2],wine[1]) 485 | [1] 35.39742 486 | 487 | Thus, the between-groups variance of V2 is 35.39742. 488 | 489 | We can calculate the "separation" achieved by a variable as its between-groups variance devided by its 490 | within-groups variance. Thus, the separation achieved by V2 is calculated as: 491 | 492 | :: 493 | 494 | > 35.39742/0.2620525 495 | [1] 135.0776 496 | 497 | .. Note I think we can also get the within-groups and between-groups variance from the output of ANOVA: 498 | .. 499 | .. summary(aov(wine[,2]~as.factor(wine[,1]))) 500 | .. Df Sum Sq Mean Sq F value Pr(>F) 501 | .. as.factor(wine[, 1]) 2 70.795 35.397 135.08 < 2.2e-16 *** 502 | .. Residuals 175 45.859 0.262 503 | .. --- 504 | .. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 505 | .. 506 | .. Here the within-groups variance is 0.262 (called the mean square of residuals) 507 | .. and the between-groups variance is 35.397. The ratio is 135.08 (the F statistic), which 508 | .. is the same as the separation that I calculate (see above). 509 | 510 | If you want to calculate the separations achieved by all of the variables in a multivariate data set, 511 | you can use the function "calcSeparations()" below: 512 | 513 | :: 514 | 515 | > calcSeparations <- function(variables,groupvariable) 516 | { 517 | # find out how many variables we have 518 | variables <- as.data.frame(variables) 519 | numvariables <- length(variables) 520 | # find the variable names 521 | variablenames <- colnames(variables) 522 | # calculate the separation for each variable 523 | for (i in 1:numvariables) 524 | { 525 | variablei <- variables[i] 526 | variablename <- variablenames[i] 527 | Vw <- calcWithinGroupsVariance(variablei, groupvariable) 528 | Vb <- calcBetweenGroupsVariance(variablei, groupvariable) 529 | sep <- Vb/Vw 530 | print(paste("variable",variablename,"Vw=",Vw,"Vb=",Vb,"separation=",sep)) 531 | } 532 | } 533 | 534 | .. I checked the formula and it is fine. 535 | 536 | For example, to calculate the separations for each of the 13 chemical concentrations, we type: 537 | 538 | :: 539 | 540 | > calcSeparations(wine[2:14],wine[1]) 541 | [1] "variable V2 Vw= 0.262052469153907 Vb= 35.3974249602692 separation= 135.0776242428" 542 | [1] "variable V3 Vw= 0.887546796746581 Vb= 32.7890184869213 separation= 36.9434249631837" 543 | [1] "variable V4 Vw= 0.0660721013425184 Vb= 0.879611357248741 separation= 13.312901199991" 544 | [1] "variable V5 Vw= 8.00681118121156 Vb= 286.41674636309 separation= 35.7716374073093" 545 | [1] "variable V6 Vw= 180.65777316441 Vb= 2245.50102788939 separation= 12.4295843381499" 546 | [1] "variable V7 Vw= 0.191270475224227 Vb= 17.9283572942847 separation= 93.7330096203673" 547 | [1] "variable V8 Vw= 0.274707514337437 Vb= 64.2611950235641 separation= 233.925872681549" 548 | [1] "variable V9 Vw= 0.0119117022132797 Vb= 0.328470157461624 separation= 27.5754171469659" 549 | [1] "variable V10 Vw= 0.246172943795542 Vb= 7.45199550777775 separation= 30.2713831702276" 550 | [1] "variable V11 Vw= 2.28492308133354 Vb= 275.708000822304 separation= 120.664018441003" 551 | [1] "variable V12 Vw= 0.0244876469432414 Vb= 2.48100991493829 separation= 101.3167953903" 552 | [1] "variable V13 Vw= 0.160778729560982 Vb= 30.5435083544253 separation= 189.972320578889" 553 | [1] "variable V14 Vw= 29707.6818705169 Vb= 6176832.32228483 separation= 207.920373902178" 554 | 555 | Thus, the individual variable which gives the greatest separations between the groups (the wine cultivars) is 556 | V8 (separation 233.9). As we will discuss below, the purpose of linear discriminant analysis (LDA) is to find the 557 | linear combination of the individual variables that will give the greatest separation between the groups (cultivars here). 558 | This hopefully will give a better separation than the best separation achievable by any individual variable (233.9 559 | for V8 here). 560 | 561 | Between-groups Covariance and Within-groups Covariance for Two Variables 562 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 563 | 564 | If you have a multivariate data set with several variables describing sampling units from different groups, 565 | such as the wine samples from different cultivars, it is often of interest to calculate the within-groups 566 | covariance and between-groups variance for pairs of the variables. 567 | 568 | This can be done using the following functions, which you will need to copy and paste into R to use them: 569 | 570 | :: 571 | 572 | > calcWithinGroupsCovariance <- function(variable1,variable2,groupvariable) 573 | { 574 | # find out how many values the group variable can take 575 | groupvariable2 <- as.factor(groupvariable[[1]]) 576 | levels <- levels(groupvariable2) 577 | numlevels <- length(levels) 578 | # get the covariance of variable 1 and variable 2 for each group: 579 | Covw <- 0 580 | for (i in 1:numlevels) 581 | { 582 | leveli <- levels[i] 583 | levelidata1 <- variable1[groupvariable==leveli,] 584 | levelidata2 <- variable2[groupvariable==leveli,] 585 | mean1 <- mean(levelidata1) 586 | mean2 <- mean(levelidata2) 587 | levelilength <- length(levelidata1) 588 | # get the covariance for this group: 589 | term1 <- 0 590 | for (j in 1:levelilength) 591 | { 592 | term1 <- term1 + ((levelidata1[j] - mean1)*(levelidata2[j] - mean2)) 593 | } 594 | Cov_groupi <- term1 # covariance for this group 595 | Covw <- Covw + Cov_groupi 596 | } 597 | totallength <- nrow(variable1) 598 | Covw <- Covw / (totallength - numlevels) 599 | return(Covw) 600 | } 601 | 602 | .. Checked this works fine. 603 | .. Agrees with formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295: 604 | .. Covw = (1/(N-G)) Sum(from g=1 to G) [ Sum(over i) { (x_ig - x_hat_g)*(y_ig - y_hat_g) } ] 605 | 606 | For example, to calculate the within-groups covariance for variables V8 and V11, we type: 607 | 608 | :: 609 | 610 | > calcWithinGroupsCovariance(wine[8],wine[11],wine[1]) 611 | [1] 0.2866783 612 | 613 | :: 614 | 615 | > calcBetweenGroupsCovariance <- function(variable1,variable2,groupvariable) 616 | { 617 | # find out how many values the group variable can take 618 | groupvariable2 <- as.factor(groupvariable[[1]]) 619 | levels <- levels(groupvariable2) 620 | numlevels <- length(levels) 621 | # calculate the grand means 622 | variable1mean <- mean(variable1) 623 | variable2mean <- mean(variable2) 624 | # calculate the between-groups covariance 625 | Covb <- 0 626 | for (i in 1:numlevels) 627 | { 628 | leveli <- levels[i] 629 | levelidata1 <- variable1[groupvariable==leveli,] 630 | levelidata2 <- variable2[groupvariable==leveli,] 631 | mean1 <- mean(levelidata1) 632 | mean2 <- mean(levelidata2) 633 | levelilength <- length(levelidata1) 634 | term1 <- (mean1 - variable1mean)*(mean2 - variable2mean)*(levelilength) 635 | Covb <- Covb + term1 636 | } 637 | Covb <- Covb / (numlevels - 1) 638 | Covb <- Covb[[1]] 639 | return(Covb) 640 | } 641 | 642 | .. Formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295 643 | .. Covb = (1/(G-1)) * Sum(from g=1 to G) [ Sum(over i) { (n_g) * (x_hat_g - x_hat) * (y_hat_g - y_hat) } ] 644 | .. xxx Note it doesn't give me the answer given for Q3(a)(ii) of assignment - put Q on forum 645 | 646 | For example, to calculate the between-groups covariance for variables V8 and V11, we type: 647 | 648 | :: 649 | 650 | > calcBetweenGroupsCovariance(wine[8],wine[11],wine[1]) 651 | [1] -60.41077 652 | 653 | Thus, for V8 and V11, the between-groups covariance is -60.41 and the within-groups covariance is 0.29. 654 | Since the within-groups covariance is positive (0.29), it means V8 and V11 are positively related within groups: 655 | for individuals from the same group, individuals with a high value of V8 tend to have a high value of V11, 656 | and vice versa. Since the between-groups covariance is negative (-60.41), V8 and V11 are negatively related between groups: 657 | groups with a high mean value of V8 tend to have a low mean value of V11, and vice versa. 658 | 659 | Calculating Correlations for Multivariate Data 660 | ---------------------------------------------- 661 | 662 | It is often of interest to investigate whether any of the variables in a multivariate data set are 663 | significantly correlated. 664 | 665 | To calculate the linear (Pearson) correlation coefficient for a pair of variables, you can use 666 | the "cor.test()" function in R. For example, to calculate the correlation coefficient for the first 667 | two chemicals' concentrations, V2 and V3, we type: 668 | 669 | :: 670 | 671 | > cor.test(wine$V2, wine$V3) 672 | Pearson's product-moment correlation 673 | data: wine$V2 and wine$V3 674 | t = 1.2579, df = 176, p-value = 0.2101 675 | alternative hypothesis: true correlation is not equal to 0 676 | 95 percent confidence interval: 677 | -0.05342959 0.23817474 678 | sample estimates: 679 | cor 680 | 0.09439694 681 | 682 | This tells us that the correlation coefficient is about 0.094, which is a very weak correlation. 683 | Furthermore, the P-value for the statistical test of whether the correlation coefficient is 684 | significantly different from zero is 0.21. This is much greater than 0.05 (which we can use here 685 | as a cutoff for statistical significance), so there is very weak evidence that that the correlation is non-zero. 686 | 687 | If you have a lot of variables, you can use "cor.test()" to calculate the correlation coefficient 688 | for each pair of variables, but you might be just interested in finding out what are the most highly 689 | correlated pairs of variables. For this you can use the function "mosthighlycorrelated()" below. 690 | 691 | The function "mosthighlycorrelated()" will print out the linear correlation coefficients for 692 | each pair of variables in your data set, in order of the correlation coefficient. This lets you see 693 | very easily which pair of variables are most highly correlated. 694 | 695 | :: 696 | 697 | > mosthighlycorrelated <- function(mydataframe,numtoreport) 698 | { 699 | # find the correlations 700 | cormatrix <- cor(mydataframe) 701 | # set the correlations on the diagonal or lower triangle to zero, 702 | # so they will not be reported as the highest ones: 703 | diag(cormatrix) <- 0 704 | cormatrix[lower.tri(cormatrix)] <- 0 705 | # flatten the matrix into a dataframe for easy sorting 706 | fm <- as.data.frame(as.table(cormatrix)) 707 | # assign human-friendly names 708 | names(fm) <- c("First.Variable", "Second.Variable","Correlation") 709 | # sort and print the top n correlations 710 | head(fm[order(abs(fm$Correlation),decreasing=T),],n=numtoreport) 711 | } 712 | 713 | To use this function, you will first have to copy and paste it into R. The arguments of the function 714 | are the variables that you want to calculate the correlations for, and the number of top correlation 715 | coefficients to print out (for example, you can tell it to print out the largest ten correlation coefficients, or 716 | the largest 20). 717 | 718 | For example, to calculate correlation coefficients between the concentrations of the 13 chemicals 719 | in the wine samples, and to print out the top 10 pairwise correlation coefficients, you can type: 720 | 721 | :: 722 | 723 | > mosthighlycorrelated(wine[2:14], 10) 724 | First.Variable Second.Variable Correlation 725 | 84 V7 V8 0.8645635 726 | 150 V8 V13 0.7871939 727 | 149 V7 V13 0.6999494 728 | 111 V8 V10 0.6526918 729 | 157 V2 V14 0.6437200 730 | 110 V7 V10 0.6124131 731 | 154 V12 V13 0.5654683 732 | 132 V3 V12 -0.5612957 733 | 118 V2 V11 0.5463642 734 | 137 V8 V12 0.5434786 735 | 736 | This tells us that the pair of variables with the highest linear correlation coefficient are 737 | V7 and V8 (correlation = 0.86 approximately). 738 | 739 | Standardising Variables 740 | ----------------------- 741 | 742 | If you want to compare different variables that have different units, are very different variances, 743 | it is a good idea to first standardise the variables. 744 | 745 | For example, we found above that the concentrations of the 13 chemicals in the wine samples show a wide range of 746 | standard deviations, from 0.1244533 for V9 (variance 0.01548862) to 314.9074743 for V14 (variance 99166.72). 747 | This is a range of approximately 6,402,554-fold in the variances. 748 | 749 | As a result, it is not a good idea to use the unstandardised chemical concentrations as the input for a 750 | principal component analysis (PCA, see below) of the 751 | wine samples, as if you did that, the first principal component would be dominated by the variables 752 | which show the largest variances, such as V14. 753 | 754 | Thus, it would be a better idea to first standardise the variables so that they all have variance 1 and mean 0, 755 | and to then carry out the principal component analysis on the standardised data. This would allow us to 756 | find the principal components that provide the best low-dimensional representation of the variation in the 757 | original data, without being overly biased by those variables that show the most variance in the original data. 758 | 759 | You can standardise variables in R using the "scale()" function. 760 | 761 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, we type: 762 | 763 | :: 764 | 765 | > standardisedconcentrations <- as.data.frame(scale(wine[2:14])) 766 | 767 | Note that we use the "as.data.frame()" function to convert the output of "scale()" into a 768 | "data frame", which is the same type of R variable that the "wine" variable. 769 | 770 | We can check that each of the standardised variables stored in "standardisedconcentrations" 771 | has a mean of 0 and a standard deviation of 1 by typing: 772 | 773 | :: 774 | 775 | > sapply(standardisedconcentrations,mean) 776 | V2 V3 V4 V5 V6 V7 777 | -8.591766e-16 -6.776446e-17 8.045176e-16 -7.720494e-17 -4.073935e-17 -1.395560e-17 778 | V8 V9 V10 V11 V12 V13 779 | 6.958263e-17 -1.042186e-16 -1.221369e-16 3.649376e-17 2.093741e-16 3.003459e-16 780 | V14 781 | -1.034429e-16 782 | > sapply(standardisedconcentrations,sd) 783 | V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 784 | 1 1 1 1 1 1 1 1 1 1 1 1 1 785 | 786 | We see that the means of the standardised variables are all very tiny numbers and so are 787 | essentially equal to 0, and the standard deviations of the standardised variables are all equal to 1. 788 | 789 | Principal Component Analysis 790 | ---------------------------- 791 | 792 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a 793 | multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing 794 | wine samples from three different cultivars. We can carry out a principal component analysis to investigate 795 | whether we can capture most of the variation between samples using a smaller number of new variables (principal 796 | components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations. 797 | 798 | To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise 799 | the variables under study using the "scale()" function (see above). This is necessary if the input variables 800 | have very different variances, which is true in this case as the concentrations of the 13 chemicals have 801 | very different variances (see above). 802 | 803 | Once you have standardised your variables, you can carry out a principal component analysis using the "prcomp()" 804 | function in R. 805 | 806 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, and carry out a 807 | principal components analysis on the standardised concentrations, we type: 808 | 809 | :: 810 | 811 | > standardisedconcentrations <- as.data.frame(scale(wine[2:14])) # standardise the variables 812 | > wine.pca <- prcomp(standardisedconcentrations) # do a PCA 813 | 814 | You can get a summary of the principal component analysis results using the "summary()" function on the 815 | output of "prcomp()": 816 | 817 | :: 818 | 819 | > summary(wine.pca) 820 | Importance of components: 821 | PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 822 | Standard deviation 2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009 823 | Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193 824 | Cumulative Proportion 0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617 825 | PC11 PC12 PC13 826 | Standard deviation 0.4752 0.4108 0.32152 827 | Proportion of Variance 0.0174 0.0130 0.00795 828 | Cumulative Proportion 0.9791 0.9920 1.00000 829 | 830 | This gives us the standard deviation of each component, and the proportion of variance explained by 831 | each component. The standard deviation of the components is stored in a named element called "sdev" of the output 832 | variable made by "prcomp": 833 | 834 | :: 835 | 836 | > wine.pca$sdev 837 | [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128 0.5903367 838 | [9] 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244 839 | 840 | The total variance explained by the components is the sum of the variances of the components: 841 | 842 | :: 843 | 844 | > sum((wine.pca$sdev)^2) 845 | [1] 13 846 | 847 | In this case, we see that the total variance is 13, which is equal to the number of standardised variables (13 variables). 848 | This is because for standardised data, the variance of each standardised variable is 1. The total variance is equal to the sum 849 | of the variances of the individual variables, and since the variance of each standardised variable is 1, the 850 | total variance should be equal to the number of variables (13 here). 851 | 852 | Deciding How Many Principal Components to Retain 853 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 854 | 855 | In order to decide how many principal components should be retained, 856 | it is common to summarise the results of a principal components analysis by making a scree plot, which we 857 | can do in R using the "screeplot()" function: 858 | 859 | :: 860 | 861 | > screeplot(wine.pca, type="lines") 862 | 863 | |image6| 864 | 865 | The most obvious change in slope in the scree plot occurs at component 4, which is the "elbow" of the 866 | scree plot. Therefore, it cound be argued based on the basis of the scree plot that the first three 867 | components should be retained. 868 | 869 | Another way of deciding how many components to retain is to use Kaiser's criterion: 870 | that we should only retain principal components for which the variance is above 1 (when principal 871 | component analysis was applied to standardised data). We can check this by finding the variance of each 872 | of the principal components: 873 | 874 | :: 875 | 876 | > (wine.pca$sdev)^2 877 | [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283 0.3484974 878 | [9] 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779 879 | 880 | We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances 881 | 4.71, 2.50, and 1.45, respectively). Therefore, using Kaiser's criterion, we would retain the first 882 | three principal components. 883 | 884 | A third way to decide how many principal components to retain is to decide to keep the number of 885 | components required to explain at least some minimum amount of the total variance. For example, if 886 | it is important to explain at least 80% of the variance, we would retain the first five principal components, 887 | as we can see from the output of "summary(wine.pca)" that the first five principal components 888 | explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient). 889 | 890 | Loadings for the Principal Components 891 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 892 | 893 | The loadings for the principal components are stored in a named element "rotation" of the variable 894 | returned by "prcomp()". This contains a matrix with the loadings of each principal component, where 895 | the first column in the matrix contains the loadings for the first principal component, the second 896 | column contains the loadings for the second principal component, and so on. 897 | 898 | Therefore, to obtain the loadings for the first principal component in our 899 | analysis of the 13 chemical concentrations in wine samples, we type: 900 | 901 | :: 902 | 903 | > wine.pca$rotation[,1] 904 | V2 V3 V4 V5 V6 V7 905 | -0.144329395 0.245187580 0.002051061 0.239320405 -0.141992042 -0.394660845 906 | V8 V9 V10 V11 V12 V13 907 | -0.422934297 0.298533103 -0.313429488 0.088616705 -0.296714564 -0.376167411 908 | V14 909 | -0.286752227 910 | 911 | This means that the first principal component is a linear combination of the variables: 912 | -0.144*Z2 + 0.245*Z3 + 0.002*Z4 + 0.239*Z5 - 0.142*Z6 - 0.395*Z7 - 0.423*Z8 + 0.299*Z9 913 | -0.313*Z10 + 0.089*Z11 - 0.297*Z12 - 0.376*Z13 - 0.287*Z14, where Z2, Z3, Z4...Z14 are 914 | the standardised versions of the variables V2, V3, V4...V14 (that each 915 | have mean of 0 and variance of 1). 916 | 917 | Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings: 918 | 919 | :: 920 | 921 | > sum((wine.pca$rotation[,1])^2) 922 | [1] 1 923 | 924 | To calculate the values of the first principal component, we can define our own function to calculate 925 | a principal component given the loadings and the input variables' values: 926 | 927 | :: 928 | 929 | > calcpc <- function(variables,loadings) 930 | { 931 | # find the number of samples in the data set 932 | as.data.frame(variables) 933 | numsamples <- nrow(variables) 934 | # make a vector to store the component 935 | pc <- numeric(numsamples) 936 | # find the number of variables 937 | numvariables <- length(variables) 938 | # calculate the value of the component for each sample 939 | for (i in 1:numsamples) 940 | { 941 | valuei <- 0 942 | for (j in 1:numvariables) 943 | { 944 | valueij <- variables[i,j] 945 | loadingj <- loadings[j] 946 | valuei <- valuei + (valueij * loadingj) 947 | } 948 | pc[i] <- valuei 949 | } 950 | return(pc) 951 | } 952 | 953 | We can then use the function to calculate the values of the first principal component for each sample in our 954 | wine data: 955 | 956 | :: 957 | 958 | > calcpc(standardisedconcentrations, wine.pca$rotation[,1]) 959 | [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051 960 | [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921 961 | [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376 962 | [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878 963 | [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118 964 | [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159 965 | ... 966 | 967 | In fact, the values of the first principal component are stored in the variable wine.pca$x[,1] 968 | that was returned by the "prcomp()" function, so we can compare those values to the ones that we 969 | calculated, and they should agree: 970 | 971 | :: 972 | 973 | > wine.pca$x[,1] 974 | [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051 975 | [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921 976 | [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376 977 | [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878 978 | [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118 979 | [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159 980 | ... 981 | 982 | We see that they do agree. 983 | 984 | The first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376), 985 | V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13, 986 | V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the 987 | first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, 988 | and the concentrations of V9, V3 and V5. 989 | 990 | Similarly, we can obtain the loadings for the second principal component by typing: 991 | 992 | :: 993 | 994 | > wine.pca$rotation[,2] 995 | V2 V3 V4 V5 V6 V7 996 | 0.483651548 0.224930935 0.316068814 -0.010590502 0.299634003 0.065039512 997 | V8 V9 V10 V11 V12 V13 998 | -0.003359812 0.028779488 0.039301722 0.529995672 -0.279235148 -0.164496193 999 | V14 1000 | 0.364902832 1001 | 1002 | This means that the second principal component is a linear combination of the variables: 1003 | 0.484*Z2 + 0.225*Z3 + 0.316*Z4 - 0.011*Z5 + 0.300*Z6 + 0.065*Z7 - 0.003*Z8 + 0.029*Z9 1004 | + 0.039*Z10 + 0.530*Z11 - 0.279*Z12 - 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3...Z14 1005 | are the standardised versions of variables V2, V3, ... V14 that each have mean 0 and variance 1. 1006 | 1007 | Note that the square of the loadings sum to 1, as above: 1008 | 1009 | :: 1010 | 1011 | > sum((wine.pca$rotation[,2])^2) 1012 | [1] 1 1013 | 1014 | The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), 1015 | V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while 1016 | the loading for V12 is negative. Therefore, an interpretation of the second principal component is that 1017 | it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of 1018 | V12. Note that the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between 1019 | the concentrations of V11 and V2, and the concentration of V12. 1020 | 1021 | Scatterplots of the Principal Components 1022 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1023 | 1024 | The values of the principal components are stored in a named element "x" of the variable returned by 1025 | "prcomp()". This contains a matrix with the principal components, where the first column in the matrix 1026 | contains the first principal component, the second column the second component, and so on. 1027 | 1028 | Thus, in our example, "wine.pca$x[,1]" contains the first principal component, and 1029 | "wine.pca$x[,2]" contains the second principal component. 1030 | 1031 | We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine 1032 | samples come from, by typing: 1033 | 1034 | :: 1035 | 1036 | > plot(wine.pca$x[,1],wine.pca$x[,2]) # make a scatterplot 1037 | > text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red") # add labels 1038 | 1039 | |image7| 1040 | 1041 | The scatterplot shows the first principal component on the x-axis, and the second principal 1042 | component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1 1043 | have much lower values of the first principal component than wine samples of cultivar 3. 1044 | Therefore, the first principal component separates wine samples of cultivars 1 from those 1045 | of cultivar 3. 1046 | 1047 | We can also see that wine samples of cultivar 2 have much higher values of the second 1048 | principal component than wine samples of cultivars 1 and 3. Therefore, the second principal 1049 | component separates samples of cultivar 2 from samples of cultivars 1 and 3. 1050 | 1051 | Therefore, the first two principal components are reasonably useful for distinguishing wine 1052 | samples of the three different cultivars. 1053 | 1054 | Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, 1055 | and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the 1056 | concentrations of these chemicals in the different cultivars, by printing out the means of the 1057 | standardised concentration variables in each cultivar, using the "printMeanAndSdByGroup()" function (see above): 1058 | 1059 | :: 1060 | 1061 | > printMeanAndSdByGroup(standardisedconcentrations,wine[1]) 1062 | [1] "Means:" 1063 | V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 1064 | 1 1 0.9166093 -0.2915199 0.3246886 -0.7359212 0.46192317 0.87090552 0.95419225 -0.57735640 0.5388633 0.2028288 0.4575567 0.7691811 1.1711967 1065 | 2 2 -0.8892116 -0.3613424 -0.4437061 0.2225094 -0.36354162 -0.05790375 0.05163434 0.01452785 0.0688079 -0.8503999 0.4323908 0.2446043 -0.7220731 1066 | 3 3 0.1886265 0.8928122 0.2572190 0.5754413 -0.03004191 -0.98483874 -1.24923710 0.68817813 -0.7641311 1.0085728 -1.2019916 -1.3072623 -0.3715295 1067 | 1068 | Does it make sense that the first principal component can separate cultivar 1 from cultivar 3? 1069 | In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171) 1070 | are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736). 1071 | In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372) 1072 | are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575). 1073 | Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14, 1074 | and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3. 1075 | 1076 | Above, we intepreted the second principal component as a contrast between the concentrations of V11, 1077 | V2, V14, V4, V6 and V3, and the concentration of V12. 1078 | In the light of the mean values of these variables in the different cultivars, does 1079 | it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3? 1080 | In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292) 1081 | are not very different from the mean value of V12 (0.458). 1082 | In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893) 1083 | are also not very different from the mean value of V12 (-1.202). 1084 | In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361) 1085 | are much less than the mean value of V12 (0.432). 1086 | Therefore, it makes sense that principal component is a contrast between the concentrations of V11, 1087 | V2, V14, V4, V6 and V3, and the concentration of V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3. 1088 | 1089 | Linear Discriminant Analysis 1090 | ---------------------------- 1091 | 1092 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a 1093 | multivariate data set. For example, in the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars. 1094 | By carrying out a principal component analysis, we found that most of the variation in the chemical concentrations 1095 | between the samples can be captured using the first two principal components, 1096 | where each of the principal components is a particular linear combination of the 13 chemical concentrations. 1097 | 1098 | The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (the 13 1099 | chemical concentrations here) that gives the best possible separation between the groups (wine cultivars here) in our 1100 | data set. Linear discriminant analysis is also known as "canonical discriminant analysis", or simply "discriminant analysis". 1101 | 1102 | If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups (G) is 3, 1103 | and the number of variables is 13 (13 chemicals' concentrations; p = 13). The maximum number of useful discriminant 1104 | functions that can separate the wines by cultivar is the minimum of G-1 and p, and so in this case it is the minimum of 2 and 13, 1105 | which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the 1106 | 13 chemical concentration variables. 1107 | 1108 | You can carry out a linear discriminant analysis using the "lda()" function from the R "MASS" package. 1109 | To use this function, we first need to install the "MASS" R package 1110 | (for instructions on how to install an R package, see `How to install an R package 1111 | <./installr.html#how-to-install-an-r-package>`_). 1112 | 1113 | For example, to carry out a linear discriminant analysis using the 13 chemical concentrations in the wine samples, we type: 1114 | 1115 | :: 1116 | 1117 | > library("MASS") # load the MASS package 1118 | > wine.lda <- lda(wine$V1 ~ wine$V2 + wine$V3 + wine$V4 + wine$V5 + wine$V6 + wine$V7 + 1119 | wine$V8 + wine$V9 + wine$V10 + wine$V11 + wine$V12 + wine$V13 + 1120 | wine$V14) 1121 | 1122 | Loadings for the Discriminant Functions 1123 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1124 | 1125 | To get the values of the loadings of the discriminant functions for the wine data, we can type: 1126 | 1127 | :: 1128 | 1129 | > wine.lda 1130 | Coefficients of linear discriminants: 1131 | LD1 LD2 1132 | wine$V2 -0.403399781 0.8717930699 1133 | wine$V3 0.165254596 0.3053797325 1134 | wine$V4 -0.369075256 2.3458497486 1135 | wine$V5 0.154797889 -0.1463807654 1136 | wine$V6 -0.002163496 -0.0004627565 1137 | wine$V7 0.618052068 -0.0322128171 1138 | wine$V8 -1.661191235 -0.4919980543 1139 | wine$V9 -1.495818440 -1.6309537953 1140 | wine$V10 0.134092628 -0.3070875776 1141 | wine$V11 0.355055710 0.2532306865 1142 | wine$V12 -0.818036073 -1.5156344987 1143 | wine$V13 -1.157559376 0.0511839665 1144 | wine$V14 -0.002691206 0.0028529846 1145 | 1146 | This means that the first discriminant function is a linear combination of the variables: 1147 | -0.403*V2 + 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8 1148 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14, where 1149 | V2, V3, ... V14 are the concentrations of the 14 chemicals found in the wine samples. 1150 | For convenience, the value for each discriminant function (eg. the first discriminant function) 1151 | are scaled so that their mean value is zero (see below). 1152 | 1153 | Note that these loadings are calculated so that the within-group variance of each discriminant 1154 | function for each group (cultivar) is equal to 1, as will be demonstrated below. 1155 | 1156 | These scalings are also stored in the named element "scaling" of the variable returned 1157 | by the lda() function. This element contains a matrix, in which the first column contains 1158 | the loadings for the first discriminant function, the second column contains the loadings 1159 | for the second discriminant function and so on. For example, to extract the loadings for 1160 | the first discriminant function, we can type: 1161 | 1162 | :: 1163 | 1164 | > wine.lda$scaling[,1] 1165 | wine$V2 wine$V3 wine$V4 wine$V5 wine$V6 wine$V7 1166 | -0.403399781 0.165254596 -0.369075256 0.154797889 -0.002163496 0.618052068 1167 | wine$V8 wine$V9 wine$V10 wine$V11 wine$V12 wine$V13 1168 | -1.661191235 -1.495818440 0.134092628 0.355055710 -0.818036073 -1.157559376 1169 | wine$V14 1170 | -0.002691206 1171 | 1172 | To calculate the values of the first discriminant function, we can define our own function "calclda()": 1173 | 1174 | :: 1175 | 1176 | > calclda <- function(variables,loadings) 1177 | { 1178 | # find the number of samples in the data set 1179 | as.data.frame(variables) 1180 | numsamples <- nrow(variables) 1181 | # make a vector to store the discriminant function 1182 | ld <- numeric(numsamples) 1183 | # find the number of variables 1184 | numvariables <- length(variables) 1185 | # calculate the value of the discriminant function for each sample 1186 | for (i in 1:numsamples) 1187 | { 1188 | valuei <- 0 1189 | for (j in 1:numvariables) 1190 | { 1191 | valueij <- variables[i,j] 1192 | loadingj <- loadings[j] 1193 | valuei <- valuei + (valueij * loadingj) 1194 | } 1195 | ld[i] <- valuei 1196 | } 1197 | # standardise the discriminant function so that its mean value is 0: 1198 | ld <- as.data.frame(scale(ld, center=TRUE, scale=FALSE)) 1199 | ld <- ld[[1]] 1200 | return(ld) 1201 | } 1202 | 1203 | The function calclda() simply calculates the value of a discriminant function 1204 | for each sample in the data set, for example, for the first disriminant function, for each sample we calculate 1205 | the value using the equation -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8 1206 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14. Furthermore, the "scale()" 1207 | command is used within the calclda() function in order to standardise the value of a discriminant function 1208 | (eg. the first discriminant function) so that its mean value (over all the wine samples) is 0. 1209 | 1210 | We can use the function calclda() to calculate the values of the first discriminant function for each sample in our 1211 | wine data: 1212 | 1213 | :: 1214 | 1215 | > calclda(wine[2:14], wine.lda$scaling[,1]) 1216 | [1] -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 1217 | [7] -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 1218 | [13] -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 1219 | [19] -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 1220 | [25] -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573 1221 | [31] -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996 1222 | ... 1223 | 1224 | .. This agrees with the values that we get in SPSS, except that the values in SPSS 1225 | .. multiplied by -1, because the loadings are multiplied by -1, but that is fine. 1226 | 1227 | In fact, the values of the first linear discriminant function can be calculated using the 1228 | "predict()" function in R, so we can compare those to the ones that we calculated, and they 1229 | should agree: 1230 | 1231 | :: 1232 | 1233 | > wine.lda.values <- predict(wine.lda, wine[2:14]) 1234 | > wine.lda.values$x[,1] # contains the values for the first discriminant function 1235 | 1 2 3 4 5 6 1236 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 1237 | 7 8 9 10 11 12 1238 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 1239 | 13 14 15 16 17 18 1240 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 1241 | 19 20 21 22 23 24 1242 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 1243 | 25 26 27 28 29 30 1244 | -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573 1245 | 31 32 33 34 35 36 1246 | -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996 1247 | ... 1248 | 1249 | We see that they do agree. 1250 | 1251 | .. The loadings agree with those given in SPSS for the unstandardised variables. 1252 | .. In SPSS I get: 1253 | .. Unstandardised coeffs: 1254 | .. V2: 0.403, 0.872 1255 | .. V3: -0.165, 0.305 1256 | .. V4: 0.369, 2.346 1257 | .. V5: -0.155, -0.146 1258 | .. V6: 0.002, 0.000 1259 | .. V7: -0.618, -0.032 1260 | .. V8: 1.661, -0.492 1261 | .. V9: 1.496, -1.632 1262 | .. V10: -0.134, -0.307 1263 | .. V11: -0.355, 0.253 1264 | .. V12: 0.818, -1.516 1265 | .. V13: 1.158, 0.051 1266 | .. V14: 0.003, 0.003 1267 | .. Standardised coeffs: 1268 | .. V2: 0.207, 0.446 1269 | .. V3: -0.156, 0.288 1270 | .. V4: 0.095, 0.603 1271 | .. V5: -0.438, -0.414 1272 | .. V6: 0.029, -0.006 1273 | .. V7: -0.270, -0.014 1274 | .. V8: 0.871, -0.258 1275 | .. V9: 0.163, -0.178 1276 | .. V10: -0.067, -0.152 1277 | .. V11: -0.537, 0.383 1278 | .. V12: 0.128, -0.237 1279 | .. V13: 0.464, 0.021 1280 | .. V14: 0.464, 0.492 1281 | 1282 | .. Comment: 1283 | .. If you look at the output of calcSeparations, you can see that the within-group variances are 1. 1284 | .. The loadings are in wine.lda$scaling, I think. 1285 | .. The description for scaling in the help for lda() is: 1286 | .. a matrix which transforms observations to discriminant functions, normalized so that within groups 1287 | .. covariance matrix is spherical. 1288 | .. calcpc(wine[2:14], wine.lda$scaling[,1]) 1289 | .. -13.931031 -13.532745 -12.651506 -13.436540 -10.740768 -13.749476 1290 | .. -13.758165 -13.379134 -13.091615 -12.597411 -14.036666 -12.658863 1291 | .. -12.896889 -14.819033 -14.732101 -12.415539 -12.520157 -12.228879 1292 | .. -14.477190 -12.367318 -12.808265 -10.921558 -14.065937 -12.326676 1293 | .. -12.552434 -11.375609 -13.213215 -11.916701 -12.793881 -12.403802 1294 | .. -12.227055 -12.799449 -12.615850 -12.758324 -12.082695 -12.024907 1295 | .. -11.988872 -11.408131 -12.260050 -12.501839 -12.151442 -11.467997 1296 | .. -13.930512 -10.461148 -11.812826 -11.813907 -13.119666 -12.680540 1297 | .. mylda1 <- calcpc(wine[2:14], wine.lda$scaling[,1]) 1298 | .. summary(aov(mylda1~as.factor(wine[,1]))) 1299 | .. Df Sum Sq Mean Sq F value Pr(>F) 1300 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 *** 1301 | .. Residuals 175 175.0 1.00 1302 | .. Do seem to have within-group variance=1. 1303 | .. Put the LDA1 and LDA2 calculated from SPSS in a file, can check if within-group variance=1: 1304 | .. spss <- read.table("C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/OUBooks/MultivariateStats/wine.data_lda.txt",header=FALSE) 1305 | .. summary(aov(spss$V1~as.factor(wine[,1]))) 1306 | .. Df Sum Sq Mean Sq F value Pr(>F) 1307 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 *** 1308 | .. Residuals 175 175.0 1.00 1309 | .. Has within-group variance=1. 1310 | .. plot(spss$V1, mylda1) # Have a correlation of -1 1311 | .. summary(mylda1) 1312 | .. Min. 1st Qu. Median Mean 3rd Qu. Max. 1313 | .. -14.820 -12.140 -9.529 -9.231 -6.396 -3.489 1314 | .. summary(spss$V1) 1315 | .. Min. 1st Qu. Median Mean 3rd Qu. Max. 1316 | .. -5.742e+00 -2.835e+00 2.978e-01 -5.618e-08 2.909e+00 5.588e+00 1317 | .. SPSS seems to have centred the data so that the mean of LDA1 is 0. 1318 | .. 1319 | .. 1320 | .. ... 1321 | .. wine.lda.values <- predict(wine.lda, wine[2:14]) 1322 | .. wine.lda.values$x[,1] # contains the values for the first discriminant function 1323 | .. 1 2 3 4 5 6 1324 | .. -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 1325 | .. 7 8 9 10 11 12 1326 | .. -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 1327 | .. 13 14 15 16 17 18 1328 | .. -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 1329 | .. 19 20 21 22 23 24 1330 | .. -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 1331 | .. Agrees perfectly with the values from SPSS (except the SPSS values are multiplied by -1, because the loadings are all multipled by 1332 | .. -1, but that doesn't matter). 1333 | 1334 | It doesn't matter whether the input variables for linear discriminant analysis are standardised or not, unlike 1335 | for principal components analysis in which it is often necessary to standardise the input variables. 1336 | However, using standardised variables in linear discriminant analysis makes it easier to interpret the loadings in 1337 | a linear discriminant function. 1338 | 1339 | In linear discriminant analysis, the standardised version of an input variable is defined so that it 1340 | has mean zero and within-groups variance of 1. Thus, we can calculate the "group-standardised" variable 1341 | by subtracting the mean from each value of the variable, and dividing by the within-groups standard deviation. 1342 | To calculate the group-standardised version of a set of variables, we can use the function "groupStandardise()" below: 1343 | 1344 | 1345 | :: 1346 | 1347 | > groupStandardise <- function(variables, groupvariable) 1348 | { 1349 | # find out how many variables we have 1350 | variables <- as.data.frame(variables) 1351 | numvariables <- length(variables) 1352 | # find the variable names 1353 | variablenames <- colnames(variables) 1354 | # calculate the group-standardised version of each variable 1355 | for (i in 1:numvariables) 1356 | { 1357 | variablei <- variables[i] 1358 | variablei_name <- variablenames[i] 1359 | variablei_Vw <- calcWithinGroupsVariance(variablei, groupvariable) 1360 | variablei_mean <- mean(variablei) 1361 | variablei_new <- (variablei - variablei_mean)/(sqrt(variablei_Vw)) 1362 | data_length <- nrow(variablei) 1363 | if (i == 1) { variables_new <- data.frame(row.names=seq(1,data_length)) } 1364 | variables_new[`variablei_name`] <- variablei_new 1365 | } 1366 | return(variables_new) 1367 | } 1368 | 1369 | For example, we can use the "groupStandardise()" function to calculate the group-standardised versions of the 1370 | chemical concentrations in wine samples: 1371 | 1372 | :: 1373 | 1374 | > groupstandardisedconcentrations <- groupStandardise(wine[2:14], wine[1]) 1375 | 1376 | We can then use the lda() function to perform linear disriminant analysis on the group-standardised variables: 1377 | 1378 | :: 1379 | 1380 | > wine.lda2 <- lda(wine$V1 ~ groupstandardisedconcentrations$V2 + groupstandardisedconcentrations$V3 + 1381 | groupstandardisedconcentrations$V4 + groupstandardisedconcentrations$V5 + 1382 | groupstandardisedconcentrations$V6 + groupstandardisedconcentrations$V7 + 1383 | groupstandardisedconcentrations$V8 + groupstandardisedconcentrations$V9 + 1384 | groupstandardisedconcentrations$V10 + groupstandardisedconcentrations$V11 + 1385 | groupstandardisedconcentrations$V12 + groupstandardisedconcentrations$V13 + 1386 | groupstandardisedconcentrations$V14) 1387 | > wine.lda2 1388 | Coefficients of linear discriminants: 1389 | LD1 LD2 1390 | groupstandardisedconcentrations$V2 -0.20650463 0.446280119 1391 | groupstandardisedconcentrations$V3 0.15568586 0.287697336 1392 | groupstandardisedconcentrations$V4 -0.09486893 0.602988809 1393 | groupstandardisedconcentrations$V5 0.43802089 -0.414203541 1394 | groupstandardisedconcentrations$V6 -0.02907934 -0.006219863 1395 | groupstandardisedconcentrations$V7 0.27030186 -0.014088108 1396 | groupstandardisedconcentrations$V8 -0.87067265 -0.257868714 1397 | groupstandardisedconcentrations$V9 -0.16325474 -0.178003512 1398 | groupstandardisedconcentrations$V10 0.06653116 -0.152364015 1399 | groupstandardisedconcentrations$V11 0.53670086 0.382782544 1400 | groupstandardisedconcentrations$V12 -0.12801061 -0.237174509 1401 | groupstandardisedconcentrations$V13 -0.46414916 0.020523349 1402 | groupstandardisedconcentrations$V14 -0.46385409 0.491738050 1403 | 1404 | It makes sense to interpret the loadings calculated using the group-standardised variables rather than the loadings for 1405 | the original (unstandardised) variables. 1406 | 1407 | In the first discriminant function calculated for the group-standardised variables, the largest loadings (in absolute) value 1408 | are given to V8 (-0.871), V11 (0.537), V13 (-0.464), V14 (-0.464), and V5 (0.438). The loadings for V8, V13 and V14 are negative, while 1409 | those for V11 and V5 are positive. Therefore, the discriminant function seems to represent a contrast between the concentrations of 1410 | V8, V13 and V14, and the concentrations of V11 and V5. 1411 | 1412 | We saw above that the individual variables which gave the greatest separations between the groups were V8 (separation 233.93), V14 (207.92), 1413 | V13 (189.97), V2 (135.08) and V11 (120.66). These were mostly the same variables that had the largest loadings in the linear discriminant 1414 | function (loading for V8: -0.871, for V14: -0.464, for V13: -0.464, for V11: 0.537). 1415 | 1416 | We found above that variables V8 and V11 have a negative between-groups covariance (-60.41) and a positive within-groups covariance (0.29). 1417 | When the between-groups covariance and within-groups covariance for two variables have opposite signs, it indicates that a better separation 1418 | between groups can be obtained by using a linear combination of those two variables than by using either variable on its own. 1419 | 1420 | Thus, given that the two variables V8 and V11 have between-groups and within-groups covariances of opposite signs, and that these are two 1421 | of the variables that gave the greatest separations between groups when used individually, it is not surprising that these are the two 1422 | variables that have the largest loadings in the first discriminant function. 1423 | 1424 | Note that although the loadings for the group-standardised variables are easier to interpret than the loadings for the 1425 | unstandardised variables, the values of the discriminant function are the same regardless of whether we standardise 1426 | the input variables or not. For example, for wine data, we can calculate the value of the first discriminant function calculated 1427 | using the unstandardised and group-standardised variables by typing: 1428 | 1429 | :: 1430 | 1431 | > wine.lda.values <- predict(wine.lda, wine[2:14]) 1432 | > wine.lda.values$x[,1] # values for the first discriminant function, using the unstandardised data 1433 | 1 2 3 4 5 6 1434 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 1435 | 7 8 9 10 11 12 1436 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 1437 | 13 14 15 16 17 18 1438 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 1439 | 19 20 21 22 23 24 1440 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 1441 | ... 1442 | > wine.lda.values2 <- predict(wine.lda2, groupstandardisedconcentrations) 1443 | > wine.lda.values2$x[,1] # values for the first discriminant function, using the standardised data 1444 | 1 2 3 4 5 6 1445 | -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 1446 | 7 8 9 10 11 12 1447 | -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 1448 | 13 14 15 16 17 18 1449 | -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 1450 | 19 20 21 22 23 24 1451 | -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 1452 | ... 1453 | 1454 | .. Note these are the same values that I get using SPSS. 1455 | 1456 | We can see that although the loadings are different for the first discriminant functions calculated using 1457 | unstandardised and group-standardised data, the actual values of the first discriminant function are the same. 1458 | 1459 | Separation Achieved by the Discriminant Functions 1460 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1461 | To calculate the separation achieved by each discriminant function, we first need to calculate the 1462 | value of each discriminant function, by substituting the variables' values into the linear combination for 1463 | the discriminant function (eg. -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8 1464 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14 for the first discriminant function), 1465 | and then scaling the values of the discriminant function so that their mean is zero. 1466 | 1467 | As mentioned above, we can do this using the "predict()" function in R. For example, 1468 | to calculate the value of the discriminant functions for the wine data, we type: 1469 | 1470 | :: 1471 | 1472 | > wine.lda.values <- predict(wine.lda, standardisedconcentrations) 1473 | 1474 | The returned variable has a named element "x" which is a matrix containing the linear discriminant functions: 1475 | the first column of x contains the first discriminant function, the second column of x contains the second 1476 | discriminant function, and so on (if there are more discriminant functions). 1477 | 1478 | We can therefore calculate the separations achieved by the two linear discriminant functions for the wine data by using the 1479 | "calcSeparations()" function (see above), which calculates the separation as the ratio of the between-groups 1480 | variance to the within-groups variance: 1481 | 1482 | :: 1483 | 1484 | > calcSeparations(wine.lda.values$x,wine[1]) 1485 | [1] "variable LD1 Vw= 1 Vb= 794.652200566216 separation= 794.652200566216" 1486 | [1] "variable LD2 Vw= 1 Vb= 361.241041493455 separation= 361.241041493455" 1487 | 1488 | As mentioned above, the loadings for each discriminant function are calculated in such a way that 1489 | the within-group variance (Vw) for each group (wine cultivar here) is equal to 1, as we see in the 1490 | output from calcSeparations() above. 1491 | 1492 | The output from calcSeparations() tells us that the separation achieved by the first (best) discriminant 1493 | function is 794.7, and the separation achieved by the second (second best) discriminant function is 361.2. 1494 | 1495 | Therefore, the total separation is the sum of these, which is (794.652200566216+361.241041493455=1155.893) 1496 | 1155.89, rounded to two decimal places. Therefore, the "percentage separation" achieved by the 1497 | first discriminant function is (794.652200566216*100/1155.893=) 68.75%, and the percentage separation achieved by the 1498 | second discriminant function is (361.241041493455*100/1155.893=) 31.25%. 1499 | 1500 | The "proportion of trace" that is printed when you type "wine.lda" (the variable returned by the lda() function) 1501 | is the percentage separation achieved by each discriminant function. For example, for the wine data we get the 1502 | same values as just calculated (68.75% and 31.25%): 1503 | 1504 | :: 1505 | 1506 | > wine.lda 1507 | Proportion of trace: 1508 | LD1 LD2 1509 | 0.6875 0.3125 1510 | 1511 | Therefore, the first discriminant function does achieve a good separation between the three groups (three cultivars), but the second 1512 | discriminant function does improve the separation of the groups by quite a large amount, so is it worth using the 1513 | second discriminant function as well. Therefore, to achieve a good separation of the groups (cultivars), 1514 | it is necessary to use both of the first two discriminant functions. 1515 | 1516 | We found above that the largest separation achieved for any of the individual variables (individual chemical concentrations) 1517 | was 233.9 for V8, which is quite a lot less than 794.7, the separation achieved by the first discriminant function. Therefore, 1518 | the effect of using more than one variable to calculate the discriminant function is that we can find a discriminant function 1519 | that achieves a far greater separation between groups than achieved by any one variable alone. 1520 | 1521 | The variable returned by the lda() function also has a named element "svd", which contains the ratio of 1522 | between- and within-group standard deviations for the linear discriminant variables, that is, the square 1523 | root of the "separation" value that we calculated using calcSeparations() above. When we calculate the 1524 | square of the value stored in "svd", we should get the same value as found using calcSeparations(): 1525 | 1526 | :: 1527 | 1528 | > (wine.lda$svd)^2 1529 | [1] 794.6522 361.2410 1530 | 1531 | 1532 | .. Note that these are also called "canonical F-statistics". 1533 | .. Note the F statistics I get from aov() are the same as the separation values that I calculate: 1534 | .. > summary(aov(wine.lda.values$x ~ as.factor(wine[,1]))) 1535 | .. Response LD1 : 1536 | .. Df Sum Sq Mean Sq F value Pr(>F) 1537 | .. as.factor(wine[, 1]) 2 1589.3 794.65 794.65 < 2.2e-16 *** 1538 | .. Residuals 175 175.0 1.00 1539 | .. --- 1540 | .. Response LD2 : 1541 | .. Df Sum Sq Mean Sq F value Pr(>F) 1542 | .. as.factor(wine[, 1]) 2 722.48 361.24 361.24 < 2.2e-16 *** 1543 | .. Residuals 175 175.00 1.00 1544 | 1545 | A Stacked Histogram of the LDA Values 1546 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1547 | 1548 | A nice way of displaying the results of a linear discriminant analysis (LDA) is to make a stacked histogram of the 1549 | values of the discriminant function for the samples from different groups (different wine cultivars in our example). 1550 | 1551 | We can do this using the "ldahist()" function in R. For example, to make a stacked histogram of the first discriminant 1552 | function's values for wine samples of the three different wine cultivars, we type: 1553 | 1554 | :: 1555 | 1556 | > ldahist(data = wine.lda.values$x[,1], g=wine$V1) 1557 | 1558 | |image8| 1559 | 1560 | We can see from the histogram that cultivars 1 and 3 are well separated by the first 1561 | discriminant function, since the values for the first cultivar are between -6 and -1, 1562 | while the values for cultivar 3 are between 2 and 6, and so there is no overlap in values. 1563 | 1564 | However, the separation achieved by the linear discriminant function on the training 1565 | set may be an overestimate. To get a more accurate idea of how well the first discriminant function 1566 | separates the groups, we would need to see a stacked histogram of the values for the three 1567 | cultivars using some unseen "test set", that is, using 1568 | a set of data that was not used to calculate the linear discriminant function. 1569 | 1570 | We see that the first discriminant function separates cultivars 1 and 3 very well, but 1571 | does not separate cultivars 1 and 2, or cultivars 2 and 3, so well. 1572 | 1573 | We therefore investigate whether the second discriminant function separates those cultivars, 1574 | by making a stacked histogram of the second discriminant function's values: 1575 | 1576 | :: 1577 | 1578 | > ldahist(data = wine.lda.values$x[,2], g=wine$V1) 1579 | 1580 | |image9| 1581 | 1582 | We see that the second discriminant function separates cultivars 1 and 2 quite well, although 1583 | there is a little overlap in their values. Furthermore, the second discriminant function also 1584 | separates cultivars 2 and 3 quite well, although again there is a little overlap in their values so 1585 | it is not perfect. 1586 | 1587 | Thus, we see that two discriminant functions are necessary to separate the cultivars, as was 1588 | discussed above (see the discussion of percentage separation above). 1589 | 1590 | Scatterplots of the Discriminant Functions 1591 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1592 | 1593 | We can obtain a scatterplot of the best two discriminant functions, with the data points labelled by cultivar, by typing: 1594 | 1595 | :: 1596 | 1597 | > plot(wine.lda.values$x[,1],wine.lda.values$x[,2]) # make a scatterplot 1598 | > text(wine.lda.values$x[,1],wine.lda.values$x[,2],wine$V1,cex=0.7,pos=4,col="red") # add labels 1599 | 1600 | |image10| 1601 | 1602 | From the scatterplot of the first two discriminant functions, we can see that the wines from the three 1603 | cultivars are well separated in the scatterplot. The first discriminant function (x-axis) 1604 | separates cultivars 1 and 3 very well, but doesn't not perfectly separate cultivars 1605 | 1 and 3, or cultivars 2 and 3. 1606 | 1607 | The second discriminant function (y-axis) achieves a fairly good separation of cultivars 1608 | 1 and 3, and cultivars 2 and 3, although it is not totally perfect. 1609 | 1610 | To achieve a very good separation of the three cultivars, it would be best to use both the first and second 1611 | discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, 1612 | and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well. 1613 | 1614 | Allocation Rules and Misclassification Rate 1615 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1616 | 1617 | We can calculate the mean values of the discriminant functions for each of the three cultivars using the 1618 | "printMeanAndSdByGroup()" function (see above): 1619 | 1620 | :: 1621 | 1622 | > printMeanAndSdByGroup(wine.lda.values$x,wine[1]) 1623 | [1] "Means:" 1624 | V1 LD1 LD2 1625 | 1 1 -3.42248851 1.691674 1626 | 2 2 -0.07972623 -2.472656 1627 | 3 3 4.32473717 1.578120 1628 | 1629 | We find that the mean value of the first discriminant function is -3.42248851 for cultivar 1, -0.07972623 for cultivar 2, 1630 | and 4.32473717 for cultivar 3. The mid-way point between the mean values for cultivars 1 and 2 is (-3.42248851-0.07972623)/2=-1.751107, 1631 | and the mid-way point between the mean values for cultivars 2 and 3 is (-0.07972623+4.32473717)/2 = 2.122505. 1632 | 1633 | Therefore, we can use the following allocation rule: 1634 | 1635 | * if the first discriminant function is <= -1.751107, predict the sample to be from cultivar 1 1636 | * if the first discriminant function is > -1.751107 and <= 2.122505, predict the sample to be from cultivar 2 1637 | * if the first discriminant function is > 2.122505, predict the sample to be from cultivar 3 1638 | 1639 | We can examine the accuracy of this allocation rule by using the "calcAllocationRuleAccuracy()" function below: 1640 | 1641 | :: 1642 | 1643 | > calcAllocationRuleAccuracy <- function(ldavalue, groupvariable, cutoffpoints) 1644 | { 1645 | # find out how many values the group variable can take 1646 | groupvariable2 <- as.factor(groupvariable[[1]]) 1647 | levels <- levels(groupvariable2) 1648 | numlevels <- length(levels) 1649 | # calculate the number of true positives and false negatives for each group 1650 | numlevels <- length(levels) 1651 | for (i in 1:numlevels) 1652 | { 1653 | leveli <- levels[i] 1654 | levelidata <- ldavalue[groupvariable==leveli] 1655 | # see how many of the samples from this group are classified in each group 1656 | for (j in 1:numlevels) 1657 | { 1658 | levelj <- levels[j] 1659 | if (j == 1) 1660 | { 1661 | cutoff1 <- cutoffpoints[1] 1662 | cutoff2 <- "NA" 1663 | results <- summary(levelidata <= cutoff1) 1664 | } 1665 | else if (j == numlevels) 1666 | { 1667 | cutoff1 <- cutoffpoints[(numlevels-1)] 1668 | cutoff2 <- "NA" 1669 | results <- summary(levelidata > cutoff1) 1670 | } 1671 | else 1672 | { 1673 | cutoff1 <- cutoffpoints[(j-1)] 1674 | cutoff2 <- cutoffpoints[(j)] 1675 | results <- summary(levelidata > cutoff1 & levelidata <= cutoff2) 1676 | } 1677 | trues <- results["TRUE"] 1678 | trues <- trues[[1]] 1679 | print(paste("Number of samples of group",leveli,"classified as group",levelj," : ", 1680 | trues,"(cutoffs:",cutoff1,",",cutoff2,")")) 1681 | } 1682 | } 1683 | } 1684 | 1685 | For example, to calculate the accuracy for the wine data based on the allocation 1686 | rule for the first discriminant function, we type: 1687 | 1688 | :: 1689 | 1690 | > calcAllocationRuleAccuracy(wine.lda.values$x[,1], wine[1], c(-1.751107, 2.122505)) 1691 | [1] "Number of samples of group 1 classified as group 1 : 56 (cutoffs: -1.751107 , NA )" 1692 | [1] "Number of samples of group 1 classified as group 2 : 3 (cutoffs: -1.751107 , 2.122505 )" 1693 | [1] "Number of samples of group 1 classified as group 3 : NA (cutoffs: 2.122505 , NA )" 1694 | [1] "Number of samples of group 2 classified as group 1 : 5 (cutoffs: -1.751107 , NA )" 1695 | [1] "Number of samples of group 2 classified as group 2 : 65 (cutoffs: -1.751107 , 2.122505 )" 1696 | [1] "Number of samples of group 2 classified as group 3 : 1 (cutoffs: 2.122505 , NA )" 1697 | [1] "Number of samples of group 3 classified as group 1 : NA (cutoffs: -1.751107 , NA )" 1698 | [1] "Number of samples of group 3 classified as group 2 : NA (cutoffs: -1.751107 , 2.122505 )" 1699 | [1] "Number of samples of group 3 classified as group 3 : 48 (cutoffs: 2.122505 , NA )" 1700 | 1701 | This can be displayed in a "confusion matrix": 1702 | 1703 | +------------+----------------------+----------------------+----------------------+ 1704 | | | Allocated to group 1 | Allocated to group 2 | Allocated to group 3 | 1705 | +============+======================+======================+======================+ 1706 | | Is group 1 | 56 | 3 | 0 | 1707 | +------------+----------------------+----------------------+----------------------+ 1708 | | Is group 2 | 5 | 65 | 1 | 1709 | +------------+----------------------+----------------------+----------------------+ 1710 | | Is group 3 | 0 | 0 | 48 | 1711 | +------------+----------------------+----------------------+----------------------+ 1712 | 1713 | There are 3+5+1=9 wine samples that are misclassified, out of (56+3+5+65+1+48=) 178 wine samples: 1714 | 3 samples from cultivar 1 are predicted to be from cultivar 2, 5 samples from cultivar 2 are predicted 1715 | to be from cultivar 1, and 1 sample from cultivar 2 is predicted to be from cultivar 3. 1716 | Therefore, the misclassification rate is 9/178, or 5.1%. The misclassification rate is quite low, 1717 | and therefore the accuracy of the allocation rule appears to be relatively high. 1718 | 1719 | However, this is probably an underestimate of the misclassification rate, as the allocation rule was based on this data (this is 1720 | the "training set"). If we calculated the misclassification rate for a separate "test set" consisting of data other than that 1721 | used to make the allocation rule, we would probably get a higher estimate of the misclassification rate. 1722 | 1723 | Links and Further Reading 1724 | ------------------------- 1725 | 1726 | Here are some links for further reading. 1727 | 1728 | For a more in-depth introduction to R, a good online tutorial is 1729 | available on the "Kickstarting R" website, 1730 | `cran.r-project.org/doc/contrib/Lemon-kickstart `_. 1731 | 1732 | There is another nice (slightly more in-depth) tutorial to R 1733 | available on the "Introduction to R" website, 1734 | `cran.r-project.org/doc/manuals/R-intro.html `_. 1735 | 1736 | To learn about multivariate analysis, I would highly recommend the book "Multivariate 1737 | analysis" (product code M249/03) by the Open University, available from `the Open University Shop 1738 | `_. 1739 | 1740 | There is a book available in the "Use R!" series on using R for multivariate analyses, 1741 | `An Introduction to Applied Multivariate Analysis with R `_ 1742 | by Everitt and Hothorn. 1743 | 1744 | Acknowledgements 1745 | ---------------- 1746 | 1747 | Many of the examples in this booklet are inspired by examples in the excellent Open University book, 1748 | "Multivariate Analysis" (product code M249/03), 1749 | available from `the Open University Shop `_. 1750 | 1751 | I am grateful to the UCI Machine Learning Repository, 1752 | `http://archive.ics.uci.edu/ml `_, for making data sets available 1753 | which I have used in the examples in this booklet. 1754 | 1755 | Thank you to the following users for very helpful comments: to Rich O'Hara and Patrick Hausmann for pointing 1756 | out that sd() and mean() is deprecated; to Arnau Serra-Cayuela for pointing out a typo 1757 | in the LDA section; to John Christie for suggesting a more compact form for my printMeanAndSdByGroup() function, 1758 | and to Rama Ramakrishnan for suggesting a more compact form for my mosthighlycorrelated() function. 1759 | 1760 | Contact 1761 | ------- 1762 | 1763 | I will be grateful if you will send me (`Avril Coghlan `_) corrections or suggestions for improvements to 1764 | my email address alc@sanger.ac.uk 1765 | 1766 | License 1767 | ------- 1768 | 1769 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License 1770 | `_. 1771 | 1772 | .. |image1| image:: ../_static/image1.png 1773 | :width: 500 1774 | .. |image2| image:: ../_static/image2.png 1775 | :width: 400 1776 | .. |image4| image:: ../_static/image4.png 1777 | :width: 400 1778 | .. |image5| image:: ../_static/image5.png 1779 | :width: 400 1780 | .. |image6| image:: ../_static/image6.png 1781 | :width: 400 1782 | .. |image7| image:: ../_static/image7.png 1783 | :width: 400 1784 | .. |image8| image:: ../_static/image8.png 1785 | :width: 400 1786 | .. |image9| image:: ../_static/image9.png 1787 | :width: 400 1788 | .. |image10| image:: ../_static/image10.png 1789 | :width: 400 1790 | --------------------------------------------------------------------------------