├── README
├── _static
    ├── image1.png
    ├── image2.png
    ├── image3.png
    ├── image4.png
    ├── image5.png
    ├── image6.png
    ├── image7.png
    ├── image8.png
    ├── image9.png
    └── image10.png
├── _build
    └── latex
    │   └── MultivariateAnalysis.pdf
├── LittleBookofRMultivariateAnalysis
    └── src
    │   └── multivariateanalysis.rst
├── index.rst
├── make.bat
├── conf.py
└── src
    ├── installr.rst
    └── multivariateanalysis.rst


/README:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/_static/image1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image1.png


--------------------------------------------------------------------------------
/_static/image2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image2.png


--------------------------------------------------------------------------------
/_static/image3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image3.png


--------------------------------------------------------------------------------
/_static/image4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image4.png


--------------------------------------------------------------------------------
/_static/image5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image5.png


--------------------------------------------------------------------------------
/_static/image6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image6.png


--------------------------------------------------------------------------------
/_static/image7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image7.png


--------------------------------------------------------------------------------
/_static/image8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image8.png


--------------------------------------------------------------------------------
/_static/image9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image9.png


--------------------------------------------------------------------------------
/_static/image10.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_static/image10.png


--------------------------------------------------------------------------------
/_build/latex/MultivariateAnalysis.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/_build/latex/MultivariateAnalysis.pdf


--------------------------------------------------------------------------------
/LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/avrilcoghlan/LittleBookofRMultivariateAnalysis/HEAD/LittleBookofRMultivariateAnalysis/src/multivariateanalysis.rst


--------------------------------------------------------------------------------
/index.rst:
--------------------------------------------------------------------------------
 1 | Welcome to a Little Book of R for Multivariate Analysis!
 2 | ========================================================
 3 | 
 4 | By `Avril Coghlan <http://www.sanger.ac.uk/research/projects/parasitegenomics/>`_,
 5 | Wellcome Trust Sanger Institute, Cambridge, U.K. 
 6 | Email: alc@sanger.ac.uk
 7 | 
 8 | This is a simple introduction to multivariate analysis using the R statistics software.
 9 | 
10 | There is a pdf version of this booklet available at:
11 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf <https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf>`_.
12 | 
13 | If you like this booklet, you may also like to check out my booklet on using
14 | R for biomedical statistics,  
15 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/
16 | <http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/>`_,
17 | and my booklet on using R for time series analysis,
18 | `http://a-little-book-of-r-for-time-series.readthedocs.org/
19 | <http://a-little-book-of-r-for-time-series.readthedocs.org/>`_.
20 | 
21 | Contents:
22 | 
23 | .. toctree::
24 |    :maxdepth: 3
25 | 
26 |    src/installr.rst
27 |    src/multivariateanalysis.rst
28 | 
29 | Acknowledgements
30 | ----------------
31 | 
32 | Thank you to Noel O'Boyle for helping in using Sphinx, `http://sphinx.pocoo.org <http://sphinx.pocoo.org>`_, to create
33 | this document, and github, `https://github.com/ <https://github.com/>`_, to store different versions of the document
34 | as I was writing it, and readthedocs, `http://readthedocs.org/ <http://readthedocs.org/>`_, to build and distribute
35 | this document.
36 | 
37 | Contact
38 | -------
39 | 
40 | I will be very grateful if you will send me (`Avril Coghlan <http://www.sanger.ac.uk/research/projects/parasitegenomics/>`_) corrections or suggestions for improvements to
41 | my email address alc@sanger.ac.uk
42 | 
43 | License
44 | -------
45 | 
46 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
47 | <http://creativecommons.org/licenses/by/3.0/>`_.
48 | 
49 | 


--------------------------------------------------------------------------------
/make.bat:
--------------------------------------------------------------------------------
  1 | @ECHO OFF
  2 | 
  3 | REM Command file for Sphinx documentation
  4 | 
  5 | set SPHINXBUILD=C:\Python26\Scripts\sphinx-build
  6 | set BUILDDIR=_build
  7 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .
  8 | if NOT "%PAPER%" == "" (
  9 | 	set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%
 10 | )
 11 | 
 12 | if "%1" == "" goto help
 13 | 
 14 | if "%1" == "help" (
 15 | 	:help
 16 | 	echo.Please use `make ^<target^>` where ^<target^> is one of
 17 | 	echo.  html      to make standalone HTML files
 18 | 	echo.  dirhtml   to make HTML files named index.html in directories
 19 | 	echo.  pickle    to make pickle files
 20 | 	echo.  json      to make JSON files
 21 | 	echo.  htmlhelp  to make HTML files and a HTML help project
 22 | 	echo.  qthelp    to make HTML files and a qthelp project
 23 | 	echo.  latex     to make LaTeX files, you can set PAPER=a4 or PAPER=letter
 24 | 	echo.  changes   to make an overview over all changed/added/deprecated items
 25 | 	echo.  linkcheck to check all external links for integrity
 26 | 	echo.  doctest   to run all doctests embedded in the documentation if enabled
 27 | 	goto end
 28 | )
 29 | 
 30 | if "%1" == "clean" (
 31 | 	for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i
 32 | 	del /q /s %BUILDDIR%\*
 33 | 	goto end
 34 | )
 35 | 
 36 | if "%1" == "html" (
 37 | 	%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html
 38 | 	echo.
 39 | 	echo.Build finished. The HTML pages are in %BUILDDIR%/html.
 40 | 	goto end
 41 | )
 42 | 
 43 | if "%1" == "dirhtml" (
 44 | 	%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml
 45 | 	echo.
 46 | 	echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.
 47 | 	goto end
 48 | )
 49 | 
 50 | if "%1" == "pickle" (
 51 | 	%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle
 52 | 	echo.
 53 | 	echo.Build finished; now you can process the pickle files.
 54 | 	goto end
 55 | )
 56 | 
 57 | if "%1" == "json" (
 58 | 	%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json
 59 | 	echo.
 60 | 	echo.Build finished; now you can process the JSON files.
 61 | 	goto end
 62 | )
 63 | 
 64 | if "%1" == "htmlhelp" (
 65 | 	%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp
 66 | 	echo.
 67 | 	echo.Build finished; now you can run HTML Help Workshop with the ^
 68 | .hhp project file in %BUILDDIR%/htmlhelp.
 69 | 	goto end
 70 | )
 71 | 
 72 | if "%1" == "qthelp" (
 73 | 	%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp
 74 | 	echo.
 75 | 	echo.Build finished; now you can run "qcollectiongenerator" with the ^
 76 | .qhcp project file in %BUILDDIR%/qthelp, like this:
 77 | 	echo.^> qcollectiongenerator %BUILDDIR%\qthelp\sampledoc.qhcp
 78 | 	echo.To view the help file:
 79 | 	echo.^> assistant -collectionFile %BUILDDIR%\qthelp\sampledoc.ghc
 80 | 	goto end
 81 | )
 82 | 
 83 | if "%1" == "latex" (
 84 | 	%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex
 85 | 	echo.
 86 | 	echo.Build finished; the LaTeX files are in %BUILDDIR%/latex.
 87 | 	goto end
 88 | )
 89 | 
 90 | if "%1" == "changes" (
 91 | 	%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes
 92 | 	echo.
 93 | 	echo.The overview file is in %BUILDDIR%/changes.
 94 | 	goto end
 95 | )
 96 | 
 97 | if "%1" == "linkcheck" (
 98 | 	%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck
 99 | 	echo.
100 | 	echo.Link check complete; look for any errors in the above output ^
101 | or in %BUILDDIR%/linkcheck/output.txt.
102 | 	goto end
103 | )
104 | 
105 | if "%1" == "doctest" (
106 | 	%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest
107 | 	echo.
108 | 	echo.Testing of doctests in the sources finished, look at the ^
109 | results in %BUILDDIR%/doctest/output.txt.
110 | 	goto end
111 | )
112 | 
113 | if "%1" == "pdf" (
114 |         %SPHINXBUILD% -b pdf %ALLSPHINXOPTS% %BUILDDIR%/pdf
115 |         echo.
116 |         echo.Build finished. The PDF files are in _build/pdf.
117 | 	goto end
118 | )
119 | 
120 | :end
121 | 


--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # sampledoc documentation build configuration file, created by
  4 | # sphinx-quickstart on Sat Jan 09 18:21:28 2010.
  5 | #
  6 | # This file is execfile()d with the current directory set to its containing dir.
  7 | #
  8 | # Note that not all possible configuration values are present in this
  9 | # autogenerated file.
 10 | #
 11 | # All configuration values have a default; values that are commented out
 12 | # serve to show the default.
 13 | 
 14 | import sys, os
 15 | 
 16 | # If extensions (or modules to document with autodoc) are in another directory,
 17 | # add these directories to sys.path here. If the directory is relative to the
 18 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 19 | #sys.path.append(os.path.abspath('.'))
 20 | 
 21 | # -- General configuration -----------------------------------------------------
 22 | 
 23 | # Add any Sphinx extension module names here, as strings. They can be extensions
 24 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 25 | # (Noel) adding rst2pdf
 26 | extensions = ['sphinx.ext.autodoc'] # ,'rst2pdf.pdfbuilder']
 27 | 
 28 | # Add any paths that contain templates here, relative to this directory.
 29 | templates_path = ['_templates']
 30 | 
 31 | # The suffix of source filenames.
 32 | source_suffix = '.rst'
 33 | 
 34 | # The encoding of source files.
 35 | #source_encoding = 'utf-8'
 36 | 
 37 | # The master toctree document.
 38 | master_doc = 'index'
 39 | 
 40 | # General information about the project.
 41 | project = u'Multivariate Analysis'
 42 | copyright = u'2010, Avril Coghlan'
 43 | 
 44 | # The version info for the project you're documenting, acts as replacement for
 45 | # |version| and |release|, also used in various other places throughout the
 46 | # built documents.
 47 | #
 48 | # The short X.Y version.
 49 | version = '0.1'
 50 | # The full version, including alpha/beta/rc tags.
 51 | release = '0.1'
 52 | 
 53 | # The language for content autogenerated by Sphinx. Refer to documentation
 54 | # for a list of supported languages.
 55 | #language = None
 56 | 
 57 | # There are two options for replacing |today|: either, you set today to some
 58 | # non-false value, then it is used:
 59 | #today = ''
 60 | # Else, today_fmt is used as the format for a strftime call.
 61 | #today_fmt = '%B %d, %Y'
 62 | 
 63 | # List of documents that shouldn't be included in the build.
 64 | #unused_docs = []
 65 | 
 66 | # List of directories, relative to source directory, that shouldn't be searched
 67 | # for source files.
 68 | exclude_trees = ['_build']
 69 | 
 70 | # The reST default role (used for this markup: `text`) to use for all documents.
 71 | #default_role = None
 72 | 
 73 | # If true, '()' will be appended to :func: etc. cross-reference text.
 74 | #add_function_parentheses = True
 75 | 
 76 | # If true, the current module name will be prepended to all description
 77 | # unit titles (such as .. function::).
 78 | #add_module_names = True
 79 | 
 80 | # If true, sectionauthor and moduleauthor directives will be shown in the
 81 | # output. They are ignored by default.
 82 | #show_authors = False
 83 | 
 84 | # The name of the Pygments (syntax highlighting) style to use.
 85 | pygments_style = 'sphinx'
 86 | 
 87 | # A list of ignored prefixes for module index sorting.
 88 | #modindex_common_prefix = []
 89 | 
 90 | 
 91 | # -- Options for HTML output ---------------------------------------------------
 92 | 
 93 | # The theme to use for HTML and HTML Help pages.  Major themes that come with
 94 | # Sphinx are currently 'default' and 'sphinxdoc'.
 95 | # html_theme = 'default'
 96 | html_theme = 'sphinxdoc'
 97 | 
 98 | # Theme options are theme-specific and customize the look and feel of a theme
 99 | # further.  For a list of options available for each theme, see the
100 | # documentation.
101 | #html_theme_options = {}
102 | 
103 | # Add any paths that contain custom themes here, relative to this directory.
104 | #html_theme_path = []
105 | 
106 | # The name for this set of Sphinx documents.  If None, it defaults to
107 | # "<project> v<release> documentation".
108 | #html_title = None
109 | 
110 | # A shorter title for the navigation bar.  Default is the same as html_title.
111 | #html_short_title = None
112 | 
113 | # The name of an image file (relative to this directory) to place at the top
114 | # of the sidebar.
115 | #html_logo = None
116 | 
117 | # The name of an image file (within the static path) to use as favicon of the
118 | # docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
119 | # pixels large.
120 | #html_favicon = None
121 | 
122 | # Add any paths that contain custom static files (such as style sheets) here,
123 | # relative to this directory. They are copied after the builtin static files,
124 | # so a file named "default.css" will overwrite the builtin "default.css".
125 | html_static_path = ['_static']
126 | 
127 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
128 | # using the given strftime format.
129 | #html_last_updated_fmt = '%b %d, %Y'
130 | 
131 | # If true, SmartyPants will be used to convert quotes and dashes to
132 | # typographically correct entities.
133 | #html_use_smartypants = True
134 | 
135 | # Custom sidebar templates, maps document names to template names.
136 | #html_sidebars = {}
137 | 
138 | # Additional templates that should be rendered to pages, maps page names to
139 | # template names.
140 | #html_additional_pages = {}
141 | 
142 | # If false, no module index is generated.
143 | #html_use_modindex = True
144 | 
145 | # If false, no index is generated.
146 | #html_use_index = True
147 | 
148 | # If true, the index is split into individual pages for each letter.
149 | #html_split_index = False
150 | 
151 | # If true, links to the reST sources are added to the pages.
152 | #html_show_sourcelink = True
153 | 
154 | # If true, an OpenSearch description file will be output, and all pages will
155 | # contain a <link> tag referring to it.  The value of this option must be the
156 | # base URL from which the finished HTML is served.
157 | #html_use_opensearch = ''
158 | 
159 | # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
160 | #html_file_suffix = ''
161 | 
162 | # Output file base name for HTML help builder.
163 | htmlhelp_basename = 'sampledocdoc'
164 | 
165 | 
166 | # -- Options for LaTeX output --------------------------------------------------
167 | 
168 | # The paper size ('letter' or 'a4').
169 | latex_paper_size = 'a4'
170 | 
171 | # The font size ('10pt', '11pt' or '12pt').
172 | #latex_font_size = '10pt'
173 | 
174 | # Grouping the document tree into LaTeX files. List of tuples
175 | # (source start file, target name, title, author, documentclass [howto/manual]).
176 | latex_documents = [
177 |   ('index', 'MultivariateAnalysis.tex', u'A Little Book of R For Multivariate Analysis',
178 |    u'Avril Coghlan', 'manual'),
179 | ]
180 | 
181 | # The name of an image file (relative to this directory) to place at the top of
182 | # the title page.
183 | #latex_logo = None
184 | 
185 | # For "manual" documents, if this is true, then toplevel headings are parts,
186 | # not chapters.
187 | #latex_use_parts = False
188 | 
189 | # Additional stuff for the LaTeX preamble.
190 | #latex_preamble = ''
191 | 
192 | # Documents to append as an appendix to all manuals.
193 | #latex_appendices = []
194 | 
195 | # If false, no module index is generated.
196 | #latex_use_modindex = True
197 | 
198 | # (Noel) The following is all from rst2pdf
199 | # -- Options for PDF output --------------------------------------------------
200 | 
201 | # Grouping the document tree into PDF files. List of tuples
202 | # (source start file, target name, title, author, options).
203 | #
204 | # If there is more than one author, separate them with \\.
205 | # For example: r'Guido van Rossum\\Fred L. Drake, Jr., editor'
206 | #
207 | # The options element is a dictionary that lets you override 
208 | # this config per-document.
209 | # For example, 
210 | # ('index', u'MyProject', u'My Project', u'Author Name', 
211 | #  dict(pdf_compressed = True))
212 | # would mean that specific document would be compressed
213 | # regardless of the global pdf_compressed setting.
214 | 
215 | pdf_documents = [ 
216 | ('index', u'MyProject', u'My Project', u'Author Name'),
217 | ]
218 | 
219 | # A comma-separated list of custom stylesheets. Example:
220 | pdf_stylesheets = ['sphinx','kerning','a4']
221 | 
222 | # Create a compressed PDF
223 | # Use True/False or 1/0
224 | # Example: compressed=True
225 | #pdf_compressed = False
226 | 
227 | # A colon-separated list of folders to search for fonts. Example:
228 | # pdf_font_path = ['/usr/share/fonts', '/usr/share/texmf-dist/fonts/']
229 | 
230 | # Language to be used for hyphenation support
231 | #pdf_language = "en_US"
232 | 
233 | # Mode for literal blocks wider than the frame. Can be
234 | # overflow, shrink or truncate
235 | #pdf_fit_mode = "shrink"
236 | 
237 | # Section level that forces a break page.
238 | # For example: 1 means top-level sections start in a new page
239 | # 0 means disabled
240 | #pdf_break_level = 0
241 | 
242 | # When a section starts in a new page, force it to be 'even', 'odd',
243 | # or just use 'any'
244 | #pdf_breakside = 'any'
245 | 
246 | # Insert footnotes where they are defined instead of 
247 | # at the end.
248 | #pdf_inline_footnotes = True
249 | 
250 | # verbosity level. 0 1 or 2
251 | #pdf_verbosity = 0
252 | 
253 | # If false, no index is generated.
254 | #pdf_use_index = True
255 | 
256 | # If false, no modindex is generated.
257 | #pdf_use_modindex = True
258 | 
259 | # If false, no coverpage is generated.
260 | #pdf_use_coverpage = True
261 | 
262 | # Documents to append as an appendix to all manuals.    
263 | #pdf_appendices = []
264 | 
265 | # Enable experimental feature to split table cells. Use it
266 | # if you get "DelayedTable too big" errors
267 | #pdf_splittables = False
268 | 
269 | # Set the default DPI for images
270 | #pdf_default_dpi = 72
271 | 
272 | # Enable rst2pdf extension modules (default is empty list)
273 | # you need vectorpdf for better sphinx's graphviz support
274 | #pdf_extensions = ['vectorpdf']
275 | 
276 | 
277 | 


--------------------------------------------------------------------------------
/src/installr.rst:
--------------------------------------------------------------------------------
  1 | How to install R  
  2 | ================
  3 | 
  4 | Introduction to R
  5 | -----------------
  6 | 
  7 | This little booklet has some information on how to use R for time series analysis.
  8 | 
  9 | R (`www.r-project.org <http://www.r-project.org/>`_) is a commonly used
 10 | free Statistics software. R allows you to carry out statistical
 11 | analyses in an interactive mode, as well as allowing simple programming.
 12 | 
 13 | Installing R
 14 | ------------
 15 | 
 16 | To use R, you first need to install the R program on your computer.
 17 | 
 18 | How to check if R is installed on a Windows PC
 19 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 20 | 
 21 | Before you install R on your computer, the first thing to do is to check whether
 22 | R is already installed on your computer (for example, by a previous user). 
 23 | 
 24 | These instructions will focus on installing R on a Windows PC. However, I will also
 25 | briefly mention how to install R on a Macintosh or Linux computer (see below).
 26 | 
 27 | If you are using a Windows PC, there are two ways you can check whether R is
 28 | already isntalled on your computer:
 29 | 
 30 | 1. Check if there is an "R" icon on the desktop of the computer that you are using.
 31 |    If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead.
 32 | 2. Click on the "Start" menu at the bottom left of your Windows desktop, and then move your 
 33 |    mouse over "All Programs" in the menu that pops up. See if "R" appears in the list
 34 |    of programs that pops up. If it does, it means that R is already installed on your
 35 |    computer, and you can start R by selecting "R"  (or R X.X.X, where X.X.X gives the version of R, 
 36 |    eg. R 2.10.0) from the list.
 37 | 
 38 | If either (1) or (2) above does succeed in starting R, it means that R is already installed
 39 | on the computer that you are using. (If neither succeeds, R is not installed yet).
 40 | If there is an old version of R installed on the Windows PC that you are using,
 41 | it is worth installing the latest version of R, to make sure that you have all the
 42 | latest R functions available to you to use.
 43 | 
 44 | Finding out what is the latest version of R
 45 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 46 | 
 47 | To find out what is the latest version of R, you can look at the CRAN (Comprehensive
 48 | R Network) website, `http://cran.r-project.org/ <http://cran.r-project.org/>`_.
 49 | 
 50 | Beside "The latest release" (about half way down the page), it will say something like
 51 | "R-X.X.X.tar.gz" (eg. "R-2.12.1.tar.gz"). This means that the latest release of R is X.X.X (for
 52 | example, 2.12.1).
 53 | 
 54 | New releases of R are made very regularly (approximately once a month), as R is actively being
 55 | improved all the time. It is worthwhile installing new versions of R regularly, to make sure
 56 | that you have a recent version of R (to ensure compatibility with all the latest versions of
 57 | the R packages that you have downloaded). 
 58 | 
 59 | Installing R on a Windows PC
 60 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 61 | 
 62 | To install R on your Windows computer, follow these steps:
 63 | 
 64 | 1. Go to `http://ftp.heanet.ie/mirrors/cran.r-project.org <http://ftp.heanet.ie/mirrors/cran.r-project.org>`_.
 65 | 2. Under "Download and Install R", click on the "Windows" link.
 66 | 3. Under "Subdirectories", click on the "base" link.
 67 | 4. On the next page, you should see a link saying something like "Download R 2.10.1 for Windows" (or R X.X.X, where X.X.X gives the version of R, eg. R 2.11.1). 
 68 |    Click on this link.
 69 | 5. You may be asked if you want to save or run a file "R-2.10.1-win32.exe". Choose "Save" and
 70 |    save the file on the Desktop. Then double-click on the icon for the file to run it.
 71 | 6. You will be asked what language to install it in - choose English.
 72 | 7. The R Setup Wizard will appear in a window. Click "Next" at the bottom of the R Setup wizard 
 73 |    window.
 74 | 8. The next page says "Information" at the top. Click "Next" again.
 75 | 9. The next page says "Information" at the top. Click "Next" again.
 76 | 10. The next page says "Select Destination Location" at the top. 
 77 |     By default, it will suggest to install R in "C:\\Program Files" on your computer. 
 78 | 11. Click "Next" at the bottom of the R Setup wizard window.
 79 | 12. The next page says "Select components" at the top. Click "Next" again.
 80 | 13. The next page says "Startup options" at the top. Click "Next" again.
 81 | 14. The next page says "Select start menu folder" at the top. Click "Next" again.
 82 | 15. The next page says "Select additional tasks" at the top. Click "Next" again.
 83 | 16. R should now be installed. This will take about a minute. When R has finished, you will 
 84 |     see "Completing the R for Windows Setup Wizard" appear. Click "Finish".
 85 | 17. To start R, you can either follow step 18, or 19:
 86 | 18. Check if there is an "R" icon on the desktop of the computer that you are using.
 87 |     If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 19 instead.
 88 | 19. Click on the "Start" button at the bottom left of your computer screen, and then 
 89 |     choose "All programs", and start R by selecting "R"  (or R X.X.X, where 
 90 |     X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 
 91 | 20. The R console (a rectangle) should pop up:
 92 | 
 93 | |image3|
 94 | 
 95 | How to install R on non-Windows computers (eg. Macintosh or Linux computers)
 96 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 97 | 
 98 | The instructions above are for installing R on a Windows PC. If you want to install R 
 99 | on a computer that has a non-Windows operating system (for example, a Macintosh or computer running Linux,
100 | you should download the appropriate R installer for that operating system at 
101 | `http://ftp.heanet.ie/mirrors/cran.r-project.org
102 | <http://ftp.heanet.ie/mirrors/cran.r-project.org/>`_ and 
103 | follow the R installation instructions for the appropriate operating system at 
104 | `http://ftp.heanet.ie/mirrors/cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed_003f 
105 | <http://ftp.heanet.ie/mirrors/cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-R-be-installed_003f>`_).
106 | 
107 | Installing R packages
108 | ---------------------
109 | 
110 | R comes with some standard packages that are installed when you install R. However, in this 
111 | booklet I will also tell you how to use some additional R packages that are useful, for example,
112 | the "rmeta" package. These additional packages do not come with the standard installation of R,
113 | so you need to install them yourself.
114 | 
115 | How to install an R package
116 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
117 | 
118 | Once you have installed R on a Windows computer (following the steps above), you can install 
119 | an additional package by following the steps below:
120 | 
121 | 1. To start R, follow either step 2 or 3:
122 | 2. Check if there is an "R" icon on the desktop of the computer that you are using.
123 |    If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead.
124 | 3. Click on the "Start" button at the bottom left of your computer screen, and then 
125 |    choose "All programs", and start R by selecting "R"  (or R X.X.X, where 
126 |    X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 
127 | 4. The R console (a rectangle) should pop up.
128 | 5. Once you have started R, you can now install an R package (eg. the "rmeta" package) by 
129 |    choosing "Install package(s)" from the "Packages" menu at the top of the R console.
130 |    This will ask you what website you want to download the package from, you should choose 
131 |    "Ireland" (or another country, if you prefer). It will also bring up a list of available
132 |    packages that you can install, and you should choose the package that you want to install
133 |    from that list (eg. "rmeta").
134 | 6. This will install the "rmeta" package.
135 | 7. The "rmeta" package is now installed. Whenever you want to use the "rmeta" package after this, 
136 |    after starting R, you first have to load the package by typing into the R console:
137 | 
138 | .. highlight:: r
139 | 
140 | ::
141 | 
142 |     > library("rmeta")
143 | 
144 | Note that there are some additional R packages for bioinformatics that are part of a special 
145 | set of R packages called Bioconductor (`www.bioconductor.org <http://www.bioconductor.org/>`_) 
146 | such as the "yeastExpData" R package, the "Biostrings" R package, etc.). 
147 | These Bioconductor packages need to be installed using a different, Bioconductor-specific procedure 
148 | (see `How to install a Bioconductor R package`_ below).
149 | 
150 | How to install a Bioconductor R package
151 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152 | 
153 | The procedure above can be used to install the majority of R packages. However, the
154 | Bioconductor set of bioinformatics R packages need to be installed by a special procedure.
155 | Bioconductor (`www.bioconductor.org <http://www.bioconductor.org/>`_)
156 | is a group of R packages that have been developed for bioinformatics. This includes 
157 | R packages such as "yeastExpData", "Biostrings", etc.
158 | 
159 | To install the Bioconductor packages, follow these steps:
160 | 
161 | 1. To start R, follow either step 2 or 3:
162 | 2. Check if there is an "R" icon on the desktop of the computer that you are using.
163 |    If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 3 instead.
164 | 3. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R"  (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs. 
165 | 4. The R console (a rectangle) should pop up.
166 | 5. Once you have started R, now type in the R console:
167 | 
168 | .. highlight:: r
169 | 
170 | ::
171 | 
172 |     > source("http://bioconductor.org/biocLite.R")
173 |     > biocLite()
174 | 
175 | 6. This will install a core set of Bioconductor packages ("affy", "affydata", "affyPLM", 
176 |    "annaffy", "annotate", "Biobase", "Biostrings", "DynDoc", "gcrma", "genefilter", 
177 |    "geneplotter", "hgu95av2.db", "limma", "marray", "matchprobes", "multtest", "ROC", 
178 |    "vsn", "xtable", "affyQCReport").
179 |    This takes a few minutes (eg. 10 minutes). 
180 | 7. At a later date, you may wish to install some extra Bioconductor packages that do not belong 
181 |    to the core set of Bioconductor packages. For example, to install the Bioconductor package called 
182 |    "yeastExpData", start R and type in the R console:
183 | 
184 | .. highlight:: r
185 | 
186 | ::
187 | 
188 |     > source("http://bioconductor.org/biocLite.R")
189 |     > biocLite("yeastExpData")
190 | 
191 | 8. Whenever you want to use a package after installing it, you need to load it into R by typing:
192 | 
193 | .. highlight:: r
194 | 
195 | ::
196 | 
197 |    > library("yeastExpData")
198 | 
199 | Running R
200 | -----------
201 | 
202 | To use R, you first need to start the R program on your computer.
203 | You should have already installed R on your computer (see above). 
204 | 
205 | To start R, you can either follow step 1 or 2:
206 | 1. Check if there is an "R" icon on the desktop of the computer that you are using.
207 |    If so, double-click on the "R" icon to start R. If you cannot find an "R" icon, try step 2 instead.
208 | 2. Click on the "Start" button at the bottom left of your computer screen, and then choose "All programs", and start R by selecting "R"  (or R X.X.X, where X.X.X gives the version of R, eg. R 2.10.0) from the menu of programs.
209 | 
210 | This should bring up a new window, which is the *R console*.
211 | 
212 | A brief introduction to R
213 | -------------------------
214 | 
215 | You will type R commands into the R console in order to carry out
216 | analyses in R. In the R console you will see:
217 | 
218 | .. highlight:: r
219 | 
220 | ::
221 | 
222 |     >
223 | 
224 | This is the R prompt. We type the commands needed for a particular
225 | task after this prompt. The command is carried out after you hit
226 | the Return key.
227 | 
228 | Once you have started R, you can start typing in commands, and the
229 | results will be calculated immediately, for example:
230 | 
231 | .. highlight:: r
232 | 
233 | ::
234 | 
235 |     > 2*3
236 |     [1] 6
237 |     > 10-3
238 |     [1] 7
239 | 
240 | All variables (scalars, vectors, matrices, etc.) created by R are
241 | called *objects*. In R, we assign values to variables using an
242 | arrow. For example, we can assign the value 2\*3 to the variable
243 | *x* using the command:
244 | 
245 | .. highlight:: r
246 | 
247 | ::
248 | 
249 |     > x <- 2*3 
250 | 
251 | To view the contents of any R object, just type its name, and the
252 | contents of that R object will be displayed:
253 | 
254 | .. highlight:: r
255 | 
256 | ::
257 | 
258 |     > x
259 |     [1] 6
260 | 
261 | There are several possible different types of objects in R,
262 | including scalars, vectors, matrices, arrays, data frames, tables,
263 | and lists. The scalar variable *x* above is one example of an R
264 | object. While a scalar variable such as *x* has just one element, a
265 | vector consists of several elements. The elements in a vector are
266 | all of the same type (eg. numeric or characters), while lists may
267 | include elements such as characters as well as numeric quantities.
268 | 
269 | To create a vector, we can use the c() (combine) function. For
270 | example, to create a vector called *myvector* that has elements
271 | with values 8, 6, 9, 10, and 5, we type:
272 | 
273 | .. highlight:: r
274 | 
275 | ::
276 | 
277 |     > myvector <- c(8, 6, 9, 10, 5)
278 | 
279 | To see the contents of the variable *myvector*, we can just type
280 | its name:
281 | 
282 | .. highlight:: r
283 | 
284 | ::
285 | 
286 |     > myvector
287 |     [1]  8  6  9 10  5
288 | 
289 | The [1] is the index of the first element in the vector. We can
290 | extract any element of the vector by typing the vector name with
291 | the index of that element given in square brackets. For example, to
292 | get the value of the 4th element in the vector *myvector*, we
293 | type:
294 | 
295 | .. highlight:: r
296 | 
297 | ::
298 | 
299 |     > myvector[4]
300 |     [1] 10
301 | 
302 | In contrast to a vector, a list can contain elements of different
303 | types, for example, both numeric and character elements. A list can
304 | also include other variables such as a vector. The list() function
305 | is used to create a list. For example, we could create a list
306 | *mylist* by typing:
307 | 
308 | .. highlight:: r
309 | 
310 | ::
311 | 
312 |     > mylist <- list(name="Fred", wife="Mary", myvector)
313 | 
314 | We can then print out the contents of the list *mylist* by typing
315 | its name:
316 | 
317 | .. highlight:: r
318 | 
319 | ::
320 | 
321 |     > mylist
322 |     $name
323 |     [1] "Fred"
324 |     
325 |     $wife
326 |     [1] "Mary"
327 |     
328 |     [[3]]
329 |     [1]  8  6  9 10  5
330 | 
331 | The elements in a list are numbered, and can be referred to using
332 | indices. We can extract an element of a list by typing the list
333 | name with the index of the element given in double square brackets
334 | (in contrast to a vector, where we only use single square
335 | brackets). Thus, we can extract the second and third elements from
336 | *mylist* by typing:
337 | 
338 | .. highlight:: r
339 | 
340 | ::
341 | 
342 |     > mylist[[2]]
343 |     [1] "Mary"
344 |     > mylist[[3]]
345 |     [1]  8  6  9 10  5
346 | 
347 | Elements of lists may also be named, and in this case the elements
348 | may be referred to by giving the list name, followed by "$",
349 | followed by the element name. For example, *mylist$name* is the
350 | same as *mylist[[1]]* and *mylist$wife* is the same as
351 | *mylist[[2]]*:
352 | 
353 | .. highlight:: r
354 | 
355 | ::
356 | 
357 |     > mylist$wife
358 |     [1] "Mary"
359 | 
360 | We can find out the names of the named elements in a list by using
361 | the attributes() function, for example:
362 | 
363 | .. highlight:: r
364 | 
365 | ::
366 | 
367 |     > attributes(mylist)
368 |     $names
369 |     [1] "name" "wife" ""    
370 | 
371 | When you use the attributes() function to find the named elements
372 | of a list variable, the named elements are always listed under a
373 | heading "$names". Therefore, we see that the named elements of the
374 | list variable *mylist* are called "name" and "wife", and we can
375 | retrieve their values by typing *mylist$name* and *mylist$wife*,
376 | respectively.
377 | 
378 | Another type of object that you will encounter in R is a *table*
379 | variable. For example, if we made a vector variable *mynames*
380 | containing the names of children in a class, we can use the table()
381 | function to produce a table variable that contains the number of
382 | children with each possible name:
383 | 
384 | .. highlight:: r
385 | 
386 | ::
387 | 
388 |     > mynames <- c("Mary", "John", "Ann", "Sinead", "Joe", "Mary", "Jim", "John", "Simon")
389 |     > table(mynames)
390 |     mynames
391 |        Ann    Jim    Joe   John   Mary  Simon Sinead 
392 |          1      1      1      2      2      1      1 
393 | 
394 | We can store the table variable produced by the function table(),
395 | and call the stored table "mytable", by typing:
396 | 
397 | .. highlight:: r
398 | 
399 | ::
400 | 
401 |     > mytable <- table(mynames)
402 | 
403 | To access elements in a table variable, you need to use double
404 | square brackets, just like accessing elements in a list. For
405 | example, to access the fourth element in the table *mytable* (the
406 | number of children called "John"), we type:
407 | 
408 | .. highlight:: r
409 | 
410 | ::
411 | 
412 |     > mytable[[4]]
413 |     [1] 2
414 | 
415 | Alternatively, you can use the name of the fourth element in
416 | the table ("John") to find the value of that table element:
417 | 
418 | .. highlight:: r
419 | 
420 | ::
421 | 
422 |     > mytable[["John"]]
423 |     [1] 2
424 | 
425 | Functions in R usually require *arguments*, which are input
426 | variables (ie. objects) that are passed to them, which they then
427 | carry out some operation on. For example, the log10() function is
428 | passed a number, and it then calculates the log to the base 10 of
429 | that number:
430 | 
431 | .. highlight:: r
432 | 
433 | ::
434 | 
435 |     > log10(100)
436 |     2
437 | 
438 | In R, you can get help about a particular function by using the
439 | help() function. For example, if you want help about the log10()
440 | function, you can type:
441 | 
442 | .. highlight:: r
443 | 
444 | ::
445 | 
446 |     > help("log10")
447 | 
448 | When you use the help() function, a box or webpage will pop up with
449 | information about the function that you asked for help with.
450 | 
451 | If you are not sure of the name of a function, but think you know
452 | part of its name, you can search for the function name using the
453 | help.search() and RSiteSearch() functions. The help.search() function
454 | searches to see if you already have a function installed (from one of
455 | the R packages that you have installed) that may be related to some
456 | topic you're interested in. The RSiteSearch() function searches all
457 | R functions (including those in packages that you haven't yet installed)
458 | for functions related to the topic you are interested in.
459 | 
460 | For example, if you want to know if there
461 | is a function to calculate the standard deviation of a set of
462 | numbers, you can search for the names of all installed functions containing
463 | the word "deviation" in their description by typing:
464 | 
465 | .. highlight:: r
466 | 
467 | ::
468 | 
469 |     > help.search("deviation")
470 |     Help files with alias or concept or title matching
471 |     'deviation' using fuzzy matching:
472 |     
473 |     genefilter::rowSds
474 |                         Row variance and standard deviation of
475 |                         a numeric array
476 |     nlme::pooledSD      Extract Pooled Standard Deviation
477 |     stats::mad          Median Absolute Deviation
478 |     stats::sd           Standard Deviation
479 |     vsn::meanSdPlot     Plot row standard deviations versus row
480 | 
481 | Among the functions that were found, is the function sd() in the
482 | "stats" package (an R package that comes with the standard R
483 | installation), which is used for calculating the standard deviation.
484 | 
485 | In the example above, the help.search() function found a relevant
486 | function (sd() here). However, if you did not find what you were looking
487 | for with help.search(), you could then use the RSiteSearch() function to
488 | see if a search of all functions described on the R website may find
489 | something relevant to the topic that you're interested in:
490 | 
491 | .. highlight:: r
492 | 
493 | :: 
494 | 
495 |    > RSiteSearch("deviation") 
496 | 
497 | The results of the RSiteSearch() function will be hits to descriptions
498 | of R functions, as well as to R mailing list discussions of those
499 | functions.
500 | 
501 | We can perform computations with R using objects such as scalars
502 | and vectors. For example, to calculate the average of the values in
503 | the vector *myvector* (ie. the average of 8, 6, 9, 10 and 5), we
504 | can use the mean() function:
505 | 
506 | .. highlight:: r
507 | 
508 | ::
509 | 
510 |     > mean(myvector)
511 |     [1] 7.6
512 | 
513 | We have been using built-in R functions such as mean(),
514 | length(), print(), plot(), etc. We can also create our own
515 | functions in R to do calculations that you want to carry out very
516 | often on different input data sets. For example, we can create a
517 | function to calculate the value of 20 plus square of some input
518 | number:
519 | 
520 | .. highlight:: r
521 | 
522 | ::
523 | 
524 |     > myfunction <- function(x) { return(20 + (x*x)) }
525 | 
526 | This function will calculate the square of a number (*x*), and then
527 | add 20 to that value. The return() statement returns the calculated
528 | value. Once you have typed in this function, the function is then
529 | available for use. For example, we can use the function for
530 | different input numbers (eg. 10, 25):
531 | 
532 | .. highlight:: r
533 | 
534 | ::
535 | 
536 |     > myfunction(10)
537 |     [1] 120
538 |     > myfunction(25) 
539 |     [1] 645
540 | 
541 | To quit R, type:
542 | 
543 | .. highlight:: r
544 | 
545 | ::
546 | 
547 |     > q()
548 | 
549 | 
550 | Links and Further Reading
551 | -------------------------
552 | 
553 | Some links are included here for further reading.
554 | 
555 | For a more in-depth introduction to R, a good online tutorial is
556 | available on the "Kickstarting R" website,
557 | `cran.r-project.org/doc/contrib/Lemon-kickstart <http://cran.r-project.org/doc/contrib/Lemon-kickstart/>`_.
558 | 
559 | There is another nice (slightly more in-depth) tutorial to R
560 | available on the "Introduction to R" website,
561 | `cran.r-project.org/doc/manuals/R-intro.html <http://cran.r-project.org/doc/manuals/R-intro.html>`_.
562 | 
563 | Acknowledgements
564 | ----------------
565 | 
566 | For very helpful comments and suggestions for improvements on the installation instructions, thank you very much to Friedrich Leisch and Phil Spector.
567 | 
568 | Contact
569 | -------
570 | 
571 | I will be very grateful if you will send me (`Avril Coghlan <http://www.sanger.ac.uk/research/projects/parasitegenomics/>`_) corrections or suggestions for improvements to
572 | my email address alc@sanger.ac.uk
573 | 
574 | License
575 | -------
576 | 
577 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
578 | <http://creativecommons.org/licenses/by/3.0/>`_.
579 | 
580 | .. |image3| image:: ../_static/image3.png
581 | 


--------------------------------------------------------------------------------
/src/multivariateanalysis.rst:
--------------------------------------------------------------------------------
   1 | Using R for Multivariate Analysis
   2 | =================================
   3 | 
   4 | Multivariate Analysis
   5 | ---------------------
   6 | 
   7 | This booklet tells you how to use the R statistical software to carry out some simple multivariate analyses,
   8 | with a focus on principal components analysis (PCA) and linear discriminant analysis (LDA).
   9 | 
  10 | This booklet assumes that the reader has some basic knowledge of multivariate analyses, and
  11 | the principal focus of the booklet is not to explain multivariate analyses, but rather 
  12 | to explain how to carry out these analyses using R.
  13 | 
  14 | If you are new to multivariate analysis, and want to learn more about any of the concepts
  15 | presented here, I would highly recommend the Open University book 
  16 | "Multivariate Analysis" (product code M249/03), available from
  17 | from `the Open University Shop <http://www.ouw.co.uk/store/>`_.
  18 | 
  19 | In the examples in this booklet, I will be using data sets from the UCI Machine
  20 | Learning Repository, `http://archive.ics.uci.edu/ml <http://archive.ics.uci.edu/ml>`_.
  21 | 
  22 | There is a pdf version of this booklet available at
  23 | `https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf <https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf>`_.
  24 | 
  25 | If you like this booklet, you may also like to check out my booklet on using
  26 | R for biomedical statistics, 
  27 | `http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/
  28 | <http://a-little-book-of-r-for-biomedical-statistics.readthedocs.org/>`_,
  29 | and my booklet on using R for time series analysis,
  30 | `http://a-little-book-of-r-for-time-series.readthedocs.org/
  31 | <http://a-little-book-of-r-for-time-series.readthedocs.org/>`_.
  32 | 
  33 | Reading Multivariate Analysis Data into R
  34 | -----------------------------------------
  35 | 
  36 | The first thing that you will want to do to analyse your multivariate data will be to read
  37 | it into R, and to plot the data. You can read data into R using the read.table() function.
  38 | 
  39 | For example, the file `http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
  40 | <http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data>`_
  41 | contains data on concentrations of 13 different chemicals in wines grown in the same region in Italy that are
  42 | derived from three different cultivars.
  43 | 
  44 | The data set looks like this:
  45 | 
  46 | .. highlight:: r
  47 | 
  48 | ::
  49 | 
  50 |     1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
  51 |     1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
  52 |     1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
  53 |     1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
  54 |     1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
  55 |     ... 
  56 | 
  57 | There is one row per wine sample.
  58 | The first column contains the cultivar of a wine sample (labelled 1, 2 or 3), and the following thirteen columns
  59 | contain the concentrations of the 13 different chemicals in that sample.
  60 | The columns are separated by commas. 
  61 | 
  62 | When we read the file into R using the read.table() function, we need to use the "sep="
  63 | argument in read.table() to tell it that the columns are separated by commas.
  64 | That is, we can read in the file using the read.table() function as follows:
  65 | 
  66 | .. highlight:: r
  67 | 
  68 | ::
  69 | 
  70 |     > wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", 
  71 |               sep=",")
  72 |     > wine
  73 |          V1    V2   V3   V4   V5  V6   V7   V8   V9  V10       V11   V12  V13  V14
  74 |      1    1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29  5.640000 1.040 3.92 1065
  75 |      2    1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28  4.380000 1.050 3.40 1050
  76 |      3    1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81  5.680000 1.030 3.17 1185
  77 |      4    1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18  7.800000 0.860 3.45 1480
  78 |      5    1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82  4.320000 1.040 2.93  735
  79 |      ...
  80 |      176  3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.200000 0.590 1.56  835
  81 |      177  3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46  9.300000 0.600 1.62  840
  82 |      178  3 14.13 4.10 2.74 24.5  96 2.05 0.76 0.56 1.35  9.200000 0.610 1.60  560
  83 |      
  84 | In this case the data on 178 samples of wine has been read into the variable 'wine'.
  85 | 
  86 | Plotting Multivariate Data
  87 | --------------------------
  88 | 
  89 | Once you have read a multivariate data set into R, the next step is usually to make a plot of the data.
  90 | 
  91 | A Matrix Scatterplot
  92 | ^^^^^^^^^^^^^^^^^^^^
  93 | 
  94 | One common way of plotting multivariate data is to make a "matrix scatterplot", showing each pair of
  95 | variables plotted against each other. We can use the "scatterplotMatrix()" function from the "car"
  96 | R package to do this. To use this function, we first need to install the "car" R package 
  97 | (for instructions on how to install an R package, see `How to install an R package 
  98 | <./installr.html#how-to-install-an-r-package>`_).
  99 | 
 100 | Once you have installed the "car" R package, you can load the "car" R package by typing:
 101 | 
 102 | .. highlight:: r
 103 | 
 104 | ::
 105 | 
 106 |     > library("car")
 107 | 
 108 | You can then use the "scatterplotMatrix()" function to plot the multivariate data. 
 109 | 
 110 | To use the scatterplotMatrix() function, you need to give it as its input the variables
 111 | that you want included in the plot. Say for example, that we just want to include the
 112 | variables corresponding to the concentrations of the first five chemicals. These are stored in 
 113 | columns 2-6 of the variable "wine". We can extract just these columns from the variable
 114 | "wine" by typing:
 115 | 
 116 | ::
 117 | 
 118 |     > wine[2:6]
 119 |              V2   V3   V4   V5  V6  
 120 |       1   14.23 1.71 2.43 15.6 127 
 121 |       2   13.20 1.78 2.14 11.2 100
 122 |       3   13.16 2.36 2.67 18.6 101 
 123 |       4   14.37 1.95 2.50 16.8 113
 124 |       5   13.24 2.59 2.87 21.0 118 
 125 |       ...
 126 | 
 127 | To make a matrix scatterplot of just these 13 variables using the scatterplotMatrix() function we type:
 128 | 
 129 | ::
 130 | 
 131 |     > scatterplotMatrix(wine[2:6])
 132 | 
 133 | 
 134 | |image1|
 135 | 
 136 | 
 137 | In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in this
 138 | case the concentrations of the first five chemicals (variables V2, V3, V4, V5, V6). 
 139 | 
 140 | Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example, the second cell in the
 141 | first row is a scatterplot of V2 (y-axis) against V3 (x-axis). 
 142 | 
 143 | A Scatterplot with the Data Points Labelled by their Group
 144 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 145 | 
 146 | If you see an interesting scatterplot for two variables in the matrix scatterplot, you may want to
 147 | plot that scatterplot in more detail, with the data points labelled by their group (their cultivar in this case).
 148 | 
 149 | For example, in the matrix scatterplot above, the cell in the third column of the fourth row down is a scatterplot
 150 | of V5 (x-axis) against V4 (y-axis). If you look at this scatterplot, it appears that there may be a 
 151 | positive relationship between V5 and V4. 
 152 | 
 153 | We may therefore decide to examine the relationship between V5 and V4 more closely, by plotting a scatterplot
 154 | of these two variable, with the data points labelled by their group (their cultivar). To plot a scatterplot
 155 | of two variables, we can use the "plot" R function. The V4 and V5 variables are stored in the columns
 156 | V4 and V5 of the variable "wine", so can be accessed by typing wine$V4 or wine$V5. Therefore, to plot
 157 | the scatterplot, we type:
 158 | 
 159 | ::
 160 | 
 161 |     > plot(wine$V4, wine$V5)
 162 | 
 163 | |image2|
 164 | 
 165 | If we want to label the data points by their group (the cultivar of wine here), we can use the "text" function
 166 | in R to plot some text beside every data point. In this case, the cultivar of wine is stored in the column
 167 | V1 of the variable "wine", so we type:
 168 | 
 169 | ::
 170 | 
 171 |     > text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red")
 172 | 
 173 | If you look at the help page for the "text" function, you will see that "pos=4" will plot the text just to the
 174 | right of the symbol for a data point. The "cex=0.5" option will plot the text at half the default size, and
 175 | the "col=red" option will plot the text in red. This gives us the following plot:
 176 | 
 177 | |image4|
 178 | 
 179 | We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have
 180 | lower values of V4 compared to the wines of cultivar 1. 
 181 | 
 182 | A Profile Plot
 183 | ^^^^^^^^^^^^^^
 184 | 
 185 | Another type of plot that is useful is a "profile plot", which shows the variation in each of the
 186 | variables, by plotting the value of each of the variables for each of the samples. 
 187 | 
 188 | The function "makeProfilePlot()" below can be used to make a profile plot. This function requires
 189 | the "RColorBrewer" library. To use this function, we first need to install the "RColorBrewer" R package 
 190 | (for instructions on how to install an R package, see `How to install an R package 
 191 | <./installr.html#how-to-install-an-r-package>`_).
 192 | 
 193 | ::
 194 | 
 195 |     > makeProfilePlot <- function(mylist,names)
 196 |       {
 197 |          require(RColorBrewer)
 198 |          # find out how many variables we want to include
 199 |          numvariables <- length(mylist)   
 200 |          # choose 'numvariables' random colours
 201 |          colours <- brewer.pal(numvariables,"Set1")
 202 |          # find out the minimum and maximum values of the variables:
 203 |          mymin <- 1e+20
 204 |          mymax <- 1e-20
 205 |          for (i in 1:numvariables)
 206 |          {
 207 |             vectori <- mylist[[i]]
 208 |             mini <- min(vectori)
 209 |             maxi <- max(vectori)
 210 |             if (mini < mymin) { mymin <- mini }
 211 |             if (maxi > mymax) { mymax <- maxi }
 212 |          }
 213 |          # plot the variables
 214 |          for (i in 1:numvariables)
 215 |          {
 216 |             vectori <- mylist[[i]]
 217 |             namei <- names[i]
 218 |             colouri <- colours[i]
 219 |             if (i == 1) { plot(vectori,col=colouri,type="l",ylim=c(mymin,mymax)) }
 220 |             else         { points(vectori, col=colouri,type="l")                                     }
 221 |             lastxval <- length(vectori)
 222 |             lastyval <- vectori[length(vectori)]
 223 |             text((lastxval-10),(lastyval),namei,col="black",cex=0.6)
 224 |          }
 225 |       }
 226 | 
 227 | To use this function, you first need to copy and paste it into R. The arguments to the
 228 | function are a vector containing the names of the varibles that you want to plot, and
 229 | a list variable containing the variables themselves. 
 230 | 
 231 | For example, to make a profile plot of the concentrations of the first five chemicals in the wine samples
 232 | (stored in columns V2, V3, V4, V5, V6 of variable "wine"), we type:
 233 | 
 234 | ::
 235 | 
 236 |     > library(RColorBrewer)
 237 |     > names <- c("V2","V3","V4","V5","V6")
 238 |     > mylist <- list(wine$V2,wine$V3,wine$V4,wine$V5,wine$V6)
 239 |     > makeProfilePlot(mylist,names)
 240 | 
 241 | |image5|
 242 | 
 243 | It is clear from the profile plot that the mean and standard deviation for V6 is
 244 | quite a lot higher than that for the other variables.
 245 | 
 246 | .. xxx why did they do quite a different profile plot in the assignment answer? I sent a Q to the forum
 247 | 
 248 | Calculating Summary Statistics for Multivariate Data
 249 | ----------------------------------------------------
 250 | 
 251 | Another thing that you are likely to want to do is to calculate summary statistics such as the
 252 | mean and standard deviation for each of the variables in your multivariate data set.
 253 | 
 254 | .. sidebar:: sapply
 255 | 
 256 |    The "sapply()" function can be used to apply some other function to each column
 257 |    in a data frame, eg. sapply(mydataframe,sd) will calculate the standard deviation of 
 258 |    each column in a dataframe "mydataframe".
 259 | 
 260 | This is easy to do, using the "mean()" and "sd()" functions in R. For example, say we want
 261 | to calculate the mean and standard deviations of each of the 13 chemical concentrations in the
 262 | wine samples. These are stored in columns 2-14 of the variable "wine". So we type:
 263 | 
 264 | ::
 265 | 
 266 |     > sapply(wine[2:14],mean)
 267 |               V2          V3          V4          V5          V6          V7 
 268 |       13.0006180   2.3363483   2.3665169  19.4949438  99.7415730   2.2951124 
 269 |               V8          V9         V10         V11         V12         V13 
 270 |        2.0292697   0.3618539   1.5908989   5.0580899   0.9574494   2.6116854 
 271 |               V14 
 272 |       746.8932584 
 273 |       
 274 | This tells us that the mean of variable V2 is 13.0006180, the mean of V3 is 2.3363483, and so on.
 275 | 
 276 | Similarly, to get the standard deviations of the 13 chemical concentrations, we type:
 277 | 
 278 | ::
 279 | 
 280 |     > sapply(wine[2:14],sd)
 281 |               V2          V3          V4          V5          V6          V7 
 282 |        0.8118265   1.1171461   0.2743440   3.3395638  14.2824835   0.6258510 
 283 |               V8          V9         V10         V11         V12         V13 
 284 |        0.9988587   0.1244533   0.5723589   2.3182859   0.2285716   0.7099904 
 285 |               V14 
 286 |        314.9074743 
 287 | 
 288 | We can see here that it would make sense to standardise in order to compare the variables because the variables
 289 | have very different standard deviations - the standard deviation of V14 is 314.9074743, while the standard deviation
 290 | of V9 is just 0.1244533. Thus, in order to compare the variables, we need to standardise each variable so that
 291 | it has a sample variance of 1 and sample mean of 0. We will explain below how to standardise the variables.
 292 | 
 293 | Means and Variances Per Group
 294 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 295 | 
 296 | It is often interesting to calculate the means and standard deviations for just the samples
 297 | from a particular group, for example, for the wine samples from each cultivar. The cultivar
 298 | is stored in the column "V1" of the variable "wine".
 299 | 
 300 | To extract out the data for just cultivar 2, we can type:
 301 | 
 302 | ::
 303 | 
 304 |     > cultivar2wine <- wine[wine$V1=="2",] 
 305 | 
 306 | We can then calculate the mean and standard deviations of the 13 chemicals' concentrations, for
 307 | just the cultivar 2 samples:
 308 | 
 309 | ::
 310 | 
 311 |     > sapply(cultivar2wine[2:14],mean)
 312 |         V2         V3         V4         V5         V6         V7         V8 
 313 |       12.278732   1.932676   2.244789  20.238028  94.549296   2.258873   2.080845 
 314 |         V9        V10        V11        V12        V13        V14 
 315 |       0.363662   1.630282   3.086620   1.056282   2.785352 519.507042 
 316 |     > sapply(cultivar2wine[2:14]) 
 317 |         V2          V3          V4          V5          V6          V7          V8 
 318 |       0.5379642   1.0155687   0.3154673   3.3497704  16.7534975   0.5453611   0.7057008 
 319 |         V9         V10         V11         V12         V13         V14 
 320 |       0.1239613   0.6020678   0.9249293   0.2029368   0.4965735 157.2112204 
 321 | 
 322 | You can calculate the mean and standard deviation of the 13 chemicals' concentrations for just cultivar 1 samples,
 323 | or for just cultivar 3 samples, in a similar way.
 324 | 
 325 | However, for convenience, you might want to use the function "printMeanAndSdByGroup()" below, which
 326 | prints out the mean and standard deviation of the variables for each group in your data set:
 327 | 
 328 | ::
 329 | 
 330 |     > printMeanAndSdByGroup <- function(variables,groupvariable)
 331 |       {
 332 |          # find the names of the variables
 333 |          variablenames <- c(names(groupvariable),names(as.data.frame(variables)))
 334 |          # within each group, find the mean of each variable
 335 |          groupvariable <- groupvariable[,1] # ensures groupvariable is not a list
 336 |          means <- aggregate(as.matrix(variables) ~ groupvariable, FUN = mean)
 337 |          names(means) <- variablenames                             
 338 |          print(paste("Means:"))
 339 |          print(means)
 340 |          # within each group, find the standard deviation of each variable:
 341 |          sds <- aggregate(as.matrix(variables) ~ groupvariable, FUN = sd)
 342 |          names(sds) <- variablenames                             
 343 |          print(paste("Standard deviations:"))
 344 |          print(sds)
 345 |          # within each group, find the number of samples:
 346 |          samplesizes <- aggregate(as.matrix(variables) ~ groupvariable, FUN = length)
 347 |          names(samplesizes) <- variablenames 
 348 |          print(paste("Sample sizes:"))
 349 |          print(samplesizes)
 350 |       }
 351 | 
 352 | To use the function "printMeanAndSdByGroup()", you first need to copy and paste it into R. The 
 353 | arguments of the function are the variables that you want to calculate means and standard deviations for,
 354 | and the variable containing the group of each sample. For example, to calculate the mean and standard deviation
 355 | for each of the 13 chemical concentrations, for each of the three different wine cultivars, we type:
 356 | 
 357 | ::
 358 | 
 359 |     > printMeanAndSdByGroup(wine[2:14],wine[1])
 360 |       [1] "Means:"
 361 |         V1       V2       V3       V4       V5       V6       V7        V8       V9      V10      V11       V12      V13       V14
 362 |       1  1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119
 363 |       2  2 12.27873 1.932676 2.244789 20.23803  94.5493 2.258873 2.0808451 0.363662 1.630282 3.086620 1.0562817 2.785352  519.5070
 364 |       3  3 13.15375 3.333750 2.437083 21.41667  99.3125 1.678750 0.7814583 0.447500 1.153542 7.396250 0.6827083 1.683542  629.8958
 365 |       [1] "Standard deviations:"
 366 |         V1        V2        V3        V4       V5       V6        V7        V8         V9       V10       V11       V12       V13      V14
 367 |       1  1 0.4621254 0.6885489 0.2271660 2.546322 10.49895 0.3389614 0.3974936 0.07004924 0.4121092 1.2385728 0.1164826 0.3570766 221.5208
 368 |       2  2 0.5379642 1.0155687 0.3154673 3.349770 16.75350 0.5453611 0.7057008 0.12396128 0.6020678 0.9249293 0.2029368 0.4965735 157.2112
 369 |       3  3 0.5302413 1.0879057 0.1846902 2.258161 10.89047 0.3569709 0.2935041 0.12413959 0.4088359 2.3109421 0.1144411 0.2721114 115.0970
 370 |       [1] "Sample sizes:"
 371 |         V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
 372 |       1  1 59 59 59 59 59 59 59 59  59  59  59  59  59
 373 |       2  2 71 71 71 71 71 71 71 71  71  71  71  71  71
 374 |       3  3 48 48 48 48 48 48 48 48  48  48  48  48  48
 375 | 
 376 | The function "printMeanAndSdByGroup()" also prints out the number of samples in each group. In this case,
 377 | we see that there are 59 samples of cultivar 1, 71 of cultivar 2, and 48 of cultivar 3.
 378 | 
 379 | Between-groups Variance and Within-groups Variance for a Variable
 380 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 381 | 
 382 | If we want to calculate the within-groups variance for a particular variable (for example, for a particular
 383 | chemical's concentration), we can use the function "calcWithinGroupsVariance()" below:
 384 | 
 385 | ::
 386 | 
 387 |     > calcWithinGroupsVariance <- function(variable,groupvariable) 
 388 |       {
 389 |          # find out how many values the group variable can take
 390 |          groupvariable2 <- as.factor(groupvariable[[1]])
 391 |          levels <- levels(groupvariable2)
 392 |          numlevels <- length(levels)
 393 |          # get the mean and standard deviation for each group:
 394 |          numtotal <- 0
 395 |          denomtotal <- 0
 396 |          for (i in 1:numlevels)
 397 |          {
 398 |             leveli <- levels[i]
 399 |             levelidata <- variable[groupvariable==leveli,]
 400 |             levelilength <- length(levelidata)
 401 |             # get the standard deviation for group i:
 402 |             sdi <- sd(levelidata)
 403 |             numi <- (levelilength - 1)*(sdi * sdi)
 404 |             denomi <- levelilength
 405 |             numtotal <- numtotal + numi
 406 |             denomtotal <- denomtotal + denomi 
 407 |          } 
 408 |          # calculate the within-groups variance
 409 |          Vw <- numtotal / (denomtotal - numlevels) 
 410 |          return(Vw)
 411 |       }
 412 | 
 413 | .. Checked that this formula is correct.
 414 | 
 415 | You will need to copy and paste this function into R before you can use it.
 416 | For example, to calculate the within-groups variance of the variable V2 (the concentration of the first chemical),
 417 | we type:
 418 | 
 419 | ::
 420 | 
 421 |     > calcWithinGroupsVariance(wine[2],wine[1]) 
 422 |       [1] 0.2620525
 423 | 
 424 | Thus, the within-groups variance for V2 is 0.2620525. 
 425 | 
 426 | We can calculate the between-groups variance for a particular variable (eg. V2) using the function
 427 | "calcBetweenGroupsVariance()" below:
 428 | 
 429 | ::
 430 | 
 431 |     > calcBetweenGroupsVariance <- function(variable,groupvariable) 
 432 |       {
 433 |          # find out how many values the group variable can take
 434 |          groupvariable2 <- as.factor(groupvariable[[1]])
 435 |          levels <- levels(groupvariable2)
 436 |          numlevels <- length(levels)
 437 |          # calculate the overall grand mean: 
 438 |          grandmean <- mean(variable) 
 439 |          # get the mean and standard deviation for each group:
 440 |          numtotal <- 0
 441 |          denomtotal <- 0
 442 |          for (i in 1:numlevels)
 443 |          {
 444 |             leveli <- levels[i]
 445 |             levelidata <- variable[groupvariable==leveli,]
 446 |             levelilength <- length(levelidata)
 447 |             # get the mean and standard deviation for group i:
 448 |             meani <- mean(levelidata)
 449 |             sdi <- sd(levelidata)
 450 |             numi <- levelilength * ((meani - grandmean)^2)
 451 |             denomi <- levelilength
 452 |             numtotal <- numtotal + numi
 453 |             denomtotal <- denomtotal + denomi 
 454 |          } 
 455 |          # calculate the between-groups variance
 456 |          Vb <- numtotal / (numlevels - 1)
 457 |          Vb <- Vb[[1]]
 458 |          return(Vb)
 459 |       }
 460 | 
 461 | .. In the OU book, I think that they have the wrong formula - had N-G as denominator, I sent an email to the forum xxx
 462 | 
 463 | .. Note the between-groups-variance*(G-1) + within-groups-variance*(N-G) should be equal to TotalSS
 464 | ..  calcTotalSS <- function(variable)
 465 | .. {
 466 | ..   variable <- variable[[1]]
 467 | ..   variablelen <- length(variable)
 468 | ..   print(paste("variablelen=",variablelen))
 469 | ..   grandmean <- mean(variable)
 470 | ..   print(paste("grandmean=",grandmean))
 471 | ..   totalss <- 0
 472 | ..   for (i in 1:variablelen)
 473 | ..  {
 474 | ..      totalss <- totalss + ((variable[i] - grandmean)*(variable[i] - grandmean)) 
 475 | ..   }
 476 | ..   return(totalss)
 477 | .. }
 478 | 
 479 | Once you have copied and pasted this function into R, you can use it to calculate the between-groups
 480 | variance for a variable such as V2:
 481 | 
 482 | ::
 483 | 
 484 |     > calcBetweenGroupsVariance (wine[2],wine[1])
 485 |       [1] 35.39742 
 486 | 
 487 | Thus, the between-groups variance of V2 is 35.39742.
 488 | 
 489 | We can calculate the "separation" achieved by a variable as its between-groups variance devided by its
 490 | within-groups variance. Thus, the separation achieved by V2 is calculated as:
 491 | 
 492 | ::
 493 | 
 494 |     > 35.39742/0.2620525
 495 |       [1] 135.0776 
 496 | 
 497 | .. Note I think we can also get the within-groups and between-groups variance from the output of ANOVA:
 498 | .. 
 499 | .. summary(aov(wine[,2]~as.factor(wine[,1])))
 500 | ..                       Df Sum Sq Mean Sq F value    Pr(>F)    
 501 | .. as.factor(wine[, 1])   2 70.795  35.397  135.08 < 2.2e-16 ***
 502 | .. Residuals            175 45.859   0.262                      
 503 | .. ---
 504 | .. Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 
 505 | ..
 506 | .. Here the within-groups variance is 0.262 (called the mean square of residuals)
 507 | .. and the between-groups variance is 35.397. The ratio is 135.08 (the F statistic), which
 508 | .. is the same as the separation that I calculate (see above).
 509 | 
 510 | If you want to calculate the separations achieved by all of the variables in a multivariate data set,
 511 | you can use the function "calcSeparations()" below:
 512 | 
 513 | ::
 514 | 
 515 |     > calcSeparations <- function(variables,groupvariable)
 516 |       {
 517 |          # find out how many variables we have
 518 |          variables <- as.data.frame(variables)
 519 |          numvariables <- length(variables)
 520 |          # find the variable names
 521 |          variablenames <- colnames(variables)
 522 |          # calculate the separation for each variable
 523 |          for (i in 1:numvariables)
 524 |          {
 525 |             variablei <- variables[i]
 526 |             variablename <- variablenames[i]
 527 |             Vw <- calcWithinGroupsVariance(variablei, groupvariable)
 528 |             Vb <- calcBetweenGroupsVariance(variablei, groupvariable)
 529 |             sep <- Vb/Vw
 530 |             print(paste("variable",variablename,"Vw=",Vw,"Vb=",Vb,"separation=",sep))
 531 |          }
 532 |       }
 533 | 
 534 | .. I checked the formula and it is fine.
 535 | 
 536 | For example, to calculate the separations for each of the 13 chemical concentrations, we type:
 537 | 
 538 | ::
 539 | 
 540 |     > calcSeparations(wine[2:14],wine[1])
 541 |       [1] "variable V2 Vw= 0.262052469153907 Vb= 35.3974249602692 separation= 135.0776242428"
 542 |       [1] "variable V3 Vw= 0.887546796746581 Vb= 32.7890184869213 separation= 36.9434249631837"
 543 |       [1] "variable V4 Vw= 0.0660721013425184 Vb= 0.879611357248741 separation= 13.312901199991"
 544 |       [1] "variable V5 Vw= 8.00681118121156 Vb= 286.41674636309 separation= 35.7716374073093"
 545 |       [1] "variable V6 Vw= 180.65777316441 Vb= 2245.50102788939 separation= 12.4295843381499"
 546 |       [1] "variable V7 Vw= 0.191270475224227 Vb= 17.9283572942847 separation= 93.7330096203673"
 547 |       [1] "variable V8 Vw= 0.274707514337437 Vb= 64.2611950235641 separation= 233.925872681549"
 548 |       [1] "variable V9 Vw= 0.0119117022132797 Vb= 0.328470157461624 separation= 27.5754171469659"
 549 |       [1] "variable V10 Vw= 0.246172943795542 Vb= 7.45199550777775 separation= 30.2713831702276"
 550 |       [1] "variable V11 Vw= 2.28492308133354 Vb= 275.708000822304 separation= 120.664018441003"
 551 |       [1] "variable V12 Vw= 0.0244876469432414 Vb= 2.48100991493829 separation= 101.3167953903"
 552 |       [1] "variable V13 Vw= 0.160778729560982 Vb= 30.5435083544253 separation= 189.972320578889"
 553 |       [1] "variable V14 Vw= 29707.6818705169 Vb= 6176832.32228483 separation= 207.920373902178"
 554 |  
 555 | Thus, the individual variable which gives the greatest separations between the groups (the wine cultivars) is 
 556 | V8 (separation 233.9). As we will discuss below, the purpose of linear discriminant analysis (LDA) is to find the
 557 | linear combination of the individual variables that will give the greatest separation between the groups (cultivars here).
 558 | This hopefully will give a better separation than the best separation achievable by any individual variable (233.9
 559 | for V8 here).
 560 | 
 561 | Between-groups Covariance and Within-groups Covariance for Two Variables
 562 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 563 | 
 564 | If you have a multivariate data set with several variables describing sampling units from different groups,
 565 | such as the wine samples from different cultivars, it is often of interest to calculate the within-groups
 566 | covariance and between-groups variance for pairs of the variables. 
 567 | 
 568 | This can be done using the following functions, which you will need to copy and paste into R to use them:
 569 | 
 570 | ::
 571 | 
 572 |     > calcWithinGroupsCovariance <- function(variable1,variable2,groupvariable) 
 573 |       {
 574 |          # find out how many values the group variable can take
 575 |          groupvariable2 <- as.factor(groupvariable[[1]])
 576 |          levels <- levels(groupvariable2)
 577 |          numlevels <- length(levels)
 578 |          # get the covariance of variable 1 and variable 2 for each group:
 579 |          Covw <- 0
 580 |          for (i in 1:numlevels)
 581 |          {
 582 |             leveli <- levels[i]
 583 |             levelidata1 <- variable1[groupvariable==leveli,]
 584 |             levelidata2 <- variable2[groupvariable==leveli,]
 585 |             mean1 <- mean(levelidata1)
 586 |             mean2 <- mean(levelidata2)
 587 |             levelilength <- length(levelidata1)
 588 |             # get the covariance for this group:
 589 |             term1 <- 0 
 590 |             for (j in 1:levelilength)
 591 |             {
 592 |                term1 <- term1 + ((levelidata1[j] - mean1)*(levelidata2[j] - mean2))
 593 |             }
 594 |             Cov_groupi <- term1 # covariance for this group
 595 |             Covw <- Covw + Cov_groupi 
 596 |          }
 597 |          totallength <- nrow(variable1)
 598 |          Covw <- Covw / (totallength - numlevels)
 599 |          return(Covw)
 600 |       }
 601 | 
 602 | .. Checked this works fine. 
 603 | .. Agrees with formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295:
 604 | .. Covw = (1/(N-G)) Sum(from g=1 to G) [ Sum(over i) { (x_ig - x_hat_g)*(y_ig - y_hat_g) } ]
 605 | 
 606 | For example, to calculate the within-groups covariance for variables V8 and V11, we type:
 607 | 
 608 | ::
 609 | 
 610 |     > calcWithinGroupsCovariance(wine[8],wine[11],wine[1])
 611 |       [1] 0.2866783 
 612 | 
 613 | ::
 614 | 
 615 |     > calcBetweenGroupsCovariance <- function(variable1,variable2,groupvariable) 
 616 |       {
 617 |          # find out how many values the group variable can take
 618 |          groupvariable2 <- as.factor(groupvariable[[1]])
 619 |          levels <- levels(groupvariable2)
 620 |          numlevels <- length(levels)
 621 |          # calculate the grand means
 622 |          variable1mean <- mean(variable1)
 623 |          variable2mean <- mean(variable2)
 624 |          # calculate the between-groups covariance
 625 |          Covb <- 0
 626 |          for (i in 1:numlevels)
 627 |          {
 628 |             leveli <- levels[i]
 629 |             levelidata1 <- variable1[groupvariable==leveli,]
 630 |             levelidata2 <- variable2[groupvariable==leveli,]
 631 |             mean1 <- mean(levelidata1)
 632 |             mean2 <- mean(levelidata2)
 633 |             levelilength <- length(levelidata1)
 634 |             term1 <- (mean1 - variable1mean)*(mean2 - variable2mean)*(levelilength)
 635 |             Covb <- Covb + term1  
 636 |          }
 637 |          Covb <- Covb / (numlevels - 1)
 638 |          Covb <- Covb[[1]]
 639 |          return(Covb)
 640 |       }
 641 | 
 642 | .. Formula from Kryzanowski's 'Principles of Multivariate Analysis' pages 294-295
 643 | .. Covb = (1/(G-1)) * Sum(from g=1 to G) [ Sum(over i) { (n_g) * (x_hat_g - x_hat) * (y_hat_g - y_hat) } ]
 644 | .. xxx Note it doesn't give me the answer given for Q3(a)(ii) of assignment - put Q on forum
 645 | 
 646 | For example, to calculate the between-groups covariance for variables V8 and V11, we type:
 647 | 
 648 | ::
 649 | 
 650 |     > calcBetweenGroupsCovariance(wine[8],wine[11],wine[1])
 651 |       [1] -60.41077
 652 | 
 653 | Thus, for V8 and V11, the between-groups covariance is -60.41 and the within-groups covariance is 0.29.
 654 | Since the within-groups covariance is positive (0.29), it means V8 and V11 are positively related within groups:
 655 | for individuals from the same group, individuals with a high value of V8 tend to have a high value of V11, 
 656 | and vice versa. Since the between-groups covariance is negative (-60.41), V8 and V11 are negatively related between groups:
 657 | groups with a high mean value of V8 tend to have a low mean value of V11, and vice versa.
 658 | 
 659 | Calculating Correlations for Multivariate Data
 660 | ----------------------------------------------
 661 | 
 662 | It is often of interest to investigate whether any of the variables in a multivariate data set are
 663 | significantly correlated.
 664 | 
 665 | To calculate the linear (Pearson) correlation coefficient for a pair of variables, you can use
 666 | the "cor.test()" function in R. For example, to calculate the correlation coefficient for the first
 667 | two chemicals' concentrations, V2 and V3, we type:
 668 | 
 669 | ::
 670 | 
 671 |     > cor.test(wine$V2, wine$V3)
 672 |       Pearson's product-moment correlation
 673 |       data:  wine$V2 and wine$V3 
 674 |       t = 1.2579, df = 176, p-value = 0.2101
 675 |       alternative hypothesis: true correlation is not equal to 0 
 676 |       95 percent confidence interval:
 677 |       -0.05342959  0.23817474 
 678 |       sample estimates:
 679 |        cor 
 680 |       0.09439694 
 681 | 
 682 | This tells us that the correlation coefficient is about 0.094, which is a very weak correlation.
 683 | Furthermore, the P-value for the statistical test of whether the correlation coefficient is 
 684 | significantly different from zero is 0.21. This is much greater than 0.05 (which we can use here
 685 | as a cutoff for statistical significance), so there is very weak evidence that that the correlation is non-zero.
 686 | 
 687 | If you have a lot of variables, you can use "cor.test()" to calculate the correlation coefficient
 688 | for each pair of variables, but you might be just interested in finding out what are the most highly
 689 | correlated pairs of variables. For this you can use the function "mosthighlycorrelated()" below.
 690 | 
 691 | The function "mosthighlycorrelated()" will print out the linear correlation coefficients for
 692 | each pair of variables in your data set, in order of the correlation coefficient. This lets you see
 693 | very easily which pair of variables are most highly correlated.
 694 | 
 695 | ::
 696 | 
 697 |     > mosthighlycorrelated <- function(mydataframe,numtoreport)
 698 |       {
 699 |          # find the correlations
 700 |          cormatrix <- cor(mydataframe)
 701 |          # set the correlations on the diagonal or lower triangle to zero, 
 702 |          # so they will not be reported as the highest ones:
 703 |          diag(cormatrix) <- 0
 704 |          cormatrix[lower.tri(cormatrix)] <- 0
 705 |          # flatten the matrix into a dataframe for easy sorting
 706 |          fm <- as.data.frame(as.table(cormatrix))
 707 |          # assign human-friendly names 
 708 |          names(fm) <- c("First.Variable", "Second.Variable","Correlation")
 709 |          # sort and print the top n correlations
 710 |          head(fm[order(abs(fm$Correlation),decreasing=T),],n=numtoreport)  
 711 |       }
 712 | 
 713 | To use this function, you will first have to copy and paste it into R. The arguments of the function
 714 | are the variables that you want to calculate the correlations for, and the number of top correlation
 715 | coefficients to print out (for example, you can tell it to print out the largest ten correlation coefficients, or
 716 | the largest 20).
 717 | 
 718 | For example, to calculate correlation coefficients between the concentrations of the 13 chemicals
 719 | in the wine samples, and to print out the top 10 pairwise correlation coefficients, you can type:
 720 | 
 721 | ::
 722 | 
 723 |     > mosthighlycorrelated(wine[2:14], 10)
 724 |           First.Variable Second.Variable Correlation
 725 |       84              V7              V8   0.8645635
 726 |       150             V8             V13   0.7871939
 727 |       149             V7             V13   0.6999494
 728 |       111             V8             V10   0.6526918
 729 |       157             V2             V14   0.6437200
 730 |       110             V7             V10   0.6124131
 731 |       154            V12             V13   0.5654683
 732 |       132             V3             V12  -0.5612957
 733 |       118             V2             V11   0.5463642
 734 |       137             V8             V12   0.5434786
 735 | 
 736 | This tells us that the pair of variables with the highest linear correlation coefficient are
 737 | V7 and V8 (correlation = 0.86 approximately). 
 738 | 
 739 | Standardising Variables
 740 | -----------------------
 741 | 
 742 | If you want to compare different variables that have different units, are very different variances,
 743 | it is a good idea to first standardise the variables. 
 744 | 
 745 | For example, we found above that the concentrations of the 13 chemicals in the wine samples show a wide range of 
 746 | standard deviations, from 0.1244533 for V9 (variance 0.01548862) to 314.9074743 for V14 (variance 99166.72).
 747 | This is a range of approximately 6,402,554-fold in the variances. 
 748 | 
 749 | As a result, it is not a good idea to use the unstandardised chemical concentrations as the input for a
 750 | principal component analysis (PCA, see below) of the
 751 | wine samples, as if you did that, the first principal component would be dominated by the variables
 752 | which show the largest variances, such as V14.
 753 | 
 754 | Thus, it would be a better idea to first standardise the variables so that they all have variance 1 and mean 0, 
 755 | and to then carry out the principal component analysis on the standardised data. This would allow us to 
 756 | find the principal components that provide the best low-dimensional representation of the variation in the
 757 | original data, without being overly biased by those variables that show the most variance in the original data.
 758 | 
 759 | You can standardise variables in R using the "scale()" function. 
 760 | 
 761 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, we type:
 762 | 
 763 | ::
 764 | 
 765 |     > standardisedconcentrations <- as.data.frame(scale(wine[2:14]))
 766 | 
 767 | Note that we use the "as.data.frame()" function to convert the output of "scale()" into a
 768 | "data frame", which is the same type of R variable that the "wine" variable.
 769 | 
 770 | We can check that each of the standardised variables stored in "standardisedconcentrations"
 771 | has a mean of 0 and a standard deviation of 1 by typing:
 772 | 
 773 | ::
 774 | 
 775 |     > sapply(standardisedconcentrations,mean) 
 776 |            V2            V3            V4            V5            V6            V7 
 777 |       -8.591766e-16 -6.776446e-17  8.045176e-16 -7.720494e-17 -4.073935e-17 -1.395560e-17 
 778 |            V8            V9           V10           V11           V12           V13 
 779 |       6.958263e-17 -1.042186e-16 -1.221369e-16  3.649376e-17  2.093741e-16  3.003459e-16 
 780 |           V14 
 781 |       -1.034429e-16 
 782 |     > sapply(standardisedconcentrations,sd)
 783 |       V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 
 784 |       1   1   1   1   1   1   1   1   1   1   1   1   1 
 785 | 
 786 | We see that the means of the standardised variables are all very tiny numbers and so are
 787 | essentially equal to 0, and the standard deviations of the standardised variables are all equal to 1.
 788 | 
 789 | Principal Component Analysis
 790 | ----------------------------
 791 | 
 792 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a
 793 | multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing
 794 | wine samples from three different cultivars. We can carry out a principal component analysis to investigate
 795 | whether we can capture most of the variation between samples using a smaller number of new variables (principal
 796 | components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations.
 797 | 
 798 | To carry out a principal component analysis (PCA) on a multivariate data set, the first step is often to standardise
 799 | the variables under study using the "scale()" function (see above). This is necessary if the input variables
 800 | have very different variances, which is true in this case as the concentrations of the 13 chemicals have
 801 | very different variances (see above).
 802 | 
 803 | Once you have standardised your variables, you can carry out a principal component analysis using the "prcomp()"
 804 | function in R.
 805 | 
 806 | For example, to standardise the concentrations of the 13 chemicals in the wine samples, and carry out a 
 807 | principal components analysis on the standardised concentrations, we type:
 808 | 
 809 | ::
 810 | 
 811 |     > standardisedconcentrations <- as.data.frame(scale(wine[2:14])) # standardise the variables
 812 |     > wine.pca <- prcomp(standardisedconcentrations)                 # do a PCA
 813 | 
 814 | You can get a summary of the principal component analysis results using the "summary()" function on the
 815 | output of "prcomp()":
 816 | 
 817 | ::
 818 | 
 819 |     > summary(wine.pca)
 820 |       Importance of components:
 821 |                               PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8    PC9   PC10
 822 |       Standard deviation     2.169 1.580 1.203 0.9586 0.9237 0.8010 0.7423 0.5903 0.5375 0.5009
 823 |       Proportion of Variance 0.362 0.192 0.111 0.0707 0.0656 0.0494 0.0424 0.0268 0.0222 0.0193
 824 |       Cumulative Proportion  0.362 0.554 0.665 0.7360 0.8016 0.8510 0.8934 0.9202 0.9424 0.9617
 825 |                               PC11   PC12    PC13
 826 |       Standard deviation     0.4752 0.4108 0.32152
 827 |       Proportion of Variance 0.0174 0.0130 0.00795
 828 |       Cumulative Proportion  0.9791 0.9920 1.00000
 829 | 
 830 | This gives us the standard deviation of each component, and the proportion of variance explained by
 831 | each component. The standard deviation of the components is stored in a named element called "sdev" of the output 
 832 | variable made by "prcomp":
 833 | 
 834 | ::
 835 | 
 836 |     > wine.pca$sdev
 837 |       [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128 0.5903367
 838 |       [9] 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
 839 | 
 840 | The total variance explained by the components is the sum of the variances of the components:
 841 | 
 842 | ::
 843 | 
 844 |     > sum((wine.pca$sdev)^2)
 845 |       [1] 13
 846 |     
 847 | In this case, we see that the total variance is 13, which is equal to the number of standardised variables (13 variables). 
 848 | This is because for standardised data, the variance of each standardised variable is 1. The total variance is equal to the sum 
 849 | of the variances of the individual variables, and since the variance of each standardised variable is 1, the 
 850 | total variance should be equal to the  number of variables (13 here). 
 851 | 
 852 | Deciding How Many Principal Components to Retain
 853 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 854 | 
 855 | In order to decide how many principal components should be retained, 
 856 | it is common to summarise the results of a principal components analysis by making a scree plot, which we
 857 | can do in R using the "screeplot()" function:
 858 | 
 859 | ::
 860 | 
 861 |     > screeplot(wine.pca, type="lines")
 862 | 
 863 | |image6|
 864 | 
 865 | The most obvious change in slope in the scree plot occurs at component 4, which is the "elbow" of the
 866 | scree plot. Therefore, it cound be argued based on the basis of the scree plot that the first three
 867 | components should be retained.
 868 | 
 869 | Another way of deciding how many components to retain is to use Kaiser's criterion:
 870 | that we should only retain principal components for which the variance is above 1 (when principal
 871 | component analysis was applied to standardised data).  We can check this by finding the variance of each
 872 | of the principal components:
 873 | 
 874 | ::
 875 | 
 876 |     > (wine.pca$sdev)^2
 877 |       [1] 4.7058503 2.4969737 1.4460720 0.9189739 0.8532282 0.6416570 0.5510283 0.3484974
 878 |       [9] 0.2888799 0.2509025 0.2257886 0.1687702 0.1033779
 879 | 
 880 | We see that the variance is above 1 for principal components 1, 2, and 3 (which have variances
 881 | 4.71, 2.50, and 1.45, respectively). Therefore, using Kaiser's criterion, we would retain the first
 882 | three principal components.
 883 | 
 884 | A third way to decide how many principal components to retain is to decide to keep the number of
 885 | components required to explain at least some minimum amount of the total variance. For example, if
 886 | it is important to explain at least 80% of the variance, we would retain the first five principal components,
 887 | as we can see from the output of "summary(wine.pca)" that the first five principal components
 888 | explain 80.2% of the variance (while the first four components explain just 73.6%, so are not sufficient).
 889 | 
 890 | Loadings for the Principal Components
 891 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 892 | 
 893 | The loadings for the principal components are stored in a named element "rotation" of the variable
 894 | returned by "prcomp()". This contains a matrix with the loadings of each principal component, where
 895 | the first column in the matrix contains the loadings for the first principal component, the second
 896 | column contains the loadings for the second principal component, and so on.
 897 | 
 898 | Therefore, to obtain the loadings for the first principal component in our
 899 | analysis of the 13 chemical concentrations in wine samples, we type:
 900 | 
 901 | ::
 902 | 
 903 |     > wine.pca$rotation[,1]
 904 |           V2           V3           V4           V5           V6           V7 
 905 |       -0.144329395  0.245187580  0.002051061  0.239320405 -0.141992042 -0.394660845 
 906 |           V8           V9          V10          V11          V12          V13 
 907 |       -0.422934297  0.298533103 -0.313429488  0.088616705 -0.296714564 -0.376167411 
 908 |          V14 
 909 |       -0.286752227 
 910 | 
 911 | This means that the first principal component is a linear combination of the variables:
 912 | -0.144*Z2 + 0.245*Z3 + 0.002*Z4 + 0.239*Z5 - 0.142*Z6 - 0.395*Z7 - 0.423*Z8 + 0.299*Z9
 913 | -0.313*Z10 + 0.089*Z11 - 0.297*Z12 - 0.376*Z13 - 0.287*Z14, where Z2, Z3, Z4...Z14 are
 914 | the standardised versions of the variables V2, V3, V4...V14 (that each
 915 | have mean of 0 and variance of 1).
 916 | 
 917 | Note that the square of the loadings sum to 1, as this is a constraint used in calculating the loadings:
 918 | 
 919 | ::
 920 | 
 921 |     > sum((wine.pca$rotation[,1])^2)
 922 |       [1] 1
 923 | 
 924 | To calculate the values of the first principal component, we can define our own function to calculate
 925 | a principal component given the loadings and the input variables' values:
 926 | 
 927 | ::
 928 | 
 929 |     > calcpc <- function(variables,loadings)
 930 |       {
 931 |          # find the number of samples in the data set
 932 |          as.data.frame(variables)
 933 |          numsamples <- nrow(variables)
 934 |          # make a vector to store the component
 935 |          pc <- numeric(numsamples)
 936 |          # find the number of variables 
 937 |          numvariables <- length(variables)
 938 |          # calculate the value of the component for each sample
 939 |          for (i in 1:numsamples)
 940 |          {
 941 |             valuei <- 0
 942 |             for (j in 1:numvariables)
 943 |             {
 944 |                valueij <- variables[i,j]
 945 |                loadingj <- loadings[j]
 946 |                valuei <- valuei + (valueij * loadingj)
 947 |             } 
 948 |             pc[i] <- valuei
 949 |          }
 950 |          return(pc)
 951 |       }
 952 | 
 953 | We can then use the function to calculate the values of the first principal component for each sample in our
 954 | wine data:
 955 | 
 956 | ::
 957 | 
 958 |     > calcpc(standardisedconcentrations, wine.pca$rotation[,1])
 959 |       [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
 960 |       [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
 961 |       [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
 962 |       [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
 963 |       [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
 964 |       [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
 965 |       ...   
 966 | 
 967 | In fact, the values of the first principal component are stored in the variable wine.pca$x[,1]
 968 | that was returned by the "prcomp()" function, so we can compare those values to the ones that we
 969 | calculated, and they should agree:
 970 | 
 971 | ::
 972 | 
 973 |     > wine.pca$x[,1]
 974 |       [1] -3.30742097 -2.20324981 -2.50966069 -3.74649719 -1.00607049 -3.04167373 -2.44220051
 975 |       [8] -2.05364379 -2.50381135 -2.74588238 -3.46994837 -1.74981688 -2.10751729 -3.44842921
 976 |       [15] -4.30065228 -2.29870383 -2.16584568 -1.89362947 -3.53202167 -2.07865856 -3.11561376
 977 |       [22] -1.08351361 -2.52809263 -1.64036108 -1.75662066 -0.98729406 -1.77028387 -1.23194878
 978 |       [29] -2.18225047 -2.24976267 -2.49318704 -2.66987964 -1.62399801 -1.89733870 -1.40642118
 979 |       [36] -1.89847087 -1.38096669 -1.11905070 -1.49796891 -2.52268490 -2.58081526 -0.66660159
 980 |       ...
 981 |       
 982 | We see that they do agree.
 983 | 
 984 | The first principal component has highest (in absolute value) loadings for V8 (-0.423), V7 (-0.395), V13 (-0.376),
 985 | V10 (-0.313), V12 (-0.297), V14 (-0.287), V9 (0.299), V3 (0.245), and V5 (0.239). The loadings for V8, V7, V13,
 986 | V10, V12 and V14 are negative, while those for V9, V3, and V5 are positive. Therefore, an interpretation of the
 987 | first principal component is that it represents a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
 988 | and the concentrations of V9, V3 and V5.
 989 | 
 990 | Similarly, we can obtain the loadings for the second principal component by typing:
 991 | 
 992 | ::
 993 | 
 994 |     > wine.pca$rotation[,2]
 995 |           V2           V3           V4           V5           V6           V7 
 996 |       0.483651548  0.224930935  0.316068814 -0.010590502  0.299634003  0.065039512 
 997 |           V8           V9          V10          V11          V12          V13 
 998 |       -0.003359812  0.028779488  0.039301722  0.529995672 -0.279235148 -0.164496193 
 999 |          V14 
1000 |       0.364902832 
1001 | 
1002 | This means that the second principal component is a linear combination of the variables:
1003 | 0.484*Z2 + 0.225*Z3 + 0.316*Z4 - 0.011*Z5 + 0.300*Z6 + 0.065*Z7 - 0.003*Z8 + 0.029*Z9
1004 | + 0.039*Z10 + 0.530*Z11 - 0.279*Z12 - 0.164*Z13 + 0.365*Z14, where Z1, Z2, Z3...Z14
1005 | are the standardised versions of variables V2, V3, ... V14 that each have mean 0 and variance 1.
1006 | 
1007 | Note that the square of the loadings sum to 1, as above:
1008 | 
1009 | ::
1010 | 
1011 |     > sum((wine.pca$rotation[,2])^2)
1012 |       [1] 1
1013 | 
1014 | The second principal component has highest loadings for V11 (0.530), V2 (0.484), V14 (0.365), V4 (0.316), 
1015 | V6 (0.300), V12 (-0.279), and V3 (0.225). The loadings for V11, V2, V14, V4, V6 and V3 are positive, while
1016 | the loading for V12 is negative. Therefore, an interpretation of the second principal component is that
1017 | it represents a contrast between the concentrations of V11, V2, V14, V4, V6 and V3, and the concentration of
1018 | V12. Note that the loadings for V11 (0.530) and V2 (0.484) are the largest, so the contrast is mainly between
1019 | the concentrations of V11 and V2, and the concentration of V12.
1020 | 
1021 | Scatterplots of the Principal Components
1022 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1023 | 
1024 | The values of the principal components are stored in a named element "x" of the variable returned by
1025 | "prcomp()". This contains a matrix with the principal components, where the first column in the matrix
1026 | contains the first principal component, the second column the second component, and so on.
1027 | 
1028 | Thus, in our example, "wine.pca$x[,1]" contains the first principal component, and 
1029 | "wine.pca$x[,2]" contains the second principal component. 
1030 | 
1031 | We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine
1032 | samples come from, by typing:
1033 | 
1034 | ::
1035 | 
1036 |     > plot(wine.pca$x[,1],wine.pca$x[,2]) # make a scatterplot
1037 |     > text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red") # add labels
1038 | 
1039 | |image7|
1040 | 
1041 | The scatterplot shows the first principal component on the x-axis, and the second principal
1042 | component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1
1043 | have much lower values of the first principal component than wine samples of cultivar 3.
1044 | Therefore, the first principal component separates wine samples of cultivars 1 from those
1045 | of cultivar 3.
1046 | 
1047 | We can also see that wine samples of cultivar 2 have much higher values of the second
1048 | principal component than wine samples of cultivars 1 and 3. Therefore, the second principal
1049 | component separates samples of cultivar 2 from samples of cultivars 1 and 3.
1050 | 
1051 | Therefore, the first two principal components are reasonably useful for distinguishing wine
1052 | samples of the three different cultivars.
1053 | 
1054 | Above, we interpreted the first principal component as a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
1055 | and the concentrations of V9, V3 and V5. We can check whether this makes sense in terms of the
1056 | concentrations of these chemicals in the different cultivars, by printing out the means of the
1057 | standardised concentration variables in each cultivar, using the "printMeanAndSdByGroup()" function (see above): 
1058 | 
1059 | ::
1060 | 
1061 |     > printMeanAndSdByGroup(standardisedconcentrations,wine[1])
1062 |       [1] "Means:"
1063 |         V1         V2         V3         V4         V5          V6          V7          V8          V9        V10        V11        V12        V13        V14
1064 |       1  1  0.9166093 -0.2915199  0.3246886 -0.7359212  0.46192317  0.87090552  0.95419225 -0.57735640  0.5388633  0.2028288  0.4575567  0.7691811  1.1711967
1065 |       2  2 -0.8892116 -0.3613424 -0.4437061  0.2225094 -0.36354162 -0.05790375  0.05163434  0.01452785  0.0688079 -0.8503999  0.4323908  0.2446043 -0.7220731
1066 |       3  3  0.1886265  0.8928122  0.2572190  0.5754413 -0.03004191 -0.98483874 -1.24923710  0.68817813 -0.7641311  1.0085728 -1.2019916 -1.3072623 -0.3715295
1067 |       
1068 | Does it make sense that the first principal component can separate cultivar 1 from cultivar 3?
1069 | In cultivar 1, the mean values of V8 (0.954), V7 (0.871), V13 (0.769), V10 (0.539), V12 (0.458) and V14 (1.171)
1070 | are very high compared to the mean values of V9 (-0.577), V3 (-0.292) and V5 (-0.736).
1071 | In cultivar 3, the mean values of V8 (-1.249), V7 (-0.985), V13 (-1.307), V10 (-0.764), V12 (-1.202) and V14 (-0.372)
1072 | are very low compared to the mean values of V9 (0.688), V3 (0.893) and V5 (0.575). 
1073 | Therefore, it does make sense that principal component 1 is a contrast between the concentrations of V8, V7, V13, V10, V12, and V14,
1074 | and the concentrations of V9, V3 and V5; and that principal component 1 can separate cultivar 1 from cultivar 3.
1075 | 
1076 | Above, we intepreted the second principal component as a contrast between the concentrations of V11, 
1077 | V2, V14, V4, V6 and V3, and the concentration of V12.
1078 | In the light of the mean values of these variables in the different cultivars, does 
1079 | it make sense that the second principal component can separate cultivar 2 from cultivars 1 and 3?
1080 | In cultivar 1, the mean values of V11 (0.203), V2 (0.917), V14 (1.171), V4 (0.325), V6 (0.462) and V3 (-0.292)
1081 | are not very different from the mean value of V12 (0.458). 
1082 | In cultivar 3, the mean values of V11 (1.009), V2 (0.189), V14 (-0.372), V4 (0.257), V6 (-0.030) and V3 (0.893)
1083 | are also not very different from the mean value of V12 (-1.202). 
1084 | In contrast, in cultivar 2, the mean values of V11 (-0.850), V2 (-0.889), V14 (-0.722), V4 (-0.444), V6 (-0.364) and V3 (-0.361)
1085 | are much less than the mean value of V12 (0.432). 
1086 | Therefore, it makes sense that principal component is a contrast between the concentrations of V11, 
1087 | V2, V14, V4, V6 and V3, and the concentration of V12; and that principal component 2 can separate cultivar 2 from cultivars 1 and 3.
1088 | 
1089 | Linear Discriminant Analysis
1090 | ----------------------------
1091 | 
1092 | The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a
1093 | multivariate data set. For example, in the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars. 
1094 | By carrying out a principal component analysis, we found that most of the variation in the chemical concentrations
1095 | between the samples can be captured using the first two principal components, 
1096 | where each of the principal components is a particular linear combination of the 13 chemical concentrations.
1097 | 
1098 | The purpose of linear discriminant analysis (LDA) is to find the linear combinations of the original variables (the 13
1099 | chemical concentrations here) that gives the best possible separation between the groups (wine cultivars here) in our
1100 | data set. Linear discriminant analysis is also known as "canonical discriminant analysis", or simply "discriminant analysis".
1101 | 
1102 | If we want to separate the wines by cultivar, the wines come from three different cultivars, so the number of groups (G) is 3, 
1103 | and the number of variables is 13 (13 chemicals' concentrations; p = 13).  The maximum number of useful discriminant
1104 | functions that can separate the wines by cultivar is the minimum of G-1 and p, and so in this case it is the minimum of 2 and 13, 
1105 | which is 2. Thus, we can find at most 2 useful discriminant functions to separate the wines by cultivar, using the 
1106 | 13 chemical concentration variables.
1107 | 
1108 | You can carry out a linear discriminant analysis using the "lda()" function from the R "MASS" package.
1109 | To use this function, we first need to install the "MASS" R package 
1110 | (for instructions on how to install an R package, see `How to install an R package 
1111 | <./installr.html#how-to-install-an-r-package>`_).
1112 | 
1113 | For example, to carry out a linear discriminant analysis using the 13 chemical concentrations in the wine samples, we type:
1114 | 
1115 | ::
1116 | 
1117 |     > library("MASS")                                                # load the MASS package
1118 |     > wine.lda <- lda(wine$V1 ~ wine$V2 + wine$V3 + wine$V4 + wine$V5 + wine$V6 + wine$V7 + 
1119 |                                 wine$V8 + wine$V9 + wine$V10 + wine$V11 + wine$V12 + wine$V13 +
1120 |                                 wine$V14)
1121 |                     
1122 | Loadings for the Discriminant Functions
1123 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1124 | 
1125 | To get the values of the loadings of the discriminant functions for the wine data, we can type:
1126 | 
1127 | ::
1128 | 
1129 |     > wine.lda
1130 |       Coefficients of linear discriminants:        
1131 |                   LD1           LD2
1132 |       wine$V2  -0.403399781  0.8717930699
1133 |       wine$V3   0.165254596  0.3053797325
1134 |       wine$V4  -0.369075256  2.3458497486
1135 |       wine$V5   0.154797889 -0.1463807654
1136 |       wine$V6  -0.002163496 -0.0004627565
1137 |       wine$V7   0.618052068 -0.0322128171
1138 |       wine$V8  -1.661191235 -0.4919980543
1139 |       wine$V9  -1.495818440 -1.6309537953
1140 |       wine$V10  0.134092628 -0.3070875776
1141 |       wine$V11  0.355055710  0.2532306865
1142 |       wine$V12 -0.818036073 -1.5156344987
1143 |       wine$V13 -1.157559376  0.0511839665
1144 |       wine$V14 -0.002691206  0.0028529846
1145 | 
1146 | This means that the first discriminant function is a linear combination of the variables:
1147 | -0.403*V2 + 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1148 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14, where
1149 | V2, V3, ... V14 are the concentrations of the 14 chemicals found in the wine samples.
1150 | For convenience, the value for each discriminant function (eg. the first discriminant function)
1151 | are scaled so that their mean value is zero (see below). 
1152 | 
1153 | Note that these loadings are calculated so that the within-group variance of each discriminant
1154 | function for each group (cultivar) is equal to 1, as will be demonstrated below.
1155 | 
1156 | These scalings are also stored in the named element "scaling" of the variable returned
1157 | by the lda() function. This element contains a matrix, in which the first column contains
1158 | the loadings for the first discriminant function, the second column contains the loadings
1159 | for the second discriminant function and so on. For example, to extract the loadings for
1160 | the first discriminant function, we can type:
1161 | 
1162 | ::
1163 | 
1164 |     > wine.lda$scaling[,1]
1165 |        wine$V2      wine$V3      wine$V4      wine$V5      wine$V6      wine$V7 
1166 |      -0.403399781  0.165254596 -0.369075256  0.154797889 -0.002163496  0.618052068 
1167 |        wine$V8      wine$V9     wine$V10     wine$V11     wine$V12     wine$V13 
1168 |      -1.661191235 -1.495818440  0.134092628  0.355055710 -0.818036073 -1.157559376 
1169 |       wine$V14 
1170 |      -0.002691206 
1171 | 
1172 | To calculate the values of the first discriminant function, we can define our own function "calclda()":
1173 | 
1174 | ::
1175 | 
1176 |     > calclda <- function(variables,loadings)
1177 |       {
1178 |          # find the number of samples in the data set
1179 |          as.data.frame(variables)
1180 |          numsamples <- nrow(variables)
1181 |          # make a vector to store the discriminant function
1182 |          ld <- numeric(numsamples)
1183 |          # find the number of variables 
1184 |          numvariables <- length(variables)
1185 |          # calculate the value of the discriminant function for each sample
1186 |          for (i in 1:numsamples)
1187 |          {
1188 |             valuei <- 0
1189 |             for (j in 1:numvariables)
1190 |             {
1191 |                valueij <- variables[i,j]
1192 |                loadingj <- loadings[j]
1193 |                valuei <- valuei + (valueij * loadingj)
1194 |             } 
1195 |             ld[i] <- valuei
1196 |          } 
1197 |          # standardise the discriminant function so that its mean value is 0:
1198 |          ld <- as.data.frame(scale(ld, center=TRUE, scale=FALSE))
1199 |          ld <- ld[[1]]
1200 |          return(ld)
1201 |       }
1202 | 
1203 | The function calclda() simply calculates the value of a discriminant function 
1204 | for each sample in the data set, for example, for the first disriminant function, for each sample we calculate 
1205 | the value using the equation -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1206 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14. Furthermore, the "scale()"
1207 | command is used within the calclda() function in order to standardise the value of a discriminant function
1208 | (eg. the first discriminant function) so that its mean value (over all the wine samples) is 0. 
1209 | 
1210 | We can use the function calclda() to calculate the values of the first discriminant function for each sample in our
1211 | wine data:
1212 | 
1213 | ::
1214 | 
1215 |     > calclda(wine[2:14], wine.lda$scaling[,1])
1216 |       [1] -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934
1217 |       [7] -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646
1218 |       [13] -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262
1219 |       [19] -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961
1220 |       [25] -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573
1221 |       [31] -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996
1222 |       ...
1223 | 
1224 | .. This agrees with the values that we get in SPSS, except that the values in SPSS
1225 | .. multiplied by -1, because the loadings are multiplied by -1, but that is fine.
1226 | 
1227 | In fact, the values of the first linear discriminant function can be calculated using the
1228 | "predict()" function in R, so we can compare those to the ones that we calculated, and they
1229 | should agree:
1230 | 
1231 | ::
1232 | 
1233 |     > wine.lda.values <- predict(wine.lda, wine[2:14])
1234 |     > wine.lda.values$x[,1] # contains the values for the first discriminant function
1235 |           1           2           3           4           5           6 
1236 |       -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 
1237 |           7           8           9          10          11          12 
1238 |       -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 
1239 |          13          14          15          16          17          18 
1240 |       -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 
1241 |          19          20          21          22          23          24 
1242 |       -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 
1243 |          25          26          27          28          29          30 
1244 |       -3.32164716 -2.14482223 -3.98242850 -2.68591432 -3.56309464 -3.17301573 
1245 |          31          32          33          34          35          36 
1246 |       -2.99626797 -3.56866244 -3.38506383 -3.52753750 -2.85190852 -2.79411996 
1247 |       ... 
1248 | 
1249 | We see that they do agree.
1250 | 
1251 | .. The loadings agree with those given in SPSS for the unstandardised variables.
1252 | .. In SPSS I get:
1253 | .. Unstandardised coeffs:
1254 | .. V2: 0.403, 0.872
1255 | .. V3: -0.165, 0.305
1256 | .. V4: 0.369, 2.346
1257 | .. V5: -0.155, -0.146
1258 | .. V6: 0.002, 0.000
1259 | .. V7: -0.618, -0.032
1260 | .. V8: 1.661, -0.492
1261 | .. V9: 1.496, -1.632
1262 | .. V10: -0.134, -0.307
1263 | .. V11: -0.355, 0.253
1264 | .. V12: 0.818, -1.516
1265 | .. V13: 1.158, 0.051
1266 | .. V14: 0.003, 0.003
1267 | .. Standardised coeffs:
1268 | .. V2: 0.207, 0.446
1269 | .. V3: -0.156, 0.288
1270 | .. V4: 0.095, 0.603
1271 | .. V5: -0.438, -0.414
1272 | .. V6: 0.029, -0.006
1273 | .. V7: -0.270, -0.014
1274 | .. V8: 0.871, -0.258
1275 | .. V9: 0.163, -0.178
1276 | .. V10: -0.067, -0.152
1277 | .. V11: -0.537, 0.383
1278 | .. V12: 0.128, -0.237
1279 | .. V13: 0.464, 0.021
1280 | .. V14: 0.464, 0.492
1281 | 
1282 | ..  Comment:
1283 | ..  If you look at the output of calcSeparations, you can see that the within-group variances are 1.
1284 | ..  The loadings are in wine.lda$scaling, I think.
1285 | ..  The description for scaling in the help for lda() is:
1286 | ..  a matrix which transforms observations to discriminant functions, normalized so that within groups 
1287 | ..  covariance matrix is spherical.
1288 | ..  calcpc(wine[2:14], wine.lda$scaling[,1]) 
1289 | ..  -13.931031 -13.532745 -12.651506 -13.436540 -10.740768 -13.749476
1290 | ..  -13.758165 -13.379134 -13.091615 -12.597411 -14.036666 -12.658863
1291 | ..  -12.896889 -14.819033 -14.732101 -12.415539 -12.520157 -12.228879
1292 | ..  -14.477190 -12.367318 -12.808265 -10.921558 -14.065937 -12.326676
1293 | ..  -12.552434 -11.375609 -13.213215 -11.916701 -12.793881 -12.403802
1294 | ..  -12.227055 -12.799449 -12.615850 -12.758324 -12.082695 -12.024907
1295 | ..  -11.988872 -11.408131 -12.260050 -12.501839 -12.151442 -11.467997
1296 | ..  -13.930512 -10.461148 -11.812826 -11.813907 -13.119666 -12.680540
1297 | ..  mylda1 <- calcpc(wine[2:14], wine.lda$scaling[,1]) 
1298 | ..  summary(aov(mylda1~as.factor(wine[,1])))
1299 | ..                       Df Sum Sq Mean Sq F value    Pr(>F)    
1300 | ..  as.factor(wine[, 1])   2 1589.3  794.65  794.65 < 2.2e-16 ***
1301 | ..  Residuals            175  175.0    1.00                      
1302 | ..  Do seem to have within-group variance=1. 
1303 | ..  Put the LDA1 and LDA2 calculated from SPSS in a file, can check if within-group variance=1:
1304 | ..  spss <- read.table("C:/Documents and Settings/Avril Coughlan/My Documents/BACKEDUP/OUBooks/MultivariateStats/wine.data_lda.txt",header=FALSE)
1305 | ..  summary(aov(spss$V1~as.factor(wine[,1])))
1306 | ..                       Df Sum Sq Mean Sq F value    Pr(>F)    
1307 | ..  as.factor(wine[, 1])   2 1589.3  794.65  794.65 < 2.2e-16 ***
1308 | ..  Residuals            175  175.0    1.00   
1309 | ..  Has within-group variance=1. 
1310 | ..  plot(spss$V1, mylda1) # Have a correlation of -1
1311 | ..  summary(mylda1)
1312 | ..     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1313 | ..   -14.820 -12.140  -9.529  -9.231  -6.396  -3.489 
1314 | ..  summary(spss$V1)
1315 | ..       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
1316 | .. -5.742e+00 -2.835e+00  2.978e-01 -5.618e-08  2.909e+00  5.588e+00 
1317 | ..  SPSS seems to have centred the data so that the mean of LDA1 is 0.
1318 | .. 
1319 | .. 
1320 | ..  ... 
1321 | ..  wine.lda.values <- predict(wine.lda, wine[2:14])
1322 | ..  wine.lda.values$x[,1] # contains the values for the first discriminant function
1323 | ..           1           2           3           4           5           6 
1324 | .. -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 
1325 | ..           7           8           9          10          11          12 
1326 | .. -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 
1327 | ..          13          14          15          16          17          18 
1328 | .. -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 
1329 | ..          19          20          21          22          23          24 
1330 | .. -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 
1331 | .. Agrees perfectly with the values from SPSS (except the SPSS values are multiplied by -1, because the loadings are all multipled by
1332 | .. -1, but that doesn't matter).
1333 | 
1334 | It doesn't matter whether the input variables for linear discriminant analysis are standardised or not, unlike
1335 | for principal components analysis in which it is often necessary to standardise the input variables. 
1336 | However, using standardised variables in linear discriminant analysis makes it easier to interpret the loadings in
1337 | a linear discriminant function. 
1338 | 
1339 | In linear discriminant analysis, the standardised version of an input variable is defined so that it
1340 | has mean zero and within-groups variance of 1. Thus, we can calculate the "group-standardised" variable 
1341 | by subtracting the mean from each value of the variable, and dividing by the within-groups standard deviation.
1342 | To calculate the group-standardised version of a set of variables, we can use the function "groupStandardise()" below:
1343 | 
1344 | 
1345 | ::
1346 | 
1347 |     > groupStandardise <- function(variables, groupvariable)
1348 |       {
1349 |          # find out how many variables we have
1350 |          variables <- as.data.frame(variables)
1351 |          numvariables <- length(variables)
1352 |          # find the variable names
1353 |          variablenames <- colnames(variables)
1354 |          # calculate the group-standardised version of each variable
1355 |          for (i in 1:numvariables)
1356 |          {
1357 |             variablei <- variables[i]
1358 |             variablei_name <- variablenames[i]
1359 |             variablei_Vw <- calcWithinGroupsVariance(variablei, groupvariable)
1360 |             variablei_mean <- mean(variablei)
1361 |             variablei_new <- (variablei - variablei_mean)/(sqrt(variablei_Vw))
1362 |             data_length <- nrow(variablei)
1363 |             if (i == 1) { variables_new <- data.frame(row.names=seq(1,data_length)) } 
1364 |             variables_new[`variablei_name`] <- variablei_new
1365 |          }
1366 |          return(variables_new)
1367 |       }
1368 | 
1369 | For example, we can use the "groupStandardise()" function to calculate the group-standardised versions of the
1370 | chemical concentrations in wine samples:
1371 | 
1372 | ::
1373 | 
1374 |     > groupstandardisedconcentrations <- groupStandardise(wine[2:14], wine[1])
1375 | 
1376 | We can then use the lda() function to perform linear disriminant analysis on the group-standardised variables:
1377 | 
1378 | :: 
1379 | 
1380 |     > wine.lda2 <- lda(wine$V1 ~ groupstandardisedconcentrations$V2 + groupstandardisedconcentrations$V3 + 
1381 |                                  groupstandardisedconcentrations$V4 + groupstandardisedconcentrations$V5 + 
1382 |                                  groupstandardisedconcentrations$V6 + groupstandardisedconcentrations$V7 + 
1383 |                                  groupstandardisedconcentrations$V8 + groupstandardisedconcentrations$V9 + 
1384 |                                  groupstandardisedconcentrations$V10 + groupstandardisedconcentrations$V11 + 
1385 |                                  groupstandardisedconcentrations$V12 + groupstandardisedconcentrations$V13 +
1386 |                                  groupstandardisedconcentrations$V14)
1387 |     > wine.lda2
1388 |       Coefficients of linear discriminants:
1389 |                                                 LD1          LD2
1390 |       groupstandardisedconcentrations$V2  -0.20650463  0.446280119
1391 |       groupstandardisedconcentrations$V3   0.15568586  0.287697336
1392 |       groupstandardisedconcentrations$V4  -0.09486893  0.602988809
1393 |       groupstandardisedconcentrations$V5   0.43802089 -0.414203541
1394 |       groupstandardisedconcentrations$V6  -0.02907934 -0.006219863
1395 |       groupstandardisedconcentrations$V7   0.27030186 -0.014088108
1396 |       groupstandardisedconcentrations$V8  -0.87067265 -0.257868714
1397 |       groupstandardisedconcentrations$V9  -0.16325474 -0.178003512
1398 |       groupstandardisedconcentrations$V10  0.06653116 -0.152364015
1399 |       groupstandardisedconcentrations$V11  0.53670086  0.382782544
1400 |       groupstandardisedconcentrations$V12 -0.12801061 -0.237174509
1401 |       groupstandardisedconcentrations$V13 -0.46414916  0.020523349
1402 |       groupstandardisedconcentrations$V14 -0.46385409  0.491738050
1403 | 
1404 | It makes sense to interpret the loadings calculated using the group-standardised variables rather than the loadings for
1405 | the original (unstandardised) variables. 
1406 | 
1407 | In the first discriminant function calculated for the group-standardised variables, the largest loadings (in absolute) value 
1408 | are given to V8 (-0.871), V11 (0.537), V13 (-0.464), V14 (-0.464), and V5 (0.438). The loadings for V8, V13 and V14 are negative, while 
1409 | those for V11 and V5 are positive. Therefore, the discriminant function seems to represent a contrast between the concentrations of 
1410 | V8, V13 and V14, and the concentrations of V11 and V5.
1411 | 
1412 | We saw above that the individual variables which gave the greatest separations between the groups were V8 (separation 233.93), V14 (207.92), 
1413 | V13 (189.97), V2 (135.08) and V11 (120.66). These were mostly the same variables that had the largest loadings in the linear discriminant 
1414 | function (loading for V8: -0.871, for V14: -0.464, for V13: -0.464, for V11: 0.537).
1415 | 
1416 | We found above that variables V8 and V11 have a negative between-groups covariance (-60.41) and a positive within-groups covariance (0.29). 
1417 | When the between-groups covariance and within-groups covariance for two variables have opposite signs, it indicates that a better separation 
1418 | between groups can be obtained by using a linear combination of those two variables than by using either variable on its own.
1419 | 
1420 | Thus, given that the two variables V8 and V11 have between-groups and within-groups covariances of opposite signs, and that these are two 
1421 | of the variables that gave the greatest separations between groups when used individually, it is not surprising that these are the two 
1422 | variables that have the largest loadings in the first discriminant function.
1423 | 
1424 | Note that although the loadings for the group-standardised variables are easier to interpret than the loadings for the
1425 | unstandardised variables, the values of the discriminant function are the same regardless of whether we standardise
1426 | the input variables or not. For example, for wine data, we can calculate the value of the first discriminant function calculated
1427 | using the unstandardised and group-standardised variables by typing:
1428 | 
1429 | :: 
1430 | 
1431 |     > wine.lda.values <- predict(wine.lda, wine[2:14]) 
1432 |     > wine.lda.values$x[,1] # values for the first discriminant function, using the unstandardised data
1433 |           1           2           3           4           5           6 
1434 |       -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 
1435 |           7           8           9          10          11          12 
1436 |       -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 
1437 |          13          14          15          16          17          18 
1438 |       -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 
1439 |          19          20          21          22          23          24 
1440 |       -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 
1441 |       ...
1442 |     > wine.lda.values2 <- predict(wine.lda2, groupstandardisedconcentrations)
1443 |     > wine.lda.values2$x[,1] # values for the first discriminant function, using the standardised data
1444 |           1           2           3           4           5           6 
1445 |       -4.70024401 -4.30195811 -3.42071952 -4.20575366 -1.50998168 -4.51868934 
1446 |           7           8           9          10          11          12 
1447 |       -4.52737794 -4.14834781 -3.86082876 -3.36662444 -4.80587907 -3.42807646 
1448 |          13          14          15          16          17          18 
1449 |       -3.66610246 -5.58824635 -5.50131449 -3.18475189 -3.28936988 -2.99809262 
1450 |          19          20          21          22          23          24 
1451 |       -5.24640372 -3.13653106 -3.57747791 -1.69077135 -4.83515033 -3.09588961 
1452 |       ...
1453 | 
1454 | .. Note these are the same values that I get using SPSS.
1455 | 
1456 | We can see that although the loadings are different for the first discriminant functions calculated using
1457 | unstandardised and group-standardised data, the actual values of the first discriminant function are the same.
1458 | 
1459 | Separation Achieved by the Discriminant Functions
1460 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1461 | To calculate the separation achieved by each discriminant function, we first need to calculate the
1462 | value of each discriminant function, by substituting the variables' values into the linear combination for
1463 | the discriminant function (eg. -0.403*V2 - 0.165*V3 - 0.369*V4 + 0.155*V5 - 0.002*V6 + 0.618*V7 - 1.661*V8
1464 | - 1.496*V9 + 0.134*V10 + 0.355*V11 - 0.818*V12 - 1.158*V13 - 0.003*V14 for the first discriminant function),
1465 | and then scaling the values of the discriminant function so that their mean is zero.
1466 | 
1467 | As mentioned above, we can do this using the "predict()" function in R. For example,
1468 | to calculate the value of the discriminant functions for the wine data, we type:
1469 | 
1470 | ::
1471 | 
1472 |     > wine.lda.values <- predict(wine.lda, standardisedconcentrations) 
1473 | 
1474 | The returned variable has a named element "x" which is a matrix containing the linear discriminant functions:
1475 | the first column of x contains the first discriminant function, the second column of x contains the second
1476 | discriminant function, and so on (if there are more discriminant functions).
1477 |     
1478 | We can therefore calculate the separations achieved by the two linear discriminant functions for the wine data by using the
1479 | "calcSeparations()" function (see above), which calculates the separation as the ratio of the between-groups
1480 | variance to the within-groups variance:
1481 | 
1482 | ::
1483 | 
1484 |     > calcSeparations(wine.lda.values$x,wine[1])
1485 |       [1] "variable LD1 Vw= 1 Vb= 794.652200566216 separation= 794.652200566216"
1486 |       [1] "variable LD2 Vw= 1 Vb= 361.241041493455 separation= 361.241041493455"
1487 | 
1488 | As mentioned above, the loadings for each discriminant function are calculated in such a way that
1489 | the within-group variance (Vw) for each group (wine cultivar here) is equal to 1, as we see in the
1490 | output from calcSeparations() above.
1491 | 
1492 | The output from calcSeparations() tells us that the separation achieved by the first (best) discriminant 
1493 | function is 794.7, and the separation achieved by the second (second best) discriminant function is 361.2.
1494 | 
1495 | Therefore, the total separation is the sum of these, which is (794.652200566216+361.241041493455=1155.893) 
1496 | 1155.89, rounded to two decimal places. Therefore, the "percentage separation" achieved by the
1497 | first discriminant function is (794.652200566216*100/1155.893=) 68.75%, and the percentage separation achieved by the 
1498 | second discriminant function is (361.241041493455*100/1155.893=) 31.25%.
1499 | 
1500 | The "proportion of trace" that is printed when you type "wine.lda" (the variable returned by the lda() function) 
1501 | is the percentage separation achieved by each discriminant function. For example, for the wine data we get the 
1502 | same values as just calculated (68.75% and 31.25%):
1503 | 
1504 | ::
1505 | 
1506 |     > wine.lda
1507 |       Proportion of trace:
1508 |         LD1    LD2 
1509 |       0.6875 0.3125 
1510 | 
1511 | Therefore, the first discriminant function does achieve a good separation between the three groups (three cultivars), but the second
1512 | discriminant function does improve the separation of the groups by quite a large amount, so is it worth using the 
1513 | second discriminant function as well. Therefore, to achieve a good separation of the groups (cultivars), 
1514 | it is necessary to use both of the first two discriminant functions.
1515 | 
1516 | We found above that the largest separation achieved for any of the individual variables (individual chemical concentrations)
1517 | was 233.9 for V8, which is quite a lot less than 794.7, the separation achieved by the first discriminant function. Therefore,
1518 | the effect of using more than one variable to calculate the discriminant function is that we can find a discriminant function
1519 | that achieves a far greater separation between groups than achieved by any one variable alone.
1520 | 
1521 | The variable returned by the lda() function also has a named element "svd", which contains the ratio of
1522 | between- and within-group standard deviations for the linear discriminant variables, that is, the square
1523 | root of the "separation" value that we calculated using calcSeparations() above. When we calculate the
1524 | square of the value stored in "svd", we should get the same value as found using calcSeparations():
1525 | 
1526 | ::
1527 | 
1528 |     > (wine.lda$svd)^2
1529 |       [1] 794.6522 361.2410
1530 | 
1531 | 
1532 | .. Note that these are also called "canonical F-statistics".
1533 | .. Note the F statistics I get from aov() are the same as the separation values that I calculate:
1534 | .. > summary(aov(wine.lda.values$x ~ as.factor(wine[,1])))
1535 | ..   Response LD1 :
1536 | ..                          Df Sum Sq Mean Sq F value    Pr(>F)    
1537 | ..   as.factor(wine[, 1])   2 1589.3  794.65  794.65 < 2.2e-16 ***
1538 | ..   Residuals            175  175.0    1.00                      
1539 | ..   ---
1540 | ..   Response LD2 :
1541 | ..                          Df Sum Sq Mean Sq F value    Pr(>F)    
1542 | ..   as.factor(wine[, 1])   2 722.48  361.24  361.24 < 2.2e-16 ***
1543 | ..   Residuals            175 175.00    1.00                      
1544 | 
1545 | A Stacked Histogram of the LDA Values
1546 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1547 | 
1548 | A nice way of displaying the results of a linear discriminant analysis (LDA) is to make a stacked histogram of the
1549 | values of the discriminant function for the samples from different groups (different wine cultivars in our example).
1550 | 
1551 | We can do this using the "ldahist()" function in R. For example, to make a stacked histogram of the first discriminant
1552 | function's values for wine samples of the three different wine cultivars, we type: 
1553 | 
1554 | ::
1555 | 
1556 |     > ldahist(data = wine.lda.values$x[,1], g=wine$V1)
1557 | 
1558 | |image8|
1559 | 
1560 | We can see from the histogram that cultivars 1 and 3 are well separated by the first
1561 | discriminant function, since the values for the first cultivar are between -6 and -1,
1562 | while the values for cultivar 3 are between 2 and 6, and so there is no overlap in values.
1563 | 
1564 | However, the separation achieved by the linear discriminant function on the training
1565 | set may be an overestimate. To get a more accurate idea of how well the first discriminant function 
1566 | separates the groups, we would need to see a stacked histogram of the values for the three
1567 | cultivars using some unseen "test set", that is, using
1568 | a set of data that was not used to calculate the linear discriminant function.
1569 | 
1570 | We see that the first discriminant function separates cultivars 1 and 3 very well, but
1571 | does not separate cultivars 1 and 2, or cultivars 2 and 3, so well.
1572 | 
1573 | We therefore investigate whether the second discriminant function separates those cultivars,
1574 | by making a stacked histogram of the second discriminant function's values:
1575 | 
1576 | ::
1577 | 
1578 |     > ldahist(data = wine.lda.values$x[,2], g=wine$V1)
1579 | 
1580 | |image9|
1581 | 
1582 | We see that the second discriminant function separates cultivars 1 and 2 quite well, although
1583 | there is a little overlap in their values. Furthermore, the second discriminant function also
1584 | separates cultivars 2 and 3 quite well, although again there is a little overlap in their values so 
1585 | it is not perfect.
1586 | 
1587 | Thus, we see that two discriminant functions are necessary to separate the cultivars, as was
1588 | discussed above (see the discussion of percentage separation above).
1589 | 
1590 | Scatterplots of the Discriminant Functions
1591 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1592 | 
1593 | We can obtain a scatterplot of the best two discriminant functions, with the data points labelled by cultivar, by typing:
1594 | 
1595 | ::
1596 | 
1597 |     > plot(wine.lda.values$x[,1],wine.lda.values$x[,2]) # make a scatterplot
1598 |     > text(wine.lda.values$x[,1],wine.lda.values$x[,2],wine$V1,cex=0.7,pos=4,col="red") # add labels
1599 | 
1600 | |image10|
1601 | 
1602 | From the scatterplot of the first two discriminant functions, we can see that the wines from the three 
1603 | cultivars are well separated in the scatterplot. The first discriminant function (x-axis)
1604 | separates cultivars 1 and 3 very well, but doesn't not perfectly separate cultivars
1605 | 1 and 3, or cultivars 2 and 3. 
1606 | 
1607 | The second discriminant function (y-axis) achieves a fairly good separation of cultivars
1608 | 1 and 3, and cultivars 2 and 3, although it is not totally perfect.
1609 | 
1610 | To achieve a very good separation of the three cultivars, it would be best to use both the first and second 
1611 | discriminant functions together, since the first discriminant function can separate cultivars 1 and 3 very well, 
1612 | and the second discriminant function can separate cultivars 1 and 2, and cultivars 2 and 3, reasonably well.
1613 | 
1614 | Allocation Rules and Misclassification Rate
1615 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1616 | 
1617 | We can calculate the mean values of the discriminant functions for each of the three cultivars using the
1618 | "printMeanAndSdByGroup()" function (see above):
1619 | 
1620 | ::
1621 | 
1622 |     > printMeanAndSdByGroup(wine.lda.values$x,wine[1])
1623 |       [1] "Means:"
1624 |          V1         LD1       LD2
1625 |        1  1 -3.42248851  1.691674
1626 |        2  2 -0.07972623 -2.472656
1627 |        3  3  4.32473717  1.578120
1628 | 
1629 | We find that the mean value of the first discriminant function is -3.42248851 for cultivar 1, -0.07972623 for cultivar 2,
1630 | and 4.32473717 for cultivar 3. The mid-way point between the mean values for cultivars 1 and 2 is (-3.42248851-0.07972623)/2=-1.751107,
1631 | and the mid-way point between the mean values for cultivars 2 and 3 is (-0.07972623+4.32473717)/2 = 2.122505.
1632 | 
1633 | Therefore, we can use the following allocation rule:
1634 | 
1635 | * if the first discriminant function is <= -1.751107, predict the sample to be from cultivar 1
1636 | * if the first discriminant function is > -1.751107 and <= 2.122505, predict the sample to be from cultivar 2
1637 | * if the first discriminant function is > 2.122505, predict the sample to be from cultivar 3
1638 | 
1639 | We can examine the accuracy of this allocation rule by using the "calcAllocationRuleAccuracy()" function below:
1640 | 
1641 | ::
1642 | 
1643 |     > calcAllocationRuleAccuracy <- function(ldavalue, groupvariable, cutoffpoints)
1644 |       {
1645 |          # find out how many values the group variable can take
1646 |          groupvariable2 <- as.factor(groupvariable[[1]])
1647 |          levels <- levels(groupvariable2)
1648 |          numlevels <- length(levels)
1649 |          # calculate the number of true positives and false negatives for each group
1650 |          numlevels <- length(levels)
1651 |          for (i in 1:numlevels)
1652 |          {
1653 |             leveli <- levels[i]
1654 |             levelidata <- ldavalue[groupvariable==leveli]
1655 |             # see how many of the samples from this group are classified in each group
1656 |             for (j in 1:numlevels)
1657 |             {
1658 |                levelj <- levels[j]
1659 |                if (j == 1) 
1660 |                { 
1661 |                   cutoff1 <- cutoffpoints[1]
1662 |                   cutoff2 <- "NA"
1663 |                   results <- summary(levelidata <= cutoff1)
1664 |                }
1665 |                else if (j == numlevels)
1666 |                {
1667 |                   cutoff1 <- cutoffpoints[(numlevels-1)]
1668 |                   cutoff2 <- "NA"
1669 |                   results <- summary(levelidata > cutoff1) 
1670 |                }
1671 |                else
1672 |                {
1673 |                   cutoff1 <- cutoffpoints[(j-1)]
1674 |                   cutoff2 <- cutoffpoints[(j)]
1675 |                   results <- summary(levelidata > cutoff1 & levelidata <= cutoff2)
1676 |                }
1677 |                trues <- results["TRUE"]
1678 |                trues <- trues[[1]]
1679 |                print(paste("Number of samples of group",leveli,"classified as group",levelj," : ",
1680 |                   trues,"(cutoffs:",cutoff1,",",cutoff2,")"))
1681 |             }
1682 |          }
1683 |       }
1684 | 
1685 | For example, to calculate the accuracy for the wine data based on the allocation
1686 | rule for the first discriminant function, we type:
1687 | 
1688 | ::
1689 | 
1690 |     > calcAllocationRuleAccuracy(wine.lda.values$x[,1], wine[1], c(-1.751107, 2.122505))
1691 |       [1] "Number of samples of group 1 classified as group 1  :  56 (cutoffs: -1.751107 , NA )"
1692 |       [1] "Number of samples of group 1 classified as group 2  :  3 (cutoffs: -1.751107 , 2.122505 )"
1693 |       [1] "Number of samples of group 1 classified as group 3  :  NA (cutoffs: 2.122505 , NA )"
1694 |       [1] "Number of samples of group 2 classified as group 1  :  5 (cutoffs: -1.751107 , NA )"
1695 |       [1] "Number of samples of group 2 classified as group 2  :  65 (cutoffs: -1.751107 , 2.122505 )"
1696 |       [1] "Number of samples of group 2 classified as group 3  :  1 (cutoffs: 2.122505 , NA )"
1697 |       [1] "Number of samples of group 3 classified as group 1  :  NA (cutoffs: -1.751107 , NA )"
1698 |       [1] "Number of samples of group 3 classified as group 2  :  NA (cutoffs: -1.751107 , 2.122505 )"
1699 |       [1] "Number of samples of group 3 classified as group 3  :  48 (cutoffs: 2.122505 , NA )"
1700 | 
1701 | This can be displayed in a "confusion matrix":
1702 | 
1703 | +------------+----------------------+----------------------+----------------------+
1704 | |            | Allocated to group 1 | Allocated to group 2 | Allocated to group 3 |
1705 | +============+======================+======================+======================+
1706 | | Is group 1 |        56            |         3            |         0            |
1707 | +------------+----------------------+----------------------+----------------------+
1708 | | Is group 2 |         5            |        65            |         1            |
1709 | +------------+----------------------+----------------------+----------------------+
1710 | | Is group 3 |         0            |         0            |        48            |
1711 | +------------+----------------------+----------------------+----------------------+
1712 | 
1713 | There are 3+5+1=9 wine samples that are misclassified, out of (56+3+5+65+1+48=) 178 wine samples: 
1714 | 3 samples from cultivar 1 are predicted to be from cultivar 2, 5 samples from cultivar 2 are predicted 
1715 | to be from cultivar 1, and 1 sample from cultivar 2 is predicted to be from cultivar 3.
1716 | Therefore, the misclassification rate is 9/178, or 5.1%. The misclassification rate is quite low,
1717 | and therefore the accuracy of the allocation rule appears to be relatively high.
1718 | 
1719 | However, this is probably an underestimate of the misclassification rate, as the allocation rule was based on this data (this is
1720 | the "training set"). If we calculated the misclassification rate for a separate "test set" consisting of data other than that
1721 | used to make the allocation rule, we would probably get a higher estimate of the misclassification rate.
1722 | 
1723 | Links and Further Reading
1724 | -------------------------
1725 | 
1726 | Here are some links for further reading.
1727 | 
1728 | For a more in-depth introduction to R, a good online tutorial is
1729 | available on the "Kickstarting R" website,
1730 | `cran.r-project.org/doc/contrib/Lemon-kickstart <http://cran.r-project.org/doc/contrib/Lemon-kickstart/>`_.
1731 | 
1732 | There is another nice (slightly more in-depth) tutorial to R
1733 | available on the "Introduction to R" website,
1734 | `cran.r-project.org/doc/manuals/R-intro.html <http://cran.r-project.org/doc/manuals/R-intro.html>`_.
1735 | 
1736 | To learn about multivariate analysis, I would highly recommend the book "Multivariate
1737 | analysis" (product code M249/03) by the Open University, available from `the Open University Shop
1738 | <http://www.ouw.co.uk/store/>`_.
1739 | 
1740 | There is a book available in the "Use R!" series on using R for multivariate analyses, 
1741 | `An Introduction to Applied Multivariate Analysis with R <http://www.springer.com/statistics/statistical+theory+and+methods/book/978-1-4419-9649-7>`_
1742 | by Everitt and Hothorn.
1743 | 
1744 | Acknowledgements
1745 | ----------------
1746 | 
1747 | Many of the examples in this booklet are inspired by examples in the excellent Open University book,
1748 | "Multivariate Analysis" (product code M249/03), 
1749 | available from `the Open University Shop <http://www.ouw.co.uk/store/>`_.
1750 | 
1751 | I am grateful to the UCI Machine Learning Repository, 
1752 | `http://archive.ics.uci.edu/ml <http://archive.ics.uci.edu/ml>`_, for making data sets available
1753 | which I have used in the examples in this booklet.
1754 | 
1755 | Thank you to the following users for very helpful comments: to Rich O'Hara and Patrick Hausmann for pointing 
1756 | out that sd(<data.frame>) and mean(<data.frame>) is deprecated; to Arnau Serra-Cayuela for pointing out a typo
1757 | in the LDA section; to John Christie for suggesting a more compact form for my printMeanAndSdByGroup() function,
1758 | and to Rama Ramakrishnan for suggesting a more compact form for my mosthighlycorrelated() function.
1759 | 
1760 | Contact
1761 | -------
1762 | 
1763 | I will be grateful if you will send me (`Avril Coghlan <http://www.sanger.ac.uk/research/projects/parasitegenomics/>`_) corrections or suggestions for improvements to
1764 | my email address alc@sanger.ac.uk
1765 | 
1766 | License
1767 | -------
1768 | 
1769 | The content in this book is licensed under a `Creative Commons Attribution 3.0 License
1770 | <http://creativecommons.org/licenses/by/3.0/>`_.
1771 | 
1772 | .. |image1| image:: ../_static/image1.png
1773 |             :width: 500
1774 | .. |image2| image:: ../_static/image2.png
1775 |             :width: 400
1776 | .. |image4| image:: ../_static/image4.png
1777 |             :width: 400
1778 | .. |image5| image:: ../_static/image5.png
1779 |             :width: 400
1780 | .. |image6| image:: ../_static/image6.png
1781 |             :width: 400
1782 | .. |image7| image:: ../_static/image7.png
1783 |             :width: 400
1784 | .. |image8| image:: ../_static/image8.png
1785 |             :width: 400
1786 | .. |image9| image:: ../_static/image9.png
1787 |             :width: 400
1788 | .. |image10| image:: ../_static/image10.png
1789 |             :width: 400
1790 | 


--------------------------------------------------------------------------------